MoonshotAI: Kimi K2 0905

MoonshotAI: Kimi K2 0905

Kimi-K2-0905

Kimi K2 0905 is the September update to Kimi K2 0711, a large-scale Mixture-of-Experts (MoE) model developed by Moonshot AI. It features 1 trillion total parameters with 32B active per forward pass and extends long-context inference from 128K to 256K tokens.

This release enhances agentic coding with improved accuracy and better generalization across scaffolds, while also boosting frontend development with more refined and functional outputs for web, 3D, and related applications. Optimized for agentic capabilities—spanning advanced tool use, reasoning, and code synthesis—Kimi K2 continues to excel across benchmarks in coding (LiveCodeBench, SWE-bench), reasoning (ZebraLogic, GPQA), and tool use (Tau2, AceBench). Training is powered by a novel stack that incorporates the MuonClip optimizer for stable, large-scale MoE performance.

Conversations

Download TXT
Download PDF

Creator Moonshot AI
Release Date September, 2025
License Modified MIT License
Context Window 262,144
Image Input Support No
Open Source (Weights) Yes
Parameters 1000B, 32.0B active at inference time
Model Weights Click here

Performance Benchmarks

Benchmark Metric K2-Instruct-0905 K2-Instruct-0711 Qwen3-Coder-480B-A35B-Instruct GLM-4.5 DeepSeek-V3.1 Claude-Sonnet-4 Claude-Opus-4
SWE-Bench verified ACC 69.2 ± 0.63 65.8 69.6* 64.2* 66.0* 72.7* 72.5*
SWE-Bench Multilingual ACC 55.9 ± 0.72 47.3 54.7* 52.7 54.5* 53.3*
Multi-SWE-Bench ACC 33.5 ± 0.28 31.3 32.7 31.7 29.0 35.7
Terminal-Bench ACC 44.5 ± 2.03 37.5 37.5* 39.9* 31.3* 36.4* 43.2*
SWE-Dev ACC 66.6 ± 0.72 61.9 64.7 63.2 53.3 67.1

Meta: Llama 3.3 70B Instruct

Meta: Llama 3.3 70B Instruct

Llama-3.3-70B

The Meta Llama 3.3 multilingual large language model (LLM) is a 70B-parameter pretrained and instruction-tuned text-only model. Optimized for multilingual dialogue, it outperforms many open-source and proprietary chat models on standard industry benchmarks.

It supports a wide range of languages, including English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.

Conversations

Download TXT
Download PDF

Creator Meta
Release Date December, 2024
License Llama 4 Comunity Lisense Agreement
Context Window 65,536
Image Input Support No
Open Source (Weights) Yes
Parameters 70B
Model Weights Click here

MiniMax: MiniMax 01

MiniMax: MiniMax 01

Minimax-01

MiniMax-01 integrates MiniMax-Text-01 for text generation and MiniMax-VL-01 for image understanding, combining multimodal strengths in a single model. It features 456B parameters, with 45.9B active per inference, and supports context lengths of up to 4 million tokens.

The text component uses a hybrid architecture that blends Lightning Attention, Softmax Attention, and Mixture-of-Experts (MoE). The vision component follows a “ViT-MLP-LLM” framework, trained on top of the text model to enable advanced multimodal reasoning.

Conversations

Download TXT
Download PDF

Creator MiniMax
Release Date January, 2025
License MiniMax Model License Agreement
Context Window 1,000,192
Image Input Support No
Open Source (Weights) Yes
Parameters 456B, 45.9B active at inference time
Model Weights Click here

Performance Benchmarks

Core Academic Benchmarks

Tasks GPT-4o (11-20) Claude-3.5-Sonnet (10-22) Gemini-1.5-Pro (002) Gemini-2.0-Flash (exp) Qwen2.5-72B-Inst. DeepSeek-V3 Llama-3.1-405B-Inst. MiniMax-Text-01
General
MMLU* 85.7 88.3 86.8 86.5 86.1 88.5 88.6 88.5
MMLU-Pro* 74.4 78.0 75.8 76.4 71.1 75.9 73.3 75.7
SimpleQA 39.0 28.1 23.4 26.6 10.3 24.9 23.2 23.7
C-SimpleQA 64.6 56.8 59.4 63.3 52.2 64.8 54.7 67.4
IFEval (avg) 84.1 90.1 89.4 88.4 87.2 87.3 86.4 89.1
Arena-Hard 92.4 87.6 85.3 72.7 81.2 91.4 63.5 89.1
Reasoning
GPQA* (diamond) 46.0 65.0 59.1 62.1 49.0 59.1 50.7 54.4
DROP* (F1) 89.2 88.8 89.2 89.3 85.0 91.0 92.5 87.8
Mathematics
GSM8k* 95.6 96.9 95.2 95.4 95.8 96.7 96.7 94.8
MATH* 76.6 74.1 84.6 83.9 81.8 84.6 73.8 77.4
Coding
MBPP + 76.2 75.1 75.4 75.9 77.0 78.8 73.0 71.7
HumanEval 90.2 93.7 86.6 89.6 86.6 92.1 89.0 86.9

Ruler

Model 4k 8k 16k 32k 64k 128k 256k 512k 1M
GPT-4o (11-20) 0.970 0.921 0.890 0.888 0.884
Claude-3.5-Sonnet (10-22) 0.965 0.960 0.957 0.950 0.952 0.938
Gemini-1.5-Pro (002) 0.962 0.960 0.960 0.958 0.938 0.917 0.916 0.861 0.850
Gemini-2.0-Flash (exp) 0.960 0.960 0.951 0.957 0.937 0.860 0.797 0.709
MiniMax-Text-01 0.963 0.961 0.953 0.954 0.943 0.947 0.945 0.928 0.910

LongBench V2

Model overall easy hard short medium long
Human 53.7 100.0 25.1 47.2 59.1 53.7
w/ CoT
GPT-4o (11-20) 51.4 54.2 49.7 59.6 48.6 43.5
Claude-3.5-Sonnet (10-22) 46.7 55.2 41.5 53.9 41.9 44.4
Deepseek-V3
Qwen2.5-72B-Inst. 43.5 47.9 40.8 48.9 40.9 39.8
MiniMax-Text-01 56.5 66.1 50.5 61.7 56.7 47.2
w/o CoT
GPT-4o (11-20) 50.1 57.4 45.6 53.3 52.4 40.2
Claude-3.5-Sonnet (10-22) 41.0 46.9 37.3 46.1 38.6 37.0
Deepseek-V3 48.7
Qwen2.5-72B-Inst. 42.1 42.7 41.8 45.6 38.1 44.4
MiniMax-Text-01 52.9 60.9 47.9 58.9 52.6 43.5

MTOB

Context Type no context half book full book Δ half book Δ full book
eng → kalam (ChrF)
GPT-4o (11-20) 9.90 54.30 44.40
Claude-3.5-Sonnet (10-22) 20.22 53.62 55.65 33.39 35.42
Gemini-1.5-Pro (002) 16.79 53.68 57.90 36.89 41.11
Gemini-2.0-Flash (exp) 12.20 49.50 53.30 37.30 41.10
Qwen-Long 16.55 48.48 45.94 31.92 29.39
MiniMax-Text-01 6.0 51.74 51.60 45.7 45.6
kalam → eng (BLEURT)
GPT-4o (11-20) 33.20 58.30 25.10
Claude-3.5-Sonnet (10-22) 31.42 59.70 62.30 28.28 30.88
Gemini-1.5-Pro (002) 32.02 61.52 63.09 29.50 31.07
Gemini-2.0-Flash (exp) 33.80 57.50 57.00 23.70 23.20
Qwen-Long 30.13 53.14 32.15 23.01 2.02
MiniMax-Text-01 33.65 57.10 58.00 23.45 24.35

Explore More AI Models

DeepSeek: DeepSeek V3.1

DeepSeek: DeepSeek V3.1

Deepseek-V3.1

DeepSeek V3.1 is a large-scale hybrid reasoning model with 671B parameters (37B active), capable of operating in both “thinking” and “non-thinking” modes through prompt templates. Building on the DeepSeek-V3 base, it introduces a two-phase long-context training process supporting up to 128K tokens, and leverages FP8 microscaling for more efficient inference. Users can directly control reasoning behavior via a simple boolean toggle.

The model enhances tool use, code generation, and reasoning efficiency, delivering performance on par with DeepSeek-R1 on challenging benchmarks while offering faster response times. With support for structured tool calling, code agents, and search agents, DeepSeek-V3.1 is well-suited for research, programming, and agent-driven workflows. As the successor to DeepSeek-V3-0324, it demonstrates strong performance across a wide range of tasks.

Conversations

Download TXT
Download PDF

Creator Deepseek
Release Date August, 2025
License MIT
Context Window 128,000
Image Input Support No
Open Source (Weights) Yes
Parameters 685B, 37B active at inference time
Model Weights Click here

MiniMax: MiniMax M1

MiniMax: MiniMax M1

Minimax-M1

MiniMax M1 is a large-scale open-weight reasoning model built for long-context processing and efficient inference. Using a hybrid Mixture-of-Experts (MoE) design combined with a custom “lightning attention” mechanism, it can handle sequences up to 1 million tokens while maintaining strong FLOP efficiency. With 456B total parameters and 45.9B active per token, it is optimized for complex, multi-step reasoning.

Trained with a custom reinforcement learning pipeline (CISPO), MiniMax-M1 delivers exceptional performance in long-context comprehension, software engineering, agent-driven tool use, and mathematical reasoning. It achieves top results across benchmarks such as FullStackBench, SWE-bench, MATH, GPQA, and TAU-Bench—often surpassing other open models like DeepSeek R1 and Qwen3-235B.

Conversations

Download TXT
Download PDF

Creator MiniMax
Release Date June, 2025
License Apache 2.0
Context Window 1,000,000
Image Input Support No
Open Source (Weights) Yes
Parameters 456B, 45.9B active at inference time
Model Weights Click here

Performance Benchmarks

Category Task MiniMax-M1-80K MiniMax-M1-40K Qwen3-235B-A22B DeepSeek-R1-0528 DeepSeek-R1 Seed-Thinking-v1.5 Claude 4 Opus Gemini 2.5 Pro (06-05) OpenAI-o3
Extended Thinking 80K 40K 32k 64k 32k 32k 64k 64k 100k
Mathematics AIME 2024 86.0 83.3 85.7 91.4 79.8 86.7 76.0 92.0 91.6
AIME 2025 76.9 74.6 81.5 87.5 70.0 74.0 75.5 88.0 88.9
MATH-500 96.8 96.0 96.2 98.0 97.3 96.7 98.2 98.8 98.1
General Coding LiveCodeBench (24/8~25/5) 65.0 62.3 65.9 73.1 55.9 67.5 56.6 77.1 75.8
FullStackBench 68.3 67.6 62.9 69.4 70.1 69.9 70.3 69.3
Reasoning & Knowledge GPQA Diamond 70.0 69.2 71.1 81.0 71.5 77.3 79.6 86.4 83.3
HLE (no tools) 8.4* 7.2* 7.6* 17.7* 8.6* 8.2 10.7 21.6 20.3
ZebraLogic 86.8 80.1 80.3 95.1 78.7 84.4 95.1 91.6 95.8
MMLU-Pro 81.1 80.6 83.0 85.0 84.0 87.0 85.0 86.0 85.0
Software Engineering SWE-bench Verified 56.0 55.6 34.4 57.6 49.2 47.0 72.5 67.2 69.1
Long Context OpenAI-MRCR (128k) 73.4 76.1 27.7 51.5 35.8 54.3 48.9 76.8 56.5
OpenAI-MRCR (1M) 56.2 58.6 58.8
LongBench-v2 61.5 61.0 50.1 52.1 58.3 52.5 55.6 65.0 58.8
Agentic Tool Use TAU-bench (airline) 62.0 60.0 34.7 53.5 44.0 59.6 50.0 52.0
TAU-bench (retail) 63.5 67.8 58.6 63.9 55.7 81.4 67.0 73.9
Factuality SimpleQA 18.5 17.9 11.0 27.8 30.1 12.9 54.0 49.4
General Assistant MultiChallenge 44.7 44.7 40.0 45.0 40.7 43.0 45.8 51.8 56.5

Explore More AI Models