Month: January 2026
MoonshotAI: Kimi K2 0905
MoonshotAI: Kimi K2 0905

Kimi K2 0905 is the September update to Kimi K2 0711, a large-scale Mixture-of-Experts (MoE) model developed by Moonshot AI. It features 1 trillion total parameters with 32B active per forward pass and extends long-context inference from 128K to 256K tokens.
This release enhances agentic coding with improved accuracy and better generalization across scaffolds, while also boosting frontend development with more refined and functional outputs for web, 3D, and related applications. Optimized for agentic capabilities—spanning advanced tool use, reasoning, and code synthesis—Kimi K2 continues to excel across benchmarks in coding (LiveCodeBench, SWE-bench), reasoning (ZebraLogic, GPQA), and tool use (Tau2, AceBench). Training is powered by a novel stack that incorporates the MuonClip optimizer for stable, large-scale MoE performance.
| Creator | Moonshot AI |
| Release Date | September, 2025 |
| License | Modified MIT License |
| Context Window | 262,144 |
| Image Input Support | No |
| Open Source (Weights) | Yes |
| Parameters | 1000B, 32.0B active at inference time |
| Model Weights | Click here |
Performance Benchmarks
| Benchmark | Metric | K2-Instruct-0905 | K2-Instruct-0711 | Qwen3-Coder-480B-A35B-Instruct | GLM-4.5 | DeepSeek-V3.1 | Claude-Sonnet-4 | Claude-Opus-4 |
|---|---|---|---|---|---|---|---|---|
| SWE-Bench verified | ACC | 69.2 ± 0.63 | 65.8 | 69.6* | 64.2* | 66.0* | 72.7* | 72.5* |
| SWE-Bench Multilingual | ACC | 55.9 ± 0.72 | 47.3 | 54.7* | 52.7 | 54.5* | 53.3* | – |
| Multi-SWE-Bench | ACC | 33.5 ± 0.28 | 31.3 | 32.7 | 31.7 | 29.0 | 35.7 | – |
| Terminal-Bench | ACC | 44.5 ± 2.03 | 37.5 | 37.5* | 39.9* | 31.3* | 36.4* | 43.2* |
| SWE-Dev | ACC | 66.6 ± 0.72 | 61.9 | 64.7 | 63.2 | 53.3 | 67.1 | – |
Explore More AI Models
Meta: Llama 3.3 70B Instruct
Meta: Llama 3.3 70B Instruct

The Meta Llama 3.3 multilingual large language model (LLM) is a 70B-parameter pretrained and instruction-tuned text-only model. Optimized for multilingual dialogue, it outperforms many open-source and proprietary chat models on standard industry benchmarks.
It supports a wide range of languages, including English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.
| Creator | Meta |
| Release Date | December, 2024 |
| License | Llama 4 Comunity Lisense Agreement |
| Context Window | 65,536 |
| Image Input Support | No |
| Open Source (Weights) | Yes |
| Parameters | 70B |
| Model Weights | Click here |
Explore More AI Models
Z-AI – ChatSonic
Models – ChatSonic
MiniMax: MiniMax 01
MiniMax: MiniMax 01

MiniMax-01 integrates MiniMax-Text-01 for text generation and MiniMax-VL-01 for image understanding, combining multimodal strengths in a single model. It features 456B parameters, with 45.9B active per inference, and supports context lengths of up to 4 million tokens.
The text component uses a hybrid architecture that blends Lightning Attention, Softmax Attention, and Mixture-of-Experts (MoE). The vision component follows a “ViT-MLP-LLM” framework, trained on top of the text model to enable advanced multimodal reasoning.
| Creator | MiniMax |
| Release Date | January, 2025 |
| License | MiniMax Model License Agreement |
| Context Window | 1,000,192 |
| Image Input Support | No |
| Open Source (Weights) | Yes |
| Parameters | 456B, 45.9B active at inference time |
| Model Weights | Click here |
Performance Benchmarks
Core Academic Benchmarks
| Tasks | GPT-4o (11-20) | Claude-3.5-Sonnet (10-22) | Gemini-1.5-Pro (002) | Gemini-2.0-Flash (exp) | Qwen2.5-72B-Inst. | DeepSeek-V3 | Llama-3.1-405B-Inst. | MiniMax-Text-01 |
|---|---|---|---|---|---|---|---|---|
| General | ||||||||
| MMLU* | 85.7 | 88.3 | 86.8 | 86.5 | 86.1 | 88.5 | 88.6 | 88.5 |
| MMLU-Pro* | 74.4 | 78.0 | 75.8 | 76.4 | 71.1 | 75.9 | 73.3 | 75.7 |
| SimpleQA | 39.0 | 28.1 | 23.4 | 26.6 | 10.3 | 24.9 | 23.2 | 23.7 |
| C-SimpleQA | 64.6 | 56.8 | 59.4 | 63.3 | 52.2 | 64.8 | 54.7 | 67.4 |
| IFEval (avg) | 84.1 | 90.1 | 89.4 | 88.4 | 87.2 | 87.3 | 86.4 | 89.1 |
| Arena-Hard | 92.4 | 87.6 | 85.3 | 72.7 | 81.2 | 91.4 | 63.5 | 89.1 |
| Reasoning | ||||||||
| GPQA* (diamond) | 46.0 | 65.0 | 59.1 | 62.1 | 49.0 | 59.1 | 50.7 | 54.4 |
| DROP* (F1) | 89.2 | 88.8 | 89.2 | 89.3 | 85.0 | 91.0 | 92.5 | 87.8 |
| Mathematics | ||||||||
| GSM8k* | 95.6 | 96.9 | 95.2 | 95.4 | 95.8 | 96.7 | 96.7 | 94.8 |
| MATH* | 76.6 | 74.1 | 84.6 | 83.9 | 81.8 | 84.6 | 73.8 | 77.4 |
| Coding | ||||||||
| MBPP + | 76.2 | 75.1 | 75.4 | 75.9 | 77.0 | 78.8 | 73.0 | 71.7 |
| HumanEval | 90.2 | 93.7 | 86.6 | 89.6 | 86.6 | 92.1 | 89.0 | 86.9 |
Ruler
| Model | 4k | 8k | 16k | 32k | 64k | 128k | 256k | 512k | 1M |
|---|---|---|---|---|---|---|---|---|---|
| GPT-4o (11-20) | 0.970 | 0.921 | 0.890 | 0.888 | 0.884 | – | – | – | – |
| Claude-3.5-Sonnet (10-22) | 0.965 | 0.960 | 0.957 | 0.950 | 0.952 | 0.938 | – | – | – |
| Gemini-1.5-Pro (002) | 0.962 | 0.960 | 0.960 | 0.958 | 0.938 | 0.917 | 0.916 | 0.861 | 0.850 |
| Gemini-2.0-Flash (exp) | 0.960 | 0.960 | 0.951 | 0.957 | 0.937 | 0.860 | 0.797 | 0.709 | – |
| MiniMax-Text-01 | 0.963 | 0.961 | 0.953 | 0.954 | 0.943 | 0.947 | 0.945 | 0.928 | 0.910 |
LongBench V2
| Model | overall | easy | hard | short | medium | long |
|---|---|---|---|---|---|---|
| Human | 53.7 | 100.0 | 25.1 | 47.2 | 59.1 | 53.7 |
| w/ CoT | ||||||
| GPT-4o (11-20) | 51.4 | 54.2 | 49.7 | 59.6 | 48.6 | 43.5 |
| Claude-3.5-Sonnet (10-22) | 46.7 | 55.2 | 41.5 | 53.9 | 41.9 | 44.4 |
| Deepseek-V3 | – | – | – | – | – | – |
| Qwen2.5-72B-Inst. | 43.5 | 47.9 | 40.8 | 48.9 | 40.9 | 39.8 |
| MiniMax-Text-01 | 56.5 | 66.1 | 50.5 | 61.7 | 56.7 | 47.2 |
| w/o CoT | ||||||
| GPT-4o (11-20) | 50.1 | 57.4 | 45.6 | 53.3 | 52.4 | 40.2 |
| Claude-3.5-Sonnet (10-22) | 41.0 | 46.9 | 37.3 | 46.1 | 38.6 | 37.0 |
| Deepseek-V3 | 48.7 | – | – | – | – | – |
| Qwen2.5-72B-Inst. | 42.1 | 42.7 | 41.8 | 45.6 | 38.1 | 44.4 |
| MiniMax-Text-01 | 52.9 | 60.9 | 47.9 | 58.9 | 52.6 | 43.5 |
MTOB
| Context Type | no context | half book | full book | Δ half book | Δ full book |
|---|---|---|---|---|---|
| eng → kalam (ChrF) | |||||
| GPT-4o (11-20) | 9.90 | 54.30 | – | 44.40 | – |
| Claude-3.5-Sonnet (10-22) | 20.22 | 53.62 | 55.65 | 33.39 | 35.42 |
| Gemini-1.5-Pro (002) | 16.79 | 53.68 | 57.90 | 36.89 | 41.11 |
| Gemini-2.0-Flash (exp) | 12.20 | 49.50 | 53.30 | 37.30 | 41.10 |
| Qwen-Long | 16.55 | 48.48 | 45.94 | 31.92 | 29.39 |
| MiniMax-Text-01 | 6.0 | 51.74 | 51.60 | 45.7 | 45.6 |
| kalam → eng (BLEURT) | |||||
| GPT-4o (11-20) | 33.20 | 58.30 | – | 25.10 | – |
| Claude-3.5-Sonnet (10-22) | 31.42 | 59.70 | 62.30 | 28.28 | 30.88 |
| Gemini-1.5-Pro (002) | 32.02 | 61.52 | 63.09 | 29.50 | 31.07 |
| Gemini-2.0-Flash (exp) | 33.80 | 57.50 | 57.00 | 23.70 | 23.20 |
| Qwen-Long | 30.13 | 53.14 | 32.15 | 23.01 | 2.02 |
| MiniMax-Text-01 | 33.65 | 57.10 | 58.00 | 23.45 | 24.35 |
Explore More AI Models
X-AI – ChatSonic
MoonshotAI – ChatSonic
DeepSeek: DeepSeek V3.1
DeepSeek: DeepSeek V3.1

DeepSeek V3.1 is a large-scale hybrid reasoning model with 671B parameters (37B active), capable of operating in both “thinking” and “non-thinking” modes through prompt templates. Building on the DeepSeek-V3 base, it introduces a two-phase long-context training process supporting up to 128K tokens, and leverages FP8 microscaling for more efficient inference. Users can directly control reasoning behavior via a simple boolean toggle.
The model enhances tool use, code generation, and reasoning efficiency, delivering performance on par with DeepSeek-R1 on challenging benchmarks while offering faster response times. With support for structured tool calling, code agents, and search agents, DeepSeek-V3.1 is well-suited for research, programming, and agent-driven workflows. As the successor to DeepSeek-V3-0324, it demonstrates strong performance across a wide range of tasks.
| Creator | Deepseek |
| Release Date | August, 2025 |
| License | MIT |
| Context Window | 128,000 |
| Image Input Support | No |
| Open Source (Weights) | Yes |
| Parameters | 685B, 37B active at inference time |
| Model Weights | Click here |
Explore More AI Models
MiniMax: MiniMax M1
MiniMax: MiniMax M1

MiniMax M1 is a large-scale open-weight reasoning model built for long-context processing and efficient inference. Using a hybrid Mixture-of-Experts (MoE) design combined with a custom “lightning attention” mechanism, it can handle sequences up to 1 million tokens while maintaining strong FLOP efficiency. With 456B total parameters and 45.9B active per token, it is optimized for complex, multi-step reasoning.
Trained with a custom reinforcement learning pipeline (CISPO), MiniMax-M1 delivers exceptional performance in long-context comprehension, software engineering, agent-driven tool use, and mathematical reasoning. It achieves top results across benchmarks such as FullStackBench, SWE-bench, MATH, GPQA, and TAU-Bench—often surpassing other open models like DeepSeek R1 and Qwen3-235B.
| Creator | MiniMax |
| Release Date | June, 2025 |
| License | Apache 2.0 |
| Context Window | 1,000,000 |
| Image Input Support | No |
| Open Source (Weights) | Yes |
| Parameters | 456B, 45.9B active at inference time |
| Model Weights | Click here |
Performance Benchmarks
| Category | Task | MiniMax-M1-80K | MiniMax-M1-40K | Qwen3-235B-A22B | DeepSeek-R1-0528 | DeepSeek-R1 | Seed-Thinking-v1.5 | Claude 4 Opus | Gemini 2.5 Pro (06-05) | OpenAI-o3 |
|---|---|---|---|---|---|---|---|---|---|---|
| Extended Thinking | 80K | 40K | 32k | 64k | 32k | 32k | 64k | 64k | 100k | |
| Mathematics | AIME 2024 | 86.0 | 83.3 | 85.7 | 91.4 | 79.8 | 86.7 | 76.0 | 92.0 | 91.6 |
| AIME 2025 | 76.9 | 74.6 | 81.5 | 87.5 | 70.0 | 74.0 | 75.5 | 88.0 | 88.9 | |
| MATH-500 | 96.8 | 96.0 | 96.2 | 98.0 | 97.3 | 96.7 | 98.2 | 98.8 | 98.1 | |
| General Coding | LiveCodeBench (24/8~25/5) | 65.0 | 62.3 | 65.9 | 73.1 | 55.9 | 67.5 | 56.6 | 77.1 | 75.8 |
| FullStackBench | 68.3 | 67.6 | 62.9 | 69.4 | 70.1 | 69.9 | 70.3 | — | 69.3 | |
| Reasoning & Knowledge | GPQA Diamond | 70.0 | 69.2 | 71.1 | 81.0 | 71.5 | 77.3 | 79.6 | 86.4 | 83.3 |
| HLE (no tools) | 8.4* | 7.2* | 7.6* | 17.7* | 8.6* | 8.2 | 10.7 | 21.6 | 20.3 | |
| ZebraLogic | 86.8 | 80.1 | 80.3 | 95.1 | 78.7 | 84.4 | 95.1 | 91.6 | 95.8 | |
| MMLU-Pro | 81.1 | 80.6 | 83.0 | 85.0 | 84.0 | 87.0 | 85.0 | 86.0 | 85.0 | |
| Software Engineering | SWE-bench Verified | 56.0 | 55.6 | 34.4 | 57.6 | 49.2 | 47.0 | 72.5 | 67.2 | 69.1 |
| Long Context | OpenAI-MRCR (128k) | 73.4 | 76.1 | 27.7 | 51.5 | 35.8 | 54.3 | 48.9 | 76.8 | 56.5 |
| OpenAI-MRCR (1M) | 56.2 | 58.6 | — | — | — | — | — | 58.8 | — | |
| LongBench-v2 | 61.5 | 61.0 | 50.1 | 52.1 | 58.3 | 52.5 | 55.6 | 65.0 | 58.8 | |
| Agentic Tool Use | TAU-bench (airline) | 62.0 | 60.0 | 34.7 | 53.5 | — | 44.0 | 59.6 | 50.0 | 52.0 |
| TAU-bench (retail) | 63.5 | 67.8 | 58.6 | 63.9 | — | 55.7 | 81.4 | 67.0 | 73.9 | |
| Factuality | SimpleQA | 18.5 | 17.9 | 11.0 | 27.8 | 30.1 | 12.9 | — | 54.0 | 49.4 |
| General Assistant | MultiChallenge | 44.7 | 44.7 | 40.0 | 45.0 | 40.7 | 43.0 | 45.8 | 51.8 | 56.5 |



















