The Most Packed Month in AI History
February 2026 will be remembered as the month the frontier AI race went into overdrive. Ten major providers are actively shipping frontier models β each one pushing the boundaries of what's possible with language models.
Here's the timeline:
| Date | Provider | Model | Highlight |
|---|---|---|---|
| Dec 2 | Mistral AI | Mistral Large 3 | 675B MoE, #2 open-source on LMArena |
| Jan 27 | Moonshot AI | Kimi K2.5 | 1T open-source MoE with Agent Swarm |
| Feb 5 | OpenAI | GPT-5.3 Codex | First "self-improving" agentic coding model |
| Feb 11 | Zhipu AI | GLM-5 | 745B open-source model trained on Chinese chips |
| Feb 12 | DeepSeek | V3.2 Update | Context window expanded 10x to 1M+ tokens |
| Feb 15 | Moonshot AI | Kimi Claw | Browser-based agent platform powered by K2.5 |
| Feb 17 | Anthropic | Claude Sonnet 4.6 | Near-Opus performance at 1/5th the price |
| Feb 17 | xAI | Grok 4.2 RC | "Rapid learning" model that improves weekly |
| Feb 17 | DeepSeek | V4 (expected) | 1T-param model targeting coding dominance |
| Feb 19 | Gemini 3.1 Pro | 2x reasoning jump, ARC-AGI-2 score of 77.1% | |
| 2026 | MiniMax | M2.5 | #1 Multi-SWE-Bench, 10B active params, $0.30/M |
This isn't just incremental improvement. This is a fundamental shift in what AI models can do, how much they cost, and who's building them.
Let's break down each release.
OpenAI: GPT-5.3 Codex
Released: February 5, 2026
OpenAI's GPT-5.3 Codex represents a paradigm shift from "model that writes code" to "model that does nearly anything developers can do on a computer."
What's New
GPT-5.3 Codex combines the frontier coding performance of GPT-5.2-Codex with the reasoning and professional knowledge of GPT-5.2. The result is a model that can take on long-running tasks involving research, tool use, and complex multi-step execution.
Key improvements:
- 25% faster than GPT-5.2-Codex
- Fewer tokens consumed per task β builds more with less
- State-of-the-art on SWE-Bench Pro and Terminal-Bench
- Strong results on OSWorld and GDPval
The Cybersecurity Flag
This is the first OpenAI model to hit "high" on their cybersecurity preparedness framework β meaning they believe GPT-5.3 Codex is capable enough at coding and reasoning to "meaningfully enable real-world cyber harm, especially if automated or used at scale." It's a milestone that underscores just how capable these models have become.
Availability
Available to paid ChatGPT users via Codex app, CLI, IDE extension, and web. A lighter GPT-5.3-Codex-Spark variant was also released. API access coming soon.
Pricing
| Model | Input (per 1M) | Output (per 1M) | Cached Input |
|---|---|---|---|
| GPT-5 | $1.25 | $10.00 | $0.625 |
| GPT-5.3 Codex | TBA (API pending) | TBA | TBA |
| o3 | $2.00 | $8.00 | β |
| o4-mini | $1.10 | $4.40 | $0.55 |
Anthropic: Claude Sonnet 4.6
Released: February 17, 2026
Claude Sonnet 4.6 is Anthropic's answer to a question nobody thought possible a year ago: can a mid-tier model match a flagship?
What's New
This isn't a minor version bump. Sonnet 4.6 is a full upgrade across coding, computer use, long-context reasoning, agent planning, knowledge work, and design. It ships with a 1M token context window (in beta).
Benchmark Highlights
| Benchmark | Sonnet 4.6 | Opus 4.6 | Gap |
|---|---|---|---|
| SWE-bench Verified | 79.6% | β | β |
| OSWorld (Computer Use) | 72.5% | 72.7% | 0.2% |
| Office Productivity | 1633 Elo | 1559 Elo | Sonnet leads |
| Financial Analysis | 63.3% | 62.0% | Sonnet leads |
The computer use number is remarkable: 72.5% on OSWorld-Verified, up from 14.9% when computer use first launched just 16 months earlier.
User Preference
Anthropic reports that 70% of users prefer Sonnet 4.6 over Sonnet 4.5, and 59% prefer it over the older Opus 4.5. At $3/$15 per million tokens β one-fifth of Opus 4.6's $15/$75 β this is the best value in frontier AI right now for enterprise workloads.
Claude Opus 4.6
The flagship Opus 4.6 remains the ceiling for Anthropic's capabilities, powering the most demanding agentic and reasoning tasks. But the gap with Sonnet is now razor-thin, making the mid-tier model the pragmatic choice for most applications.
Google: Gemini 3.1 Pro
Released: February 19, 2026
Google is framing Gemini 3.1 Pro not as a niche upgrade but as a sturdier default model for complex tasks.
What's New
The headline number: an ARC-AGI-2 score of 77.1% β more than double the reasoning performance of Gemini 3 Pro. This is specifically designed for tasks that require advanced multi-step reasoning, like synthesizing data across sources or explaining complex interdependent topics.
Availability
Rolling out across the full Google ecosystem:
- Gemini app (higher limits for Pro and Ultra plan users)
- NotebookLM (Pro and Ultra users)
- Gemini API via AI Studio, Vertex AI, Gemini CLI, and Android Studio
- Pricing unchanged from Gemini 3 Pro (~$1.25/$10 per million tokens standard)
Why It Matters
Google kept pricing flat while dramatically improving reasoning. For enterprises already on Google Cloud, 3.1 Pro slots in as a direct upgrade with zero budget impact.
DeepSeek: V4 & The 10x Context Expansion
V3.2 Update: February 12, 2026 V4 Expected: Mid-February 2026
DeepSeek continues to be the most disruptive force in AI pricing while pushing genuine frontier capabilities.
V3.2: 10x Context Expansion
In early February, DeepSeek expanded V3.2's context window from 128,000 tokens to over 1 million β a tenfold increase. At $0.27/$1.10 per million tokens, this is now the cheapest way to process massive documents with a frontier-class model.
V4: The Next Frontier
DeepSeek V4 is expected to launch with:
- 1 trillion parameters (MoE architecture)
- 1M+ token context native
- Three architectural breakthroughs: Engram conditional memory, Manifold-Constrained Hyper-Connections, and DeepSeek Sparse Attention
- Target: 80%+ on SWE-bench β which would put it at the very top of coding benchmarks
- Expected to be open-weight under a permissive license
The Cost Story
The pricing gap between DeepSeek and Western providers remains staggering:
| Task Cost Example | GPT-5 | Claude Opus 4.6 | DeepSeek V3.2 |
|---|---|---|---|
| 100K input + 10K output | $0.225 | $2.25 | $0.038 |
| Ratio to DeepSeek | 6x | 59x | 1x |
A complex task costing $15 with GPT-5 costs approximately $0.50 with DeepSeek. This isn't just a cost advantage β it changes what's economically viable to automate.
Zhipu AI: GLM-5
Released: February 11, 2026
The biggest open-source model release of the month, and possibly the most geopolitically significant.
What's New
GLM-5 is a 745 billion parameter MoE model (44B active parameters) with five core capabilities: creative writing, code generation, multi-step reasoning, agentic intelligence, and long-context processing.
Benchmark Performance
| Benchmark | GLM-5 | Comparison |
|---|---|---|
| SWE-bench Verified | 77.8% | Matches Claude Opus 4.5 |
| AIME 2026 | 92.7% | β |
| GPQA-Diamond | 86.0% | β |
| Humanity's Last Exam | 50.4% | Beats Claude Opus 4.5 |
| Hallucination Rate | 34% | Down from 90% (GLM-4.7) |
The hallucination reduction β from 90% to 34% using a novel RL technique called Slime β is particularly impressive, topping the Artificial Analysis Omniscience Index.
The Geopolitical Signal
GLM-5 was trained entirely on Huawei Ascend chips using the MindSpore framework β zero US-manufactured hardware. This demonstrates that China's domestic compute stack can produce frontier-quality models despite export controls.
Native Agent Mode
GLM-5 ships with a native "Agent Mode" that can transform prompts into professional office documents (.docx, .pdf, .xlsx) β directly competing with Anthropic's computer use and OpenAI's Codex on practical business tasks.
Following the launch, Zhipu's shares surged 34% on the Hong Kong Stock Exchange.
Moonshot AI: Kimi K2.5 & Kimi Claw
K2.5 Released: January 27, 2026 Kimi Claw: February 15, 2026
Moonshot AI is building the most complete open-source agentic ecosystem in the Chinese AI space.
Kimi K2.5
A 1 trillion parameter MoE model (32B active parameters) that understands text, images, and video. Key innovation: Agent Swarm capability, powered by a new RL technique called Parallel Agent Reinforcement Learning (PARL) that trains the model to decompose and parallelize complex tasks.
The model is fully open-source and available on Hugging Face.
Kimi Claw
Launched February 15, Kimi Claw is a cloud-native browser-based AI agent platform built on the OpenClaw framework. Think of it as Moonshot's answer to Anthropic's computer use β but running entirely in the cloud.
xAI: Grok 4.2 Release Candidate
Public Beta: February 17, 2026
Elon Musk's Grok 4.2 introduces a fundamentally different approach to model improvement: rapid learning.
What's New
Unlike every other model on this list, Grok 4.2 is designed to improve every week based on public usage. Musk described it as "able to learn rapidly" with weekly improvement cycles and release notes.
New capabilities:
- 4-agent parallel collaboration β specialized AI agents that synthesize outputs into a single response
- Medical document analysis via photo upload
- Improved engineering reasoning
Pricing
xAI maintains its aggressive pricing strategy:
| Model | Input (per 1M) | Output (per 1M) |
|---|---|---|
| Grok 4.1 | $0.20 | $0.50 |
| Grok 4.2 RC | TBA (beta) | TBA |
Current Status
Grok 4.2 is currently in public beta β available to select in the Grok interface. The general public release is expected in March 2026. Official benchmarks will be published after the beta concludes.
Mistral AI: Large 3 & The Coding Stack
Mistral Large 3: December 2, 2025 Devstral 2: December 2025
Mistral continues to punch above its weight as Europe's frontier AI lab, shipping models that compete at the top of open-source leaderboards.
Mistral Large 3
A 675 billion parameter MoE model with 41B active parameters. It debuted at #2 in open-source non-reasoning models on the LMArena leaderboard β behind only the much larger models from Chinese labs.
Key models in Mistral's current lineup:
| Model | Focus | Pricing (per 1M) |
|---|---|---|
| Mistral Large 3 | General frontier | ~$2.00 / $6.00 |
| Mistral Medium 3.1 | Multimodal (40k ctx) | $2.00 / $5.00 |
| Magistral Medium 1.2 | Reasoning | $2.00 / $5.00 |
| Codestral | Code completion | Premier tier |
| Devstral 2 | Agentic coding | Open-weight |
Devstral Small 2
The standout from the December release: a 24B parameter coding model that beats Qwen 3 Coder Flash despite being significantly smaller. For teams that need self-hosted coding AI without massive GPU requirements, Devstral Small 2 is a compelling option.
Ministral 3
Mistral's small-model family (3B, 7B, 14B parameters) achieves the best cost-to-performance ratio of any open-source model β matching or exceeding comparable models while producing an order of magnitude fewer tokens.
MiniMax: M2.5
M2.5 Released: 2026
The dark horse of the frontier race. MiniMax's M2.5 delivers benchmark-topping coding performance with just 10 billion active parameters β a fraction of what competitors use.
What's New
MiniMax M2.5 is purpose-built for coding and agentic execution, with a focus on doing more with less:
- #1 on Multi-SWE-Bench with a score of 51.3
- Surpasses Claude Opus 4.6 on SWE-Bench Pro
- Leading scores on FinSearch, BrowseComp, and RISE benchmarks
- 100 tokens per second throughput β described as "3x faster than Opus"
- Chain of Thought reasoning up to 128K tokens
The Efficiency Story
The standout stat: MiniMax M2.5 completes 327.8 tasks per $100 budget β over 10x more than Opus. At $0.30 per million input tokens ($0.06 with cache), it's in DeepSeek territory for pricing while matching or exceeding premium models on coding tasks.
| Model | Input (per 1M) | With Cache | Speed |
|---|---|---|---|
| M2.5 | $0.30 | $0.06 | 100 TPS |
| M2.5-highspeed | $0.30 | $0.06 | Faster variant |
Open Weights
MiniMax has open-sourced M2.5 weights on HuggingFace, supporting vLLM, SGLang, and Transformers for self-hosting. This makes it one of the most cost-effective options for teams running their own inference infrastructure.
The Pricing Landscape
Here's how all frontier models stack up on cost (per million tokens):
| Provider | Model | Input | Output | Context |
|---|---|---|---|---|
| xAI | Grok 4.1 | $0.20 | $0.50 | β |
| DeepSeek | V3.2 | $0.27 | $1.10 | 1M+ |
| MiniMax | M2.5 | $0.30 | β | 128K |
| OpenAI | o4-mini | $1.10 | $4.40 | β |
| Gemini 3.1 Pro | ~$1.25 | ~$10.00 | 1M | |
| OpenAI | GPT-5 | $1.25 | $10.00 | 400K |
| Mistral AI | Medium 3.1 | $2.00 | $5.00 | 40K |
| Mistral AI | Large 3 | ~$2.00 | ~$6.00 | 128K |
| OpenAI | o3 | $2.00 | $8.00 | β |
| Anthropic | Sonnet 4.6 | $3.00 | $15.00 | 1M (beta) |
| Anthropic | Opus 4.6 | $15.00 | $75.00 | 200K |
| Zhipu AI | GLM-5 | Open weights | Free to self-host | β |
| Moonshot AI | Kimi K2.5 | Open weights | Free to self-host | β |
| DeepSeek | V4 (expected) | Open weights | Free to self-host | 1M+ |
The 17x cost gap between the cheapest API (DeepSeek at $0.27/M) and premium models (Opus 4.6 at $15/M input) represents a real architectural decision for businesses. The question is no longer "can we afford AI?" but "which tier of AI fits our use case?"
Key Trends
1. The Open-Source Surge
Five recent releases β GLM-5, Kimi K2.5, DeepSeek V4, Mistral Large 3, and MiniMax M2.5 β are open-weight models. They're not just catching up to closed-source; GLM-5 matches Claude Opus 4.5 on SWE-bench and beats it on Humanity's Last Exam. Mistral Large 3 sits at #2 on open-source LMArena. The quality gap between open and closed is essentially gone.
2. China's Independent AI Stack
Both GLM-5 (Huawei Ascend) and DeepSeek V4 demonstrate that Chinese labs can produce frontier models without US hardware. Export controls have slowed but not stopped China's AI progress β and may have accelerated their investment in domestic alternatives.
3. The Agentic Everything
Every single release this month includes agentic capabilities: GPT-5.3 Codex does long-running multi-step tasks, Claude 4.6 has computer use at 72.5%, Grok 4.2 runs 4-agent parallel collaboration, GLM-5 has native Agent Mode, and Kimi has Agent Swarm. 2026 is the year models stopped being chatbots and started being workers.
4. The Mid-Tier Revolution
Claude Sonnet 4.6 proving that a $3/M model can match a $15/M flagship is a watershed moment. Combined with DeepSeek's $0.27/M pricing achieving ~90% of GPT-5 quality, the value proposition of premium API pricing is under serious pressure.
5. Context Window Convergence
Multiple models now offer 1M+ token context windows: Gemini 3.1 Pro, Claude 4.6 (beta), DeepSeek V4, and Kimi K2.5. Processing entire codebases, legal documents, or research corpora in a single pass is no longer a differentiator β it's table stakes.
What This Means for Business Users
If you're building AI into your business workflow in 2026, here's the practical takeaway:
For coding and development: GPT-5.3 Codex and Claude Sonnet 4.6 lead the pack. Codex for long-running agentic tasks, Sonnet for versatile coding + computer use.
For cost-sensitive workloads: DeepSeek V3.2 at $0.27/M tokens is unbeatable for high-volume tasks. Open-weight models (GLM-5, Kimi K2.5) are free to self-host if you have GPU infrastructure.
For enterprise reasoning: Gemini 3.1 Pro's 2x reasoning improvement makes it the default for Google Cloud shops. Claude Opus 4.6 remains the ceiling for complex analysis.
For rapid iteration: Grok 4.2's weekly improvement model is unique β if you need a model that gets better at your specific use cases over time, it's worth watching.
For independence: Open-weight models (GLM-5, Kimi K2.5, DeepSeek V4) give you full control over deployment, customization, and data privacy.
Last Updated
February 20, 2026 β This article is updated as new frontier models are released. Follow us for the latest coverage.
Previous updates: Initial publication (Feb 20, 2026)

