Frontier AI Models: Every Major Release This Month (February 2026)
Jozo
Jozo
2026/02/20
14 min read

The Most Packed Month in AI History

February 2026 will be remembered as the month the frontier AI race went into overdrive. Ten major providers are actively shipping frontier models β€” each one pushing the boundaries of what's possible with language models.

Here's the timeline:

DateProviderModelHighlight
Dec 2Mistral AIMistral Large 3675B MoE, #2 open-source on LMArena
Jan 27Moonshot AIKimi K2.51T open-source MoE with Agent Swarm
Feb 5OpenAIGPT-5.3 CodexFirst "self-improving" agentic coding model
Feb 11Zhipu AIGLM-5745B open-source model trained on Chinese chips
Feb 12DeepSeekV3.2 UpdateContext window expanded 10x to 1M+ tokens
Feb 15Moonshot AIKimi ClawBrowser-based agent platform powered by K2.5
Feb 17AnthropicClaude Sonnet 4.6Near-Opus performance at 1/5th the price
Feb 17xAIGrok 4.2 RC"Rapid learning" model that improves weekly
Feb 17DeepSeekV4 (expected)1T-param model targeting coding dominance
Feb 19GoogleGemini 3.1 Pro2x reasoning jump, ARC-AGI-2 score of 77.1%
2026MiniMaxM2.5#1 Multi-SWE-Bench, 10B active params, $0.30/M

This isn't just incremental improvement. This is a fundamental shift in what AI models can do, how much they cost, and who's building them.

Let's break down each release.


OpenAI: GPT-5.3 Codex

Released: February 5, 2026

OpenAI's GPT-5.3 Codex represents a paradigm shift from "model that writes code" to "model that does nearly anything developers can do on a computer."

What's New

GPT-5.3 Codex combines the frontier coding performance of GPT-5.2-Codex with the reasoning and professional knowledge of GPT-5.2. The result is a model that can take on long-running tasks involving research, tool use, and complex multi-step execution.

Key improvements:

  • 25% faster than GPT-5.2-Codex
  • Fewer tokens consumed per task β€” builds more with less
  • State-of-the-art on SWE-Bench Pro and Terminal-Bench
  • Strong results on OSWorld and GDPval

The Cybersecurity Flag

This is the first OpenAI model to hit "high" on their cybersecurity preparedness framework β€” meaning they believe GPT-5.3 Codex is capable enough at coding and reasoning to "meaningfully enable real-world cyber harm, especially if automated or used at scale." It's a milestone that underscores just how capable these models have become.

Availability

Available to paid ChatGPT users via Codex app, CLI, IDE extension, and web. A lighter GPT-5.3-Codex-Spark variant was also released. API access coming soon.

Pricing

ModelInput (per 1M)Output (per 1M)Cached Input
GPT-5$1.25$10.00$0.625
GPT-5.3 CodexTBA (API pending)TBATBA
o3$2.00$8.00β€”
o4-mini$1.10$4.40$0.55

Anthropic: Claude Sonnet 4.6

Released: February 17, 2026

Claude Sonnet 4.6 is Anthropic's answer to a question nobody thought possible a year ago: can a mid-tier model match a flagship?

What's New

This isn't a minor version bump. Sonnet 4.6 is a full upgrade across coding, computer use, long-context reasoning, agent planning, knowledge work, and design. It ships with a 1M token context window (in beta).

Benchmark Highlights

BenchmarkSonnet 4.6Opus 4.6Gap
SWE-bench Verified79.6%β€”β€”
OSWorld (Computer Use)72.5%72.7%0.2%
Office Productivity1633 Elo1559 EloSonnet leads
Financial Analysis63.3%62.0%Sonnet leads

The computer use number is remarkable: 72.5% on OSWorld-Verified, up from 14.9% when computer use first launched just 16 months earlier.

User Preference

Anthropic reports that 70% of users prefer Sonnet 4.6 over Sonnet 4.5, and 59% prefer it over the older Opus 4.5. At $3/$15 per million tokens β€” one-fifth of Opus 4.6's $15/$75 β€” this is the best value in frontier AI right now for enterprise workloads.

Claude Opus 4.6

The flagship Opus 4.6 remains the ceiling for Anthropic's capabilities, powering the most demanding agentic and reasoning tasks. But the gap with Sonnet is now razor-thin, making the mid-tier model the pragmatic choice for most applications.


Google: Gemini 3.1 Pro

Released: February 19, 2026

Google is framing Gemini 3.1 Pro not as a niche upgrade but as a sturdier default model for complex tasks.

What's New

The headline number: an ARC-AGI-2 score of 77.1% β€” more than double the reasoning performance of Gemini 3 Pro. This is specifically designed for tasks that require advanced multi-step reasoning, like synthesizing data across sources or explaining complex interdependent topics.

Availability

Rolling out across the full Google ecosystem:

  • Gemini app (higher limits for Pro and Ultra plan users)
  • NotebookLM (Pro and Ultra users)
  • Gemini API via AI Studio, Vertex AI, Gemini CLI, and Android Studio
  • Pricing unchanged from Gemini 3 Pro (~$1.25/$10 per million tokens standard)

Why It Matters

Google kept pricing flat while dramatically improving reasoning. For enterprises already on Google Cloud, 3.1 Pro slots in as a direct upgrade with zero budget impact.


DeepSeek: V4 & The 10x Context Expansion

V3.2 Update: February 12, 2026 V4 Expected: Mid-February 2026

DeepSeek continues to be the most disruptive force in AI pricing while pushing genuine frontier capabilities.

V3.2: 10x Context Expansion

In early February, DeepSeek expanded V3.2's context window from 128,000 tokens to over 1 million β€” a tenfold increase. At $0.27/$1.10 per million tokens, this is now the cheapest way to process massive documents with a frontier-class model.

V4: The Next Frontier

DeepSeek V4 is expected to launch with:

  • 1 trillion parameters (MoE architecture)
  • 1M+ token context native
  • Three architectural breakthroughs: Engram conditional memory, Manifold-Constrained Hyper-Connections, and DeepSeek Sparse Attention
  • Target: 80%+ on SWE-bench β€” which would put it at the very top of coding benchmarks
  • Expected to be open-weight under a permissive license

The Cost Story

The pricing gap between DeepSeek and Western providers remains staggering:

Task Cost ExampleGPT-5Claude Opus 4.6DeepSeek V3.2
100K input + 10K output$0.225$2.25$0.038
Ratio to DeepSeek6x59x1x

A complex task costing $15 with GPT-5 costs approximately $0.50 with DeepSeek. This isn't just a cost advantage β€” it changes what's economically viable to automate.


Zhipu AI: GLM-5

Released: February 11, 2026

The biggest open-source model release of the month, and possibly the most geopolitically significant.

What's New

GLM-5 is a 745 billion parameter MoE model (44B active parameters) with five core capabilities: creative writing, code generation, multi-step reasoning, agentic intelligence, and long-context processing.

Benchmark Performance

BenchmarkGLM-5Comparison
SWE-bench Verified77.8%Matches Claude Opus 4.5
AIME 202692.7%β€”
GPQA-Diamond86.0%β€”
Humanity's Last Exam50.4%Beats Claude Opus 4.5
Hallucination Rate34%Down from 90% (GLM-4.7)

The hallucination reduction β€” from 90% to 34% using a novel RL technique called Slime β€” is particularly impressive, topping the Artificial Analysis Omniscience Index.

The Geopolitical Signal

GLM-5 was trained entirely on Huawei Ascend chips using the MindSpore framework β€” zero US-manufactured hardware. This demonstrates that China's domestic compute stack can produce frontier-quality models despite export controls.

Native Agent Mode

GLM-5 ships with a native "Agent Mode" that can transform prompts into professional office documents (.docx, .pdf, .xlsx) β€” directly competing with Anthropic's computer use and OpenAI's Codex on practical business tasks.

Following the launch, Zhipu's shares surged 34% on the Hong Kong Stock Exchange.


Moonshot AI: Kimi K2.5 & Kimi Claw

K2.5 Released: January 27, 2026 Kimi Claw: February 15, 2026

Moonshot AI is building the most complete open-source agentic ecosystem in the Chinese AI space.

Kimi K2.5

A 1 trillion parameter MoE model (32B active parameters) that understands text, images, and video. Key innovation: Agent Swarm capability, powered by a new RL technique called Parallel Agent Reinforcement Learning (PARL) that trains the model to decompose and parallelize complex tasks.

The model is fully open-source and available on Hugging Face.

Kimi Claw

Launched February 15, Kimi Claw is a cloud-native browser-based AI agent platform built on the OpenClaw framework. Think of it as Moonshot's answer to Anthropic's computer use β€” but running entirely in the cloud.


xAI: Grok 4.2 Release Candidate

Public Beta: February 17, 2026

Elon Musk's Grok 4.2 introduces a fundamentally different approach to model improvement: rapid learning.

What's New

Unlike every other model on this list, Grok 4.2 is designed to improve every week based on public usage. Musk described it as "able to learn rapidly" with weekly improvement cycles and release notes.

New capabilities:

  • 4-agent parallel collaboration β€” specialized AI agents that synthesize outputs into a single response
  • Medical document analysis via photo upload
  • Improved engineering reasoning

Pricing

xAI maintains its aggressive pricing strategy:

ModelInput (per 1M)Output (per 1M)
Grok 4.1$0.20$0.50
Grok 4.2 RCTBA (beta)TBA

Current Status

Grok 4.2 is currently in public beta β€” available to select in the Grok interface. The general public release is expected in March 2026. Official benchmarks will be published after the beta concludes.


Mistral AI: Large 3 & The Coding Stack

Mistral Large 3: December 2, 2025 Devstral 2: December 2025

Mistral continues to punch above its weight as Europe's frontier AI lab, shipping models that compete at the top of open-source leaderboards.

Mistral Large 3

A 675 billion parameter MoE model with 41B active parameters. It debuted at #2 in open-source non-reasoning models on the LMArena leaderboard β€” behind only the much larger models from Chinese labs.

Key models in Mistral's current lineup:

ModelFocusPricing (per 1M)
Mistral Large 3General frontier~$2.00 / $6.00
Mistral Medium 3.1Multimodal (40k ctx)$2.00 / $5.00
Magistral Medium 1.2Reasoning$2.00 / $5.00
CodestralCode completionPremier tier
Devstral 2Agentic codingOpen-weight

Devstral Small 2

The standout from the December release: a 24B parameter coding model that beats Qwen 3 Coder Flash despite being significantly smaller. For teams that need self-hosted coding AI without massive GPU requirements, Devstral Small 2 is a compelling option.

Ministral 3

Mistral's small-model family (3B, 7B, 14B parameters) achieves the best cost-to-performance ratio of any open-source model β€” matching or exceeding comparable models while producing an order of magnitude fewer tokens.


MiniMax: M2.5

M2.5 Released: 2026

The dark horse of the frontier race. MiniMax's M2.5 delivers benchmark-topping coding performance with just 10 billion active parameters β€” a fraction of what competitors use.

What's New

MiniMax M2.5 is purpose-built for coding and agentic execution, with a focus on doing more with less:

  • #1 on Multi-SWE-Bench with a score of 51.3
  • Surpasses Claude Opus 4.6 on SWE-Bench Pro
  • Leading scores on FinSearch, BrowseComp, and RISE benchmarks
  • 100 tokens per second throughput β€” described as "3x faster than Opus"
  • Chain of Thought reasoning up to 128K tokens

The Efficiency Story

The standout stat: MiniMax M2.5 completes 327.8 tasks per $100 budget β€” over 10x more than Opus. At $0.30 per million input tokens ($0.06 with cache), it's in DeepSeek territory for pricing while matching or exceeding premium models on coding tasks.

ModelInput (per 1M)With CacheSpeed
M2.5$0.30$0.06100 TPS
M2.5-highspeed$0.30$0.06Faster variant

Open Weights

MiniMax has open-sourced M2.5 weights on HuggingFace, supporting vLLM, SGLang, and Transformers for self-hosting. This makes it one of the most cost-effective options for teams running their own inference infrastructure.


The Pricing Landscape

Here's how all frontier models stack up on cost (per million tokens):

ProviderModelInputOutputContext
xAIGrok 4.1$0.20$0.50β€”
DeepSeekV3.2$0.27$1.101M+
MiniMaxM2.5$0.30β€”128K
OpenAIo4-mini$1.10$4.40β€”
GoogleGemini 3.1 Pro~$1.25~$10.001M
OpenAIGPT-5$1.25$10.00400K
Mistral AIMedium 3.1$2.00$5.0040K
Mistral AILarge 3~$2.00~$6.00128K
OpenAIo3$2.00$8.00β€”
AnthropicSonnet 4.6$3.00$15.001M (beta)
AnthropicOpus 4.6$15.00$75.00200K
Zhipu AIGLM-5Open weightsFree to self-hostβ€”
Moonshot AIKimi K2.5Open weightsFree to self-hostβ€”
DeepSeekV4 (expected)Open weightsFree to self-host1M+

The 17x cost gap between the cheapest API (DeepSeek at $0.27/M) and premium models (Opus 4.6 at $15/M input) represents a real architectural decision for businesses. The question is no longer "can we afford AI?" but "which tier of AI fits our use case?"


1. The Open-Source Surge

Five recent releases β€” GLM-5, Kimi K2.5, DeepSeek V4, Mistral Large 3, and MiniMax M2.5 β€” are open-weight models. They're not just catching up to closed-source; GLM-5 matches Claude Opus 4.5 on SWE-bench and beats it on Humanity's Last Exam. Mistral Large 3 sits at #2 on open-source LMArena. The quality gap between open and closed is essentially gone.

2. China's Independent AI Stack

Both GLM-5 (Huawei Ascend) and DeepSeek V4 demonstrate that Chinese labs can produce frontier models without US hardware. Export controls have slowed but not stopped China's AI progress β€” and may have accelerated their investment in domestic alternatives.

3. The Agentic Everything

Every single release this month includes agentic capabilities: GPT-5.3 Codex does long-running multi-step tasks, Claude 4.6 has computer use at 72.5%, Grok 4.2 runs 4-agent parallel collaboration, GLM-5 has native Agent Mode, and Kimi has Agent Swarm. 2026 is the year models stopped being chatbots and started being workers.

4. The Mid-Tier Revolution

Claude Sonnet 4.6 proving that a $3/M model can match a $15/M flagship is a watershed moment. Combined with DeepSeek's $0.27/M pricing achieving ~90% of GPT-5 quality, the value proposition of premium API pricing is under serious pressure.

5. Context Window Convergence

Multiple models now offer 1M+ token context windows: Gemini 3.1 Pro, Claude 4.6 (beta), DeepSeek V4, and Kimi K2.5. Processing entire codebases, legal documents, or research corpora in a single pass is no longer a differentiator β€” it's table stakes.


What This Means for Business Users

If you're building AI into your business workflow in 2026, here's the practical takeaway:

For coding and development: GPT-5.3 Codex and Claude Sonnet 4.6 lead the pack. Codex for long-running agentic tasks, Sonnet for versatile coding + computer use.

For cost-sensitive workloads: DeepSeek V3.2 at $0.27/M tokens is unbeatable for high-volume tasks. Open-weight models (GLM-5, Kimi K2.5) are free to self-host if you have GPU infrastructure.

For enterprise reasoning: Gemini 3.1 Pro's 2x reasoning improvement makes it the default for Google Cloud shops. Claude Opus 4.6 remains the ceiling for complex analysis.

For rapid iteration: Grok 4.2's weekly improvement model is unique β€” if you need a model that gets better at your specific use cases over time, it's worth watching.

For independence: Open-weight models (GLM-5, Kimi K2.5, DeepSeek V4) give you full control over deployment, customization, and data privacy.


Last Updated

February 20, 2026 β€” This article is updated as new frontier models are released. Follow us for the latest coverage.

Previous updates: Initial publication (Feb 20, 2026)