Cohere just dropped North Mini Code, a 30B-parameter Mixture-of-Experts model with 3B active parameters, and it's their first model explicitly designed for agentic coding. Not code completion. Not chat-with-your-repo. Actual autonomous software engineering agents that edit codebases, run tests, and fix bugs across dozens of turns.
This matters because most code models optimize for HumanEval or single-turn generation benchmarks that don't predict agent performance. North Mini Code was trained end-to-end for multi-turn, tool-using, verification-driven workflows—and the results show it.
The Performance Story: Punching Above Its Weight
On Artificial Analysis' Coding Index, North Mini Code scores 33.4. That's higher than Qwen3.5 (35B-A3B), Gemma 4 (26B-A4B), Devstral Small 2 (24B Dense), and—here's the kicker—substantially larger models like Nemotron 3 Super (120B-A12B), Mistral Small 4 (119B-A6B), and Devstral 2 (123B).
A 30B-parameter model beating 120B models isn't just efficiency porn. It's evidence that task-specific post-training beats sheer scale when the task is well-defined. Cohere didn't try to build a general-purpose reasoning monster. They built a model that writes code, uses terminal tools, and iterates until tests pass.
The model is Apache 2.0 licensed and available on Hugging Face, which means you can actually run this in production without negotiating enterprise contracts.
Architecture: Sparse MoE With Interleaved Attention
North Mini Code is a decoder-only Transformer with 128 experts, activating 8 per token. The feed-forward blocks use SwiGLU activation, and the router applies sigmoid activation before top-k selection.
The attention mechanism is more interesting: it interleaves sliding-window attention (with RoPE) and full global attention (no positional embeddings) in a 3:1 ratio. Sliding-window keeps compute manageable for long contexts; global attention preserves cross-document reasoning. This hybrid pattern is becoming standard for long-context code models, but Cohere's implementation includes a single dense layer before the sparse layers—a small architectural choice that likely helps routing stability.
The model was trained with 64K and 128K context windows in a "long-to-longer" cascade. First stage: 64K context on broad data. Second stage: 128K context on only the highest-quality verified samples. This prevents low-quality long-context data from polluting the later stages—a mistake that apparently caused "higher behavioral conflicts" in earlier ablations.
Post-Training: SFT → SFT → RLVR
Cohere's post-training pipeline has three phases:
Phase 1: Broad SFT
The first supervised fine-tuning stage uses a wide mix where code datasets make up 70% of trainable tokens. Within that:
- 43% agentic tool-use data
- 27% single-turn competitive/scientific programming
This stage establishes baseline coding ability and robustness across domains. The 64K context window keeps training stable while covering shorter, higher-diversity samples.
Phase 2: High-Quality Agentic SFT
The second SFT stage narrows to 4.5 billion tokens of only agentic and reasoning-driven samples, where code forms 61% of trainable tokens. Every sample is verified as executable and correct using containerized coding environments.
Cohere built their data pipeline around over 70,000 verifiable tasks across ~5,000 unique repositories. They explicitly deduplicated against SWE-Bench and SWE-Bench-Pro repository sources to avoid test-set leakage—a detail that should be table stakes but often isn't.
The filtering is aggressive: invalid tool calls, erroneous whitespace, malformed special tokens, and hallucinated citations all get pruned. This stage isn't about hitting benchmark numbers. It's about priming for reinforcement learning by maximizing sampling diversity and pass@K for high K.
After SFT, the model achieves 80.2% pass@10 on SWE-Bench Verified and 55.1% pass@10 on Terminal-Bench v2. Not the final scores—just the starting point for RL.
Phase 3: Asynchronous RLVR
Reinforcement learning with verifiable rewards (RLVR) is where things get spicy. Cohere runs a single multi-environment RL training run spanning two task types: terminal-based tasks and software engineering tasks.
Each training batch has 512 rollouts with group size 8. All rollouts share a 128K token context window. The model receives binary rewards from unit-test verifiers, plus a reward of 0 for invalid tool calls or unparseable outputs—which drives hallucinated tool calls to near-zero within the first few steps.
The training objective is CISPO (a log-likelihood objective with token-level importance sampling correction), which differs from PPO and GRPO by multiplying importance weights against log-likelihood rather than probability ratios. Crucially, loss is aggregated at the token level, not prompt level, so long agentic traces get proportional gradient signal instead of being down-weighted.
Because coding-agent rollouts are "long and highly variable in length," Cohere decouples sampling from learning. A vLLM sidecar serves rollouts continuously while the trainer runs asynchronously, syncing policy weights every few learner steps (K=4). They use a windowed FIFO queue to drain stragglers without blocking on the longest rollouts—recovering "most of the throughput" of completion-order sampling without hurting training stability.
Harness Robustness: Why Single-Benchmark Optimization Is A Trap
Here's the insight most code-model teams miss: real agents encounter diverse tooling environments. SWE-Agent uses a rich agent-CLI interface with specialized commands (bash, str_replace_editor, submit). mini-SWE-agent strips that down to a single bash tool with raw stdout. OpenCode uses fine-grained individually typed tools (edit, grep, todowrite) returning structured JSON.
If you train only on one harness, your model memorizes that harness's idiosyncrasies. It'll fail when deployed in a slightly different environment.
Cohere addresses this by introducing multiple benchmark harness data during second-stage SFT—just 6% of the mix compared to 50% from the primary SWE-Agent harness. This yields a 10% gain on OpenCode evaluation while maintaining SWE-Agent performance. Cross-harness transfer is cheap and doesn't degrade benchmark scores.
Even more interesting: North Mini Code achieves 61.0% pass@1 using mini-SWE-Agent, where the improvement emerged for free from cross-task, cross-harness training. Overlapping tool capabilities create positive transfer. Skills required by different harnesses are "usually complementary rather than contradictory."
The team also notes that introducing sufficient variation in harnesses (data augmentation) forces the model to link instructions to behaviors instead of regurgitating fixed templates. This is especially critical when harnesses look similar but differ in subtle ways.
What This Means For Agent Developers
North Mini Code is optimized for the workflows that actually matter in production agents:
- Multi-turn debugging and iteration
- Terminal-based task execution
- Repository-scale code understanding
- Tool use across heterogeneous interfaces
The model's training pipeline—verified tasks, multi-harness data, asynchronous RL with verifiable rewards—represents a blueprint for post-training models for agentic tasks beyond coding. The same principles apply to web agents, data analysis agents, or any domain where you can define verifiable success criteria.
The fact that a 30B MoE can outperform 120B dense models on agentic coding benchmarks is a strong signal that task-specific RL scales better than raw parameter count when you have good reward signals. This won't be the last domain where we see smaller, specialized models beat larger generalists.
Open Questions
Cohere's blog post is admirably detailed, but a few things remain opaque:
- How much of the performance gain comes from architecture versus data versus RL? Ablation studies would clarify this.
- What's the latency/throughput profile compared to dense models of similar capability? MoE should be faster, but real-world serving details matter.
- How does the model handle tasks outside its training distribution? The harness-robustness work is promising, but generalization to completely novel tooling environments is still an open question.
Still, this is the most transparent release I've seen from Cohere. The methodology section alone is worth studying if you're building agent-training pipelines.
The Bigger Picture
We're entering the era where single-model architectures are for demos, multi-turn agent systems are for shipping. North Mini Code is designed for that world. It's not trying to be the best chat model or the fastest code completion engine. It's trying to be the foundation for agents that autonomously fix bugs, implement features, and pass CI.
The Apache 2.0 license means the open-source agent ecosystem can actually build on this without waiting for API rate limits or negotiating contracts. That matters more than most people realize.
If you're building coding agents, go try it. If you're training models for agentic tasks, study the post-training pipeline. And if you're skeptical that specialized models can beat larger generalists—well, here's your counterexample.