What happens when you stop pretending agents think alike
Most multi-agent simulations run one model, many prompts. That's tidy, repeatable, and boring. Thousand Token Wood v2 took a different path: every agent in an emergent woodland economy runs on a different lab's small model. An owl hoards on gpt-oss-20b from OpenAI. A fox speculates on MiniCPM3-4B from OpenBMB. A magistrate hunts you for insider trading on Nemotron-Mini-4B from NVIDIA. And the narrator runs on a fine-tuned Qwen 0.5B.
This isn't novelty for its own sake. It's a thesis: heterogeneity is the product, not a constraint. A market where every participant genuinely differs in training data, post-training, and reasoning style is more interesting than five copies of the same prompt wrapper. And the engineering lesson underneath is sharp—standing up four models from four labs taught more about small-model deployment than any single-model project could.
The serving layer is the real constraint
If you've only run one model family in production, you might think model diversity means architecture hell. It doesn't. The four models in Thousand Token Wood ran into almost zero modeling-layer friction. The pain was entirely at the serving layer, and it was universal:
-
vLLM 0.22.1 JIT-compiles kernels at load and expects
nvccpresent. Lean base images don't ship the CUDA toolkit, so all four models failed identically with "could not find nvcc" until the author rebased them on a CUDA devel image. One image fix unblocked everything. -
gpt-oss-20bruns in native MXFP4 quantization and fits a 24GB L4 GPU with room to spare. No H100 required. It does wrap answers in an analysis preamble, so the consumer has to extract the final channel—a per-model quirk, but a one-line parse. -
MiniCPM3neededtrust_remote_code;Nemotronloaded clean. Per-model footguns, each a config entry.
The thing that made four heterogeneous models tractable was a tolerant JSON parse-and-repair layer that every model's output flows through. Different tokenizers and formatting habits produce different malformations. The parser drops what it can't salvage; the simulation never crashes. Build that layer once and adding a fifth model is a config entry, not a refactor.
This is the quiet lesson for production multi-agent systems: your variety bottleneck isn't model architecture, it's serving infrastructure and output normalization. Fix those and you can mix labs freely.
Information asymmetry is a security property
The dramatic core of v2 is the insider tip. You play the Patron of the Wood, a shadow financier who can whisper a tip to a creature that is true (a real forecast of the next market mania) or false (bait). Acting on a true tip and profiting raises your heat. Cross a threshold and the magistrate opens an investigation that ends in a fine, frozen assets, or exile.
For that to be a real game, the truth of a tip must be hidden from the creatures. They see the rumor text; they must never see the flag. This is a security property, not a UI nicety. Small-model agents make it sharp: everything the model could repeat back is whatever you put in its prompt.
So the hidden flag lives off-prompt entirely—on the player's ledger, stripped from the public event record at construction. The narrator summarizes only public events. A single test scans every creature's full prompt, every turn, for the banned tokens. That test is the most important one in the suite.
When you give an agent secret information, assume it will leak unless a test proves it cannot.
This is harder than it sounds with reasoning models or chain-of-thought prompting, where the model can infer hidden state from context. Thousand Token Wood solved it with data architecture, not prompt instructions. The flag never enters the system the agents can see.
Memory is cheap drama if you bound it
Creatures carry persistent relationships: a signed sentiment toward the Patron and toward each other, nudged by events (you shorted my crop, you repaid your loan, you allied me with a rival). A creature that turns hostile refuses your loans and quotes you worse. Allied creatures stop undercutting each other and behave like a cartel.
The trap is prompt inflation. Raw history grows without bound and a small model drowns in it. The fix: never put history in the prompt. The model sees a one-line bucketed summary—"you feel warmly toward Oona, wary of the Patron"—capped to the few strongest feelings, derived from integer sentiment. Notes are kept for traces but bounded and never shown.
The behavioral bias is part emergent (the summary nudges the model) and part mechanical (a strongly hostile creature deterministically refuses), so it's observable and testable rather than a hope. This is the right pattern for small models: summarize aggressively, bound the context window, and put the weight-bearing logic outside the prompt where you can unit-test it.
What actually shipped
A representative council run, full v2 mechanics live:
- Models in the council: 4 labs, all under the 32B cap, served on Modal
- Fine-tuned 0.5B reliability: 0% self-buys, 100% valid offers (beats its 3B teacher)
- Truth firewall: 0 leaks of a tip's hidden flag across every prompt scanned
- Insider tip edge: a true-tip pre-position settles a positive P&L; a false tip does not
- Heat to investigation: two clean suspicious wins cross the magistrate's line
- Ruin: a margin call and a loan default banish a creature, who returns a chapter later
This is a single seeded run exercising the Patron, the information war, relationships, and leverage end to end. Not a benchmark, but a proof that the mechanics hold under player interaction.
Why this matters for small-model agent systems
Thousand Token Wood is a game, but the engineering choices are not game-specific. They're patterns for any multi-agent system running on small models:
-
A small model is a reliable format generator and an unreliable reasoner. You close the gap with structure, prompting, and a small fine-tune, not with scale.
-
A heterogeneous council is more interesting than a homogeneous one and costs you only config once the serving layer is solid.
-
Secret information given to an agent is a firewall problem. The firewall belongs in the data flow, proven by a test, not in a prompt instruction.
-
Persistent memory is the cheapest way to make agents feel alive, as long as the prompt only ever sees a bounded summary.
The whole council is open, and so are the traces. If you're building multi-agent systems, this is a worked example of how to ship something with personality on models under 5B without waiting for reasoning-model pricing to drop.
Small models, big adventures.