Small is the right tool, not the compromise
Lester Leong's Thousand Token Wood is the kind of project that makes you rethink where frontier models actually belong. It's a multi-agent economic simulation—five woodland creatures trading five goods for pebbles—running entirely on Qwen2.5-3B. Not as a cost-saving measure. As the correct architectural choice.
The constraint is the feature. A living economy needs many agents thinking many times per tick. That's exactly where GPT-4 or Claude would be the wrong tool: too slow, too expensive, too serial. A 3B model served via vLLM on Modal can run the entire council of traders in a single batched GPU call every turn. The simulation runs in real-time because it's small.
This is a field report on what worked, what didn't, and what a 3-billion-parameter model can and cannot do when you ask it to panic-sell honey during a bank run.
The first economy died of abundance
The naive version was boring. Production outran consumption. Every creature was self-sufficient. The market cleared once and went silent.
The fix wasn't better prompts or a bigger model. It was designed scarcity:
- Diet variety: A creature can eat only one unit of any single food per meal, so surviving means buying foods it doesn't grow.
- Spoilage: Perishable food rots if hoarded, forcing surplus sales while value remains.
- Winter fuel crisis: Every creature must burn firewood each turn, the need rises over time, and only one creature makes firewood.
That last mechanic drives the drama. One supplier can't meet rising demand, so the woodcutter gets rich and everyone else competes for warmth. Emergent inequality appeared without a line of code modeling it directly.
The lesson: abundance is boring. Multi-agent systems need tension baked into the rules, or they optimize themselves into equilibrium and fall silent. Scarcity creates the reasons to trade, hoard, and panic.
Valid JSON, terrible judgment
With scarcity in place, the honest small-model lesson surfaced.
The 3B model emitted valid JSON on 100% of calls. Formatting reliability was perfect. But economic judgment was catastrophically bad: a creature that produced acorns would post an order to buy acorns, the one thing it had in surplus.
The fix wasn't scale. It was prompt structure:
- Tell each agent what it produces and must never buy.
- Compute the exact list of goods it's short on.
- Give it one worked example.
Decision quality jumped. The creatures began trading to their roles. The whole loop is wrapped in a tolerant JSON parse-and-repair layer, so a malformed response degrades to a no-op instead of crashing the simulation.
This is the small-model trade-off in production: reliable formatting, unreliable reasoning. You close the gap with structure and constraints, not parameter count.
Then it started telling stories
The feature Leong is most pleased with ties the project to market history. Players can draw a Wood Legend: a famous financial episode reskinned as woodland folklore.
- Tulip Mania becomes the Great Acorn Mania.
- The South Sea Bubble becomes the Hollow Log Trading Company.
- The 1929 bank runs become the Run on Oona's Hoard.
These aren't flavor text. Each legend fires real shocks, and the agents react.
In one run, the rumor spread that the owl's vault was empty. Oona began liquidating honey to raise pebbles. The flood of supply crashed the honey price from 10 to 3 over the next turns. A reskinned bank run made an agent dump assets and moved a market price. None of it was scripted.
Making prices move
For that to be visible, prices had to move. Initially they were frozen because agents quoted back the reference price shown to them.
The fix: let the market reference drift with residual supply and demand after each round. Heavy unfilled buying pushes a price up; a glut pushes it down. Prices now trend during scarcity and stay calm in balanced trade.
This is prompt engineering at the system level—not just what you say to the model, but how you let the environment respond to its decisions.
What actually happened
A representative fifteen-turn run with a drought and winter rumor injected partway:
| Metric | Result |
|---|---|
| Valid JSON actions | 100% (75 of 75 calls) |
| Trades per turn | sustained 3 to 9, never silent |
| Honey price | crashed 10 to 3 during bank-run legend |
| Firewood price | rose 4 to 7 as winter scarcity bit |
| Wealth gap (Gini) | widened 0.14 to 0.38 |
| Outcome | woodcutter ended richest, hoarder broke |
The reasoning behind every move is in the open traces dataset: each row contains a creature's full prompt, raw response, parsed actions, and private thought.
That dataset is the real artifact. It's a corpus of 3B decision-making under scarcity, complete with the prompts that shaped it. If you're building with small models, that's gold.
Wellbeing as mean-reversion, not death spirals
An earlier design modeled wellbeing as an accumulator. Any chronic shortfall ground every creature to zero over a run—a death spiral that was no fun to watch and punished the agents' imperfect optimization.
Leong reframed it as mean-reverting mood: wellbeing recovers when a creature is fed and warm and never hits zero. Stakes belong in pebbles, prices, and status, not starvation.
This is a subtle but critical design choice. Multi-agent simulations need to tolerate suboptimal play without collapsing into unrecoverable states. Forgiveness mechanics keep the system legible and the drama in the right place.
Takeaways for building with small models
Most of the engineering is closing the gap between a small model's reliable formatting and its unreliable reasoning, with structure and prompting rather than scale.
Emergent systems need designed scarcity. Abundance is boring. If agents can self-sufficiently optimize, they will—and your simulation will go silent.
The best demos don't need invented drama. Three centuries of market history had it ready. A council of 3B agents was enough to play it out.
And the big one: small models are the right tool when you need many decisions fast. Frontier models are for single, high-stakes reasoning tasks. Small models are for councils, swarms, and simulations.
Try the Space. Watch the wood panic. Small models, big adventures.