The evaluation problem nobody talks about
Most LLM evaluation tools are built for the moment after you ship: you run a finished model through a suite of benchmarks, publish the scores, and move on. But if you're actually building an LLM, you're not evaluating once—you're evaluating constantly. Every tweak to the data mix, every hyperparameter adjustment, every checkpoint sends you back through the same loop: configure the benchmarks, run them, log the results, and decide whether the change actually helped or just added noise.
AI2 just released olmo-eval, a new evaluation workbench designed for that iterative reality. It builds on OLMES, their 2024 open benchmarking standard, but extends it across the entire development cycle—not just the final scorecard. The pitch: faster benchmark integration, modular runtime policies, first-class support for agentic evaluations, and analysis tools that help you distinguish real improvements from statistical noise.
What OLMES solved (and what it didn't)
OLMES—the Open Language Model Evaluation Standard—was AI2's answer to a reproducibility crisis. The same models were scoring differently on the same benchmarks because every paper made different choices about prompt formatting, task formulation, and scoring logic. OLMES pinned those choices down in an open standard, and it became the basis for evaluating Olmo and Tulu.
But OLMES was designed for reporting scores, not for the messy work of iterating on a model under development. Adding a new benchmark was still a sizeable integration project. Changing how a benchmark runs—say, adding tool use or switching execution environments—often meant rewriting the benchmark itself. And comparing two checkpoints meant eyeballing aggregate numbers without a clear view of where the differences actually showed up.
That's the gap olmo-eval fills.
How it differs from Harbor (and why that matters)
olmo-eval overlaps with Harbor, the open framework for evaluating AI agents in sandboxed containers. Both frameworks decouple benchmark logic from runtime policy. Both support agentic, multi-turn evaluations. But they're optimized for different workflows.
Harbor is built for publishing agent benchmarks: everything runs inside sealed, reproducible containers with the verification rigor that public benchmarks require. That's the right tradeoff when you're releasing a leaderboard, but it's heavyweight for everyday development.
olmo-eval lets you choose the execution mode per benchmark. A simple Q&A eval can run directly—faster, cheaper, no container overhead. An eval that needs to execute model-generated code gets an isolated sandbox. The lightweight path is the default; the heavy setup only kicks in when a benchmark actually requires it.
Adding a benchmark in Harbor is a formal process designed for evals you plan to share publicly. In olmo-eval, the integration surface depends on what the benchmark needs: a short Python task definition for basic evals, optional tool and scaffold configurations for agentic benchmarks, or a thin wrapper if the benchmark already has its own runner and you just want olmo-eval to report the results in a consistent schema.
Modularity all the way down
Both frameworks separate benchmarks from runtime policy, but olmo-eval takes modularity further. The model being evaluated, the tools it can use, the containerized environment, and any helper models—like an LLM-as-a-judge—are all swappable components. You can reuse a tool definition across many harnesses, plug a grading model into one benchmark without touching the others, or adjust prompt wording without extensive plumbing.
This matters when you're running dozens of evals per day across a stream of checkpoints. The ability to change one variable while holding everything else fixed is the difference between confident iteration and guesswork.
The four-layer stack
olmo-eval is composed of four components that work independently but are designed to integrate:
1. Task / Suite / Harness abstraction
A task defines what you're evaluating: the dataset, how requests are formatted, and how answers are scored. A suite groups tasks into a standard set you run together. A harness controls how each task runs—which model, which tools, which execution environment.
This separation means the same benchmark can run as a baseline or with tools and scaffolding enabled, without changing what it measures. You define the benchmark once and vary the runtime policy via harness presets.
2. Sandbox and capability routing
When a benchmark requires tool use—writing code, browsing the web, running shell commands—olmo-eval routes those actions to the appropriate executor. The asynchronous sandbox planner handles parallel container execution, spinning up isolated environments only when needed and tearing them down when the eval completes.
This isn't simulation. When the model calls a tool, olmo-eval runs the tool and feeds the real output back to the model. That's the point: evaluating the model's actual tool-use behavior, not whether a generated tool call merely looks plausible.
3. Normalized experiment schema
Every run—every checkpoint, every configuration change—gets logged in the same structured format. This makes it possible to group related experiments, compare checkpoints over time, and avoid the schema drift that accumulates in long-running projects.
It sounds boring, but it's the foundation for everything else. Without a consistent schema, you can't build tooling to analyze trends or confidently roll back a regression.
4. Pairwise results viewer
Harbor reports an overall score for each model. olmo-eval reports those scores too, with standard errors and minimum detectable effects—the smallest difference that can be reliably distinguished from noise.
But the more useful view is the pairwise comparison: lining two checkpoints up question by question, with all else held fixed. A 2.4-percentage-point shift in aggregate accuracy might be noise. A consistent pattern of improvements on a specific question type, visible in the per-instance diff, is signal.
What adding a benchmark actually looks like
In most setups, integrating a new benchmark is a multi-day project. In olmo-eval, it's a Python class:
@register("internal_freshqa")
class InternalFreshQA(Task):
data_source = DataSource(path="s3://evals/internal/freshqa.jsonl", split="test")
formatter = ChatFormatter()
sampling_params = SamplingParams(temperature=0.0)
metrics = (AccuracyMetric(scorer=ExactMatchScorer),)
@property
def instances(self):
loader = DataLoader()
for idx, doc in enumerate(loader.load(self.config.get_data_source())):
yield Instance(
question=doc["question"],
gold_answer=doc["answer"],
metadata={"id": doc.get("id", f"freshqa_{idx}")},
)
Variants express changes in evaluation policy without duplicating the benchmark:
register_variant("internal_freshqa", "3shot", num_fewshot=3, fewshot_seed=1234)
register_variant("internal_freshqa", "zero", num_fewshot=0)
Runtime policy lives in the harness, not the task. Same benchmark, different execution:
# Baseline
olmo-eval run -m my-checkpoint -t internal_freshqa:zero
# Same scoring, search/tool runtime enabled
olmo-eval run -m my-checkpoint -t internal_freshqa:zero --harness search_agent
When to use it (and when not to)
Use olmo-eval when evaluation is part of ongoing model development rather than a one-off scoring run. If your recurring question is "How does this checkpoint differ from the last one, and where exactly did it improve or regress?", that's the workflow olmo-eval is built for.
If you're publishing a leaderboard or need maximum reproducibility guarantees for public benchmarks, Harbor's heavier verification process might be the better fit. If you just need to score a finished model on established benchmarks once, lighter-weight tools will get you there faster.
But if you're in the loop—tweaking data, retraining, evaluating, analyzing, and doing it all again tomorrow—olmo-eval is designed to keep pace.
The open development loop
Reproducible evaluation should move at the speed of model development, not lag behind it. OLMES standardized how we score finished models. olmo-eval carries that standard into the iterative work of building models—and AI2 is releasing it openly so the rest of us can build on it.
The code is on GitHub. The paradigm—modular, iterative, grounded in the reality of constant checkpoints and constant questions—feels like the right direction for evaluation tooling in 2025.