The evaluation results crisis no one talks about
Here's a problem that should bother you more than it does: the same model, on the same benchmark, routinely reports wildly different scores depending on who ran it. LLaMA 65B has been clocked at both 63.7 and 48.8 on MMLU. Not a rounding error—a 15-point spread on a percentage-scale benchmark that's supposed to be measuring the same capability.
This isn't a fluke. Evaluation results live scattered across papers, leaderboards, blog posts, and harness logs, each in its own format, with commonly unreported settings that make apples-to-apples comparison nearly impossible. When you're comparing models or reasoning about safety thresholds, this chaos is a real problem.
Every Eval Ever (EEE) and Hugging Face Community Evals just became intercompatible, and if you care about making evaluation results actually useful, this is the most interesting infrastructure work happening right now.
What EEE actually fixes
Every Eval Ever launched in February 2026 as the first cross-institutional attempt to standardize how evaluation results get reported. It's a single JSON schema that captures everything you need to interpret a benchmark score:
- Who ran it and when
- Which model, accessed how
- Generation settings (temperature, top-p, shot count)
- What the metric actually measures
- Optionally, a companion JSONL with per-sample outputs
The schema was built with input from researchers and policy folks. It accepts results from any source—harness logs, leaderboard scrapes, paper tables—and normalizes them into the same structure.
The EEE datastore on Hugging Face now holds around 229,000 evaluation results across more than 22,000 models and 2,200 benchmarks, pulled from 31 different reporting formats. Reproducing those runs from scratch would cost hundreds of thousands of dollars. That's a strong argument for not letting the data scatter once someone has paid to generate it.
What Community Evals does differently
Hugging Face Community Evals, also launched in February 2026, decentralizes how benchmark scores surface on the Hub. It has two sides:
Benchmark leaderboards live on dataset repos. Any dataset can register as a benchmark by adding an eval.yaml. Once registered, that dataset page automatically aggregates every score reported against it across all model repos into a live leaderboard.
Model scores live in .eval_results/*.yaml files inside model repos. They show up on model cards and feed into the matching benchmark leaderboard. Anyone can add a score to any model by opening a pull request with the right YAML.
Each score carries provenance: author-submitted, community-submitted, or independently verified. Model authors can close PRs or hide results on their own repos, but the default is open contribution.
The integration that makes both systems useful
Here's where it gets interesting. When you submit a result to both EEE and Community Evals, two things happen simultaneously:
- Your score appears on the Hugging Face model page and gets pulled into the benchmark's leaderboard—where people actually look at models
- It carries a source badge linking straight back to the full EEE record, where the generation config, harness version, reproducibility notes, and instance-level data live
The two destinations do different jobs toward the same goal. Hugging Face puts your result where users browse models. EEE keeps the full structured record that makes the result interpretable and powers tools like Eval Cards on top of it.
Send your data to both and the same evaluation ends up visible and legible at once. That's the entire point of reporting it.
How the converter works
The new converter tool automates the cross-posting. It maps EEE's JSON schema to Hugging Face's YAML format:
source_data.hf_repo→dataset.idevaluation_name→task_idscore_details.score→valueevaluation_timestamp→date
Then it drops in the datastore object URL as the source link. Currently supports four official benchmarks: MMLU-Pro, GPQA, HLE, and GSM8K.
But it does more than reshape fields. The converter:
- Downloads the collection and validates object hashes
- Audits what already exists on the model's main branch and in open PRs
- Flags duplicates as
already_present, conflicts asscore_conflict, missing repos asmissing_hf_model - Writes local YAML previews and a review file for inspection
- Only opens PRs after you type
OPEN PRSand confirm
Nothing gets pushed without your sign-off. Reruns cache results unless you force a refresh.
Why this matters for evaluators
If you run evals—whether first-party (your own models) or third-party (someone else's)—you can now submit to both systems and get:
Visibility: Your results show up where users compare models, not buried in a GitHub repo or PDF supplement.
Attribution: When you submit through your organization's official Hugging Face account, your results get a verified checkmark on EvalEval—a signal that the numbers come straight from the source.
Interpretability: Anyone clicking through to the EEE record sees the full context: what settings you used, what version of the harness, what the metric definition is, and whether you've shared instance-level outputs.
This is infrastructure, not a feature. But good infrastructure makes the difference between evaluation results people can actually use and evaluation results that just add to the confusion.
What this looks like in practice
Run the converter on an EEE collection:
uv run tools/hf-community-evals/community_evals_converter.py MMLU-Pro \
--datastore evaleval/EEE_datastore@main
Review the previews and the report. When you're ready, type OPEN PRS. The tool opens pull requests to the relevant model repos. Model authors can merge or reject.
Once merged, the score appears on the model card with a source badge linking to the full EEE JSON. Same evaluation, two surfaces: one for discoverability, one for reproducibility.
The bigger picture
Evaluation results are how we measure capabilities, compare models, and reason about deployment decisions. When those results are scattered, incompatible, and missing critical context, everyone loses—users picking models, researchers comparing approaches, and policymakers setting thresholds.
EEE and Community Evals together create a path from "I ran this eval" to "anyone can find, understand, and verify this result." The schema handles the structure, the datastore handles the storage, Hugging Face handles the discovery, and the converter handles the annoying part in between.
If you're running evals, submit to the EEE datastore and use the converter. If you're browsing models, you'll start seeing more results with better provenance. If you're building tools on top of eval data, you now have 229,000 standardized records to work with.
This is the kind of unglamorous infrastructure work that makes everything else possible. It's about time someone built it.