The Terminology Problem
The AI agent field is having a vocabulary crisis. Terms like "harness," "scaffold," and "agent" get thrown around at conferences and in docs, but ask three practitioners what they mean and you'll get four different answers. After ICLR 2026, @ariG23498 captured the confusion perfectly: "What do you mean by the terms 'harness' and 'scaffold' in the context of agents? I have heard a lot of explanations while I was at ICLR, but I could not converge to a single explanation."
HuggingFace's new agent glossary is the clearest attempt yet to fix this. It's not trying to impose one true definition—the authors acknowledge that different frameworks use these words differently. Instead, it offers a practical mental model that makes discussions actually followable.
Here's why this matters: if you're building agents, evaluating them, or just trying to understand why Claude Code feels different from Cursor despite similar underlying models, these distinctions are load-bearing.
The Core Distinction: Harness vs Scaffold
The glossary draws a bright line that resolves most of the confusion:
Scaffold is the behavior-defining layer: system prompt, tool descriptions, output parsing rules, context management. It shapes how the model sees the world and what it can express. Think of it as the instructions and knowledge the model works from.
Harness is the execution layer: it calls the model, handles tool invocations, decides when to stop, manages errors. The harness makes the agent run. The scaffold is what the model works from.
This maps cleanly to the product landscape. Claude Code's own documentation says "Claude Code serves as the agentic harness around Claude." Products like Codex and Antigravity CLI are harnesses. Some are tightly coupled to their provider's models; others let you swap in any LLM.
The confusion arises because many products use "harness" to mean the whole system—scaffold included. That's fine for product marketing. But when you're reasoning about training pipelines, evaluation frameworks, or debugging why an agent keeps failing at step 4, you need the finer distinction.
What an Agent Actually Is
The glossary grounds "agent" in its RL origins: a function that takes observations and returns actions in a loop. The environment processes the action, returns a new observation, repeat.
In the LLM world, that loop is still core, but the term has expanded. An agent is a model plus everything that lets it act, not just respond. The formula the community has settled on: Agent = Model + Harness.
Break it down:
- Model: The LLM itself. Takes text in, produces text out. No memory between calls, no loop. Can express intent to call a tool but can't execute it.
- Harness: The execution loop around the model—calls it, handles tool invocations, decides when to stop.
- Scaffold: Lives inside the harness. Defines the model's instructions, tools, format, and memory.
Two products using the same model can feel completely different because their harnesses make different choices. Swap a better model into the same harness and the experience shifts again. The model, the harness, and the product are three different things.
Why Scaffolding Gets Confusing
The scaffold/harness boundary trips people up because "scaffold" sometimes gets used more broadly. You'll hear it refer to any infrastructure the harness relies on: hooks, runtime config, even directory structure.
But in the core sense—the one that matters for agent design—scaffolding is context engineering at scale. It's what goes into the agent's context window at each step: system prompt, tool descriptions, conversation history, retrieved knowledge. It's not a one-time decision. As the agent runs, previous turns shape what goes into future calls. The harness actively manages this throughout.
The cost of getting scaffolding wrong depends on where you are. At inference, it's just text—change a prompt and redeploy. At training, what the model sees shapes what gets learned. Get it wrong and you're retraining from scratch.
Tool Use, Skills, and Sub-Agents
The glossary clarifies three concepts that often get conflated:
Tool use is how agents reach outside themselves: APIs, code interpreters, databases, web search. The model expresses intent in a structured format. Modern inference APIs surface this as a first-class object—the harness receives the call directly and routes it.
Skills are reusable, structured packages of knowledge that enable multi-step tasks. Where a tool is an action ("run this command"), a skill bundles everything needed to accomplish a goal ("investigate this bug, form a hypothesis, write a fix"). Skills are portable across agents and composable.
Sub-agents are full agent instances that act as tools within a parent agent. The parent delegates a task, the sub-agent runs its own loop with its own harness and scaffold, and returns results. This is architecturally heavier than skills but enables true delegation of complex subtasks.
These aren't just semantic distinctions. They map to different implementation patterns and different failure modes.
The Training Vocabulary
The last section of the glossary covers terms specific to training agent models. If you're building harnesses or deploying agents, you can mostly skip this. If you work on model development, these are load-bearing:
RL Environment: The world the agent interacts with during training. Could be a simulator, a live API, a code execution sandbox. Must be cheap to reset and run in parallel.
Trainer: The RL algorithm that updates model weights based on rollout data—PPO, DPO, whatever you're using.
Rollout: A complete episode from start to terminal state. The harness runs the agent loop, collects observations and actions, and feeds them to the trainer.
Reward: The signal that shapes learning. Can be sparse (only at episode end) or dense (every step). Designing good reward functions for complex agent tasks is still more art than science.
At training time, the harness runs many rollouts in parallel and feeds results back to update the model. At eval time, the same pattern becomes an eval harness—run fixed scenarios at a checkpoint, record metrics instead of updating weights.
Why This Matters Now
Agent systems are moving from research demos to production. That means more people building harnesses, more frameworks with different conventions, more products making different tradeoffs. Without shared vocabulary, every design conversation starts with 20 minutes of "wait, what do you mean by agent?"
The HuggingFace glossary won't end all disagreement—the authors explicitly acknowledge that different communities use these terms differently. But it gives us a practical baseline. When someone says "harness," you can ask: "Do you mean the execution layer or the whole system?" When they say "scaffold," you know to clarify: "The prompts and tools, or the broader infrastructure?"
It's a small fix for a fast-moving field. But small fixes to vocabulary problems have outsized impact. Now we can argue about the actual design choices instead of talking past each other about what the words mean.
The Bigger Picture
Beyond the specific terms, the glossary reveals something about where agent development is headed. The fact that we need to distinguish harness from scaffold, skills from tools, sub-agents from function calls—this points to increasing architectural sophistication.
Early agent systems were monolithic: one model, one prompt, one loop. Modern systems are compositional: multiple reasoning patterns, delegated subtasks, hybrid memory architectures. The vocabulary is catching up to the complexity.
That's a good sign. It means we're past the "AI agent" handwave and into the era of serious engineering tradeoffs. Different harness designs optimize for different things. Scaffold choices shape what behaviors emerge. The model is just one piece.
Get the vocabulary right, and the design conversations get clearer. Get the design right, and agents start shipping in production instead of stalling in demos. This glossary is a step toward both.