The harness versus framework split
Most agent projects start with a week of plumbing. You pick a framework, wire up model clients, build tool adapters, stream state to a UI, and somewhere in there decide what the agent actually does. The interesting part arrives last.
CUGA — IBM's open-source Configurable Generalist Agent harness — inverts that. It handles planning, execution loops, tool calls, and state management so you can focus on the two things that matter: which tools your agent reaches and what you tell it to do.
To prove the point, IBM Research built cuga-apps: two dozen single-file working apps, from movie recommenders to IBM Cloud architecture advisors. Each wraps one CugaAgent in FastAPI. They exist to be read and copied. The live gallery runs on Hugging Face Spaces.
What the harness takes off your plate
The fair question for anything in this space is what it saves you from writing. CUGA's answer: the orchestration around a model that you'd otherwise rebuild every time.
It plans before it acts, executes with a mix of tool calls and generated code (CodeAct), and runs a reflection step that catches bad calls and re-plans instead of barreling ahead. On long tasks that run twenty steps, the thing that breaks most agents is losing track of intermediate results and re-deriving them wrong on the next turn. CUGA holds that state.
That machinery is why it topped agent benchmarks. The post notes #1 on AppWorld from July 2025 through February 2026 and WebArena from February 2025 through September 2025. These aren't tuning wins — it's the harness carrying load the model would otherwise have to.
You also set cost/latency tradeoffs from config rather than code: Fast, Balanced, and Accurate reasoning modes, with code execution in whatever sandbox you trust (local, Docker/Podman, or E2B cloud). Same agent definition, different dial.
The planning, reflection, and variable-tracking let smaller open-weight models hold up where they normally wouldn't. The hosted apps run on gpt-oss-120b rather than a frontier API. That's the bet: a smaller open model is enough when the harness does the work.
One app end to end
The IBM Cloud advisor recommends real IBM Cloud services for architectures. The whole thing fits in one file: a main.py with the agent factory, tools, and prompt, plus a small UI.
The entire agent is four arguments:
def make_agent():
from cuga import CugaAgent
from _llm import create_llm
return CugaAgent(
model=create_llm(
provider=os.getenv("LLM_PROVIDER"),
model=os.getenv("LLM_MODEL"),
),
tools=_make_tools(),
special_instructions=_SYSTEM,
cuga_folder=str(_DIR / ".cuga"),
)
The model comes from a small factory that speaks to OpenAI, Anthropic, watsonx, LiteLLM, or Ollama depending on an environment variable. Nothing in the app code knows which model sits behind it. The cuga_folder is where the app keeps state and policies. The two arguments that carry the app are tools and special_instructions.
The tool split that works
Tools mix local functions with hosted ones:
@tool
def search_ibm_catalog(query: str) -> str:
"""Search the IBM Cloud Global Catalog for real IBM Cloud services.
Always call this before recommending services to verify they exist."""
... # hits the catalog API, returns JSON
from _mcp_bridge import load_tools
web_tools = load_tools(["web"])
return [search_ibm_catalog, *web_tools]
There's a pattern here: MCP tools for generic, stateless capabilities; inline Python functions for app-specific logic. load_tools(["web"]) pulls in web search without hosting anything. Anything specific gets defined inline — search_ibm_catalog whose docstring is what the agent reads to decide when to call it.
The cloud advisor's prompt tells the agent to search the catalog before naming any service, recommend three to seven services with each one's role, and never invent service names. That last rule earns its keep: an agent recommending services that don't exist is worse than no agent. The prompt forces every recommendation through a catalog lookup first.
Prompts written as ordered steps with explicit "don't make things up" rules behave. Prompts written as personas wander.
The boring convention that matters
Every inline tool returns the same small envelope. Success looks like {"ok": true, "data": {...}}; failure looks like {"ok": false, "code": "...", "error": "..."}.
It looks like boilerplate. It isn't. CUGA's planner handles declared failures gracefully and chokes on undeclared ones where a raw stack trace bubbles up mid-plan. Across the apps, the ones that worked reliably were the ones whose tools never threw bare exceptions at the agent.
A boring convention, but it's the difference between an agent that recovers and one that face-plants.
MCP as shared infrastructure
The split only pays off because the generic half is already running somewhere. The capabilities the apps reach for — web search, Wikipedia/arXiv, geocoding and weather, finance quotes — live in 7 public MCP servers (36 tools) hosted on IBM Code Engine, no auth required. A small bridge resolves their URLs automatically, and the live gallery ships an MCP Tool Explorer to call any of them from a form before wiring into an agent.
From prototype to governed production
Once you've read the cloud advisor, you've read all of them. They share a skeleton. The movie recommender swaps the IBM catalog tool for the knowledge MCP server. The web researcher leans almost entirely on web search.
The real test is what happens when you need the same agent running governed in production. The post walks through this: the prototype runs standalone, then moves into IBM's Agent Assist SaaS where policy enforcement, audit logging, and multi-agent coordination happen at the platform level.
The agent code doesn't change. The tools, prompt, and CugaAgent constructor are identical. What changes is the deployment target and the policies layered over it. That's the promise of a harness: the orchestration is portable because it's not yours to maintain.
Agent Assist and multi-agent delegation
In production, agents delegate over A2A (agent-to-agent protocol). The post describes a content approval workflow where a writer agent drafts content, a reviewer agent checks it against policy, and an approver agent makes the final call. Each agent runs as its own service with its own tool set. The orchestration is declarative — no hand-rolled message passing.
Guardrails are configured, not coded. The same agent that runs permissively in development runs with input validation, output filtering, and rate limiting in production. The difference is config in the cuga_folder.
Why this matters
The agentic stack is consolidating around two layers: the model (increasingly commoditized) and the orchestration (still fragmented). CUGA's argument is that orchestration should be infrastructure, not framework code you maintain.
The fact that IBM shipped two dozen production-ready apps using the same four-argument constructor suggests the abstraction holds. The interesting work — tool selection, prompt engineering, domain modeling — stays in your control. The boring work — planning loops, state management, error recovery — moves into the harness.
The apps prove it's possible. Whether it's the right abstraction is the question the ecosystem will answer. But watching a movie recommender and a cloud advisor share 90% of their plumbing makes a convincing case that most agent code shouldn't exist.
You can explore the full set of apps, read the source, and run them yourself at the CUGA apps gallery. The code is open, the servers are public, and the examples are meant to be copied.