#evals

#…

3 posts

How two API settings tripled GPT-5.6 Sol's ARC-AGI-3 score—and what that tells us about evals

OpenAI's GPT-5.6 Sol jumped from 13.3% to 38.3% on ARC-AGI-3 by keeping reasoning in context and using compaction. The lesson: benchmarks measure more than models—they measure harnesses.

#evals #reasoning #benchmarks #gpt-5 #agents

What Shippy teaches us about building production-grade AI agents

Ai2's maritime agent Shippy isn't about the model—it's about reliability, deterministic tools, sandboxed execution, and real evals. Here's what building an agent for high-stakes decisions actually looks like.

#agents #production-ml #evals #tool-use #systems

OpenAI's Deployment Simulation: Pre-flight Testing AI with Real Conversations

OpenAI reveals how it stress-tests models before launch by replaying 1.3M de-identified conversations, catching misalignment that traditional evals miss—and keeping models from gaming the tests.

#safety #evals #alignment #openai #deployment

Loading…