The Result That Shouldn't Work
Direct Preference Optimization has been a chat alignment tool since Rafailov introduced it in 2023. You train on human judgments—helpful versus harmful, accurate versus misleading—and the model learns to prefer better responses. The technique lives in the RLHF family, designed for subjective tasks where quality is a matter of human preference.
DharmaOCR just used it to fix text degeneration in OCR models. Not chat. Not alignment. Structured document extraction—a task with objective ground truth and zero conversational context. And it worked consistently: 59.4% average reduction in degeneration rates across five model families, with the best case hitting 87.6%.
The insight isn't that DPO is good at alignment. It's that DPO is good at teaching models to avoid specific failure modes when you construct the training signal correctly. For structured generation tasks, that signal comes from the model's own outputs when it breaks.
What Supervised Fine-Tuning Can't Fix
Text degeneration is the failure mode where a model enters a repetition loop and can't escape. The same token or phrase repeats until the sequence hits max length. It's not a decoding bug you can patch with temperature tweaks or repetition penalties—those are symptom management. The underlying problem is geometric.
When a model assigns high probability to a token, and that token's presence increases the probability of itself at the next step, you get a self-reinforcing attractor in the distribution. Autoregressive sampling walks into that attractor and stays there. The model isn't choosing to repeat; it's sampling from a distribution that's locally converged.
Supervised fine-tuning trains token-by-token, maximizing likelihood of observed sequences. A repetition loop is never penalized as a completion-level failure—just a sequence of locally probable next-token predictions. SFT moves the model closer to the task domain, but it doesn't attack degeneration directly because degeneration isn't in its objective function.
The DharmaOCR benchmark measured vanilla degeneration rates across open-source vision-language models from under 1% to above 33%. SFT reduced those rates for most families, but rarely to production-acceptable levels. One model family showed the opposite pattern: 0.60% vanilla degeneration rising to 3.23% after SFT, before DPO brought it back down to 1.41%.
The pattern matters structurally. Task capability and degeneration resistance can move independently. SFT isn't a degeneration fix that sometimes works—it's a domain adaptation technique that occasionally has that side effect.
The Design Decision: Degenerate Outputs as Rejection Pairs
DPO requires preference pairs: a chosen output and a rejected output for the same input, with a clear quality difference. In chat alignment, human annotators produce those judgments. For OCR, there are no human preference rankings. You either transcribe correctly or you don't.
The DharmaOCR pipeline found its training signal in the range of outputs the SFT model already produces. Generate multiple candidates per document with the SFT model, score them with an automated judge, and you have a distribution of quality. Correct transcriptions become chosen examples. Degenerate outputs—the ones that entered the repetition loop—become rejected examples.
This is the inversion that makes the technique work for structured tasks. The conventional response when degenerate outputs appear in training data is to filter them out as low-quality noise. DharmaOCR deliberately retained them as the negative signal, because they represent exactly the failure mode the optimization needs to suppress.
Training Against the Attractor
The paper describes this as "preference-guided implicit unlikelihood." DPO trains not only toward better outputs but explicitly away from a specific failure class. Where SFT maximizes likelihood of high-quality outputs, DPO simultaneously penalizes outputs displaying the degeneration attractor geometry.
The training signal is completion-level rather than token-level. A repetition loop can be labeled as the wrong outcome—not just a sequence of locally probable tokens. This targets the failure mode in a way SFT's objective function cannot.
The implementation ran on 23,726 training documents. For each document, the SFT model generated multiple candidate responses. An LLM judge scored each candidate. High-quality transcriptions paired with degenerate outputs from the same input became the DPO training set.
The Results: Consistent Across Five Families
Every model family tested showed degeneration reduction after DPO. No exceptions. The magnitude varied—best case was Nanonets-OCR2-3B dropping from 1.61% to 0.20%, worst case still showed improvement—but the direction was invariant.
This consistency matters because it suggests the technique generalizes across architectures. The families tested weren't minor variants of a single base model. They represented different training regimes, different scale points, different architectural choices. The pattern held.
The paper also tested the approach beyond OCR, applying it to code generation and mathematical reasoning tasks. Both showed the same structure: a completion-level failure mode (infinite loops in code, circular logic in math) that SFT reduced but didn't eliminate, and DPO providing further suppression by training directly against those failure patterns.
What This Means for Structured Generation
The broader implication isn't "use DPO for everything." It's that preference optimization techniques work on objective tasks when you construct the preference signal from task-specific failure modes rather than human judgments.
For any structured generation pipeline where:
- The task has objective correctness criteria
- The model produces characteristic failure modes after SFT
- Those failures can be programmatically identified
You have the ingredients for a DPO stage. The rejected examples are sitting in your inference logs. You're already generating them; you just need to label them and train against them.
This shifts how you think about model failures in production. A degenerate output isn't just a bad sample to discard—it's a training signal showing you exactly where the model's distribution has pathological geometry. Collect enough of those signals and you can train the model away from its own failure modes.
The Open Questions
The technique works, but several structural questions remain open. Why does SFT have a ceiling on degeneration reduction? The paper's conjecture points to loss granularity—token-level training can't penalize completion-level failures—but that's not proven.
How much does the quality of the rejection examples matter? DharmaOCR used degenerate outputs, which are unambiguous failures. Would the technique work as well with "merely mediocre" outputs as rejections, or does it need clear binary failure to provide useful signal?
And what happens if you skip SFT entirely and go straight to DPO from a base model? The paper's methodology treats DPO as a second stage after task-specific fine-tuning, but the results raise the question of whether that ordering is necessary or just conventional.
The Pattern Beyond Chat
DPO's origin story is chat alignment, but its mechanism is broader: it's a technique for training toward one class of outputs and away from another when you have paired examples. Chat alignment uses human preference judgments to construct those pairs. DharmaOCR uses the model's own failure modes.
The technique isn't limited to text degeneration. Any systematic failure mode you can programmatically identify becomes a candidate for this approach. Code that compiles but hangs. Math proofs that are syntactically valid but logically circular. Structured data extraction that produces well-formed JSON with semantically incorrect fields.
If you can generate it, score it, and pair successes with failures, you can train against it. That's the insight beyond chatbots.