DPO Isn't Just for Chat: Using Your Model's Own Failures as Training Signal
DharmaOCR cut text degeneration by 59% on average using DPO—not for alignment, but by training directly against the repetition loops the model produced after supervised fine-tuning.