Question 1

How is DPO different from SFT?

Accepted Answer

SFT teaches the model what to produce — 'given this input, generate this output.' DPO teaches the model what to prefer — 'between these two versions, humans prefer this one.' DPO is often more effective for aligning generation quality with human standards because it directly shapes the model's preferences, not just its outputs.

Question 2

Where do DPO pairs come from?

Accepted Answer

Every needs_fix correction in the AI QA & Evaluation Platform creates a DPO pair automatically: the original AI message (rejected) and the human-corrected gold standard rewrite (preferred). The reviewer's correction is the preference signal. No extra annotation work needed.

Question 3

How many DPO pairs do I need?

Accepted Answer

Research shows meaningful alignment improvements with as few as 100-500 high-quality preference pairs. Most teams generate that volume organically within the first few weeks of running messages through Bookbag. Quality matters more than quantity — and production corrections are as high-quality as preference data gets.

DPO Training Data

What It Means

Why It Matters

How Bookbag Helps

Automatic pair structuring

Production-grade provenance

Combined with SFT export

Related Terms

Frequently Asked Questions

Related Resources

Solutions

Compare

See how Bookbag works