RLHF vs. DPO: Training Clinical Language Models
Why we chose Reinforcement Learning from Human Feedback over Direct Preference Optimization for clinical safety. Exploring reward modeling with expert annotations and PPO convergence on 50K clinical Q&A pairs.
The Clinical Safety Problem
When we set out to build Synthure's core language model, we faced a fundamental challenge: how do we align a large language model to provide medically accurate, clinically safe responses? Unlike general-purpose LLMs where occasional mistakes are tolerable, healthcare demands near-zero tolerance for hallucinations. A patient reading an incorrect diagnosis translation could alter their understanding of their condition, skip critical medications, or delay necessary care.
Pre-training on 900GB of EHR data gets you 80% of the way there. But the final 20%—the difference between a model that sometimes errs and one that doctors can trust—requires alignment techniques that explicitly optimize for clinical accuracy.
RLHF: The Reward Model Approach
Reinforcement Learning from Human Feedback (RLHF) has become the standard for modern LLM alignment. The pipeline works in stages:
- Supervised Fine-Tuning (SFT): Train on 50K high-quality clinical Q&A pairs annotated by medical experts
- Reward Model Training: Learn a scalar reward function from pairwise expert preferences (500 samples, 3 annotators)
- PPO Optimization: Use the reward model to guide policy updates via Proximal Policy Optimization
The Reward Model: Clinical Preferences as Scalars
Our reward model is a 7B parameter transformer trained to predict: given a query and two model completions, which one is more clinically sound? We collected 500 pairwise comparisons from board-certified physicians across cardiology, psychiatry, and internal medicine.
The reward model learned to penalize:
- Factual inaccuracies (wrong drug interactions, dosing errors)
- Missing context (failing to mention contraindications)
- Poor communication (using jargon when explaining to patients)
- Overconfidence (presenting uncertain diagnoses as definitive)
PPO Training: Balancing Helpfulness and Safety
Once the reward model was trained, we ran PPO to optimize our policy (the main language model) against this reward signal. PPO is elegant because it prevents the policy from drifting too far from the SFT baseline—critical in healthcare where radical changes can introduce new failure modes.
PPO Training Dynamics: Hallucination Rate Over Time
RLHF training curves show rapid initial improvement as PPO learns the reward model, then plateaus due to KL divergence penalty keeping the policy close to SFT baseline.
DPO: The Simpler Alternative
Direct Preference Optimization (DPO) is a newer approach that's gained traction. Instead of explicitly training a reward model, DPO directly optimizes the policy to satisfy preference pairs. The key insight: you can write a closed-form loss that encodes preferences without needing a separate reward model.
Why We Chose RLHF for Synthure
Given the clinical stakes, we prioritized interpretability and stability over computational efficiency. Here's our reasoning:
Additionally, DPO assumes that preference pairs fully specify the alignment target. But in clinical settings, we have expert preferences that change with context—a cardiologist might weight diagnostic thoroughness differently than a primary care physician. RLHF's two-stage approach lets us:
- Train a reward model that captures this domain-specific nuance
- Fine-tune that reward model as we collect more expert data
- Keep the policy stable while continuously improving the reward signal
The Numbers: RLHF Performance
Practical Lessons: PPO Hyperparameter Tuning
Getting PPO right required careful tuning. Here are the hyperparameters that worked for clinical LM alignment:
Key takeaway: The KL coefficient (0.05) was critical. Higher values (0.1+) caused poor convergence; lower values (0.01) allowed the policy to drift too far from SFT baseline. In clinical settings, conservative constraints are features, not bugs.
From Lab to Clinic: Validation
We validated the RLHF-trained model against both automated metrics and clinical expert review:
- ROUGE-L & BLEU: Standard NLG metrics showed 15% improvement
- Clinical Expert Eval: 3 MDs rated 500 model outputs on accuracy, completeness, and readability. 81% of outputs ranked "clinically acceptable" (vs. 13% for baseline GPT-4)
- Patient Comprehension: 200 patients read RLHF translations; 81% demonstrated factual understanding (vs. 13% baseline)
When DPO Might Be Better
DPO isn't wrong—it's just different. DPO excels when:
- You have thousands of preference pairs and want single-stage efficiency
- Interpretability isn't critical (e.g., general-purpose chatbots)
- Compute budget is limited and you can tolerate slightly higher overfitting risk
- Expert feedback is static and unlikely to evolve
For Synthure's clinical setting, we had only 500 pairwise preferences (sparse by DPO standards) and needed interpretability for regulatory sign-off. RLHF was the safer bet.
Conclusion: Safety Beats Simplicity
The RLHF vs. DPO choice comes down to priorities. DPO is faster and simpler. But in healthcare, where a single hallucination can harm a patient, we chose RLHF for its interpretability, stability, and ability to integrate evolving clinical feedback.
Over 18 hours of GPU training, we reduced hallucinations from 92% to 12%, achieved 87% agreement with expert preferences, and built a foundation for continuous improvement as we collect more clinical data.
If you're aligning LLMs for safety-critical domains (healthcare, law, finance), RLHF's two-stage approach gives you the interpretability and control that simpler methods lack. The extra compute is worth it.