RLHF vs. DPO: Training Clinical Language Models

Why we chose Reinforcement Learning from Human Feedback over Direct Preference Optimization for clinical safety. Exploring reward modeling with expert annotations and PPO convergence on 50K clinical Q&A pairs.

The Clinical Safety Problem

When we set out to build Synthure's core language model, we faced a fundamental challenge: how do we align a large language model to provide medically accurate, clinically safe responses? Unlike general-purpose LLMs where occasional mistakes are tolerable, healthcare demands near-zero tolerance for hallucinations. A patient reading an incorrect diagnosis translation could alter their understanding of their condition, skip critical medications, or delay necessary care.

Pre-training on 900GB of EHR data gets you 80% of the way there. But the final 20%—the difference between a model that sometimes errs and one that doctors can trust—requires alignment techniques that explicitly optimize for clinical accuracy.

RLHF: The Reward Model Approach

Reinforcement Learning from Human Feedback (RLHF) has become the standard for modern LLM alignment. The pipeline works in stages:

  1. Supervised Fine-Tuning (SFT): Train on 50K high-quality clinical Q&A pairs annotated by medical experts
  2. Reward Model Training: Learn a scalar reward function from pairwise expert preferences (500 samples, 3 annotators)
  3. PPO Optimization: Use the reward model to guide policy updates via Proximal Policy Optimization
Why RLHF for clinical settings: RLHF lets us directly encode clinical safety as a reward signal. Expert physicians can specify what "correct" looks like—not just in text, but in clinical reasoning, evidence citation, and risk communication. This is easier to define via preferences ("I'd rather the model explain the differential diagnosis step-by-step") than to bake into a loss function.

The Reward Model: Clinical Preferences as Scalars

Our reward model is a 7B parameter transformer trained to predict: given a query and two model completions, which one is more clinically sound? We collected 500 pairwise comparisons from board-certified physicians across cardiology, psychiatry, and internal medicine.

Reward Model Accuracy
87%
Agreement between learned reward and held-out expert preferences (validation set of 100 comparisons)

The reward model learned to penalize:

PPO Training: Balancing Helpfulness and Safety

Once the reward model was trained, we ran PPO to optimize our policy (the main language model) against this reward signal. PPO is elegant because it prevents the policy from drifting too far from the SFT baseline—critical in healthcare where radical changes can introduce new failure modes.

PPO Training Dynamics: Hallucination Rate Over Time

25% 50% 75% 0% Training Steps (10K batches) Hallucination Rate SFT: 92% Final: 12% KL constraint

RLHF training curves show rapid initial improvement as PPO learns the reward model, then plateaus due to KL divergence penalty keeping the policy close to SFT baseline.

DPO: The Simpler Alternative

Direct Preference Optimization (DPO) is a newer approach that's gained traction. Instead of explicitly training a reward model, DPO directly optimizes the policy to satisfy preference pairs. The key insight: you can write a closed-form loss that encodes preferences without needing a separate reward model.

Aspect RLHF (Our Choice) DPO Reward Model Explicit: Learn separate 7B model Implicit: Derived from preference loss Interpretability High: Can inspect what reward model learned Low: Reward signal is a mathematical abstraction Clinical Debugging Can identify which preferences drive errors Harder to trace failure causes Compute Cost Higher: Train both reward + policy Lower: Single-stage optimization Stability More stable with KL constraints Risk of overfitting to preferences Expert Feedback Scale Works well at 50K+ examples May underperform with sparse feedback

Why We Chose RLHF for Synthure

Given the clinical stakes, we prioritized interpretability and stability over computational efficiency. Here's our reasoning:

Interpretability: In healthcare, we need to understand why the model made a particular choice. When our reward model says a response deserves a high score, we can inspect its learned representations and ask: "What clinical features are driving this preference?" This is invaluable for safety audits and regulatory approval.
Reward Model Interpretability Study
73%
Of high-reward predictions can be traced to specific clinical features (differential diagnosis reasoning, evidence citation, risk communication)

Additionally, DPO assumes that preference pairs fully specify the alignment target. But in clinical settings, we have expert preferences that change with context—a cardiologist might weight diagnostic thoroughness differently than a primary care physician. RLHF's two-stage approach lets us:

  1. Train a reward model that captures this domain-specific nuance
  2. Fine-tune that reward model as we collect more expert data
  3. Keep the policy stable while continuously improving the reward signal

The Numbers: RLHF Performance

Hallucination Reduction
92% → 12%
On 500-sample test set with 3-rater clinical evaluation
Convergence Stability
<5%
Perplexity variance across 3 PPO training runs
Reward Model Agreement
87%
With held-out expert preferences (validation)
Training Efficiency
18 hours
Full PPO pipeline on 8x A100 GPU cluster

Practical Lessons: PPO Hyperparameter Tuning

Getting PPO right required careful tuning. Here are the hyperparameters that worked for clinical LM alignment:

# Synthure's PPO Configuration learning_rate: 5e-6 # Conservative; prevent distribution shift batch_size: 128 examples rollout_len: 512 tokens num_ppo_epochs: 4 clip_ratio: 0.2 # Standard Proximal Policy Optimization kl_coef: 0.05 # Heavy KL penalty; stay close to SFT value_fn_coef: 1.0 entropy_coef: 0.1 # Small bonus for output diversity max_grad_norm: 1.0

Key takeaway: The KL coefficient (0.05) was critical. Higher values (0.1+) caused poor convergence; lower values (0.01) allowed the policy to drift too far from SFT baseline. In clinical settings, conservative constraints are features, not bugs.

From Lab to Clinic: Validation

We validated the RLHF-trained model against both automated metrics and clinical expert review:

⚠️ Gotcha we discovered: The reward model initially overweighted "explaining reasoning" at the expense of brevity. Cardiologists wanted concise risk summaries; the reward model kept generating lengthy differential diagnoses. Solution: added explicit preferences for communication style in the next round of annotations.

When DPO Might Be Better

DPO isn't wrong—it's just different. DPO excels when:

For Synthure's clinical setting, we had only 500 pairwise preferences (sparse by DPO standards) and needed interpretability for regulatory sign-off. RLHF was the safer bet.

Conclusion: Safety Beats Simplicity

The RLHF vs. DPO choice comes down to priorities. DPO is faster and simpler. But in healthcare, where a single hallucination can harm a patient, we chose RLHF for its interpretability, stability, and ability to integrate evolving clinical feedback.

Over 18 hours of GPU training, we reduced hallucinations from 92% to 12%, achieved 87% agreement with expert preferences, and built a foundation for continuous improvement as we collect more clinical data.

If you're aligning LLMs for safety-critical domains (healthcare, law, finance), RLHF's two-stage approach gives you the interpretability and control that simpler methods lack. The extra compute is worth it.

About this post: This deep dive reflects lessons from training Synthure's 7B clinical language model on 900GB of EHR data and 50K clinical Q&A pairs. Published: February 2026. Next: We'll explore how we extended RLHF with constitutional AI techniques to handle edge cases in psychiatric care.