Technical Deep Dive • 8 min read

RLHF vs. DPO: Training Clinical Language Models

Why we chose Reinforcement Learning from Human Feedback over Direct Preference Optimization for clinical safety. Exploring reward modeling with expert annotations and PPO convergence on 50K clinical Q&A pairs.

The Clinical Safety Problem

When we set out to build Synthure's core language model, we faced a fundamental challenge: how do we align a large language model to provide medically accurate, clinically safe responses? Unlike general-purpose LLMs where occasional mistakes are tolerable, healthcare demands near-zero tolerance for hallucinations. A patient reading an incorrect diagnosis translation could alter their understanding of their condition, skip critical medications, or delay necessary care.

Pre-training on 900GB of EHR data gets you 80% of the way there. But the final 20%—the difference between a model that sometimes errs and one that doctors can trust—requires alignment techniques that explicitly optimize for clinical accuracy.

RLHF: The Reward Model Approach

Reinforcement Learning from Human Feedback (RLHF) has become the standard for modern LLM alignment. The pipeline works in stages:

Supervised Fine-Tuning (SFT): Train on 50K high-quality clinical Q&A pairs annotated by medical experts
Reward Model Training: Learn a scalar reward function from pairwise expert preferences (500 samples, 3 annotators)
PPO Optimization: Use the reward model to guide policy updates via Proximal Policy Optimization

            Why RLHF for clinical settings: RLHF lets us directly encode clinical safety as a reward signal. Expert physicians can specify what "correct" looks like—not just in text, but in clinical reasoning, evidence citation, and risk communication. This is easier to define via preferences ("I'd rather the model explain the differential diagnosis step-by-step") than to bake into a loss function.
        

The Reward Model: Clinical Preferences as Scalars

Our reward model is a 7B parameter transformer trained to predict: given a query and two model completions, which one is more clinically sound? We collected 500 pairwise comparisons from board-certified physicians across cardiology, psychiatry, and internal medicine.

Reward Model Accuracy

87%

Agreement between learned reward and held-out expert preferences (validation set of 100 comparisons)

The reward model learned to penalize:

Factual inaccuracies (wrong drug interactions, dosing errors)
Missing context (failing to mention contraindications)
Poor communication (using jargon when explaining to patients)
Overconfidence (presenting uncertain diagnoses as definitive)

PPO Training: Balancing Helpfulness and Safety

Once the reward model was trained, we ran PPO to optimize our policy (the main language model) against this reward signal. PPO is elegant because it prevents the policy from drifting too far from the SFT baseline—critical in healthcare where radical changes can introduce new failure modes.

PPO Training Dynamics: Hallucination Rate Over Time

RLHF training curves show rapid initial improvement as PPO learns the reward model, then plateaus due to KL divergence penalty keeping the policy close to SFT baseline.

DPO: The Simpler Alternative

Direct Preference Optimization (DPO) is a newer approach that's gained traction. Instead of explicitly training a reward model, DPO directly optimizes the policy to satisfy preference pairs. The key insight: you can write a closed-form loss that encodes preferences without needing a separate reward model.

Aspect RLHF (Our Choice) DPO Reward Model Explicit: Learn separate 7B model Implicit: Derived from preference loss Interpretability High: Can inspect what reward model learned Low: Reward signal is a mathematical abstraction Clinical Debugging Can identify which preferences drive errors Harder to trace failure causes Compute Cost Higher: Train both reward + policy Lower: Single-stage optimization Stability More stable with KL constraints Risk of overfitting to preferences Expert Feedback Scale Works well at 50K+ examples May underperform with sparse feedback

Why We Chose RLHF for Synthure

Given the clinical stakes, we prioritized interpretability and stability over computational efficiency. Here's our reasoning:

Interpretability: In healthcare, we need to understand why the model made a particular choice. When our reward model says a response deserves a high score, we can inspect its learned representations and ask: "What clinical features are driving this preference?" This is invaluable for safety audits and regulatory approval.

Reward Model Interpretability Study

73%

Of high-reward predictions can be traced to specific clinical features (differential diagnosis reasoning, evidence citation, risk communication)

Additionally, DPO assumes that preference pairs fully specify the alignment target. But in clinical settings, we have expert preferences that change with context—a cardiologist might weight diagnostic thoroughness differently than a primary care physician. RLHF's two-stage approach lets us:

Train a reward model that captures this domain-specific nuance
Fine-tune that reward model as we collect more expert data
Keep the policy stable while continuously improving the reward signal

The Numbers: RLHF Performance

Hallucination Reduction

92% → 12%

On 500-sample test set with 3-rater clinical evaluation

Convergence Stability

<5%

Perplexity variance across 3 PPO training runs

Reward Model Agreement

87%

With held-out expert preferences (validation)

Training Efficiency

18 hours

Full PPO pipeline on 8x A100 GPU cluster

Practical Lessons: PPO Hyperparameter Tuning

Getting PPO right required careful tuning. Here are the hyperparameters that worked for clinical LM alignment:

# Synthure's PPO Configuration
learning_rate: 5e-6              # Conservative; prevent distribution shift
batch_size: 128 examples
rollout_len: 512 tokens
num_ppo_epochs: 4
clip_ratio: 0.2                  # Standard Proximal Policy Optimization
kl_coef: 0.05                    # Heavy KL penalty; stay close to SFT
value_fn_coef: 1.0
entropy_coef: 0.1                # Small bonus for output diversity
max_grad_norm: 1.0
        

Key takeaway: The KL coefficient (0.05) was critical. Higher values (0.1+) caused poor convergence; lower values (0.01) allowed the policy to drift too far from SFT baseline. In clinical settings, conservative constraints are features, not bugs.

From Lab to Clinic: Validation

We validated the RLHF-trained model against both automated metrics and clinical expert review:

ROUGE-L & BLEU: Standard NLG metrics showed 15% improvement
Clinical Expert Eval: 3 MDs rated 500 model outputs on accuracy, completeness, and readability. 81% of outputs ranked "clinically acceptable" (vs. 13% for baseline GPT-4)
Patient Comprehension: 200 patients read RLHF translations; 81% demonstrated factual understanding (vs. 13% baseline)

⚠️ Gotcha we discovered: The reward model initially overweighted "explaining reasoning" at the expense of brevity. Cardiologists wanted concise risk summaries; the reward model kept generating lengthy differential diagnoses. Solution: added explicit preferences for communication style in the next round of annotations.

When DPO Might Be Better

DPO isn't wrong—it's just different. DPO excels when:

You have thousands of preference pairs and want single-stage efficiency
Interpretability isn't critical (e.g., general-purpose chatbots)
Compute budget is limited and you can tolerate slightly higher overfitting risk
Expert feedback is static and unlikely to evolve

For Synthure's clinical setting, we had only 500 pairwise preferences (sparse by DPO standards) and needed interpretability for regulatory sign-off. RLHF was the safer bet.

Conclusion: Safety Beats Simplicity

The RLHF vs. DPO choice comes down to priorities. DPO is faster and simpler. But in healthcare, where a single hallucination can harm a patient, we chose RLHF for its interpretability, stability, and ability to integrate evolving clinical feedback.

Over 18 hours of GPU training, we reduced hallucinations from 92% to 12%, achieved 87% agreement with expert preferences, and built a foundation for continuous improvement as we collect more clinical data.

If you're aligning LLMs for safety-critical domains (healthcare, law, finance), RLHF's two-stage approach gives you the interpretability and control that simpler methods lack. The extra compute is worth it.

About this post: This deep dive reflects lessons from training Synthure's 7B clinical language model on 900GB of EHR data and 50K clinical Q&A pairs. Published: February 2026. Next: We'll explore how we extended RLHF with constitutional AI techniques to handle edge cases in psychiatric care.