Building AI Doctors Can Trust

From 92% hallucination rate to 12%. How real-time fact validation, clinical guideline integration, and expert review transformed AI safety in healthcare.

The Trust Problem

Doctors are skeptical of AI. With reason: early LLMs hallucinate. They confidently assert facts that don't exist. In healthcare, hallucination isn't a minor nuisance—it's dangerous. A cardiologist told us: "I can't use this. I can't spend 5 minutes fact-checking every sentence. If I don't trust it implicitly, it's a liability."

That conversation drove our clinical safety roadmap. We set a goal: reduce hallucination rate below 15% (from our baseline of 92%) and achieve physician trust certification.

Understanding Hallucination

Hallucination in clinical LLMs manifests in three types:

Our pre-RLHF model had a 92% hallucination rate, meaning 92% of outputs contained at least one factually incorrect or unsafe statement. This is typical for general-purpose LLMs fine-tuned on medical data.

Strategy 1: Real-Time Fact Validation

Our fact validator agent checks model outputs against three sources in real-time:

Clinical Guidelines Database

We indexed UpToDate, clinical practice guidelines from major societies (AHA, ADA, ACP), and FDA drug labels. When the model generates medical claims, we embed them and search for contradictions.

Guideline Coverage
94%
Of claims in cardiology, psychiatry, and internal medicine covered by indexed guidelines

Web Verification

For claims not in our database, we query the web (via PubMed, FDA databases, medical news). This catches newly published research or rare drug interactions our static database misses.

Expert Feedback Loop

When the model generates a claim with low confidence (<0.75), or when it contradicts guidelines, we flag it for physician review. Physicians vote on correctness, and we retrain the reward model.

The key insight: We don't need the model to be perfectly accurate. We need it to know what it doesn't know. When confidence is low, we trigger human review. This hybrid approach (AI + human) achieves 98.7% accuracy while keeping human burden reasonable.

Strategy 2: Structured Fact Checking

Free-text outputs are hard to validate. So we redesigned the model to output structured JSON with explicit claims:

{
"diagnosis": "Type 2 Diabetes Mellitus",
"confidence": 0.92,
"claims": [
{
"claim": "First-line treatment is metformin",
"source": "ADA 2024 Guidelines",
"confidence": 0.95
},
{
"claim": "Common side effects include gastrointestinal upset",
"source": "FDA Label",
"confidence": 0.98
}
],
"warnings": ["Contraindicated in eGFR < 30"]
}

Each claim is tagged with its source and confidence. We validate each claim independently. If a source is unavailable or confidence is low, we flag it for review.

Results: From 92% to 12% Hallucination

Hallucination Rate (Baseline)
92%
Pre-RLHF model, standard inference
After RLHF
28%
Improved through preference learning
+ Fact Validation
18%
Caught and corrected in real-time
Final (with review)
12%
Flagged for human review at runtime

The remaining 12% are subtle hallucinations that slipped through: overgeneralized statements, contextual misses, or novel drug interactions not yet in our database. These are caught by physician review.

Physician Trust Study

We conducted a survey of 50 physicians: "Would you trust Synthure outputs to use in patient care?" Results:

System Trust (Agree/Strongly Agree) Would Use in Clinical Practice
Baseline LLM (92% hallucination) 8% 6%
Post-RLHF (28% hallucination) 31% 18%
With fact validation (18%) 64% 52%
With review flagging (12%) 87% 81%

The jump from 64% to 87% trust with explicit review flagging shows physicians want transparency. Knowing when the system is uncertain matters more than achieving perfect accuracy.

Lessons Learned

⚠️ Early mistake: We initially pursued 100% accuracy without human-in-the-loop. After 18 weeks of effort, we hit a wall—the last 10% of hallucinations are rare edge cases that required massive datasets to eliminate. We pivoted to hybrid approach: AI catches 88%, humans review 12%. This achieved trust faster.

Clinical Feedback Is Gold

Our biggest breakthrough came from interviewing physicians on what makes them distrust AI. Top factors:

  1. Overconfidence: Stating uncertain facts as certain (e.g., "This patient definitely has X")
  2. Missing context: Ignoring important caveats or exceptions
  3. Unsourced claims: Statements without visible evidence
  4. Logical leaps: Jumping to conclusions without showing reasoning

We redesigned the model to address these. Instead of "Patient has condition X," we now output: "Most likely diagnosis: X (confidence: 0.82 based on symptoms A, B, C per [source])."

Regulatory Path

FDA considers clinical AI as a "Software as a Medical Device" (SaMD). Approval requires demonstration of safety and effectiveness. Our fact validation pipeline and review flagging system provided the evidence:

Conclusion: Trust Enables Adoption

The path to clinical AI adoption isn't perfect accuracy—it's earned trust. By combining RLHF training, real-time fact validation, and human-in-the-loop review, we achieved 87% physician trust and 81% willingness to use in practice.

The remaining gap (13% of doctors still skeptical) reflects healthy caution. Healthcare professionals have high standards for safety. We don't fight that—we embrace it and design systems that respect their expertise.

About this post: Clinical validation conducted across 50 board-certified physicians at Stanford Health, UCSF, and Kaiser Permanente. Fact validation database covers UpToDate, FDA labels, and major clinical guidelines. Published: February 2026.