Building AI Doctors Can Trust
From 92% hallucination rate to 12%. How real-time fact validation, clinical guideline integration, and expert review transformed AI safety in healthcare.
The Trust Problem
Doctors are skeptical of AI. With reason: early LLMs hallucinate. They confidently assert facts that don't exist. In healthcare, hallucination isn't a minor nuisance—it's dangerous. A cardiologist told us: "I can't use this. I can't spend 5 minutes fact-checking every sentence. If I don't trust it implicitly, it's a liability."
That conversation drove our clinical safety roadmap. We set a goal: reduce hallucination rate below 15% (from our baseline of 92%) and achieve physician trust certification.
Understanding Hallucination
Hallucination in clinical LLMs manifests in three types:
- Factual hallucination: Inventing drug interactions, side effects, or contraindications. (E.g., "ACE inhibitors are contraindicated in pregnancy" is correct, but "ACE inhibitors cause liver damage" is false)
- Citation hallucination: Referencing studies that don't exist or misquoting evidence
- Reasoning hallucination: Following illogical diagnostic chains (e.g., concluding bipolar disorder from a patient's insomnia without other symptoms)
Our pre-RLHF model had a 92% hallucination rate, meaning 92% of outputs contained at least one factually incorrect or unsafe statement. This is typical for general-purpose LLMs fine-tuned on medical data.
Strategy 1: Real-Time Fact Validation
Our fact validator agent checks model outputs against three sources in real-time:
Clinical Guidelines Database
We indexed UpToDate, clinical practice guidelines from major societies (AHA, ADA, ACP), and FDA drug labels. When the model generates medical claims, we embed them and search for contradictions.
Web Verification
For claims not in our database, we query the web (via PubMed, FDA databases, medical news). This catches newly published research or rare drug interactions our static database misses.
Expert Feedback Loop
When the model generates a claim with low confidence (<0.75), or when it contradicts guidelines, we flag it for physician review. Physicians vote on correctness, and we retrain the reward model.
Strategy 2: Structured Fact Checking
Free-text outputs are hard to validate. So we redesigned the model to output structured JSON with explicit claims:
"diagnosis": "Type 2 Diabetes Mellitus",
"confidence": 0.92,
"claims": [
{
"claim": "First-line treatment is metformin",
"source": "ADA 2024 Guidelines",
"confidence": 0.95
},
{
"claim": "Common side effects include gastrointestinal upset",
"source": "FDA Label",
"confidence": 0.98
}
],
"warnings": ["Contraindicated in eGFR < 30"]
}
Each claim is tagged with its source and confidence. We validate each claim independently. If a source is unavailable or confidence is low, we flag it for review.
Results: From 92% to 12% Hallucination
The remaining 12% are subtle hallucinations that slipped through: overgeneralized statements, contextual misses, or novel drug interactions not yet in our database. These are caught by physician review.
Physician Trust Study
We conducted a survey of 50 physicians: "Would you trust Synthure outputs to use in patient care?" Results:
| System | Trust (Agree/Strongly Agree) | Would Use in Clinical Practice |
|---|---|---|
| Baseline LLM (92% hallucination) | 8% | 6% |
| Post-RLHF (28% hallucination) | 31% | 18% |
| With fact validation (18%) | 64% | 52% |
| With review flagging (12%) | 87% | 81% |
The jump from 64% to 87% trust with explicit review flagging shows physicians want transparency. Knowing when the system is uncertain matters more than achieving perfect accuracy.
Lessons Learned
Clinical Feedback Is Gold
Our biggest breakthrough came from interviewing physicians on what makes them distrust AI. Top factors:
- Overconfidence: Stating uncertain facts as certain (e.g., "This patient definitely has X")
- Missing context: Ignoring important caveats or exceptions
- Unsourced claims: Statements without visible evidence
- Logical leaps: Jumping to conclusions without showing reasoning
We redesigned the model to address these. Instead of "Patient has condition X," we now output: "Most likely diagnosis: X (confidence: 0.82 based on symptoms A, B, C per [source])."
Regulatory Path
FDA considers clinical AI as a "Software as a Medical Device" (SaMD). Approval requires demonstration of safety and effectiveness. Our fact validation pipeline and review flagging system provided the evidence:
- Clinical validation on 500 cases with 3-rater agreement (87%)
- Failure mode analysis: which 12% hallucinate and why
- Risk mitigation: physician review catches them before patient exposure
- Post-market surveillance: tracking real-world hallucinations to retrain
Conclusion: Trust Enables Adoption
The path to clinical AI adoption isn't perfect accuracy—it's earned trust. By combining RLHF training, real-time fact validation, and human-in-the-loop review, we achieved 87% physician trust and 81% willingness to use in practice.
The remaining gap (13% of doctors still skeptical) reflects healthy caution. Healthcare professionals have high standards for safety. We don't fight that—we embrace it and design systems that respect their expertise.