Healthcare Innovation • 10 min read

Building AI Doctors Can Trust

From 92% hallucination rate to 12%. How real-time fact validation, clinical guideline integration, and expert review transformed AI safety in healthcare.

The Trust Problem

Doctors are skeptical of AI. With reason: early LLMs hallucinate. They confidently assert facts that don't exist. In healthcare, hallucination isn't a minor nuisance—it's dangerous. A cardiologist told us: "I can't use this. I can't spend 5 minutes fact-checking every sentence. If I don't trust it implicitly, it's a liability."

That conversation drove our clinical safety roadmap. We set a goal: reduce hallucination rate below 15% (from our baseline of 92%) and achieve physician trust certification.

Understanding Hallucination

Hallucination in clinical LLMs manifests in three types:

Factual hallucination: Inventing drug interactions, side effects, or contraindications. (E.g., "ACE inhibitors are contraindicated in pregnancy" is correct, but "ACE inhibitors cause liver damage" is false)
Citation hallucination: Referencing studies that don't exist or misquoting evidence
Reasoning hallucination: Following illogical diagnostic chains (e.g., concluding bipolar disorder from a patient's insomnia without other symptoms)

Our pre-RLHF model had a 92% hallucination rate, meaning 92% of outputs contained at least one factually incorrect or unsafe statement. This is typical for general-purpose LLMs fine-tuned on medical data.

Strategy 1: Real-Time Fact Validation

Our fact validator agent checks model outputs against three sources in real-time:

Clinical Guidelines Database

We indexed UpToDate, clinical practice guidelines from major societies (AHA, ADA, ACP), and FDA drug labels. When the model generates medical claims, we embed them and search for contradictions.

Guideline Coverage

94%

Of claims in cardiology, psychiatry, and internal medicine covered by indexed guidelines

Web Verification

For claims not in our database, we query the web (via PubMed, FDA databases, medical news). This catches newly published research or rare drug interactions our static database misses.

Expert Feedback Loop

When the model generates a claim with low confidence (<0.75), or when it contradicts guidelines, we flag it for physician review. Physicians vote on correctness, and we retrain the reward model.

            The key insight: We don't need the model to be perfectly accurate. We need it to know what it doesn't know. When confidence is low, we trigger human review. This hybrid approach (AI + human) achieves 98.7% accuracy while keeping human burden reasonable.
        

Strategy 2: Structured Fact Checking

Free-text outputs are hard to validate. So we redesigned the model to output structured JSON with explicit claims:

{

  "diagnosis": "Type 2 Diabetes Mellitus",

  "confidence": 0.92,

  "claims": [

    {

      "claim": "First-line treatment is metformin",

      "source": "ADA 2024 Guidelines",

      "confidence": 0.95

    },

    {

      "claim": "Common side effects include gastrointestinal upset",

      "source": "FDA Label",

      "confidence": 0.98

    }

  ],

  "warnings": ["Contraindicated in eGFR < 30"]

}

Each claim is tagged with its source and confidence. We validate each claim independently. If a source is unavailable or confidence is low, we flag it for review.

Results: From 92% to 12% Hallucination

Hallucination Rate (Baseline)

92%

Pre-RLHF model, standard inference

After RLHF

28%

Improved through preference learning

+ Fact Validation

18%

Caught and corrected in real-time

Final (with review)

12%

Flagged for human review at runtime

The remaining 12% are subtle hallucinations that slipped through: overgeneralized statements, contextual misses, or novel drug interactions not yet in our database. These are caught by physician review.

Physician Trust Study

We conducted a survey of 50 physicians: "Would you trust Synthure outputs to use in patient care?" Results:

System	Trust (Agree/Strongly Agree)	Would Use in Clinical Practice
Baseline LLM (92% hallucination)	8%	6%
Post-RLHF (28% hallucination)	31%	18%
With fact validation (18%)	64%	52%
With review flagging (12%)	87%	81%

The jump from 64% to 87% trust with explicit review flagging shows physicians want transparency. Knowing when the system is uncertain matters more than achieving perfect accuracy.

Lessons Learned

⚠️ Early mistake: We initially pursued 100% accuracy without human-in-the-loop. After 18 weeks of effort, we hit a wall—the last 10% of hallucinations are rare edge cases that required massive datasets to eliminate. We pivoted to hybrid approach: AI catches 88%, humans review 12%. This achieved trust faster.

Clinical Feedback Is Gold

Our biggest breakthrough came from interviewing physicians on what makes them distrust AI. Top factors:

Overconfidence: Stating uncertain facts as certain (e.g., "This patient definitely has X")
Missing context: Ignoring important caveats or exceptions
Unsourced claims: Statements without visible evidence
Logical leaps: Jumping to conclusions without showing reasoning

We redesigned the model to address these. Instead of "Patient has condition X," we now output: "Most likely diagnosis: X (confidence: 0.82 based on symptoms A, B, C per [source])."

Regulatory Path

FDA considers clinical AI as a "Software as a Medical Device" (SaMD). Approval requires demonstration of safety and effectiveness. Our fact validation pipeline and review flagging system provided the evidence:

Clinical validation on 500 cases with 3-rater agreement (87%)
Failure mode analysis: which 12% hallucinate and why
Risk mitigation: physician review catches them before patient exposure
Post-market surveillance: tracking real-world hallucinations to retrain

Conclusion: Trust Enables Adoption

The path to clinical AI adoption isn't perfect accuracy—it's earned trust. By combining RLHF training, real-time fact validation, and human-in-the-loop review, we achieved 87% physician trust and 81% willingness to use in practice.

The remaining gap (13% of doctors still skeptical) reflects healthy caution. Healthcare professionals have high standards for safety. We don't fight that—we embrace it and design systems that respect their expertise.

About this post: Clinical validation conducted across 50 board-certified physicians at Stanford Health, UCSF, and Kaiser Permanente. Fact validation database covers UpToDate, FDA labels, and major clinical guidelines. Published: February 2026.