Synthetic EHR Data: 60K+ Records, HIPAA Compliant

How we generated unlimited labeled training data from MIMIC-IV while maintaining strict privacy. De-identification pipelines, synthetic data validation, and clinical realism.

The Data Bottleneck

Training a clinical language model requires massive amounts of labeled data. We needed 50K+ clinical Q&A pairs for supervised fine-tuning and RLHF. But real patient data is locked behind HIPAA regulations, IRB restrictions, and institutional bureaucracy.

The typical path: negotiate data sharing agreements with hospitals (6-12 months), de-identify records, get IRB approval (3-6 months), then finally use the data. We couldn't afford to wait. So we built a synthetic data pipeline to expand our training corpus.

Source: MIMIC-IV De-Identified Data

MIMIC-IV is a publicly available dataset of 530K hospital admissions from Beth Israel Deaconess Medical Center (2008-2019). It's fully de-identified and HIPAA-safe. We used MIMIC-IV as our foundation, extracting:

From these, we generated synthetic data by:

  1. Clinical concept extraction: Parse real notes to extract diagnosis, medications, severity
  2. Synthetic recombination: Create new patients by mixing concepts from different real patients
  3. Variation injection: Add realistic variations (medication names, dosing, comorbidities)
  4. Validation: Check synthetic records for clinical plausibility

Differential Privacy: The Math

Synthetic data is only safe if it truly breaks the link to individual patients. We applied differential privacy—a mathematically rigorous privacy guarantee.

Differential Privacy Definition:

A mechanism M is (ε, δ)-differentially private if for any two adjacent datasets D and D' (differing by one record):

P(M(D) ∈ S) ≤ eε × P(M(D') ∈ S) + δ

where ε is privacy loss, δ is failure probability, and S is any output set.

Lower ε = stronger privacy. ε < 1 is considered strong privacy.

In plain English: differential privacy ensures that the synthetic data distribution is so similar to any two adjacent datasets that you can't tell which was the source. This makes it impossible to reverse-engineer individual records.

Our Approach: Laplace Mechanism

We applied the Laplace mechanism to numerical data (lab values, vital signs). For each lab measurement, we add noise sampled from a Laplace distribution:

noise ~ Laplace(0, Δf/ε)

where:
Δf = global sensitivity (max change in output if one record is added/removed)
ε = privacy budget

Example: For hemoglobin (normal range 12-16 g/dL):
Δf = 5 (max reasonable change)
ε = 0.5 (strong privacy)
noise scale = 5/0.5 = 10

synthetic_Hgb = real_Hgb + Laplace(0, 10)

This adds noise calibrated to the sensitivity of each data type. Blood pressure changes are noisier than patient age (age is less sensitive).

Categorical Data: Exponential Mechanism

For categorical features (diagnosis codes, medication names), we use the exponential mechanism:

P(output = category_i) ∝ exp(ε × utility(category_i) / 2Δ)

where utility measures how well category_i matches the real data.

This samples from categorical distributions while preserving privacy. Common diagnoses (e.g., Type 2 Diabetes) are oversampled; rare ones (e.g., primary biliary cirrhosis) are undersampled proportionally.

Privacy Parameter (ε)
0.7
Strong privacy guarantee. Industry standard is ε < 1 for sensitive data.

Synthetic Data Generation Pipeline

Step 1: Clinical Concept Extraction

We parsed real clinical notes to extract structured concepts:

# Example real note "68-year-old with history of hypertension and diabetes mellitus. CXR shows pneumonia. Started on amoxicillin-clavulanate." # Extracted concepts { "age": 68, "conditions": ["hypertension", "diabetes mellitus"], "acute_diagnosis": "pneumonia", "medication": "amoxicillin-clavulanate", "note_type": "progress_note" }

Step 2: Synthetic Recombination

We combined concepts from different patients to create new synthetic records. This breaks the direct link to any single real patient:

# Patient A: 68, hypertension, diabetes # Patient B: 72, pneumonia, on antibiotics # Patient C: 65, normal kidney function # Synthetic: Combine age from A, conditions from B, labs from C synthetic_patient = { "age": 68 + noise, # from A "conditions": ["pneumonia"], # from B "creatinine": 0.9 + noise, # from C "medication": "amoxicillin-clavulanate" # from B }

Step 3: Consistency Checking

Synthetic records must be clinically plausible. We validated:

We rejected 18% of synthetic records for violating these constraints, resampled, and validated again.

Validation: Synthetic vs. Real

We validated that synthetic data maintains the statistical properties of real data. Three tests:

Test Real MIMIC Synthetic Difference
Mean age 63.2 years 63.8 years +0.6 years
Hypertension prevalence 42% 41% -1%
Diabetes prevalence 28% 27% -1%
Mean creatinine 1.1 mg/dL 1.09 mg/dL -0.01
Most common drug Lisinopril (5.2%) Lisinopril (5.1%) -0.1%

Synthetic and real data matched on all major epidemiological measures. Clinical plausibility was confirmed by physician review of 200 random synthetic records (95% rated as "realistic").

Scale: From 60K to Unlimited

Once we validated the pipeline, we could generate synthetic data at scale. Starting from 60K real patients, we generated:

Real MIMIC Records
60K
Patient encounters from 2008-2019
Generated Synthetic
200K
Using combinatorial recombination
Q&A Pairs Created
50K
LLM-generated Q&A from syntheti data
Training Data Volume
900GB
Combined EHR + Q&A + interaction logs

Clinical Q&A Generation

From synthetic EHR records, we generated clinical Q&A pairs:

Synthetic Record: { "age": 67, "diagnosis": "congestive heart failure", "ejection_fraction": "35%", "medications": ["lisinopril", "furosemide", "carvedilol"], "symptoms": "shortness of breath, lower extremity edema" } Generated Q&A: Q: "What medications should be started for systolic heart failure?" A: "Guideline-directed medical therapy includes: 1. ACE inhibitor (e.g., lisinopril) 2. Beta-blocker (e.g., carvedilol) 3. Loop diuretic for volume overload (e.g., furosemide)" Q: "What warning signs require urgent evaluation?" A: "Seek immediate care if you experience: - Sudden increase in shortness of breath at rest - Severe leg swelling or weight gain > 3 lbs/day - New chest pain or syncope"

Privacy Certification

We obtained formal privacy certification:

What differential privacy prevents: Even if an attacker has external datasets and attempts to re-identify individuals, the ε-privacy guarantee makes it probabilistically impossible. Unlike simple de-identification (which can fail), differential privacy has formal mathematical proofs.

Limitations & Future Work

Synthetic data has inherent limitations:

We're expanding to include synthetic longitudinal data (patient trajectories) and multi-institutional synthesis to capture broader epidemiology.

Conclusion: Privacy Enables Scale

Synthetic data with formal privacy guarantees unlocked our ability to train large clinical models without months of data negotiation. By combining MIMIC-IV de-identified data with differential privacy and clinical validation, we generated 200K realistic patient records that are both private and useful.

The key insight: privacy isn't a constraint—it's an enabler. With formal guarantees, institutions are comfortable sharing synthetic data that maintains the statistical properties of real populations while protecting individuals.

About this post: Synthetic data generation pipeline built using PyTorch and custom differential privacy implementation. Privacy parameters validated using attacks from the DP literature. Physician validation conducted across 50 clinicians. Published: February 2026.