Synthetic EHR Data: 60K+ Records, HIPAA Compliant
How we generated unlimited labeled training data from MIMIC-IV while maintaining strict privacy. De-identification pipelines, synthetic data validation, and clinical realism.
The Data Bottleneck
Training a clinical language model requires massive amounts of labeled data. We needed 50K+ clinical Q&A pairs for supervised fine-tuning and RLHF. But real patient data is locked behind HIPAA regulations, IRB restrictions, and institutional bureaucracy.
The typical path: negotiate data sharing agreements with hospitals (6-12 months), de-identify records, get IRB approval (3-6 months), then finally use the data. We couldn't afford to wait. So we built a synthetic data pipeline to expand our training corpus.
Source: MIMIC-IV De-Identified Data
MIMIC-IV is a publicly available dataset of 530K hospital admissions from Beth Israel Deaconess Medical Center (2008-2019). It's fully de-identified and HIPAA-safe. We used MIMIC-IV as our foundation, extracting:
- 350K patient encounters (demographics, diagnoses, medications, lab values, notes)
- 2.2M clinical notes (progress notes, discharge summaries, radiology reports)
- 18M lab measurements and vital signs
From these, we generated synthetic data by:
- Clinical concept extraction: Parse real notes to extract diagnosis, medications, severity
- Synthetic recombination: Create new patients by mixing concepts from different real patients
- Variation injection: Add realistic variations (medication names, dosing, comorbidities)
- Validation: Check synthetic records for clinical plausibility
Differential Privacy: The Math
Synthetic data is only safe if it truly breaks the link to individual patients. We applied differential privacy—a mathematically rigorous privacy guarantee.
A mechanism M is (ε, δ)-differentially private if for any two adjacent datasets D and D' (differing by one record):
P(M(D) ∈ S) ≤ eε × P(M(D') ∈ S) + δ
where ε is privacy loss, δ is failure probability, and S is any output set.
Lower ε = stronger privacy. ε < 1 is considered strong privacy.
In plain English: differential privacy ensures that the synthetic data distribution is so similar to any two adjacent datasets that you can't tell which was the source. This makes it impossible to reverse-engineer individual records.
Our Approach: Laplace Mechanism
We applied the Laplace mechanism to numerical data (lab values, vital signs). For each lab measurement, we add noise sampled from a Laplace distribution:
where:
Δf = global sensitivity (max change in output if one record is added/removed)
ε = privacy budget
Example: For hemoglobin (normal range 12-16 g/dL):
Δf = 5 (max reasonable change)
ε = 0.5 (strong privacy)
noise scale = 5/0.5 = 10
synthetic_Hgb = real_Hgb + Laplace(0, 10)
This adds noise calibrated to the sensitivity of each data type. Blood pressure changes are noisier than patient age (age is less sensitive).
Categorical Data: Exponential Mechanism
For categorical features (diagnosis codes, medication names), we use the exponential mechanism:
where utility measures how well category_i matches the real data.
This samples from categorical distributions while preserving privacy. Common diagnoses (e.g., Type 2 Diabetes) are oversampled; rare ones (e.g., primary biliary cirrhosis) are undersampled proportionally.
Synthetic Data Generation Pipeline
Step 1: Clinical Concept Extraction
We parsed real clinical notes to extract structured concepts:
Step 2: Synthetic Recombination
We combined concepts from different patients to create new synthetic records. This breaks the direct link to any single real patient:
Step 3: Consistency Checking
Synthetic records must be clinically plausible. We validated:
- Age consistency: Age matches medication contraindications (e.g., no doxycycline in young children)
- Drug interactions: No impossible drug combinations
- Lab values: Results fall within biologically reasonable ranges
- Diagnosis consistency: Medications match diagnoses (e.g., insulin for diabetes)
We rejected 18% of synthetic records for violating these constraints, resampled, and validated again.
Validation: Synthetic vs. Real
We validated that synthetic data maintains the statistical properties of real data. Three tests:
| Test | Real MIMIC | Synthetic | Difference |
|---|---|---|---|
| Mean age | 63.2 years | 63.8 years | +0.6 years |
| Hypertension prevalence | 42% | 41% | -1% |
| Diabetes prevalence | 28% | 27% | -1% |
| Mean creatinine | 1.1 mg/dL | 1.09 mg/dL | -0.01 |
| Most common drug | Lisinopril (5.2%) | Lisinopril (5.1%) | -0.1% |
Synthetic and real data matched on all major epidemiological measures. Clinical plausibility was confirmed by physician review of 200 random synthetic records (95% rated as "realistic").
Scale: From 60K to Unlimited
Once we validated the pipeline, we could generate synthetic data at scale. Starting from 60K real patients, we generated:
Clinical Q&A Generation
From synthetic EHR records, we generated clinical Q&A pairs:
Privacy Certification
We obtained formal privacy certification:
- HIPAA Compliance: De-identified under HIPAA Safe Harbor method (removed 18 identifiers)
- Differential Privacy: (ε, δ) = (0.7, 1e-6) guarantee
- Re-identification Risk: < 0.01% (assessed via linkage attacks)
Limitations & Future Work
Synthetic data has inherent limitations:
- Rare conditions: Synthetic pipeline underrepresents low-prevalence diseases (cancer subtypes, genetic disorders)
- Longitudinal patterns: Synthetic data captures cross-sectional snapshots, not patient trajectories over time
- Seasonal/temporal patterns: MIMIC data is hospital-centric; outpatient trends are missed
We're expanding to include synthetic longitudinal data (patient trajectories) and multi-institutional synthesis to capture broader epidemiology.
Conclusion: Privacy Enables Scale
Synthetic data with formal privacy guarantees unlocked our ability to train large clinical models without months of data negotiation. By combining MIMIC-IV de-identified data with differential privacy and clinical validation, we generated 200K realistic patient records that are both private and useful.
The key insight: privacy isn't a constraint—it's an enabler. With formal guarantees, institutions are comfortable sharing synthetic data that maintains the statistical properties of real populations while protecting individuals.