Data & Privacy • 12 min read

Synthetic EHR Data: 60K+ Records, HIPAA Compliant

How we generated unlimited labeled training data from MIMIC-IV while maintaining strict privacy. De-identification pipelines, synthetic data validation, and clinical realism.

The Data Bottleneck

Training a clinical language model requires massive amounts of labeled data. We needed 50K+ clinical Q&A pairs for supervised fine-tuning and RLHF. But real patient data is locked behind HIPAA regulations, IRB restrictions, and institutional bureaucracy.

The typical path: negotiate data sharing agreements with hospitals (6-12 months), de-identify records, get IRB approval (3-6 months), then finally use the data. We couldn't afford to wait. So we built a synthetic data pipeline to expand our training corpus.

Source: MIMIC-IV De-Identified Data

MIMIC-IV is a publicly available dataset of 530K hospital admissions from Beth Israel Deaconess Medical Center (2008-2019). It's fully de-identified and HIPAA-safe. We used MIMIC-IV as our foundation, extracting:

350K patient encounters (demographics, diagnoses, medications, lab values, notes)
2.2M clinical notes (progress notes, discharge summaries, radiology reports)
18M lab measurements and vital signs

From these, we generated synthetic data by:

Clinical concept extraction: Parse real notes to extract diagnosis, medications, severity
Synthetic recombination: Create new patients by mixing concepts from different real patients
Variation injection: Add realistic variations (medication names, dosing, comorbidities)
Validation: Check synthetic records for clinical plausibility

Differential Privacy: The Math

Synthetic data is only safe if it truly breaks the link to individual patients. We applied differential privacy—a mathematically rigorous privacy guarantee.

Differential Privacy Definition: A mechanism M is (ε, δ)-differentially private if for any two adjacent datasets D and D' (differing by one record): P(M(D) \in S) \leq e ε \times P(M(D') \in S) + δ where ε is privacy loss, δ is failure probability, and S is any output set. Lower ε = stronger privacy. ε < 1 is considered strong privacy.

In plain English: differential privacy ensures that the synthetic data distribution is so similar to any two adjacent datasets that you can't tell which was the source. This makes it impossible to reverse-engineer individual records.

Our Approach: Laplace Mechanism

We applied the Laplace mechanism to numerical data (lab values, vital signs). For each lab measurement, we add noise sampled from a Laplace distribution:

noise ~ Laplace(0, Δf/ε) where: Δf = global sensitivity (max change in output if one record is added/removed) ε = privacy budget Example: For hemoglobin (normal range 12-16 g/dL): Δf = 5 (max reasonable change) ε = 0.5 (strong privacy) noise scale = 5/0.5 = 10 synthetic_Hgb = real_Hgb + Laplace(0, 10)

This adds noise calibrated to the sensitivity of each data type. Blood pressure changes are noisier than patient age (age is less sensitive).

Categorical Data: Exponential Mechanism

For categorical features (diagnosis codes, medication names), we use the exponential mechanism:

P(output = category_i) \propto exp(ε \times utility(category_i) / 2Δ) where utility measures how well category_i matches the real data.

This samples from categorical distributions while preserving privacy. Common diagnoses (e.g., Type 2 Diabetes) are oversampled; rare ones (e.g., primary biliary cirrhosis) are undersampled proportionally.

Privacy Parameter (ε)

0.7

Strong privacy guarantee. Industry standard is ε < 1 for sensitive data.

Synthetic Data Generation Pipeline

Step 1: Clinical Concept Extraction

We parsed real clinical notes to extract structured concepts:

# Example real note
"68-year-old with history of hypertension and diabetes mellitus.
CXR shows pneumonia. Started on amoxicillin-clavulanate."

# Extracted concepts
{
  "age": 68,
  "conditions": ["hypertension", "diabetes mellitus"],
  "acute_diagnosis": "pneumonia",
  "medication": "amoxicillin-clavulanate",
  "note_type": "progress_note"
}
        

Step 2: Synthetic Recombination

We combined concepts from different patients to create new synthetic records. This breaks the direct link to any single real patient:

# Patient A: 68, hypertension, diabetes
# Patient B: 72, pneumonia, on antibiotics
# Patient C: 65, normal kidney function

# Synthetic: Combine age from A, conditions from B, labs from C
synthetic_patient = {
  "age": 68 + noise,  # from A
  "conditions": ["pneumonia"],  # from B
  "creatinine": 0.9 + noise,  # from C
  "medication": "amoxicillin-clavulanate"  # from B
}
        

Step 3: Consistency Checking

Synthetic records must be clinically plausible. We validated:

Age consistency: Age matches medication contraindications (e.g., no doxycycline in young children)
Drug interactions: No impossible drug combinations
Lab values: Results fall within biologically reasonable ranges
Diagnosis consistency: Medications match diagnoses (e.g., insulin for diabetes)

We rejected 18% of synthetic records for violating these constraints, resampled, and validated again.

Validation: Synthetic vs. Real

We validated that synthetic data maintains the statistical properties of real data. Three tests:

Test	Real MIMIC	Synthetic	Difference
Mean age	63.2 years	63.8 years	+0.6 years
Hypertension prevalence	42%	41%	-1%
Diabetes prevalence	28%	27%	-1%
Mean creatinine	1.1 mg/dL	1.09 mg/dL	-0.01
Most common drug	Lisinopril (5.2%)	Lisinopril (5.1%)	-0.1%

Synthetic and real data matched on all major epidemiological measures. Clinical plausibility was confirmed by physician review of 200 random synthetic records (95% rated as "realistic").

Scale: From 60K to Unlimited

Once we validated the pipeline, we could generate synthetic data at scale. Starting from 60K real patients, we generated:

Real MIMIC Records

60K

Patient encounters from 2008-2019

Generated Synthetic

200K

Using combinatorial recombination

Q&A Pairs Created

50K

LLM-generated Q&A from syntheti data

Training Data Volume

900GB

Combined EHR + Q&A + interaction logs

Clinical Q&A Generation

From synthetic EHR records, we generated clinical Q&A pairs:

Synthetic Record:
{
  "age": 67,
  "diagnosis": "congestive heart failure",
  "ejection_fraction": "35%",
  "medications": ["lisinopril", "furosemide", "carvedilol"],
  "symptoms": "shortness of breath, lower extremity edema"
}

Generated Q&A:
Q: "What medications should be started for systolic heart failure?"
A: "Guideline-directed medical therapy includes:
   1. ACE inhibitor (e.g., lisinopril)
   2. Beta-blocker (e.g., carvedilol)
   3. Loop diuretic for volume overload (e.g., furosemide)"

Q: "What warning signs require urgent evaluation?"
A: "Seek immediate care if you experience:
   - Sudden increase in shortness of breath at rest
   - Severe leg swelling or weight gain > 3 lbs/day
   - New chest pain or syncope"
        

Privacy Certification

We obtained formal privacy certification:

HIPAA Compliance: De-identified under HIPAA Safe Harbor method (removed 18 identifiers)
Differential Privacy: (ε, δ) = (0.7, 1e-6) guarantee
Re-identification Risk: < 0.01% (assessed via linkage attacks)

            What differential privacy prevents: Even if an attacker has external datasets and attempts to re-identify individuals, the ε-privacy guarantee makes it probabilistically impossible. Unlike simple de-identification (which can fail), differential privacy has formal mathematical proofs.
        

Limitations & Future Work

Synthetic data has inherent limitations:

Rare conditions: Synthetic pipeline underrepresents low-prevalence diseases (cancer subtypes, genetic disorders)
Longitudinal patterns: Synthetic data captures cross-sectional snapshots, not patient trajectories over time
Seasonal/temporal patterns: MIMIC data is hospital-centric; outpatient trends are missed

We're expanding to include synthetic longitudinal data (patient trajectories) and multi-institutional synthesis to capture broader epidemiology.

Conclusion: Privacy Enables Scale

Synthetic data with formal privacy guarantees unlocked our ability to train large clinical models without months of data negotiation. By combining MIMIC-IV de-identified data with differential privacy and clinical validation, we generated 200K realistic patient records that are both private and useful.

The key insight: privacy isn't a constraint—it's an enabler. With formal guarantees, institutions are comfortable sharing synthetic data that maintains the statistical properties of real populations while protecting individuals.

About this post: Synthetic data generation pipeline built using PyTorch and custom differential privacy implementation. Privacy parameters validated using attacks from the DP literature. Physician validation conducted across 50 clinicians. Published: February 2026.