Sub-2s Inference: CUDA Optimization at Scale
How we achieved <1.8s P95 latency with vLLM, custom CUDA kernels, and GPU cluster orchestration. Engineering clinical response times for production healthcare.
Why Latency Matters in Healthcare
In consumer applications, a 5-second API response is acceptable. In healthcare, it's a problem. When a clinician is standing at a patient's bedside, every second counts. A 10-second delay to see a plain-language translation of the patient's diagnosis creates friction that discourages adoption. At Synthure, we set an ambitious target: sub-2 second inference (P95) for end-to-end translation.
This required optimizing at every level: model quantization, kernel fusion, batch scheduling, and cluster-level resource management. We went from a naive baseline of 8.2 seconds to 1.8 seconds—a 4.5x speedup.
Baseline: The Naive Approach
We started with a standard setup: 7B parameter model on a single A100 GPU, using PyTorch's default inference.
| Metric | Baseline | Notes |
|---|---|---|
| Latency (P95) | 8.2 seconds | 512-token input, 256-token output |
| Throughput | 12 req/sec | Single GPU, batch size 1 |
| GPU Utilization | 42% | I/O bound, memory not saturated |
| Cost per request | $0.08 | A100 GPU-hour allocation |
The bottleneck was clear: memory bandwidth. Language model inference is dominated by loading model weights from GPU memory, not compute. The 7B model has ~14GB of weights, and we were doing this repeatedly for each token generation.
Optimization 1: Quantization
Model quantization reduces the precision of weights and activations, trading a small amount of accuracy for dramatic speed improvements. We tested three approaches:
INT8 Quantization
Convert float32 weights to int8 (8-bit integers). This reduces model size by 75% and accelerates matrix multiplications via specialized INT8 GEMM kernels.
xquant = round(xfloat / scale)
where scale = (max(|xfloat|) / 127)
Dequantization: xfloat ≈ xquant × scale
Results: 2.8x speedup, but 2.5% accuracy loss on clinical Q&A benchmarks. For patient safety, 2.5% is too risky—we need near-perfect medical accuracy.
INT4 + FP16 Hybrid
A compromise: quantize most layers to INT4, keep attention layers in FP16. Reduces model size to 3.5GB while minimizing accuracy loss.
Results: 3.2x speedup, 0.8% accuracy loss. Better, but still measurable impact on medical reasoning.
FP8 Quantization
Use NVIDIA's FP8 format (standardized in H100s, supported via vLLM on A100s). 8-bit floating point maintains better numerical precision than INT8.
~4x memory reduction vs FP32
Maintains numerical stability better than INT8
Results: 2.9x speedup, 0.3% accuracy loss. We chose this as the baseline for further optimization.
Optimization 2: vLLM Serving
vLLM is an open-source LLM serving engine that dramatically speeds up inference through two key innovations:
Paged Attention
Traditional transformers cache the key-value (KV) states for all previous tokens during generation. For a 256-token output, this creates 256 "attention blocks" of KV cache, which fragment GPU memory.
vLLM's paged attention divides the KV cache into fixed-size pages (e.g., 16 tokens per page), allowing better memory utilization and reducing fragmentation.
Continuous Batching
Instead of waiting for all requests in a batch to complete, vLLM interleaves tokens from multiple requests. If Request A is waiting for token 10 and Request B is on token 5, vLLM can process B's token 5 while A computes token 10 in parallel.
Switching to vLLM alone improved throughput 2.1x and reduced latency to 3.9 seconds (P95).
Optimization 3: Custom CUDA Kernels
Generic operations still had overhead. We wrote custom CUDA kernels for two bottleneck operations:
Fused Attention Kernel
Standard attention (Q, K, V tensors) requires 5 GPU kernel launches:
Each kernel launch has overhead. Our fused kernel combines all 5 operations into a single CUDA kernel, eliminating launch overhead and enabling better memory access patterns.
Naive: O(n²d) memory reads (Q, K, V loaded separately)
Fused: O(n²) memory reads (tiles cached in shared memory)
Where n = sequence length, d = head dimension
Results: 1.3x speedup on the attention operation, which is ~40% of total inference time.
Quantized MatMul Kernel
For FP8 matrix multiplications, we tuned the CUTLASS library template for our specific tensor shapes (batch_size=128, seq_len=512, d_model=4096).
Results: 1.2x speedup on matmul, which is ~50% of inference time.
End-to-End Optimization Results
Combining all three optimizations:
| Optimization | Speedup | Cumulative |
|---|---|---|
| Baseline (FP32) | 1.0x | 8.2s |
| FP8 Quantization | 2.9x | 2.8s |
| vLLM + Paged Attention | 1.4x | 2.0s |
| Fused Attention Kernel | 1.18x | 1.7s |
| Quantized MatMul Kernel | 1.05x | 1.62s |
| Final (P95) | 5.0x | 1.8s* |
*P95 accounts for network latency (~100ms) and request queuing. Raw inference is 1.7s.
Scaling to Production: Multi-GPU Cluster
Single-GPU inference is fast, but at scale, we need to handle traffic spikes. We deploy across 8x A100 GPUs in a load-balanced cluster:
In practice, we operate at 40-60% capacity (due to traffic patterns), which puts cost at $0.017-0.025 per request.
Latency Under Load
Latency degrades gracefully under load. At 80% cluster utilization, P95 latency increases from 1.8s to 2.1s due to queueing. At 95% utilization, it reaches 3.2s.
Latency vs. Cluster Utilization
Real-World Deployment: Lessons Learned
Theory met practice when we deployed to production. Key insights:
Network Latency Dominates at Scale
We achieved 1.8s P95 latency, but production APIs added network overhead. API request → load balancer → GPU cluster → response takes ~200-300ms at P95. Our target shifted from 1.8s inference to 2.0s end-to-end.
Batch Size Tuning is Delicate
We set batch size to 128 for optimal GPU utilization. But during quiet hours, requests arrive slowly, and holding a request for 50ms (to batch) violates our latency target. Solution: dynamic batch size based on arrival rate.
Quantization Affects Edge Cases
FP8 quantization introduced rare (0.3%) accuracy losses. Most were benign (slightly different phrasing), but 1-2 patients saw the model hesitate on complex drug interactions. We added a fallback: if confidence < 0.85, re-run in FP32 (costs 3s but guarantees accuracy).
Conclusion: Performance Enables Adoption
Sub-2 second inference isn't just a nice-to-have. In clinical settings, it's the difference between a tool clinicians use and one they ignore. By combining quantization, vLLM's paged attention, and custom CUDA kernels, we achieved 4.5x speedup while maintaining 99.7% accuracy.
The real lesson: infrastructure optimization is inseparable from product success. Every 100ms we save is a doctor who spends less time waiting and more time with patients.