Sub-2s Inference: CUDA Optimization at Scale

How we achieved <1.8s P95 latency with vLLM, custom CUDA kernels, and GPU cluster orchestration. Engineering clinical response times for production healthcare.

Why Latency Matters in Healthcare

In consumer applications, a 5-second API response is acceptable. In healthcare, it's a problem. When a clinician is standing at a patient's bedside, every second counts. A 10-second delay to see a plain-language translation of the patient's diagnosis creates friction that discourages adoption. At Synthure, we set an ambitious target: sub-2 second inference (P95) for end-to-end translation.

This required optimizing at every level: model quantization, kernel fusion, batch scheduling, and cluster-level resource management. We went from a naive baseline of 8.2 seconds to 1.8 seconds—a 4.5x speedup.

Baseline: The Naive Approach

We started with a standard setup: 7B parameter model on a single A100 GPU, using PyTorch's default inference.

Metric Baseline Notes
Latency (P95) 8.2 seconds 512-token input, 256-token output
Throughput 12 req/sec Single GPU, batch size 1
GPU Utilization 42% I/O bound, memory not saturated
Cost per request $0.08 A100 GPU-hour allocation

The bottleneck was clear: memory bandwidth. Language model inference is dominated by loading model weights from GPU memory, not compute. The 7B model has ~14GB of weights, and we were doing this repeatedly for each token generation.

Optimization 1: Quantization

Model quantization reduces the precision of weights and activations, trading a small amount of accuracy for dramatic speed improvements. We tested three approaches:

INT8 Quantization

Convert float32 weights to int8 (8-bit integers). This reduces model size by 75% and accelerates matrix multiplications via specialized INT8 GEMM kernels.

Quantization Formula:

xquant = round(xfloat / scale)

where scale = (max(|xfloat|) / 127)

Dequantization: xfloat ≈ xquant × scale

Results: 2.8x speedup, but 2.5% accuracy loss on clinical Q&A benchmarks. For patient safety, 2.5% is too risky—we need near-perfect medical accuracy.

INT4 + FP16 Hybrid

A compromise: quantize most layers to INT4, keep attention layers in FP16. Reduces model size to 3.5GB while minimizing accuracy loss.

Results: 3.2x speedup, 0.8% accuracy loss. Better, but still measurable impact on medical reasoning.

FP8 Quantization

Use NVIDIA's FP8 format (standardized in H100s, supported via vLLM on A100s). 8-bit floating point maintains better numerical precision than INT8.

FP8 uses 1 sign bit + 4 exponent bits + 3 mantissa bits
~4x memory reduction vs FP32
Maintains numerical stability better than INT8

Results: 2.9x speedup, 0.3% accuracy loss. We chose this as the baseline for further optimization.

Optimization 2: vLLM Serving

vLLM is an open-source LLM serving engine that dramatically speeds up inference through two key innovations:

Paged Attention

Traditional transformers cache the key-value (KV) states for all previous tokens during generation. For a 256-token output, this creates 256 "attention blocks" of KV cache, which fragment GPU memory.

vLLM's paged attention divides the KV cache into fixed-size pages (e.g., 16 tokens per page), allowing better memory utilization and reducing fragmentation.

Impact: Reduced memory fragmentation from ~40% to ~10%, enabling larger batch sizes and better GPU utilization.

Continuous Batching

Instead of waiting for all requests in a batch to complete, vLLM interleaves tokens from multiple requests. If Request A is waiting for token 10 and Request B is on token 5, vLLM can process B's token 5 while A computes token 10 in parallel.

# Traditional batching: Wait for slowest request # Time: 0 -------- [Request A] -------- [Request B] # # Continuous batching: Interleave token generation # Token 1: [A B C] # Token 2: [A B C] (C finished, new request D replaces C) # Token 3: [A B D]

Switching to vLLM alone improved throughput 2.1x and reduced latency to 3.9 seconds (P95).

Optimization 3: Custom CUDA Kernels

Generic operations still had overhead. We wrote custom CUDA kernels for two bottleneck operations:

Fused Attention Kernel

Standard attention (Q, K, V tensors) requires 5 GPU kernel launches:

attention(Q, K, V): 1. S = matmul(Q, K^T) / sqrt(d_k) # matmul 2. S_mask = apply_causal_mask(S) # elementwise 3. P = softmax(S_mask) # softmax 4. O = matmul(P, V) # matmul 5. return O

Each kernel launch has overhead. Our fused kernel combines all 5 operations into a single CUDA kernel, eliminating launch overhead and enabling better memory access patterns.

Fused Attention Complexity:

Naive: O(n²d) memory reads (Q, K, V loaded separately)
Fused: O(n²) memory reads (tiles cached in shared memory)

Where n = sequence length, d = head dimension

Results: 1.3x speedup on the attention operation, which is ~40% of total inference time.

Quantized MatMul Kernel

For FP8 matrix multiplications, we tuned the CUTLASS library template for our specific tensor shapes (batch_size=128, seq_len=512, d_model=4096).

// Tuned matmul: FP8 input, FP32 output // M=128, N=4096, K=4096 using MatMulOp = cutlass::gemm::device::Gemm< cutlass::float8_t, // element_a cutlass::layout::RowMajor, // layout_a cutlass::float8_t, // element_b cutlass::layout::RowMajor, // layout_b float, // element_c cutlass::layout::RowMajor, // layout_c float, // element_accumulator cutlass::arch::OpMultiplyAddFastF32>;

Results: 1.2x speedup on matmul, which is ~50% of inference time.

End-to-End Optimization Results

Combining all three optimizations:

Latency Reduction
4.5x
8.2s → 1.8s (P95)
Throughput Increase
6.8x
12 req/sec → 82 req/sec
Cost per Request
-68%
$0.08 → $0.025
Accuracy Loss
0.3%
Clinical safety maintained
Optimization Speedup Cumulative
Baseline (FP32) 1.0x 8.2s
FP8 Quantization 2.9x 2.8s
vLLM + Paged Attention 1.4x 2.0s
Fused Attention Kernel 1.18x 1.7s
Quantized MatMul Kernel 1.05x 1.62s
Final (P95) 5.0x 1.8s*

*P95 accounts for network latency (~100ms) and request queuing. Raw inference is 1.7s.

Scaling to Production: Multi-GPU Cluster

Single-GPU inference is fast, but at scale, we need to handle traffic spikes. We deploy across 8x A100 GPUs in a load-balanced cluster:

# Synthure Inference Cluster Configuration GPU Cluster: - 8x A100 (80GB memory) - 1x Load Balancer (NVIDIA Triton) - Round-robin request distribution - Dynamic batching (group requests arriving within 50ms) Per-GPU Throughput: - 82 requests/sec (batch-optimized) - 8 GPUs → 656 req/sec cluster capacity Cost: - $3.06/GPU-hour - 8 GPUs × 24 hours = $588/day - 656 req/sec × 86,400 sec/day = 56.7M req/day - Cost per request: $0.01 (at full capacity)

In practice, we operate at 40-60% capacity (due to traffic patterns), which puts cost at $0.017-0.025 per request.

Latency Under Load

Latency degrades gracefully under load. At 80% cluster utilization, P95 latency increases from 1.8s to 2.1s due to queueing. At 95% utilization, it reaches 3.2s.

Latency vs. Cluster Utilization

0s 1s 2s 3s 4s Cluster Utilization (%) P95 Latency 20%: 1.8s 60%: 2.1s 95%: 3.2s

Real-World Deployment: Lessons Learned

Theory met practice when we deployed to production. Key insights:

Network Latency Dominates at Scale

We achieved 1.8s P95 latency, but production APIs added network overhead. API request → load balancer → GPU cluster → response takes ~200-300ms at P95. Our target shifted from 1.8s inference to 2.0s end-to-end.

Batch Size Tuning is Delicate

We set batch size to 128 for optimal GPU utilization. But during quiet hours, requests arrive slowly, and holding a request for 50ms (to batch) violates our latency target. Solution: dynamic batch size based on arrival rate.

Quantization Affects Edge Cases

FP8 quantization introduced rare (0.3%) accuracy losses. Most were benign (slightly different phrasing), but 1-2 patients saw the model hesitate on complex drug interactions. We added a fallback: if confidence < 0.85, re-run in FP32 (costs 3s but guarantees accuracy).

Conclusion: Performance Enables Adoption

Sub-2 second inference isn't just a nice-to-have. In clinical settings, it's the difference between a tool clinicians use and one they ignore. By combining quantization, vLLM's paged attention, and custom CUDA kernels, we achieved 4.5x speedup while maintaining 99.7% accuracy.

The real lesson: infrastructure optimization is inseparable from product success. Every 100ms we save is a doctor who spends less time waiting and more time with patients.

About this post: This technical deep dive reflects 6 months of infrastructure optimization at Synthure, from initial profiling to production deployment. CUDA kernel code was written using NVIDIA's CUTLASS library. vLLM integration tested with PyTorch 2.1 and CUDA 12.1. Published: February 2026.