LLM Inference Optimization
Serving large language models in production requires balancing latency, throughput, and cost. This guide covers the key techniques I use in practice.
Quantization
Quantization reduces model weight precision to lower memory footprint and increase throughput.
| Method | Precision | Quality Loss | Speed Gain |
|---|---|---|---|
| FP16 | 16-bit | Minimal | ~2x vs FP32 |
| INT8 | 8-bit | Low | ~3–4x |
| INT4 | 4-bit | Moderate | ~5–6x |
Recommended tools: bitsandbytes, GPTQ, AWQ, llama.cpp
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype="bfloat16",
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3-8B",
quantization_config=bnb_config,
device_map="auto",
)
Batching Strategies
Static Batching
Group requests into fixed-size batches. Simple but wastes GPU time when sequences differ in length.
Continuous Batching (Iteration-level Scheduling)
New requests join the batch as soon as a slot frees up. Dramatically improves GPU utilization.
Tools that implement this: vLLM, TGI (Text Generation Inference), TensorRT-LLM
KV Cache Management
The key-value cache stores attention states to avoid recomputation. Efficient KV cache management is the single biggest lever for throughput.
- PagedAttention (vLLM): Treats KV cache like virtual memory pages — eliminates fragmentation
- Prefix caching: Reuse KV cache across requests with shared system prompts
- Sliding window attention: Cap cache size for very long contexts
Speculative Decoding
Use a small draft model to propose multiple tokens, then verify with the large model in parallel. Achieves near-identical output with 2–3x latency reduction.
Draft model (e.g. 1B) → proposes k tokens
Target model (e.g. 70B) → verifies all k tokens in one forward pass
Choosing a Serving Framework
| Framework | Best For |
|---|---|
| vLLM | High-throughput API serving |
| Ollama | Local development |
| TGI | HuggingFace ecosystem |
| TensorRT-LLM | NVIDIA GPU, max performance |
| llama.cpp | CPU / edge deployment |
Key Metrics to Track
- TTFT (Time to First Token) — user-perceived latency
- TPOT (Time Per Output Token) — streaming speed
- Throughput — tokens/sec across all concurrent users
- GPU utilization — aim for >80% MFU (Model FLOP Utilization)