Skip to main content

LLM Inference Optimization

Serving large language models in production requires balancing latency, throughput, and cost. This guide covers the key techniques I use in practice.

Quantization

Quantization reduces model weight precision to lower memory footprint and increase throughput.

MethodPrecisionQuality LossSpeed Gain
FP1616-bitMinimal~2x vs FP32
INT88-bitLow~3–4x
INT44-bitModerate~5–6x

Recommended tools: bitsandbytes, GPTQ, AWQ, llama.cpp

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype="bfloat16",
)

model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3-8B",
quantization_config=bnb_config,
device_map="auto",
)

Batching Strategies

Static Batching

Group requests into fixed-size batches. Simple but wastes GPU time when sequences differ in length.

Continuous Batching (Iteration-level Scheduling)

New requests join the batch as soon as a slot frees up. Dramatically improves GPU utilization.

Tools that implement this: vLLM, TGI (Text Generation Inference), TensorRT-LLM

KV Cache Management

The key-value cache stores attention states to avoid recomputation. Efficient KV cache management is the single biggest lever for throughput.

  • PagedAttention (vLLM): Treats KV cache like virtual memory pages — eliminates fragmentation
  • Prefix caching: Reuse KV cache across requests with shared system prompts
  • Sliding window attention: Cap cache size for very long contexts

Speculative Decoding

Use a small draft model to propose multiple tokens, then verify with the large model in parallel. Achieves near-identical output with 2–3x latency reduction.

Draft model (e.g. 1B) → proposes k tokens
Target model (e.g. 70B) → verifies all k tokens in one forward pass

Choosing a Serving Framework

FrameworkBest For
vLLMHigh-throughput API serving
OllamaLocal development
TGIHuggingFace ecosystem
TensorRT-LLMNVIDIA GPU, max performance
llama.cppCPU / edge deployment

Key Metrics to Track

  • TTFT (Time to First Token) — user-perceived latency
  • TPOT (Time Per Output Token) — streaming speed
  • Throughput — tokens/sec across all concurrent users
  • GPU utilization — aim for >80% MFU (Model FLOP Utilization)