LLM Inference Optimization

Serving large language models in production requires balancing latency, throughput, and cost. This guide covers the key techniques I use in practice.

Quantization

Quantization reduces model weight precision to lower memory footprint and increase throughput.

Method	Precision	Quality Loss	Speed Gain
FP16	16-bit	Minimal	~2x vs FP32
INT8	8-bit	Low	~3–4x
INT4	4-bit	Moderate	~5–6x

Recommended tools: bitsandbytes, GPTQ, AWQ, llama.cpp

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="bfloat16",
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3-8B",
    quantization_config=bnb_config,
    device_map="auto",
)

Batching Strategies

Static Batching

Group requests into fixed-size batches. Simple but wastes GPU time when sequences differ in length.

Continuous Batching (Iteration-level Scheduling)

New requests join the batch as soon as a slot frees up. Dramatically improves GPU utilization.

Tools that implement this: vLLM, TGI (Text Generation Inference), TensorRT-LLM

KV Cache Management

The key-value cache stores attention states to avoid recomputation. Efficient KV cache management is the single biggest lever for throughput.

PagedAttention (vLLM): Treats KV cache like virtual memory pages — eliminates fragmentation
Prefix caching: Reuse KV cache across requests with shared system prompts
Sliding window attention: Cap cache size for very long contexts

Speculative Decoding

Use a small draft model to propose multiple tokens, then verify with the large model in parallel. Achieves near-identical output with 2–3x latency reduction.

Draft model (e.g. 1B) → proposes k tokens
Target model (e.g. 70B) → verifies all k tokens in one forward pass

Choosing a Serving Framework

Framework	Best For
vLLM	High-throughput API serving
Ollama	Local development
TGI	HuggingFace ecosystem
TensorRT-LLM	NVIDIA GPU, max performance
llama.cpp	CPU / edge deployment

Key Metrics to Track

TTFT (Time to First Token) — user-perceived latency
TPOT (Time Per Output Token) — streaming speed
Throughput — tokens/sec across all concurrent users
GPU utilization — aim for >80% MFU (Model FLOP Utilization)

Quantization​

Batching Strategies​

Static Batching​

Continuous Batching (Iteration-level Scheduling)​

KV Cache Management​

Speculative Decoding​

Choosing a Serving Framework​

Key Metrics to Track​