Prompt Engineering for Production

Prompts are the primary interface between your application and an LLM. Small changes have outsized effects on output quality, cost, and reliability.

Core Principles

Be Explicit About Format

LLMs default to verbose prose. Always specify the output format you need.

# Vague
Summarize this article.

# Explicit
Summarize the following article in 3 bullet points, each under 15 words.
Focus on technical findings, not background context.

Give the Model a Role

System prompts that assign a clear persona improve consistency.

You are a senior software engineer reviewing a pull request.
Your feedback is precise, actionable, and backed by reasoning.
You do not suggest changes that are purely stylistic.

Chain of Thought for Reasoning Tasks

For classification or analysis, ask the model to reason before answering.

Analyze the customer complaint below.
First, identify the core issue in one sentence.
Then, classify it as one of: [billing, technical, account, other].
Finally, rate urgency as high/medium/low with a one-line justification.

Complaint: {complaint_text}

Few-Shot Examples

Include 2–3 input/output examples for tasks where format precision matters.

Convert the following dates to ISO 8601 format.

Input: March 5th, 2024
Output: 2024-03-05

Input: 12 Jan 23
Output: 2023-01-12

Input: {date}
Output:

Structured Output

Use JSON mode or function calling instead of parsing free text.

from openai import OpenAI
from pydantic import BaseModel

client = OpenAI()

class EntityExtraction(BaseModel):
    company: str
    role: str
    years_experience: int

completion = client.beta.chat.completions.parse(
    model="gpt-4o",
    messages=[
        {"role": "user", "content": f"Extract info from: {resume_text}"}
    ],
    response_format=EntityExtraction,
)
result = completion.choices[0].message.parsed

Cost and Latency Optimization

Technique	Latency	Cost	Quality
Reduce prompt length	↓	↓	Neutral
Use smaller model for simple tasks	↓↓	↓↓	↓ slightly
Cache repeated prompts	↓↓	↓↓	Same
Prompt caching (Anthropic/OpenAI)	↓	↓↓	Same
Batch API	None	↓↓	Same

Prompt Caching

For prompts with large, repeated system prompts or context:

# Anthropic prompt caching
response = client.messages.create(
    model="claude-opus-4-6",
    system=[{
        "type": "text",
        "text": long_system_prompt,
        "cache_control": {"type": "ephemeral"}  # Cache this block
    }],
    messages=[{"role": "user", "content": user_query}]
)

Evaluation

Never deploy a prompt change without evaluating it. Build a small eval set (50–200 examples) and track:

Accuracy on labeled examples
Format compliance (does output match expected schema?)
Regression rate on cases the old prompt handled correctly

def evaluate_prompt(prompt_fn, eval_set):
    correct = 0
    for example in eval_set:
        output = prompt_fn(example["input"])
        if output == example["expected"]:
            correct += 1
    return correct / len(eval_set)

Prompt Versioning

Treat prompts like code:

Store in version control alongside your codebase
Tag prompt versions that ship to production
A/B test significant rewrites before full rollout
Log which prompt version produced each response

Core Principles​

Be Explicit About Format​

Give the Model a Role​

Chain of Thought for Reasoning Tasks​

Few-Shot Examples​

Structured Output​

Cost and Latency Optimization​

Prompt Caching​

Evaluation​

Prompt Versioning​