The system prompt is the fixed tax paid on every single API call. Embedding a full FAQ document, a multi-page policy, or a large JSON schema in the system prompt means paying for thousands of tokens on every request — including the trivial ones. At 10,000 requests per day on GPT-4o at $2.50 per million input tokens, a bloated 3,000-token system prompt costs roughly $75/day in fixed overhead before any user message token is counted. OWASP LLM10 and ISO 25010 performance-efficiency flag this as architectural waste that compounds with volume.
High because a bloated system prompt multiplies token cost linearly with every request, imposing fixed overhead that grows directly with traffic.
Move static reference material out of the system prompt into a retrieval layer. Inject only the content relevant to the current query, not the entire corpus.
// Before — entire FAQ hardcoded in system prompt (3000+ tokens per call)
const systemPrompt = `You are a helpful assistant. Here is our FAQ:
Q: How do I reset my password? A: ...
[50 more Q&A pairs]`;
// After — concise system prompt; relevant FAQ injected per-request via RAG
const systemPrompt = `You are a customer support assistant.
Answer based on the provided context only. Say so if the answer is not in the context.`;
const relevantContext = await retrieveRelevantFAQ(userMessage);
After restructuring, verify the system prompt is under 1,500 tokens by running it through countMessageTokens([{ role: "system", content: systemPrompt }], model) and logging the result.
ID: ai-token-optimization.context-management.system-prompt-size
Severity: high
What to look for: Identify the system prompt definition — usually a constant string or template literal in src/lib/prompts.ts, prompts/system.ts, or inline in the API route handler. Estimate its token count using the approximation of 1 token per 4 characters of English text. Also check if the system prompt embeds large static content such as a full FAQ document, a multi-page policy, a JSON schema dump, or duplicated instructions. Before evaluating, extract and quote the longest system prompt found in the codebase to assess token efficiency. Count all instances found and enumerate each.
Pass criteria: The system prompt is concise and purpose-built. For a standard chatbot or assistant, under 800 tokens is excellent; under 1500 tokens is acceptable. The prompt does not contain static reference material that could instead be retrieved dynamically. Report even on pass: "X system prompts found. Average token count: Y. Largest prompt: Z tokens."
Fail criteria: The system prompt exceeds approximately 2000 tokens (roughly 8000 characters), or it contains hardcoded documents, large JSON schemas, full policy text, or extensive few-shot examples that inflate it beyond what is architecturally necessary. Do NOT pass if any single system prompt exceeds 4000 tokens without a clear justification for the length.
Skip (N/A) when: No system prompt is used — the application uses only user-turn prompts with no system-role message.
Signal: No string assigned to the system parameter in AI calls, and no message with role: "system" in the messages array.
Cross-reference: The prompt-template-efficiency check verifies the template structure that contributes to the token counts measured here.
Detail on fail: "System prompt is bloated — large static content consumes context on every request"
Remediation: A large system prompt increases the fixed token cost of every single API call. On high-traffic features, this compounds into significant cost and leaves less room for conversation history.
Move static reference material out of the system prompt and into a retrieval layer:
// Before — entire FAQ embedded in system prompt (3000+ tokens, every request)
const systemPrompt = `You are a helpful assistant. Here is our FAQ:
Q: How do I reset my password? A: Go to settings and click...
[50 more Q&A pairs]`;
// After — concise system prompt; FAQ retrieved and injected as context
const systemPrompt = `You are a helpful customer support assistant.
Answer based on the provided context. If the answer is not in the context, say so.`;
// Inject relevant FAQ entries via RAG based on the user's specific question
const relevantContext = await retrieveRelevantFAQ(userMessage);
After restructuring, verify the system prompt is under 1500 tokens by running it through a counter: countMessageTokens([{ role: "system", content: systemPrompt }], model).