System prompt token count is reasonable
Why it matters
The system prompt is the fixed tax paid on every single API call. Embedding a full FAQ document, a multi-page policy, or a large JSON schema in the system prompt means paying for thousands of tokens on every request — including the trivial ones. At 10,000 requests per day on GPT-4o at $2.50 per million input tokens, a bloated 3,000-token system prompt costs roughly $75/day in fixed overhead before any user message token is counted. OWASP LLM10 and ISO 25010 performance-efficiency flag this as architectural waste that compounds with volume.
Severity rationale
High because a bloated system prompt multiplies token cost linearly with every request, imposing fixed overhead that grows directly with traffic.
Remediation
Move static reference material out of the system prompt into a retrieval layer. Inject only the content relevant to the current query, not the entire corpus.
// Before — entire FAQ hardcoded in system prompt (3000+ tokens per call)
const systemPrompt = `You are a helpful assistant. Here is our FAQ:
Q: How do I reset my password? A: ...
[50 more Q&A pairs]`;
// After — concise system prompt; relevant FAQ injected per-request via RAG
const systemPrompt = `You are a customer support assistant.
Answer based on the provided context only. Say so if the answer is not in the context.`;
const relevantContext = await retrieveRelevantFAQ(userMessage);
After restructuring, verify the system prompt is under 1,500 tokens by running it through countMessageTokens([{ role: "system", content: systemPrompt }], model) and logging the result.
Detection
-
ID:
system-prompt-size -
Severity:
high -
What to look for: Identify the system prompt definition — usually a constant string or template literal in
src/lib/prompts.ts,prompts/system.ts, or inline in the API route handler. Estimate its token count using the approximation of 1 token per 4 characters of English text. Also check if the system prompt embeds large static content such as a full FAQ document, a multi-page policy, a JSON schema dump, or duplicated instructions. Before evaluating, extract and quote the longest system prompt found in the codebase to assess token efficiency. Count all instances found and enumerate each. -
Pass criteria: The system prompt is concise and purpose-built. For a standard chatbot or assistant, under 800 tokens is excellent; under 1500 tokens is acceptable. The prompt does not contain static reference material that could instead be retrieved dynamically. Report even on pass: "X system prompts found. Average token count: Y. Largest prompt: Z tokens."
-
Fail criteria: The system prompt exceeds approximately 2000 tokens (roughly 8000 characters), or it contains hardcoded documents, large JSON schemas, full policy text, or extensive few-shot examples that inflate it beyond what is architecturally necessary. Do NOT pass if any single system prompt exceeds 4000 tokens without a clear justification for the length.
-
Skip (N/A) when: No system prompt is used — the application uses only user-turn prompts with no system-role message. Signal: No string assigned to the
systemparameter in AI calls, and no message withrole: "system"in the messages array. -
Cross-reference: The
prompt-template-efficiencycheck verifies the template structure that contributes to the token counts measured here. -
Detail on fail:
"System prompt is bloated — large static content consumes context on every request" -
Remediation: A large system prompt increases the fixed token cost of every single API call. On high-traffic features, this compounds into significant cost and leaves less room for conversation history.
Move static reference material out of the system prompt and into a retrieval layer:
// Before — entire FAQ embedded in system prompt (3000+ tokens, every request) const systemPrompt = `You are a helpful assistant. Here is our FAQ: Q: How do I reset my password? A: Go to settings and click... [50 more Q&A pairs]`; // After — concise system prompt; FAQ retrieved and injected as context const systemPrompt = `You are a helpful customer support assistant. Answer based on the provided context. If the answer is not in the context, say so.`; // Inject relevant FAQ entries via RAG based on the user's specific question const relevantContext = await retrieveRelevantFAQ(userMessage);After restructuring, verify the system prompt is under 1500 tokens by running it through a counter:
countMessageTokens([{ role: "system", content: systemPrompt }], model).
External references
- owasp-llm:2025 · LLM10 — Unbounded Consumption
- cwe · CWE-770 — Allocation of Resources Without Limits or Throttling
- iso-25010:2011 · performance-efficiency.resource-utilization — Resource Utilization — system prompt token overhead minimized
Taxons
History
- 2026-04-18·v1.0.0·Initial import from ai-token-optimization·automated