All 21 checks with why-it-matters prose, severity, and cross-references to related audits.
Without context window tracking, your application flies blind into the model's hard token ceiling. When a conversation grows long enough to exceed that limit, the provider returns HTTP 400 `context_length_exceeded` — a hard error that crashes the session with no warning, no fallback, and no opportunity for graceful recovery. OWASP LLM10 and CWE-770 (Allocation of Resources Without Limits) both flag this as a critical resource management failure. The user loses their entire conversation state, and you have no log data to diagnose the spike.
Why this severity: Critical because a missing context guard converts normal conversation growth into guaranteed session-terminating crashes with no user-visible recovery path.
ai-token-optimization.context-management.context-window-trackingSee full patternPassing an unbounded `messages` array to the AI provider is a ticking time bomb. Every conversation turn adds tokens to the payload; once the cumulative history exceeds the model's context window, the API returns a hard 400 error mid-session. OWASP LLM10 and CWE-770 classify this as an uncontrolled resource consumption failure. The user's active session terminates without warning. On multi-user platforms this failure recurs predictably, silently degrading retention for your most engaged users — the ones with the longest conversation histories.
Why this severity: Critical because unbounded history growth guarantees eventual API failure for active users, crashing live sessions with no recovery path.
ai-token-optimization.context-management.conversation-history-truncationSee full patternThe system prompt is the fixed tax paid on every single API call. Embedding a full FAQ document, a multi-page policy, or a large JSON schema in the system prompt means paying for thousands of tokens on every request — including the trivial ones. At 10,000 requests per day on GPT-4o at $2.50 per million input tokens, a bloated 3,000-token system prompt costs roughly $75/day in fixed overhead before any user message token is counted. OWASP LLM10 and ISO 25010 performance-efficiency flag this as architectural waste that compounds with volume.
Why this severity: High because a bloated system prompt multiplies token cost linearly with every request, imposing fixed overhead that grows directly with traffic.
ai-token-optimization.context-management.system-prompt-sizeSee full patternSending an oversized payload to the AI API is never free — even when the provider rejects it. The request consumes a full network round-trip, counts toward rate limit quotas on some configurations, and surfaces to the user as a generic 500 error rather than an actionable message. CWE-770 and CWE-400 both apply: unguarded resource allocation with no application-side limit check. NIST AI RMF MANAGE 1.3 requires predictable failure modes; a silent hard crash on context overflow is the opposite. A pre-flight guard converts this crash into a recoverable, user-friendly state.
Why this severity: High because without a pre-flight guard, context overflow errors surface as generic failures — burning rate limit quota and giving users no actionable recovery information.
ai-token-optimization.context-management.token-counting-before-callSee full patternSimple FIFO truncation discards messages with no regard for their semantic value. The user's name stated at message 1, a constraint given at message 5, or a decision made at message 10 vanishes permanently when it slides out of the window. In multi-step workflows — job application assistants, planning tools, tutoring systems — this causes visible regression: the AI contradicts itself or asks questions the user already answered. NIST AI RMF MANAGE 2.2 requires managing model limitations that affect outcomes; history amnesia is a direct consequence of ignoring this.
Why this severity: Medium because hard truncation degrades conversation quality and user trust in extended sessions, without causing an outright crash.
ai-token-optimization.context-management.context-compaction-strategySee full patternHardcoded token limit numbers (`if (tokens > 4000)`) become silently wrong when you upgrade a model. GPT-3.5 turbo ran at 4K, then 16K. GPT-4o runs at 128K. Claude Sonnet at 200K. If the magic number in your route handler was written for an old model and the model is later swapped, every conversation will truncate far too aggressively — or not at all — with no indication that the limit constant is stale. ISO 25010 maintainability requires that configuration be centralized and named, not scattered as bare literals.
Why this severity: Low because stale magic number limits degrade behavior silently on model upgrades but do not cause immediate crashes or security failures.
ai-token-optimization.context-management.max-context-configSee full patternWithout an explicit `max_tokens` ceiling, the model may generate up to its remaining context window in a single response. A model in a repetition loop or producing unexpectedly verbose output runs until it hits the absolute limit — billing you for every token generated. On a high-traffic endpoint, a single user who triggers runaway generation can produce a spike equivalent to thousands of normal requests. OWASP LLM10, CWE-770, and CWE-400 all apply. NIST AI RMF GOVERN 6.1 requires operational controls on AI resource consumption; `max_tokens` is the most direct such control.
Why this severity: Critical because uncapped generation enables runaway token consumption that produces unbounded cost spikes with no application-level control mechanism.
ai-token-optimization.token-efficiency.max-tokens-setSee full patternFrontier models like GPT-4o cost 10–50x more per token than their lightweight counterparts — and are often no better at simple tasks. Generating a document title, classifying sentiment, or summarizing a short paragraph does not require frontier-level reasoning. Every request where `gpt-4o` runs a classification that `gpt-4o-mini` handles equally well is a 17x cost premium with zero quality benefit. NIST AI RMF MAP 5.1 requires that model selection be matched to operational requirements; using the most powerful model for every task uniformly fails this requirement.
Why this severity: High because using frontier models for all tasks including simple ones imposes a 10–50x cost multiplier on routine, high-volume operations that cheaper models handle equally well.
ai-token-optimization.token-efficiency.model-selection-by-complexitySee full patternEvery word in a developer-authored prompt template is processed on every request. Verbose preambles, redundant restatements of the same constraint, and politeness markers like "Please would you kindly" add tokens that produce no improvement in output quality. At 10,000 requests per day, 100 unnecessary tokens per request costs roughly $0.25/day on `gpt-4o-mini` and $2.50/day on `gpt-4o` — purely for filler text. OWASP LLM10 flags prompt bloat as a performance-efficiency failure. Compact prompts also leave more room for actual conversation content before hitting truncation.
Why this severity: Medium because verbose prompt templates impose a fixed per-request token tax that scales directly with traffic volume.
ai-token-optimization.token-efficiency.prompt-template-efficiencySee full patternEmbedding model choice directly determines storage cost, query latency, and API cost for every document in your system. The legacy `text-embedding-ada-002` model is strictly dominated by `text-embedding-3-small`, which achieves higher retrieval quality at lower cost — there is no justification for using `ada-002` in new systems. Using `text-embedding-3-large` (3,072 dimensions) when `text-embedding-3-small` (1,536 dimensions) produces equivalent retrieval quality for your dataset wastes storage and doubles query payload size. ISO 25010 performance-efficiency requires that resource choices be calibrated to actual requirements.
Why this severity: Medium because an oversized or legacy embedding model imposes avoidable storage, cost, and latency overhead on every embedding operation and vector query.
ai-token-optimization.token-efficiency.embedding-dimension-optimizationSee full patternRAG chunk size controls how much text is inserted into the LLM context per retrieved document. Oversized chunks (full pages at 3,000+ tokens each) flood the context window with surrounding noise, reducing the signal-to-noise ratio and leaving less room for conversation history. High `topK` values compound this: retrieving 20 chunks at 2,000 tokens each consumes 40,000 tokens before the user's question is even accounted for. OWASP LLM10 flags uncontrolled context consumption. Poorly tuned RAG raises both cost and error rate simultaneously.
Why this severity: Info because suboptimal chunk sizing and retrieval depth degrade answer quality and inflate cost, but do not cause outright failure in typical configurations.
ai-token-optimization.token-efficiency.rag-chunk-sizeSee full patternMany AI applications have query patterns where the same prompt runs repeatedly across different users: SEO description generation for a product page, FAQ answer retrieval for a support bot, or summarization of a static document. Without a cache layer, each of these requests hits the API and bills you for the same tokens repeatedly. On a product listing page visited by 10,000 users per day with identical AI-generated descriptions, you pay for 10,000 API calls when 1 would suffice. ISO 25010 performance-efficiency requires eliminating redundant computation at scale.
Why this severity: High because without caching, repeated identical prompts generate redundant API costs that scale linearly with traffic volume regardless of response novelty.
ai-token-optimization.caching-cost.repeated-query-cachingSee full patternUsers phrase semantically identical questions in dozens of different ways: "What is the refund policy?" and "How do I get a refund?" produce identical answers but fail an exact-match cache check. Semantic caching, which compares query embeddings for similarity rather than string equality, captures this overlap. Research and production deployments consistently show that semantic caching captures 20–40% additional cache-eligible traffic that exact matching misses — traffic that would otherwise generate fresh API calls. ISO 25010 performance-efficiency classifies redundant recomputation of equivalent responses as an efficiency defect.
Why this severity: Medium because semantic caching captures a substantial fraction of cache-eligible traffic that exact matching misses, with measurable cost reduction on high-volume query patterns.
ai-token-optimization.caching-cost.semantic-cachingSee full patternToken counts are the primary cost driver for every AI API integration, but they are invisible by default if not explicitly captured. Without logging `prompt_tokens`, `completion_tokens`, and model name per request, you cannot answer basic operational questions: which feature costs the most, which users are consuming disproportionate tokens, whether a prompt change improved efficiency, or whether usage is trending toward a budget ceiling. NIST AI RMF MEASURE 2.5 requires monitoring AI system resource consumption. Unlogged token usage means you discover cost problems only on the monthly bill.
Why this severity: High because unlogged token usage makes cost monitoring, quota enforcement, and abuse detection operationally impossible — problems only surface on billing statements.
ai-token-optimization.caching-cost.token-usage-loggingSee full patternRaw token counts are an abstraction that most product decisions do not operate on. Knowing a request consumed 1,200 tokens tells you nothing actionable; knowing it cost $0.014 does. Dollar cost estimation unlocks concrete decisions: which feature is too expensive to offer on the free plan, which user is consuming 10x the cohort average, whether a model swap will save $300/month, and at what traffic level a cost-per-user budget is breached. NIST AI RMF MEASURE 2.5 requires quantitative monitoring of AI operational costs; dollar estimates are that quantification.
Why this severity: Low because the absence of cost estimation leaves token counts abstract, delaying cost-driven product decisions — but does not cause operational failure on its own.
ai-token-optimization.caching-cost.cost-estimation-per-requestSee full patternThe OpenAI Batch API offers a 50% cost discount and higher throughput limits for requests that can tolerate a 24-hour completion window. Non-interactive tasks — nightly content tagging, bulk summarization, product description generation, weekly report creation — have no reason to use real-time standard API calls and pay the full rate. Sending these jobs one-by-one through the synchronous endpoint forgoes half the available cost savings and burns synchronous rate limit capacity that user-facing features need. ISO 25010 performance-efficiency requires that resource consumption be calibrated to actual operational constraints.
Why this severity: Info because missing batch API usage wastes available cost discounts on background tasks, but does not cause failures or affect real-time user experience.
ai-token-optimization.caching-cost.batch-processingSee full patternLLM generation at 50 tokens per second means a 1,000-token response takes 20 seconds. Without streaming, the user sees a spinner for the entire duration and receives no indication that anything is happening. Perceived latency is the primary driver of AI application abandonment. Research consistently shows that time-to-first-token is the latency metric users notice most — streaming brings it from 20 seconds to under 2. ISO 25010 performance-efficiency requires that user-facing response times be minimized; non-streaming chat interfaces fail this benchmark for any response over a sentence.
Why this severity: High because non-streaming AI responses impose full generation latency before the user sees any output, making the application feel broken on longer responses.
ai-token-optimization.streaming-performance.streaming-enabledSee full patternWhen a user navigates away mid-generation or realizes their prompt was wrong, the model continues generating and billing you for every token — even though no user will ever read the output. Without an abort mechanism, you pay for the full completion on every abandoned session. On a model that generates 1,000-token responses at $0.01 per completion, even a 5% abandonment rate on 10,000 daily requests costs $50/day in tokens that were never seen. OWASP LLM10, CWE-770, and NIST AI RMF MANAGE 1.3 all require application-level controls to terminate runaway resource consumption.
Why this severity: High because without abort support, navigating away or canceling a request does not stop generation — the model keeps billing for output no user will ever see.
ai-token-optimization.streaming-performance.abort-cancel-mechanismSee full patternAI APIs are susceptible to transient failures: `429 Too Many Requests` during traffic spikes, `503 Service Unavailable` during provider incidents, and occasional `500` errors. Without retry logic, every transient error immediately surfaces to the user as a failed request — even though the same call would succeed 2 seconds later. Without exponential backoff, retries that fire immediately in a tight loop amplify rate limit pressure at exactly the wrong time, causing cascading failures across all requests rather than absorbing the spike gracefully. ISO 25010 reliability requires predictable behavior under transient failure conditions.
Why this severity: Medium because missing or improperly implemented retry logic converts transient API errors into user-visible failures and can worsen rate limit pressure under load.
ai-token-optimization.streaming-performance.retry-with-backoffSee full patternServer-side-only validation for empty or oversized inputs forces every request through a full network round-trip before a basic error surfaces, wasting LLM API spend on requests that will be rejected and burning user patience on latency that could be zero. A user pasting a 200KB document into a chat box discovers the limit after the spinner, not before. Client-side input validation short-circuits these failures at the keyboard, cutting wasted provider tokens and improving perceived performance for the user-experience taxon this pattern tracks.
Why this severity: Low because the failure mode is latency and wasted tokens, not data loss or security compromise.
ai-token-optimization.streaming-performance.client-side-input-validationSee full patternLLM responses arrive token-by-token, and markdown syntax gets sliced mid-tag during delivery — an unclosed code fence, a half-written bold marker, a partial list item. A naive renderer flashes raw `**text**` and broken layouts between tokens, then suddenly snaps to formatted output when the stream completes. That flicker reads as broken software to end users and destroys the perceived quality of an otherwise functional AI feature, hitting both the user-experience and performance taxons this pattern covers.
Why this severity: Info because the output is still delivered correctly; only the rendering polish during streaming is degraded.
ai-token-optimization.streaming-performance.streaming-partial-renderSee full patternRun this audit in your AI coding tool (Claude Code, Cursor, Bolt, etc.) and submit results here for scoring and benchmarks.
Open AI Token Optimization Audit