Sending an oversized payload to the AI API is never free — even when the provider rejects it. The request consumes a full network round-trip, counts toward rate limit quotas on some configurations, and surfaces to the user as a generic 500 error rather than an actionable message. CWE-770 and CWE-400 both apply: unguarded resource allocation with no application-side limit check. NIST AI RMF MANAGE 1.3 requires predictable failure modes; a silent hard crash on context overflow is the opposite. A pre-flight guard converts this crash into a recoverable, user-friendly state.
High because without a pre-flight guard, context overflow errors surface as generic failures — burning rate limit quota and giving users no actionable recovery information.
Add a synchronous token check before every AI call that returns a structured 422 response when the payload would exceed the model's limit. Reserve headroom for the expected response length.
// src/app/api/chat/route.ts
const RESPONSE_RESERVE = 1000;
const totalTokens = countMessageTokens([
{ role: "system", content: systemPrompt },
...boundedMessages,
], model);
if (totalTokens > MODEL_CONFIG[model].contextWindow - RESPONSE_RESERVE) {
return Response.json(
{ error: "Conversation is too long. Please start a new chat to continue." },
{ status: 422 }
);
}
Verify by constructing a payload that exceeds the limit and confirming the UI displays the specific error rather than a generic failure screen.
ID: ai-token-optimization.context-management.token-counting-before-call
Severity: high
What to look for: Look specifically for a pre-flight guard before the AI API call. This is distinct from post-call usage logging. Patterns to find: a conditional that throws or returns an error before calling the provider if token count exceeds a threshold; or auto-truncation logic that runs synchronously before the call. Also check error handling — is a context_length_exceeded or 400 error caught and surfaced as a user-friendly message, or does it propagate as a generic 500? Count all instances found and enumerate each.
Pass criteria: The code prevents the API call if estimated tokens exceed the model's limit and returns a user-friendly error (e.g., "Your conversation is too long — please start a new chat"), OR the auto-truncation from the history check (previous check) makes this guard redundant by construction. At least 1 implementation must be confirmed.
Fail criteria: The code makes the API call unconditionally and allows the provider to throw a 400 context_length_exceeded error, which either crashes the route handler or surfaces as a generic error in the UI.
Skip (N/A) when: No AI API integration is detected.
Signal: No AI SDK dependencies in package.json.
Detail on fail: "No pre-flight token guard — context_length_exceeded errors will surface as generic 500s"
Remediation: API calls that fail due to length still consume network round-trip time and count toward rate limit quotas in some configurations. More importantly, the user gets a useless error.
Add a guard that fires before the API call:
// src/app/api/chat/route.ts
const RESPONSE_RESERVE = 1000; // tokens reserved for the response
const totalTokens = countMessageTokens([
{ role: "system", content: systemPrompt },
...boundedMessages,
], model);
if (totalTokens > MODEL_CONTEXT_LIMITS[model] - RESPONSE_RESERVE) {
return Response.json(
{ error: "Conversation is too long. Please start a new chat to continue." },
{ status: 422 }
);
}
// safe to proceed
const result = await streamText({ model: openai(model), messages: boundedMessages });
Verify by constructing a payload that exceeds the limit and confirming the UI displays the specific error message rather than a generic failure.