Token counting is performed before API calls to prevent hard limit errors

ab-000312 · ai-token-optimization.context-management.token-counting-before-call

Severity: highactive

Why it matters

Sending an oversized payload to the AI API is never free — even when the provider rejects it. The request consumes a full network round-trip, counts toward rate limit quotas on some configurations, and surfaces to the user as a generic 500 error rather than an actionable message. CWE-770 and CWE-400 both apply: unguarded resource allocation with no application-side limit check. NIST AI RMF MANAGE 1.3 requires predictable failure modes; a silent hard crash on context overflow is the opposite. A pre-flight guard converts this crash into a recoverable, user-friendly state.

Severity rationale

High because without a pre-flight guard, context overflow errors surface as generic failures — burning rate limit quota and giving users no actionable recovery information.

Remediation

Add a synchronous token check before every AI call that returns a structured 422 response when the payload would exceed the model's limit. Reserve headroom for the expected response length.

// src/app/api/chat/route.ts
const RESPONSE_RESERVE = 1000;

const totalTokens = countMessageTokens([
  { role: "system", content: systemPrompt },
  ...boundedMessages,
], model);

if (totalTokens > MODEL_CONFIG[model].contextWindow - RESPONSE_RESERVE) {
  return Response.json(
    { error: "Conversation is too long. Please start a new chat to continue." },
    { status: 422 }
  );
}

Verify by constructing a payload that exceeds the limit and confirming the UI displays the specific error rather than a generic failure screen.

Detection

ID: token-counting-before-call
Severity: high
What to look for: Look specifically for a pre-flight guard before the AI API call. This is distinct from post-call usage logging. Patterns to find: a conditional that throws or returns an error before calling the provider if token count exceeds a threshold; or auto-truncation logic that runs synchronously before the call. Also check error handling — is a context_length_exceeded or 400 error caught and surfaced as a user-friendly message, or does it propagate as a generic 500? Count all instances found and enumerate each.
Pass criteria: The code prevents the API call if estimated tokens exceed the model's limit and returns a user-friendly error (e.g., "Your conversation is too long — please start a new chat"), OR the auto-truncation from the history check (previous check) makes this guard redundant by construction. At least 1 implementation must be confirmed.
Fail criteria: The code makes the API call unconditionally and allows the provider to throw a 400 context_length_exceeded error, which either crashes the route handler or surfaces as a generic error in the UI.
Skip (N/A) when: No AI API integration is detected. Signal: No AI SDK dependencies in package.json.
Detail on fail: "No pre-flight token guard — context_length_exceeded errors will surface as generic 500s"

Remediation: API calls that fail due to length still consume network round-trip time and count toward rate limit quotas in some configurations. More importantly, the user gets a useless error.

Add a guard that fires before the API call:

// src/app/api/chat/route.ts
const RESPONSE_RESERVE = 1000; // tokens reserved for the response

const totalTokens = countMessageTokens([
  { role: "system", content: systemPrompt },
  ...boundedMessages,
], model);

if (totalTokens > MODEL_CONTEXT_LIMITS[model] - RESPONSE_RESERVE) {
  return Response.json(
    { error: "Conversation is too long. Please start a new chat to continue." },
    { status: 422 }
  );
}

// safe to proceed
const result = await streamText({ model: openai(model), messages: boundedMessages });

Verify by constructing a payload that exceeds the limit and confirming the UI displays the specific error message rather than a generic failure.

External references

owasp-llm:2025 · LLM10 — Unbounded Consumption
cwe · CWE-770 — Allocation of Resources Without Limits or Throttling
cwe · CWE-400 — Uncontrolled Resource Consumption
nist-ai-rmf:1.0 · MANAGE 1.3 — Responses to identified AI risks are developed
iso-25010:2011 · reliability.fault-tolerance — Fault Tolerance — pre-flight guard prevents context-overflow crashes

Taxons

cost-efficiency error-resilience

History

2026-04-18·v1.0.0·Initial import from ai-token-optimization·automated

Why it matters

Remediation

Add a synchronous token check before every AI call that returns a structured 422 response when the payload would exceed the model's limit. Reserve headroom for the expected response length.

// src/app/api/chat/route.ts
const RESPONSE_RESERVE = 1000;

const totalTokens = countMessageTokens([
  { role: "system", content: systemPrompt },
  ...boundedMessages,
], model);

if (totalTokens > MODEL_CONFIG[model].contextWindow - RESPONSE_RESERVE) {
  return Response.json(
    { error: "Conversation is too long. Please start a new chat to continue." },
    { status: 422 }
  );
}

Verify by constructing a payload that exceeds the limit and confirming the UI displays the specific error rather than a generic failure screen.

Detection

ID: token-counting-before-call
Severity: high
What to look for: Look specifically for a pre-flight guard before the AI API call. This is distinct from post-call usage logging. Patterns to find: a conditional that throws or returns an error before calling the provider if token count exceeds a threshold; or auto-truncation logic that runs synchronously before the call. Also check error handling — is a context_length_exceeded or 400 error caught and surfaced as a user-friendly message, or does it propagate as a generic 500? Count all instances found and enumerate each.
Pass criteria: The code prevents the API call if estimated tokens exceed the model's limit and returns a user-friendly error (e.g., "Your conversation is too long — please start a new chat"), OR the auto-truncation from the history check (previous check) makes this guard redundant by construction. At least 1 implementation must be confirmed.
Fail criteria: The code makes the API call unconditionally and allows the provider to throw a 400 context_length_exceeded error, which either crashes the route handler or surfaces as a generic error in the UI.
Skip (N/A) when: No AI API integration is detected. Signal: No AI SDK dependencies in package.json.
Detail on fail: "No pre-flight token guard — context_length_exceeded errors will surface as generic 500s"

Remediation: API calls that fail due to length still consume network round-trip time and count toward rate limit quotas in some configurations. More importantly, the user gets a useless error.

Add a guard that fires before the API call:

// src/app/api/chat/route.ts
const RESPONSE_RESERVE = 1000; // tokens reserved for the response

const totalTokens = countMessageTokens([
  { role: "system", content: systemPrompt },
  ...boundedMessages,
], model);

if (totalTokens > MODEL_CONTEXT_LIMITS[model] - RESPONSE_RESERVE) {
  return Response.json(
    { error: "Conversation is too long. Please start a new chat to continue." },
    { status: 422 }
  );
}

// safe to proceed
const result = await streamText({ model: openai(model), messages: boundedMessages });

Verify by constructing a payload that exceeds the limit and confirming the UI displays the specific error message rather than a generic failure.

External references

owasp-llm:2025 · LLM10 — Unbounded Consumption

cwe · CWE-770 — Allocation of Resources Without Limits or Throttling

cwe · CWE-400 — Uncontrolled Resource Consumption

nist-ai-rmf:1.0 · MANAGE 1.3 — Responses to identified AI risks are developed

iso-25010:2011 · reliability.fault-tolerance — Fault Tolerance — pre-flight guard prevents context-overflow crashes