Context window utilization is tracked and surfaced

ab-000309 · ai-token-optimization.context-management.context-window-tracking

Severity: criticalactive

Why it matters

Without context window tracking, your application flies blind into the model's hard token ceiling. When a conversation grows long enough to exceed that limit, the provider returns HTTP 400 context_length_exceeded — a hard error that crashes the session with no warning, no fallback, and no opportunity for graceful recovery. OWASP LLM10 and CWE-770 (Allocation of Resources Without Limits) both flag this as a critical resource management failure. The user loses their entire conversation state, and you have no log data to diagnose the spike.

Severity rationale

Critical because a missing context guard converts normal conversation growth into guaranteed session-terminating crashes with no user-visible recovery path.

Remediation

Add a token-counting utility using tiktoken or js-tiktoken and measure the full payload before every API call. Store the count alongside the request in structured logs so you can see which sessions are approaching limits.

// src/lib/ai/token-counter.ts
import { encoding_for_model } from "tiktoken";

export function countMessageTokens(
  messages: Array<{ role: string; content: string }>,
  model: string = "gpt-4o"
): number {
  const encoder = encoding_for_model(model as Parameters<typeof encoding_for_model>[0]);
  let count = 4;
  for (const msg of messages) {
    count += 4;
    count += encoder.encode(msg.content).length;
  }
  encoder.free();
  return count;
}

Call this before streamText or generateText in src/app/api/chat/route.ts. Log the result and branch on whether you're within the model's limit minus a response reserve.

Detection

ID: context-window-tracking
Severity: critical
What to look for: Look for logic in API routes (e.g., src/app/api/chat/route.ts) or AI service layers that calculates the total token count of the input messages before sending the request. Search for libraries like tiktoken, js-tiktoken, or gpt-tokenizer being imported and used to sum tokens of the messages array. Also check if the Vercel AI SDK's generateText or streamText result's usage field is consumed to track context headroom. Count all instances found and enumerate each.
Pass criteria: The codebase explicitly calculates or estimates the token count of the conversation history/context — either pre-flight via a counting library or post-call via the response usage field — and uses this data to log usage or trigger truncation logic. At least 1 implementation must be confirmed.
Fail criteria: No token counting logic is found prior to or immediately following the API call. The application relies entirely on the provider to reject requests that exceed limits, with no internal awareness of context state.
Skip (N/A) when: No AI API integration is detected in the project. Signal: No openai, @ai-sdk/*, @anthropic-ai/sdk, ai, or langchain dependencies in package.json, and no AI API call patterns in source files.
Detail on fail: "No context window tracking found — app will throw hard errors when conversations grow long"

Remediation: Context window usage is untracked, meaning the application will hit hard API errors (HTTP 400 "context_length_exceeded") with no graceful recovery once conversations grow long. This breaks user sessions without warning.

Add a token counting utility and measure context before each API call:

// src/lib/ai/token-counter.ts
import { encoding_for_model } from "tiktoken";

export function countMessageTokens(
  messages: Array<{ role: string; content: string }>,
  model: string = "gpt-4o"
): number {
  const encoder = encoding_for_model(model as Parameters<typeof encoding_for_model>[0]);
  let count = 4; // every reply is primed with <|im_start|>assistant<|im_sep|>
  for (const msg of messages) {
    count += 4; // role + content overhead
    count += encoder.encode(msg.content).length;
  }
  encoder.free();
  return count;
}

Then in your API route, call it before the AI request and log or act on the result:

// src/app/api/chat/route.ts
const tokenCount = countMessageTokens(messages, model);
console.log(`[token-usage] context=${tokenCount} model=${model}`);

if (tokenCount > MODEL_CONTEXT_LIMITS[model] - RESPONSE_BUFFER) {
  // truncate or return error — do not proceed blind
}

Verify by logging tokenCount and checking the server output during a long conversation.

For a broader look at performance patterns affecting your AI routes, see the Performance & Load Readiness Audit.

External references

owasp-llm:2025 · LLM10 — Unbounded Consumption
cwe · CWE-770 — Allocation of Resources Without Limits or Throttling
nist-ai-rmf:1.0 · MEASURE 2.5 — AI risk metrics are monitored
iso-25010:2011 · performance-efficiency.capacity — Capacity — context-window utilization tracked against model limit

Taxons

cost-efficiency inference-contract

History

2026-04-18·v1.0.0·Initial import from ai-token-optimization·automated

Why it matters

Remediation

// src/lib/ai/token-counter.ts
import { encoding_for_model } from "tiktoken";

export function countMessageTokens(
  messages: Array<{ role: string; content: string }>,
  model: string = "gpt-4o"
): number {
  const encoder = encoding_for_model(model as Parameters<typeof encoding_for_model>[0]);
  let count = 4;
  for (const msg of messages) {
    count += 4;
    count += encoder.encode(msg.content).length;
  }
  encoder.free();
  return count;
}

Call this before streamText or generateText in src/app/api/chat/route.ts. Log the result and branch on whether you're within the model's limit minus a response reserve.

Detection

ID: context-window-tracking
Severity: critical
What to look for: Look for logic in API routes (e.g., src/app/api/chat/route.ts) or AI service layers that calculates the total token count of the input messages before sending the request. Search for libraries like tiktoken, js-tiktoken, or gpt-tokenizer being imported and used to sum tokens of the messages array. Also check if the Vercel AI SDK's generateText or streamText result's usage field is consumed to track context headroom. Count all instances found and enumerate each.
Pass criteria: The codebase explicitly calculates or estimates the token count of the conversation history/context — either pre-flight via a counting library or post-call via the response usage field — and uses this data to log usage or trigger truncation logic. At least 1 implementation must be confirmed.
Fail criteria: No token counting logic is found prior to or immediately following the API call. The application relies entirely on the provider to reject requests that exceed limits, with no internal awareness of context state.
Skip (N/A) when: No AI API integration is detected in the project. Signal: No openai, @ai-sdk/*, @anthropic-ai/sdk, ai, or langchain dependencies in package.json, and no AI API call patterns in source files.
Detail on fail: "No context window tracking found — app will throw hard errors when conversations grow long"

Add a token counting utility and measure context before each API call:

// src/lib/ai/token-counter.ts
import { encoding_for_model } from "tiktoken";

export function countMessageTokens(
  messages: Array<{ role: string; content: string }>,
  model: string = "gpt-4o"
): number {
  const encoder = encoding_for_model(model as Parameters<typeof encoding_for_model>[0]);
  let count = 4; // every reply is primed with <|im_start|>assistant<|im_sep|>
  for (const msg of messages) {
    count += 4; // role + content overhead
    count += encoder.encode(msg.content).length;
  }
  encoder.free();
  return count;
}

Then in your API route, call it before the AI request and log or act on the result:

// src/app/api/chat/route.ts
const tokenCount = countMessageTokens(messages, model);
console.log(`[token-usage] context=${tokenCount} model=${model}`);

if (tokenCount > MODEL_CONTEXT_LIMITS[model] - RESPONSE_BUFFER) {
  // truncate or return error — do not proceed blind
}

Verify by logging tokenCount and checking the server output during a long conversation.

For a broader look at performance patterns affecting your AI routes, see the Performance & Load Readiness Audit.

External references

owasp-llm:2025 · LLM10 — Unbounded Consumption

cwe · CWE-770 — Allocation of Resources Without Limits or Throttling

nist-ai-rmf:1.0 · MEASURE 2.5 — AI risk metrics are monitored

iso-25010:2011 · performance-efficiency.capacity — Capacity — context-window utilization tracked against model limit