Context window utilization is monitored or bounded

ab-000215 · ai-response-quality.response-management.context-window-utilization

Severity: lowactive

Why it matters

Multi-turn chat applications that accumulate conversation history without any token management will eventually exceed the model's context window — at which point the API returns a 400 error and the conversation becomes unusable. Beyond hard failure, unbounded context accumulation drives token costs up linearly with conversation length. Users in long sessions pay a disproportionately high cost per message as the prompt grows. ISO 25010:2011 performance-efficiency requires that software resource consumption is proportional to task requirements. A sliding window or summarization strategy prevents both the failure mode and the cost spiral.

Severity rationale

Low because the failure is deterministic but deferred — applications only break when a conversation reaches the context limit, which may take dozens of turns, but the architectural defect is present from the first message.

Remediation

Add conversation history trimming before each API call in src/app/api/chat/route.ts:

const MAX_CHARS = 400_000  // ~100k tokens at ~4 chars/token; adjust per model

function trimHistory(messages: Message[]): Message[] {
  const system = messages.filter(m => m.role === 'system')
  let history = messages.filter(m => m.role !== 'system')
  while (history.join('').length > MAX_CHARS && history.length > 2) {
    history = history.slice(2)  // Drop oldest user+assistant turn pair
  }
  return [...system, ...history]
}

Also catch the API's token-limit error explicitly and return a recoverable error state rather than a 500.

Detection

ID: context-window-utilization
Severity: low
What to look for: Enumerate all relevant files and For applications with conversation history (multi-turn chat), check whether there is any token counting or conversation trimming logic to prevent context window overflow. Look for: token counting using tiktoken, @anthropic-ai/tokencount, or manual estimation; message pruning (removing old messages when approaching the limit); summarization of older conversation turns; or a hard message count limit. Check whether the application handles the API error that occurs when input tokens exceed the model's context window.
Pass criteria: At least 1 conforming pattern must exist. Application either (a) counts tokens and trims conversation history before sending, (b) implements a sliding window or summarization strategy, or (c) limits conversation history to a safe message count. API token-limit errors are caught and handled.
Fail criteria: Multi-turn chat application accumulates unlimited conversation history with no token management — will eventually exceed the context window and error.
Skip (N/A) when: Application is single-turn (no conversation history) or always sends a fixed, bounded context.
Detail on fail: "Multi-turn chat in api/chat/route.ts appends all messages with no token limit or pruning — will fail when context window exceeded" (max 500 chars)

Remediation: Add conversation history management:

const MAX_CONTEXT_CHARS = 400_000  // Rough heuristic: ~4 chars per token, leave room for response

function trimHistory(messages: Message[]): Message[] {
  const systemMsgs = messages.filter(m => m.role === 'system')
  let history = messages.filter(m => m.role !== 'system')

  while (
    history.map(m => m.content).join('').length > MAX_CONTEXT_CHARS &&
    history.length > 2
  ) {
    history = history.slice(2)  // Remove oldest turn pair
  }
  return [...systemMsgs, ...history]
}

External references

iso-25010:2011 · performance-efficiency — Resource utilisation

Taxons

cost-efficiency inference-contract

History

2026-04-18·v1.0.0·Initial import from ai-response-quality·automated