Without an explicit max_tokens ceiling, the model may generate up to its remaining context window in a single response. A model in a repetition loop or producing unexpectedly verbose output runs until it hits the absolute limit — billing you for every token generated. On a high-traffic endpoint, a single user who triggers runaway generation can produce a spike equivalent to thousands of normal requests. OWASP LLM10, CWE-770, and CWE-400 all apply. NIST AI RMF GOVERN 6.1 requires operational controls on AI resource consumption; max_tokens is the most direct such control.
Critical because uncapped generation enables runaway token consumption that produces unbounded cost spikes with no application-level control mechanism.
Set maxTokens on every streamText, generateText, and chat.completions.create call. Calibrate the value to what the task actually needs — a chat reply does not need the same ceiling as a document draft.
// Vercel AI SDK
const result = await streamText({
model: openai("gpt-4o"),
messages,
maxTokens: 1000, // sized for a chat reply; increase for long-form tasks
});
// OpenAI SDK directly
const completion = await openai.chat.completions.create({
model: "gpt-4o",
messages,
max_tokens: 1000,
});
Verify by searching the codebase for all streamText, generateText, and chat.completions.create call sites and confirming each has an explicit maxTokens or max_tokens property — not the model's absolute maximum.
ID: ai-token-optimization.token-efficiency.max-tokens-set
Severity: critical
What to look for: Examine every openai.chat.completions.create(...), generateText(...), and streamText(...) call in the codebase. Look for the max_tokens parameter (or maxTokens in Vercel AI SDK, max_completion_tokens for newer OpenAI models). Search for these call sites in src/app/api/, src/lib/ai/, and pages/api/. Count every AI API call and enumerate which set an explicit max_tokens parameter vs. which rely on the default. Report: X of Y calls set max_tokens.
Pass criteria: Every AI API call that generates text has an explicit max_tokens or maxTokens parameter set to a value appropriate for the task (e.g., 500 for a short reply, 2000 for a document draft, 4000 for long-form generation). The value is not set to the model's absolute maximum. Threshold: at least 90% of AI API calls must set explicit max_tokens.
Fail criteria: One or more AI API calls are made without a max_tokens parameter, meaning the model may generate up to its full remaining context in one response. This is especially dangerous for chat interfaces where users could trigger runaway generation. Do NOT pass if any API call uses the model's maximum context window as the max_tokens value — this wastes tokens on unnecessary generation capacity.
Skip (N/A) when: No AI API integration is detected.
Signal: No AI SDK dependencies in package.json and no AI API call patterns in source files.
Detail on fail: "One or more AI calls have no max_tokens set — runaway generation risks unexpected cost spikes"
Remediation: Without a max_tokens ceiling, a model that enters a repetition loop or produces verbose output will run until it hits the absolute context limit — billing you for every token. Setting an appropriate ceiling acts as a cost safety brake.
Set maxTokens on every AI call, calibrated to what the task actually needs:
// Vercel AI SDK
const result = await streamText({
model: openai("gpt-4o"),
messages,
maxTokens: 1000, // generous for a chat reply, adjust per use case
});
// OpenAI SDK directly
const completion = await openai.chat.completions.create({
model: "gpt-4o",
messages,
max_tokens: 1000,
});
For document generation endpoints that legitimately need more tokens, set a higher but still explicit limit. Verify by searching the codebase for all streamText, generateText, and chat.completions.create calls and confirming each has a maxTokens / max_tokens property.
For guidance on response length governance and quality, see the AI Response Quality Audit.