Streaming is enabled for long AI responses
Why it matters
LLM generation at 50 tokens per second means a 1,000-token response takes 20 seconds. Without streaming, the user sees a spinner for the entire duration and receives no indication that anything is happening. Perceived latency is the primary driver of AI application abandonment. Research consistently shows that time-to-first-token is the latency metric users notice most — streaming brings it from 20 seconds to under 2. ISO 25010 performance-efficiency requires that user-facing response times be minimized; non-streaming chat interfaces fail this benchmark for any response over a sentence.
Severity rationale
High because non-streaming AI responses impose full generation latency before the user sees any output, making the application feel broken on longer responses.
Remediation
Replace generateText with streamText in user-facing route handlers and return result.toDataStreamResponse(). Use the useChat hook on the frontend to consume the stream and render tokens as they arrive.
// src/app/api/chat/route.ts
import { streamText } from "ai";
import { openai } from "@ai-sdk/openai";
export async function POST(req: Request) {
const { messages } = await req.json();
const result = await streamText({
model: openai("gpt-4o"),
messages,
maxTokens: 1000,
});
return result.toDataStreamResponse();
}
// src/components/chat.tsx
import { useChat } from "ai/react";
export function Chat() {
const { messages, input, handleInputChange, handleSubmit } = useChat();
return (
<form onSubmit={handleSubmit}>
{messages.map(m => <div key={m.id}>{m.content}</div>)}
<input value={input} onChange={handleInputChange} />
<button type="submit">Send</button>
</form>
);
}
Verify by submitting a prompt that generates a long response and confirming that text begins appearing within 2 seconds of submission.
Detection
-
ID:
streaming-enabled -
Severity:
high -
What to look for: Look for
stream: truein OpenAI SDK calls, or use of the Vercel AI SDK'sstreamText(...)function (as opposed togenerateText). In the frontend, look foruseChatoruseCompletionhooks from theaipackage, or manual SSE/stream reading withReadableStream. Check API route handlers fortoDataStreamResponse()or streaming response construction. Also look for evidence of the opposite: longawait generateText(...)calls in user-facing routes where the UI shows a spinner until completion. Count all instances found and enumerate each. -
Pass criteria: User-facing AI features that produce responses of more than a sentence or two use streaming — the response begins appearing in the UI within 1-2 seconds of submission, rather than after full generation completes. At least 1 implementation must be confirmed.
-
Fail criteria: AI responses are fully generated server-side before being sent to the client. Users see a loading spinner for the full duration of generation. API routes use
generateText(non-streaming) for interactive chat or long-form generation features. -
Skip (N/A) when: The AI feature is non-interactive (background job, batch processing) or always produces very short responses (under 50 tokens) where the streaming latency improvement is negligible. Signal: All AI calls are in background jobs or cron handlers, not in response to user HTTP requests. Or the use case demonstrably only generates short outputs (e.g., classification labels, scores).
-
Detail on fail:
"AI responses not streamed — users wait for full generation before seeing any output" -
Remediation: LLM generation is inherently slow — a 1000-token response at 50 tokens/second takes 20 seconds. Without streaming, users stare at a spinner for the entire duration. With streaming, they see the first token within 1-2 seconds.
// src/app/api/chat/route.ts — streaming route handler import { streamText } from "ai"; import { openai } from "@ai-sdk/openai"; export async function POST(req: Request) { const { messages } = await req.json(); const result = await streamText({ model: openai("gpt-4o"), messages, maxTokens: 1000, }); return result.toDataStreamResponse(); } // src/components/chat.tsx — streaming frontend import { useChat } from "ai/react"; export function Chat() { const { messages, input, handleInputChange, handleSubmit } = useChat(); return ( <div> {messages.map(m => <div key={m.id}>{m.content}</div>)} <form onSubmit={handleSubmit}> <input value={input} onChange={handleInputChange} /> <button type="submit">Send</button> </form> </div> ); }Verify by submitting a chat message requesting a long response and observing that text begins appearing within 2 seconds.
For overall performance patterns in your application, see the Performance & Load Readiness Audit.
External references
- iso-25010:2011 · performance-efficiency.time-behaviour — Time Behaviour — streaming minimizes time-to-first-token
Taxons
History
- 2026-04-18·v1.0.0·Initial import from ai-token-optimization·automated