Streaming is enabled for long AI responses

ab-000325 · ai-token-optimization.streaming-performance.streaming-enabled

Severity: highactive

Why it matters

LLM generation at 50 tokens per second means a 1,000-token response takes 20 seconds. Without streaming, the user sees a spinner for the entire duration and receives no indication that anything is happening. Perceived latency is the primary driver of AI application abandonment. Research consistently shows that time-to-first-token is the latency metric users notice most — streaming brings it from 20 seconds to under 2. ISO 25010 performance-efficiency requires that user-facing response times be minimized; non-streaming chat interfaces fail this benchmark for any response over a sentence.

Severity rationale

High because non-streaming AI responses impose full generation latency before the user sees any output, making the application feel broken on longer responses.

Remediation

Replace generateText with streamText in user-facing route handlers and return result.toDataStreamResponse(). Use the useChat hook on the frontend to consume the stream and render tokens as they arrive.

// src/app/api/chat/route.ts
import { streamText } from "ai";
import { openai } from "@ai-sdk/openai";

export async function POST(req: Request) {
  const { messages } = await req.json();
  const result = await streamText({
    model: openai("gpt-4o"),
    messages,
    maxTokens: 1000,
  });
  return result.toDataStreamResponse();
}

// src/components/chat.tsx
import { useChat } from "ai/react";

export function Chat() {
  const { messages, input, handleInputChange, handleSubmit } = useChat();
  return (
    <form onSubmit={handleSubmit}>
      {messages.map(m => <div key={m.id}>{m.content}</div>)}
      <input value={input} onChange={handleInputChange} />
      <button type="submit">Send</button>
    </form>
  );
}

Verify by submitting a prompt that generates a long response and confirming that text begins appearing within 2 seconds of submission.

Detection

ID: streaming-enabled
Severity: high
What to look for: Look for stream: true in OpenAI SDK calls, or use of the Vercel AI SDK's streamText(...) function (as opposed to generateText). In the frontend, look for useChat or useCompletion hooks from the ai package, or manual SSE/stream reading with ReadableStream. Check API route handlers for toDataStreamResponse() or streaming response construction. Also look for evidence of the opposite: long await generateText(...) calls in user-facing routes where the UI shows a spinner until completion. Count all instances found and enumerate each.
Pass criteria: User-facing AI features that produce responses of more than a sentence or two use streaming — the response begins appearing in the UI within 1-2 seconds of submission, rather than after full generation completes. At least 1 implementation must be confirmed.
Fail criteria: AI responses are fully generated server-side before being sent to the client. Users see a loading spinner for the full duration of generation. API routes use generateText (non-streaming) for interactive chat or long-form generation features.
Skip (N/A) when: The AI feature is non-interactive (background job, batch processing) or always produces very short responses (under 50 tokens) where the streaming latency improvement is negligible. Signal: All AI calls are in background jobs or cron handlers, not in response to user HTTP requests. Or the use case demonstrably only generates short outputs (e.g., classification labels, scores).
Detail on fail: "AI responses not streamed — users wait for full generation before seeing any output"

Remediation: LLM generation is inherently slow — a 1000-token response at 50 tokens/second takes 20 seconds. Without streaming, users stare at a spinner for the entire duration. With streaming, they see the first token within 1-2 seconds.

// src/app/api/chat/route.ts — streaming route handler
import { streamText } from "ai";
import { openai } from "@ai-sdk/openai";

export async function POST(req: Request) {
  const { messages } = await req.json();

  const result = await streamText({
    model: openai("gpt-4o"),
    messages,
    maxTokens: 1000,
  });

  return result.toDataStreamResponse();
}

// src/components/chat.tsx — streaming frontend
import { useChat } from "ai/react";

export function Chat() {
  const { messages, input, handleInputChange, handleSubmit } = useChat();
  return (
    <div>
      {messages.map(m => <div key={m.id}>{m.content}</div>)}
      <form onSubmit={handleSubmit}>
        <input value={input} onChange={handleInputChange} />
        <button type="submit">Send</button>
      </form>
    </div>
  );
}

Verify by submitting a chat message requesting a long response and observing that text begins appearing within 2 seconds.

For overall performance patterns in your application, see the Performance & Load Readiness Audit.

External references

iso-25010:2011 · performance-efficiency.time-behaviour — Time Behaviour — streaming minimizes time-to-first-token

Taxons

performance

History

2026-04-18·v1.0.0·Initial import from ai-token-optimization·automated