Without logging model name, token usage, and latency on AI calls, there is no operational visibility into cost trajectories, quality degradation, or performance regressions. A model version bump by the provider, a prompt change that doubles token usage, or a latency spike affecting user experience will go undetected until users report problems. NIST AI RMF MEASURE-2.7 requires that AI systems have mechanisms for ongoing performance measurement. ISO 25010:2011 maintainability requires that system behavior is observable. The observability infrastructure is low-effort to add and eliminates an entire class of invisible production failures.
Info because the absence of observability does not directly cause user-visible failures, but it makes every other AI reliability issue — cost overruns, quality degradation, latency regressions — invisible until they become critical.
Add minimal structured logging to every AI API call:
const t0 = Date.now()
const response = await openai.chat.completions.create({ model: process.env.OPENAI_MODEL ?? 'gpt-4o', ... })
console.log(JSON.stringify({
model: response.model,
promptTokens: response.usage?.prompt_tokens,
completionTokens: response.usage?.completion_tokens,
finishReason: response.choices[0]?.finish_reason,
latencyMs: Date.now() - t0
}))
For production, route this telemetry to Helicone, Braintrust, or LangSmith rather than console.log. Also ensure the model name is configured via process.env.OPENAI_MODEL rather than hardcoded so version changes are tracked through config rather than code diffs.
ID: ai-response-quality.response-management.response-metadata-exposed
Severity: info
What to look for: Enumerate all relevant files and Check whether the application captures and logs (server-side, not to client) AI response metadata — model name used, token counts (prompt_tokens, completion_tokens), latency, finish_reason, and request ID. Look for any observability setup: LangSmith, Helicone, Braintrust, OpenTelemetry integration with AI SDK, or custom logging of response.usage. Check whether model name is hardcoded vs. configurable via environment variable.
Pass criteria: Application logs at least model name, token usage, and latency server-side. Model name is configurable via environment variable or config rather than buried in code.
Fail criteria: No response metadata logging exists at all for a production application — impossible to debug cost, latency, or quality issues.
Skip (N/A) when: Application is a prototype or personal tool where operational observability is not yet a concern.
Detail on fail: "No token usage or model metadata logged after AI calls in api/chat/route.ts — no observability into cost or performance" (max 500 chars)
Remediation: Add minimal observability to AI calls:
const startTime = Date.now()
const response = await openai.chat.completions.create({ ... })
console.log({
model: response.model,
promptTokens: response.usage?.prompt_tokens,
completionTokens: response.usage?.completion_tokens,
finishReason: response.choices[0].finish_reason,
latencyMs: Date.now() - startTime
})
For production, consider a dedicated AI observability platform (Helicone, Braintrust, or LangSmith) to track quality and cost over time.