v1.1.0Pro20 checks

Every AI Response Quality Audit check

All 20 checks with why-it-matters prose, severity, and cross-references to related audits.

3 critical6 high5 medium4 low2 info

Response Formatting

5 checks

Markdown rendering is enabled for AI response surfaces

critical

When AI response text is injected directly into the DOM via plain string interpolation, every `**bold**`, `## heading`, and `` `code block` `` in the model's output renders as raw punctuation — a UX failure that signals a half-finished integration. Worse, some models emit markdown-formatted content specifically because they are prompted or fine-tuned to do so; stripping rendering creates a mismatch between what the model produces and what the user sees. OWASP LLM05 identifies trust and safety failures in output presentation as a category of harm; showing raw syntax characters where formatted prose was intended degrades user trust and undermines the inference-contract between model and interface.

Why this severity: Critical because raw markdown syntax in a user-facing interface is both a visible defect and a signal that the AI integration was never tested end-to-end with real model output.

ai-response-quality.response-formatting.markdown-rendering-enabledSee full pattern

Structured output is enforced when the application requests it

critical

AI models that are prompted to return JSON do not guarantee valid, correctly-shaped JSON on every response. Without validation via Zod, JSON Schema, or SDK-level structured output, a malformed or schema-violating AI response will either throw an unhandled exception at `JSON.parse()` or silently corrupt downstream logic — writing unexpected data to a database, crashing a rendering component, or causing a downstream API call to fail. OWASP LLM09 (Misinformation) covers the class of failures where AI output is consumed uncritically. CWE-20 (Improper Input Validation) applies directly: structured AI output is external input and must be treated as such.

Why this severity: Critical because unvalidated structured AI output directly exposes application logic to crashes, data corruption, and undefined behavior on every LLM response that deviates from the expected schema.

ai-response-quality.response-formatting.structured-output-complianceSee full pattern

Code blocks are used in AI responses when code is present

high

Code rendered as plain text mashes variable names, file paths, and commands into prose where users cannot distinguish them from English — copy-paste breaks, indentation collapses, and shell commands get executed with the wrong quoting. This violates the inference-contract taxon: your UI renders markdown, but the model was never told to emit it. Developers paste broken snippets, support tickets escalate, and the product looks amateur next to competitors whose code blocks render with syntax highlighting.

Why this severity: High because every technical response degrades simultaneously, affecting trust and usability across the entire product surface.

ai-response-quality.response-formatting.code-block-usageSee full pattern

Response length is proportional to query complexity

medium

Unbounded `max_tokens` with no length guidance produces two failure modes at once: the model either pads simple answers into three-paragraph essays (burning tokens against the cost-efficiency taxon) or gets cut off mid-sentence on complex queries because the default context is exhausted. Users wait longer, pay more per request, and read padding that obscures the answer. At scale, a single missing `max_tokens` parameter can triple your OpenAI bill without anyone noticing until the invoice.

Why this severity: Medium because the impact is financial and UX degradation rather than security or data loss, but recurs on every request.

ai-response-quality.response-formatting.response-length-proportionalitySee full pattern

Truncated responses are detected and handled gracefully

low

When `finish_reason: "length"` fires and the application ignores it, users receive half-written code, truncated JSON that fails to parse, or advice that ends mid-clause — and they have no signal that anything was cut. They act on incomplete information, file bug reports against the wrong component, or lose trust in the assistant entirely. The error-resilience taxon requires surfacing partial-output states; silent truncation is the worst possible UX for a recoverable failure because it looks like a complete answer.

Why this severity: Low because truncation is infrequent in practice and users can retry, but the failure mode is silent and misleading.

ai-response-quality.response-formatting.truncation-handlingSee full pattern

Source Attribution

5 checks

No fabricated references or citations

critical

When an AI generates clickable hyperlinks in its response and no guardrail prohibits invented citations, users will click fabricated URLs — routing them to domains that may be squatted, adversarial, or simply nonexistent. For research or knowledge applications, invented paper citations and statistics erode user trust and can cause real-world harm if acted on (medical guidance, legal information, financial claims). OWASP LLM09 explicitly identifies misinformation propagation as an LLM risk. The combination of markdown link rendering and no anti-fabrication instruction in the system prompt is the highest-risk configuration in this category.

Why this severity: Critical because the combination of no citation prohibition and markdown link rendering means users will follow AI-invented URLs, with no indication that the destinations are fabricated.

ai-response-quality.source-attribution.no-fabricated-referencesSee full pattern

External knowledge claims include appropriate attribution cues

high

RAG pipelines that retrieve document chunks but strip source metadata before passing them to the model make it impossible for the AI to correctly attribute its answers. Users reading the response have no way to verify claims, locate original documents, or identify when the model has drawn on training data instead of retrieved content. For regulated industries — legal, healthcare, financial — unattributed claims create liability and compliance risk. OWASP LLM09 classifies unattributed AI-generated content as a misinformation risk. Without attribution cues, the inference-contract between the retrieval system and the user is broken.

Why this severity: High because unattributed AI responses in knowledge-domain applications undermine user ability to verify claims and expose the application to liability when users act on unverifiable information.

ai-response-quality.source-attribution.external-claim-citationsSee full pattern

Knowledge cutoff date is disclosed where relevant

high

Language models have a training cutoff and cannot know about events, legislation, software releases, or market conditions that postdate it. When an application provides no disclosure — neither in the system prompt nor the UI — users treat AI responses as current, leading to decisions based on outdated information. For compliance-sensitive domains (GDPR enforcement updates, securities regulations, medical guidelines), acting on stale AI output can cause real harm. NIST AI RMF GOVERN-1.1 requires transparency about AI system limitations. OWASP LLM09 identifies temporal misinformation as an LLM risk category.

Why this severity: High because users without cutoff disclosure will act on stale AI information as if it were current, with no signal that facts, laws, or specifications may have materially changed.

ai-response-quality.source-attribution.knowledge-cutoff-disclosureSee full pattern

RAG retrieval sources are passed through to the model context

medium

A RAG pipeline that retrieves documents but fails to inject them into the model context — or injects them as unstructured blob text — is just an expensive vector search that the LLM cannot use. The model falls back to parametric knowledge, hallucinates confidently on domain-specific questions, and cites sources it never saw. This breaks the inference-contract taxon: you promised grounded answers, you delivered confabulation. Users trust the retrieved citations exactly when they are least reliable.

Why this severity: Medium because RAG hallucinations bypass the safety layer users believe they are paying for.

ai-response-quality.source-attribution.rag-source-passthroughSee full pattern

Attribution format is consistent across response types

low

Ad-hoc citation language — sometimes `[1]`, sometimes "according to the docs," sometimes nothing — prevents the UI from rendering clickable references, breaks footnote components, and forces users to scroll back and hunt for sources. This degrades the user-experience taxon and undermines the inference-contract: users cannot verify claims when citations are inconsistent or absent. Support teams cannot build tooling on top of an undefined format, and audit trails for regulated industries become unworkable.

Why this severity: Low because responses remain usable, but citation inconsistency erodes verifiability and UI polish over time.

ai-response-quality.source-attribution.attribution-format-consistencySee full pattern

Hallucination Prevention

6 checks

Responses are grounded in provided context (not confabulated)

high

RAG applications retrieve context specifically to constrain the model to known, verifiable information — but without an explicit system prompt instruction to stay within that context, the model will freely blend retrieved content with training-data confabulations. Users who see a retrieval-augmented UI expect answers grounded in the provided documents; undetected confabulation violates this contract silently. OWASP LLM09 covers misinformation generated from this pattern. NIST AI RMF MEASURE-2.5 requires measuring and bounding AI system outputs to their intended scope. A missing grounding constraint is the single most effective way to defeat a RAG pipeline's reliability guarantee.

Why this severity: High because an ungrounded RAG system produces confidently-stated confabulations that users cannot distinguish from retrieved-document answers, defeating the core reliability purpose of RAG.

ai-response-quality.hallucination-prevention.context-grounding-enforcedSee full pattern

Uncertainty and confidence are signaled appropriately

high

A system prompt that instructs an AI to "always sound confident and authoritative" actively suppresses the model's natural uncertainty signaling — turning calibration failures into confident fabrications. Users in factual domains (legal, medical, financial, technical) rely on hedging language to know when to independently verify a claim. Stripping that signal is not a UX improvement; it is an epistemic hazard. OWASP LLM09 identifies overconfident AI output as a misinformation risk. NIST AI RMF MEASURE-2.5 requires that AI systems accurately represent their confidence levels to operators and users.

Why this severity: High because suppressing uncertainty language causes users to act on fabricated or uncertain AI answers without the hedging signals that would prompt them to verify, directly enabling harm in factual-domain applications.

ai-response-quality.hallucination-prevention.uncertainty-signalingSee full pattern

Out-of-scope requests are refused gracefully

medium

A domain-scoped assistant without a scope boundary becomes a general-purpose chatbot the moment a user asks anything off-topic. Customer-support bots start giving medical advice, code assistants opine on tax law, and the company inherits liability for confabulated answers outside its expertise. This violates the inference-contract taxon — you shipped a support product and delivered an unconstrained LLM. Worse, out-of-scope answers are where hallucination rates spike, because the model has no grounding.

Why this severity: Medium because scope leakage expands liability and hallucination surface beyond the product's tested domain.

ai-response-quality.hallucination-prevention.out-of-scope-refusalSee full pattern

Sensitive data is not leaked in AI responses

medium

When server-side code passes raw database records — including user objects with hashed passwords, internal flags, or API credentials — into an AI prompt, that data becomes part of the inference context processed by an external third-party API. Even if the model does not echo the data verbatim, it may reference, paraphrase, or leak it in edge-case responses. GDPR Article 5(1)(c) (data minimization) requires that personal data is limited to what is necessary. CWE-200 (Exposure of Sensitive Information) and OWASP LLM06 (Sensitive Information Disclosure) both apply. Logging raw AI responses without access controls compounds the exposure by persisting the sensitive context to a log store.

Why this severity: Medium because exploitation requires either an unusual model behavior or log access, but the data minimization violation is present in every request and violates GDPR Article 5(1)(c) regardless of observed leakage.

ai-response-quality.hallucination-prevention.no-sensitive-data-leakageSee full pattern

Factual claims are bounded to deterministic scope

medium

High temperature settings (≥0.8) combined with no factuality constraints in the system prompt create a confabulation-maximizing configuration. For research, knowledge, or information retrieval applications, this means the model will confidently generate plausible-but-invented statistics, misattributed quotes, and false technical specifications. OWASP LLM09 directly covers this failure mode. Temperature controls randomness in token sampling; at high values, the model increasingly selects lower-probability tokens — raising the rate of semantically coherent but factually incorrect output in a way that is invisible to users.

Why this severity: Medium because high temperature alone does not guarantee confabulation, but combined with no factuality system prompt instruction it materially and measurably increases hallucination rate in production queries.

ai-response-quality.hallucination-prevention.factual-claim-boundednessSee full pattern

AI explicitly acknowledges knowledge gaps rather than guessing

info

System prompts that include "always provide a complete, detailed answer" without qualification effectively prohibit the AI from acknowledging knowledge gaps — creating a forcing function for fabrication. When the model cannot say "I don't know," it generates a plausible-sounding answer instead. For users who cannot independently verify AI output, this is indistinguishable from a correct answer. OWASP LLM09 identifies this as a misinformation risk category. The fix is low-cost — a single sentence of explicit permission — but the absence of it structurally removes the model's most important safety valve against confabulation.

Why this severity: Info because the failure occurs only when the model encounters a genuine knowledge gap, not on every request, but a single forced fabrication in a sensitive domain can cause direct user harm.

ai-response-quality.hallucination-prevention.idontknow-acknowledgmentSee full pattern

Response Management

4 checks

Content safety filtering is active on AI responses

high

Without a content safety layer, every user-visible AI response is raw model output — including responses to adversarially crafted inputs designed to elicit harmful content. OWASP LLM05 (Improper Output Handling) covers this failure mode. NIST AI RMF MAP-5.1 requires that AI systems have mapped and mitigated output harms before deployment to users. A single unfiltered harmful response to a manipulated prompt is sufficient for reputational, legal, and regulatory exposure — particularly for applications accessible to minors or vulnerable populations. Provider-level safety settings are necessary but insufficient without application-layer verification.

Why this severity: High because without any content safety layer, a single jailbreak or adversarial input can produce harmful content served directly to users, exposing the application to legal liability and NIST AI RMF MAP-5.1 non-compliance.

ai-response-quality.response-management.content-safety-filteringSee full pattern

Responses are consistent for semantically equivalent queries

low

Temperature above 0.8 on a consistency-sensitive endpoint means the same billing question gets three different answers on three retries — one correct, one subtly wrong, one contradicting the docs. For customer support, legal, or documentation Q&A, this breaks the inference-contract taxon: users expect the same question to yield the same answer, and variance here manifests as product bugs, escalated tickets, and compliance incidents. Aggressive sampling is a feature for creative writing and a defect everywhere else.

Why this severity: Low because variance is tolerable in many contexts, but becomes serious for regulated or support-critical surfaces.

ai-response-quality.response-management.response-consistencySee full pattern

Context window utilization is monitored or bounded

low

Multi-turn chat applications that accumulate conversation history without any token management will eventually exceed the model's context window — at which point the API returns a 400 error and the conversation becomes unusable. Beyond hard failure, unbounded context accumulation drives token costs up linearly with conversation length. Users in long sessions pay a disproportionately high cost per message as the prompt grows. ISO 25010:2011 performance-efficiency requires that software resource consumption is proportional to task requirements. A sliding window or summarization strategy prevents both the failure mode and the cost spiral.

Why this severity: Low because the failure is deterministic but deferred — applications only break when a conversation reaches the context limit, which may take dozens of turns, but the architectural defect is present from the first message.

ai-response-quality.response-management.context-window-utilizationSee full pattern

Response metadata is available for debugging and observability

info

Without logging model name, token usage, and latency on AI calls, there is no operational visibility into cost trajectories, quality degradation, or performance regressions. A model version bump by the provider, a prompt change that doubles token usage, or a latency spike affecting user experience will go undetected until users report problems. NIST AI RMF MEASURE-2.7 requires that AI systems have mechanisms for ongoing performance measurement. ISO 25010:2011 maintainability requires that system behavior is observable. The observability infrastructure is low-effort to add and eliminates an entire class of invisible production failures.

Why this severity: Info because the absence of observability does not directly cause user-visible failures, but it makes every other AI reliability issue — cost overruns, quality degradation, latency regressions — invisible until they become critical.

ai-response-quality.response-management.response-metadata-exposedSee full pattern

Ready to scan your project?

Run this audit in your AI coding tool (Claude Code, Cursor, Bolt, etc.) and submit results here for scoring and benchmarks.

Open AI Response Quality Audit