Output is filtered for harmful content before display

ab-000189 · ai-prompt-injection.output-filtering.output-content-moderation

Severity: highactive

Why it matters

Input moderation stops known adversarial prompts at the gate, but a sufficiently crafted injection that slips through may still cause the model to produce policy-violating output. OWASP LLM02:2025 specifically calls out output filtering as a required second layer precisely because input-side controls are imperfect. Consumer-facing AI applications in regulated verticals (fintech, healthcare, legal) face liability when harmful content reaches users—regardless of whether the output was triggered by adversarial input. NIST AI RMF MANAGE 1.3 requires active harm management across the full inference pipeline, including the output. The finish_reason field is an often-ignored signal: a value other than stop indicates the model's output was interrupted by provider-side safety filters, and that truncated response should never be returned as a complete answer.

Severity rationale

High because output moderation is the last code-controlled layer before harmful or adversarially-triggered content reaches users, making its absence a high-impact gap in the defense stack.

Remediation

Apply output moderation after every completion call, and check finish_reason to handle safety-filtered responses correctly.

const assistantMessage = completion.choices[0]?.message?.content ?? ''
const finishReason = completion.choices[0]?.finish_reason

// Handle provider-side safety filter
if (finishReason === 'content_filter') {
  return Response.json({ error: 'Response was filtered' }, { status: 400 })
}

// Output moderation
const mod = await openai.moderations.create({ input: assistantMessage })
if (mod.results[0]?.flagged) {
  return Response.json({ error: 'Response not allowed' }, { status: 400 })
}

return Response.json({ message: assistantMessage })

If you use a non-OpenAI provider, Azure Content Safety and Perspective API both provide output moderation endpoints. Log flagged output events (without the full content) for pattern analysis.

Detection

ID: output-content-moderation
Severity: high
What to look for: List all LLM output paths displayed to users. For each, look for content moderation applied to the model's output before it is returned to the user or stored. Check for output moderation API calls (OpenAI Moderation, Perspective API, Azure Content Safety), output filtering functions, or a secondary model call that evaluates the primary model's response. Also look for use of AI provider finish_reason — checking whether the response was cut short by safety filters.
Pass criteria: Output moderation runs on model responses before they are returned to the client. OR the AI provider used has mandatory, unconfigurable output safety filtering (some hosted models do — verify via provider docs and note in findings). The finish_reason is checked and non-stop reasons are handled — at least 1 output moderation layer applied on 100% of user-facing LLM responses. Report: "X user-facing LLM outputs found, all Y pass through content moderation."
Fail criteria: No output moderation found. Model responses are returned directly to the client without any filtering.
Skip (N/A) when: No AI provider integration detected.
Detail on fail: "POST /api/chat returns model output directly to the client without output moderation check" or "finish_reason is not checked — responses cut short by safety filters are returned as partial outputs"

Remediation: Even with strong system prompts and input filtering, models can occasionally produce unexpected outputs, especially under adversarial pressure. Add output moderation as a final gate:

const assistantMessage = completion.choices[0]?.message?.content ?? ''
const finishReason = completion.choices[0]?.finish_reason

// Check if the model was stopped by safety filters
if (finishReason === 'content_filter') {
  return Response.json({ error: 'Response was filtered' }, { status: 400 })
}

// Moderation check on output
const modResult = await openai.moderations.create({ input: assistantMessage })
if (modResult.results[0]?.flagged) {
  return Response.json({ error: 'Response not allowed' }, { status: 400 })
}

return Response.json({ message: assistantMessage })

External references

owasp-llm:2025 · LLM02 — Insecure Output Handling
nist-ai-rmf:1.0 · MANAGE 1.3 — Responses to identified AI risks are prioritized and managed

Taxons

inference-contract

History

2026-04-18·v1.0.0·Initial import from ai-prompt-injection·automated