Content safety filtering is active on AI responses

ab-000213 · ai-response-quality.response-management.content-safety-filtering

Severity: highactive

Why it matters

Without a content safety layer, every user-visible AI response is raw model output — including responses to adversarially crafted inputs designed to elicit harmful content. OWASP LLM05 (Improper Output Handling) covers this failure mode. NIST AI RMF MAP-5.1 requires that AI systems have mapped and mitigated output harms before deployment to users. A single unfiltered harmful response to a manipulated prompt is sufficient for reputational, legal, and regulatory exposure — particularly for applications accessible to minors or vulnerable populations. Provider-level safety settings are necessary but insufficient without application-layer verification.

Severity rationale

High because without any content safety layer, a single jailbreak or adversarial input can produce harmful content served directly to users, exposing the application to legal liability and NIST AI RMF MAP-5.1 non-compliance.

Remediation

Add at least one content safety mechanism to the response pipeline in src/app/api/chat/route.ts:

// Option A: OpenAI Moderation API (add after generating the response)
const moderation = await openai.moderations.create({ input: aiResponse.content })
if (moderation.results[0]?.flagged) {
  return NextResponse.json({ error: 'Response flagged by safety filter' }, { status: 422 })
}

// Option B: Anthropic Claude — configure system prompt safety framing
// Option C: Azure AI Content Safety or AWS Bedrock Guardrails for provider-level filtering

For adversarial inputs that attempt to bypass safety filters via prompt injection, the AI Prompt Injection Audit covers jailbreak and injection vectors.

Detection

ID: content-safety-filtering
Severity: high
What to look for: Enumerate all relevant files and Check whether the application uses a content moderation layer on AI output. This can be: (a) OpenAI Moderation API called on the response before serving, (b) Anthropic's built-in safety via Claude system prompts, (c) Azure Content Safety or equivalent, (d) a custom regex/keyword filter on response content, (e) a secondary AI moderation pass, or (f) a third-party moderation library. Also check whether the AI provider's safety settings are explicitly configured (e.g., not disabled). Look for safety_settings in Gemini calls, guardrails in Bedrock, or similar provider-level safety config.
Pass criteria: At least 1 implementation must be present. At least one content safety mechanism is active on AI responses — either provider-level safety settings are present and not disabled, or application-layer moderation is applied before serving responses to users.
Fail criteria: No content safety layer is detected — no moderation API calls, no provider safety configuration, and no keyword/pattern filtering on AI output for a user-facing application.
Skip (N/A) when: Application is a private internal tool with no public users, or a developer tool where response safety is not a user concern.
Detail on fail: "No content moderation detected on AI responses in api/chat/route.ts — raw model output served directly to users" (max 500 chars)

Remediation: Add content safety to your response pipeline:

// Option A: OpenAI Moderation API
const moderation = await openai.moderations.create({ input: aiResponse })
if (moderation.results[0].flagged) {
  return { error: 'Response could not be displayed', safe: false }
}

// Option B: Use a safety-focused model configuration
// Option C: Provider-level safety (Anthropic Claude with appropriate system prompt framing)

return { content: aiResponse, safe: true }

For adversarial input attacks that could manipulate safety behavior, the AI Prompt Injection Audit covers prompt injection and jailbreak vectors.

External references

owasp-llm:2025 · LLM05 — Improper Output Handling
nist-ai-rmf:1.0 · MAP-5.1 — Likelihood and magnitude of each identified impact based on exposed individuals

Taxons

inference-contract

History

2026-04-18·v1.0.0·Initial import from ai-response-quality·automated