Content safety filtering is active on AI responses
Why it matters
Without a content safety layer, every user-visible AI response is raw model output — including responses to adversarially crafted inputs designed to elicit harmful content. OWASP LLM05 (Improper Output Handling) covers this failure mode. NIST AI RMF MAP-5.1 requires that AI systems have mapped and mitigated output harms before deployment to users. A single unfiltered harmful response to a manipulated prompt is sufficient for reputational, legal, and regulatory exposure — particularly for applications accessible to minors or vulnerable populations. Provider-level safety settings are necessary but insufficient without application-layer verification.
Severity rationale
High because without any content safety layer, a single jailbreak or adversarial input can produce harmful content served directly to users, exposing the application to legal liability and NIST AI RMF MAP-5.1 non-compliance.
Remediation
Add at least one content safety mechanism to the response pipeline in src/app/api/chat/route.ts:
// Option A: OpenAI Moderation API (add after generating the response)
const moderation = await openai.moderations.create({ input: aiResponse.content })
if (moderation.results[0]?.flagged) {
return NextResponse.json({ error: 'Response flagged by safety filter' }, { status: 422 })
}
// Option B: Anthropic Claude — configure system prompt safety framing
// Option C: Azure AI Content Safety or AWS Bedrock Guardrails for provider-level filtering
For adversarial inputs that attempt to bypass safety filters via prompt injection, the AI Prompt Injection Audit covers jailbreak and injection vectors.
Detection
-
ID:
content-safety-filtering -
Severity:
high -
What to look for: Enumerate all relevant files and Check whether the application uses a content moderation layer on AI output. This can be: (a) OpenAI Moderation API called on the response before serving, (b) Anthropic's built-in safety via Claude system prompts, (c) Azure Content Safety or equivalent, (d) a custom regex/keyword filter on response content, (e) a secondary AI moderation pass, or (f) a third-party moderation library. Also check whether the AI provider's safety settings are explicitly configured (e.g., not disabled). Look for
safety_settingsin Gemini calls,guardrailsin Bedrock, or similar provider-level safety config. -
Pass criteria: At least 1 implementation must be present. At least one content safety mechanism is active on AI responses — either provider-level safety settings are present and not disabled, or application-layer moderation is applied before serving responses to users.
-
Fail criteria: No content safety layer is detected — no moderation API calls, no provider safety configuration, and no keyword/pattern filtering on AI output for a user-facing application.
-
Skip (N/A) when: Application is a private internal tool with no public users, or a developer tool where response safety is not a user concern.
-
Detail on fail:
"No content moderation detected on AI responses in api/chat/route.ts — raw model output served directly to users"(max 500 chars) -
Remediation: Add content safety to your response pipeline:
// Option A: OpenAI Moderation API const moderation = await openai.moderations.create({ input: aiResponse }) if (moderation.results[0].flagged) { return { error: 'Response could not be displayed', safe: false } } // Option B: Use a safety-focused model configuration // Option C: Provider-level safety (Anthropic Claude with appropriate system prompt framing) return { content: aiResponse, safe: true }For adversarial input attacks that could manipulate safety behavior, the AI Prompt Injection Audit covers prompt injection and jailbreak vectors.
External references
- owasp-llm:2025 · LLM05 — Improper Output Handling
- nist-ai-rmf:1.0 · MAP-5.1 — Likelihood and magnitude of each identified impact based on exposed individuals
Taxons
History
- 2026-04-18·v1.0.0·Initial import from ai-response-quality·automated