Content moderation is applied to user inputs

ab-000181 · ai-prompt-injection.input-sanitization.input-content-moderation

Severity: lowactive

Why it matters

Input content moderation is the cheapest effective first-line filter against harmful and adversarial content reaching your model. Without it, your AI feature accepts arbitrary text from any user—including coordinated abuse, prompt injection probes, and policy-violating requests. OWASP LLM01:2025 identifies moderation as a required control; NIST AI RMF MANAGE 1.3 requires active management of harm vectors in deployed AI systems. For consumer-facing applications, unmoderated inputs expose the platform to both regulatory risk and reputation damage when the model produces harmful outputs in response to adversarial prompts that a moderation layer would have blocked at the door.

Severity rationale

Low because most AI providers apply some output-side safety filtering that partially compensates, but input moderation is a low-effort control that significantly reduces the attack surface reaching the primary model.

Remediation

Call the OpenAI Moderation endpoint before your primary completion call. It is free, averages under 100ms, and blocks a broad category of harmful and adversarial content.

// src/app/api/chat/route.ts
export async function POST(req: Request) {
  const { message } = await ChatInputSchema.parse(await req.json())

  // Moderation gate — runs before the primary completion
  const mod = await openai.moderations.create({ input: message })
  if (mod.results[0]?.flagged) {
    return Response.json({ error: 'Message not allowed' }, { status: 400 })
  }

  // Proceed with primary completion only if moderation passes
  const completion = await openai.chat.completions.create({ ... })
}

If you use a provider other than OpenAI, Perspective API (Google) covers similar categories and has a free tier for low-volume applications. Log the moderation result (not the full message content) for monitoring.

Detection

ID: input-content-moderation
Severity: low
What to look for: Enumerate every user-facing LLM input path. For each, look for integration with a content moderation service on the input side — OpenAI Moderation API (moderations.create), Perspective API, Azure Content Safety, or equivalent. Check whether moderation runs before the primary AI completion call, and whether the result gates access to the AI feature.
Pass criteria: A content moderation check runs on user input before it is sent to the primary AI model, and requests that fail moderation are rejected with an appropriate error response — at least 1 content moderation layer (OpenAI Moderation API, Perspective API, or custom classifier) applied on 100% of user inputs. Report: "X user-facing LLM inputs found, all Y pass through content moderation."
Fail criteria: No input content moderation found — user input is forwarded to the primary model without moderation gating.
Skip (N/A) when: No AI provider integration detected. Also skip if the AI provider used has built-in input filtering that cannot be disabled (some hosted models apply automatic moderation — verify via provider docs).
Detail on fail: "No content moderation call found before primary AI completion in POST /api/chat" or "OpenAI Moderation API imported but commented out / not called on user inputs"
Remediation: Input moderation provides a fast, cheap, first-line defense. The OpenAI Moderation endpoint is free and adds minimal latency:
```
const modResult = await openai.moderations.create({ input: userMessage })
if (modResult.results[0]?.flagged) {
  return Response.json({ error: 'Message not allowed' }, { status: 400 })
}
```
Run this before your primary completion call. Log flagged attempts (without storing the full message content) for monitoring.

External references

owasp-llm:2025 · LLM01 — Prompt Injection
nist-ai-rmf:1.0 · MANAGE 1.3 — Responses to identified AI risks are prioritized and managed

Taxons

inference-contract

History

2026-04-18·v1.0.0·Initial import from ai-prompt-injection·automated

Why it matters

Remediation

Call the OpenAI Moderation endpoint before your primary completion call. It is free, averages under 100ms, and blocks a broad category of harmful and adversarial content.

// src/app/api/chat/route.ts
export async function POST(req: Request) {
  const { message } = await ChatInputSchema.parse(await req.json())

  // Moderation gate — runs before the primary completion
  const mod = await openai.moderations.create({ input: message })
  if (mod.results[0]?.flagged) {
    return Response.json({ error: 'Message not allowed' }, { status: 400 })
  }

  // Proceed with primary completion only if moderation passes
  const completion = await openai.chat.completions.create({ ... })
}

Detection

ID: input-content-moderation
Severity: low
What to look for: Enumerate every user-facing LLM input path. For each, look for integration with a content moderation service on the input side — OpenAI Moderation API (moderations.create), Perspective API, Azure Content Safety, or equivalent. Check whether moderation runs before the primary AI completion call, and whether the result gates access to the AI feature.
Pass criteria: A content moderation check runs on user input before it is sent to the primary AI model, and requests that fail moderation are rejected with an appropriate error response — at least 1 content moderation layer (OpenAI Moderation API, Perspective API, or custom classifier) applied on 100% of user inputs. Report: "X user-facing LLM inputs found, all Y pass through content moderation."
Fail criteria: No input content moderation found — user input is forwarded to the primary model without moderation gating.
Skip (N/A) when: No AI provider integration detected. Also skip if the AI provider used has built-in input filtering that cannot be disabled (some hosted models apply automatic moderation — verify via provider docs).
Detail on fail: "No content moderation call found before primary AI completion in POST /api/chat" or "OpenAI Moderation API imported but commented out / not called on user inputs"
Remediation: Input moderation provides a fast, cheap, first-line defense. The OpenAI Moderation endpoint is free and adds minimal latency:
```
const modResult = await openai.moderations.create({ input: userMessage })
if (modResult.results[0]?.flagged) {
  return Response.json({ error: 'Message not allowed' }, { status: 400 })
}
```
Run this before your primary completion call. Log flagged attempts (without storing the full message content) for monitoring.