Automated content filtering uses safe defaults

ab-000696 · community-moderation-safety.content-filtering.safe-defaults

Severity: mediumactive

Why it matters

Moderation APIs configured with permissive thresholds (toxicity > 0.9 to flag) will miss the majority of harmful content — most real-world toxic posts score between 0.5 and 0.8. A threshold that's too high means your automated filter is effectively disabled, forcing human moderators to catch everything the system lets through. Under CWE-20, this is an input validation failure: the system nominally validates content but its configuration makes the validation meaningless. Platforms that over-rely on high-confidence thresholds face the same regulatory risk as platforms with no filtering at all.

Severity rationale

Medium because misconfigured thresholds systematically under-filter harmful content, compounding moderation workload and exposing users to abuse the system was supposed to catch.

Remediation

Set toxicity thresholds conservatively in your moderation config. Flag at 0.5 (review queue), auto-remove at 0.8. Update src/config/moderation.ts:

export const MODERATION_THRESHOLDS = {
  toxicity:       { flag: 0.5, autoRemove: 0.8 },
  severeToxicity: { flag: 0.3, autoRemove: 0.6 },
  insult:         { flag: 0.5, autoRemove: 0.8 },
  identityAttack: { flag: 0.4, autoRemove: 0.7 },
};

Content scoring above the flag threshold goes to the moderation queue; content above autoRemove is hidden immediately pending review. Never ship with thresholds above 0.7 for the flag tier.

Detection

ID: safe-defaults
Severity: medium
What to look for: If using third-party content moderation APIs (Perspective API, OpenAI moderation, etc.), check if thresholds are configured conservatively. Look for configuration files or environment variables that control toxicity/safety thresholds.
Pass criteria: If automated moderation is used, thresholds are set conservatively — flag content with greater than 50% toxicity probability, not greater than 90%. Quote the actual threshold values from configuration or code. Filtering defaults to "safe" — err on the side of caution. On pass, count all configured threshold values and report their levels.
Fail criteria: Thresholds are too lenient (only flag if >90% confidence), or filtering defaults to "allow everything" mode. A commented-out threshold or placeholder value does not count as pass.
Skip (N/A) when: No third-party content moderation API is used, or platform is <500 users with manual-only moderation.
Detail on fail: "Perspective API toxicity threshold is set to 0.90 — only flags content with >90% confidence. This will miss a lot of toxic content."
Cross-reference: Compare with community-moderation-safety.content-filtering.profanity-filtering — both should share content pipeline, but this check evaluates numeric threshold configuration while profanity-filtering evaluates keyword presence.

Remediation: Use conservative thresholds when configuring moderation APIs. Flag content at lower confidence levels (>0.5 rather than >0.9) to catch more problematic content. Update config in src/config/moderation.ts or equivalent:

// Conservative thresholds — flag at 50%, auto-remove at 80%
export const MODERATION_THRESHOLDS = {
  toxicity: { flag: 0.5, autoRemove: 0.8 },
  severeToxicity: { flag: 0.3, autoRemove: 0.6 },
  insult: { flag: 0.5, autoRemove: 0.8 },
};

External references

cwe · CWE-20 — Improper Input Validation

Taxons

content-integrity injection-and-input-trust

History

2026-04-18·v1.0.0·Initial import from community-moderation-safety·automated

Why it matters

Remediation

Set toxicity thresholds conservatively in your moderation config. Flag at 0.5 (review queue), auto-remove at 0.8. Update src/config/moderation.ts:

export const MODERATION_THRESHOLDS = {
  toxicity:       { flag: 0.5, autoRemove: 0.8 },
  severeToxicity: { flag: 0.3, autoRemove: 0.6 },
  insult:         { flag: 0.5, autoRemove: 0.8 },
  identityAttack: { flag: 0.4, autoRemove: 0.7 },
};

Content scoring above the flag threshold goes to the moderation queue; content above autoRemove is hidden immediately pending review. Never ship with thresholds above 0.7 for the flag tier.

Detection

ID: safe-defaults
Severity: medium
What to look for: If using third-party content moderation APIs (Perspective API, OpenAI moderation, etc.), check if thresholds are configured conservatively. Look for configuration files or environment variables that control toxicity/safety thresholds.
Pass criteria: If automated moderation is used, thresholds are set conservatively — flag content with greater than 50% toxicity probability, not greater than 90%. Quote the actual threshold values from configuration or code. Filtering defaults to "safe" — err on the side of caution. On pass, count all configured threshold values and report their levels.
Fail criteria: Thresholds are too lenient (only flag if >90% confidence), or filtering defaults to "allow everything" mode. A commented-out threshold or placeholder value does not count as pass.
Skip (N/A) when: No third-party content moderation API is used, or platform is <500 users with manual-only moderation.
Detail on fail: "Perspective API toxicity threshold is set to 0.90 — only flags content with >90% confidence. This will miss a lot of toxic content."
Cross-reference: Compare with community-moderation-safety.content-filtering.profanity-filtering — both should share content pipeline, but this check evaluates numeric threshold configuration while profanity-filtering evaluates keyword presence.

// Conservative thresholds — flag at 50%, auto-remove at 80%
export const MODERATION_THRESHOLDS = {
  toxicity: { flag: 0.5, autoRemove: 0.8 },
  severeToxicity: { flag: 0.3, autoRemove: 0.6 },
  insult: { flag: 0.5, autoRemove: 0.8 },
};