Automated toxicity scoring runs on user-submitted text

ab-000698 · community-moderation-safety.content-filtering.toxicity-scoring

Severity: highactive

Why it matters

Keyword-based profanity filters catch known bad words but miss context-dependent toxicity: veiled threats, dog-whistles, coded harassment, and insults that don't trigger keyword lists. Automated toxicity scoring via ML models (Perspective API, OpenAI Moderation, AWS Comprehend) catches the patterns that keyword lists miss. Without it, communities routinely develop persistent harassment cultures because bad actors quickly learn which phrases bypass keyword filters. This is CWE-20 input validation applied at the semantic layer. Platforms with substantial UGC and no toxicity scoring have higher reported abuse rates and higher moderator burnout.

Severity rationale

High because without ML toxicity scoring, keyword-bypassing harassment goes undetected, degrading community safety and driving out users who experience sustained abuse.

Remediation

Integrate a toxicity scoring service at every content-submission endpoint before the database write. Using Perspective API:

async function scoreToxicity(text) {
  const res = await fetch(
    `https://commentanalyzer.googleapis.com/v1alpha1/comments:analyze?key=${process.env.PERSPECTIVE_API_KEY}`,
    {
      method: 'POST',
      body: JSON.stringify({
        comment: { text },
        requestedAttributes: { TOXICITY: {}, SEVERE_TOXICITY: {}, INSULT: {} },
      }),
    }
  );
  const data = await res.json();
  return data.attributeScores.TOXICITY.summaryScore.value;
}

Content scoring above your flag threshold (see safe-defaults check) goes to the moderation queue; content above auto-remove is hidden immediately. Run scoring asynchronously post-save only if latency is a hard constraint — but hide the content from public view until scoring completes.

Detection

ID: toxicity-scoring
Severity: high
What to look for: Check if user-submitted text (posts, comments, messages, etc.) goes through automated toxicity analysis before being published or displayed. Look for integration with toxicity scoring services (Perspective API, OpenAI Moderation, AWS Comprehend, etc.) or local ML models. Verify that scoring happens server-side before content is stored or made publicly visible.
Pass criteria: All user-submitted text is analyzed for toxicity using an automated scoring system. List all content-submission endpoints and confirm each routes through the toxicity scorer. Content flagged as toxic is either rejected, quarantined for moderator review, or hidden by default. Scoring runs server-side before publication. On pass, report the count of endpoints with scoring vs. total endpoints.
Fail criteria: No automated toxicity scoring is implemented. User content is not analyzed before being displayed to other users. A stub or placeholder scorer that always returns 0 does not count as pass.
Skip (N/A) when: Platform has fewer than 500 active users AND uses manual-only moderation with no automated analysis.
Detail on fail: "No automated toxicity scoring. User-submitted text is not analyzed for abusive content before publication."

Remediation: Implement automated toxicity scoring using a service like Perspective API or OpenAI Moderation:

import axios from 'axios';

async function analyzeTextToxicity(text) {
  const response = await axios.post('https://commentanalyzer.googleapis.com/v1/comments:analyzeComment', {
    comment: { text },
    languages: ['en'],
    requestedAttributes: {
      TOXICITY: {},
      SEVERE_TOXICITY: {},
      IDENTITY_ATTACK: {},
      INSULT: {}
    }
  }, {
    params: { key: process.env.GOOGLE_API_KEY }
  });

  const scores = {};
  for (const [attribute, data] of Object.entries(response.data.attributeScores)) {
    scores[attribute] = data.summaryScore.value;
  }
  return scores;
}

async function submitComment(content, userId) {
  const toxicityScores = await analyzeTextToxicity(content);

  // Flag or reject if toxicity is high
  if (toxicityScores.TOXICITY > 0.7) {
    // Option 1: Reject
    throw new Error('Your comment was flagged as inappropriate');

    // Option 2: Queue for review
    // await db.flaggedContent.create({
    //   content,
    //   userId,
    //   toxicityScores,
    //   status: 'pending'
    // });
  }

  // Save comment if toxicity is acceptable
  await db.comments.create({
    content,
    userId,
    toxicityScores,
    createdAt: new Date()
  });
}

External references

cwe · CWE-20 — Improper Input Validation

Taxons

injection-and-input-trust content-integrity

History

2026-04-18·v1.0.0·Initial import from community-moderation-safety·automated

Why it matters

Remediation

Integrate a toxicity scoring service at every content-submission endpoint before the database write. Using Perspective API:

async function scoreToxicity(text) {
  const res = await fetch(
    `https://commentanalyzer.googleapis.com/v1alpha1/comments:analyze?key=${process.env.PERSPECTIVE_API_KEY}`,
    {
      method: 'POST',
      body: JSON.stringify({
        comment: { text },
        requestedAttributes: { TOXICITY: {}, SEVERE_TOXICITY: {}, INSULT: {} },
      }),
    }
  );
  const data = await res.json();
  return data.attributeScores.TOXICITY.summaryScore.value;
}

Detection

ID: toxicity-scoring
Severity: high
What to look for: Check if user-submitted text (posts, comments, messages, etc.) goes through automated toxicity analysis before being published or displayed. Look for integration with toxicity scoring services (Perspective API, OpenAI Moderation, AWS Comprehend, etc.) or local ML models. Verify that scoring happens server-side before content is stored or made publicly visible.
Pass criteria: All user-submitted text is analyzed for toxicity using an automated scoring system. List all content-submission endpoints and confirm each routes through the toxicity scorer. Content flagged as toxic is either rejected, quarantined for moderator review, or hidden by default. Scoring runs server-side before publication. On pass, report the count of endpoints with scoring vs. total endpoints.
Fail criteria: No automated toxicity scoring is implemented. User content is not analyzed before being displayed to other users. A stub or placeholder scorer that always returns 0 does not count as pass.
Skip (N/A) when: Platform has fewer than 500 active users AND uses manual-only moderation with no automated analysis.
Detail on fail: "No automated toxicity scoring. User-submitted text is not analyzed for abusive content before publication."

Remediation: Implement automated toxicity scoring using a service like Perspective API or OpenAI Moderation:

import axios from 'axios';

async function analyzeTextToxicity(text) {
  const response = await axios.post('https://commentanalyzer.googleapis.com/v1/comments:analyzeComment', {
    comment: { text },
    languages: ['en'],
    requestedAttributes: {
      TOXICITY: {},
      SEVERE_TOXICITY: {},
      IDENTITY_ATTACK: {},
      INSULT: {}
    }
  }, {
    params: { key: process.env.GOOGLE_API_KEY }
  });

  const scores = {};
  for (const [attribute, data] of Object.entries(response.data.attributeScores)) {
    scores[attribute] = data.summaryScore.value;
  }
  return scores;
}

async function submitComment(content, userId) {
  const toxicityScores = await analyzeTextToxicity(content);

  // Flag or reject if toxicity is high
  if (toxicityScores.TOXICITY > 0.7) {
    // Option 1: Reject
    throw new Error('Your comment was flagged as inappropriate');

    // Option 2: Queue for review
    // await db.flaggedContent.create({
    //   content,
    //   userId,
    //   toxicityScores,
    //   status: 'pending'
    // });
  }

  // Save comment if toxicity is acceptable
  await db.comments.create({
    content,
    userId,
    toxicityScores,
    createdAt: new Date()
  });
}