Off-the-shelf jailbreak prompts—"ignore previous instructions," "DAN mode," "you are now"—are the first tool any attacker tries because they succeed against unprotected models at a high rate. OWASP LLM01:2025 identifies these as direct prompt injection; MITRE ATLAS AML.T0054 classifies them as adversarial prompt crafting. Without a detection layer, your AI feature is vulnerable to the most amateur-level attacks documented in public jailbreak databases, not just sophisticated adversaries. For applications in regulated sectors (finance, healthcare, legal), a successful jailbreak that causes the model to produce non-compliant output creates liability exposure regardless of whether the developer "intended" the application to behave that way. NIST AI RMF MEASURE 2.6 requires demonstrable controls against known adversarial patterns.
High because publicly documented jailbreak patterns have known success rates against unprotected models, making exploitation accessible to non-technical attackers with no specialized knowledge.
Layer at least two detection mechanisms: a fast regex pass for known patterns, and a moderation API for nuanced cases. The OpenAI Moderation API is free, adds under 100ms, and catches a broad surface area.
// Layer 1: fast pattern matching
const INJECTION_PATTERNS = [
/ignore\s+(all\s+|previous\s+|your\s+)?(instructions|prompt|rules)/i,
/you\s+are\s+now/i,
/pretend\s+(you\s+are|to\s+be)/i,
/disregard\s+(previous|all|your)/i,
/forget\s+(everything|your\s+instructions)/i,
]
if (INJECTION_PATTERNS.some(p => p.test(userMessage))) {
return Response.json({ error: 'Message not allowed' }, { status: 400 })
}
// Layer 2: moderation API before primary completion
const mod = await openai.moderations.create({ input: userMessage })
if (mod.results[0]?.flagged) {
return Response.json({ error: 'Message not allowed' }, { status: 400 })
}
Update the regex list as new jailbreak patterns emerge—treat it as a living file, not a static configuration.
ID: ai-prompt-injection.input-sanitization.jailbreak-detection
Severity: high
What to look for: List all LLM input processing pipelines. For each, look for any code that screens user input for known jailbreak or prompt injection patterns before passing it to the model. This could be: explicit string matching or regex for common patterns (e.g., "ignore previous instructions", "you are now", "DAN", "pretend you are"), a content moderation API call (OpenAI Moderation API, Perspective API, custom classifier), or a pre-flight model call that evaluates whether the input appears adversarial.
Pass criteria: At least one mechanism exists that screens user input for adversarial patterns before the primary AI call. This can be pattern matching, moderation API, or a guard model — at least 1 jailbreak detection mechanism (keyword filter, classifier, or embedding-based) is applied before LLM processing. Report: "X LLM pipelines found, Y include jailbreak detection."
Fail criteria: No jailbreak or injection pattern detection found. User input is forwarded to the AI model without any adversarial screening.
Skip (N/A) when: No AI provider integration detected.
Detail on fail: "No jailbreak pattern detection found — user messages are forwarded to the model without screening in POST /api/chat" or "Content moderation API referenced in comments but no actual call implemented"
Remediation: Pattern matching alone cannot catch all injection attempts, but it stops the most common off-the-shelf attacks. A layered approach works best: fast pattern matching for obvious attempts, then a moderation API for nuanced cases.
// Simple pattern check layer
const INJECTION_PATTERNS = [
/ignore (all |previous |your )?(instructions|prompt|rules)/i,
/you are now/i,
/pretend (you are|to be)/i,
/disregard (previous|all|your)/i,
/forget (everything|your instructions)/i,
]
function hasInjectionPattern(input: string): boolean {
return INJECTION_PATTERNS.some(pattern => pattern.test(input))
}
For a higher-confidence approach, call the OpenAI Moderation endpoint before the primary completion — it's free and fast, and catches a broad range of harmful content patterns.