Rate limiting is enforced on AI endpoints

ab-000192 · ai-prompt-injection.architecture-defense.rate-limiting

Severity: highactive

Why it matters

Unrate-limited AI endpoints are vulnerable to three distinct abuse patterns. First, automated prompt injection probing: an attacker can send thousands of variants per minute, systematically testing which injections succeed—a pattern consistent with MITRE ATLAS adversarial reconnaissance. Second, cost amplification: AI completions are billed per token, so an unprotected endpoint allows malicious actors to exhaust your provider quota and generate costs you bear (OWASP A04:2021 Insecure Design). Third, denial-of-service: even non-adversarial traffic spikes can exhaust your provider's rate limits, causing legitimate users to experience failures. CWE-770 (Allocation of Resources Without Limits) applies directly. NIST AI RMF MANAGE 1.3 requires managing operational risk for deployed AI systems, including resource exhaustion.

Severity rationale

High because rate limit absence enables automated injection probing that dramatically reduces the attacker's cost of finding a successful injection, while simultaneously enabling financial DoS through provider quota exhaustion.

Remediation

Apply per-user (preferred) or per-IP rate limiting that fires before the AI provider call. For Next.js on Vercel, Upstash Redis sliding window rate limiting is the lowest-friction approach.

// src/app/api/chat/route.ts
import { Ratelimit } from '@upstash/ratelimit'
import { Redis } from '@upstash/redis'

const ratelimit = new Ratelimit({
  redis: Redis.fromEnv(),
  limiter: Ratelimit.slidingWindow(20, '1 m'), // 20 requests/min per user
  analytics: true,
})

export async function POST(req: Request) {
  const identifier = session?.userId ?? getClientIp(req)
  const { success, limit, remaining } = await ratelimit.limit(identifier)

  if (!success) {
    return Response.json(
      { error: 'Rate limit exceeded. Try again in a moment.' },
      { status: 429, headers: { 'Retry-After': '60' } }
    )
  }
  // AI call proceeds only after rate limit passes
}

Set the window based on legitimate usage patterns. 10–20 requests per minute is appropriate for interactive chat; adjust downward for document-analysis endpoints with higher per-call costs.

Detection

ID: rate-limiting
Severity: high
What to look for: List all LLM-facing endpoints. For each, look for rate limiting applied to API routes that call AI providers. Check for: rate limiting middleware (upstash/ratelimit, express-rate-limit, Vercel's edge rate limiting), per-user or per-IP request quotas, token budget tracking per user/session, or request throttling. Also check for rate limiting at the hosting platform level (Vercel Edge Config, Cloudflare Rate Limiting rules).
Pass criteria: At least one rate limiting mechanism applies to AI endpoint routes — either per-user (preferred) or per-IP. The rate limit is enforced before the AI API call to prevent quota exhaustion — no more than 20 requests per minute per user on LLM endpoints. Report: "X LLM endpoints found, all Y have rate limits configured."
Fail criteria: No rate limiting found on routes that call AI providers. Any user or IP can make unlimited requests to the AI feature.
Skip (N/A) when: No AI provider integration detected, or the AI feature is only accessible to authenticated users and the auth system has its own session rate controls (verify this is actually implemented, not just assumed).
Detail on fail: "POST /api/chat calls the AI provider without any rate limiting — unlimited requests possible" or "Rate limiting only applies to the auth routes, not the AI completion endpoints"

Remediation: Unrate-limited AI endpoints allow abuse patterns: automated prompt injection probing, quota exhaustion, and cost amplification attacks. For Next.js on Vercel, Upstash Redis rate limiting is the simplest approach:

import { Ratelimit } from '@upstash/ratelimit'
import { Redis } from '@upstash/redis'

const ratelimit = new Ratelimit({
  redis: Redis.fromEnv(),
  limiter: Ratelimit.slidingWindow(20, '1 m'), // 20 requests per minute
})

// In your API route:
const identifier = userId ?? ip
const { success } = await ratelimit.limit(identifier)
if (!success) {
  return Response.json({ error: 'Rate limit exceeded' }, { status: 429 })
}

Set the window based on typical legitimate usage. 10-20 requests per minute is generous for chat applications.

External references

cwe · CWE-770 — Allocation of Resources Without Limits or Throttling
owasp:2021 · A04 — Insecure Design
owasp-llm:2025 · LLM01 — Prompt Injection
nist-ai-rmf:1.0 · MANAGE 1.3 — Responses to identified AI risks are prioritized and managed

Taxons

cost-efficiency inference-contract

History

2026-04-18·v1.0.0·Initial import from ai-prompt-injection·automated