Many AI applications have query patterns where the same prompt runs repeatedly across different users: SEO description generation for a product page, FAQ answer retrieval for a support bot, or summarization of a static document. Without a cache layer, each of these requests hits the API and bills you for the same tokens repeatedly. On a product listing page visited by 10,000 users per day with identical AI-generated descriptions, you pay for 10,000 API calls when 1 would suffice. ISO 25010 performance-efficiency requires eliminating redundant computation at scale.
High because without caching, repeated identical prompts generate redundant API costs that scale linearly with traffic volume regardless of response novelty.
Wrap AI calls in a cache layer keyed by a SHA-256 hash of the messages and model. Include the model name in the cache key — model upgrades must invalidate stale responses.
// src/lib/ai/cached-generate.ts
import { createHash } from "crypto";
import { redis } from "@/lib/redis";
export async function cachedGenerate(
messages: Array<{ role: string; content: string }>,
options: { model: string; maxTokens: number; ttlSeconds?: number }
): Promise<string> {
const cacheKey = `ai:${createHash("sha256")
.update(JSON.stringify({ messages, model: options.model }))
.digest("hex")
.slice(0, 16)}`;
const cached = await redis.get<string>(cacheKey);
if (cached) return cached;
const { text } = await generateText({
model: openai(options.model),
messages,
maxTokens: options.maxTokens,
});
await redis.set(cacheKey, text, { ex: options.ttlSeconds ?? 3600 });
return text;
}
Verify by making the same request twice and confirming the second response is near-instant with a cache hit visible in logs.
ID: ai-token-optimization.caching-cost.repeated-query-caching
Severity: high
What to look for: Look for a caching layer wrapping the AI API call. Patterns to find: redis.get(cacheKey) before the API call, unstable_cache wrapping an AI call in Next.js, @vercel/kv or @upstash/redis usage adjacent to AI call sites, or a custom in-memory cache map for repeated prompts. The cache key should be derived from the prompt or messages content (typically via a hash). Also check if the project uses Cloudflare AI Gateway or a similar proxy that provides transparent caching. Count every AI API call site in the codebase and enumerate which implement response caching vs. which make uncached calls. Report: X of Y call sites use caching.
Pass criteria: Identical prompt inputs return a cached response without triggering a new API call. A caching mechanism (Redis, KV store, or equivalent) exists and is wired to the AI call path. Report even on pass: "X of Y AI API call sites implement caching. Estimated cache hit potential: Z%." At least 1 implementation must be confirmed.
Fail criteria: Every request, including identical repeated prompts, triggers a fresh API call. No caching layer exists between the application and the AI provider. Do NOT pass if caching exists but cache keys do not include the model version — model upgrades could serve stale responses.
Skip (N/A) when: All queries are genuinely unique by construction (e.g., queries always include a unique user ID, timestamp, or random element) such that caching would never produce a hit. Also skip if caching is explicitly handled by an upstream gateway (Cloudflare AI Gateway, LLM proxy with caching enabled).
Signal: All prompt templates include ${userId} or ${Date.now()} components, or a caching gateway is configured in the deployment setup.
Cross-reference: The semantic-caching check verifies a more sophisticated caching strategy that complements exact-match caching.
Detail on fail: "No cache layer for AI calls — identical queries hit the API every time"
Remediation: Re-generating the same answer is pure cost waste. Many AI applications have patterns like "generate this page's SEO description" or "summarize this product" where the same prompt runs repeatedly across users.
// src/lib/ai/cached-generate.ts
import { createHash } from "crypto";
import { redis } from "@/lib/redis"; // @upstash/redis or ioredis
export async function cachedGenerate(
messages: Array<{ role: string; content: string }>,
options: { model: string; maxTokens: number; ttlSeconds?: number }
): Promise<string> {
const cacheKey = `ai:${createHash("sha256")
.update(JSON.stringify({ messages, model: options.model }))
.digest("hex")
.slice(0, 16)}`;
const cached = await redis.get<string>(cacheKey);
if (cached) return cached;
const { text } = await generateText({
model: openai(options.model),
messages,
maxTokens: options.maxTokens,
});
await redis.set(cacheKey, text, { ex: options.ttlSeconds ?? 3600 });
return text;
}
Verify by making the same request twice and confirming the second response is near-instant and the cache hit appears in logs.