Semantic or fuzzy caching is implemented for similar queries
Why it matters
Users phrase semantically identical questions in dozens of different ways: "What is the refund policy?" and "How do I get a refund?" produce identical answers but fail an exact-match cache check. Semantic caching, which compares query embeddings for similarity rather than string equality, captures this overlap. Research and production deployments consistently show that semantic caching captures 20–40% additional cache-eligible traffic that exact matching misses — traffic that would otherwise generate fresh API calls. ISO 25010 performance-efficiency classifies redundant recomputation of equivalent responses as an efficiency defect.
Severity rationale
Medium because semantic caching captures a substantial fraction of cache-eligible traffic that exact matching misses, with measurable cost reduction on high-volume query patterns.
Remediation
Implement semantic caching using a vector index with a similarity threshold. Upstash Semantic Cache provides this out-of-box; for custom implementations, embed the query and run a nearest-neighbor search before calling the AI API.
import { SemanticCache } from "@upstash/semantic-cache";
import { Index } from "@upstash/vector";
const index = new Index();
const semanticCache = new SemanticCache({ index, minProximity: 0.95 });
export async function semanticallyCachedGenerate(prompt: string): Promise<string> {
const cached = await semanticCache.get(prompt);
if (cached) return cached;
const { text } = await generateText({
model: openai("gpt-4o-mini"),
prompt,
maxTokens: 500,
});
await semanticCache.set(prompt, text);
return text;
}
Verify by sending two semantically equivalent but differently worded queries — the second should return a cache hit without triggering a new API call.
Detection
-
ID:
semantic-caching -
Severity:
medium -
What to look for: Search for evidence of semantic caching — where the application checks whether a user's query is semantically similar to a cached query, not just identical. Patterns: embedding the incoming query and searching a vector index for nearby cached embeddings, integration with semantic cache libraries (e.g., Upstash Semantic Cache, Redis Vector Library), or a custom similarity threshold check before proceeding to the API. This is distinct from the exact-match caching check above. Count all instances found and enumerate each.
-
Pass criteria: A semantic similarity check runs before the AI API call, returning a cached response when cosine similarity between the new query embedding and a cached query embedding exceeds a threshold (typically 0.92-0.98). At least 1 implementation must be confirmed.
-
Fail criteria: Only exact string matching is used for caching (or no caching exists), missing opportunities to serve "What is the capital of France?" from a cache containing "capital city of France?".
-
Skip (N/A) when: The application is early-stage or MVP where semantic caching infrastructure is premature. Exact-match caching (previous check) should be implemented first. Signal: No vector database dependencies and no semantic caching library detected. Exact-match caching is not yet implemented (previous check failed).
-
Cross-reference: The
response-cachingcheck in Caching & Cost verifies the basic caching infrastructure that semantic caching builds upon. -
Detail on fail:
"Only exact-match caching in place — semantically similar queries bypass cache" -
Remediation: Users phrase the same question many different ways. Semantic caching can capture an additional 20-40% of cache-eligible traffic that exact matching misses.
// Using Upstash Semantic Cache import { SemanticCache } from "@upstash/semantic-cache"; import { Index } from "@upstash/vector"; const index = new Index(); const semanticCache = new SemanticCache({ index, minProximity: 0.95 }); export async function semanticallyCachedGenerate(prompt: string): Promise<string> { const cached = await semanticCache.get(prompt); if (cached) return cached; const { text } = await generateText({ model: openai("gpt-4o-mini"), prompt, maxTokens: 500, }); await semanticCache.set(prompt, text); return text; }Verify by sending two semantically equivalent but differently worded queries — the second should hit the cache.
For response quality considerations in cached vs. live responses, see the AI Response Quality Audit.
External references
- iso-25010:2011 · performance-efficiency.resource-utilization — Resource Utilization — semantic caching eliminates redundant AI calls
Taxons
History
- 2026-04-18·v1.0.0·Initial import from ai-token-optimization·automated