Semantic or fuzzy caching is implemented for similar queries

ab-000321 · ai-token-optimization.caching-cost.semantic-caching

Severity: mediumactive

Why it matters

Users phrase semantically identical questions in dozens of different ways: "What is the refund policy?" and "How do I get a refund?" produce identical answers but fail an exact-match cache check. Semantic caching, which compares query embeddings for similarity rather than string equality, captures this overlap. Research and production deployments consistently show that semantic caching captures 20–40% additional cache-eligible traffic that exact matching misses — traffic that would otherwise generate fresh API calls. ISO 25010 performance-efficiency classifies redundant recomputation of equivalent responses as an efficiency defect.

Severity rationale

Medium because semantic caching captures a substantial fraction of cache-eligible traffic that exact matching misses, with measurable cost reduction on high-volume query patterns.

Remediation

Implement semantic caching using a vector index with a similarity threshold. Upstash Semantic Cache provides this out-of-box; for custom implementations, embed the query and run a nearest-neighbor search before calling the AI API.

import { SemanticCache } from "@upstash/semantic-cache";
import { Index } from "@upstash/vector";

const index = new Index();
const semanticCache = new SemanticCache({ index, minProximity: 0.95 });

export async function semanticallyCachedGenerate(prompt: string): Promise<string> {
  const cached = await semanticCache.get(prompt);
  if (cached) return cached;

  const { text } = await generateText({
    model: openai("gpt-4o-mini"),
    prompt,
    maxTokens: 500,
  });

  await semanticCache.set(prompt, text);
  return text;
}

Verify by sending two semantically equivalent but differently worded queries — the second should return a cache hit without triggering a new API call.

Detection

ID: semantic-caching
Severity: medium
What to look for: Search for evidence of semantic caching — where the application checks whether a user's query is semantically similar to a cached query, not just identical. Patterns: embedding the incoming query and searching a vector index for nearby cached embeddings, integration with semantic cache libraries (e.g., Upstash Semantic Cache, Redis Vector Library), or a custom similarity threshold check before proceeding to the API. This is distinct from the exact-match caching check above. Count all instances found and enumerate each.
Pass criteria: A semantic similarity check runs before the AI API call, returning a cached response when cosine similarity between the new query embedding and a cached query embedding exceeds a threshold (typically 0.92-0.98). At least 1 implementation must be confirmed.
Fail criteria: Only exact string matching is used for caching (or no caching exists), missing opportunities to serve "What is the capital of France?" from a cache containing "capital city of France?".
Skip (N/A) when: The application is early-stage or MVP where semantic caching infrastructure is premature. Exact-match caching (previous check) should be implemented first. Signal: No vector database dependencies and no semantic caching library detected. Exact-match caching is not yet implemented (previous check failed).
Cross-reference: The response-caching check in Caching & Cost verifies the basic caching infrastructure that semantic caching builds upon.
Detail on fail: "Only exact-match caching in place — semantically similar queries bypass cache"

Remediation: Users phrase the same question many different ways. Semantic caching can capture an additional 20-40% of cache-eligible traffic that exact matching misses.

// Using Upstash Semantic Cache
import { SemanticCache } from "@upstash/semantic-cache";
import { Index } from "@upstash/vector";

const index = new Index();
const semanticCache = new SemanticCache({ index, minProximity: 0.95 });

export async function semanticallyCachedGenerate(prompt: string): Promise<string> {
  const cached = await semanticCache.get(prompt);
  if (cached) return cached;

  const { text } = await generateText({
    model: openai("gpt-4o-mini"),
    prompt,
    maxTokens: 500,
  });

  await semanticCache.set(prompt, text);
  return text;
}

Verify by sending two semantically equivalent but differently worded queries — the second should hit the cache.

For response quality considerations in cached vs. live responses, see the AI Response Quality Audit.

External references

iso-25010:2011 · performance-efficiency.resource-utilization — Resource Utilization — semantic caching eliminates redundant AI calls

Taxons

cost-efficiency

History

2026-04-18·v1.0.0·Initial import from ai-token-optimization·automated