Users phrase semantically identical questions in dozens of different ways: "What is the refund policy?" and "How do I get a refund?" produce identical answers but fail an exact-match cache check. Semantic caching, which compares query embeddings for similarity rather than string equality, captures this overlap. Research and production deployments consistently show that semantic caching captures 20–40% additional cache-eligible traffic that exact matching misses — traffic that would otherwise generate fresh API calls. ISO 25010 performance-efficiency classifies redundant recomputation of equivalent responses as an efficiency defect.
Medium because semantic caching captures a substantial fraction of cache-eligible traffic that exact matching misses, with measurable cost reduction on high-volume query patterns.
Implement semantic caching using a vector index with a similarity threshold. Upstash Semantic Cache provides this out-of-box; for custom implementations, embed the query and run a nearest-neighbor search before calling the AI API.
import { SemanticCache } from "@upstash/semantic-cache";
import { Index } from "@upstash/vector";
const index = new Index();
const semanticCache = new SemanticCache({ index, minProximity: 0.95 });
export async function semanticallyCachedGenerate(prompt: string): Promise<string> {
const cached = await semanticCache.get(prompt);
if (cached) return cached;
const { text } = await generateText({
model: openai("gpt-4o-mini"),
prompt,
maxTokens: 500,
});
await semanticCache.set(prompt, text);
return text;
}
Verify by sending two semantically equivalent but differently worded queries — the second should return a cache hit without triggering a new API call.
ID: ai-token-optimization.caching-cost.semantic-caching
Severity: medium
What to look for: Search for evidence of semantic caching — where the application checks whether a user's query is semantically similar to a cached query, not just identical. Patterns: embedding the incoming query and searching a vector index for nearby cached embeddings, integration with semantic cache libraries (e.g., Upstash Semantic Cache, Redis Vector Library), or a custom similarity threshold check before proceeding to the API. This is distinct from the exact-match caching check above. Count all instances found and enumerate each.
Pass criteria: A semantic similarity check runs before the AI API call, returning a cached response when cosine similarity between the new query embedding and a cached query embedding exceeds a threshold (typically 0.92-0.98). At least 1 implementation must be confirmed.
Fail criteria: Only exact string matching is used for caching (or no caching exists), missing opportunities to serve "What is the capital of France?" from a cache containing "capital city of France?".
Skip (N/A) when: The application is early-stage or MVP where semantic caching infrastructure is premature. Exact-match caching (previous check) should be implemented first. Signal: No vector database dependencies and no semantic caching library detected. Exact-match caching is not yet implemented (previous check failed).
Cross-reference: The response-caching check in Caching & Cost verifies the basic caching infrastructure that semantic caching builds upon.
Detail on fail: "Only exact-match caching in place — semantically similar queries bypass cache"
Remediation: Users phrase the same question many different ways. Semantic caching can capture an additional 20-40% of cache-eligible traffic that exact matching misses.
// Using Upstash Semantic Cache
import { SemanticCache } from "@upstash/semantic-cache";
import { Index } from "@upstash/vector";
const index = new Index();
const semanticCache = new SemanticCache({ index, minProximity: 0.95 });
export async function semanticallyCachedGenerate(prompt: string): Promise<string> {
const cached = await semanticCache.get(prompt);
if (cached) return cached;
const { text } = await generateText({
model: openai("gpt-4o-mini"),
prompt,
maxTokens: 500,
});
await semanticCache.set(prompt, text);
return text;
}
Verify by sending two semantically equivalent but differently worded queries — the second should hit the cache.
For response quality considerations in cached vs. live responses, see the AI Response Quality Audit.