RAG chunk sizes are tuned for retrieval quality vs. token cost
Why it matters
RAG chunk size controls how much text is inserted into the LLM context per retrieved document. Oversized chunks (full pages at 3,000+ tokens each) flood the context window with surrounding noise, reducing the signal-to-noise ratio and leaving less room for conversation history. High topK values compound this: retrieving 20 chunks at 2,000 tokens each consumes 40,000 tokens before the user's question is even accounted for. OWASP LLM10 flags uncontrolled context consumption. Poorly tuned RAG raises both cost and error rate simultaneously.
Severity rationale
Info because suboptimal chunk sizing and retrieval depth degrade answer quality and inflate cost, but do not cause outright failure in typical configurations.
Remediation
Tune chunk size to 500–1,500 tokens with 10–20% overlap, and bound topK to 3–10 depending on average chunk relevance. Start conservative and increase only if retrieval quality metrics warrant it.
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000, // ~750 tokens — good balance of context and precision
chunkOverlap: 150, // preserves context across chunk boundaries
});
const chunks = await splitter.splitDocuments(documents);
// Keep topK bounded in all vector retrieval calls
const results = await vectorStore.similaritySearch(query, 5);
Verify by examining the assembled prompt context for a typical query — the retrieved content should be directly relevant to the question, with no padding from surrounding document sections.
Detection
-
ID:
rag-chunk-size -
Severity:
info -
What to look for: If the project implements RAG (Retrieval-Augmented Generation), look for document chunking or splitting logic. In LangChain-based projects, search for
RecursiveCharacterTextSplitter,TokenTextSplitter, or similar. In custom implementations, look for string slicing logic with chunk size constants. Check thechunkSizeandchunkOverlapparameters used, and how many chunks are retrieved per query (topKin vector search calls). Count all instances found and enumerate each. -
Pass criteria: Chunk sizes are tuned to balance retrieval quality and token cost — typically 500-1500 tokens per chunk, with appropriate overlap (10-20% of chunk size). The number of retrieved chunks per query (
topK) is bounded (typically 3-10). -
Fail criteria: Extremely large chunks (e.g., entire pages or documents at 3000+ tokens each) are retrieved and inserted into the LLM context, flooding the context window with irrelevant surrounding content. Or
topKis set very high (20+) without filtering, retrieving more context than the model can effectively use. -
Skip (N/A) when: No RAG or vector search implementation is detected in the project. Signal: No vector database dependencies, no embedding creation for documents, no retrieval logic injecting context into prompts.
-
Detail on fail:
"RAG chunks too large or topK too high — context flooded with low-relevance content" -
Remediation: Retrieving massive chunks fills the context window with surrounding noise, reducing answer quality while increasing cost. Tuning chunk size is an iterative process that depends on the content type.
// Example with LangChain import { RecursiveCharacterTextSplitter } from "langchain/text_splitter"; const splitter = new RecursiveCharacterTextSplitter({ chunkSize: 1000, // ~750 tokens — good balance chunkOverlap: 150, // preserve context across chunk boundaries }); const chunks = await splitter.splitDocuments(documents); // Keep topK bounded in retrieval const results = await vectorStore.similaritySearch(query, 5); // topK = 5Verify by examining the assembled prompt context for a typical query — the retrieved content should be directly relevant, not padded with surrounding document text.
External references
- owasp-llm:2025 · LLM10 — Unbounded Consumption
- iso-25010:2011 · performance-efficiency.resource-utilization — Resource Utilization — RAG chunk and topK sizing
Taxons
History
- 2026-04-18·v1.0.0·Initial import from ai-token-optimization·automated