RAG chunk size controls how much text is inserted into the LLM context per retrieved document. Oversized chunks (full pages at 3,000+ tokens each) flood the context window with surrounding noise, reducing the signal-to-noise ratio and leaving less room for conversation history. High topK values compound this: retrieving 20 chunks at 2,000 tokens each consumes 40,000 tokens before the user's question is even accounted for. OWASP LLM10 flags uncontrolled context consumption. Poorly tuned RAG raises both cost and error rate simultaneously.
Info because suboptimal chunk sizing and retrieval depth degrade answer quality and inflate cost, but do not cause outright failure in typical configurations.
Tune chunk size to 500–1,500 tokens with 10–20% overlap, and bound topK to 3–10 depending on average chunk relevance. Start conservative and increase only if retrieval quality metrics warrant it.
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000, // ~750 tokens — good balance of context and precision
chunkOverlap: 150, // preserves context across chunk boundaries
});
const chunks = await splitter.splitDocuments(documents);
// Keep topK bounded in all vector retrieval calls
const results = await vectorStore.similaritySearch(query, 5);
Verify by examining the assembled prompt context for a typical query — the retrieved content should be directly relevant to the question, with no padding from surrounding document sections.
ID: ai-token-optimization.token-efficiency.rag-chunk-size
Severity: info
What to look for: If the project implements RAG (Retrieval-Augmented Generation), look for document chunking or splitting logic. In LangChain-based projects, search for RecursiveCharacterTextSplitter, TokenTextSplitter, or similar. In custom implementations, look for string slicing logic with chunk size constants. Check the chunkSize and chunkOverlap parameters used, and how many chunks are retrieved per query (topK in vector search calls). Count all instances found and enumerate each.
Pass criteria: Chunk sizes are tuned to balance retrieval quality and token cost — typically 500-1500 tokens per chunk, with appropriate overlap (10-20% of chunk size). The number of retrieved chunks per query (topK) is bounded (typically 3-10).
Fail criteria: Extremely large chunks (e.g., entire pages or documents at 3000+ tokens each) are retrieved and inserted into the LLM context, flooding the context window with irrelevant surrounding content. Or topK is set very high (20+) without filtering, retrieving more context than the model can effectively use.
Skip (N/A) when: No RAG or vector search implementation is detected in the project. Signal: No vector database dependencies, no embedding creation for documents, no retrieval logic injecting context into prompts.
Detail on fail: "RAG chunks too large or topK too high — context flooded with low-relevance content"
Remediation: Retrieving massive chunks fills the context window with surrounding noise, reducing answer quality while increasing cost. Tuning chunk size is an iterative process that depends on the content type.
// Example with LangChain
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000, // ~750 tokens — good balance
chunkOverlap: 150, // preserve context across chunk boundaries
});
const chunks = await splitter.splitDocuments(documents);
// Keep topK bounded in retrieval
const results = await vectorStore.similaritySearch(query, 5); // topK = 5
Verify by examining the assembled prompt context for a typical query — the retrieved content should be directly relevant, not padded with surrounding document text.