Introduction
Retrieval-Augmented Generation (RAG) has become the backbone for many production LLM systems. This article walks through a pragmatic architecture that balances freshness, cost, and recall for real-world applications.
Whether you're building a customer support chatbot, a research assistant, or a documentation search system, RAG provides the framework to ground LLM responses in your proprietary data while maintaining the flexibility of natural language generation.
Indexing & Chunking
We start by defining the index: how documents are chunked, embedded, and stored. Small changes in chunk size, overlap, and embedding cadence can dramatically affect recall and latency; I share heuristics that worked across several projects.
Chunk size heuristics
- Short documents: 200-400 tokens with minimal overlap (5-10%)
- Long-form content: 800-1200 tokens with 10-20% overlap
- Code repositories: Function-level chunking with context preservation
- Structured data: Entity-based chunking with metadata enrichment
Embedding Strategies
The choice of embedding model significantly impacts both cost and quality. We've had success with:
text-embedding-ada-002for general purpose (OpenAI)all-MiniLM-L6-v2for cost-sensitive applications- Domain-specific fine-tuned models for specialized use cases
Orchestration & Ranking
Combine semantic search (ANN) with lexical filters for precision. Use a lightweight reranker to improve the top-K candidates before sending them to the model.
// Hybrid retrieval pipeline
const candidates = await vectorStore.query({
vector: queryEmbedding,
topK: 50,
filter: { category: "docs", status: "published" }
});
// Lexical filtering for precision
const filtered = candidates.filter(doc =>
matchesLexicalFilters(doc, query)
);
// Cross-encoder reranking
const reranked = await reranker.rank(filtered, query);
const topResults = reranked.slice(0, 5);
// Construct prompt with context
const context = topResults.map(r => r.content).join('
');
const response = await llm.complete({
prompt: buildPrompt(query, context),
temperature: 0.3
});
Performance Considerations
In production, we saw 40% improvement in relevance by introducing a reranking step, with only 150ms additional latency. The trade-off is worth it for most applications.
Practical Example
Here's a complete implementation using Pinecone and LangChain. This pattern applies to any vector store—just swap the client.
import { Pinecone } from '@pinecone-database/pinecone';
import { OpenAIEmbeddings } from 'langchain/embeddings/openai';
import { PineconeStore } from 'langchain/vectorstores/pinecone';
// Initialize Pinecone
const pinecone = new Pinecone({
apiKey: process.env.PINECONE_API_KEY!,
environment: process.env.PINECONE_ENV!
});
const index = pinecone.Index('docs');
// Create vector store
const vectorStore = await PineconeStore.fromExistingIndex(
new OpenAIEmbeddings(),
{ pineconeIndex: index }
);
// Query
const results = await vectorStore.similaritySearchWithScore(
query,
5,
{ category: 'technical' }
);
Monitoring & Optimization
Track these metrics in production:
- Query latency: p50, p95, p99 for vector search + reranking
- Cache hit rate: For frequently asked questions
- Index freshness: Time since last reindexing
- Relevance scores: User feedback on answer quality
Cache hot queries aggressively. In our systems, ~40% of queries can be served from cache with sub-10ms latency.