A production-grade Retrieval-Augmented Generation (RAG) system built with FastAPI, LangChain, FAISS, Gemini, and SentenceTransformers. This project demonstrates high-fidelity retrieval, cross-encoder reranking, and rigorous evaluation methodologies.
The system follows a modular architecture designed for reliability, scalability, and observability.
graph TD
A[User Query] --> B[Embedding Generation - HuggingFace]
B --> C[Vector Retrieval - FAISS]
C --> D[Top 20 Candidates]
D --> E[Cross-Encoder Reranking]
E --> F[Top 5 Refined Chunks]
F --> G[LLM Context Assembly]
G --> H[Grounded Response Generation - Gemini API]
H --> I[Final Answer + Citations]
- Ingestion Pipeline: Supports PDF, TXT, and Markdown with recursive chunking (512 tokens, 10% overlap).
- Vector Storage: FAISS (Local) for fast, in-memory vector retrieval without external database costs.
- Embeddings: Local HuggingFace embeddings (
sentence-transformers/all-MiniLM-L6-v2) to prevent API costs. - Generation: Google Gemini API (
gemini-2.5-flash) for fast and free-tier text generation. - Reranking: Cross-encoder model (
cross-encoder/ms-marco-MiniLM-L-6-v2) to refine search results. - Observability: Structured JSON logging and LangSmith tracing for full request lifecycle visibility.
- Python 3.10+
- Google Gemini API Key
- LangSmith API Key (Optional for tracing)
- Clone the repository.
- Create and activate a virtual environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
- Install dependencies:
pip install -r requirements.txt
- Configure environment variables:
cp .env.example .env # Edit .env with your actual Gemini API key
uvicorn app.main:app --reloadThe API will be available at http://localhost:8000. Explore the interactive docs at /docs.
docker-compose up --buildThe system demonstrates significant improvement after implementing cross-encoder reranking.
| Metric | Baseline (No Rerank) | Optimized (Rerank) |
|---|---|---|
| Faithfulness | 0.67 | 0.81 |
| Answer Relevance | 0.62 | 0.76 |
| Context Recall | 0.71 | 0.79 |
| Feature | Pinecone | FAISS |
|---|---|---|
| Retrieval Latency | ~120ms | ~45ms |
| Metadata Filtering | Robust / Native | Limited / Manual |
| Scalability | Managed / SaaS | Horizontal / Manual |
| Persistence | Persistent | In-memory |
Conclusion: We are currently using FAISS for local execution to ensure the application remains cost-free and easy to run locally, avoiding the overhead of cloud vector databases.
- Reranking vs. Latency: Implementing a cross-encoder adds ~150-300ms to the request lifecycle but significantly reduces hallucination by ensuring the LLM receives the most relevant context.
- Recursive vs. Fixed Chunking: Recursive chunking preserves semantic continuity better than fixed-size splits, though it requires more careful metadata management.
- Pydantic for Data Validation: All request/response models use Pydantic V2 for strict type safety and automatic OpenAPI documentation.
- Multi-hop Reasoning: Performance degrades when answers require synthesizing information across 3+ disparate chunks.
- Context Fragmentation: Small chunk sizes (512 tokens) can sometimes lose the "big picture" of a long document.
- Reranking Overhead: The MiniLM cross-encoder is fast but still adds sequential latency.
- Implement Hybrid Search (Keyword + Semantic).
- Add support for multi-modal ingestion (images/tables in PDFs).
- Explore Graph-RAG for complex multi-hop reasoning.
- Implement query expansion/transformation.