Develop a simple Python backend for a Retrieval-Augmented Generation (RAG) pipeline, using a Language Model (LLM) for a knowledge base consisting of PDF files.
This repository contains a minimal Retrieval-Augmented Generation (RAG) pipeline built with FastAPI and Mistral API. It ingests PDF documents, chunks and embeds their text, performs hybrid semantic/keyword search for queries, and uses a Large Language Model to generate answers based on retrieved context.
User Query -> [FastAPI Endpoint] -> needs_retrieval?
-> if False: (skips to a default response)
-> if True:
rewritten_query = rewrite_query(user_query)
query_vec = embed(rewritten_query)
sem_matches = semantic_search(query_vec)
key_matches = keyword_search(user_query)
top_chunks = RRF_fuse(sem_matches, key_matches)
final_prompt = build_prompt(user_query, top_chunks)
answer = LLM_generate(final_prompt)
-> Return answer (with original and rewritten query)
- Features
- Requirements
- Installation and Setup
- Project Structure
- How to Run
- API Endpoints
- System Design Details
- Data Ingestion
- Chunking & Embedding
- Query Intent & Rewriting
- Hybrid Search & Reranking
- Answer Generation
- Technical Notes and Recommendations
- Possible Improvements
- Author
- Upload PDFs via a REST endpoint. System automatically extracts, chunks, and embeds.
- Local in-memory embedding store (no external vector DB).
- Intent classification with two options: rule-based or LLM-based approach.
- LLM-based query rewriting to improve retrieval.
- Hybrid retrieval (semantic & keyword) with Reciprocal Rank Fusion.
- LLM-based answer generation with Mistral's chat API, grounded in retrieved chunks.
- User-friendly Streamlit web interface for document management and querying.
- Python 3.9+ (recommended)
- FastAPI
- Uvicorn
- PyMuPDF (fitz) for PDF parsing
- NumPy for vector operations
- Requests for HTTP calls
- python-dotenv to load environment variables
- Mistral AI client library
- Streamlit for the web interface
- A Mistral AI account or API key
- Clone or download the repository to your local environment.
- Install dependencies:
pip install -r requirements.txt
- Configure environment:
- Create a .env file in the root directory (or set these as environment variables):
MISTRAL_API_KEY=gwGmq0uYH2_RANDOM_ZnuNpDXeCgtqaXnOf
MISTRAL_EMBED_MODEL=mistral-embed
MISTRAL_CHAT_MODEL=mistral-large-latest
- Ensure python-dotenv is installed so the code picks up these variables.
- Optional: If you have scanned PDFs or advanced needs, you may require OCR or other libraries. This MVP does not handle OCR.
rag-mvp/
│
├── app/
│ ├── main.py # FastAPI entrypoint, route registration
│ ├── ingestion.py # Endpoints for ingesting/deleting PDF documents
│ ├── query.py # Endpoint for querying, orchestrates retrieval + generation
│ ├── pdf_utils.py # PDF extraction logic (using PyMuPDF)
│ ├── chunking.py # Text chunking
│ ├── search.py # Semantic, keyword, and hybrid search with RRF
│ ├── embedding.py # Mistral API calls (embeddings)
│ ├── llm_utils.py # LLM-based functions for query rewriting and answer generation
│ ├── intent.py # Heuristic to decide if retrieval is needed
│ ├── config.py # Global index, environment variables
│ └── streamlit_app.py # User-friendly web interface
│
├── requirements.txt # Python dependencies
├── README.md # This file!
├── LICENSE # MIT License
└── .env (ignored) # Mistral secrets
- Start the FastAPI server (e.g. on port 8000):
uvicorn app.main:app --reload --port 8000
- Test the endpoints via the Swagger UI at
http://localhost:8000/docs.
Example using curl:
# Upload a PDF
curl -X POST "http://localhost:8000/documents" \
-F "files=@path/to/your_document.pdf"
# Delete a PDF
curl -X DELETE "http://localhost:8000/documents/<doc_id>"
# Query for information
curl -X POST "http://localhost:8000/query" \
-H "Content-Type: application/json" \
-d '{"query":"What is the summary of the PDF?"}'- Make sure the backend API is running (see above).
- Start the Streamlit app:
streamlit run app/streamlit_app.py
- Open your browser at
http://localhost:8501to use the interactive interface.
- POST /documents
- Ingest one or more PDFs:
- Extract text from each PDF
- Split text into chunks
- Create embeddings for each chunk
- Store chunks in a global list docs_index
- DELETE /documents/{doc_id}
- Remove the document's chunks from docs_index
- Attempts to remove the uploaded file from disk
- POST /query
- Accepts JSON:
{"query":"..."} - Runs intent classification, optional rewriting
- Hybrid retrieval (semantic & keyword) plus RRF fusion
- Feeds top chunks as context to Mistral's Chat API
- Returns JSON with original_query, rewritten_query, and the final answer.
- Accepts JSON:
- Upload: POST /documents takes PDF files.
- Extraction: Using pdf_utils.extract_text_from_pdf() with PyMuPDF.
- Chunking: Splits text into overlapping segments (~200 words with 50-word overlap) for better context coverage.
- Embedding: Calls Mistral's embedding API (embedding.get_embedding).
- Storing: Each chunk is stored with doc_id, text, and embedding in the in-memory list docs_index.
# chunking.py
def chunk_text(text, max_tokens=200, overlap=50):
words = text.split()
chunks = []
step = max_tokens - overlap
for i in range(0, len(words), step):
chunk_slice = words[i:i + max_tokens]
if not chunk_slice:
continue
chunk = " ".join(chunk_slice)
chunks.append(chunk)
return chunksThen each chunk's text is embedded:
# embedding.py
def get_embedding(text: str) -> np.ndarray:
# Calls Mistral API, returns embedding vector
...Stored in docs_index along with doc_id.
Advanced Chunking Ideas:
- Semantic Chunking: Dividing documents based on semantic coherence rather than fixed sizes
- Contextual Chunk Headers: Adding higher-level context (title, section info) to chunks before embedding
- Parent-child relationships: Creating hierarchical chunks where finer chunks are linked to broader parent chunks
- Intent: System supports two intent classification methods:
- Naive rule-based approach: Uses keywords and question marks to determine retrieval need
- LLM-based approach: Uses Mistral API to intelligently decide if retrieval is required
- Rewrite: If needed, the system calls Mistral's chat API again to simplify the query. This can help matching with chunk text.
Advanced Query Processing Ideas:
- Step-back Prompting: Taking a step back to understand the broader context before addressing specific questions
- Sub-query Decomposition: Breaking complex queries into smaller, manageable sub-queries
- HyDE (Hypothetical Document Embeddings): Generating a hypothetical answer to help bridge query-document semantic gaps
- Semantic Search: Compare the query embedding with each chunk's embedding via L2 distance or cosine similarity.
- Keyword Search: Count matching words in each chunk's text.
- Reciprocal Rank Fusion (RRF): The top results from both methods are merged based on rank, ensuring chunks that rank highly in both semantic and keyword lists are prioritized.
Advanced Retrieval Ideas:
- Two-level indexing: First retrieve relevant documents based on summaries, then retrieve chunks from those documents
- Metadata filtering: Refining retrieved chunks based on metadata like categories or timestamps
- The final top chunks are concatenated as context in a prompt:
Context information is below.
---------------------
{chunk1}
{chunk2}
...
---------------------
Given the context information (and not prior knowledge), please answer the query.
Query: {user_query}
Answer:
- This prompt is sent to Mistral's chat API with a moderate temperature. The LLM returns a response grounded in the provided chunks.
- Documents are PDF text: We assume standard text-based PDFs. Scanned PDFs or images require OCR (not included).
- Memory-based index: Chunks and embeddings live in-memory (docs_index). No persistence, meaning data is lost if the server restarts.
- Single-user or low concurrency: Global list usage with no locking might cause issues under high concurrency.
- Mistral API availability: System relies on external calls to Mistral for embeddings and generation.
- We favored simplicity (Python list for embeddings, naive intent detection) to meet the MVP constraints.
- No external DB was used, fulfilling the requirement for purely local storage but limiting scalability.
- Responses only contain final answers (and queries), not the source context. Could be expanded to provide references.
- Query rewriting is an additional API call, raising latency. In production, measure if it improves accuracy enough to justify the extra request.
- Enhanced Retrieval: Replace simple word counting with BM25 or TF-IDF scoring for more accurate keyword search
- Source Attribution: Include references to source documents in responses for transparency and fact-checking
- Metadata Filtering: Allow filtering search results by document type, date, or other metadata
- Vector Database: Store embeddings in a lightweight vector database (FAISS) for persistence and faster retrieval
- Context Optimization: Implement prompt engineering techniques to better utilize retrieved chunks
- Caching System: Cache similar or identical queries to improve response time and reduce API costs
- Asynchronous Processing: Implement parallel search and embedding generation for better performance
- Response Validation: Implement basic checks to detect potential hallucinations or inconsistencies in LLM outputs.
Daniel Fiuza Dosil
Email: dafiuzadosil@gmail.com
Date: 27/03/2025 (March)