RAG-MVP: Retrieval-Augmented Generation with FastAPI and Mistral

StackAI-Task

Develop a simple Python backend for a Retrieval-Augmented Generation (RAG) pipeline, using a Language Model (LLM) for a knowledge base consisting of PDF files.

Overview

This repository contains a minimal Retrieval-Augmented Generation (RAG) pipeline built with FastAPI and Mistral API. It ingests PDF documents, chunks and embeds their text, performs hybrid semantic/keyword search for queries, and uses a Large Language Model to generate answers based on retrieved context.

Core Steps

User Query -> [FastAPI Endpoint] -> needs_retrieval? 
    -> if False: (skips to a default response)
    -> if True:
         rewritten_query = rewrite_query(user_query)
         query_vec = embed(rewritten_query)
         sem_matches = semantic_search(query_vec)
         key_matches = keyword_search(user_query)
         top_chunks = RRF_fuse(sem_matches, key_matches)
         final_prompt = build_prompt(user_query, top_chunks)
         answer = LLM_generate(final_prompt)
    -> Return answer (with original and rewritten query)

Features

Upload PDFs via a REST endpoint. System automatically extracts, chunks, and embeds.
Local in-memory embedding store (no external vector DB).
Intent classification with two options: rule-based or LLM-based approach.
LLM-based query rewriting to improve retrieval.
Hybrid retrieval (semantic & keyword) with Reciprocal Rank Fusion.
LLM-based answer generation with Mistral's chat API, grounded in retrieved chunks.
User-friendly Streamlit web interface for document management and querying.

Requirements

Python 3.9+ (recommended)
FastAPI
Uvicorn
PyMuPDF (fitz) for PDF parsing
NumPy for vector operations
Requests for HTTP calls
python-dotenv to load environment variables
Mistral AI client library
Streamlit for the web interface
A Mistral AI account or API key

Installation and Setup

Clone or download the repository to your local environment.
Install dependencies:

pip install -r requirements.txt

Configure environment:
- Create a .env file in the root directory (or set these as environment variables):

MISTRAL_API_KEY=gwGmq0uYH2_RANDOM_ZnuNpDXeCgtqaXnOf
MISTRAL_EMBED_MODEL=mistral-embed
MISTRAL_CHAT_MODEL=mistral-large-latest

Ensure python-dotenv is installed so the code picks up these variables.

Optional: If you have scanned PDFs or advanced needs, you may require OCR or other libraries. This MVP does not handle OCR.

Project Structure

rag-mvp/
│
├── app/
│   ├── main.py              # FastAPI entrypoint, route registration
│   ├── ingestion.py         # Endpoints for ingesting/deleting PDF documents
│   ├── query.py             # Endpoint for querying, orchestrates retrieval + generation
│   ├── pdf_utils.py         # PDF extraction logic (using PyMuPDF)
│   ├── chunking.py          # Text chunking
│   ├── search.py            # Semantic, keyword, and hybrid search with RRF
│   ├── embedding.py         # Mistral API calls (embeddings)
│   ├── llm_utils.py         # LLM-based functions for query rewriting and answer generation
│   ├── intent.py            # Heuristic to decide if retrieval is needed
│   ├── config.py            # Global index, environment variables
│   └── streamlit_app.py     # User-friendly web interface
│
├── requirements.txt         # Python dependencies
├── README.md                # This file!
├── LICENSE                  # MIT License
└── .env (ignored)           # Mistral secrets

How to Run

Backend API

Start the FastAPI server (e.g. on port 8000):

uvicorn app.main:app --reload --port 8000

Test the endpoints via the Swagger UI at http://localhost:8000/docs.

Example using curl:

# Upload a PDF
curl -X POST "http://localhost:8000/documents" \
     -F "files=@path/to/your_document.pdf"

# Delete a PDF
curl -X DELETE "http://localhost:8000/documents/<doc_id>"

# Query for information
curl -X POST "http://localhost:8000/query" \
     -H "Content-Type: application/json" \
     -d '{"query":"What is the summary of the PDF?"}'

Streamlit Frontend

Make sure the backend API is running (see above).
Start the Streamlit app:

streamlit run app/streamlit_app.py

Open your browser at http://localhost:8501 to use the interactive interface.

API Endpoints

POST /documents
- Ingest one or more PDFs:
- Extract text from each PDF
- Split text into chunks
- Create embeddings for each chunk
- Store chunks in a global list docs_index
DELETE /documents/{doc_id}
- Remove the document's chunks from docs_index
- Attempts to remove the uploaded file from disk
POST /query
- Accepts JSON: {"query":"..."}
- Runs intent classification, optional rewriting
- Hybrid retrieval (semantic & keyword) plus RRF fusion
- Feeds top chunks as context to Mistral's Chat API
- Returns JSON with original_query, rewritten_query, and the final answer.

System Design Details

Data Ingestion

Upload: POST /documents takes PDF files.
Extraction: Using pdf_utils.extract_text_from_pdf() with PyMuPDF.
Chunking: Splits text into overlapping segments (~200 words with 50-word overlap) for better context coverage.
Embedding: Calls Mistral's embedding API (embedding.get_embedding).
Storing: Each chunk is stored with doc_id, text, and embedding in the in-memory list docs_index.

Chunking & Embedding

# chunking.py
def chunk_text(text, max_tokens=200, overlap=50):
    words = text.split()
    chunks = []
    step = max_tokens - overlap
    for i in range(0, len(words), step):
        chunk_slice = words[i:i + max_tokens]
        if not chunk_slice:
            continue
        chunk = " ".join(chunk_slice)
        chunks.append(chunk)
    return chunks

Then each chunk's text is embedded:

# embedding.py
def get_embedding(text: str) -> np.ndarray:
    # Calls Mistral API, returns embedding vector
    ...

Stored in docs_index along with doc_id.

Advanced Chunking Ideas:

Semantic Chunking: Dividing documents based on semantic coherence rather than fixed sizes
Contextual Chunk Headers: Adding higher-level context (title, section info) to chunks before embedding
Parent-child relationships: Creating hierarchical chunks where finer chunks are linked to broader parent chunks

Query Intent & Rewriting

Intent: System supports two intent classification methods:
- Naive rule-based approach: Uses keywords and question marks to determine retrieval need
- LLM-based approach: Uses Mistral API to intelligently decide if retrieval is required
Rewrite: If needed, the system calls Mistral's chat API again to simplify the query. This can help matching with chunk text.

Advanced Query Processing Ideas:

Step-back Prompting: Taking a step back to understand the broader context before addressing specific questions
Sub-query Decomposition: Breaking complex queries into smaller, manageable sub-queries
HyDE (Hypothetical Document Embeddings): Generating a hypothetical answer to help bridge query-document semantic gaps

Hybrid Search & Reranking

Semantic Search: Compare the query embedding with each chunk's embedding via L2 distance or cosine similarity.
Keyword Search: Count matching words in each chunk's text.
Reciprocal Rank Fusion (RRF): The top results from both methods are merged based on rank, ensuring chunks that rank highly in both semantic and keyword lists are prioritized.

Advanced Retrieval Ideas:

Two-level indexing: First retrieve relevant documents based on summaries, then retrieve chunks from those documents
Metadata filtering: Refining retrieved chunks based on metadata like categories or timestamps

Answer Generation

The final top chunks are concatenated as context in a prompt:

Context information is below.
---------------------
{chunk1}
{chunk2}
...
---------------------
Given the context information (and not prior knowledge), please answer the query.
Query: {user_query}
Answer:

This prompt is sent to Mistral's chat API with a moderate temperature. The LLM returns a response grounded in the provided chunks.

Technical Notes and Recommendations

Assumptions and Constraints

Documents are PDF text: We assume standard text-based PDFs. Scanned PDFs or images require OCR (not included).
Memory-based index: Chunks and embeddings live in-memory (docs_index). No persistence, meaning data is lost if the server restarts.
Single-user or low concurrency: Global list usage with no locking might cause issues under high concurrency.
Mistral API availability: System relies on external calls to Mistral for embeddings and generation.

Trade-offs and Design Decisions

We favored simplicity (Python list for embeddings, naive intent detection) to meet the MVP constraints.
No external DB was used, fulfilling the requirement for purely local storage but limiting scalability.
Responses only contain final answers (and queries), not the source context. Could be expanded to provide references.
Query rewriting is an additional API call, raising latency. In production, measure if it improves accuracy enough to justify the extra request.

Possible Improvements

Enhanced Retrieval: Replace simple word counting with BM25 or TF-IDF scoring for more accurate keyword search
Source Attribution: Include references to source documents in responses for transparency and fact-checking
Metadata Filtering: Allow filtering search results by document type, date, or other metadata
Vector Database: Store embeddings in a lightweight vector database (FAISS) for persistence and faster retrieval
Context Optimization: Implement prompt engineering techniques to better utilize retrieved chunks
Caching System: Cache similar or identical queries to improve response time and reduce API costs
Asynchronous Processing: Implement parallel search and embedding generation for better performance
Response Validation: Implement basic checks to detect potential hallucinations or inconsistencies in LLM outputs.

Author

Daniel Fiuza Dosil
Email: dafiuzadosil@gmail.com
Date: 27/03/2025 (March)

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.streamlit		.streamlit
app		app
data		data
tests/dev		tests/dev
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RAG-MVP: Retrieval-Augmented Generation with FastAPI and Mistral

StackAI-Task

Overview

Core Steps

Table of Contents

Features

Requirements

Installation and Setup

Project Structure

How to Run

Backend API

Streamlit Frontend

API Endpoints

System Design Details

Data Ingestion

Chunking & Embedding

Query Intent & Rewriting

Hybrid Search & Reranking

Answer Generation

Technical Notes and Recommendations

Assumptions and Constraints

Trade-offs and Design Decisions

Possible Improvements

Author

About

Uh oh!

Releases

Packages

Languages

License

Daniel-FD/StackAI-Task

Folders and files

Latest commit

History

Repository files navigation

RAG-MVP: Retrieval-Augmented Generation with FastAPI and Mistral

StackAI-Task

Overview

Core Steps

Table of Contents

Features

Requirements

Installation and Setup

Project Structure

How to Run

Backend API

Streamlit Frontend

API Endpoints

System Design Details

Data Ingestion

Chunking & Embedding

Query Intent & Rewriting

Hybrid Search & Reranking

Answer Generation

Technical Notes and Recommendations

Assumptions and Constraints

Trade-offs and Design Decisions

Possible Improvements

Author

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages