Skip to content

Develop a simple Python backend for a Retrieval-Augmented Generation (RAG) pipeline, using a Language Model (LLM) for a knowledge base consisting of PDF files.

License

Notifications You must be signed in to change notification settings

Daniel-FD/StackAI-Task

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RAG-MVP: Retrieval-Augmented Generation with FastAPI and Mistral

StackAI-Task

Develop a simple Python backend for a Retrieval-Augmented Generation (RAG) pipeline, using a Language Model (LLM) for a knowledge base consisting of PDF files.


Overview

This repository contains a minimal Retrieval-Augmented Generation (RAG) pipeline built with FastAPI and Mistral API. It ingests PDF documents, chunks and embeds their text, performs hybrid semantic/keyword search for queries, and uses a Large Language Model to generate answers based on retrieved context.

Core Steps

User Query -> [FastAPI Endpoint] -> needs_retrieval? 
    -> if False: (skips to a default response)
    -> if True:
         rewritten_query = rewrite_query(user_query)
         query_vec = embed(rewritten_query)
         sem_matches = semantic_search(query_vec)
         key_matches = keyword_search(user_query)
         top_chunks = RRF_fuse(sem_matches, key_matches)
         final_prompt = build_prompt(user_query, top_chunks)
         answer = LLM_generate(final_prompt)
    -> Return answer (with original and rewritten query)

Table of Contents

  1. Features
  2. Requirements
  3. Installation and Setup
  4. Project Structure
  5. How to Run
  6. API Endpoints
  7. System Design Details
    • Data Ingestion
    • Chunking & Embedding
    • Query Intent & Rewriting
    • Hybrid Search & Reranking
    • Answer Generation
  8. Technical Notes and Recommendations
  9. Possible Improvements
  10. Author

Features

  1. Upload PDFs via a REST endpoint. System automatically extracts, chunks, and embeds.
  2. Local in-memory embedding store (no external vector DB).
  3. Intent classification with two options: rule-based or LLM-based approach.
  4. LLM-based query rewriting to improve retrieval.
  5. Hybrid retrieval (semantic & keyword) with Reciprocal Rank Fusion.
  6. LLM-based answer generation with Mistral's chat API, grounded in retrieved chunks.
  7. User-friendly Streamlit web interface for document management and querying.

Requirements

  • Python 3.9+ (recommended)
  • FastAPI
  • Uvicorn
  • PyMuPDF (fitz) for PDF parsing
  • NumPy for vector operations
  • Requests for HTTP calls
  • python-dotenv to load environment variables
  • Mistral AI client library
  • Streamlit for the web interface
  • A Mistral AI account or API key

Installation and Setup

  1. Clone or download the repository to your local environment.
  2. Install dependencies:
pip install -r requirements.txt
  1. Configure environment:
    • Create a .env file in the root directory (or set these as environment variables):
MISTRAL_API_KEY=gwGmq0uYH2_RANDOM_ZnuNpDXeCgtqaXnOf
MISTRAL_EMBED_MODEL=mistral-embed
MISTRAL_CHAT_MODEL=mistral-large-latest
  • Ensure python-dotenv is installed so the code picks up these variables.
  1. Optional: If you have scanned PDFs or advanced needs, you may require OCR or other libraries. This MVP does not handle OCR.

Project Structure

rag-mvp/
│
├── app/
│   ├── main.py              # FastAPI entrypoint, route registration
│   ├── ingestion.py         # Endpoints for ingesting/deleting PDF documents
│   ├── query.py             # Endpoint for querying, orchestrates retrieval + generation
│   ├── pdf_utils.py         # PDF extraction logic (using PyMuPDF)
│   ├── chunking.py          # Text chunking
│   ├── search.py            # Semantic, keyword, and hybrid search with RRF
│   ├── embedding.py         # Mistral API calls (embeddings)
│   ├── llm_utils.py         # LLM-based functions for query rewriting and answer generation
│   ├── intent.py            # Heuristic to decide if retrieval is needed
│   ├── config.py            # Global index, environment variables
│   └── streamlit_app.py     # User-friendly web interface
│
├── requirements.txt         # Python dependencies
├── README.md                # This file!
├── LICENSE                  # MIT License
└── .env (ignored)           # Mistral secrets

How to Run

Backend API

  1. Start the FastAPI server (e.g. on port 8000):
uvicorn app.main:app --reload --port 8000
  1. Test the endpoints via the Swagger UI at http://localhost:8000/docs.

Example using curl:

# Upload a PDF
curl -X POST "http://localhost:8000/documents" \
     -F "files=@path/to/your_document.pdf"

# Delete a PDF
curl -X DELETE "http://localhost:8000/documents/<doc_id>"

# Query for information
curl -X POST "http://localhost:8000/query" \
     -H "Content-Type: application/json" \
     -d '{"query":"What is the summary of the PDF?"}'

Streamlit Frontend

  1. Make sure the backend API is running (see above).
  2. Start the Streamlit app:
streamlit run app/streamlit_app.py
  1. Open your browser at http://localhost:8501 to use the interactive interface.

API Endpoints

  1. POST /documents
    • Ingest one or more PDFs:
    • Extract text from each PDF
    • Split text into chunks
    • Create embeddings for each chunk
    • Store chunks in a global list docs_index
  2. DELETE /documents/{doc_id}
    • Remove the document's chunks from docs_index
    • Attempts to remove the uploaded file from disk
  3. POST /query
    • Accepts JSON: {"query":"..."}
    • Runs intent classification, optional rewriting
    • Hybrid retrieval (semantic & keyword) plus RRF fusion
    • Feeds top chunks as context to Mistral's Chat API
    • Returns JSON with original_query, rewritten_query, and the final answer.

System Design Details

Data Ingestion

  • Upload: POST /documents takes PDF files.
  • Extraction: Using pdf_utils.extract_text_from_pdf() with PyMuPDF.
  • Chunking: Splits text into overlapping segments (~200 words with 50-word overlap) for better context coverage.
  • Embedding: Calls Mistral's embedding API (embedding.get_embedding).
  • Storing: Each chunk is stored with doc_id, text, and embedding in the in-memory list docs_index.

Chunking & Embedding

# chunking.py
def chunk_text(text, max_tokens=200, overlap=50):
    words = text.split()
    chunks = []
    step = max_tokens - overlap
    for i in range(0, len(words), step):
        chunk_slice = words[i:i + max_tokens]
        if not chunk_slice:
            continue
        chunk = " ".join(chunk_slice)
        chunks.append(chunk)
    return chunks

Then each chunk's text is embedded:

# embedding.py
def get_embedding(text: str) -> np.ndarray:
    # Calls Mistral API, returns embedding vector
    ...

Stored in docs_index along with doc_id.

Advanced Chunking Ideas:

  • Semantic Chunking: Dividing documents based on semantic coherence rather than fixed sizes
  • Contextual Chunk Headers: Adding higher-level context (title, section info) to chunks before embedding
  • Parent-child relationships: Creating hierarchical chunks where finer chunks are linked to broader parent chunks

Query Intent & Rewriting

  • Intent: System supports two intent classification methods:
    • Naive rule-based approach: Uses keywords and question marks to determine retrieval need
    • LLM-based approach: Uses Mistral API to intelligently decide if retrieval is required
  • Rewrite: If needed, the system calls Mistral's chat API again to simplify the query. This can help matching with chunk text.

Advanced Query Processing Ideas:

  • Step-back Prompting: Taking a step back to understand the broader context before addressing specific questions
  • Sub-query Decomposition: Breaking complex queries into smaller, manageable sub-queries
  • HyDE (Hypothetical Document Embeddings): Generating a hypothetical answer to help bridge query-document semantic gaps

Hybrid Search & Reranking

  • Semantic Search: Compare the query embedding with each chunk's embedding via L2 distance or cosine similarity.
  • Keyword Search: Count matching words in each chunk's text.
  • Reciprocal Rank Fusion (RRF): The top results from both methods are merged based on rank, ensuring chunks that rank highly in both semantic and keyword lists are prioritized.

Advanced Retrieval Ideas:

  • Two-level indexing: First retrieve relevant documents based on summaries, then retrieve chunks from those documents
  • Metadata filtering: Refining retrieved chunks based on metadata like categories or timestamps

Answer Generation

  • The final top chunks are concatenated as context in a prompt:
Context information is below.
---------------------
{chunk1}
{chunk2}
...
---------------------
Given the context information (and not prior knowledge), please answer the query.
Query: {user_query}
Answer:
  • This prompt is sent to Mistral's chat API with a moderate temperature. The LLM returns a response grounded in the provided chunks.

Technical Notes and Recommendations

Assumptions and Constraints

  • Documents are PDF text: We assume standard text-based PDFs. Scanned PDFs or images require OCR (not included).
  • Memory-based index: Chunks and embeddings live in-memory (docs_index). No persistence, meaning data is lost if the server restarts.
  • Single-user or low concurrency: Global list usage with no locking might cause issues under high concurrency.
  • Mistral API availability: System relies on external calls to Mistral for embeddings and generation.

Trade-offs and Design Decisions

  • We favored simplicity (Python list for embeddings, naive intent detection) to meet the MVP constraints.
  • No external DB was used, fulfilling the requirement for purely local storage but limiting scalability.
  • Responses only contain final answers (and queries), not the source context. Could be expanded to provide references.
  • Query rewriting is an additional API call, raising latency. In production, measure if it improves accuracy enough to justify the extra request.

Possible Improvements

  • Enhanced Retrieval: Replace simple word counting with BM25 or TF-IDF scoring for more accurate keyword search
  • Source Attribution: Include references to source documents in responses for transparency and fact-checking
  • Metadata Filtering: Allow filtering search results by document type, date, or other metadata
  • Vector Database: Store embeddings in a lightweight vector database (FAISS) for persistence and faster retrieval
  • Context Optimization: Implement prompt engineering techniques to better utilize retrieved chunks
  • Caching System: Cache similar or identical queries to improve response time and reduce API costs
  • Asynchronous Processing: Implement parallel search and embedding generation for better performance
  • Response Validation: Implement basic checks to detect potential hallucinations or inconsistencies in LLM outputs.

Author

Daniel Fiuza Dosil
Email: dafiuzadosil@gmail.com
Date: 27/03/2025 (March)

About

Develop a simple Python backend for a Retrieval-Augmented Generation (RAG) pipeline, using a Language Model (LLM) for a knowledge base consisting of PDF files.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published