Doc Intelligence

RAG system for document Q&A. Built to learn what actually works in retrieval systems vs. what's hype.

Results

Evaluated on 38 queries across 5 documents:

Category	Accuracy
Easy	87%
Vague	64%
Hard	67%
Overall	74%

Metric	Value
Retrieval	285ms
Generation	1,573ms
Total	1,858ms

What I Tried

#	Experiment	Result	Decision
001	Baseline	80% accuracy, Q3 query failed	Found chunking bug
002	Recursive chunking	Fixed the failure	✅ Keep
003	Hybrid search	+6% accuracy	✅ Keep
004	Reranking	+5s latency, 0% gain	❌ Rejected
005	HyDE	+20% on vague queries	✅ Conditional
006	Evaluation framework	38-query test set	✅ Essential
007	Hallucination detection	Catches fabrications	✅ Keep

Key Learnings

Chunking matters more than retrieval. Fixed-size chunking split "Q3 2024" from "$1.15 billion". A chunk literally started with "ervices" (mid-word cut). Recursive chunking fixed it.

Measure latency, not just accuracy. Reranking moved correct chunk to #1. Then I measured: 5 seconds per query. Killed it.

Easy tests lie. For 5 days I got 100% accuracy. Then I built proper evaluation with vague queries. Real accuracy: 77%.

Architecture

User Query
    │
    ▼
┌─────────────────────────────────────┐
│   HyDE (if query ≤ 3 words)         │
└─────────────┬───────────────────────┘
              │
    ┌─────────┴─────────┐
    ▼                   ▼
┌────────┐        ┌──────────┐
│Semantic│        │   BM25   │
│ Search │        │  Search  │
└────┬───┘        └────┬─────┘
     └────────┬────────┘
              ▼
       Rank Fusion (RRF)
              │
              ▼
         GPT-4o-mini
              │
              ▼
    Groundedness Check (optional)
              │
              ▼
          Response

Quick Start

git clone https://github.com/SAMithila/doc-intelligence.git
cd doc-intelligence
pip install -e .

export OPENAI_API_KEY=your_key

# Run evaluation
python tests/test_evaluation.py

# Start UI
streamlit run app/streamlit_app.py

Project Structure

src/docint/
├── ingest/        # Chunking (fixed, recursive)
├── retrieval/     # Semantic, BM25, hybrid, HyDE
├── generation/    # LLM answer generation
├── verification/  # Hallucination detection
├── evaluation/    # Metrics and testing
└── api/           # FastAPI endpoints

experiments/       # Documented experiments
docs/              # Architecture decisions

Design Decisions

Decision	Choice	Why
Chunking	Recursive	Fixed-size broke semantic units
Search	Hybrid (BM25 + semantic)	+6% accuracy
Reranking	Rejected	5s latency, no gain
HyDE	Conditional only	Expensive, use for short queries
Vector store	ChromaDB	Simple, good for <100K docs

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
.github/workflows		.github/workflows
analysis		analysis
app		app
benchmarks		benchmarks
configs		configs
docker		docker
docs		docs
eval_data		eval_data
experiments		experiments
src/docint		src/docint
tests		tests
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run_api.py		run_api.py
run_streamlit.py		run_streamlit.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Doc Intelligence

Results

What I Tried

Key Learnings

Architecture

Quick Start

Project Structure

Design Decisions

Screenshots

Streamlit UI

FastAPI

What I'd Do Differently

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Doc Intelligence

Results

What I Tried

Key Learnings

Architecture

Quick Start

Project Structure

Design Decisions

Screenshots

Streamlit UI

FastAPI

What I'd Do Differently

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages