CodeSearchEngine/
├── src/cse/ # Main source code package
│ ├── data_manager/ # PostgreSQL vector store operations (add, get, remove, search)
│ ├── embeddings_tunner/ # Fine-tuning logic for embedding models (PyTorch Lightning)
│ ├── evaluation/ # Evaluation metrics (Recall@k, MRR@k, NDCG@k) and evaluation runner
│ ├── logger/ # Logging configuration and utilities
│ └── settings/ # Application settings and configuration management
├── scripts/ # Executable scripts for downloading, training, and evaluation
├── postgres/ # PostgreSQL database schema and migrations
├── models/ # Saved embedding models (downloaded and fine-tuned)
├── experiments/ # Training experiment logs and checkpoints (TensorBoard)
├── results/ # Evaluation results and metrics saved as JSON
├── pyproject.toml # Project setup file
├── .env # File for storing postgres environment variables
├── docker-compose.yml # Docker configuration for PostgreSQL database
├── README.md # Project installation and scripts running instructions
└── report.ipynb # Report on my work
-
Clone the repository
git clone <repo_url> cd <repo_folder>
-
Create and activate a virtual environment
On Linux/macOS:
python -m venv .venv source .venv/bin/activateUsing Conda:
conda create --name code-search python=3.10 conda activate code-search
-
Upgrade pip
python -m pip install --upgrade pip
-
Install dependencies
python -m pip install -e . -
Create the
.envfile
You can copy the contents of.env.example:cp .env.example .env
-
Set up the Postgres Vectorstore
(requires Docker Compose)docker compose up -d db
-
Download embeddings (all-MiniLM-L6-v2)
python scripts/download_all-MiniLM-L6-v2.py
-
Download embeddings (granite-embedding-small-english-r2)
python scripts/download_granite-embedding-small-english-r2.py
-
Populate the vectorstore
python scripts/populate_vectorstore.py <model_name> <docset_name>
Example:
python scripts/populate_vectorstore.py all-MiniLM-L6-v2 cosqa_test
-
Evaluate a model
python scripts/eval.py <model_name> <docset_name>
Example:
python scripts/eval.py all-MiniLM-L6-v2 cosqa_test
-
Tune embeddings
python scripts/tune.py <experiment_name> <base_model>
Example:
python scripts/tune.py all-mini-tuned all-MiniLM-L6-v2
-
Evaluate tuned model performance
After tuning, re-run evaluation and population scripts (steps 3 and 4), passing your new model name and a new docset name to distinguish the updated embeddings from previously indexed corpora.