Skip to content

Micz26/CodeSearchEngine

Repository files navigation

Code Search Engine

Project Structure

CodeSearchEngine/
├── src/cse/                    # Main source code package
│   ├── data_manager/           # PostgreSQL vector store operations (add, get, remove, search)
│   ├── embeddings_tunner/      # Fine-tuning logic for embedding models (PyTorch Lightning)
│   ├── evaluation/             # Evaluation metrics (Recall@k, MRR@k, NDCG@k) and evaluation runner
│   ├── logger/                 # Logging configuration and utilities
│   └── settings/               # Application settings and configuration management
├── scripts/                     # Executable scripts for downloading, training, and evaluation
├── postgres/                    # PostgreSQL database schema and migrations
├── models/                      # Saved embedding models (downloaded and fine-tuned)
├── experiments/                 # Training experiment logs and checkpoints (TensorBoard)
├── results/                     # Evaluation results and metrics saved as JSON
├── pyproject.toml               # Project setup file
├── .env                         # File for storing postgres environment variables
├── docker-compose.yml           # Docker configuration for PostgreSQL database
├── README.md                    # Project installation and scripts running instructions
└── report.ipynb                 # Report on my work

Installation

  1. Clone the repository

    git clone <repo_url>
    cd <repo_folder>
  2. Create and activate a virtual environment

    On Linux/macOS:

    python -m venv .venv
    source .venv/bin/activate

    Using Conda:

    conda create --name code-search python=3.10
    conda activate code-search
  3. Upgrade pip

    python -m pip install --upgrade pip
  4. Install dependencies

    python -m pip install -e .
  5. Create the .env file
    You can copy the contents of .env.example:

    cp .env.example .env
  6. Set up the Postgres Vectorstore
    (requires Docker Compose)

    docker compose up -d db

Scripts

  1. Download embeddings (all-MiniLM-L6-v2)

    python scripts/download_all-MiniLM-L6-v2.py
  2. Download embeddings (granite-embedding-small-english-r2)

    python scripts/download_granite-embedding-small-english-r2.py
  3. Populate the vectorstore

    python scripts/populate_vectorstore.py <model_name> <docset_name>

    Example:

    python scripts/populate_vectorstore.py all-MiniLM-L6-v2 cosqa_test
  4. Evaluate a model

    python scripts/eval.py <model_name> <docset_name>

    Example:

    python scripts/eval.py all-MiniLM-L6-v2 cosqa_test
  5. Tune embeddings

    python scripts/tune.py <experiment_name> <base_model>

    Example:

    python scripts/tune.py all-mini-tuned all-MiniLM-L6-v2
  6. Evaluate tuned model performance
    After tuning, re-run evaluation and population scripts (steps 3 and 4), passing your new model name and a new docset name to distinguish the updated embeddings from previously indexed corpora.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published