LLMQualitativeCoder is a tool for automated qualitative coding using Large Language Models (LLMs). It provides a modern web interface for easy interaction with the qualitative coding pipeline, as well as a programmable API for advanced users.
Key features include:
- Web Interface: A modern React-based UI for uploading files, configuring the pipeline, running analyses, and managing results.
- Flexible Data Handling: Supports diverse JSON input data formats with customizable fields and basic filtering.
- Automated Parsing: Breaks down large texts into smaller meaning units for coding based on specifications in LLM user prompt.
- Deductive and Inductive Coding: Offers both predefined (deductive) and emergent (inductive) coding approaches.
- Customizable Configuration: Configure batch size, context size, and LLM provider/model directly through the UI.
The easiest way to use LLMQualitativeCoder is through its web interface, which requires minimal setup.
-
Clone the Repository:
git clone https://github.com/iggygraceful/LLMQualitativeCoder.git cd LLMQualitativeCoder -
Install Dependencies: Follow the installation steps in section 3 below to set up the required dependencies.
-
Start the Backend Server:
# Activate your virtual environment if not already active source /path/to/your/virtualenv/bin/activate # Start the FastAPI server cd src uvicorn transcriptanalysis.new_api:app --reload --host 0.0.0.0 --port 8000
-
Start the Frontend Server:
# Navigate to the frontend directory cd frontend # Start the development server npm run dev
-
Access the UI: Open your browser and navigate to:
http://localhost:5173(Or the port displayed in your terminal after starting the frontend server)
The web interface provides an intuitive workflow for qualitative coding:
- View Available Files: The home screen displays both default example files and your uploaded files.
- Upload New Files: Click the "Upload JSON File" button to add your own JSON files for analysis.
- Preview Files: Click on a file to see its content before processing.
- Delete Files: Remove user-uploaded files when no longer needed.
- Select a File: Click the "Configure" button next to a file to set up analysis parameters.
- Map Fields: Specify which fields in your JSON contain the text content, contextual information, etc.
- Choose Coding Mode: Select between inductive (generate new codes) or deductive (use predefined codes) coding.
- Configure Segmentation: Choose between LLM-based or sentence-based segmentation of your text.
- Select Model: Choose which OpenAI model to use for analysis.
- Advanced Options: Configure batch sizes, context window, and thread count.
- View Codebases: See available code schemes under the "Codebases" tab.
- Create New Codebases: Build your own coding scheme or modify existing ones.
- Add Codes: Add new codes to your codebase with descriptions and metadata.
- Edit Prompts: Customize the instructions sent to the LLM for inductive, deductive, and parsing tasks.
- Preview Prompts: See how your configuration will appear in the prompt sent to the LLM.
- Reset to Default: Easily reset prompts to their default templates if needed.
- Start Jobs: Begin the analysis process with your selected configuration.
- Monitor Progress: See the status of running and completed jobs.
- Download Results: Get output and validation files from completed jobs.
- View Reports: Analyze validation reports to check for quality issues.
- Set OpenAI API Key: Enter your OpenAI API key in the settings panel to authenticate API requests.
- Secure Storage: Your key is stored securely in memory for the duration of your session.
LLMQualitativeCoder uses Poetry for dependency management and packaging.
-
Python: The project requires Python 3.10+, but can be modified to work with Python 3.9 by changing the
pythonversion inpyproject.toml. -
Poetry: Install Poetry using one of the following methods:
Method 1: Using curl (recommended):
curl -sSL https://install.python-poetry.org | python3 -Method 2: Using pip (alternative):
python3 -m pip install --user poetryAdd Poetry to your PATH: After installation, make sure Poetry's bin directory is in your PATH:
# For curl installation export PATH=$PATH:$HOME/.local/bin # For pip installation export PATH=$PATH:$HOME/Library/Python/3.9/bin # Adjust Python version as neededNote: To make the
poetrycommand available in all terminal sessions, add the relevantexport PATH=...line to your shell's startup file (e.g.,~/.zshrc,~/.bash_profile, or~/.config/fish/config.fish) and then source it or open a new terminal.Verify the installation:
poetry --version
-
Clone the Repository:
git clone https://github.com/iggygraceful/LLMQualitativeCoder.git cd LLMQualitativeCoder -
Optional: Adjust Python Version (if using Python 3.9): Edit the
pyproject.tomlfile to change:python = ">=3.10,<4.0"to
python = ">=3.9,<4.0" -
Install Dependencies:
poetry lock # Generate/update lock file if needed poetry install -
Activate the Virtual Environment:
# For Poetry 2.0+ (newer versions) poetry env activate # For older Poetry versions poetry shellThe
poetry env activatecommand will output the source command you need to run, for example:source /path/to/virtualenvs/myproject-py3.9/bin/activateRun the
sourcecommand thatpoetry env activateoutputs.After running the
source .../activatecommand, you can verify the environment is active by:- Checking your command prompt: It should now be prefixed with the environment name (e.g.,
(transcriptanalysis-py3.9)). - Checking the Python interpreter path:
which python # Should point to the python executable within your virtual environment - Listing installed packages:
pip list # You should see the project's dependencies installed here
- Checking your command prompt: It should now be prefixed with the environment name (e.g.,
-
Install Additional Required Packages:
pip install python-multipartThis package is required for file uploads in the web interface.
-
Set Environment Variables: Configure API keys before running the pipeline:
- On Linux/macOS:
export OPENAI_API_KEY='your-openai-api-key' # export HUGGINGFACE_API_KEY='your-huggingface-api-key' # If HuggingFace is supported later - On Windows CMD:
set OPENAI_API_KEY=your-openai-api-key # set HUGGINGFACE_API_KEY=your-huggingface-api-key # If HuggingFace is supported later
Note: Currently, only OpenAI models are fully supported.
Alternative: You can also set your API key through the web interface, eliminating the need to set environment variables.
- On Linux/macOS:
- Poetry Not Found: If you get a "command not found" error, ensure Poetry's bin directory is added to your PATH.
- Python Version Mismatch: If you see "Python version X is not supported by the project", modify the
pyproject.tomlfile as described above. - SSL/OpenSSL Warnings: You may see warnings about OpenSSL versions with urllib3. These are usually harmless and can be ignored.
- Missing Dependencies: If you encounter errors about missing modules, try running
poetry updatefollowed bypoetry install. - File Upload Issues: If file uploads are not working, ensure you have installed the
python-multipartpackage. - No Files Showing in UI: Make sure the backend server is running and check the terminal for any error messages.
For advanced users who prefer to run the pipeline programmatically without the web interface:
Set the necessary API keys:
- On Linux/macOS:
export OPENAI_API_KEY='your-openai-api-key' - On Windows CMD:
set OPENAI_API_KEY=your-openai-api-key
Important: Ensure you have set your OPENAI_API_KEY environment variable with a valid key before proceeding to execute the pipeline. The application will not function correctly without it.
Make sure you're in the Poetry virtual environment (see Installation Step 4 and verify its activation), then run:
# Navigate to the src directory
cd src
# Run the main module
python -m transcriptanalysis.mainAlternatively, you can use Poetry's run script (from the project root):
poetry run transcriptanalysis.main:runIf you encounter any errors, check the logs for details.
A successful run will show API communication logs (HTTP 200) and messages indicating that the output and validation report JSON files have been saved to the outputs/ directory.
Defines the pipeline behavior, including coding modes, model selections, paths, and logging settings.
Note on Configuration File Paths: The application expects config.json and data_format_config.json (described below) to be in the src/transcriptanalysis/configs/ directory by default. Ensure your input data files (e.g., teacher_transcript.json) and prompt files (e.g., parse_prompt.txt) are correctly pathed within your config.json relative to the project structure and the paths specified in config.json itself (e.g., paths.json_folder, paths.prompts_folder).
Example:
{
"coding_mode": "deductive",
"use_parsing": true,
"preliminary_segments_per_prompt": 5,
"meaning_units_per_assignment_prompt": 10,
"context_size": 5,
"data_format": "transcript",
"paths": {
"prompts_folder": "transcriptanalysis/prompts",
"codebase_folder": "transcriptanalysis/codebases",
"json_folder": "transcriptanalysis/json_inputs",
"config_folder": "transcriptanalysis/configs",
"user_uploads_folder": "data/user_uploads"
},
"selected_codebase": "default_codebase.json",
"selected_json_file": "teacher_transcript.json",
"parse_prompt_file": "parse_prompt.txt",
"inductive_coding_prompt_file": "inductive_prompt.txt",
"deductive_coding_prompt_file": "deductive_prompt.txt",
"output_folder": "outputs",
"enable_logging": true,
"logging_level": "INFO",
"log_to_file": true,
"log_file_path": "logs/application.log",
"thread_count": 4,
"parse_llm_config": {
"provider": "openai",
"model_name": "gpt-4",
"temperature": 0.7,
"max_tokens": 2000,
"api_key": "YOUR_OPENAI_API_KEY"
},
"assign_llm_config": {
"provider": "openai",
"model_name": "gpt-4o-mini",
"temperature": 0.6,
"max_tokens": 1500,
"api_key": "YOUR_OPENAI_API_KEY"
}
}Specifies your JSON input file, including fields for content, speaker, source IDs, and filtering rules.
Context fields are input fields that you want to include during the LLM coding task.
Example:
{
"transcript": {
"content_field": "text",
"context_fields": ["speaker", "timestamp"],
"list_field": "dialogues",
"filter_rules": []
}
}The codebase includes several modules:
Purpose: Coordinates the workflow from data loading to validation. Key Tasks:
- Reads configuration files.
- Initializes the environment and logging.
- Loads and filters data.
- Optionally parses data into smaller meaning units.
- Assigns codes (deductive or inductive).
- Saves coded meaning units to an output JSON file.
- Runs validation and generates a report.
Purpose: Manages data loading, transformation, and filtering based on configuration. Key Tasks:
- Loads JSON files.
- Applies filter rules.
- Transforms data into
MeaningUnitobjects.
Purpose: Includes core functionalities for parsing transcripts and assigning codes. Key Tasks:
- Interfaces with LLMs for parsing and coding.
Purpose: Provides helper functions for environment setup and resource initialization. Key Tasks:
- Loads JSON and text files.
- Manages environment variables.
- Generates structured LLM responses.
Purpose: Ensures output consistency and completeness through validation reports. Key Tasks:
- Compares original segments with meaning units.
- Identifies skipped or inconsistent segments.
- Generates JSON validation reports.
Purpose: Centralizes logging setup. Key Tasks:
- Configures log levels (DEBUG, INFO, etc.).
- Manages console and file outputs.
Purpose: Provides a FastAPI server for asynchronous pipeline execution and web interface support. Key Endpoints:
POST /run-pipeline- Start a new coding jobGET /jobs/{job_id}- Get job statusGET /jobs/{job_id}/output- Download output fileGET /jobs/{job_id}/validation- Download validation reportPOST /files/upload- Upload a JSON fileGET /files/list- List available filesDELETE /files/{filename}- Delete a user-uploaded file
Validation ensures the final meaning units accurately represent the original segments. Discrepancies are reported in validation_report.json.
Logs capture pipeline operations and are saved in the specified log_file_path.
- Preliminary Segments: JSON files containing raw data.
- Codebase Files: JSONL files for deductive coding.
- Prompts: Text files with LLM instructions.
- Coded JSON Files: Contain meaning units with assigned codes.
- Validation Reports: Detail discrepancies between input and output.
- Logs: Available in the console and specified files.