Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
c953180
Add Docker setup for GPU-based inference
claude May 25, 2026
920c89d
Add tiered container health check script
claude May 25, 2026
ccec106
Add Gradio web UI (serve.py) with model warm-caching
claude May 25, 2026
5160853
Add .gitignore and untrack .env
claude May 28, 2026
b3a89f3
Fix CUDA base image tag (cudnn9 β†’ cudnn)
claude May 28, 2026
9744995
docs: add Docker one-click setup section to README
claude May 28, 2026
22ac206
chore: add .gitignore and .env.example
danielesalpietro May 28, 2026
3d20e57
docs: add web UI screenshot placeholder and improve CPU fallback inst…
claude May 28, 2026
3e562c7
fix: move Gradio theme to launch() for Gradio 6.0 compatibility
claude May 28, 2026
42c7cb5
chore: add serve-cpu.bat for GPU-less Windows launch
claude May 28, 2026
dbacac8
fix: update Chatbot message format for Gradio 6.0 compatibility
claude May 28, 2026
1032efa
fix: add missing seed and sample_seed to fake_args in serve.py
claude May 28, 2026
3632cfc
fix: remove redundant python/serve.py from compose command for serve
claude May 28, 2026
78b8a87
feat: add first-use download info message below style dropdown
claude May 28, 2026
36dab1a
Create webui.png
danielesalpietro May 28, 2026
1a7612d
Merge branch 'claude/sharp-carson-nBKHX' of https://github.com/daniel…
danielesalpietro May 28, 2026
cd0631f
docs: remove placeholder webui screenshot reference
claude May 28, 2026
533ad21
docs: add Gradio web UI screenshot to README
claude May 28, 2026
4b5eea3
Merge pull request #1 from danielesalpietro/claude/sharp-carson-nBKHX
danielesalpietro May 28, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
.git
.env
__pycache__
*.pyc
*.pyo
*.pyd
.DS_Store
1 change: 0 additions & 1 deletion .env

This file was deleted.

2 changes: 2 additions & 0 deletions .env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
TAVILY_API_KEY=tavily-api-key-here
HF_TOKEN=hf-token-here
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
.env
__pycache__/
*.pyc
*.pyo
*.pyd
.DS_Store
29 changes: 29 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
FROM nvidia/cuda:12.4.1-cudnn-runtime-ubuntu22.04

ENV DEBIAN_FRONTEND=noninteractive
ENV PYTHONUNBUFFERED=1
# HuggingFace model cache β€” mount a volume here to persist downloads across runs
ENV HF_HOME=/hf_cache
ENV TOKENIZERS_PARALLELISM=false
ENV MAS_FORCE_DISABLE_TORCHVISION=1

RUN apt-get update && apt-get install -y --no-install-recommends \
python3.10 \
python3.10-dev \
python3-pip \
git \
&& rm -rf /var/lib/apt/lists/* \
&& ln -sf /usr/bin/python3.10 /usr/bin/python \
&& ln -sf /usr/bin/pip3 /usr/bin/pip

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir --upgrade pip \
&& pip install --no-cache-dir -r requirements.txt

COPY . .

VOLUME ["/hf_cache"]

ENTRYPOINT ["python", "run.py"]
30 changes: 30 additions & 0 deletions Dockerfile.serve
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
FROM nvidia/cuda:12.4.1-cudnn-runtime-ubuntu22.04

ENV DEBIAN_FRONTEND=noninteractive
ENV PYTHONUNBUFFERED=1
ENV HF_HOME=/hf_cache
ENV TOKENIZERS_PARALLELISM=false
ENV MAS_FORCE_DISABLE_TORCHVISION=1

RUN apt-get update && apt-get install -y --no-install-recommends \
python3.10 \
python3.10-dev \
python3-pip \
git \
&& rm -rf /var/lib/apt/lists/* \
&& ln -sf /usr/bin/python3.10 /usr/bin/python \
&& ln -sf /usr/bin/pip3 /usr/bin/pip

WORKDIR /app

COPY requirements.txt requirements-serve.txt ./
RUN pip install --no-cache-dir --upgrade pip \
&& pip install --no-cache-dir -r requirements.txt -r requirements-serve.txt

COPY . .

VOLUME ["/hf_cache"]

EXPOSE 7860

ENTRYPOINT ["python", "serve.py"]
151 changes: 150 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,148 @@ Please set up a search API key (e.g., a Tavily API key) in `.env` file:
TAVILY_API_KEY=your_tavily_api_key_here
```

## 🐳 Docker: One-Click Setup

> Get a fully isolated, GPU-ready environment running in **~60 seconds** β€” no conda, no manual driver configuration.

### Prerequisites

| Requirement | Notes |
|---|---|
| [Docker Desktop](https://docs.docker.com/get-docker/) β‰₯ 24 | WSL2 backend required on Windows |
| NVIDIA driver β‰₯ 470 | [Linux: NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) Β· Windows: driver with WSL2 support |
| [Hugging Face token](https://huggingface.co/settings/tokens) | Read access, for model downloads |

### Step 1 β€” Configure secrets

Create a `.env` file in the project root (**never commit this file**):

```env
HF_TOKEN=hf_your_token_here
TAVILY_API_KEY=your_tavily_key_here # required only for deliberation style
```

### Step 2 β€” Build the image

```bash
docker compose build recursivemas
```

The first build takes ~5 minutes to pull the CUDA base layer. All subsequent builds are fully cached.

### Step 3 β€” Run batch inference

```bash
docker compose up recursivemas
```

Models are downloaded from Hugging Face on first run and persisted in the `hf_cache` Docker volume β€” subsequent runs start immediately.

### Step 4 β€” Launch the Gradio web UI

```bash
docker compose up serve
```

Open [http://localhost:7860](http://localhost:7860). The UI exposes all 5 collaboration styles. Models are loaded into VRAM on the first request and stay warm for subsequent ones β€” no reload between questions.

<p align="center">
<img src="assets/webui.png" width="90%" alt="RecursiveMAS Gradio Web UI">
</p>

---

### 🩺 Health Check

Verify the container before running inference:

```bash
# Level 1 β€” Python dependencies + all 5 styles registered (no GPU needed)
docker run --rm --entrypoint python recursivemas healthcheck.py --level 1

# Level 2 β€” CUDA device detection + tensor allocation
docker run --rm --entrypoint python recursivemas healthcheck.py --level 2

# Level 3 β€” HuggingFace Hub reachability (requires HF_TOKEN env var)
docker run --rm --entrypoint python -e HF_TOKEN=$HF_TOKEN recursivemas healthcheck.py --level 3
```

Expected output for a passing level-1 check:

```
======================================================
RecursiveMAS β€” container health check
======================================================

[Level 1] Python dependencies + internal modules
[PASS] torch: version=2.9.0+cu128
[PASS] transformers: version=5.3.0
[PASS] huggingface_hub: version=1.7.1
[PASS] accelerate: version=1.12.0
[PASS] internal modules (modeling, load_from_repo, prompts): 5 styles registered

======================================================
All 5/5 checks passed.
```

---

### ⚠️ No GPU? CPU Fallback

If your machine has no NVIDIA GPU, or GPU passthrough is not yet configured (common on **Windows + WSL2**), you can still explore the web UI and run inference on CPU.

**Step 1 β€” Create `docker-compose.override.yml`** in the project root:

```yaml
services:
recursivemas:
runtime: runc
deploy: {}
serve:
runtime: runc
deploy: {}
```

The `runtime: runc` key forces the standard Docker runtime, bypassing the NVIDIA hook entirely.

**Step 2 β€” Start the web UI**

```bash
docker compose down # remove any existing containers
docker compose up serve # start fresh without GPU reservation
```

Open [http://localhost:7860](http://localhost:7860). The **Device** dropdown will show `cpu` only β€” select it and send your question.

> CPU inference is orders of magnitude slower than GPU (several minutes per question vs. a few seconds). It is suitable for exploring the UI and validating the pipeline end-to-end, not for benchmarking.

**Alternatively**, bypass Compose entirely with `docker run`:

```bash
# Linux / macOS
docker run --rm -p 7860:7860 \
-e HF_TOKEN="" -e TAVILY_API_KEY="" \
-v recursivemas_hf_cache:/hf_cache \
--entrypoint python recursivemas-serve \
serve.py --host 0.0.0.0 --port 7860

# Windows PowerShell
docker run --rm -p 7860:7860 `
-e HF_TOKEN="" -e TAVILY_API_KEY="" `
-v recursivemas_hf_cache:/hf_cache `
--entrypoint python recursivemas-serve `
serve.py --host 0.0.0.0 --port 7860
```

**Fixing GPU passthrough on Windows (WSL2)** β€” to unlock full GPU speed:

1. Run `wsl --list --verbose` β€” the `VERSION` column must show **2** (not 1)
2. Update the NVIDIA Windows driver to **β‰₯ 470** from [nvidia.com/drivers](https://www.nvidia.com/drivers)
3. Docker Desktop β†’ **Settings β†’ Resources β†’ WSL Integration** β†’ enable your distro
4. Restart Docker Desktop, delete the override file, and re-run `docker compose up serve`

---

## πŸ’₯ Quick Start

### πŸ€– Load Model Checkpoints
Expand Down Expand Up @@ -167,13 +309,20 @@ The current repository is organized as follows:
RecursiveMAS/
β”œβ”€β”€ README.md
β”œβ”€β”€ __init__.py
β”œβ”€β”€ run.py
β”œβ”€β”€ run.py # unified CLI entry point for batch inference
β”œβ”€β”€ serve.py # Gradio web UI (all 5 styles, warm model cache)
β”œβ”€β”€ healthcheck.py # 3-level container health check
β”œβ”€β”€ load_from_repo.py
β”œβ”€β”€ hf_resolver.py
β”œβ”€β”€ modeling.py
β”œβ”€β”€ system_loader.py
β”œβ”€β”€ prompts.py
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ requirements-serve.txt # extra deps for serve.py (gradio)
β”œβ”€β”€ Dockerfile # batch inference image
β”œβ”€β”€ Dockerfile.serve # web UI image
β”œβ”€β”€ docker-compose.yml # orchestrates both services + shared hf_cache volume
β”œβ”€β”€ .dockerignore
β”œβ”€β”€ assets/
β”œβ”€β”€ dataset/
└── inference_utils/
Expand Down
Binary file added assets/webui.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
56 changes: 56 additions & 0 deletions docker-compose.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
services:
# ── Batch evaluation ────────────────────────────────────────────────────────
recursivemas:
build: .
image: recursivemas:latest
# Requires NVIDIA Container Toolkit on the host:
# https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
environment:
- HF_TOKEN=${HF_TOKEN:-}
- TAVILY_API_KEY=${TAVILY_API_KEY:-}
volumes:
- hf_cache:/hf_cache
# Override with your desired --style, --dataset, etc.
command:
- --style
- sequential_light
- --dataset
- math500
- --device
- cuda

# ── Web UI (Gradio) ─────────────────────────────────────────────────────────
serve:
build:
context: .
dockerfile: Dockerfile.serve
image: recursivemas-serve:latest
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
environment:
- HF_TOKEN=${HF_TOKEN:-}
- TAVILY_API_KEY=${TAVILY_API_KEY:-}
volumes:
- hf_cache:/hf_cache
ports:
- "${SERVE_PORT:-7860}:7860"
command:
- --host
- "0.0.0.0"
- --port
- "7860"

volumes:
hf_cache:
Loading