A middleware server for OpenAI-compatible backends with passthrough (OpenAI/Anthropic), Anthropic-to-OpenAI conversion, sampling params override, and mid-generation response validation.
- Passthrough Modes: OpenAI, Anthropic, and Anthropic-to-OpenAI conversion
- Parameter Override: Apply custom sampling parameters per model
- Streaming Support: Both streaming and non-streaming responses
- Garbage Detection: Validate responses and auto-retry when garbage output is detected
- Mid-Stream Validation: Detect garbage during generation (not just at the end)
- Flexible Validator API: Supports Anthropic and OpenAI API formats
# Clone and setup
git clone https://github.com/avtc/sampling-proxy.git
cd sampling-proxy
python -m venv sampling-proxy
# Activate venv and install
source sampling-proxy/bin/activate # Linux/macOS
sampling-proxy\Scripts\activate # Windows
pip install -r requirements.txt
# Configure and run
cp config_sample.json config.json
python sampling_proxy.py./sampling_proxy.sh # Linux/macOS
.\sampling_proxy.ps1 # WindowsBoth scripts auto-activate the sampling-proxy venv and run the proxy.
Run a small model as a validator for garbage detection:
llama-server --hf-repo unsloth/Qwen3.5-4B-GGUF --hf-file Qwen3.5-4B-UD-Q6_K_XL.gguf --host 127.0.0.1 --port 1235 -ngl 99 --parallel 2 --jinja -fa on -c 40000 --chat-template-kwargs "{\"enable_thinking\": false}" --temp 1 --min-p 0 --top-p 0.95 --top-k 20 --repeat-penalty 1 --presence-penalty 1.5 --cache-ram 0python sampling_proxy.py --help| Option | Description |
|---|---|
--config, -c |
Path to config JSON file |
--host |
Proxy server host |
--port |
Proxy server port |
--target-base-url |
Backend URL |
--debug-logs, -d |
Enable debug logging |
--enforce-params, -e |
Enforce parameters as JSON |
Enable validation to detect and retry garbage responses.
Detected issue:
- Repetition loops: Same phrase repeated 3+ times
| Option | Description | Default |
|---|---|---|
validation.enabled |
Enable response validation | false |
validation.validator_url |
Validator endpoint URL | http://127.0.0.1:1235 |
validation.validator_model |
Model name for validation | Qwen3.5-4B-UD-Q6_K_XL.gguf |
validation.max_retries |
Max retry attempts | 1 |
validation.mid_stream_validation_enabled |
Validate during streaming | true |
validation.mid_stream_validation_interval_words |
Check every N words | 300 |
When enabled, validates responses periodically during streaming:
- Catches repetition loops at ~300 words (configurable)
- Interrupts garbage immediately and retries
- Reduces latency by not waiting for full garbage responses
{
"validation": {
"enabled": true,
"mid_stream_validation_enabled": true,
"mid_stream_validation_interval_words": 300
}
}Failed validation responses saved to ~/.sampling-proxy/logs/
| Endpoint | Description |
|---|---|
/chat/completions |
OpenAI chat completions |
/messages |
Anthropic messages (converted) |
/models |
List available models |
/ |
Health check |
MIT License