Run AI entirely in your browser

No servers, no API keys, no data leaves your device. Powered by WebLLM, Transformers.js, and MediaPipe — everything runs locally on your hardware.

First load downloads model weights to your browser — this is a one-time download. After that, the model loads from cache in seconds.
WebLLM and MediaPipe models require WebGPU (Chrome 113+, Edge 113+). Transformers.js models use ONNX Runtime Web. All need network access to huggingface.co
How do these models run in your browser? ↓

Three paths to in-browser AI

WebLLM + MLC

HuggingFaceMLC-format weights
TVM / MLC CompilerAhead-of-time compilation
WebGPU Compute ShadersPre-optimized GPU kernels
Your GPU

Models are compiled ahead-of-time using Apache TVM / MLC (Machine Learning Compilation). The compiler transforms model weights and operations into optimized WebGPU compute shaders that run directly on your GPU.

Trade-offs
  • Fast inference — kernels are pre-optimized
  • First run compiles shaders for your GPU (cached after)
  • Model must be specifically compiled for MLC

Used by

SmolLM2 360M, SmolLM2 1.7B, Qwen3 4B, Phi-3.5 Mini, Llama 3.2 1B

Transformers.js + ONNX Runtime Web

HuggingFaceONNX-format model graph + weights
ONNX Runtime WebBuilds execution plan at load time
WebGPU
WASM fallback
Your GPU / CPU

Models are stored in the standard ONNX (Open Neural Network Exchange) format. ONNX Runtime Web interprets the model graph at load time and executes it on your GPU via WebGPU, or falls back to WebAssembly on unsupported hardware.

Trade-offs
  • Supports any model exportable to ONNX
  • Can fall back to WASM if WebGPU is unavailable
  • Slightly more overhead than pre-compiled kernels

Used by

Qwen3.5 0.8B, Qwen3.5 2B, Qwen3.5 4B

MediaPipe + LiteRT

HuggingFaceLiteRT model file (.litertlm)
MediaPipe GenAILLM Inference API
WebGPU ComputeMultimodal: text + images
Your GPU

Google's MediaPipe LLM Inference API loads Gemma models in the LiteRT format (formerly TFLite). Supports multimodal input — text and images — all processed on-device via WebGPU.

Trade-offs
  • Multimodal: text and image input
  • Single large file download (no split shards)
  • Requires WebGPU — no WASM fallback

Used by

Gemma 3n E2B, Gemma 3n E4B

All three methods use WebGPU for GPU acceleration. All model weights are cached in your browser after the first download — no server involved.
Initializing engine…
Download
Compile
Ready
0% 0s elapsed
Downloading model weights — this only happens once, then it's cached locally.

System Prompt

Generation

Knowledge Base

Embedding model not loaded
No documents added
Drop file to add as context
Model loaded · all processing happens here
0 tokens
Ready

Conversations

No saved conversations yet
Loading Embedding Model
Downloading model weights — this only happens once.
0%