Albor (Spanish: "dawn") — A sovereign Python code completion model trained from first principles using only the Sovereign AI stack.
Specification Book · Training Log · Gap Register
A 350M-parameter decoder-only transformer for Python code completion, trained entirely in Rust with zero Python dependencies. Every operation — data loading, tokenization, training, evaluation, checkpointing — uses the Sovereign AI stack. No PyTorch, no pip, no conda.
The project has two goals:
- Produce a usable Python code completion model that runs anywhere Rust compiles
- Identify and fix every gap in the Sovereign AI stack that blocks end-to-end LLM development — 97 gaps found and fixed so far
v15 training RUNNING — Step 24K/155K (15.6%), 786M tokens processed.
| Metric | Value |
|---|---|
| Training run | v15 (15th attempt, seed=123) |
| Step | 24,000 / 155,000 (15.6%) |
| Best val_ppl | 309 (step 9K, pre-outage) |
| Throughput | 14,900 tok/s, 46.9% MFU |
| Hardware | RTX 4090 (24 GB), single GPU |
| Data | codeparrot-clean, 5.08B tokens (73% Chinchilla) |
| ETA | ~3.3 days remaining |
Phase change at step 3K (earliest ever). Power outage at step 11K — resumed from step 10K checkpoint. Post-resume best: 400 (step 17K). 98 gaps fixed including ALB-122 (trueno PTX bug discovered during resume).
Through 15 training runs and 97 bug fixes, the project has:
- Trained a 350M transformer on a single GPU entirely in Rust — no Python anywhere in the stack
- Achieved 8.5K tok/s at 24.6% MFU on an RTX 4090 with hand-written PTX kernels + cuBLAS
- Discovered and fixed 97 infrastructure gaps across 6 upstream repos, including:
- Silent memory corruption in CUDA backward kernels (ALB-041, 043, 059)
- Missing RoPE backward pass (ALB-119) — model trained without position gradients
- GPU optimizer state not checkpointed (ALB-118) — resume destroyed weights
- Data loader position not checkpointed (ALB-120) — resume caused data overlap
- Activation gradient overflow at GPU-CPU boundary (ALB-044) — NaN in embeddings
- Stream synchronization race conditions (ALB-065) — stale GPU data on D2H transfer
- Reached val_ppl=129 (v9) on 490M tokens — v15 is on track to beat this with 10x more data
- Built 39 provable contracts verified by
pv(provable-contracts) - 108/108 batuta falsification tests PASS (Toyota Standard grade)
Why this is hard (Textbooks Are All You Need): phi-1-small (350M) achieved 45% HumanEval — but with 7B tokens of synthetic textbook-quality data generated by GPT-3.5. Albor trains on codeparrot-clean (raw GitHub Python, ~50GB). Data quality is the primary ceiling, not model architecture. Distillation from Qwen3-Coder-30B partially compensates.
Scaling position (Chinchilla, Beyond Chinchilla): Our 5.08B tokens on 350M params (14.5:1) is below Chinchilla-optimal (20:1 = 7B tokens). Modern practice overtrains small models far beyond this — Llama 3 uses 1875:1. For inference-optimized deployment, training longer on a smaller model is preferred.
Competitive landscape (CodeGen): CodeGen-350M-mono achieves 10.2% HumanEval pass@1, trained on 577B tokens. No sub-1B model has appeared on the Big Code Models Leaderboard.
| Milestone | When | Success Criteria |
|---|---|---|
| v15 surpasses v9 (val_ppl < 129) | Step 10-15K (~2 days) | Sentence-level patterns learned |
| v15 reaches ppl < 50 | Step 155K (~5 days) | Syntactic structure captured |
| HumanEval pass@1 > 0% | val_ppl < 100 | First valid Python generation |
| Distillation from Qwen3-Coder-30B | After base model | Synthetic textbook-style data |
| HumanEval pass@1 > 10% | After distillation | Beat CodeGen-350M-mono (10.2%) |
| Big Code Leaderboard submission | After distillation | First sub-1B entry |
Minimum viable (Phase 3): Base model converges to val_ppl < 100 and achieves HumanEval pass@1 > 5%. Proves the sovereign stack can train a working code model end-to-end in Rust.
Good (Phase 5): Distilled model hits HumanEval pass@1 > 10%, beating CodeGen-350M-mono (10.2%). Proves distillation from Qwen3-Coder-30B MoE teacher through the sovereign stack produces competitive results.
Full success (Phase 8): All 6 model variants benchmarked, Q4 model under 100MB runs at <50ms/token on CPU, submitted to Big Code Leaderboard as the first sub-1B entry. The stack is proven end-to-end.
Full success (Phase 8): All 6 model variants benchmarked, Q4 model under 100MB runs at <50ms/token on CPU, submitted to Big Code Leaderboard as the first sub-1B entry.
LLaMA-style decoder-only transformer
├── 24 layers, 1024 hidden dim, 16 attention heads, 4 KV heads (GQA)
├── SwiGLU FFN (4096 intermediate), RoPE, RMSNorm (pre-norm)
├── 32,768 vocab (ByteLevel BPE v2), 1024 context (GPU-resident)
├── ~370M parameters, GPU-resident AdamW on RTX 4090 (~13 GB VRAM)
└── Cosine LR schedule (3e-4 peak, 155K steps, 2K warmup)
15 training runs, each revealing and fixing infrastructure bugs:
| Run | Steps | Best val_ppl | Outcome | Key Fix |
|---|---|---|---|---|
| v2 | 1K | 1,008 | Crashed | ALB-073: PTX instruction bug |
| v3 | 28K | 1,018 | Plateau | ALB-079: no cosine LR decay |
| v5 | — | — | Failed | ALB-092: gradient accumulation bug |
| v8 | 5K | — | Killed | ALB-106: trained without RoPE |
| v9 | 15K | 129 | Stopped | Best genuine result (490M tokens) |
| v13 | 62K | 239 (inflated) | Stopped | ALB-120: data position not checkpointed |
| v14 | 20K | 782 | Killed | Degenerate init (seed=42) |
| v15 | 5K+ | 333 | Running | Phase change at step 3K |
| Component | Role | Gaps Fixed |
|---|---|---|
| entrenar | Training engine | 40+ (CUDA kernels, optimizer, checkpoint) |
| trueno | GPU tensor ops | 15+ (RoPE, RMSNorm, cuBLAS, PTX) |
aprender (apr) |
CLI | 10+ (eval, train, checkpoint) |
| realizar | Inference | 5+ (Qwen3 MoE, Q4K) |
| alimentar | Data pipeline | 5+ (Parquet, FIM) |
| provable-contracts | Verification | 39 contracts |
# Build the CLI
cd ~/src/aprender && cargo build --release -p apr-cli
# Train from scratch (RTX 4090, ~5 days)
apr train apply --task pretrain --config configs/train/pretrain-350m-v15.yaml
# Evaluate
apr eval --task humaneval --model checkpoints/albor-base-350m-v15/model-best.aprApache-2.0