Skip to content

paiml/albor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

350 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Albor — Sovereign Python Code Completion

Albor (Spanish: "dawn") — A sovereign Python code completion model trained from first principles using only the Sovereign AI stack.

Specification Book · Training Log · Gap Register


What Is Albor?

A 350M-parameter decoder-only transformer for Python code completion, trained entirely in Rust with zero Python dependencies. Every operation — data loading, tokenization, training, evaluation, checkpointing — uses the Sovereign AI stack. No PyTorch, no pip, no conda.

The project has two goals:

  1. Produce a usable Python code completion model that runs anywhere Rust compiles
  2. Identify and fix every gap in the Sovereign AI stack that blocks end-to-end LLM development — 97 gaps found and fixed so far

Current Status

v15 training RUNNING — Step 24K/155K (15.6%), 786M tokens processed.

Metric Value
Training run v15 (15th attempt, seed=123)
Step 24,000 / 155,000 (15.6%)
Best val_ppl 309 (step 9K, pre-outage)
Throughput 14,900 tok/s, 46.9% MFU
Hardware RTX 4090 (24 GB), single GPU
Data codeparrot-clean, 5.08B tokens (73% Chinchilla)
ETA ~3.3 days remaining

Phase change at step 3K (earliest ever). Power outage at step 11K — resumed from step 10K checkpoint. Post-resume best: 400 (step 17K). 98 gaps fixed including ALB-122 (trueno PTX bug discovered during resume).

What We've Built

Through 15 training runs and 97 bug fixes, the project has:

  • Trained a 350M transformer on a single GPU entirely in Rust — no Python anywhere in the stack
  • Achieved 8.5K tok/s at 24.6% MFU on an RTX 4090 with hand-written PTX kernels + cuBLAS
  • Discovered and fixed 97 infrastructure gaps across 6 upstream repos, including:
    • Silent memory corruption in CUDA backward kernels (ALB-041, 043, 059)
    • Missing RoPE backward pass (ALB-119) — model trained without position gradients
    • GPU optimizer state not checkpointed (ALB-118) — resume destroyed weights
    • Data loader position not checkpointed (ALB-120) — resume caused data overlap
    • Activation gradient overflow at GPU-CPU boundary (ALB-044) — NaN in embeddings
    • Stream synchronization race conditions (ALB-065) — stale GPU data on D2H transfer
  • Reached val_ppl=129 (v9) on 490M tokens — v15 is on track to beat this with 10x more data
  • Built 39 provable contracts verified by pv (provable-contracts)
  • 108/108 batuta falsification tests PASS (Toyota Standard grade)

Research Context

Why this is hard (Textbooks Are All You Need): phi-1-small (350M) achieved 45% HumanEval — but with 7B tokens of synthetic textbook-quality data generated by GPT-3.5. Albor trains on codeparrot-clean (raw GitHub Python, ~50GB). Data quality is the primary ceiling, not model architecture. Distillation from Qwen3-Coder-30B partially compensates.

Scaling position (Chinchilla, Beyond Chinchilla): Our 5.08B tokens on 350M params (14.5:1) is below Chinchilla-optimal (20:1 = 7B tokens). Modern practice overtrains small models far beyond this — Llama 3 uses 1875:1. For inference-optimized deployment, training longer on a smaller model is preferred.

Competitive landscape (CodeGen): CodeGen-350M-mono achieves 10.2% HumanEval pass@1, trained on 577B tokens. No sub-1B model has appeared on the Big Code Models Leaderboard.

What Needs to Happen

Milestone When Success Criteria
v15 surpasses v9 (val_ppl < 129) Step 10-15K (~2 days) Sentence-level patterns learned
v15 reaches ppl < 50 Step 155K (~5 days) Syntactic structure captured
HumanEval pass@1 > 0% val_ppl < 100 First valid Python generation
Distillation from Qwen3-Coder-30B After base model Synthetic textbook-style data
HumanEval pass@1 > 10% After distillation Beat CodeGen-350M-mono (10.2%)
Big Code Leaderboard submission After distillation First sub-1B entry

When We Declare Success

Minimum viable (Phase 3): Base model converges to val_ppl < 100 and achieves HumanEval pass@1 > 5%. Proves the sovereign stack can train a working code model end-to-end in Rust.

Good (Phase 5): Distilled model hits HumanEval pass@1 > 10%, beating CodeGen-350M-mono (10.2%). Proves distillation from Qwen3-Coder-30B MoE teacher through the sovereign stack produces competitive results.

Full success (Phase 8): All 6 model variants benchmarked, Q4 model under 100MB runs at <50ms/token on CPU, submitted to Big Code Leaderboard as the first sub-1B entry. The stack is proven end-to-end.

Full success (Phase 8): All 6 model variants benchmarked, Q4 model under 100MB runs at <50ms/token on CPU, submitted to Big Code Leaderboard as the first sub-1B entry.

Architecture

LLaMA-style decoder-only transformer
├── 24 layers, 1024 hidden dim, 16 attention heads, 4 KV heads (GQA)
├── SwiGLU FFN (4096 intermediate), RoPE, RMSNorm (pre-norm)
├── 32,768 vocab (ByteLevel BPE v2), 1024 context (GPU-resident)
├── ~370M parameters, GPU-resident AdamW on RTX 4090 (~13 GB VRAM)
└── Cosine LR schedule (3e-4 peak, 155K steps, 2K warmup)

Training History

15 training runs, each revealing and fixing infrastructure bugs:

Run Steps Best val_ppl Outcome Key Fix
v2 1K 1,008 Crashed ALB-073: PTX instruction bug
v3 28K 1,018 Plateau ALB-079: no cosine LR decay
v5 Failed ALB-092: gradient accumulation bug
v8 5K Killed ALB-106: trained without RoPE
v9 15K 129 Stopped Best genuine result (490M tokens)
v13 62K 239 (inflated) Stopped ALB-120: data position not checkpointed
v14 20K 782 Killed Degenerate init (seed=42)
v15 5K+ 333 Running Phase change at step 3K

Sovereign AI Stack

Component Role Gaps Fixed
entrenar Training engine 40+ (CUDA kernels, optimizer, checkpoint)
trueno GPU tensor ops 15+ (RoPE, RMSNorm, cuBLAS, PTX)
aprender (apr) CLI 10+ (eval, train, checkpoint)
realizar Inference 5+ (Qwen3 MoE, Q4K)
alimentar Data pipeline 5+ (Parquet, FIM)
provable-contracts Verification 39 contracts

Reproduce

# Build the CLI
cd ~/src/aprender && cargo build --release -p apr-cli

# Train from scratch (RTX 4090, ~5 days)
apr train apply --task pretrain --config configs/train/pretrain-350m-v15.yaml

# Evaluate
apr eval --task humaneval --model checkpoints/albor-base-350m-v15/model-best.apr

License

Apache-2.0

About

LLM from first principles trained only from Sovereign AI components

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors