GitHub - paiml/albor: LLM from first principles trained only from Sovereign AI components

Albor (Spanish: "dawn") — A sovereign Python code completion model trained from first principles using only the Sovereign AI stack.

Specification Book · Training Log · Gap Register

What Is Albor?

A 350M-parameter decoder-only transformer for Python code completion, trained entirely in Rust with zero Python dependencies. Every operation — data loading, tokenization, training, evaluation, checkpointing — uses the Sovereign AI stack. No PyTorch, no pip, no conda.

The project has two goals:

Produce a usable Python code completion model that runs anywhere Rust compiles
Identify and fix every gap in the Sovereign AI stack that blocks end-to-end LLM development — 97 gaps found and fixed so far

Current Status

v15 training RUNNING — Step 24K/155K (15.6%), 786M tokens processed.

Metric	Value
Training run	v15 (15th attempt, seed=123)
Step	24,000 / 155,000 (15.6%)
Best val_ppl	309 (step 9K, pre-outage)
Throughput	14,900 tok/s, 46.9% MFU
Hardware	RTX 4090 (24 GB), single GPU
Data	codeparrot-clean, 5.08B tokens (73% Chinchilla)
ETA	~3.3 days remaining

Phase change at step 3K (earliest ever). Power outage at step 11K — resumed from step 10K checkpoint. Post-resume best: 400 (step 17K). 98 gaps fixed including ALB-122 (trueno PTX bug discovered during resume).

What We've Built

Through 15 training runs and 97 bug fixes, the project has:

Trained a 350M transformer on a single GPU entirely in Rust — no Python anywhere in the stack
Achieved 8.5K tok/s at 24.6% MFU on an RTX 4090 with hand-written PTX kernels + cuBLAS
Discovered and fixed 97 infrastructure gaps across 6 upstream repos, including:
- Silent memory corruption in CUDA backward kernels (ALB-041, 043, 059)
- Missing RoPE backward pass (ALB-119) — model trained without position gradients
- GPU optimizer state not checkpointed (ALB-118) — resume destroyed weights
- Data loader position not checkpointed (ALB-120) — resume caused data overlap
- Activation gradient overflow at GPU-CPU boundary (ALB-044) — NaN in embeddings
- Stream synchronization race conditions (ALB-065) — stale GPU data on D2H transfer
Reached val_ppl=129 (v9) on 490M tokens — v15 is on track to beat this with 10x more data
Built 39 provable contracts verified by pv (provable-contracts)
108/108 batuta falsification tests PASS (Toyota Standard grade)

Research Context

Why this is hard (Textbooks Are All You Need): phi-1-small (350M) achieved 45% HumanEval — but with 7B tokens of synthetic textbook-quality data generated by GPT-3.5. Albor trains on codeparrot-clean (raw GitHub Python, ~50GB). Data quality is the primary ceiling, not model architecture. Distillation from Qwen3-Coder-30B partially compensates.

Scaling position (Chinchilla, Beyond Chinchilla): Our 5.08B tokens on 350M params (14.5:1) is below Chinchilla-optimal (20:1 = 7B tokens). Modern practice overtrains small models far beyond this — Llama 3 uses 1875:1. For inference-optimized deployment, training longer on a smaller model is preferred.

Competitive landscape (CodeGen): CodeGen-350M-mono achieves 10.2% HumanEval pass@1, trained on 577B tokens. No sub-1B model has appeared on the Big Code Models Leaderboard.

What Needs to Happen

Milestone	When	Success Criteria
v15 surpasses v9 (val_ppl < 129)	Step 10-15K (~2 days)	Sentence-level patterns learned
v15 reaches ppl < 50	Step 155K (~5 days)	Syntactic structure captured
HumanEval pass@1 > 0%	val_ppl < 100	First valid Python generation
Distillation from Qwen3-Coder-30B	After base model	Synthetic textbook-style data
HumanEval pass@1 > 10%	After distillation	Beat CodeGen-350M-mono (10.2%)
Big Code Leaderboard submission	After distillation	First sub-1B entry

When We Declare Success

Minimum viable (Phase 3): Base model converges to val_ppl < 100 and achieves HumanEval pass@1 > 5%. Proves the sovereign stack can train a working code model end-to-end in Rust.

Good (Phase 5): Distilled model hits HumanEval pass@1 > 10%, beating CodeGen-350M-mono (10.2%). Proves distillation from Qwen3-Coder-30B MoE teacher through the sovereign stack produces competitive results.

Full success (Phase 8): All 6 model variants benchmarked, Q4 model under 100MB runs at <50ms/token on CPU, submitted to Big Code Leaderboard as the first sub-1B entry. The stack is proven end-to-end.

Full success (Phase 8): All 6 model variants benchmarked, Q4 model under 100MB runs at <50ms/token on CPU, submitted to Big Code Leaderboard as the first sub-1B entry.

Architecture

LLaMA-style decoder-only transformer
├── 24 layers, 1024 hidden dim, 16 attention heads, 4 KV heads (GQA)
├── SwiGLU FFN (4096 intermediate), RoPE, RMSNorm (pre-norm)
├── 32,768 vocab (ByteLevel BPE v2), 1024 context (GPU-resident)
├── ~370M parameters, GPU-resident AdamW on RTX 4090 (~13 GB VRAM)
└── Cosine LR schedule (3e-4 peak, 155K steps, 2K warmup)

Training History

15 training runs, each revealing and fixing infrastructure bugs:

Run	Steps	Best val_ppl	Outcome	Key Fix
v2	1K	1,008	Crashed	ALB-073: PTX instruction bug
v3	28K	1,018	Plateau	ALB-079: no cosine LR decay
v5	—	—	Failed	ALB-092: gradient accumulation bug
v8	5K	—	Killed	ALB-106: trained without RoPE
v9	15K	129	Stopped	Best genuine result (490M tokens)
v13	62K	239 (inflated)	Stopped	ALB-120: data position not checkpointed
v14	20K	782	Killed	Degenerate init (seed=42)
v15	5K+	333	Running	Phase change at step 3K

Sovereign AI Stack

Component	Role	Gaps Fixed
entrenar	Training engine	40+ (CUDA kernels, optimizer, checkpoint)
trueno	GPU tensor ops	15+ (RoPE, RMSNorm, cuBLAS, PTX)
aprender (`apr`)	CLI	10+ (eval, train, checkpoint)
realizar	Inference	5+ (Qwen3 MoE, Q4K)
alimentar	Data pipeline	5+ (Parquet, FIM)
provable-contracts	Verification	39 contracts

Reproduce

# Build the CLI
cd ~/src/aprender && cargo build --release -p apr-cli

# Train from scratch (RTX 4090, ~5 days)
apr train apply --task pretrain --config configs/train/pretrain-350m-v15.yaml

# Evaluate
apr eval --task humaneval --model checkpoints/albor-base-350m-v15/model-best.apr

License

Apache-2.0

Name		Name	Last commit message	Last commit date
Latest commit History 350 Commits
.config		.config
.dvc		.dvc
.github		.github
assets		assets
benches		benches
checkpoints/canary-pytorch		checkpoints/canary-pytorch
configs		configs
contracts		contracts
data/distill		data/distill
docs		docs
emc		emc
examples		examples
fuzz		fuzz
logs		logs
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
.pmat-baseline.json		.pmat-baseline.json
.pre-commit-config.yaml		.pre-commit-config.yaml
CLAUDE.md		CLAUDE.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
LICENSE		LICENSE
LINEAGE.md		LINEAGE.md
MODEL_CARD.md		MODEL_CARD.md
Makefile		Makefile
README.md		README.md
clippy.toml		clippy.toml
data_catalog.yaml		data_catalog.yaml
deny.toml		deny.toml
pmat.toml		pmat.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What Is Albor?

Current Status

What We've Built

Research Context

What Needs to Happen

When We Declare Success

Architecture

Training History

Sovereign AI Stack

Reproduce

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

What Is Albor?

Current Status

What We've Built

Research Context

What Needs to Happen

When We Declare Success

Architecture

Training History

Sovereign AI Stack

Reproduce

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages