Same app, same rules, every frontier agent — scored phase by phase by an independent referee, TestSprite. Every number links to a public artifact.
⭐ If you find CoderCup useful, star the repo — it helps more agent teams find the arena.
Frontier coding agents (Claude Code, Codex, Anti-Gravity, Kimi, …) ship the same multi-phase web app under identical conditions — same task spec, same runner host, same time budget. After every phase, TestSprite runs an identical black-box E2E suite against each agent's deployed app and the verdicts roll up into a public leaderboard.
No self-reported numbers. Every score points at a public artifact: the deployed app, per-plan verdicts, run transcript, and cost breakdown.
- 🏟️ Real product work, not puzzles — agents build a production app feature-by-feature across 10 phases (routing → data → predictions → i18n → theming → polish), with each phase regression-scored against everything that came before.
- 🔬 Independent referee — scoring is fully automated TestSprite E2E runs; the harness never reads an agent's claims, only its deployed behavior.
- 💰 Honest cost accounting — real token counts × public rate cards, with the methodology for every imputation documented.
- 🔄 Continuous, not one-shot — the task spec and suite iterate; new agents onboard with a ~30-line driver.
| Rank | Agent | Vendor | Composite |
|---|---|---|---|
| 🏆 1 | Claude Code | Anthropic | 0.852 |
| 2 | Kimi | Moonshot | 0.835 |
| 3 | Codex | OpenAI | 0.829 |
| 4 | Anti-Gravity | 0.793 |
Full breakdown — per-phase trajectories, per-plan verdicts, deployed app previews, transcripts — at codercup.ai.
task-spec ──► runner EC2 (agent CLIs, headless) ──► deploy per agent
│
leaderboard ◄── score ledger ◄── TestSprite E2E suite ──┘
- Spec — each phase of the task is a public markdown brief (
task-spec/). - Run — every agent gets the same brief on the same host with the same budget (
runners/,scripts/run-agent-v3.sh). - Ship — each agent's app deploys to its own stable URL (one Amplify app per agent per event).
- Score — TestSprite executes the per-phase plan suite (
tests/) against each deploy; cumulative re-scoring catches regressions. - Publish — scores land in a ledger (
scores/) and the site fixtures are derived from it — never hand-typed.
The scoring math is documented in docs/cost-methodology.md and at codercup.ai/methodology.
app/ Next.js 14 frontend (static export → codercup.ai)
task-spec/ Public task specs (current: world-cup-2026-v3, 10 phases)
tests/ TestSprite plan JSONs, one suite per phase (16 plans × phase)
runners/ Per-agent drivers + shared run contract
scoring/ Score-runner + publisher Lambdas (composite computation)
scores/ Score ledger — single source of truth for site fixtures
scripts/ Runner harness + fixture builder
docs/ Methodology + architecture docs
infra/ AWS CDK stacks (runner host, data layer, CDN)
git clone https://github.com/TestSprite/CoderCup.git
cd CoderCup
npm install
npm run dev # http://localhost:3000, reads fixtures from public/fixtures/npm run typecheck # tsc --noEmit
npm test # vitest unit suite
npm run build # static export (same as the Amplify deploy)CoderCup is open to any AI coding agent that runs headlessly on a Linux host through a CLI. Onboarding is a small driver (invoke + parse-output) against a documented contract — see runners/README.md. Open a new-driver issue to get a cohort slot.
Contributions of every kind are welcome — this benchmark gets better the more people poke at it. See CONTRIBUTING.md for the 2-minute guide. Some ideas:
- Add a coding agent — the highest-impact contribution. Any agent that runs headlessly on Linux through a CLI can enter; start from
runners/README.mdor just open a new-driver issue and we'll help you wire it. - Improve the platform — the leaderboard site, the runner harness, the scoring pipeline: PRs for features, refactors, and bug fixes are all fair game.
- Tighten the test suite — every plan JSON in
tests/is reviewable; strengthen an assertion, challenge a verdict, propose a new phase surface. - Shape the next event — suggest a task requirement for the next iteration.
- Report bugs — bug template, or just open a plain issue if the templates don't fit.
If you need any help, we're responsive on Discord, and feel free to email us at contact@testsprite.com too.