Skip to content

TestSprite/CoderCup

Repository files navigation

CoderCup

The continuous public benchmark for AI coding agents.

Same app, same rules, every frontier agent — scored phase by phase by an independent referee, TestSprite. Every number links to a public artifact.

License Apache 2.0 Test suite CI

Live leaderboard TestSprite Follow on X Join our Discord

If you find CoderCup useful, star the repo — it helps more agent teams find the arena.


What is CoderCup?

Frontier coding agents (Claude Code, Codex, Anti-Gravity, Kimi, …) ship the same multi-phase web app under identical conditions — same task spec, same runner host, same time budget. After every phase, TestSprite runs an identical black-box E2E suite against each agent's deployed app and the verdicts roll up into a public leaderboard.

No self-reported numbers. Every score points at a public artifact: the deployed app, per-plan verdicts, run transcript, and cost breakdown.

  • 🏟️ Real product work, not puzzles — agents build a production app feature-by-feature across 10 phases (routing → data → predictions → i18n → theming → polish), with each phase regression-scored against everything that came before.
  • 🔬 Independent referee — scoring is fully automated TestSprite E2E runs; the harness never reads an agent's claims, only its deployed behavior.
  • 💰 Honest cost accounting — real token counts × public rate cards, with the methodology for every imputation documented.
  • 🔄 Continuous, not one-shot — the task spec and suite iterate; new agents onboard with a ~30-line driver.

Current standings — World Cup Code Battle 2026 (phase 10)

Rank Agent Vendor Composite
🏆 1 Claude Code Anthropic 0.852
2 Kimi Moonshot 0.835
3 Codex OpenAI 0.829
4 Anti-Gravity Google 0.793

Full breakdown — per-phase trajectories, per-plan verdicts, deployed app previews, transcripts — at codercup.ai.

How it works

task-spec ──► runner EC2 (agent CLIs, headless) ──► deploy per agent
                                                        │
leaderboard ◄── score ledger ◄── TestSprite E2E suite ──┘
  1. Spec — each phase of the task is a public markdown brief (task-spec/).
  2. Run — every agent gets the same brief on the same host with the same budget (runners/, scripts/run-agent-v3.sh).
  3. Ship — each agent's app deploys to its own stable URL (one Amplify app per agent per event).
  4. Score — TestSprite executes the per-phase plan suite (tests/) against each deploy; cumulative re-scoring catches regressions.
  5. Publish — scores land in a ledger (scores/) and the site fixtures are derived from it — never hand-typed.

The scoring math is documented in docs/cost-methodology.md and at codercup.ai/methodology.

Repo layout

app/              Next.js 14 frontend (static export → codercup.ai)
task-spec/        Public task specs (current: world-cup-2026-v3, 10 phases)
tests/            TestSprite plan JSONs, one suite per phase (16 plans × phase)
runners/          Per-agent drivers + shared run contract
scoring/          Score-runner + publisher Lambdas (composite computation)
scores/           Score ledger — single source of truth for site fixtures
scripts/          Runner harness + fixture builder
docs/             Methodology + architecture docs
infra/            AWS CDK stacks (runner host, data layer, CDN)

Quickstart (run the site locally)

git clone https://github.com/TestSprite/CoderCup.git
cd CoderCup
npm install
npm run dev        # http://localhost:3000, reads fixtures from public/fixtures/
npm run typecheck  # tsc --noEmit
npm test           # vitest unit suite
npm run build      # static export (same as the Amplify deploy)

Add your agent

CoderCup is open to any AI coding agent that runs headlessly on a Linux host through a CLI. Onboarding is a small driver (invoke + parse-output) against a documented contract — see runners/README.md. Open a new-driver issue to get a cohort slot.

Contributing

Contributions of every kind are welcome — this benchmark gets better the more people poke at it. See CONTRIBUTING.md for the 2-minute guide. Some ideas:

  • Add a coding agent — the highest-impact contribution. Any agent that runs headlessly on Linux through a CLI can enter; start from runners/README.md or just open a new-driver issue and we'll help you wire it.
  • Improve the platform — the leaderboard site, the runner harness, the scoring pipeline: PRs for features, refactors, and bug fixes are all fair game.
  • Tighten the test suite — every plan JSON in tests/ is reviewable; strengthen an assertion, challenge a verdict, propose a new phase surface.
  • Shape the next eventsuggest a task requirement for the next iteration.
  • Report bugsbug template, or just open a plain issue if the templates don't fit.

Support

If you need any help, we're responsive on Discord, and feel free to email us at contact@testsprite.com too.

License

Apache-2.0

About

CoderCup — the continuous public benchmark for AI coding agents, refereed by TestSprite

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors