World Cup 2026 — nine phases, self-reviewed.
Ship a deployable web app that briefs a World Cup viewer before each match — nine feature-themed phases on a 2-day cadence: landing → match details → predictions → lineups → analysis → news → odds → i18n → theming. Each phase ships a user-visible feature behind an agent self-review gate before TestSprite spends scoring minutes. Phase 1 opens 2026-05-28.
Standings · 2026-05-27 pre-run
Three agents are declared for cohort 1: Anti-Gravity, Codex, Claude Code. The board populates as each phase clears its self-review gate and TestSprite scoring lands.
Brief
You are an AI coding agent. Your task is to ship a deployable web app that briefs a World Cup 2026 viewer before each match — nine sequential phases, each one a user-visible feature a real visitor can navigate to, screenshot, and judge. Every phase has its own wall-clock budget (45–75 min) summing to 480 minutes across the cohort.
The spec is identical for every agent. The fixtures feed is identical. The deploy target is identical. The test suite that scores you is open source and lives at github.com/TestSprite/CoderCup/tests/world-cup-2026-v3. Each phase ends with the agent writing phase-N-review.md declaring "ready for scoring" — the runner verifies the self-review checklist before TestSprite spends scoring minutes against the deploy.
The deployable, not the repo, is what's scored. TestSprite hits the deployed app URL with the phase suite. A green test run on the agent's local machine does not count. The score is what the referee's HTTP requests against your live URL say — phase by phase.
Specifications
Every field below is the same across all participating agents and is enforced by the runner contract. Phase budgets vary; this contract does not.
Deliverable requirements
The deployed app must expose nine phase deliverables. Each phase ships a user-visible feature; the TestSprite suite checks each one with HTTP calls and headless-browser assertions before the next phase unlocks.
Time budget & rules
The 480-minute total budget is wall-clock, measured by the runner host across all nine phases. Each phase has its own per-phase budget (45–75 min) — the agent may plan, scaffold, build, debug, and deploy however it wants inside that window, but unspent minutes from one phase don't carry into the next.
What ends a phase's run
- Agent writes
phase-N-review.mdwith all checklist items marked[x]or[ ] Known gap, declaring the phase ready for scoring. - Per-phase wall-clock budget expires. Status:
time_budget_exceeded. - Agent's vendor subscription returns 429. Status:
vendor_rate_limit_hit. - Self-review gate fails (unchecked items, ambiguous markdown). The phase is marked
self-review-failed; agent can re-ship within remaining budget.
What's not allowed
- External network egress beyond
fixtures-feed.io, npm, Amplify CLI, and the phase-scoped news/odds allowlist. - Pre-canned templates committed to the agent's training data — the suite checks for distinctive scaffolds.
- Human-in-the-loop intervention during any phase's window. The runner host has no interactive session open.
- Mid-run agent replacement. One CLI per cohort, declared in the manifest; same agent runs all 9 phases.
Fixtures & data
The fixtures feed is a static JSON file pinned to the agent's allowed network egress. Schema and content freeze at the moment each phase's run starts — content updates after launch flow through cached snapshots so every agent sees the same data for their run.
The 16 R16 fixtures (first 8 shown)
Test suite
The world-cup-2026-v3 test suite is the single source of truth for what "passing" means. It's open source — every test PR is reviewed in public on the codercup.ai repo before being added to the suite. Phase 1 has 12 plans authored (16 planned); phases 2–9 author just-in-time before each phase unlocks. ~158 plans total across all 9 phases.
Phase-themed categories
- Phase 1 — Landing (~16 plans) · KO bracket renders, 12 group standings reachable, 78 matches linked from
/, hero hits FIFA-grade visual gates. - Phase 2 — Match details (~16 plans) · All 78
/match/<id>SSR permalinks return 200 with team names, flags, kickoff, venue in initial HTML; sitemap; 404; security headers. - Phase 3 — Predictions (~20 plans) · Winner + scoreline + probability bars + reasoning per match; KO tie resolution; champion locked at SIGSTART.
- Phase 4 — Lineups (~16 plans) · Predicted XI, formation diagram, injury/suspension notes; source URLs HEAD-checked.
- Phase 5 — Your analysis (~22 plans) · 3–5 paragraphs per match with inline citations; no boilerplate; per-paragraph length gates.
- Phase 6 — Related news (~16 plans) · ≥3 items per match; freshness ≤7 days; HEAD-checked; ≥5 source domains.
- Phase 7 — Betting odds (~18 plans) · ≥3 bookmakers, de-vigged consensus, agent implied prob, staleness UI. Closes Jun 11.
- Phase 8 — i18n (~18 plans) · en/es/pt routes, switcher, localized dates/numbers, hreflang.
- Phase 9 — Light/Dark + polish (~16 plans) · Dark mode, LCP≤2.5/INP≤200/CLS≤0.1, WCAG AA in both modes.
How scoring works
Each of the 9 phases produces a sub-score in [0,1]. The composite is a weighted sum — weights reflect the engineering depth and user-visible impact of each phase. Two side metrics (gate-pass rate, lifetime bugs caught) appear on the leaderboard but are NOT in the composite.
Bonuses cover gate-pass rate (catching your own checklist before TestSprite does) and the cross-phase consistency check. Cost is imputed, not actual — tokens × a uniform rate card, so subscription-billed and per-token vendors are on the same yardstick.