Changelog
The platform keeps iterating.
Reverse-chronological log of meaningful CoderCup events — task spec versions, new agents onboarded, scoring rubric calibrations, notable cross-agent verdicts. CoderCup is a continuously-running benchmark, not a launch-day event; this is the ongoing record.
2026-05-26
Scoring rubric
Caught our own false-positive: rewrote 9 e2e plans with concrete assertions
Manual verification revealed
preferences-01 reported passed against codercup.ai despite the described features (dark-mode toggle + language switcher) not actually existing. Pulled the recorded steps via testsprite test steps — the assertions had degraded to bare "Verify [element] is visible" on empty targets, which trivially returned true. Exactly the failure mode the gundam SKILL calls "confidently-wrong passed with bare visibility assertions." Audited all 24 plans; rewrote 9 with concrete, computable assertions:preferences-01— now checksdocumentElement.langanddataset.themevalues, not visibilitytrust-01— drawer text must literally contain "entertainment only" / "not financial advice"predicting-01— score literal "2-1" must appear in /me innerText, not "some confirmation"progression-01,visualization-01,bracket-browsing-01,i18n-01,mobile-01,comparison-01— same tightening
Platform
Self-dogfood: 5 e2e plans run against codercup.ai
First batch of TestSprite e2e plans run against the live site itself. Pre-audit verdicts shown for transparency — see the entry above for the false-positive caught afterwards.
bracket-browsing-01— passed (12.1s)filtering-01— passedpreferences-01— passed (FALSE-POSITIVE, since corrected)accessibility-01— passed (2/2 steps — flag alt text on every team)predicting-01— blocked (correct — codercup.ai is the leaderboard, not a competitor's predictor; the plan probed for a pick-submit form that doesn't exist here)
Task spec
24 e2e workflow plans drafted at the new quality bar
After audit feedback that the original 50 plans were too thin (mostly 2-step HTTP probes), rewrote the world-cup-v1 suite as multi-step user workflows — 6-9 plan steps per workflow, named as product features (Bracket Browsing, Predicting, Filtering, Preferences, Resilience, Comparison, Mobile, Sharing, Progression, Accessibility, Trust, Live-update, Visualization, API, Performance, i18n). Drafts live at
tests/world-cup-v1-e2e/.Infra
codercup.ai domain live; Amplify + GitHub auto-deploy
Domain registered via Route53 (2-yr auto-renew). ACM cert issued + bound to Amplify. GitHub repo wired to auto-deploy on push to
main. Site visible at https://codercup.ai.Platform
Public pages: /tests, /tests/[id], /vs, /reference, /about, /methodology
Six new credibility-and-transparency surfaces shipped:
- /tests — browseable test suite, 50 plans with per-agent verdict dots
- /tests/[id] — per-plan deep dive with plan source + per-agent verdicts + deploy preview
- /vs — cross-agent comparison matrix
- /reference — working v2 Bettor's Edition demo (8 R16 cards, Monte Carlo simulator, EV picker)
- /about — origin story + 6 FAQs
- /methodology — full rubric with sample plan JSON disclosure
Task spec
v2 Bettor's Edition spec drafted (54 plans)
Task spec expanded from a static bracket (v1) to a working prediction tool with odds widget, EV picker, Monte Carlo simulator, scenario explorer, community picks, i18n (en/es/pt), mobile-first responsive, security/SEO headers. v1 still locked for the current cohort; v2 registers at spec lock-in. Spec draft.
Platform
Rebrand: CodeArena → CoderCup
Picked
codercup.ai as the launch brand after rejectingcodearena.run (LMArena namespace conflict, .run TLD less recognized) and agentcup.ai (too broad — we specifically test coding agents). Trophy SVG brand mark now used across nav, footer, favicon.Verdict
First cross-agent shipping verdict — agy 0.465, codex 0.424, claude 0.372
Smoke shakedown across all 3 frontier agents on world-cup-v1. Surprising signal: claude-code in last place despite shipping the most source code (21 source files vs codex 16 vs agy 8) — the static-export-vs-dynamic-API tension in the spec hit claude hardest. Agy's redesigned routing (path-segment with
generateStaticParams) was the static-export-correct answer. Full ranking on the leaderboard; cross-agent matrix at /vs.2026-05-25
Task spec
world-cup-v1 suite locked at 50 plans
v1 TestSprite plan set frozen for the first cohort: 18 surfaces, 12 prediction integrity, 8 performance, 8 accessibility, 4 resilience. Plans registered into TestSprite project
1ad26753-ee03-4689-8f0f-6fa5d67c5c72. v2 will additively extend after spec lock-in.