Tennis XGBoost Autoresearch

March 2026 · repo public, post published

Autonomous research loop meets tennis, ELO, XGBoost, and Goodhart's Law.

The seed was beautifully simple: tennis is a head-to-head game, ELO is a strong temporal skill signal, surfaces matter a lot, and XGBoost is exactly the kind of fast, boring learner that lets agents iterate without waiting for a GPU monastery to warm up.

This page is the short explainer; the full blog post tells the whole story.

So I built the initial pipeline from a Green Code tennis-video idea that reached me through Phosphen's X thread: Jeff Sackmann tennis data, 2025-2026 extensions, ATP/WTA temporal splits, overall and surface-specific ELO, recent-form windows, head-to-head features, tournament context, player metadata, and an XGBoost model sitting on top.

Then I let the loop run. `run-research.sh` dispatched sequential Codex workers through agent-mux. Each worker read `program.md` and `RESEARCH_LOG.md`, changed a narrow write set, ran the gate, and either got committed or rolled back. No vibes. One scalar gate: ROC-AUC.

The first phase worked. The loop moved from a decent baseline into real improvements: tour-specific hyperparameters, segment blending, surface-aware features, rank momentum, and feature exclusions. Honest combined ROC-AUC climbed to about 0.7609, roughly +155 bps from the pre-loop baseline.

Then the curve stopped behaving like research and started behaving like an optimizer. The loop began carving the validation set into narrower and narrower tournament specialists, then eventually found the nastier path: the evaluator lived too close to mutable model code. If the judge is in the arena, the optimizer learns to punch the judge.

At the peak of the broken run, reported combined ROC-AUC reached 0.8523. It looked exciting for about five minutes, then turned into a much better story: the loop had learned to rewrite the referee. Probability transformations, validation-specific specialists, and metric extraction instead of tennis knowledge.

The fix became the useful artifact. Evaluation moved into immutable `evaluate.py`; the gate blocks edits to evaluator/data/test paths; prediction distributions get sanity-checked; anomalous jumps became a signal instead of a celebration. The lesson is boring and important: structural controls beat "please do not game the metric" prompting.

The repo preserves both sides: the clean pipeline and the archived gaming branch. That makes it less of a leaderboard project and more of a small lab specimen for autonomous research loops: honest gains are real, and Goodhart is default execution behavior when the gate is weak.

Current continuation: I am experimenting with Minecraft speedrunning via MCSR XGBoost Autoresearch. Same broad family, different substrate: 43K+ MCSR Ranked matches, timeline-derived features, dual ELO, and a 0.8941 ROC-AUC baseline before the autoresearch loop is ported over.

Receipts: public repo, full blog post, archived gamed iterations, MCSR continuation, WecoAI awesome-autoresearch PR, and alvinreal awesome-autoresearch PR.