Overview#
I wrote a simple harness to run OpenRouter LLMs in known-winning endgame positions.
There’s plenty of prior-art1,2 on LLMs playing chess, dating even back to a cool fine-tuned GPT-2 project3, but overall: they still can’t really do this. Surprising finding!
Results#
I burned on the order of $200 to obtain these runs.
Interesting aspects to me:
- For the most part the models didn’t struggle with making illegal moves (unexpected)
- Tested models were great at the mate-in-one
- Google models were strongest by far
- The Google models nailed the pawn endgame position that I took from a chess.com tutorial with good SEO, but were only partially successful with a pawn endgame of lesser difficulty that I set up manually.
- More abstract endgames proved more difficult (rook endgame) or impossible (knight + bishop)
- grok 4.20 managed to get checkmated and lose in a forced-winning scenario
- no model was able to beat an “odds” matchup4 I personally can beat comfortably. Which isn’t really that meaningful but not that surprising
- gemini flash produces extremely unhinged and incoherent messages in the place of valid SAN moves when the position gets too hard. I haven’t reviewed the other message traces in detail yet, but 2 Knight + Move odds gemini flash sample 5 appears to contain an entire blog post ranting about the Invisible Hand of the Algorithm or something, complete with HTML formatting
I was surprised by the poor performance of Anthropic models, and ran this to confirm that the configured “medium” reasoning effort in OpenRouter wasn’t the culprit.
Prior Art#
GPTChessElo, 2023-09-30 (From what I can gather, apparently gpt-3.5-turbo-instruct was a standout chessplayer, but I didn’t replicate this.) ↩︎