When AI Plays Poker and Mafia: Game Arena Changes the Benchmark

AI Plays Poker and Mafia: Game Arena Changes the Benchmark

Poker and Mafia (Werewolf) Added to Kaggle Game Arena
Gemini 3 Pro/Flash Ranked 1st and 2nd on Chess and Mafia Leaderboards
Hikaru Nakamura Commentary Live Event in Progress for 3 Days

What Happened?

Google DeepMind added Poker and Werewolf to the Kaggle Game Arena. ^{[Google Blog]} “Chess is a perfect information game. The real world isn’t.” DeepMind’s Oran Kelly explained the reason for the expansion. ^[TechBuzz]

Why is it Important?

Frankly, existing AI benchmarks have clear limitations. Scores are hitting the ceiling, and data contamination is a serious problem. Game Arena takes a different approach.

Game	Measured Ability	Characteristics
Chess	Strategic Reasoning	Perfect Information
Poker	Risk Assessment	Incomplete Information + Probability
Mafia	Social Reasoning, Deception Detection	Natural Language Team Game

Mafia is also very useful for AI safety research. By playing both the role of deceiving and finding the truth, it tests the AI’s ability to deceive in a controlled environment. ^[TechBuzz]

Personally, I think it’s a necessary benchmark in the age of agent AI.

What Will Happen in the Future?

Gemini 3 Pro and Flash are ranked 1st and 2nd on the Chess and Mafia leaderboards. ^{[Google Blog]} A live event is in progress from February 2nd to 4th. Chess GM Hikaru Nakamura, poker pro Doug Polk, etc. will provide commentary. ^[TechBuzz]

Future plans include expansion to multiplayer video games and real-world simulations. The open-source harness is available on GitHub. ^[GitHub]

Frequently Asked Questions (FAQ)

Q: Can models other than Gemini participate?

A: Yes. Kaggle Game Arena is an independent public benchmark platform. It is structured so that various frontier models compete against each other. Anyone can participate because new models can be easily added through the open-source harness.

Q: Do game benchmarks reflect actual AI performance?

A: It is more realistic than existing multiple-choice benchmarks. Poker tests decision-making under uncertainty, and Mafia tests natural language social reasoning. However, games are also limited environments. It does not fully capture real-world complexity.

Q: Can LLMs beat chess engines like Stockfish?

A: Not yet. Stockfish calculates millions of moves per second, but LLMs rely on pattern recognition. Interestingly, the reasoning of LLMs is similar to that of human players. They utilize concepts such as piece activity and pawn structure.

If this article was helpful, please subscribe to AI Digester.

Reference Materials

Advancing AI benchmarking with Game Arena – Google Blog (2026-02-02)
Google DeepMind Expands Game Arena AI Benchmarks – TechBuzz (2026-02-02)
Game Arena GitHub Repository – GitHub (2026-02-02)