Researchers are now using Super Mario Bros. as a benchmark to test AI capabilities, arguing that it presents a tougher challenge than Pokémon. On Friday, Hao AI Lab at the University of California San Diego tested AI models in live Super Mario Bros. games, with Anthropic’s Claude 3.7 performing the best, followed by Claude 3.5. Meanwhile, Google’s Gemini 1.5 Pro and OpenAI’s GPT-4o struggled to navigate the game effectively.
The version of Super Mario Bros. used in the test wasn’t the original 1985 release but a modified version running on an emulator. The researchers integrated it with their in-house framework, GamingAgent, allowing AI models to control Mario. GamingAgent provided basic instructions, such as “If an obstacle or enemy is near, move/jump left to dodge,” along with in-game screenshots. The AI then generated inputs using Python code to maneuver Mario through the game.
Despite this assistance, the game still required AI models to develop strategic planning and complex maneuvers. Surprisingly, the study found that reasoning-based AI models, such as OpenAI’s o1, which typically solve problems step by step, performed worse than non-reasoning models. The researchers believe this is because reasoning models take longer—often seconds—to decide on actions, which is a major disadvantage in real-time games like Super Mario Bros., where split-second decisions determine success or failure.
While video games have been used to test AI for decades, some experts question whether gaming benchmarks truly measure technological progress. Unlike the real world, games are abstract, structured, and provide unlimited training data. OpenAI’s Andrej Karpathy even referred to this trend as an “evaluation crisis”, stating that current AI performance metrics are unclear and difficult to interpret.