A recent AI benchmark inspired by the classic arcade game Street Fighter III was conceived during the Mistral AI hackathon in San Francisco. The innovative LLM Colosseum benchmark, crafted by Stan Girard and Quivr Brain, features AI combatants engaging in unconventional battles within an emulator, adding a unique twist to the traditional benchmarking landscape.
Matthew Berman, an AI enthusiast, introduces the novel beat-em-up-centered large language model (LLM) tournament in an embedded video presentation. Beyond showcasing the intense street fighting dynamics, Berman’s video guides viewers through the process of setting up this open-source project on their personal computers, enabling them to experience the excitement firsthand.
(Image credit: OpenGenerativeAI team)
Diverging from the norm, this LLM benchmark challenges the notion that smaller models hold an inherent advantage in terms of latency and speed, leading to more victories in the gaming arena. Similar to human players in beat-em-up games, swift reactions play a pivotal role in countering opponents’ moves, a principle that resonates within the realm of AI-versus-AI showdowns.
The LLMs showcased in this benchmark exhibit real-time decision-making abilities during combat. Operating as text-based entities, they respond to game stimuli by analyzing the contextual game state before deliberating on their strategic moves, which encompass actions like advancing, retreating, unleashing fireballs, executing powerful punches, summoning hurricanes, and casting mega-fireballs.
(Image credit: OpenGenerativeAI team)
The video demonstration highlights the seamless flow of battles, with players demonstrating strategic prowess through adept counters, defensive maneuvers, and the tactical deployment of special abilities. However, the current iteration of the project restricts players to the Ken character, known for its balanced attributes, albeit potentially limiting the viewing excitement.
In the quest to determine the premier Street Fighter III AI contender, Girard’s assessments crown OpenAI’s GPT 3.5 Turbo (ELO 1776) as the deserving victor among the eight LLMs engaged in fierce competition. In a separate evaluation conducted by Amazon executive Banjo Obayomi, Anthropic’s claude_3_haiku emerges triumphant (ELO 1613) after 14 LLMs engage in 314 intense matches.
Noteworthy observations by Banjo reveal instances where LLM anomalies, such as AI hallucinations and safety protocols, influence the performance of specific models during combat scenarios.
As the discussion unfolds, the inquiry arises regarding the practical utility of this benchmark for LLMs versus its role as a mere diversion. While more intricate games could potentially yield deeper insights, interpreting the outcomes might prove to be a more intricate endeavor compared to the straightforward Street Fighter III setting.