AI continues to produce significant light and heat, with the top models in text and images now available through subscriptions and integrated into consumer products, competing closely. Companies like OpenAI, Google, and Anthropic are leading the pack.
It comes as no surprise that AI researchers are exploring new frontiers for generative models. Given the massive data requirements of AI, one way to anticipate future developments is by examining the vast but largely untapped online data landscape.
Video, being abundant, stands out as a logical progression. Recently, OpenAI unveiled a groundbreaking text-to-video AI named Sora, which amazed spectators.
But what about video games?
The Quest for Innovation
Interestingly, there is a plethora of gaming videos online. Google DeepMind, for instance, disclosed training a new AI system, Genie, on 30,000 hours of curated video content featuring gamers engaging with simple platformers—reminiscent of early Nintendo games. As a result, Genie can now generate its own examples.
Genie transforms a basic image, photograph, or sketch into an interactive video game.
Upon receiving a prompt, such as a character sketch and its environment, the AI can interact with a player to guide the character through its virtual realm. DeepMind showcased Genie’s creations traversing 2D landscapes, moving around, or leaping between platforms. Some of these virtual worlds were even inspired by AI-generated images, creating a fascinating loop of creativity.
Unlike traditional video games, Genie constructs these interactive worlds frame by frame. Given a prompt and a movement command, it predicts and generates the subsequent frames in real-time. It has even mastered the inclusion of parallax, a common visual effect in platformers where the foreground moves quicker than the background.
Remarkably, Genie’s training did not involve explicit labels. Instead, the AI learned to correlate input commands—like left, right, or jump—with in-game actions simply by observing examples during training. This autonomous learning ability suggests that future iterations could potentially leverage a vast amount of relevant online video content.
While Genie serves as a compelling proof of concept, its development is still in its early stages, and DeepMind has not yet announced plans to release the model to the public.
The games themselves depict pixelated worlds unfolding at a leisurely pace of one frame per second. In comparison, modern video games can achieve frame rates of 60 to 120 frames per second. Additionally, like all generative algorithms, Genie may exhibit peculiar or inconsistent visual anomalies. The research team noted that it is also prone to envisioning “unrealistic futures.”
Nonetheless, there are promising indicators that Genie will evolve further.
Crafting Virtual Realms
Given that the AI can learn from unlabeled online videos and is relatively compact—comprising only 11 billion parameters—there is ample room for scalability. Larger models trained on more extensive datasets tend to exhibit significant improvements. Moreover, with a growing emphasis on inference in the industry—referring to the process by which a trained AI executes tasks like generating images or text—the speed of such processes is likely to increase.
DeepMind envisions that Genie could assist various individuals, including professional developers, in creating video games. Similar to OpenAI’s broader perspective with Sora, the team is contemplating larger applications beyond gaming. For instance, the approach could extend to controlling robots using AI. By training a separate model on videos of robotic arms performing diverse tasks, the AI learned to manipulate the robots and interact with various objects.
Furthermore, DeepMind proposed that the video game environments generated by Genie could serve as training grounds for AI agents. This strategy is not entirely novel. In a 2021 paper, another DeepMind team introduced XLand, a video game populated by AI agents and overseen by an AI overlord that designs tasks and challenges for them. The notion that the next evolutionary leap in AI will necessitate algorithms capable of training each other or generating synthetic training data is gaining traction.
This ongoing innovation represents the latest development in the fierce competition between OpenAI and Google to showcase advancements in AI. While players like Anthropic are progressing with multimodal models akin to GPT-4, Google and OpenAI are particularly focused on algorithms that simulate reality. Such algorithms are poised to excel in planning and interaction, essential skills for the AI agents these organizations aim to create.
The researchers at DeepMind articulated, “Genie can be prompted with images it has never seen before, such as real-world photographs or sketches, enabling individuals to interact with their envisioned virtual worlds—essentially serving as a foundational world model.” They emphasized that their method, although demonstrated with 2D platformer games and robotics videos, is versatile and scalable to encompass various domains and even larger internet datasets.
Likewise, when OpenAI introduced Sora recently, researchers hinted at a more fundamental breakthrough: a world simulator. Both teams seem to leverage the vast reservoir of online videos not only to train AI in generating video content but also to enhance its comprehension and capabilities in real-world applications, whether online or offline.
The outcome of these endeavors and their long-term sustainability remain uncertain. While the human brain operates on minimal power, generative AI consumes vast amounts of energy from data centers. Yet, the current landscape—with its abundance of talent, technological advancements, intellectual capital, and financial investments—indicates a concerted effort to enhance AI capabilities and efficiency.
We have witnessed remarkable advancements in text, images, audio, and their convergence. Videos are now being added to the mix, promising an even more potent concoction.
Image Credit: Google DeepMind