Written by 1:46 pm Generative AI

– Unveiling Gemini: Uncover Unique Video Techniques with Open-Source AI

A team of researchers used a novel approach to training AI models and extended Meta’s Llama 2…

Google’s latest state-of-the-art generative artificial intelligence (AI) system, Gemini 1.5, a successor to the original Gemini program launched in December, has astounded the global audience with its recent demonstration. Gemini 1.5 showcases proficiency in solving complex tasks like the “needle-in-a-haystack” problem, where it can identify a specific frame of video based on a textual description, among other capabilities.

Similar to many AI initiatives by major corporations, Google’s system offers substantial technical insights into its operational mechanisms. The 58-page technical report on Gemini 1.5 by Google provides fundamental descriptions of the system and its methodologies while omitting detailed information on the architecture of Gemini 1.5. The script remains inaccessible.

Furthermore, the introduction of Google’s latest AI model, Gemini 1.5, represents a significant advancement over its predecessor.

In a related development, prominent companies such as Google, OpenAI, and others have recently begun limiting the disclosure of technical details regarding AI systems.

Open-source programs that mirror some of Gemini’s functionalities while allowing access to their scripts present an avenue to counter such secrecy.

Researchers have repurposed Meta’s open-source Llama 2 large language model to develop a multi-modal system, akin to Gemini 1.5, capable of processing both text and images, albeit excluding audio unlike Gemini 1.5. In a recent publication by Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel from the University of California, Berkeley, they elaborate on their innovative work.

Moreover, a study has indicated a decline in the intelligence of GPT-4 over time.

The authors managed to manipulate inputs of up to one million “tokens,” representing text, images, or video inputs to the system, using the mainstream version of Llama 2, a modest 7-billion-parameter neural network. This capacity surpasses the 128,000 tokens handled by Gemini 1.0 and OpenAI’s GPT-4 Turbo.

Their creation, known as the Big World Model (LWM), exhibits capabilities akin to those of Gemini 1.5. For instance, when presented with a one-hour YouTube video, LWM can effectively solve intricate problems like identifying specific details, such as the color of a jacket worn by a person in the video.

UC Berkeley’s Big World Model surpasses Google’s Gemini 1.0 and OpenAI’s GPT-4 Turbo by accurately addressing challenging queries related to video content.

The effectiveness of Liu and the team’s findings compared to Gemini 1.5, as well as comparisons with Gemini 1.0 and GPT-4, are highlighted. Notably, LWM successfully answers complex questions, unlike the other two models.

LWM can engage in discussions regarding video content and delve into detailed conversations about image content, a concept referred to as “image chat.” Additionally, when provided with textual prompts, LWM can generate images and videos.

Interestingly, Liu and the team achieved results comparable to Gemini 1.0 with less computational power. While Gemini 1.0’s technical documentation provides insights into the training system, akin to the professional report for 1.5, it is evident that Google utilized TPU Version 4 and Version 5 Pods during a specific timeframe. This suggests that they might have trained LWM with significantly more processing power than Liu and the team.

The ability of LWM, operating on reduced processing power and relying solely on a small, open-source system, to deliver outcomes similar to Gemini 1.0 is attributed to a distinct approach in neural system development.

Both models leverage a Transformer, a neural net type, with Google incorporating advancements in training algorithms, data, and infrastructure into the Transformer.

In contrast, Liu and the team trained LWM across successive shells, gradually increasing the number of “environment windows,” representing the data samples processed at each step. They initiated with 32,768 tokens in the context windows, progressively elevating it to one million tokens.

This methodology, termed “Ring Attention,” enables training a neural system on multiple data samples concurrently, enhancing efficiency and productivity.

LWM’s infrastructure involves training on progressively longer sequence lengths, starting from 32K tokens and scaling up to 1M tokens in powers of two, optimizing compute resources and training efficiency.

The data sets used to train LWM encompass well-known repositories like Books3 and Video Instruct-100K, contributing to its robust learning process.

In contrast, Google briefly outlines Gemini 1.0’s training data as bidirectional and multilingual, sourced from various mediums including images, audio, videos, web pages, books, and code.

Additionally, the potential of AI to revolutionize human capabilities is acknowledged. While Google advances with Gemini 1.5, capable of handling up to 10 million tokens, Liu and the team believe that Ring Attention could extend infinitely, constrained only by available devices.

The released model of LWM is anticipated to serve as a foundation for future endeavors in developing extended context models and tackling challenging long-range tasks, emphasizing productivity over mere fact retrieval.

The source code for LWM is available on the researchers’ GitHub repository.

Visited 2 times, 1 visit(s) today
Tags: Last modified: March 4, 2024
Close Search Window
Close