Written by 4:00 pm AI, NVIDIA

### NVIDIA’s EoS Computer Sets New AI Teaching Standard

NVIDIA’s new Eos supercomputer uses more than 10,000 H100 Tensor Core GPUs to train a 175 bil…

Training a large vocabulary model of considerable size can be a time-consuming endeavor, taking weeks, months, or even years, depending on the hardware infrastructure being utilized. Given the constraints of time and resources, such prolonged durations are impractical for conducting business efficiently. Recently, NVIDIA introduced the Eos computer, powered by over 10,000 H100 Tensor Core GPUs, capable of training a 175 billion feature GPT-3 model on 1 billion cryptocurrencies in under four hours. This unveiling on Wednesday marked a significant advancement, surpassing NVIDIA’s previous benchmark for the MLPerf AI industry standard by threefold.

Eos represents a massive computational capability, providing an astounding 40 exaflops of AI processing power. This is achieved through the utilization of 10,752 GPUs interconnected via NVIDIA’s Infiniband networking, each capable of transferring a petabyte of data per second. Additionally, the system boasts 860 terabytes of high-bandwidth memory, with an aggregate bandwidth of 36PB/sec and interconnected bandwidth of 1.1PB/sec. The cloud architecture comprises 1344 networks, offering businesses the opportunity to enhance their AI capabilities without the need to build their own infrastructure, at a cost of approximately $37,000 per month.

In a series of nine standard tests, NVIDIA set six new records, including a 3.9-minute benchmark for GPT-3, a 2.5-minute record for training a Stable Diffusion model using 1,024 Hopper GPUs, and notable timings for other models such as DLRM, RetinaNet, 3D U-Net, and BERT-Large. It is noteworthy to mention that the Stable Diffusion model and the 175 billion feature GPT-3 version employed in these benchmarks were not full-sized iterations of the models. The larger GPT-3 model, with about 3.7 trillion parameters, would require significantly more time for training on older hardware configurations like the A100 structure.

To facilitate efficient benchmarking, NVIDIA and MLCommons utilized more compact emulations of the models, focusing on 1 billion tokens rather than the full-scale versions with trillions of parameters. The recent performance leap can be attributed to the substantial increase in the number of GPUs used, from 3,584 Hopper GPUs in previous trials to 10,752 H100 GPUs in the latest round of testing. Despite the tripling of GPU count, NVIDIA maintained an impressive performance efficiency of 2.8x, demonstrating a 93% efficiency rate through effective technological utilization.

Dave Salvator, Director of Accelerated Computing Products at NVIDIA, emphasized the significance of scaling in optimizing system performance and cost-effectiveness. The competitive landscape saw Microsoft’s Azure group presenting a comparable 10,752 H100 GPU system, delivering results within a 2% margin of NVIDIA’s performance. This parity underscores the advancements in AI hardware across industry players, with Azure’s commercially available system showcasing similar capabilities to the cutting-edge Eos computer.

NVIDIA’s strategic focus on AI development, GPU design, rendering technologies, generative AI, and autonomous driving systems underscores the diverse applications for these enhanced compute capabilities. The continuous evolution of benchmarking standards, as highlighted by MLCommons’ inclusion of additional performance metrics for Stable Diffusion tasks, reflects the dynamic nature of the AI landscape. These benchmarks serve as vital indicators of progress in the realm of generative AI, ensuring that market claims are rigorously evaluated and validated within a collaborative industry framework.

As the field of generative AI continues to evolve rapidly, regulatory oversight and standardized benchmarks play a crucial role in maintaining credibility and transparency. NVIDIA’s commitment to advancing AI capabilities aligns with the industry’s trajectory towards more sophisticated and efficient computing solutions. Through initiatives like MLPerf, companies can uphold the integrity of their performance claims, fostering a culture of accountability and reliability in the AI ecosystem.

Visited 2 times, 1 visit(s) today
Tags: , Last modified: February 1, 2024
Close Search Window
Close