In today’s discussions with IT and business leaders, the topic of artificial intelligence (AI) consistently takes center stage. Almost every company views AI as a crucial strategic endeavor, leading to substantial investments and widespread technology implementation.
Similar to past technology trends, the adoption of AI has been largely experimental for many organizations, characterized by a learn-as-you-go approach. While some may immediately associate AI deployment with software stacks and sophisticated language models, the initial focus lies on infrastructure. This encompasses the storage, networking, and servers essential for managing the vast amounts of data utilized across enterprises.
As bespoke infrastructure is progressively integrated with new-generation servers replacing outdated technology, many IT departments are grappling with strained power budgets. The escalating heat generated by these cutting-edge servers, equipped with advanced CPUs, GPUs, and specialized silicon, surpasses the cooling capacity of conventional air systems.
Amidst the escalating server temperatures and the growing demand for specialized AI acceleration, datacenter architects face a pressing dilemma. In the subsequent sections, I will delve deeper into the power and cooling challenges while exploring emerging trends in this domain.
Escalating Server Temperatures
The insatiable demand for computational power remains a fundamental truth in enterprise IT. In the past, GPUs were predominantly associated with gamers and high-performance computing (HPC) enthusiasts. Today, they represent indispensable pieces of silicon globally sought after. Similarly, the mention of ASICs five years ago would have been misconstrued as a reference to athletic footwear.
Aligned with this transformative shift, organizations are deploying specialized technologies to cater to modern workloads essential for competing in a market teeming with agile cloud-native competitors. Consequently, there is a surge in the deployment of servers embedded with advanced (and specialized) silicon to expedite the crucial time-to-value metric.
This surge in power consumption on a global scale is staggering. The International Energy Association reports that datacenter power consumption consumed between 240 and 340 terawatt-hours of electricity in 2022, equivalent to 2% of global consumption—on par with Australia’s electricity usage. By 2030, this figure is projected to escalate to approximately 8%.
Short of a monumental shift in IT priorities, mitigating this surge in power consumption seems implausible. The CPUs being deployed are not only more potent but also power-intensive. Furthermore, the GPUs and other accelerators utilized for data analysis and model training demand even more power. Consider this: Intel and AMD CPUs can individually consume up to 400 watts, while Nvidia and AMD GPUs can exceed 700 watts each. Anticipate these power consumption figures to surge with the advent of next-generation silicon.
For datacenters operating within tight budgets, two significant challenges loom for datacenter managers and operators:
- Enhancing the utilization of available power to support modernized infrastructure needs.
- Implementing effective cooling solutions to counteract the escalating heat generated within server form factors.
To provide additional context to these intertwined challenges, approximately 40% of the average datacenter budget is allocated to—cooling.
Focus on Power Usage Effectiveness (PUE)
Power usage effectiveness evaluates datacenter power efficiency, reflecting the proportion of power consumed that directly powers servers, storage, and networking equipment within the datacenter. An ideal PUE rating of 1.0 implies that every watt consumed efficiently powers the essential datacenter components.
Contrary to this ideal scenario, the National Renewable Energy Laboratory reveals that the average datacenter PUE hovers around 1.8. Datacenters emphasizing sustainability typically achieve a PUE of about 1.2. These statistics underscore the significance of reducing PUE for IT organizations to manage their power budgets effectively.
The pivotal question arises: How can organizations lower their PUE while effectively cooling servers operating at higher temperatures than ever before? The apparent solution lies in adopting a different approach—liquid cooling.
Distinguishing Liquid Cooling Methodologies
Liquid cooling, once reserved for high-performance computing and niche applications, has transitioned into the mainstream due to the rush to deploy AI-ready infrastructure and the evolution of compute platforms. Two primary liquid cooling methods—direct-to-chip and immersion—have gained prominence as effective cooling solutions.
In direct-to-chip cooling, cool liquids circulate through a cold plate directly connected to heat-generating components. These cold plates extract heat and transfer it to the liquid. Conversely, immersion cooling involves housing entire servers in a dielectric fluid that efficiently dissipates heat from the components.
Both direct-to-chip and immersion cooling encompass two subtypes: single-phase and two-phase. In single-phase cooling, a hydrocarbon-based liquid absorbs heat and is subsequently cooled through a heat exchanger. On the other hand, two-phase cooling involves a fluorocarbon-based liquid that absorbs heat from servers. Upon heating, the liquid transitions into gas, which, upon separation, is condensed back into liquid form.
While two-phase cooling boasts superior PUE compared to single-phase cooling, environmental concerns arise due to the utilization of fluorocarbon-based liquids, known as PFAS (poly-fluorinated alkyl substances). These substances pose environmental risks, prompting several governments to prohibit their production and usage.
When contrasting direct-to-chip and immersion cooling, operational disruptions must be considered. Direct-to-chip cooling minimally impacts server deployment and maintenance, whereas immersion cooling necessitates a reevaluation of server maintenance practices. Additionally, immersion cooling poses challenges concerning component warranties.
Regardless of the approach adopted, liquid cooling—be it direct-to-chip or immersion—significantly surpasses air-based cooling in terms of PUE. While the average datacenter maintains a PUE of 1.8, direct-to-chip cooling can decrease PUE to under 1.2, and immersion cooling can achieve a PUE as low as 1.02. Achieving a PUE below 1.2 signifies substantial power (and cost) savings for organizations.
Initiatives by Server Vendors
Acknowledging the escalating power consumption and shrinking form factors, each server vendor has devised tailored responses to address these challenges.
- Lenovo’s fifth-generation Neptune cooling system integrates direct-to-chip cooling and a rear door heat exchanger to deliver comprehensive system-level cooling. The company has also dedicated significant resources to engineering an intricate tubing system.
- HPE leverages cooling technologies acquired through Cray and SGI to offer diverse cooling solutions tailored to specific workloads and customer requirements.
- Dell provides a direct liquid-cooling solution and actively collaborates with ecosystem partners. The company also shares its research findings and engages in standards bodies such as ASHRAE and the Open Compute Project.
- Supermicro has also made notable strides in liquid cooling, leveraging its agile business model to swiftly adapt to evolving market demands.
The Evolving Cooling Landscape
While each server vendor offers unique solutions, the market features a plethora of best-of-breed cooling solutions spanning direct-to-chip, single-phase, and two-phase immersion methods. Companies like Zutacore, Liquidstack, CoolIT, GRC, Submer, and Vertiv are actively competing to establish themselves in this dynamic market.
Selecting the optimal partner hinges on a company-specific evaluation process. For IT or facilities organizations considering liquid cooling deployment, I recommend the following approach:
- Thoroughly understand your requirements and capabilities. What is necessary to implement a liquid cooling solution effectively?
- Enhance your knowledge of liquid cooling variants to identify the best fit for your needs.
- Engage with your server vendor, channel partners, and cooling standards bodies.
- Compile a shortlist of vendors and conduct a comprehensive evaluation.
- Seek references from organizations akin to yours (e.g., same industry vertical and datacenter type) that have embraced liquid cooling. Identify the technology and vendor they have adopted.
- Initiate with a small-scale deployment and gradually expand.
Closing Remarks
Liquid cooling is poised to become the standard in datacenters, marking an inevitable shift in cooling methodologies. Whether direct-to-chip or immersion cooling prevails, or if the industry develops environmentally safe two-phase liquid, the future datacenter landscape will likely encompass a diverse array of cooling solutions. Although this transformation may not transpire overnight, the imperative for alternative cooling methods to support AI and HPC workloads is immediate. The question remains—what steps will you take?