A prominent speech unit for genomic data, known for its ability to generate gene sequences resembling SARS-CoV-2, the virus behind COVID-19, has made significant strides in real-world applications.
The innovation, termed GenSLMs, underwent training on a repository of genetic sequences, essential components of DNA and RNA. This groundbreaking technology clinched the prestigious Gordon Bell award for high-performance computing-driven COVID-19 investigations last year. A collaborative effort involving experts from the University of Chicago, Argonne National Laboratory, NVIDIA, and numerous academic and corporate partners contributed to its development.
Remarkably, despite being exclusively trained on COVID-19 genomes from the initial phase of the pandemic, GenSLM managed to mirror key characteristics of real-world Eris and Pirola subvariants prevalent in the current year.
Arvind Ramanathan, the lead researcher of the project and a mathematical biologist at Argonne, highlighted the model’s intuitive approach, devoid of specific guidelines or constraints regarding the appearance of new COVID variants. Even after exposure solely to the Alpha and Beta strains during training, the AI’s proficiency in predicting gene mutations in novel COVID strains underscores its remarkable capabilities.
GenSLMs not only classify and analyze diverse COVID genome sequences but also generate new ones, distinguishing between different variants. An upcoming demo on NGC, NVIDIA’s platform for accelerated applications, will offer users insights into GenSLMs’ assessment of biological patterns within the COVID virus genome through visualizations.
Unveiling Evolutionary Trends and Delving Deeper
One of the pivotal features of GenSLMs is its ability to decipher extensive nucleotide sequences, symbolized by DNA sequences (A, T, G, C) and RNA sequences (E, U, C, G), akin to how an Mba comprehends American text. This capability empowers the unit to grasp the interconnections among various genome regions, each coronavirus genome comprising approximately 30,000 nucleotides.
In the forthcoming video, users can explore eight distinct COVID variants to understand how the AI model tracks mutations across different proteins in the genome. The visual representation elucidates the co-evolution of prominent proteins, shedding light on the prevalent genome regions in each variant.
Ramanathan emphasized that understanding the co-evolution of genome regions offers valuable insights into potential threats or resistance mechanisms that the disease might evolve. By discerning which mutations hold significant implications in a variant, scientists can anticipate challenges like immune evasion strategies.
GenSLMs underwent fine-tuning using a global dataset of around 1.5 million popular COVID segments sourced from the Bacterial and Viral Bioinformatics Resource Center. Following training on over 110 million bacterial genetic sequences, the model holds promise for future applications by adapting to the chromosomes of diverse bacteria or viruses.
The research team harnessed the computational power of NVIDIA A100 Tensor Core GPU-driven supercomputers like Argonne’s Polaris program, Perlmutter by the U.S. Department of Energy, and Selene for model training.
Recognized at the SC22 computational conference, the Gordon Bell award celebrated the GenSLMs research team’s accomplishments. NVIDIA continues to showcase pioneering advancements in accelerated computing at SC23 in Denver. Explore the comprehensive agenda here.
NVIDIA Research teams, comprising experts in robotics, autonomous driving, AI, computer graphics, and computer vision, are at the forefront of innovation. Stay updated on NVIDIA’s latest breakthroughs by subscribing to NVIDIA medical news and delving into SNIDIA Research.
Bharat Kale, photo courtesy of the Argonne National Laboratory. This study received funding from the Exascale Computing Project (17-SC-20 SC), a collaboration between the National Nuclear Security Administration and the U.S. DOE Office of Science. The National Virtual Biotechnology Laboratory, supported by DOE national facilities, contributed to the research through the Coronavirus CARES Act.