With limited medical data, how can AI be trained to comprehend medical terminology? Combining educational data through the training of another AI model.
The landscape of medical practice is evolving rapidly, with artificial intelligence playing an increasingly pivotal role in various scientific endeavors.
Cutting-edge AI technologies such as generative AI and frameworks like GatorTronGPT, a sophisticated language model developed on the HiPerGator AI supercomputer at the University of Florida and recently featured in a publication in Nature Digital Medicine, are at the forefront of this transformation.
An expanding array of advanced language models (LLMs) drawing from clinical data includes GatorTronGPT. Scientists leveraged models like GPT-3, also utilized by ChatGPT, to facilitate the training process.
This endeavor involved harnessing a vast lexicon totaling 277 billion words. The training dataset comprised 195 billion words sourced from diverse English texts and an additional 82 billion words extracted from de-identified scientific records.
Furthermore, the research team utilized GatorTronGPT to curate a specialized corpus of chemical medical text encompassing approximately 20 billion words, complete with meticulously crafted prompts. This corpus mimics authentic clinical notes drafted by medical professionals, focusing on pertinent medical variables.
Consequently, leveraging this synthetic data, a BERT-based model named GatorTron-S was developed and trained.
In a quantitative assessment, GatorTron-S surpassed the original BERT-based model trained on the 82 billion word medical dataset in terms of performance on scientific natural language processing tasks such as medical concept extraction and medical relation identification.
Remarkably, it achieved these results with a reduced volume of data.
The training of the GatorTron models OG and S was executed on the HiPerGator supercomputer at the University of Florida, powered by 560 NVIDIA A100 Tensor Core GPUs running the Megatron-Sigma architecture. Subsequently, the NVIDIA NeMo framework, pivotal in advancing GatorTronGPT, integrated technologies derived from the Megatron LM platform utilized in this project.
Synthetic data generated by LLMs address certain challenges encountered due to the scarcity of high-quality medical data, a prerequisite for training LLMs.
Moreover, artificial data facilitates model training while adhering to stringent regulations such as HIPAA and other medical data privacy statutes.
The collaboration with GatorTronGPT exemplifies how LLMs can be tailored to enhance various domains following their rapid adoption, as exemplified by ChatGPT’s widespread use.
This collaboration underscores the breakthroughs made possible by novel AI strategies empowered by accelerated computing capabilities.
The impactful collaboration between the University of Florida and NVIDIA, which announced plans to construct the world’s fastest AI supercomputer for educational purposes in 2020, has culminated in the GatorTronGPT initiative.
A significant contribution of $50 million from NVIDIA’s founder, Chris Malachowsky, and the company itself supported this initiative.
An illustration of HiPerGator’s impact is its role in advancing AI education; this supercomputer stands poised to drive further progress in the realm of health sciences and beyond within the University of Florida system.