They dubbed the resulting dataset “TinyStories” and used it to train very small language models of around 10 million parameters. To their surprise, when prompted to create its own stories, the small language model trained on TinyStories generated fluent narratives with perfect grammar.
Next, they took their experiment up a grade, so to speak. This time a bigger group of researchers used carefully selected publicly-available data that was filtered based on educational value and content quality to train Phi-1. After collecting publicly available information into an initial dataset, they used a prompting and seeding formula inspired by the one used for TinyStories, but took it one step further and made it more sophisticated, so that it would capture a wider scope of data. To ensure high quality, they repeatedly filtered the resulting content before feeding it back into a LLM for further synthesizing. In this way, over several weeks, they built up a corpus of data large enough to train a more capable SLM.
“A lot of care goes into producing these synthetic data,” Bubeck said, referring to data generated by AI, “looking over it, making sure it makes sense, filtering it out. We don’t take everything that we produce.” They dubbed this dataset “CodeTextbook.”
The researchers further enhanced the dataset by approaching data selection like a teacher breaking down difficult concepts for a student. “Because it’s reading from textbook-like material, from quality documents that explain things very, very well,” said Bubeck, “you make the task of the language model to read and understand this material much easier.”
Distinguishing between high- and low-quality information isn’t difficult for a human, but sorting through more than a terabyte of data that Microsoft researchers determined they would need to train their SLM would be impossible without help from a LLM.
“The power of the current generation of large language models is really an enabler that we didn’t have before in terms of synthetic data generation,” said Ece Kamar, a Microsoft vice president who leads the Microsoft Research AI Frontiers Lab, where the new training approach was developed.
Starting with carefully selected data helps reduce the likelihood of models returning unwanted or inappropriate responses, but it’s not sufficient to guard against all potential safety challenges. As with all generative AI model releases, Microsoft’s product and responsible AI teams used a multi-layered approach to manage and mitigate risks in developing Phi-3 models.
For instance, after initial training they provided additional examples and feedback on how the models should ideally respond, which builds in an additional safety layer and helps the model generate high-quality results. Each model also undergoes assessment, testing and manual red-teaming, in which experts identify and address potential vulnerabilities.
Finally, developers using the Phi-3 model family can also take advantage of a suite of tools available in Azure AI to help them build safer and more trustworthy applications.
Choosing the right-size language model for the right task
But even small language models trained on high quality data have limitations. They are not designed for in-depth knowledge retrieval, where large language models excel due to their greater capacity and training using much larger data sets.
LLMs are better than SLMs at complex reasoning over large amounts of information due to their size and processing power. That’s a function that could be relevant for drug discovery, for example, by helping to pore through vast stores of scientific papers, analyze complex patterns and understand interactions between genes, proteins or chemicals.
“Anything that involves things like planning where you have a task, and the task is complicated enough that you need to figure out how to partition that task into a set of sub tasks, and sometimes sub-sub tasks, and then execute through all of those to come with a final answer … are really going to be in the domain of large models for a while,” said Vargas.
Based on ongoing conversations with customers, Vargas and Yadav expect to see some companies “offloading” some tasks to small models if the task is not too complex.
For instance, a business could use Phi-3 to summarize the main points of a long document or extract relevant insights and industry trends from market research reports. Another organization might use Phi-3 to generate copy, helping create content for marketing or sales teams such as product descriptions or social media posts. Or, a company might use Phi-3 to power a support chatbot to answer customers’ basic questions about their plan, or service upgrades.
Internally, Microsoft is already using suites of models, where large language models play the role of router, to direct certain queries that require less computing power to small language models, while tackling other more complex requests itself.
“The claim here is not that SLMs are going to substitute or replace large language models,” said Kamar. Instead, SLMs “are uniquely positioned for computation on the edge, computation on the device, computations where you don’t need to go to the cloud to get things done. That’s why it is important for us to understand the strengths and weaknesses of this model portfolio.”
And size carries important advantages. There’s still a gap between small language models and the level of intelligence that you can get from the big models on the cloud, said Bubeck. “And maybe there will always be a gap because you know – the big models are going to keep making progress.”