Building ever larger language models has led to groundbreaking jumps in performance. But itâs also pushing state-of-the-art AI beyond the reach of all but the most well-resourced AI labs. That makes efforts to shrink models down to more manageable sizes more important than ever, say researchers.
In 2020, researchers at OpenAI proposed AI scaling laws that suggested increasing model size led to reliable and predictable improvements in capability. But this trend is quickly putting the cutting edge of AI research out of reach for all but a handful of private labs. While the company has remained tight-lipped on the matter, there is speculation that its latest GPT-4 large language model (LLM) has as many as a trillion parameters, far more than most companies or research groups have the computing resources to train or run. As a result, the only way most people can access the most powerful models is through the APIs of industry leaders.
âWe wonât be able to make models bigger forever. There comes a point where even with hardware improvements, given the pace that weâre increasing the model size, we just canât.â
âDylan Patel, SemiAnalysis
Thatâs a problem, says Dylan Patel, chief analyst at the consultancy SemiAnalysis, because it makes it more or less impossible for others to reproduce these models. That means external researchers arenât able to probe these models for potential safety concerns and that companies looking to deploy LLMs are âtied to the hipâ of OpenAIâs data set and model design choices.
There are more practical concerns too. The pace of innovation in the GPU chips that are used to run AI is lagging behind model size, meaning that pretty soon we could face a âbrick wallâ beyond which scaling cannot plausibly go. âWe wonât be able to make models bigger forever,â he says. âThere comes a point where even with hardware improvements, given the pace that weâre increasing the model size, we just canât.â
How large do large language models need to be?
Efforts to push back against the logic of scaling are underway, though. Last year, researchers at DeepMind showed that training smaller models on far more data could significantly boost performance. DeepMindâs 70-billion-parameter Chinchilla model outperformed the 175-billion-parameter GPT-3 by training on nearly five times as much data. This February, Meta used the same approach to train much smaller models that could still go toe-to-toe with the biggest LLMs. Its resulting LLaMa model came in a variety of sizes between 7 and 65 billion parameters, with the 13-billion-parameter version outperforming GPT-3 on most benchmarks.
The companyâs stated goal was to make such LLMs more accessible, and so Meta offered the trained model to any researchers who asked for it. This experiment in accessibility quickly got out of control, though, after the model was leaked online. And earlier this month, researchers at Stanford pushed things further: They took the 7-billion-parameter version of LLaMa and fine-tuned it on 52,000 query responses from GPT-3.5, the model that originally powered ChatGPT and (as of press time) still powers OpenAIâs free version. The resulting model, called Alpaca, was able to replicate much of the behavior of the OpenAI model, according to the researchers, who released their data and training recipe so others could replicate it.
âIncreasingly, we were finding that there was a gap in the qualitative behavior of models available to the research community and the closed-source models being served by leading LLM providers,â says Tatsunori Hashimoto, an assistant professor at Stanford who led the research. âOur view was that having a capable and accessible model was important to have the academic community engage in analyzing and solving the many deficiencies of instruction following LLMs.â
Since then, hackers and hobbyists have run with the idea, using the LLaMa weights and the Alpaca training scheme to run their own LLMs on PCs, phones, and even a Raspberry Pi microcontroller. Hashimoto says itâs great to see more people engaging with LLMs, and heâs been surprised at the efficiency people have squeezed out of these models. But he stresses that Alpaca is still very much a research model not suitable for widespread use, and that broad accessibility to LLMs also carries risks.
âIf we can take advantage of the knowledge already frozen in these models, we should.â
âJim Fan, Nvidia
Patel says there are question marks around the way the Stanford researchers evaluated their model, and itâs not clear the performance is as good as that of larger models. But there are plenty of other approaches for boosting efficiency that are also making progress. One promising technique is the âmixture of experts,â he says, which involves training multiple smaller sub-models specialized for specific tasks rather than using a single large model to solve all of them. The MoE approach makes a lot of sense, says Patel. Our brains follow a similar pattern, with different regions specialized for different tasks.
Nvidia recently used the approach to build a vision-language model called Prismer, designed to answer questions about images or provide captions. The company showed that the model could match the performance of models trained on 10 to 20 times as much data. âThere are tons of high-quality pretrained models for various tasks like depth estimation, object segmentation, and 3D understanding,â says Jim Fan, AI research scientist at Nvidia. âIf we can take advantage of the knowledge already frozen in these models, we should.â
Turning LLM sparsity into opportunity
Another attractive approach to boosting model efficiency is to exploit a property known as sparsity, says Patel. A surprisingly large number of weights in LLMs are set to zero, and performing operations on these values is a waste of computation. Finding ways to remove these zeros could help shrink the size of models and reduce computational costs, says Patel.
Sparsity is one of the most promising future directions for compressing models, says Sara Hooker, who leads the research lab Cohere For AI, but current hardware is not well-suited to exploit it. Patterns of sparsity typically donât have any obvious structure, but todayâs GPUs are specialized for processing data in well-defined matrices. This means that even when a weight is zero it still needs to be represented in the matrix, which takes up memory and adds computational overhead. While enforcing structured patterns of sparsity is a partial workaround, the chips canât take full advantage, and further hardware innovation is probably needed, Hooker says. âThe interesting challenge is, how do you represent the absence of something without actually representing it?â she says.
Many of the techniques that are effective at compressing smaller AI models also donât appear to translate well to LLMs, says Hooker. One popular approach is known as quantization, which reduces data requirements by representing weights using fewer bitsâfor instance, using 8-bit floating-point numbers rather than 32-bit. Another is knowledge distillation, in which a large teacher model is used to train a smaller one. So far, though, these techniques have had little success when applied to models above 6 billion parameters, says Hooker.
The fight against AI scaling laws also faces more prosaic challenges, says Patel. Part of the reason why theyâve proved so enduring is that itâs often easier to throw computing power at a well-understood model architecture than fine-tune new techniques. âIf I have 1,000 GPUs for three months, whatâs the best model I can make?â he says. âA lot of times, the answer is, unfortunately, that you really canât get these new architectures to run efficiently.â
Thatâs not to say that efforts to shrink larger models are a waste of time, says Patel. However, he adds, scaling is likely to continue to be important for setting new states of the art. âThe max size is going to continue to grow, and the quality at small sizes is going to continue to grow,â he says. âI think thereâs two divergent paths, and youâre kind of following both.â
This article appears in the xDATEx print issue as âWhen Large Language Models Shrink.â
- Hallucinations Could Blunt ChatGPTâs Success âº
- Coding Made AIâNow, How Will AI Unmake Coding? âº
- The Stickle-Brick Approach To Big AI - IEEE Spectrum âº
- Intel and Nvidia Square Off in GPT-3 Time Trials - IEEE Spectrum âº
- LLAMA and ChatGPT Are Not Open-Source - IEEE Spectrum âº
- 1-bit LLMs Could Solve AIâs Energy Demands - IEEE Spectrum âº
- Small Language Models: Apple, Microsoft Debut LLM Alternative - IEEE Spectrum âº
Edd Gent is a freelance science and technology writer based in Bengaluru, India. His writing focuses on emerging technologies across computing, engineering, energy and bioscience. He's on Twitter at @EddytheGent and email at edd dot gent at outlook dot com. His PGP fingerprint is ABB8 6BB3 3E69 C4A7 EC91 611B 5C12 193D 5DFC C01B. His public key is here. DM for Signal info.




