When AIâ€™s Large Language Models Shrink

Building ever larger language models has led to groundbreaking jumps in performance. But itâ€™s also pushing state-of-the-art AI beyond the reach of all but the most well-resourced AI labs. That makes efforts to shrink models down to more manageable sizes more important than ever, say researchers.

In 2020, researchers at OpenAI proposed AI scaling laws that suggested increasing model size led to reliable and predictable improvements in capability. But this trend is quickly putting the cutting edge of AI research out of reach for all but a handful of private labs. While the company has remained tight-lipped on the matter, there is speculation that its latest GPT-4 large language model (LLM) has as many as a trillion parameters, far more than most companies or research groups have the computing resources to train or run. As a result, the only way most people can access the most powerful models is through the APIs of industry leaders.

â€œWe wonâ€™t be able to make models bigger forever. There comes a point where even with hardware improvements, given the pace that weâ€™re increasing the model size, we just canâ€™t.â€
â€”Dylan Patel, SemiAnalysis

Thatâ€™s a problem, says Dylan Patel, chief analyst at the consultancy SemiAnalysis, because it makes it more or less impossible for others to reproduce these models. That means external researchers arenâ€™t able to probe these models for potential safety concerns and that companies looking to deploy LLMs are â€œtied to the hipâ€ of OpenAIâ€™s data set and model design choices.

There are more practical concerns too. The pace of innovation in the GPU chips that are used to run AI is lagging behind model size, meaning that pretty soon we could face a â€œbrick wallâ€ beyond which scaling cannot plausibly go. â€œWe wonâ€™t be able to make models bigger forever,â€ he says. â€œThere comes a point where even with hardware improvements, given the pace that weâ€™re increasing the model size, we just canâ€™t.â€

How large do large language models need to be?

Efforts to push back against the logic of scaling are underway, though. Last year, researchers at DeepMind showed that training smaller models on far more data could significantly boost performance. DeepMindâ€™s 70-billion-parameter Chinchilla model outperformed the 175-billion-parameter GPT-3 by training on nearly five times as much data. This February, Meta used the same approach to train much smaller models that could still go toe-to-toe with the biggest LLMs. Its resulting LLaMa model came in a variety of sizes between 7 and 65 billion parameters, with the 13-billion-parameter version outperforming GPT-3 on most benchmarks.

The companyâ€™s stated goal was to make such LLMs more accessible, and so Meta offered the trained model to any researchers who asked for it. This experiment in accessibility quickly got out of control, though, after the model was leaked online. And earlier this month, researchers at Stanford pushed things further: They took the 7-billion-parameter version of LLaMa and fine-tuned it on 52,000 query responses from GPT-3.5, the model that originally powered ChatGPT and (as of press time) still powers OpenAIâ€™s free version. The resulting model, called Alpaca, was able to replicate much of the behavior of the OpenAI model, according to the researchers, who released their data and training recipe so others could replicate it.

â€œIncreasingly, we were finding that there was a gap in the qualitative behavior of models available to the research community and the closed-source models being served by leading LLM providers,â€ says Tatsunori Hashimoto, an assistant professor at Stanford who led the research. â€œOur view was that having a capable and accessible model was important to have the academic community engage in analyzing and solving the many deficiencies of instruction following LLMs.â€

Since then, hackers and hobbyists have run with the idea, using the LLaMa weights and the Alpaca training scheme to run their own LLMs on PCs, phones, and even a Raspberry Pi microcontroller. Hashimoto says itâ€™s great to see more people engaging with LLMs, and heâ€™s been surprised at the efficiency people have squeezed out of these models. But he stresses that Alpaca is still very much a research model not suitable for widespread use, and that broad accessibility to LLMs also carries risks.

â€œIf we can take advantage of the knowledge already frozen in these models, we should.â€
â€”Jim Fan, Nvidia

Patel says there are question marks around the way the Stanford researchers evaluated their model, and itâ€™s not clear the performance is as good as that of larger models. But there are plenty of other approaches for boosting efficiency that are also making progress. One promising technique is the â€œmixture of experts,â€ he says, which involves training multiple smaller sub-models specialized for specific tasks rather than using a single large model to solve all of them. The MoE approach makes a lot of sense, says Patel. Our brains follow a similar pattern, with different regions specialized for different tasks.

Nvidia recently used the approach to build a vision-language model called Prismer, designed to answer questions about images or provide captions. The company showed that the model could match the performance of models trained on 10 to 20 times as much data. â€œThere are tons of high-quality pretrained models for various tasks like depth estimation, object segmentation, and 3D understanding,â€ says Jim Fan, AI research scientist at Nvidia. â€œIf we can take advantage of the knowledge already frozen in these models, we should.â€

Turning LLM sparsity into opportunity

Another attractive approach to boosting model efficiency is to exploit a property known as sparsity, says Patel. A surprisingly large number of weights in LLMs are set to zero, and performing operations on these values is a waste of computation. Finding ways to remove these zeros could help shrink the size of models and reduce computational costs, says Patel.

Sparsity is one of the most promising future directions for compressing models, says Sara Hooker, who leads the research lab Cohere For AI, but current hardware is not well-suited to exploit it. Patterns of sparsity typically donâ€™t have any obvious structure, but todayâ€™s GPUs are specialized for processing data in well-defined matrices. This means that even when a weight is zero it still needs to be represented in the matrix, which takes up memory and adds computational overhead. While enforcing structured patterns of sparsity is a partial workaround, the chips canâ€™t take full advantage, and further hardware innovation is probably needed, Hooker says. â€œThe interesting challenge is, how do you represent the absence of something without actually representing it?â€ she says.

Many of the techniques that are effective at compressing smaller AI models also donâ€™t appear to translate well to LLMs, says Hooker. One popular approach is known as quantization, which reduces data requirements by representing weights using fewer bitsâ€”for instance, using 8-bit floating-point numbers rather than 32-bit. Another is knowledge distillation, in which a large teacher model is used to train a smaller one. So far, though, these techniques have had little success when applied to models above 6 billion parameters, says Hooker.

The fight against AI scaling laws also faces more prosaic challenges, says Patel. Part of the reason why theyâ€™ve proved so enduring is that itâ€™s often easier to throw computing power at a well-understood model architecture than fine-tune new techniques. â€œIf I have 1,000 GPUs for three months, whatâ€™s the best model I can make?â€ he says. â€œA lot of times, the answer is, unfortunately, that you really canâ€™t get these new architectures to run efficiently.â€

Thatâ€™s not to say that efforts to shrink larger models are a waste of time, says Patel. However, he adds, scaling is likely to continue to be important for setting new states of the art. â€œThe max size is going to continue to grow, and the quality at small sizes is going to continue to grow,â€ he says. â€œI think thereâ€™s two divergent paths, and youâ€™re kind of following both.â€

This article appears in the xDATEx print issue as â€œWhen Large Language Models Shrink.â€

From Your Site Articles

Sunbelt Computer Software

PL/B Language Development and Support

Topics

Sections

More

For IEEE Members

For IEEE Members

IEEE Spectrum

Follow IEEE Spectrum

Support IEEE Spectrum

When AIâ€™s Large Language Models Shrink

How large do large language models need to be?

Turning LLM sparsity into opportunity

Reviving Teletext for Ham Radio

Proposed Chinese Robot Ban Is Latest U.S. Tech Sovereignty Move

AI Agent Designs a RISC-V CPU Core From Scratch

Related Stories

Why Are LLMs so Terrible at Video Games?

Thunderforge Brings AI Agents to Wargames

AI Mistakes Are Very Different From Human Mistakes

Sunbelt Computer Software

PL/B Language Development and Support

Topics

Sections

More

For IEEE Members

For IEEE Members

IEEE Spectrum

Follow IEEE Spectrum

Support IEEE Spectrum

Enjoy more free content and benefits by creating an account

Saving articles to read later requires an IEEE Spectrum account

The Institute content is only available for members

Downloading full PDF issues is exclusive for IEEE Members

Downloading this e-book is exclusive for IEEE Members

Access to Spectrum 's Digital Edition is exclusive for IEEE Members

Following topics is a feature exclusive for IEEE Members

Adding your response to an article requires an IEEE Spectrum account

Create an account to access more content and features on IEEE Spectrum , including the ability to save articles to read later, download Spectrum Collections, and participate in conversations with readers and editors. For more exclusive content and features, consider Joining IEEE .

Join the worldâ€™s largest professional organization devoted to engineering and applied sciences and get access to all of Spectrumâ€™s articles, archives, PDF downloads, and other benefits. Learn more about IEEE â†’

Join the worldâ€™s largest professional organization devoted to engineering and applied sciences and get access to this e-book plus all of IEEE Spectrumâ€™s articles, archives, PDF downloads, and other benefits. Learn more about IEEE â†’

Access Thousands of Articles â€” Completely Free

Create an account and get exclusive content and features: Save articles, download collections, and post comments â€” all free! For full access and benefits, subscribe to Spectrum.

When AIâ€™s Large Language Models Shrink

How large do large language models need to be?

Turning LLM sparsity into opportunity

Reviving Teletext for Ham Radio

Proposed Chinese Robot Ban Is Latest U.S. Tech Sovereignty Move

AI Agent Designs a RISC-V CPU Core From Scratch

Related Stories

Why Are LLMs so Terrible at Video Games?

Thunderforge Brings AI Agents to Wargames

AI Mistakes Are Very Different From Human Mistakes