What are scaling laws?

3 min read

Suggest changes in Google Docs

"Scaling laws", in the context of training an AI model, describe how model performance depends on three key quantities: the model's size (number of parameters), the length of the training run, and the amount of data it was trained on. These three quantities determine how much compute is used in the training process, and scaling laws are used to allocate a fixed amount of compute between them so as to produce the most capable model.

Scaling laws are used to decide on trade-offs like: Should I pay Stack Overflow to train on its data? Or should I buy more GPUs? Or should I pay the higher electricity bills I would get by training my model longer? If my compute goes up 10×, how many parameters should I add to my model to make the best possible use of my GPUs?

In the case of frontier models like GPT-4, these trade-offs might look like training a 20-billion parameter model on 40% of an archive of the Internet, training a 200-billion parameter model on 4% of an archive of the Internet, or any strategy in between.

In 2020, OpenAI proposed the first scaling laws, based on finding that, at least for the largest models at the time, increasing model size was more effective than using more data. Subsequent research largely accepted this hypothesis — note, in the table below, the acceleration of growth in model size, while a relatively consistent amount of training data was used.

model	year	size (#parameters)	data (#training tokens)
LaMDA	2021	137 billion	168 billion
GPT-3	2020	174 billion	300 billion
Jurassic	2021	178 billion	300 billion
Gopher	2021	280 billion	300 billion
MT-NLG 530B	2022	530 billion	270 billion

Caption: The number of parameters have been increasing faster recently. Note the logarithmic scale. Graph from Epoch.

DeepMind researchers proposed new scaling laws in 2022. They found that increasing the size of the model and the size of the dataset by roughly the same amount was a more effective use of compute than mainly increasing model size. To test the new scaling law, DeepMind trained a 70-billion parameter model called "Chinchilla" using the same amount of compute as the 280-billion parameter Gopher. Chinchilla’s smaller size allowed DeepMind to reallocate compute to train the model on a much larger dataset (1.4 trillion tokens compared to Gopher’s 300 billion). As the new scaling laws predicted, Chinchilla performed significantly better than Gopher.

Can we get AGI by scaling up architectures similar to current ones, or are we missing key insights?

What is the "Bitter Lesson"?