What are scaling laws?

"Scaling laws", in the context of training an AI model, describe how model performance depends on three key quantities: the model's size (number of parameters), the length of the training run, and the amount of data it was trained on. These three quantities determine how much compute is used in the training process, and scaling laws are used to allocate a fixed amount of compute between them so as to produce the most capable model.

Scaling laws are used to decide on trade-offs like: Should I pay Stack Overflow to train on its data? Or should I buy more GPUs? Or should I pay the higher electricity bills I would get by training my model longer? If my compute goes up 10×, how many parameters should I add to my model to make the best possible use of my GPUs?

In the case of frontier models like GPT-4, these trade-offs might look like training a 20-billion parameter model on 40% of an archive of the Internet, training a 200-billion parameter model on 4% of an archive of the Internet, or any strategy in between.

In 2020, OpenAI proposed the first scaling laws, based on finding that, at least for the largest models at the time, increasing model size was more effective than using more data. Subsequent research largely accepted this hypothesis — note, in the table below, the acceleration of growth in model size, while a relatively consistent amount of training data was used.

model year size (#parameters) data (#training tokens)
LaMDA 2021 137 billion 168 billion
GPT-3 2020 174 billion 300 billion
Jurassic 2021 178 billion 300 billion
Gopher 2021 280 billion 300 billion
MT-NLG 530B 2022 530 billion 270 billion

Caption: The number of parameters have been increasing faster recently. Note the logarithmic scale. Graph from Epoch.

DeepMind researchers proposed new scaling laws in 2022. They found that increasing the size of the model and the size of the dataset by roughly the same amount was a more effective use of compute than mainly increasing model size. To test the new scaling law, DeepMind trained a 70-billion parameter model called "Chinchilla" using the same amount of compute as the 280-billion parameter Gopher. Chinchilla’s smaller size allowed DeepMind to reallocate compute to train the model on a much larger dataset (1.4 trillion tokens compared to Gopher’s 300 billion). As the new scaling laws predicted, Chinchilla performed significantly better than Gopher.



AISafety.info

We’re a global team of specialists and volunteers from various backgrounds who want to ensure that the effects of future AI are beneficial rather than catastrophic.

© AISafety.info, 2022—1970

Aisafety.info is an Ashgro Inc Project. Ashgro Inc (EIN: 88-4232889) is a 501(c)(3) Public Charity incorporated in Delaware.