Training a language model on a single GPU in one day

Feb 2, 2023

The other day, I mentioned nanoGPT in one of my posts, an implementation of GPT that smaller organisations can use for generating text. This type of research is cool because it creates more diversity of use of these types of models. I’ll leave my thoughts about going all in on the “deep-learning bet” for another time 😉

In December, two researchers from the University of Maryland published a pre-print on “Cramming”, a challenge they made to see how much of a BERT-style language model can be trained in one day, on one GPU. They use a number of tricks to make the model smaller and more performant for a specific task.

The paper shows that they got pretty far in terms of performance. Even given the limitations of this smaller model, this type of research is inspiring, because it will allow individuals and small organisations to also use language modelling successfully.

Of course, most of the time, you don’t need to train your own BERT-like models, and can just use pre-trained models. Or, you might not even need fancy deep learning models for solving your business problems. Are you wondering what kind of AI technology is best for your business? Feel free to drop me a line and we’ll have a chat!

(Also posted on my LinkedIn feed)

Cramming: Training a Language Model on a Single GPU in One Day

arxiv.org

Recent trends in language modeling have focused on increasing performance through scaling, and have resulted in an environment where training language models is out of reach for most researchers and practitioners. While most in the community are asking how to push the limits of extreme computation, we ask the opposite question: How far can we get with a single GPU in just one day? We investigate the downstream performance achievable with a transformer-based language model trained completely from scratch with masked language modeling for a single day on a single consumer GPU. Aside from re-analyzing nearly all components of the pretraining pipeline for this scenario and providing a modified pipeline with performance close to BERT, we investigate why scaling down is hard, and which modifications actually improve performance in this scenario. We provide evidence that even in this constrained setting, performance closely follows scaling laws observed in large-compute settings. Through the lens of scaling laws, we categorize a range of recent improvements to training and architecture and discuss their merit and practical applicability (or lack thereof) for the limited compute setting.

JP van Oosten

Training a language model on a single GPU in one day

Cramming: Training a Language Model on a Single GPU in One Day