Newsletter #3: 2022-12-22

Hi again!

Here we are in our third AI/ML newsletter for the term and the theme of this week is going to be sustainability and the resources used by machine learning.

I’ve been thinking about this ever since the CEO of OpenAI said that the cost of operating ChatGPT is “eye-watering”.

The cost of training or using a machine learning model, of course, is essentially the cost of the energy used the run the computers. The cost of the energy is thus proportional to the environmental impact of the model.

So I don’t know about you but I hear that costs are “eye-watering” and my immediate thought is “Oh no, how much carbon emission is this exactly?”

Understanding the resource usage of modern big-scale machine learning is still very much in its infancy and at the moment relies on private organizations self-reporting on training and inference costs. We know that stable-diffusion cost $600k to train because Stability AI told us. Meanwhile we – as far as I can tell unless it was in a poorly circulated press release – only have estimates of OpenAI products like DALL·E 2 (estimated to be upwards of $1M) and GPT-3 (estimated to have cost at least $4.6M but possibly on the order of $10M).

That doesn’t mean, though, that there aren’t a lot of people in the ML community who care about this! Huggingface has been doing research on estimates and has started even including data on carbon costs along with hosted models in order to—and this is absolutely wild to me – let you search for models to use by estimated carbon emissions in grams.

They even have a semi-automated way of keeping track of carbon emissions for your code if you’re using their infrastructure for training and using models!

But this opens a lot of questions about the use of large models, especially in the rush to build both toys and tools on top of things like GitHub’s Copilot or OpenAI’s ChatGPT. To even run GPT-3 once it’s trained would require ~$50k worth of GPUs shared between multiple computers. That’s a lot of upfront cost and resources just to run a single – albeit impressive – service. Having to build and buy new computers is also, with the exception of very low-power architectures, always environmentally worse.

There are two obvious directions for improving the sustainability of things: reduce the amount of resources used or find a better way to use the resources we have. We’ll talk about the second, first, because I think it’s the most well-explored. Distributed computing is the process of splitting big computational jobs between multiple machines and collating the results back together, in an automated way. Note I say “automated” not “automatically” in that you will probably have to do a little work to make your code work with a distributed computing framework but, once you have, then you can split the job into pieces and farm them out between different machines.

As a personal connection, my first research group back in the day used distributed computing for particle physics simulations over machines across the entirety of the UW Madison campus. It turns out, though, that this project – Condor – still exists and has actually pivoted to distributed computing for machine learning!

In more immediate machine learning and distributed computing success Eleuther – a big actually open-source organization—has had a lot of success using Mesh TensorFlow to train their own competitor to GPT-3: GTP-NeoX.

And of course, this makes me think of eons and eons ago, back I used to leave projects like Folding@Home (Folding@Home Wikipedia) running on my PS3, and whether we might one day be able to have training large models be a collective work of humanity rather than something only a few organizations can afford to do. Unshockingly, there are other people who’ve been wondering about this too.

If you check out the above site this is still very much a project in its infancy but wouldn’t it be neat to use across a bunch of machines across a community college system that has multiple campuses? Just a random example.

But, okay, what about just making models smaller? I mean we already know that just making models bigger and bigger isn’t necessarily making them better. Chinchilla is a recent model that is still large, don’t get me wrong, but it’s less than half the size of GPT-3 and performs better. I cannot stress enough that we don’t actually know what the “right” number of parameters should be for things like these large language models. It’s entirely possible, maybe even likely, that they’re so big not because they have to be but because we need that big of a hammer to force the circle through the triangular hole.

Which brings me to a paper an internet friend once shared with me: Playing Atari with Six Neurons. This paper is about how, by being more clever in the design of the system, they created a neural network that can play an old Atari video game that is just absolutely tiny. They basically trade up-front engineering for the creation of very small final models. And while the specific technique they use probably won’t apply to things like a future micro-GPT, maybe that core lesson will!

The last thing I wanted to share about sustainability, and perhaps better context for why all these things are on my mind, is that I come from the permacomputing scene where we’re concerned with how to create a world where computers are long-term artifacts that last decades and not disposable quantities. The idea, to me, is that we want to not only reduce future electronic waste, make better use of the computers that already exist so they don’t rot in landfills, but also make computing more financially accessible for everyone by reducing the hardware requirements for participation in this shared computational infrastructure. I have a lot of thoughts on this topic, if you can’t tell!

So now with the heavy stuff out of the way here are some of the links shared by faculty members this week:

Okay, and with that I’ll leave y’all for the holidays!