Newsletter #6: 2023-05-30

Hi everyone, it’s been months since my last newsletter but there’s a lot to cover as we’re getting back to talking about what’s been happening in the world of large language models (LLMs).

So, obviously, a lot of this is going to be about chatGPT. That’s unavoidable. It’s the big flashy product that’s growing to tens and tens of millions of users, over a hundred million if their hype is to be believed, and it’s the main thing that everyone is talking about. It was supposed to change the world: but has it?

When it comes to education, certainly the fear of chatGPT has already had a large impact. Now that we’re reaching the end of the school year here in the US we’re seeing a lot of new stories crop up about “AI detection tools” getting used to flunk swaths of students.

The biggest story was about Texas A&M-Commerce.

But I’ve also been finding stories like this showing up on Reddit a lot.

(with regard to comments, caveat lector)

So the Texas A&M story is just wild, right? Like the professor, who apparently is a 30-year-old who just got his Ph.D. a couple of years ago so this isn’t a tenured boomer on his way out, is pasting in answers into chatGPT and asking “Did you write this?” which is an absolutely non-sensical thing to do. As I’m fond of quoting Pauli: it’s not even wrong.

But the stories involving tools like GPTZero are more interesting because they are, at least in principle, capable of detecting whether text is generated by GPT-3/3.5/4. So what does it mean when almost every student in a class is getting accused of plagiarism? Well, this is an example of what sometimes is called the “false positive paradox” which itself is an example of “base rate fallacy“.

The idea is pretty simple: let’s imagine, just to make the numbers look nice, when GPTZero flags an essay as generated text it’s right 95% of the time and the other 5% is a false positive. Now that sounds pretty good, right? But the problem is that we have to worry about how many cheaters there are in the first place! Imagine a class of two hundred students where fully half of them are cheating. So then you’re going to get 100*0.95 = 95 cheaters correctly flagged and 100*0.05 = 5 non-cheaters incorrectly flagged. Well, duh, that’s what we expect from the 95% accuracy, right?

Now consider, what if out of that two hundred only 10 of them are cheaters? Then you’re going to get, roughly, all ten of those cheaters identified but also 190*0.05 = 9.5 (rounding up) another ten students who aren’t cheaters flagged as well.

So the efficacy of a tool like GPTZero is almost entirely dependent on whether you think that cheating is prevalent – but even in the best-case scenario, you will be punishing students who are not cheating!

I feel comfortable saying, then, that trying to detect whether students are using LLMs with any kind of automated tool is probably useless.

So what of cheaters? I don’t know. I’m still not entirely convinced that we need to worry about it for most classes. Here’s part of my reason: we already know that payments to “paper mills” has gone way down since chatGPT became available. See this article.

This tells me that the people using chatGPT to cheat were already turning in work that wasn’t their own. I think it’s kind of a wash and we don’t need to upend our teaching and make assessment more inaccessible.

Okay, so that being said are there good uses of LLMs in the classroom? I think so! I think they’re good for quickly unpacking common definitions, rephrasing text copied from Wikipedia or textbooks in different ways to help understanding, building first drafts from outlines, and as aids for students who are still gaining fluency in English to make their drafts more idiomatic.

I think we’re soon going to be over the hype of “Can LLMs give you advice and lifehacks and answer your every question?” type of deals. For a few months now I’ve been stuck thinking about this experiment.

Basically, it’s a gag where someone entered the following prompt into the chatGPT GPT-4 model: “You are HustleGPT, an entrepreneurial AI. I am your human counterpart. I can act as a liaison between you and the physical world. You have $100, and your only goal is to turn that into as much money as possible in the shortest time possible, without doing anything illegal. I will do everything you say and keep you updated on our current cash total. No manual labor.”

And documented the results.

The problem is that what you’re going to get out is the standard advice for “how to make a bunch of money without doing any work” you’ll find plastered on the internet. If you’re as familiar with this scene as I am then you already know this means our old collective nemeses dropshipping and affiliate marketing. For those who don’t know what these terms mean they are, respectively yet uncharitably, being an online middle-man between sellers and buyers—neither making nor stocking products themselves – and being an SEO spammer who fills their posts with links to products that you will get a few cents for if people actually click through.

These are not actually viable business practices, basically ever, as noted by the fact that the thousands of people who claim to have used these tricks to become quadrillionaires while working 30 seconds a day don’t seem to do anything with their time and money other than sell you expensive courses telling you how to also become impossibly rich with no effort.

If you know what I mean by MLM “tooling scams” this is basically the distributed, hierarchyless, version of that.

So, yes, chatGPT recommends exactly this because it’s the advice that people who want to take your money have flooded the internet with.

So what happened with HustleGPT?

Unsurprisingly, no, his affiliate marketing site did not take off and make a billion dollars. He has however built a business around helping people to build successful businesses with chatGPT! Huh, does this recall something I said just a few paragraphs ago?

I’m pretty sure this trend is going to die out the way “buy this NFT and soon you can sell it for infinite money” hype did. Maybe FoldingIdeas will even make a video on it.

The trend I’m less confident is going to die out is “automate content creation with chatGPT”. It’s pretty big right now, from really low-effort entries like “faceless videos” on YouTube to making automated blog posts – often for your affiliate marketing – people are arguing that soon you’ll be able to automate away all that pesky writing you need to do in your life.

If I can get a little polemical, I find most of this sorta fascinating because, well, I’m not sure if I understand the point in writing things that are simplistic enough that you can automate the process.

Using an LLM to help brainstorm? Oh, absolutely. Using them to turn outlines into drafts? I personally don’t have use for that but that’s only because I’m an experienced writer. For a novice, I can see it being really helpful for seeing examples of how writing could be done. The problem with using the LLM to try and “automate” content creation is that it’s honestly just a mediocre writer. If you’re literally just taking the output without large amounts of editing, you’re only going to get the blandest version possible of that writing. It will be competent in terms of grammar but will be, largely, just an unpacking and repacking of definitions.

How could it not be? It’s about likelihood and averages with a little wiggle room in terms of randomness so that it doesn’t behave like hitting word prediction on your phone repeatedly.

So, yes, you can automate out huge chunks of writing if the writing was going to be bland and perfunctory to begin with, but if that’s the case why are you doing it in the first place?

To make Wittgenstein wince via non-linear time: of that which one must automate, one must remain silent.

And yet far from being silent, we’re seeing people literally bragging about using chatGPT to create the literary equivalent of shovelware.

Part of the danger here is that we’re going to create an environment in which the web is further diluted in terms of real, careful, informative content. This is going to degrade search even more than it already has been and, I fear, we’re going to start having the same companies providing the tools for flooding the internet will continue selling us the tools to fix the problems they’re exacerbating.

It’s a bit of an aside but I think in a year or two we’re going to see Google and Bing become essentially useless. We’re going to need to use a mixture of personal bookmarking, webrings, RSS feeds, and the like in order to actually manage our information rather than being able to retrieve it via search. The new internet is going to become the old internet, is what I’m saying.

So all that – somewhat negative – news aside what are some of the good things that have been coming out of the LLM world?

Well since my last newsletter GPT-4 has become available as part of chatGPT (the original being a fine-tuned version of GPT-3.5). I have to say in my own experiments GPT-4 is incredibly impressive. It kinda wipes the floor with GPT-3.5 in a lot of ways. This isn’t just my impression, either. It does really well on a lot of various benchmarks and metrics. So much so that it’s ignited a bit of a hype train. Now despite the fact that I don’t think it’s “agi” we’re starting to see some kind of interesting emergent properties, like the fact that prompting tricks like the ones outlined in this video and the papers he links even work. If you don’t have time to watch this I can summarize and say that we’re discovering that a lot of tricks like “explain step-by-step”, “reflect on your answer”, “find what’s wrong in the provided answer” etc. actually provide massively improved results when prompting. We don’t really understand why this is true. Some people have felt like there are new abilities that show up in LLMs as they get larger, but it’s actually hard to tell if that’s real or if it’s an accident of how we’re evaluating the model’s capabilities as can be seen in this paper.

The big OpenAI news is not just that GPT-4 is available but also that they’ve started adding extensions: GPT-4 + plugins and GPT-4 + browsing. The tl;dr of the former is that developers are starting to create integrations between the natural language interface to chatGPT with their products. They range from creating Spotify playlists out of natural language prompts to searching for real estate to integration with pdf parsing tools so that you can load in pdfs as context for your queries. There are literally hundreds and hundreds of plugins for GPT-4 now and it’s only been available for development for a couple of months.

You won’t be surprised that I have a lot of concerns about the idea of creating an interface to your application with a natural language interface like an LLM. The flexibility that makes the LLM an easy tool for an interface also makes it hard to control. Simon Wilison has written a lot about this:

But on the other hand, I think things like the ability to query your own documents are pretty great. I guess you can summarize my feeling as “LLMs are great when they’re used to process text, not to take actions based on that text”. So while, yes, there’s an Instacart plugin that can let you ask about a recipe and then build a cart of all the ingredients for that recipe I think trying to do anything more complicated than that is – wait for it – a recipe for disaster.

This does bring us to things like GPT-4 + browsing, where with chatGPT you’re now able to ask queries and also have it double-check the query against a Bing search. I’ll be honest I can’t tell if directly providing URLs to use in a search actually affects how the Bing search is done and the information is processed. It’s fairly opaque to see what it’s doing, which I find frustrating: two steps forward, one step back.

Although I’m segueing to our last topic major topic which is that if you want to have more control and transparency over what’s happening you need to have more local code and control rather than using an opaque platform such as chatGPT. That’s not quite possible yet, but we’re getting there.

The big thing that happened in the “local LLM install” revolution was the release of the llama model weights. (As a reminder, by weights we mean the values of all the parameters that go into the neural network. If you know the shape of the neural network and you know the final weights, you have everything you need to replicate the trained network yourself.) Facebook had trained its own large language model, meant to be a competitor for GPT-3.5 and GPT-4, then they released the weights to researchers for study, and those researchers took to sharing it with each other more efficiently by using BitTorrent to spread the files. So, very quickly, you had almost everyone who wanted to experiment with a GPT-4-like model messing with llama. Llama came in a variety of sizes, too: a 7 billion parameter version, a 13 billion parameter version, and a 65 billion parameter version.

One of the things that happened almost immediately is that someone figured out how to run these models, albeit slowly, if you needed to run them entirely via the CPU rather than the GPU.

(As another reminder: since the calculations you have to do when running a computation with a large neural network involve many many many multiplications and additions that can be done in parallel – that is, at the same time – it’s much faster to run them on a graphics card since that’s literally the kind of math they’ve been designed to do quickly.) Why is this a big deal? Comparatively few people have the expensive array of multiple graphics cards that are needed to run something like even a 13 billion parameter model: the amount of video RAM needed to store the model will run you at least $1,500 dollars if not much more. To run something like a 65 billion parameter model entirely on GPUs requires the kind of setup that costs a good $10k minimum. Meanwhile, buying the equivalent amount of “normal” RAM is a few hundred dollars.

So, yes, running llamas and llama-byproducts entirely on the CPU is definitely worth it when it comes to making these things accessible for experimenters without “I have the budget for an entire data center” kinda money.

Beyond running LLMs on the CPU, there’s also an entire world opening up about the quantization of LLM model weights. The idea of quantization is a little technical but I’m going to try and give the gist. So all data on a computer, fundamentally, comes down to a sequence of ones and zeroes. That is, I think, mostly common knowledge. What’s maybe not as well known is that when it comes to numbers that aren’t “whole” numbers, that can have stuff to the right of the decimal point, the amount of space you devote to each number affects how fine-grained you can make the distinctions between numbers. If you allow arbitrary amounts of digits to the right of the decimal point, storing a single number could take an infinite amount of space, so you have to choose how much precision to keep.

The trick of quantization is that you can take the weights, that were made with a larger precision, and convert them to smaller precision. By doing this you can reduce the amount of RAM/VRAM needed to run an LLM by a factor of 2, 4, or even 8 times smaller. Reducing the precision does change the weights though, as on some level you are – albeit cleverly – throwing away information after a certain number of decimal points. How much does this affect the quality of the texted generated?

Can you guess what the answer is? If you said “We don’t really know!” you’d be right!

I cannot stress enough that this world is not only not “science” it isn’t even “engineering” yet.

But, running a 65B parameter model with the precision cleverly cut in half is probably better than not running the model at all!

Beyond the original llama, though, we’ve also seen a bunch of llama descendants that have come from fine-tuning llama on various examples of chat-like question-and-answer pairs. The first of these was alpaca, which was a cool research project by researchers at Stanford where they used chatGPT to generate the question/answer pairs for training. The problem with alpaca is that it kind of violated OpenAI’s terms of service. Instead, what’s really taken off is Vicuna (Large Model Systems OrganizationVicuna: An Open-Source Chatbot) which is an alpaca-like that does not violate OpenAI’s terms of service and thus runs into legal trouble for existing. I’ve tried using Vicuna, via the FastChat framework and it’s a little wonky but kinda fun. Everything in this space is still very experimental and is research-quality software (derogatory).

Here’s a summary of how all the various GPT-likes compare to each other on non-trivial tasks.

Spoilers: basically nothing compares to GPT-4 but on some things, other models almost get to the level of GPT-3.5. That’s still pretty exciting, in my opinion.

So what do you do with models once you run them locally? I prefaced this whole discussion with a promise that it would let you extend and use them more transparently than GPT-4 + plugins.

Enter langchain. This is going to be most exciting for the programmers in our cohort, but the basic concept is that this is a framework for writing applications built on top of large language models, rather than just using an LLM as a simple prompt -> answer generator. You can extend the LLM with a kind of long-term memory, integrate it into other programs, and even automate processes where you feed the outputs of prompts in as new prompts to take advantage of some of those tricks I talked about above that we’re using with GPT-4.

Langchain technically can work with OpenAI, with the paid API, or it can work with locally installed models.

I haven’t personally done anything with langchain yet but you can bet I’m hoping to dig into it when I have the time, and I’ll definitely report back the results.