Based in Sydney, Australia, Foundry is a blog by Rebecca Thao. Her posts explore modern architecture through photos and quotes by influential architects, engineers, and artists.

Episode 266 - Simplicity, Complexity, and Text Classification with Joel Grus

Episode 266 - Simplicity, Complexity, and Text Classification with Joel Grus

We are joined by Software Engineer, Data Scientist, and Author Joel Grus for a discussion that ranges from the latest NLP techniques to thoughts on how to think about complexity costs.

Joel works at a large financial services company, leading a small team that builds data and ML products.

Also, he did research and engineering at the Allen Institute for Artificial Intelligence in Seattle and worked on the AllenNLP project, engineered software at Google, and did data science at various startups.

Joel Grus

Links

Adversarial Learning
Norm Conf
ChatGPT
Wikipedia: BERT (language model)

Joel Grus: Website | Books | LinkedIn | Twitter | Email

The Local Maximum: Email

Transcript

Max Sklar: You're listening to the Local Maximum episode 266.

Narration: Time to expand your perspective. Welcome to the Local Maximum. Now here's your host, Max Sklar.

Max Sklar: Welcome everyone, welcome! You have reached another Local Maximum. 

Celebrating five years of Local Maximums for you this year, actually. Back on February 6, two weeks ago, episode 264, was the five-year anniversary of the Local Maximum. Doesn't that go fast? 

If you've been with us for a while, please consider helping me reach my goal of 50 supporting listeners on maximum.locals.com. I know that 50 doesn't sound like that high of a number but I know a lot of people just listen for free, which is fine. But hopefully, I'm trying to get a few people to support us on locals and get a really great community going there. 50 for five years, maybe that makes sense. 

Not only will you support the show, which you love, for only $4 a month, but you get access to our community where we'll have further discussions and commentary way better than the discourse that you have on Reddit, I can assure you. Basically, this is just a direct line to me and Aaron so you could ask us anything more conveniently. 

A single month counts so please sign up with Winter 23. You get one free month and then just do another free month. If you don't like it, it still counts towards my 50. So that is a good one to do so please help me out there. It's so easy. 

Last week, we talked about the multi-armed bandit. I thought that was going to go over a lot of people's heads, Maybe it did but I actually got a lot of emails from people and a lot of messages about the multi-armed bandit episode. People really liked it. People thought it was interesting. People got something out of it so maybe I'll do more of those kind of mathematics, read paper type of episode.

I don't want to turn the show into kind of a PhD seminar or something like that. But I do want to try to discuss ideas and try to figure out what does that mean for technologists. What does it mean for the average person? So I think that I think that helps quite a bit. 

Today, we are going to talk more about technology. We're going to bring you more AI content today. Specifically, we're going to talk about natural language processing but going further than that, we're going to talk about strategies for thinking about simplicity, complexity, and prioritization because my guest today, in addition to being a software engineer and a data scientist, is also a seasoned author and a speaker. Just go on his website, he has tons of examples of talks, all with slides. 

It's all really interesting and engaging and gives you food for thought so it's worth going to his website at- Oh, shoot, I should probably do it at the end, or should I do it at the beginning? But maybe I will just do it at the beginning. It's joelgrus.com. J O E L G R U S .com. Also give it at the end. I first found him recently at the online Norm Conf, online conference, Norm Conf, with the talk., “What's the simplest thing that might possibly work and why didn't you try that first?”

Joel Grus, you've reached the Local Maximum. Welcome to the show.

Joel Grus: Thanks for having me.

Max: I appreciate you having on and I know you've done some podcasting yourself at one point.

Joel: We're gonna bring it back at some point. We keep talking about it.

Max: Oh really? What was your podcast called?

Joel: It's called Adversarial Learning. It's me and Andrew.

Max: When you do bring it back, definitely let us know. Announce it, you could come on this show and announce it. I can forward it to my audience. So that's exciting! I'll look into it.

Joel: It wasn't a deliberate hiatus. It was more we just got busy and then stopped recording episodes.

Max: I know. That's what happens. People don't realize doing a podcast episode takes out a lot more than you expect. Like right now, we're recording what half hour it's like, okay, that's just a little bit of time in my calendar. But then now I realize I have to have wind down time and then and then rev up time. There's a whole hour before and after that I need to always block off.

But anyway, I first heard you recently at Norm Conf, which, by the way, I really enjoyed that conference. So how did you hear about that and how did you come up with your talk, which was the main one that caught my eye because the title was, What's the Simplest Thing You Can Do and Why Didn't You Do That?

Joel: So I'm friends with a bunch of the Norm Conf organizers. Honestly, it was a conference that ran largely on cronyism. I know the organizers and they asked a lot of their friends to come and speak. It so happens that they knew a lot of people who are prominent in the data science community, one way or another. As soon as they announced it, I said I'll speak there. It's always fun to have an excuse to speak. 

As for why I chose that topic, a couple of things happened. One is that I'm someone who really has always recently embraced the idea of simplicity. I like simple solutions to things. I like clean code that’s not too complicated. I like trying simple things before complicated things. I used to run data science/machine learning/engineering teams. When people would bring me something complicated, I'd say, “Did you try something simple first?” And they said, “No.” I'd say, “Why not?” After a while, they would anticipate that and then just start trying more simple things. 

That mindset was sort of half of why I started thinking about that talk. The other half was sort of an opposite experience. I started talking to a lot of people who were working on natural language processing problems or text classification problems. They would tell me, “Oh, I'm trying to Naive Bayes classifier” or “I'm going to do word to vectors and then something else.” Things that, in some ways, are really simple approaches to the problem. Naive Bayes is maybe the first thing you teach someone about how to do text classification and you’d implement it yourself in an afternoon.

But at the same time, my reaction was Naive Bayes is so bad compared to using BERT or something and fine-tuning it. Why would you even waste your time when you're gonna get much better results the other way? That caused tension inside of me that usually, I say try the simplest possible thing first. But here's this other situation where someone did, ostensibly a simple thing and I'm like why did you waste your time doing that? So some of the talk was about trying to sort of square those two instincts within myself.

Max: I've been through that with text classification. I want to get into that in a little bit. 

It sounds like from the talk, I was thinking about it. It's one thing to say I'm going to embrace simplicity but in reality, when you do it, it's not so simple. There's some deception when it comes to what is simple and what is complex. So I don't know if there's a way. You just spoke about that contradiction, which is Naive Bayes versus BERT. What’s at the heart of the tension there? Is that something that comes up in other areas as well?

Joel: There's a couple of things there. One is that the Naive Bayes is a model with not a large number of parameters and it's simple to implement, it's simple to train. At the same time, if you talk about BERT, it's a very complex transformer model with millions or probably hundreds of millions of parameters. If you compare, here's a model with 1000 parameters and here's a model with hundreds of millions of parameters, then intuitively, the one with hundreds of millions of parameters is a lot more complicated and the one with 1000 parameters is a lot more simple. I think you have a lot of situations like that. 

At the same time, I used this example in my talk, if you read the BERT paper and said, I want to apply that to my problem, you were signing up for writing thousands of lines of code to get this thing implemented. Over time, Hugging Face came out with a transformers library and started a hub of all these pre-trained models. Now, today, you could fine-tune a BERT model using five to 10 lines of code. Not any more code, than it would take you to train the Naive Bayes classifier model. 

From a building, tooling, and scaffolding perspective, suddenly you have this very complex model in some sense, which is just as simple to use and gets much better results as the model that's simpler in some abstract sense. 

Then on top of that, one nice thing, and this is a little bit more to do with text classification or NLP than other things, is that in all these BERT models, you shove your text in as is. Because it understands, understand may be the wrong word, but it understands that the word walk and the word walking are related and sort of takes that into out automatically. Whereas if you're using a Naive Bayes classifier, you might have to do stemming, you might have to remove stop words, you have to make all these different modelling choices, which are in some ways, arbitrary, and in some ways are adding all this extra complexity to the model. Because suddenly, it's not just, that I took a pre-trained thing and shoved my text through it. I took a simple thing but then I started making a large number of decisions about how I want to feed my thing through it.

Max: I've been through this exactly with Foursquare. I built out there the natural language processing pipeline. This is all from 2012 to 2015 so there was no BERT. What I ended up doing was, well, for one of the main applications, it was sentiment analysis and we had a lot of likes and dislikes. We kind of were able to put together a large training set. 

Started with Naive Bayes after some language detection and tokenization and stemming. Then moved on to logistic regression which beat Naive Bayes by a lot. We also did it on a four-gram model. We had all sorts of different phrases in there. 

The model was really, really cool but I get what you mean about having to make all of these decisions within that. We tried tons of different things. I remember one project that really was a waste of time. I was like, maybe it doesn't make sense in to do a four-gram if the four words and sequence span two sentences or skip over a comma because they're in two different phrases. So let's just remove those from the features of a given piece of text that we're trying to classify. So I did all this work and the end model was worse. It didn't, it didn't do anything.

Joel: In some ways, it's similar to the ever-present desire to figure out how to use neural networks on tabular data instead of XG boost or something. Because with XG boost, you have to do all this feature engineering and there's so many things, I’ll bucket it, and then I’ll look at this, and I'll cut it off here. Whereas in theory if you just throw it into a deep learning model, then maybe the feature engineering, you get it for, quote, unquote, free. But I think people have had less success there.

Max: It sounds like when trying to analyze the complexity of a project, we have to have a much broader view. It's not just the complexity of the model. It's the complexity of the code. It's the complexity of the team. It's how the abstractions are laid out. There's a lot more to it than just… Now it sounds like my analysis of complexity is getting a little too complex.

Joel: I think the way I put it was, as our tools get better and as our abstractions get better, things that previously were very complex to do and understand and implement, become much simpler. So simple and complex are not decisions that can be made in some kind of abstract vacuum, but they're really a function of what is the tooling we have? What are the resources we have? And so on.

Max: How do you know if something's ready for prime time? Because there would have been a time with BERT, for example, in 2018. I don't know when this was when it would have been like this is not ready for us yet. Do you kind of have to keep your pulse on the latest research and try to try to figure it out? It seems like you might have one impression, and then that knowledge might get old after a few years.

Joel: Yeah, so most knowledge gets old after a few years. But I think it depends. Some problems need state of the art and some problems, every percentage point of accuracy, or f1, or whatever, is worth millions of dollars if you're doing something at Amazon scale or whatever. Then at the same time, at the other end of the spectrum, maybe it's the case that your Foursquare sentiment analysis model, that if you improve it by 5%, it affects people at the margin but it's not going to drive that much change. 

I think those are some of the calculations that need to go into it. Like okay, I'm potentially gonna get an incremental improvement by staying on top of the state of the art. How much is that incremental improvement worth to me? And that's a function of my problem and my business, and so on and so forth. 

I think there are some places where it would be worthwhile, and they probably have enough people dedicated to it to say, BERT was published. Let's try it on our data tomorrow even though we're gonna have to write a lot of ugly code and debug a lot of stuff and figure out how to get it to work and so on and so forth. Then there's also places where you say that looks interesting. Once the tooling around it is sufficient that my data scientists can work with it, we'll give it a try. But that doesn't need to be this month or next month.

Max: I enjoy these philosophical discussions about it because I feel like this helps people even who are not in data science, even if you're not machine learning, just people who work on really anything. This applies to who are listening today. 

I haven't looked at the Foursquare NLP pipeline, since maybe 2018, maybe 2019. Help me update my skills. Let's talk about what this is, BERT and transformers. What is it doing that either a logistic regression and gram model or even a neural net isn't doing?

Joel: I'll do a little bit of a history lesson here. With Naive Bayes or like an n-gram model, you're basically saying, I'm going to chop this thing into words then basically treat it as a bag of words, in some sense. I take a document, and I don't care that much about word order, modulo what I capture through the n-grams. I'm just going to count how many times this occurs, how many times that occurs, and so on. Now I have a vector of counts and I'm going to use some method to say, I'm going to turn that vector into a probability of positive sentiment or negative sentiment. So logistic regression is sort of the natural thing to do there. 

But the problem is that there's a couple problems. One, you've thrown away word order and word order is very important. Like if I have negations, or whatever, and I say, “this is not terrible.” Well if I don't know how to associate the not with the terrible, then you just say “terrible.” That's a huge negative word. 

Max: That’s why the model that we had where we looked at all phrases of length one through four actually fixed that problem. I guess you could have had something of length five or some complex thing but it really solved all the problems with that. 

Joel: That will help with the shorter-range dependencies, but then you have longer-range things. 

The second problem has to do with stemming and similar words. Like I said, walk and walking, if you're just embedding things as a vector of hearing the words that are in there, then a one in the walk place, and the one in the walking place, have nothing to do with each other, other than what you sort of trained through your objective function. But there's no way to bake into there the knowledge that those are sort of the same word and are related and things like that. 

One step from there is what you could call the word vector era. Word to vec, glove vec, where, for each word, you get, instead of a one hot embedding, where if it's the word walk, you get a one in position 134, which is the walk position, you get some sort of dense embedding in some 500-dimensional space, or whatever and you do it in a way that that embedding somehow captures the meaning of the word. So that walk and walking, get embeddings that are similar in some sense. 

Then we've taken a step towards this problem of how do we represent these words in a way that captures their semantics a little bit. The next step would be, instead of doing this one, hot encoding whatever with n-grams, we could do the word vectors, take the average and then train logistic regression on top of that. 

But then we still are not taking word order into account. Then you can introduce this concept of recurrent neural networks like LSTMs, and so on, and so forth. What you do is, instead of just averaging all your word vectors together, you sort of feed them into this stateful neural network where as you feed one word vector in at a time, it builds up a state that depends on the previous things you've fed in and sort of puts the word vectors in context so that if you have the same word, but it's in a different sentence, something sort of slightly different will happen to it. 

Then you get some state that captures either a sequence of textual word vectors, or maybe just a vector representing the whole sentence, and you can classify those. That was kind of the state of the art, pre-BERT. That was kind of what a lot of people were doing pre-BERT. 

Here's what BERT did. BERT said, rather than using these recurrent neural networks, which are sort of a pain because you have to do them one step at a time. If I have a sentence with 200 tokens in it, I feed in the first token, update the state, feed in the second token. It's a sequential process that takes 200 tokens. They said, instead of doing that, we're going to use, there's this concept of intention that's used within neural networks, which is basically, when I have a position of input, what other positions is it related to?

They replaced the recurrent neural network with this idea of self-intention, which is, instead of looking at the things before me, I'm just going to look across all the other positions in the sentence. The nice thing about that, is that I can do that in parallel. Each position looks across all the other positions in parallel. So now, if I have 200 spots in my sentence, I can have no GPUs. I can just do each of those self-intentions in parallel. So that allows me to train much faster, which means I can train much bigger models.

Max: I don't know if this is a question that makes sense. Maybe it's where I'm deficient in my mental model of these transformer models. But if you have a sentence, let's say, it's it’s 20 words. If I'm trying to run through word five, I'll have to pass through words one through four, and then I have to pass through six through 20. 

It seems like the model’s gonna have to run a little bit differently depending on what position your word is in because it has more to look back at and more to look forward to whereas the recurrent neural network, it's obvious how to pass in one at a time. In other words, how does it take into account that your history the word that you're giving into the network can have a variable number of words before it and after it?

Joel: That's a fair point. Basically, what it's doing is it's doing a matrix multiplication that takes each word and then multiplies it by matrix and multiplies it by sort of the matrix of all the words. It's basically the same sort of computation. The dimensions are set up in such a way that you're doing matrix multiplication, and then you have some length, and you get you same length out..

Max: It sounds like there's sort of a context up to that point and then there's a context after that point. Those are like matrices. Am I thinking about this right or no?

Joel: Not exactly because within the transformer model, it's not really making that distinction between before and after like the way you're describing it. There were a few pre-BERT models that were basically trying to generate contextual embeddings that we're looking at that and they’d say, I'm gonna run one recurrent neural network in the forward direction and one in the backward direction and build the state from both sides and so on and so forth. But here we're just saying I have a word. I also have a sentence worth of words and I'm going to use those two things to generate some context.

Max: Okay, got it. Have you used this personally? What are some of your best use cases when it comes to text classification that you’ve seen?

Joel: I've done it in a couple of ways. One, let's say, categorization. I have a bunch of texts that may fit into a bunch of different categories. So I run a call center, and I have all these call transcripts and I know that this person called because their widget arrived broken and they're upset about it. This person called because they didn't understand how to read the instructions. And this person called because they're angry because the battery was not included and they thought batteries would be included. If I have a set of categories that I've decided is how I want to think about my customers and their problems and I have some labelled data, I can train a BERT model to pick those out quite nicely. 

The other thing that maybe I didn't mention that I should was for either logistic regression, or even the RNN models that we're talking about, you're basically starting with random weights and trying to learn what's going on. You need a lot of data to learn that. 

With these transformer models, what you're doing is… It turns out that BERT is pre-trained on a large corpus of data. Like billions of documents. What that means is that a lot of quote-unquote, understanding of language is baked into that from the beginning. Which means that for a lot of applications, you can fine-tune it using a very small data set and get good results. So rather than having to have tens of thousands of labelled data points, you could get away with a couple hundred, maybe, then just fine-tune the model, which again, already starts off doing all this stuff about how language is structured and could get results.

Max: Yeah. I mean, that's huge. It's not just the cost savings, but it's literally like, it sounds like you could almost get away with doing labelling in-house. A hundred labels, 200 labels like a person can do. Maybe get an intern if you need to do a thousand or so. 

Joel: I'm probably exaggerating a little bit, but not much.

Max: Well, yeah. But it's not like we were using hundreds of thousands to get logistic regression. I don't know. It probably could have worked with less. I guess we could see that because it was we were running one for every single language, which was like that maybe the top five to 10 languages kind of worked. Maybe the 10th most popular language had like 40,000, 50,000 examples.

How does BERT, by the way, do with the different languages? Does it have a different model for every language? Or is it kind of a universal understanding type of thing?

Joel: I know people have trained different ones for different languages and they tend to give them cutesy names. I think there's one called CamemBERT, which is French, and so on, and so forth. I don't actually know on top of my head how the original BERT does on multilingual stuff. I know some of these big transformer models, people do train them to be multilingual. I don’t actually know if BERT is…

Max: It's always an interesting problem because it's never quite as cut and dry as you want it to be. It's never like, Oh, these different languages are all different walled gardens, where each user is going to speak one language. Each piece of piece of text is going to be one language. It's not even the case that each piece of text has the same character set from one sentence to another. So oftentimes, you run into these problems. At least I did, and I'm sure that's probably-

Joel: On some level, all these models are doing is learning associations between words or tokens. Like BERT is trained off of this objective of massive language modelling. So they'll give it a sentence, and then blank out a few of the words randomly, and then the model has to learn to predict what is the missing word. 

So if, in theory, if you feed it like a lot of English sentences, but also a lot of Spanish sentences, then it will learn both of them. Because when it sees surrounding Spanish words, then it'll predict the missing Spanish word, and so on, and so forth. Now, in practice, I don't know if that's the best thing to do or if having different models is the best thing to do.

Max: Yeah, I'm pretty sure that the logistic regression model would do worse if we kind of just threw all the languages in there because even though… Well, first of all, sometimes the same literal characters have different meanings across different languages. Then secondly, it would… Well, first of all, it would also be a lot more space because we'd have to, at least for the logistic regression model we'd have to… I don't know, maybe it could work. I didn't try it. It just doesn't seem like it would work very well.

Joel: Well, the thing that's hard to wrap your mind around, I think, is that your model capacity in these transformer models is huge. So your logistic regression model, even if you go up to four grams, you're still thousands of parameters. Like, thousands of features and weights for thousands of features

Max: Could be way more. We ended up throwing out everything that wasn't used at least like five times in the corpus, and still, that was enormous.

Joel: In the BERT model, again, you're talking about literally hundreds of millions of parameters. Within hundreds of billions of parameters, there's a lot of space to learn patterns. So there's room to learn multiple languages. 

For instance, if you go to Chat GPT, which is a bigger model than BERT. But if you go to Chat, GPT and you start talking to it in a different language, it will answer in a different language.

Max: Oh, I should try that. That sounds pretty cool. 

This is kind of exciting. I am looking forward to, hopefully, if I have an NLP project in the future, working with one of these models. It sounds really neat. 

What do you think are some common mistakes people make when they're doing NLP and text classification? Or things that could go wrong?

Joel: So one thing that can go wrong is if your data is labelled poorly. A lot of times what I've seen is in multiclass classification, the classes are maybe not distinct enough. So if you're trying to determine what is this complaint about, to go back to my earlier example, one was batteries were not included. One was I didn't receive it on time, things like that. If your categories are not distinct enough, then even a perfect labeller will still get it wrong because they'll say, Oh, I guess it's this or that? I think that's I think that's one problem that I run into multiple times in a lot of contexts.

Max: I was just saying, I agree. I was gonna give some personal examples. But if you had one more…

There were things like… Some of the classifications that we did include sentiment analysis which did have some wiggle room. There were some mixed reviews and then there were some kind of statements that were not reviews at all. So we had four different categorizations there. 

Then, in addition, it was like our dataset had three classifications, which was just like people saying, like, dislike, or the middle one, which is really ordinal data, but we treated it as categorical data. But then there was spaminess. 

I think the big one was quality. Are these Foursquare reviews, Foursquare tips that we want to show people? Or is this just a monkey banging on the keyboard? Now, sometimes it's obvious. It's just a monkey here. But other times, we disagree about that. It was very subjective.

Joel: That's really one of the biggest problems I see in text classification. If you can't get smart humans to agree on what the right label should be, then you're gonna have a real hard time getting a computer to be able to come up with them. 

Max: And I've also run into this problem. Well, I don't know if it was a problem but so much of a, okay, we built a model. Maybe it was a sentiment analysis model. I was like, let's look at the top hundred that it labelled as positive that we labelled as negative. So I looked at the top 100 and most of them were we mislabelled it. So in other words, the model was doing better than we were. 

I asked my manager, should I change these in our set? I don't know. It sounds awfully arbitrary but we probably should. I kind of ran into that. I don't know if that was necessarily a problem. It was just an interesting kind of situation.

Joel: My experience is that if you have labelling errors that are sort of random and you have enough data, the model is gonna learn around them. But if you have labelling errors that are systematic, then the model is going to learn those systematic errors as well.

Max: But when you look at the top lists of disagreements, then it looks like you have many more labelling errors than you really do. Or at least in terms of percentage because you're literally focusing in on them.

Joel: Yeah because you're finding the worst mistakes. This was one that was obvious to the computer so we got it wrong. But if you'll get the other end, then you'd find the opposite.

Max: Yeah, exactly. Although, interestingly, there was a model that I did once that was not text classification but it was trying to predict users' age and gender based on where they went in Foursquare's location-based system. So it was locations you go and gender. It turned out the ones it got wrong, it wasn't mislabelled. It was usually just people who were married and were hanging out with their, were just always with their wife or husband or spouse of opposite gender. 

It made that kind of error, but that was because of the nature of the model more than… It wasn't like oh, you liked to do things that the opposite gender does. It wasn't like that at all, which was kind of interesting. 

Joel: That’s how you know you're doing it a long time ago because today you'd get in trouble if you tried to predict people's gender.

Max: Well it was 20… It was 2017 so it was getting there. I think I had to do a few caveats like, hey, this is just a model. It's for marketing. But then I was totally fine. I don't know, today I don't know if I would even attempt that. But that happened.

Alright. Before we head out, I want to hear a little bit about the book you wrote, Data Science From Scratch. We don't have time to go through all of it. But what was it like to write that? Why'd you decide to write it? Tell me a little bit about that. 

Joel: So I wrote that book in 2014 so it's been a long time now. I’ll talk about it. It requires me to delve into long ago. The second edition came out in 2019. 

So a couple of things. One, it's a very self-aggrandizing story. Data Science was becoming a big thing. All these famous data scientists were coming up and getting quoted. I was like, oh, I want to be like a famous data scientist so how can I do that when I'm just some dude working in a start-up? So I’ll write a book. 

I studied math in school and I went to grad school for a bit. I have this very mathematical pedagogy in my head. That’s the way I learned things, and I like learning things from first principles. I pitched a book that was much bigger than the book I ended up writing, which like part one was here's all these models, and here's how they work from first principles. Then part two is here's the libraries you'd use if you wanted to use them in real life. O'Reilly said, “That's, that's two books and we're not giving you a two-book contract because we've never heard of you.” So I said, “Okay, fine. I’ll write the first part. I'm gonna start the second part.”

Basically, the premise of the book is we're going to understand how data science tools and techniques work by building them ourselves in base Python. Okay, logistic regression, we're going to code up logistic regression from scratch and then try it on some problems. Naive Bayes, we're going to code it up from scratch, and then try to answer problems. 

Then the second edition, I added some more stuff. It goes all the way through. Neural networks and RNNs and things like that, that are in there. That's the way I like thinking about it so that's how I wrote the book.

Max: One of the things I've learned… So I just submitted a paper to Archive. It's like 30 pages on probability. One of the hardest things that I ran into that I wasn't expecting was how to order all the things that you want to talk about so that once you talk about something, you've already established all the necessary background and definitions before that. 

I felt like I spent so much time reordering things and then reading through it and being like, Oh, wait, this doesn't make sense because we haven't introduced this next thing yet. So did you run into that problem at all? And I don't know, I just asked because it’s on my mind.

Joel: I would say I ran into that problem but it was definitely something I had to take into account when I was plotting out the book. So the way I wrote the book is I basically made a list of chapters. Then I did that kind of topological sort of like, this chapter needs to come before this other chapter because I want to use this. 

I took the original Andrew Ng ML class, which was sort of like the first MOOC. That class, used kind of a gradient descent as an organizing principle, if you will. Like et's learn about gradient descent analyses of several bunch of problems. That was pretty influential on my way of thinking. In my book, I kind of did something similar which is that we go through gradient descent in great detail and now I can use it to do linear regression and logistic regression and so on

It took a little bit of work, but in general, things were… I sort of designed it from the beginning so that I had things in the right order. There were a few things where it's like, oh, I guess I'm gonna have to move this one topic into an earlier chapter that I preferred to move it into, but for the most part, it wasn't too bad.

Max: It sounds like a little bit of topological sort of planning goes a long way. Although, I found sometimes you end up having to introduce things that you didn't even realize when you were reading about something. 

Joel: Actually, that was one of the, not exactly that, but one of the most humbling things about writing a book is that you discover that most of the things you thought you understood, you don't understand. When it came time to write the section on inference and hypothesis testing and I started writing out how does hypothesis testing works. I quickly discovered that I did not understand it well enough.  I thought I understood it but I did not understand it well enough to write about it. I had to go off and study and actually learn it so that I could write about it sensibly.

Max: That's one of the benefits of writing too, because then you really understand something. It’s like if you explain it back then you understand it well enough. There have definitely been things like, I'm really excited about x and then someone's like, explain it to me and I'm like, I guess I don't understand it well enough to explain it.

Joel: I’m worried that my earlier explanations of transformers were not quite right enough and someone's gonna get angry at me. I apologize if that’s the case

Max: Don't worry about it. This podcast is more of a discussion with people in tech. This podcast is often like what's on your mind pre-book. This is not like hey, this is our final course on BERT. This is like, hey, what's interesting to us now and what's on our mind about it.

Joel: Never underestimate the propensity of people to listen to things and say, “I can't believe he said that that works this way because it actually works this other way.”

Max: I will let you know if I get an email about it but I'm pretty sure you're good. Joel, thanks for coming on the show. Do you have any last thoughts on what we talked about today? And tell the audience where we can find out more about you and where they can contact you and all that. 

Joel: Yeah, so I can be found on Twitter. It’s my name @joelgrus, J O E L G R U S. I'm one of the Twitter holdouts. As everyone else moves to Mastodon, I'm gonna keep staying on Twitter cause I like Twitter and I haven't found a Mastodon server whose terms of service I can live with. I also have a website which I very rarely update but that's joelgrus.com. Again, my name, J O E L G R U S and all my contact information is there, but really, Twitter's probably the best way to get in touch with me. 

Max: All right, awesome. Joel, thanks for coming on the show. 

Joel: Thank you. 

Max: Alright, that was a lot of fun. I've been busy conducting a lot of interviews and we're really going to branch out over the next month, which is exciting. Next on deck is another Bayesian thinking interview, which, for us, I guess isn't branching out as much as it's going deep. But some of the further ones down are going to be branching out. We've got an interview on hardware, which I think you'll find fascinating. 

We're going to try to intersperse these interviews with the solo shows and the discussions with Aaron on current events. Those solo shows and co-hosted shows are also going to have the probability distribution of the week segments. 

Today's episode, once again, is localmaxradio.com/266, where you can get all the links including the link to joelgrus.com to get his website. Have a great week, everyone.

Narrator: That's the show. To support the Local Maximum, sign up for exclusive content and our online community at maximum.locals.com. The Local Maximum is available wherever podcasts are found. If you want to keep up, remember to subscribe on your podcast app. Also, check out the website with show notes and additional materials at localmaxradio.com. If you want to contact me, the host, send an email to localmaxradio@gmail.com. Have a great week.

Episode 267 - Bernoulli's Fallacy with Aubrey Clayton

Episode 267 - Bernoulli's Fallacy with Aubrey Clayton

Episode 265 - The Multi-Armed Bandit

Episode 265 - The Multi-Armed Bandit