Based in Sydney, Australia, Foundry is a blog by Rebecca Thao. Her posts explore modern architecture through photos and quotes by influential architects, engineers, and artists.

Episode 265 - The Multi-Armed Bandit

Episode 265 - The Multi-Armed Bandit

Max talks about the multi-armed bandit - an important concept in the Data Science world. The news of the day is Microsoft's plan to integrate generative text (from open AI) into it's Bing search engine, and whether Google can turn around from negative news stories and compete with this.

Probability Distribution of the Week: Dirichlet Distribution

Links

Wikipedia: Multi-armed bandit
Optimizely: Different strategies for the multi-armed bandit
Arxiv: Variational inference for the multi-armed contextual bandit
The New York Times: Bing (Yes, Bing) Just Made Search Interesting Again
NPR: Google shares drop $100 billion after its new AI chatbot makes a mistake
Arxiv: Fast MLE Computation for the Dirichlet Multinomial
YouTube: Tutorial on Dirichlet Distribution by Max Sklar
YouTube: Using the Dirichlet Distribution to Describe Count Data

Related Episodes

Episode 36 - The Google Graveyard
Episode 234 - Simplexes and Distributions
Episode 262 - Category Theory, Google Responds, and Another Covid Retro

Transcript

Max Sklar: You're listening to the Local Maximum episode 265.

Narration: Time to expand your perspective. Welcome to the Local Maximum. Now here's your host, Max Sklar.

Max Sklar: Welcome everyone, welcome! You have reached another Local Maximum. 

I've now been doing this podcast, the Local Maximum, for five years. Bringing you great content, making you smarter every week of the year for five years. I know that there are a lot of you listening right now and some of you have been listening to me for a while.

All I'm trying to get is 50 paid supporters signed up for the Locals on maximum.locals.com. It's not much, it's only $4 a month. And guess what? If you only go for one month, Locals counts that but I hope you stay for multiple months. So if you get value from the show, get even more value by having conversations directly with me, directly with Aaron and other listeners by subscribing on maximum.locals.com. Remember to use promo code WINTER23. We're having a good time on there and I think with a few more of us on there, we are going to have a much better time! So looking forward to seeing you on there.

Today we're going to start out with a technical topic, the Multi-armed Bandit. It's a technical topic with kind of a fun name and a fun background so that'll be interesting. Then we get a little news update and then we return to another technical topic with probability distribution of the week. 

So let's get started. Let's get right into it with the Multi-armed Bandit. This is something that comes up over and over again in data science and machine learning. I actually learned about the Multi-armed Bandit both in grad school and at work at Foursquare in certain situations where we're running a B test. We'll talk to you about what a B test is and why the Multi-armed Bandit is useful there in a minute. 

But of course, I had kind of forgotten a little bit about it. I got a little fuzzier about it because I hadn't talked about it for a while. But now, I have been doing job interviews. I get asked, it turns out about people interviewing for jobs in data science and machine learning, they love to ask about the Multi-armed Bandit. I get asked about the Multi-armed Bandit all the time, like multiple times a day now so like all the time. Since I had to dive into them anyway and learn them, I thought I'd share this idea with you. We can talk a little bit about what I think of them and where the research is going. 

The idea of the Multi-armed Bandit, as you can tell, is kind of taken from gambling. You have a bunch of slot machines for some reason, probably very good reasons. A single-slot machine is often known as a one-arm bandit because I guess you pull the arm and it takes your money. I did that once. I put $1 in it pulled the arm and it ate the dollar. I was like, alright, let's play something else. But people love playing the slot machines all the time. 

The way these machines are used in a casino is kind of have a probability of getting a reward. Now, of course, this whole slot machine in the casino thing… A lot of probability theory started by kind of taking inspiration from casino games, trying to understand casino games, but then it gets generalized to problems in life, problems in business, and all that. This is one of those cases. 

A single slot machine is known as a One-armed Bandit. It has some kind of method to the madness. There's some way in which it works but we don't really know how it works. All we could do is observe. We could observe people playing it, we could observe ourselves playing it, we could see what rewards we can get. Then over time, we can figure out how this so-called One-armed Bandit really works or at least we have a better and better idea over time. 

So in this case, we have a bunch of machines. The idea is that every step, every time you're allowed to play one of the slot machines. We know that each one of these machines is unique or potentially unique. Some of them might be the same but some of them might be different. There might be better ones to play than others. Each turn we get a reward and as we play it, we can learn about the reward system of each machine. So presumably, the more time we spend in that casino, the more time we spend playing these slot machines now.

Unlike a casino where your best bet is probably not to play slot machines at all if you're trying to maximize monetary reward. We're talking about actions in the real world where you're trying to maximize a real reward where you have to make a decision and oftentimes, you're talking about things that do have a positive reward. 

But let's still think about the casino. Every day, you’re going into this casino playing slot machines all day. After a while, you start to learn which ones are better to play, which ones are worse to play. And as you play it, presumably, you get a better and better idea. 

So the problem is this, which machines do you pick to play and in what order? Secondly, how does context play into this? Because, unlike real-life slot machines, the actual repeated decisions that we make in life may depend on the conditions on the ground. 

For example, let's suppose I'm choosing between various investments or various types of investment vehicles. Maybe those investments are my slot machines. I play one, see what happens, play another one, see what happens. But I might want to also take certain economic indicators into account. Maybe if there's a recession, this one is better. Maybe if unemployment is like this, this one is better. Maybe so on and so forth. 

Now that's called contextual Multi-armed Bandit, in which case, you don't really know… Not only do the machines have certain distributions of rewards, but that distribution depends upon features, depends upon what's called machine learning, your x variables are things that go into trying to predict what these machines are going to do. 

Another obvious parallel that we can make where these things are actually used is when you're talking about A/B tests. So A/B tests, or ABC tests, or however you want, is when you kind of run a, well, in the idealized version of an A/B test, you're running a randomized control trial where let's say, half of your users see your app. Let's say, it's a design question, something simple. You're trying to see if the background should be blue or green. Maybe half of them see blue, half of them see green. You want to see if it changes what action that they take and then you could kind of understand after the experiment is done, what action is better. 

But if you take the Multi-armed Bandit approach, then instead of running an A/B test, you kind of have A and B. Then over time, maybe you're trying both A and B, but over time, the test automatically learns which one to select. Over time, maybe it selects the one that it should have been doing all along. It cuts out the middleman of okay, let's run the test for X amount of weeks, then we're going to analyze the results. And then we're going to make some changes in our application that is going to reflect the results that are best. The Multi-armed Bandit approach does this all at once. You launch it. It immediately starts testing and then it slowly converges on to the right solution without your engineering team having to do anything. So pretty cool, huh? 

What's known as a good way to think about this is the exploration, exploitation dilemma. It's kind of a problem that you face in life as well. Another way to say it that rhymes, I don't know why people like things that rhyme in this case, is learning versus earning. Do I spend time learning new skills? Or do I spend time applying what I already know to try to exploit those skills, to try to try to figure out what to do? 

Usually, the best thing to do is some kind of mixed strategy. You're learning and earning at the same time, you're exploring new ideas but you also want to exploit the ideas that work at the same time. So the Multi-armed Bandit approach is kind of more exacting in terms of how are we going to make that trade-off and more explicit in terms of how we're going to make that trade-off so that's really good. 

Now, one of the main concepts that comes up in the Multi-armed Bandit is something called Thompson sampling. Thompson sampling is… When was it come up with? It’s pretty early on, almost 100 years ago. Let's see, yeah, 1933, almost 100 years ago. It was described by mathematician William Thompson. 

The idea is it's a heuristic that we use to try to pick which arm we're going to flip in the Multi-armed Bandit problem. The concept is you have this kind of fuzzy notion at the beginning. By fuzzy notion, obviously, as someone who's listened to the show and knows about my view of probability as a subjective measure, it means that you have some probabilistic notion of how each action is going to lead to a reward. We have kind of a probability distribution over the reward, given the action that we take and the action is, of course, one of the bandits, one of the one of the K bandits. 

So what does Thompson sampling tell you to do? Well, first of all interestingly enough, you do not do what you think you should do, which is maximize the total expected reward or pick the one with the highest expected reward. That’s sort of what you'd expect. You'd be like, okay I think on average, this machine's gonna get me $10 and this other machine is gonna get me $20 so I better pull the $20 machine. But that solution would actually be total exploitation of what with no exploration. 

Think about it. That's like totally exploiting your current knowledge of the world, trying to do what you think is best given your state of the world without any conception that you might be wrong about the state of the world. Maybe try the one that's $10 because you could be wrong about that. Maybe the one that you think is $10 is really $30 on average, you don't know. 

All right, so the second thing you do not do is you do not pick the one that you think will beat all of the others. Now that's a little bit different than total expected reward because there are certain times when I could have a slot machine where, let's say, one of them has an average of, let's just use the the example before. Let's say one of them has an average of $20 and the first one has an average of $10. But it could be that the one that pays an average of $20, usually it pays $5 but in very, very rare circumstances, it pays a very high rate of return. That's the jackpot. So in very rare circumstances, you hit the jackpot but in most cases, you are only getting $5. Maybe the $10 one more or less always gives you $10. 

So I think that the $10 machine is the one that is going to be the one that beats all the others, that beats the other one in any given game. So that strategy would just pick that one, hey that one is usually going to give me more. You could have a situation where something beats all the others also a plurality of times, but when the others win, they hit a big jackpot. So you'd be missing out on your chances to win that jackpot. 

The idea of Thompson sampling is you need to think even fuzzier than this. You don't really know the expected reward of each machine but you have some kind of probability cloud over how these machines work so you actually have a probabilistic notion of what the expected rewards are. You can combine these to get the total expected reward as well but we're going to use this probabilistic notion instead.

Let's suppose… It's the difference between saying, I think this machine has a 50% chance of giving me $10 and a 50% chance of giving me $20 and my average is $15. But when you take that average, you're kind of losing some information there. So we keep this full probabilistic notion of each machine. But on top of that, there's more.

Let me back up a little bit. Let's suppose the machine gives you $10 half the time, and $20 half the time, and then on average, that's 15. But we actually don't know how the machine really works. Maybe the difference in probabilities between $10 and $20, is not 50/50. Maybe it has other values that it could give us if we continue to play this machine. We only know more about what this machine's behavior is as we play it more and more. 

There are two levels of uncertainty here. The first level of uncertainty here is that the machine is essentially a dice roll. We don't know what the machine is going to do. There's some uncertainty in every play. But then there's some uncertainty above that on the weights of the machine. Like what are the probabilities of the machine? How does it work? So, oftentimes in probability, these hierarchical models, there were two different levels of uncertainty and you have to kind of suss out the difference. I don't know if I explained it the best here but that's kind of one of the important things of trying to understand probability here. 

Another example, which we'll get back to later, is if I flip a coin, I'm uncertain as to whether it's heads or tails. Maybe I'm certain that it's a fair coin but then there's another uncertainty on top of that, which is, is it a fair coin? I don't know. What are the mixtures that it might give between heads and tails? I don't know that either. That's kind of a second level. You have to think, multi-level here. You have some probabilistic notion over the expected rewards of each machine. Therefore, you have some probabilistic notion over which one gives the highest reward, the highest expected reward. 

So again, let's go back to the first thing. The first thing I said not to do is, you don't want to give the one that you think has the total expected reward. But you actually don't know which one has the highest expected reward because you have uncertainty. You could end up calculating some.

If I knew for sure this machine gives me $20 on average, and this other machine gives me $10 on average, if I knew that certainty for sure, then sure, pick the $20 all the time. But I don't know that number for certain. I have some probability distribution over those numbers. So it could be like machine A, I think there's a 50% chance that machine A will get me the one that has the highest expected return. I think there's a 30% chance that machine B has the highest expected return. And I think that there's a 20% chance that machine C has the highest expected return. 

Those are the kinds of probabilities that you want to be calculating here. Again, I'm not telling you how to calculate these. I'm just kind of giving you a high-level intuitive notion of how this works. Once you calculate those probabilities, those are the probabilities from which you choose your next bandit. 

It's not like I choose the one that's 50% most likely to have the highest expected return. Let's say that machine A, like I said, there's a 50% chance that has the highest expected return, then okay, there's a 50% chance that I select A and there's a 30% chance that I select B and there is a 20% chance that I select C and then you keep playing that way. You're going to be trying A, you're going to be trying b, you're going to be trying C and then as you go on and on, you can adjust those percentages and that is done automatically. 

So that is Thompson sampling. That is how the Multi-armed Bandit works. Then of course, over time, you adjust your probabilities, you adjust to the world. You have to start out with some kind of prior model of how these machines work. So your first pull should be, for example, if you don't know anything about these machines, is probably going to be some uniform distribution over the machine. In other words, your first pull, you just pull a random machine. 

If that makes sense, then you might be asking, okay I kind of twisted my head in a knot. I think that makes sense as to how you do it. But is this actually possible to calculate? How do you calculate? Well, it turns out in some simple examples, when you use simple classes of probability distributions, you could use Thompson sampling. Thompson sampling is optimal. But it also turns out that that's not always possible.

I'm actually looking at a paper currently, which is recently it's from 2021. It's by Iñigo Urteaga and Chris Wiggins from Columbia. Chris Wiggins is a professor from Columbia, also chief data scientist in New York Times, a great researcher and speaker about these issues. He's heavily involved in the data science community in New York City so that's why I've met him a few times and worked with him a few times. 

The title of this paper is Variational Inference for the Multi-armed Contextual Bandit. The idea here is that Thompson sampling, what I described above, like what first was described in 1933, is great for certain ideal situations when you're like, I have these machines and I have this class of probability distributions as it should be, and all my ducks are in a row. But as you can tell how complex it was to just state the problem. In more complex situations, these equations become what are known as computationally intractable. That means that we can't find an exact answer. That means we need to improvise, maybe estimate or simplify it somehow. 

How exactly do they do this? Well, they take the probability distributions that are suggested by Thompson sampling and then they approximate it with simpler distributions using a technique known as variational Bayes. Maybe we'll get into variational Bayes another time but I think the bottom line here is that Multi-armed Bandits can be used in a lot of situations and they model active learning or trying to choose situations, choose actions that maximize rewards over time, while you can observe the reward that you get over time. 

So it's called active learning because as the learner you're actually choosing actions as you're learning. Passive learning is something that we do more often, which is like supervised machine learning. Which is just, I'm given the data set. I'm passive. I don't get to choose what's in the data set, it's already there. And oftentimes, that's it as well. But Multi-armed Bandit is active. 

It's interesting how you have this research on Multi-armed Bandit from pre-computer age, then it's used in the internet age in A/B testing to try to automate exploration, exploitation, and how even today, there is a lot of research going on in terms of how to solve it. Because even the way we think we know how to solve it, which is Thompson sampling, it's not a full solution because it's not computationally possible in every situation. That means there's a lot of tricks and techniques that can be used to get closer to what Thompson sampling would give us. 

All that is pretty good. I hope that makes sense for you. If you have a question about it, go on our Locals, maximum.locals.com or leave me an email at localmaxradio.com. 

I have a news update in terms of, what else, the thing that we've been talking about all year. Computational, natural language processing generative language, and of course, open AI’s Chat GPT and what it is doing in the market for search. This article from the New York Times from Kevin Roose, who has tested a new version of Microsoft Bing that he says uses chat GPT. The title of the article is Bing, Yes, Bing Just Made Search Interesting Again. 

Kevin claims that Bing is now good, actually, which is surprising because it hasn't been as good as Google in the past. Kind of been playing number two for many years. Microsoft recently re-upped their investment in open AI to the tune of billions of dollars, like $10 billion, to use their language model technology into their search product. 

Here's the key description from the article that I wanted to read for you. 

“The new Bing, which is available only to a small group of testers now, and will become more widely available soon looks like a hybrid of a standard search engine and a GPT-style chatbot. Type in a prompt and say ‘Write me a menu for a vegetarian dinner party’ and the left side of your screen fills up with the standard ads and links to recipe websites. On the right side of your screen, Bing’s AI engine starts typing out a response in full sentences often annotated with links to the websites it's retrieving information from. 

To ask a follow-up question or make a more detailed request, for example, ‘Write a grocery list for that menu sorted by aisle with amounts needed to make enough food for eight people.’ You can open up the chat window and type it. 

For now the new Bing only works on desktop computers using Edge, Microsoft’s web browser, but the company told me that it plans to expand to other browsers and devices eventually. 

I tested the new Bing for a few hours on Tuesday afternoon and it's a marked improvement over Google. It's also an improvement over chat GPT, which despite its many capabilities was never designed to be used as a search engine. It doesn't cite sources, it has trouble incorporating up-to-date information or events. So while Chat GPT can write a beautiful poem about baseball, or draft a testy email to your landlord, it's less suited to telling you what happened in Ukraine last week, or where to find a decent meal in Albuquerque.”

The article here, it calls out some perfections that these chat models have, these large language models. The kind of obvious one is the non-awareness in math. Sometimes it gets math correctly but sometimes it just does not understand math, which is kind of funny because it's all based on math. Sometimes it makes up some stuff wholesale, just gives you fake news and makes it sound good. So all of those are issues that we have to work around. 

But Roose says, “When the new Bing works, it's not just a better search engine. It's an entirely new way of interacting with information on the internet and one whose full implications I'm still trying to wrap my head around.” So obviously, the rest of us can't use it. So we'll have to see whether the hype is warranted when the rest of us test it out. 

Okay, so what's going on up at Google? NPR says Google shares dropped, the Google Market cap, I guess, dropped a hundred billion dollars after its new AI chatbot makes a mistake. 

NPR writes in a fateful ad that ran on Google's Twitter feed this week, “The company described Bard as a launch pad for curiosity and search tool to help simplify complex topics and accompanying gift prompts barred with the question, what new discoveries from the James Webb Space Telescope can I tell my nine-year-old about? The Chatbot responds with a few bullet points, including the claim that the telescope took the very first pictures of exoplanets or planets outside the solar system. These discoveries can spark a child's imagination about the infinite wonders of the universe, Bard says. 

But the James Webb Telescope didn't discover exoplanets. The European Southern Observatory's very large-scale telescope took the first pictures of those special celestial bodies in 2004, a fact that NASA confirms. Social media users quickly pointed out that the company could’ve fact-checked the exoplanet claim by, well, googling it.”

This one episode, of course, doesn't discount Google's whole stack. Again, like the article says, this is an issue that can happen well by fact-check. It can be fixed by fact-checking. But it can happen with any of these generative text text systems. I don't believe for a second that Chat GPT, even when it goes on to GPT4, even when you scale it up, even when Microsoft is using it, I don't think any of these generative text systems are immune from this whole thing. I think the only thing is that this was embarrassing for Google because they didn't check the example that they were giving out. 

So again, with the article emphasis more on the technology raises the risk of biased answers, increased plagiarism, and the spread of misinformation. Though they're often perceived as all-knowing machines, AI bots frequently state incorrect information as fact because they're designed to fill the gap. 

I think a big question is going to be well, how do we deal with that? The correct information problem goes beyond just AI. That's a problem that we're dealing with even in the human internet. So we still have to have good ways of dealing with that. It's probably going to require us thinking about epistemology. It's probably going to require us thinking about what are our filters for misinformation. Could those filters themselves be captured and be vehicles for misinformation, as is often the case? 

Clearly, Microsoft is winning the narrative war today. Over the last several months, Google is behind, not necessarily in the technology, but in the narrative form. Could this change? Absolutely. It's like politics, sometimes one party’s ahead and sometimes another party's ahead. 

I still think Google's less-than-stellar record on launching new products, going all the way back to Episode 36, the Google Graveyard, that can hurt them. What helps them is their years of research dominance in hiring the brightest. Which would really be a shame. You hire the best people, you have all this research, and then you stumbled right at the end of the race because your system for launching products is not that great. 

But still, this is brand new AI tech. The winner could go in any direction, including new entrants coming in that could challenge both Google and Microsoft, perhaps even open AI itself. 

Remember, Google started by partnering with then dominant, Yahoo, helped Yahoo a lot but ultimately surpassed them. Maybe ultimately killed them in the end because people were… There were like the late 2000s, early 2010s, where Yahoo was clinging on to the fact that they had their home screen where you could do a Yahoo search. But once it wasn't backed by Google Search, and once the Google Search screen was really what people wanted there was no reason to use Yahoo anymore so they died out. Maybe there's more to the story of Yahoo. I don't know but that's what's going on there. 

All right. So now that we've gone through our news update, now it's time, of course, for the probability distribution of the week. 

Narrator: Now, the probability distribution of the week. 

Max: I probably shouldn't have said probability distribution of the week before I played the bumper music. I think you're supposed to be surprised by the bumper and be like, oh yeah, that's what this is. I don't know, maybe my broadcast skills need an update here. 

All right, the Dirichlet distribution, that's what we're talking about today. It's something I've thought about a lot and we're going to talk about it more later because as I’m gonna mention, there's a lot of things that you could build from the Dirichlet distribution that's really cool, really sophisticated. 

And of course, on the podcast here today, I'm trying to get you interested about these ideas. I'm not trying to get into all the nitty-gritty. But I will have links on the website, localmaxradio.com/265, to get into the nitty-gritty, if you want. 

The Dirichlet distribution, I went through that today, because it goes with the Multi-armed Bandit pretty well. In the Multi-armed Bandit, you have many different actions to choose but it's a finite state, finite space. You have k different actions to choose from. In Dirichlet distribution, we're also looking at categorical data, or multinomial data with several options to choose from. You also encounter that two-phase thinking, and I'll get to that in a little bit.

So in Episode 234, we covered the probability simplex. The probability simplex is when you have, there are several different events and I have a certain probability of each event and those probabilities have to add to one because I know that exactly one of these things is going to be true. 

Then in Episode 26, we looked at the beta distribution and we said, okay, there are two events and there's some probability that event A is going to occur, and then one minus p, the opposite probability that event B is going to occur. We have some uncertainty as to what that probability is. We want to come up with some probability distribution over that uncertainty. A beta distribution does that. 

The Dirichlet distribution is the multi-dimensional generalization of the beta distribution. It's a distribution that is over the probability simplex. We have a bunch of numbers that add to one. I don't know what those numbers are. I want to express uncertainty as to what those numbers are, but I have some information about them. That's when I can use Dirichlet distribution.

Imagine that you have some weighted die where each side has a different weight. If you're uncertain as to how that die was weighted, you can express that uncertainty as a Dirichlet distribution, very useful. This is where we get into the two-phase thinking. There's that bottom phase where I know how the die works. Just when I roll it, I don't know what it's going to come up as. Then there's the higher level, the Dirichlet distribution, where it's like, I actually don't know how the die works. I have some ideas so that uncertainty is expressed in Dirichlet distribution. 

Now you're talking about a K-sided die. You have K categories. The Dirichlet actually has K parameters but unlike the categorical distribution, they no longer have to add up to one. So yes, each of the K parameters kind of corresponds to a different side of the die but they don't have to add up to one. 

You can normalize them to add up to one. In other words, divide by the sum. And that is actually the mean location of the Dirichlet distribution. That's where you think how you think the die is weighted, on average. If you normalize those K numbers. But it kind of matters if those numbers are very high, or if those numbers are very low. If you add them up, do they go to 100 or do they go to close to zero or are they close to one? 

If the values are very high, then the probability is centered on the mean. Then if the values are very low, there's a high amount of variance until the probability starts to accumulate in the corners of the distribution. What does that really mean practically? In other words, it means that if you have a Dirichlet distribution with a high weight, it means we're pretty certain about the weight of the die. 

For example, let's say we think that the die is fair. That means that every parameter of the Dirichlet distribution for every side is the same and they're all equal because it's fair. They're the same, but they're all really high. Maybe they're all a hundred, maybe they're all a thousand, maybe they're all a million. So maybe we think they're close to the fair but that means that the die is going to be very close to fair. We're pretty sure it's close to fair. It might be a little bit off fair. Might be one side might be a few points higher than another so we're still uncertain, but we're expressing a high degree of uncertainty. 

Low weight, and you could do that doesn't have to be with a fair die. It could be we're pretty certain that the one is twice as likely to come up as the six on this die. Maybe it's not twice as likely. Maybe it's 2.1 times as likely, maybe it's 1.9 times as likely, but we're pretty sure that around there. You'll also get a Dirichlet distribution with a very high weight. 

Now what happens if it's a low weight? A low weight could mean, let's say in the fair case, let's say they're all 0.01. It means it's a trick die, in which case, one side always comes up but we're very uncertain over which side that is. So notice that in both cases of let's say, you have a fair die with a very high weight for the Dirichlet distribution, or you have a die that's trick, and one side always comes up, and you have a very low weight of the distribution, but you have no idea which side it is. 

In both cases, a single roll, we have total uncertainty over that single roll. We're still equally uncertain as to which number it’s going to come up as. But it's different kinds of uncertainty. In one case, the uncertainty is because we know the die is fair but that fairness gives us a rightful uncertainty over what's going to happen. On the other side, we know that the die is unfair, but we don't know how it's unfair so we still have the same uncertainty over what comes up. 

That's sort of that goes hand in hand for the kind of two ways of thinking. That's why unlike in the categorical distribution, these numbers don't add up to one. That gives you an extra degree of freedom. It's that sort of mixture, that term corresponds to that extra layer of uncertainty that we're taking into account here. 

As you can imagine, you can use the Dirichlet distribution to build many things just like you can the normal distribution. The Dirichlet multinomial is my favorite and we'll have to follow up on that another time. I have a paper called Fast Maximum Likelihood Estimation of Dirichlet Multinomial which I wrote in 2014. Honestly, I think it's almost time for an update on that because I've learned so much in the last 10 years but hopefully, we'll get into that soon and maybe I'll post some papers on that. Maybe I'll post that paper here on the Local Maximum.

Next week, we're going to speak to author Joel Gruss about the latest technology on natural language processing. This is like getting into the weeds of how this works and not just the tech news on Google and Microsoft and open AI, which is what we've been doing, but some of the more nitty gritty, it'll make us all smarter in terms of how the process of natural language processing, how does this technology work. 

If you really want to understand that you're not gonna want to miss this. Then later down the road, we're gonna have Aubrey Clayton, another author, and he is going to talk about Bayesian inference and its importance to humanity. 

A lot of interesting feedback from last week's episode with Adam Kovacevich. I know a lot of you have messaged me saying… A lot of my audience has very different views and that's great, I love it. So tell me what you think about all of these things.

Once again, going to pitch the Locals, maximum.locals.com Use the promo code WINTER23 or just email us at localmaxradio.com. Have a great week, everyone. 

Narrator: That's the show. To support the Local Maximum, sign up for exclusive content and our online community at maximum.locals.com. The Local Maximum is available wherever podcasts are found. If you want to keep up, remember to subscribe on your podcast app. Also, check out the website with show notes and additional materials at localmaxradio.com. If you want to contact me, the host, send an email to localmaxradio@gmail.com. Have a great week.

Episode 266 - Simplicity, Complexity, and Text Classification with Joel Grus

Episode 266 - Simplicity, Complexity, and Text Classification with Joel Grus

Episode 264 - Talking Tik Tok, Privacy, and Propaganda with Adam Kovacevich

Episode 264 - Talking Tik Tok, Privacy, and Propaganda with Adam Kovacevich