Based in Sydney, Australia, Foundry is a blog by Rebecca Thao. Her posts explore modern architecture through photos and quotes by influential architects, engineers, and artists.

What is Occam's Razor?

Occam’s Razor is the idea that less complex explanations are more likely than more complex explanations. Alternatively - when we are talking about statistical inference - simpler hypotheses tend to be more likely than complex hypotheses. A philosophical razor in general is a way of removing (or demoting) certain ideas when the number of possibilities are abundant.

Occam’s Razor is a probabilistic statement and cannot always be true. We examine some of the practical and theoretical applications below.

Examples

Occam’s razor is usually brought up as a counter to so-called “conspiracy-theories”. In reality, it does not refute any deviation from the official narrative, but it does strike down exceedingly complex and implausible allegations, such as those that require thousands of people to collude in secrecy.

In Inference and Machine Learning

Occam’s Razor applies when creating a Bayesian prior distribution among hypotheses. There are several reasons why you’d want to weight simpler explanations as more likely.

First - and this may depend on the application but it is very common - Occam’s razor holds because the more complex explanations contain more coincidences of events, which multiplied together are indeed less likely.

Second - it could be that Occam’s Razor isn’t literally true but it saves time and energy in creating and exporting the probability model to be simple. Therefore, we want to set it up so that we get the complex explanations only when the evidence is overwhelming.

Third, Occam’s razor must be true when the hypothesis space is infinite (specifically countably infinite) and we want to assign each value a non-zero probability. For this, see the next section on predicting new words.

Example: Predicting New Words

Suppose we want to build a language model that predicts the probability of a given word in any given spot in an English text (without context). The model assigns each possible word a non-zero probability of appearing at any given time in an English text. We also want to include words that we have never been seen.

Suppose that we define a word as a “finite string of Latin characters”. This means that there are infinitely many words! This also means that some words are going to be simple to write down, and others are going to have very many letters.

This always happens when representing an infinite hypothesis space because if there were a ceiling on the number of characters used to represent values, then the space would be finite.

We can immediately conclude that as the complexity gets greater and greater (as the number of characters in a potential word goes up and up) at some point the probability of seeing such a word descends to zero. If this doesn’t happen, then the probabilities won’t add to 1. So, in this sense Occam’s Razor is a mathematical certainty.

Note that while Occam’s Razor must be true globally (I’d be less surprised by seeing a new word like “dorn” than I would be seeing a new word like ”askawhickacallitendinatafalletcanatunafaroutastan”), it doesn’t have to be true locally. At some point we would learn that 1 and 2-letter words, while they exist, are actually less common than 3 and 4 letter words.

What is the regularity assumption?

What is a probability simplex?