Based in Sydney, Australia, Foundry is a blog by Rebecca Thao. Her posts explore modern architecture through photos and quotes by influential architects, engineers, and artists.

What is a hill climbing algorithm?

A hill climbing algorithm is any algorithm that searches for an optimal solution by starting from any solution, and randomly tweaking it to see if it can be improved. It’s a very simple algorithm to implement and can be used to solve some problems, but often needs to be “upgraded” in some way to be useful.

Prerequisites for Implementation

Several elements must be present in order to implement a hill climbing algorithm. These are the 3 main ones:

1) A well-understood mathematical space of solutions must be available. This could be a discrete space (such as the space of permutations of 26 characters) or a continuous space (say 10 real-valued parameters of a model)

2) For each solution, a scoring function must be available that tells us how good that solution is (and to be able to compare the score of several solutions).

3) A transition function must be available to “tweak” any given solution. In other words, for any solution, this will find a random and nearby solution.

Transition functions can take on a variety of forms. In a discrete space, the solutions are often arranged in a network, and a transition function is a “random neighbor”. In a continuous space, the transition function will often look at nearby points by adding a normal distribution (or something similar) to the starting point.

An assumption that must hold for a transition function: solutions that are close by (that is - they are likely to step into each other as the algorithm runs) have similar scores. If this doesn’t hold, then there’s no point in searching nearby a “higher score” solution since those nearby solutions are not also likely to have a higher score. If this does hold, but only very weakly - then the hill climbing algorithm will take much more time to make any progress.

An example of a problem where this assumption doesn’t hold is finding the input to a cryptographic hash function that yields a particular output. Even if you find a solution that gets some parts of the output correct, a “step” which likely involves swapping characters in the input will likely erase all those matching parts and just generate a random output. Therefore, hill climbing (or any optimization algorithm) cannot be used to crack cryptographic hash functions - which is a good thing for data security!

A sketch of the Algorithms

  1. Choose a random point and label it as the current solution.

  2. Compute the score of the current solution.

  3. Use the transition function on the current point to find a new solution (called the proposed solution)

  4. Compute the score for the proposed solution. If it is less than or equal to the score for the current solution, go back to step 3.

  5. The proposed solution has a score greater than the current solution. Make the proposed solution the new current solution, and go to step 2.

Note that there’s no stopping criteria in the above sketch. That would have to be included on a case-by-case basis.

The Transition Function and Convergence

The convergence of a hill climbing algorithm is partially determined by the nature of the transition function. Transition functions are closely related to the topology of a mathematical space, or the connections between points.

A narrow transition function stays close to the point of origin, while a broad transition function may take you far away from the point of origin. Typically, the benefit of a narrow transition function is that it makes it more likely that a hill climber will find a better solution more quickly, but a broad transition function can make larger jumps possible when they are needed.

A narrow transition function can be turned into a broad transition function by applying it multiple times. Therefore, a balance can be kept between finding nearby solutions more quickly and making larger jumps.

In a discrete case (which is where hill climbers are most likely to be implemented), transition functions are often a finite list of “potential next steps” given a point. In this case, a hill climber could try every step in the list of neighbors and choose the best one. The draw back to this is a slower speed per step, but it could lead to fewer steps to converge.

In Bayesian Inference and Machine Learning

In Bayesian Inference, a hill climbing algorithm to search the space of a posterior probability - to find a hypothesis that has a high probability of being right in relation to other hypotheses in the space. Therefore, this is a search for the maximum a postiori, or maximum likelihood of a probability distribution.

A good example of this was covered in Episode 4 of the Local Maximum when solving the substitution cypher.

More generally in machine learning, the search of a solution space can be done with hill climbing, including loss functions and energy functions, which are usually descents rather than climbing.

Drawbacks to these applications

For most potential applications in Bayesian Inference and Machine Learning, hill climbing by itself is (substantially) sub-optimal and can easily be improved. However, sometimes it’s a good starting point, especially given the simplicity of implementing it.

One issue is that the transition function typically chooses a nearby solution in a “random” direction and doesn’t take into account the gradient in the scoring function. Sometimes that gradient can be calculated directly using differential calculus, but even when it can’t, a hill climber on a continuous mathematical space can often estimate a gradient by scoring several samples. The main exceptions to this are in discrete mathematical spaces where gradients (or direction) has no meaning.

In order to take the gradient into account, use a gradient ascent algorithm (or more commonly gradient descent to get the minimum). To get even faster speedups, use the second derivative (or Hessian Matrix in the dimensional case) and use a Newton-Raphson method.

Finally, often times in Bayesian Inference the goal is not to find the maximum posterior. Sometimes, the goal is instead to sample from the posterior distribution. In this case, variants of Markov Chain Monte Carlo methods can be used instead of hill climbing, and instead of gradient/Newton methods.

The Local Maximum Problem

A common problem with Hill Climbing Algorithms (that affect gradient-based algorithms as well is their tendency to get stuck in a local maximum. See Question: What is a Local Maximum?

This can be ameliorated in several ways:

1) Random restart: every time the algorithm gets stuck in a Local Maximum, restart it at another random location and find another local maximum. The allows the algorithm to find multiple local maxima, but each trial learns nothing from previous trials.

2) Random Backup - if a local maximum is reached, retrace your steps in the hopes of finding a better solution. The number of steps to go back could be random. This provides a potential improvement over random restart because some information is retained, and the starting point is better.

3) Broaden the transition function by applying the transition function multiple times.

4) Simulated Annealing: In this case, the hill climber probabilistically accepts lower-scoring solutions. As time goes on, the hill climber becomes less and less likely to do this. When it finally stopped accepting lower-scoring solutions and reaches a local maximum, this is much more likely to be a global maximum (or higher local maximum) than before.



What is a mathematical space?

What is a local maximum?