Maximum likelihood estimation

Maximum Likelihood Estimation (MLE) is a technique used to estimate the parameters of a statistical model.

But what are parameters?

A parameter is a variable whose value can be estimated from historical data. For example, in the case of Linear regression (see our article on linear regression), the distribution is Y=mx+b, the parameters are m and b. Here are a few examples of distributions and their corresponding parameters. We will be estimating these parameters using MLE.

Parameters and random variables
Parameters and random variables

Understanding MLE intuitively

Suppose we want to fit a distribution to the data that we have since it is always easier to work with and we can generalize similar problems. A distribution basically gives us the probability distribution of the occurrences in an experiment. If we were to consider the class of all possible distributions, it would be a tedious process and there is no method that can determine the optimal distribution for the data.

Hence what we can do is pick out the distribution from a certain parameterized family of distribution p(X;θ), parameterized by θ. The basic idea of MLE is that we want to pick parameters θ that maximize the probability of the observed data. We can understand this further by going through an example.

Say we want to fit a distribution to this data that represents the height of students in a class. The height increases from left to right.

Distribution of height in  a class
Distribution of height in a class

We decide on a distribution curve, like normal, exponential, or gamma distribution (see wiki) and we try to figure out how to place the curve. Here we can observe that most of the heights of students have been concentrated in the middle and gets sparser towards the end. We can thus take the normal distribution(a bell-shaped curve) due to the distribution of the data.

Placing the center of normal distribution
Placing the center of normal distribution

Our goal is to observe the position and maximum likelihood of the observed data at that position. We place the center of the distribution at various locations and try to observe the likelihood of the data. In the figure above, we can observe that if the curve is placed to the extreme left, it does not correspond to the distribution of the data.

Plotting the position of the center
Plotting the position of the center

Thus, we try to change the location of the center and plot the likelihood of observing the data at various locations. By observing this plot, we can find out the best possible position for the center and thus the curve, which will aid us in calculating the parameters. Now let us try to understand it mathematically.

How is this represented mathematically?

Mathematically, we can represent the likelihood of our data as follows where there are n samples (X1, X2 ….).

Mathematical representation of likelihood
Mathematical representation of likelihood

In maximum likelihood estimation, we know our goal is to choose values of our parameters that maximize the likelihood function. We choose θ to maximize the likelihood which is represented as follows:

Maximized likelihood

Here, the argmax of a function means that it is the value of a variable at which the function is maximized.

Graphs of Logarithmic functions

Now, as you can observe the likelihood of the function is a product and it is a tedious task to differentiate the product in order to find out the best parameters.

We take the log of the function since it makes the math simpler. It converts the product to a sum, which is a property of the log. Also, since the log is a monotonic function(it means that it is either entirely nonincreasing or nondecreasing), the argmax of a function is equal to the argmax of the log of the function. Thus for MLE, we write the Log-Likelihood function as follows.

Log-Likelihood function

Taking the log of the function is always recommended for the ease of calculations. Then we calculate the value of parameters using the log-likelihood function at which the maxima occur. You can calculate the argmax of a function using this technique as explained on

How is likelihood different than probability?

Here is an answer on StackExchange that explains this in the best possible way:

Probability quantifies anticipation (of outcome), likelihood quantifies trust (in model). Suppose somebody challenges us to a ‘profitable gambling game’. Then, probabilities will serve us to compute things like the expected profile of your gains and loses (mean, mode, median, variance, information ratio, value at risk, gamblers ruin, and so on). In contrast, likelihood will serve us to quantify whether we trust those probabilities in the first place; or whether we ‘smell a rat’.

You can also explore definitions from different angles in the same StackExchange post.

Similar Posts

Leave a Reply

Your email address will not be published.