Maximum Likelihood Estimation (MLE) is a technique used to estimate the parameters of a statistical model.
But what are parameters?
A parameter is a variable whose value can be estimated from historical data. For example, in the case of Linear regression (see our article on linear regression), the distribution is Y=mx+b, the parameters are m and b. Here are a few examples of distributions and their corresponding parameters. We will be estimating these parameters using MLE.
Understanding MLE intuitively
Suppose we want to fit a distribution to the data that we have since it is always easier to work with and we can generalize similar problems. A distribution basically gives us the probability distribution of the occurrences in an experiment. If we were to consider the class of all possible distributions, it would be a tedious process and there is no method that can determine the optimal distribution for the data.
Hence what we can do is pick out the distribution from a certain parameterized family of distribution p(X;θ), parameterized by θ. The basic idea of MLE is that we want to pick parameters θ that maximize the probability of the observed data. We can understand this further by going through an example.
Say we want to fit a distribution to this data that represents the height of students in a class. The height increases from left to right.
We decide on a distribution curve, like normal, exponential, or gamma distribution (see wiki) and we try to figure out how to place the curve. Here we can observe that most of the heights of students have been concentrated in the middle and gets sparser towards the end. We can thus take the normal distribution(a bell-shaped curve) due to the distribution of the data.
Our goal is to observe the position and maximum likelihood of the observed data at that position. We place the center of the distribution at various locations and try to observe the likelihood of the data. In the figure above, we can observe that if the curve is placed to the extreme left, it does not correspond to the distribution of the data.
Thus, we try to change the location of the center and plot the likelihood of observing the data at various locations. By observing this plot, we can find out the best possible position for the center and thus the curve, which will aid us in calculating the parameters. Now let us try to understand it mathematically.
How is this represented mathematically?
Mathematically, we can represent the likelihood of our data as follows where there are n samples (X1, X2 ….).
In maximum likelihood estimation, we know our goal is to choose values of our parameters that maximize the likelihood function. We choose θ to maximize the likelihood which is represented as follows:
Here, the argmax of a function means that it is the value of a variable at which the function is maximized.
Now, as you can observe the likelihood of the function is a product and it is a tedious task to differentiate the product in order to find out the best parameters.
We take the log of the function since it makes the math simpler. It converts the product to a sum, which is a property of the log. Also, since the log is a monotonic function(it means that it is either entirely nonincreasing or nondecreasing), the argmax of a function is equal to the argmax of the log of the function. Thus for MLE, we write the Log-Likelihood function as follows.
Taking the log of the function is always recommended for the ease of calculations. Then we calculate the value of parameters using the log-likelihood function at which the maxima occur. You can calculate the argmax of a function using this technique as explained on onlinemath4all.com.
How is likelihood different than probability?
Here is an answer on StackExchange that explains this in the best possible way:
Probability quantifies anticipation (of outcome), likelihood quantifies trust (in model). Suppose somebody challenges us to a ‘profitable gambling game’. Then, probabilities will serve us to compute things like the expected profile of your gains and loses (mean, mode, median, variance, information ratio, value at risk, gamblers ruin, and so on). In contrast, likelihood will serve us to quantify whether we trust those probabilities in the first place; or whether we ‘smell a rat’.https://stats.stackexchange.com/q/56035
You can also explore definitions from different angles in the same StackExchange post.