Maximum likelihood estimation

Intro

One commonly used technique in network (and more broadly, statistical) analysis is maximum likelihood estimation (MLE). In this post, we will go over a basic analytical method for calculating the MLE of familiar probability distributions. This will set the stage for applying MLE methods to random graph models.

Maximum Likelihood

To understand this method let’s begin with Bayes’ Rule. Suppose you have some dataset $A$ and you assume it was generated from some model with parameters $\Theta$. Then according to Bayes’ Rule, the probability of observing $A$ given such model parameters is: $$ P\left(A|\Theta\right) = \frac{P\left(\Theta|A\right)P\left(A\right)}{P\left(\Theta\right)} \tag{$1$} $$ This is called the model likelihood. Often when doing these analyses, we are assuming that $A$ is given and what we are really interested in is the probability of observing the model parameters, $\Theta$, given our dataset. We can achieve an expression for this by simply rearranging Equation (1) $$ P\left(\Theta|A\right) = \frac{P\left(A|\Theta\right)P\left(\Theta\right)}{P\left(A\right)} \tag{2} $$

The quantity $P\left(\Theta|A\right)$ is called the posterior. Recall from your introductory calculus class that to optimize a function, we take it’s derivative and set it equal to $0$; MLE involves doing the exact same thing. However, often times the expression for $P\left(\Theta|A\right)$ is quite complex, and it’s easier to optimize the logarithm of the function rather than the function itself. The logarithm is a monotonically increasing function, so any $\hat{x}$ that optimizes $\log\left(f\left(x\right)\right)$ also optimizes $f\left(x\right)$. Therefore, optimizing the logarithm of the posterior is sufficient.

$$ \log\left(P\left(\Theta|A\right)\right) = \log\left(P\left(A|\Theta\right)\right) + \log\left(P\left(\Theta\right)\right) - \log\left(P\left(A\right)\right) $$

I left out one important detail, which is a discussion of the term: $P\left(\Theta\right)$, known as the prior. The prior distribution allows us to incorporate beliefs on the model parameters, which is ultimately our subject of interest. However, if we do not need to bias the estimation in any direction, it is often sufficient to assume the probability that they take on any particular value to be constant. Meaning that prior will not contribute anything to the optimization step, and we can drop it all together. Hence, we are only maximizing likelihood term.

$$ \begin{align*} \log\left(P\left(\Theta|A\right)\right) &\propto \log\left(P\left(A|\Theta\right)\right) \\ \implies \frac{\partial}{\partial \Theta}\log\left(P\left(\Theta|A\right)\right) &= \frac{\partial}{\partial \Theta}\log\left(P\left(A|\Theta\right)\right) = 0 \end{align*} $$

Example: Gaussian Distribution

Lets try this with a concrete example: consider a dataset $X = \{x_1, x_2, \cdots, x_n\}$, suppose we assume they were generated from a Gaussian distribution with standard deviation $\sigma$ and mean $\mu$, but we are only given $\sigma$. The question we might ask, then, what $\mu$ maximizes the probability that this dataset was indeed generated from a Gaussian distribution. In this example then, $\Theta = \mu$, and $P\left(X|\Theta\right) = \frac {1}{\sigma \sqrt{2\pi }}e^{-{\frac {1}{2}}\left({\frac {x-\mu }{\sigma }}\right)^{2}} $, since we are assuming $X$ was generated from a Gaussian distribution. We will assume a constant prior distribution on $\mu$, implying that all possible means are equally likely. Altogether: $$ \begin{align*} P\left(\Theta|X\right) & \propto \prod\limits_{i = 1}^{n} \frac {1}{\sigma \sqrt{2\pi }}e^{-{\frac {1}{2}}\left({\frac {x_i-\mu }{\sigma }}\right)^{2}}\\ \implies \log\left(P\left(\Theta|X\right)\right) &= \sum\limits_{i = 1}^{n}\log\left( \frac {1}{\sigma \sqrt{2\pi }} \right) + \frac{1}{2}\left(\frac{x_i - \mu}{\sigma} \right)^{2}\\ \implies \frac{\partial}{\partial \Theta}\log\left(P\left(\Theta|X\right)\right) &= \frac{1}{\sigma^2} \sum\limits_{i = 1}^{n} \left(x_i-\mu\right) = 0 \\ \sum\limits_{i=1}^n {x_i} &= n\mu\\ \mbox{and so the } &\hat{\mu}\mbox{ that maximizes the probability is}\\ \hat{\mu}&=\boxed{\frac{1}{n}\sum\limits_{i=1}^n {x_i}} \end{align*} $$

It is no surprise that this is in fact the sample mean of the dataset. The most enlightening interpretation of $\mu$ (or $\sigma$ if we were optimizing for this parameter too) is not to think of it as mean and standard deviation, but instead as the parameters of the distribution that give us the best fit for our dataset.

Conclusions

The idea behind maximum likelihood estimation is quite intuitive. You have a dataset and make assumptions on its generative process, usually dependent on some latent, unobserved parameters. These parameters, in principle, tell you something about the system being studied, so the goal is to estimate what parameters maximize the probability that the data can be explained by this model. It is akin to finding the line of best fit in linear regression.