Maximum a posteriori

Intro

The simplest way to think about community detection is to think of it as a method of clustering the nodes of a network based solely on its connections. This partitioning of the graph (or network) allows us to see some of the more hidden structures in the system the network is trying to represent.

unpartitioned partitioned
unpartitioned unpartitioned partitioned partitioned

The left (red colored) graph illustrates what a network might look like before doing community detection while the right is a perfect partitioning of the graph into two obvious communities.

Graph partitioning is one of the most extensively studied problems in graph theory and network scientists have many different approaches to doing community detection; we will focus on methods based on statistical inference.

Maximum A Posterior (MAP)

The MAP method is crucial to doing statistical inference based community detection. To understand this method let’s begin with Bayes’ Rule, suppose you have some dataset \(A\) and you assume it was generated from some model with parameters \(\Theta\). Then according to Bayes’ Rule, the probability of observing \(A\) given such model parameters is: $$ P\left(A|\Theta\right) = \frac{P\left(\Theta|A\right)P\left(A\right)}{P\left(\Theta\right)} $$ With the MAP we are assuming that \(A\) is given and what we are really interested in, is the probability of observing the model parameters \(\Theta\) given our dataset. We can achieve an expression for this by simply rearranging Bayes’ Rule: $$ P\left(\Theta|A\right) = \frac{P\left(A|\Theta\right)P\left(\Theta\right)}{P\left(A\right)} \tag{1} $$

The quantity \(P\left(\Theta|A\right)\) is called the posterior and the goal of MAP is to maximize this quantity. Recall from your introductory calculus class that to optimize a function, we take it’s derivative and set it equal to \(0\); MAP involves doing the exact same thing. However, often times the expression for \(P\left(\Theta|A\right)\) is quite complex and it’s easier to optimize the logarithm of the function rather than the function itself. The logarithm is a monotonically increasing function, this means that any \(\hat{x}\) that optimizes \(\log\left(f\left(x\right)\right)\) also optimizes \(f\left(x\right)\) and so optimizing the logarithm of the posterior is sufficient.

$$ \log\left(P\left(\Theta|A\right)\right) = \log\left(P\left(A|\Theta\right)\right) + \log\left(P\left(\Theta\right)\right) - \log\left(P\left(A\right)\right) $$

Observe that \(P\left(A\right)\) is our dataset, this is not a function and is always the same no matter what parameters or model we use, i.e. it is constant with respect to \(\Theta\) and will disappear in the optimization process, and so we need only consider: $$ \begin{align*} \log\left(P\left(\Theta|A\right)\right) &\propto \log\left(P\left(A|\Theta\right)\right) + \log\left(P\left(\Theta\right)\right) \\ \implies \frac{\partial}{\partial \Theta}\log\left(P\left(\Theta|A\right)\right) &= \frac{\partial}{\partial \Theta}\log\left(P\left(A|\Theta\right)\right) + \frac{\partial}{\partial \Theta}\log\left(P\left(\Theta\right)\right)= 0 \end{align*} $$

Example: Gaussian Distribution

Lets try this with a more concrete example: consider a dataset \(X = \{x_1, x_2, \cdots, x_n\}\), suppose we assume they were generated from a Gaussian distribution with standard deviation \(\sigma\) and mean \(\mu\), but we are only given \(\sigma\). The question we might ask, then, what \(\mu\) maximizes the probability that this dataset was indeed generated from a Gaussian distribution. In this example then, \(\Theta = \mu\), \(P\left(X|\Theta\right) = \frac {1}{\sigma \sqrt{2\pi }}e^{-{\frac {1}{2}}\left({\frac {x-\mu }{\sigma }}\right)^{2}} \), since we are assuming \(X\) was generated from a Gaussian distribution, and lastly we need to choose an expression for \(P\left(\Theta\right)\). \(P\left(\Theta\right)\) is known as the prior, and this encodes any prior knowledge we may have on how the parameters \(\Theta\) were chosen. If we do not know anything about how the parameters were chosen, then we may assume they are all equally probable and \(P\left(\Theta\right)\) is constant. Altogether: $$ \begin{align*} P\left(\Theta|A\right) &= \prod\limits_{i = 1}^{n} \frac{\frac {1}{\sigma \sqrt{2\pi }}e^{-{\frac {1}{2}}\left({\frac {x_i-\mu }{\sigma }}\right)^{2}} \cdot k}{c} \mbox{ where } k, c \mbox{ are arbitrary constants}\\ \implies \log\left(P\left(\Theta|A\right)\right) &= \sum\limits_{i = 1}^{n}\log\left( \frac {1}{\sigma \sqrt{2\pi }} \right) + \frac{1}{2}\left(\frac{x_i - \mu}{\sigma} \right)^{2} + \log\left(k\right) - \log\left( c \right)\\ \implies \frac{\partial}{\partial \Theta}\log\left(P\left(\Theta|A\right)\right) &= \frac{1}{\sigma^2} \sum\limits_{i = 1}^{n} \left(x_i-\mu\right) = 0 \\ \sum\limits_{i=1}^n {x_i} &= n\mu\\ \mbox{and so the } &\hat{\mu}\mbox{ that maximizes the probability is}\\ \hat{\mu}&=\boxed{\frac{1}{n}\sum\limits_{i=1}^n {x_i}} \end{align*} $$

It is no surprise that this is in fact the sample mean of the dataset. The most enlightening interpretation of \(\mu\) (or \(\sigma\) if we were optimizing for this parameter too) is not too think of it as mean and standard deviation, but instead as the parameters of the distribution that give us the best fit for our dataset.