The Stochastic Block Model
Intro
This and the next post will be quite heavy in derivations, so I will be saving the code for a third post. In this post (and the next) we will be focussing on a very important model of community detection, the Stochastic Block Model (SBM). This model is a generative model like the Chung-Lu and Erdős–Rényi models, however, this model has community structure built in.
Definition
The SBM is a model that generates graphs containing clusters of vertices generally called communities. You can think of this as a partition into \(k\) disjoint subsets on the vertex set \(V=\{C_1, C_2, \cdots, C_k\}\) of a network \(G = (V,E)\). Here, disjoint, is important because the canonical SBM does not allow nodes to part of multiple communities, so if a node \(i\) is in the community \(C_r\), then it’s not in any other community. The model tends towards these partitions by specifying different probabilities for edge placement depending on the group membership of the nodes.
Building the Stochastic Block Model
$$ A = \begin{pmatrix} 0 & 1 & 1 & 0 & 0 \\ 1 & 0 & 1 & 0 & 0 \\ 1 & 1 & 0 & 1 & 0 \\ 0 & 0 & 1 & 0 & 1 \\ 0 & 0 & 0 & 1 & 0 \end{pmatrix} $$ |
Consider the above network and its adjacency matrix, \(A\). To derive the SBM, we are going to perform a series of operations on \(A\) based on what we are trying to achieve: a model that generates communities. Recall that the stochastic block model incorporates communities by requiring that the probability of the an edge between nodes depend on their group membership, let’s quantify this.
Suppose a network has \(k\) groups and \(n\) nodes, in our example \(k=2\) and \(n=5\). Let \(\Omega_{rs}\) be a \(k\times k\) symmetric matrix wherein each entry defines the probability of an edge between a node in group \(r\) and node in group \(s\). Furthermore, let \(\Gamma_{ir}\) be an \(n \times k\) matrix such that: $$ \begin{align} \Gamma_{ir} = \begin{cases} 1 & \text{if node } i \text{ in group } r\\ 0 & \text{otherwise} \end{cases} \end{align} $$
The \(\Gamma_{ir}\) matrix is our group membership matrix, it tells us when a node is in a group or not, moreover, if we sum the columns, we get a \(1\times n\) matrix of the group occupancy, i.e. each column contains the number of nodes in the corresponding group: \(n_r = \sum \limits_{i}^n \Gamma_r\). For our example network \(\Gamma_{ir}\) looks like: $$ \begin{align*} \Gamma_{ir} = \begin{pmatrix} 1 & 0 \\ 1 & 0 \\ 1 & 0 \\ 0 & 1 \\ 0 & 1 \end{pmatrix} \end{align*} $$
The \(\Omega_{rs}\) matrix looks like this
$$ \Omega_{rs}= \begin{pmatrix} \Omega_{11} & \Omega_{12}\\ \Omega_{12} & \Omega_{22} \end{pmatrix} $$
Again, this matrix defines the probability of an edge between any two nodes based on their group assignment. The stochastic block model, can generate networks with community structure when the diagonal terms in the matrix are greater than the off diagonal terms. That is, in-group probabilities are greater than between-group probabilities. This gives rise to community structure by ensuring that the density of edges within groups are greater on average than between groups.
Let’s now define a new matrix by transforming the adjacency matrix: \(m_{rs} = \sum\limits_{ij}\Gamma_{si}A_{ij}\Gamma_{jr}\). The matrix \(m_{rs}\) has the following structure: $$ \begin{align} m_{rs} = \begin{cases} \text{number of edges between groups } r \text{ and } s & r \neq s\\ 2 \times \text{number of edges between groups } r \text{ and } s & r=s \end{cases} \end{align} $$
This matrix tells how many edges are in each group and how many edges are between groups, I call it the edge-adjacency matrix, it essentially gives us an edge perspective of our network as opposed to the node perspective that the adjacency matrix gives. For our example graph, we can calculate \(m_{rs}\) as follows: $$ \begin{align*} m_{rs}=\Gamma^{\text{T}}A\Gamma = \begin{pmatrix} 1 & 1 & 1 & 0 & 0 \\ 0 & 0 & 0 & 1 & 1 \\ \end{pmatrix} \begin{pmatrix} 0 & 1 & 1 & 0 & 0 \\ 1 & 0 & 1 & 0 & 0 \\ 1 & 1 & 0 & 1 & 0 \\ 0 & 0 & 1 & 0 & 1 \\ 0 & 0 & 0 & 1 & 0 \end{pmatrix} \begin{pmatrix} 1 & 0 \\ 1 & 0 \\ 1 & 0 \\ 0 & 1 \\ 0 & 1 \end{pmatrix} = \begin{pmatrix} 6 & 1\\ 1 & 2 \end{pmatrix} \end{align*} $$
As expected, the diagonal entries represent twice the number of edges in each group and the off diagonal the number of edges going from group \(1\) to \(2\) and vice versa.
Let’s now make another transformation, but this time on \(\Omega_{rs}\). Let’s define the matrix \(\Lambda_{ij} = \sum\limits_{rs}\Gamma_{ir}\Omega_{rs}\Gamma_{sj}\), to see what this is exactly, let’s calculate it explicitly for our example: $$ \begin{align*} \Lambda_{ij}=\Gamma\Omega\Gamma^{\text{T}} = \begin{pmatrix} 1 & 0 \\ 1 & 0 \\ 1 & 0 \\ 0 & 1 \\ 0 & 1 \end{pmatrix} \begin{pmatrix} \Omega_{11} & \Omega_{12}\\ \Omega_{12} & \Omega_{22} \end{pmatrix} \begin{pmatrix} 1 & 1 & 1 & 0 & 0 \\ 0 & 0 & 0 & 1 & 1 \\ \end{pmatrix} = \begin{pmatrix} \color{blue}{\Omega_{11}} & \color{blue}{\Omega_{11}} & \color{blue}{\Omega_{11}} & \Omega_{12} & \Omega_{12} \\ \color{blue}{\Omega_{11}} & \color{blue}{\Omega_{11}} & \color{blue}{\Omega_{11}} & \Omega_{12} & \Omega_{12} \\ \color{blue}{\Omega_{11}} & \color{blue}{\Omega_{11}} & \color{blue}{\Omega_{11}} & \Omega_{12} & \Omega_{12} \\ \Omega_{12} & \Omega_{12} & \Omega_{12} & \color{red}{\Omega_{22}} & \color{red}{\Omega_{22}} \\ \Omega_{12} & \Omega_{12} & \Omega_{12} & \color{red}{\Omega_{22}} & \color{red}{\Omega_{22}} \end{pmatrix} \end{align*} $$
Recall that the elements of the \(\Omega_{rs}\) are the edge probabilities of the model. So a closer look at the resulting matrix above tells us that the \(\Lambda_{ij}\) matrix is the edge probability corresponding to each individual node. Notice the blocks of equal probabilities, nodes in the same group have some probability of have an edge placed between them and nodes from different groups have another.
Observe also, that, $$ n_rn_s\Omega_{rs} = \sum\limits_{rs}\Gamma_{ri}\Lambda_{ij}\Gamma_{js} $$
where \(n_r\) is the number of nodes in group \(r\). This is a useful relation that we will make use of later, I implore you to do the calculation out for yourself.
Stochastic Block Model Equation
Armed with the notation, we can now write down, quantitatively, the stochastic block model. The probability of observing a network \(G\) with adjacency matrix \(A\) conditional on group assignments \(\Gamma\) and edge probabilities \(\Omega\) is given by $$ \begin{align} P\left(A| \Omega, \Gamma\right) = \prod\limits_{i<j}^{n}\left(\Lambda_{ij}\right)^{A_{ij}}\left(1-\Lambda_{ij}\right)^{1-A_{ij}} \end{align} $$ This is not magic, this is just looping through all the node pairs and flipping a bernoulli coin to determine edge placement, it’s the Erdős–Rényi model but instead allowing the probability to change with every node pair.
Like we did with the Chung-Lu model, we take the limit of Poisson distributed edges with expected value \(\Lambda_{ij}\) to make the math easier, this allows us to write $$ \begin{align} P\left(A| \Omega, \Gamma\right) = \prod\limits_{i<j}^{n}\frac{\left(\Lambda_{ij}\right)^{A_{ij}}}{A_{ij}!}e^{-\Lambda_{ij}} \times \prod\limits_{i}^{n}\frac{\left(\frac{1}{2}\Lambda_{ii}\right)^{\frac{1}{2}A_{ii}}}{{(\frac{1}{2}A_{ii}})!}e^{-\frac{1}{2}\Lambda_{ii}} \end{align} $$
In this limit, we allow for both self edges and multiedges, we therefore require that the expected value of a self edge to be \(\frac{1}{2}\Lambda_{ij}\) because of the definition of the adjacency matrix, which is precisely accounted for in the second product.
Stochastic Block Model MAP
Now that we have our SBM, we can do a maximum a posteriori estimate just as we’ve done with the other random graph models discussed. By Bayes’ theorem we can express the probability of observing \(\Omega\) and \(\Gamma\) as $$ \begin{align*} P\left(\Omega, \Gamma| A\right) = \frac{P(A| \Omega, \Gamma)P(\Omega,\Gamma)}{P(A)} \end{align*} $$ The probability \(P(A)\) is a constant, as it has already been observed. Additionally, if we assume uniform priors on \(\Omega\) and \(\Gamma\), then in the optimization problem these terms disappear so we can simply consider the relation $$ \begin{align*} P\left(\Omega, \Gamma| A\right) \propto P\left(A| \Omega, \Gamma\right) \end{align*} $$ We can simplify this optimization problem by optimizing the log of the likelihood instead. $$ \begin{align*} \log P(\Omega, \Gamma| A) & \propto \log\left[\prod\limits_{i<j}^{n}\frac{\left(\Lambda_{ij}\right)^{A_{ij}}}{{A_{ij}}!}e^{-\Lambda_{ij}} \times \prod\limits_{i}^{n}\frac{\left(\frac{1}{2}\Lambda_{ii}\right)^{\frac{1}{2}A_{ii}}}{{\frac{1}{2}A_{ii}}!}e^{-\frac{1}{2}\Lambda_{ii}}\right]\\ &= \sum\limits_{i<j}^n A_{ij}\log\Lambda_{ij} - \log A_{ij}! -\Lambda_{ij} +\frac{1}{2}\left[\sum\limits_{i}^n A_{ii}\log\Lambda_{ii} - 2\log\left(\frac{1}{2}A_{ii}!\right) -\Lambda_{ii}+\log\frac{1}{2}\right] \end{align*} $$
In the left sum, we can sum over all \(i, j\) and account for double counting with a factor of \(\frac{1}{2}\) $$ \begin{align*} \mathscr{L} = \frac{1}{2}\sum\limits_{i\neq j}^n A_{ij}\log\Lambda_{ij} -\Lambda_{ij} + \frac{1}{2}\sum\limits_{i}^n A_{ii}\log\Lambda_{ii} -\Lambda_{ii} + const \end{align*} $$ after combining terms we arrive at $$ \begin{align} \mathscr{L} = \frac{1}{2}\sum\limits_{ij}^n A_{ij}\log\Lambda_{ij} -\Lambda_{ij} \end{align} $$
This equation should look familiar, it is the exact same form as the equation we optimized for the Chung-Lu model. So we can reuse the same machinery to optimize for a particular \(\Omega_{rs}\) (recall that the entries of \(\Lambda\) are \(\Omega_{rs}\))
Choosing a particular \(\Lambda_{lk}\)
$$ \begin{align*} \frac{\partial\mathscr{L}}{\partial\Lambda_{lk}} &= \frac{1}{2}\left(A_{lk} - 1\right) = 0\\ \implies &\mathcal{A}{lk} = \Lambda{lk} \end{align*} $$ Now transforming this relation to the edge/group view $$ \begin{align*} \sum\limits_{lk}\Gamma_{sl}A_{lk}\Gamma_{kr} &= \sum\limits_{lk}\Gamma_{sl}\Lambda_{lk}\Gamma_{kr}\\ m_{rs} &= n_rn_s\Omega_{rs} \end{align*} $$ So we have that the optimal probabilities are given by $$ \begin{align} \hat{\Omega}{rs} = \frac{m{rs}}{n_rn_s} \end{align} $$
Subbing \(\hat{\Omega}_{rs}\) into the log-likelihood, we arrive at
$$ \begin{align} \mathscr{L} = \frac{1}{2}\sum\limits_{rs}m_{rs}\log\left(\frac{m_{rs}}{n_rn_s}\right) \end{align} $$ where I have dropped the constant term and used the fact that $$ \begin{align*} \sum\limits_{ij}A_{ij}\log\Lambda_{ij} = \sum\limits_{rs}m_{rs}\log\Omega_{rs} \end{align*} $$ This expression for the log-likelihood is known as the profile likelihood.
Conclusion
This derivation has been quite involved, so let’s take a moment to reflect on what we’ve just accomplished. The goal of the MAP for this problem is to find the edge probabilities and the group assignments that maximize the probability that our observed network was generated from a stochastic block model. The \(\hat{\Omega}\) that we arrived at tells us that the edge probabilities that best fits the model can be rëexpressed in terms of quantities we can calculate using our group assignment matrix (I showed previously how to calculate \(m_{rs}\) and \(n_r\)). Now all that is left is to maximize the profile likelihood, this however, is no easy task. Doing so, effectively finds the most likely group assignments of the nodes. This is a discrete problem and must be solved computationally, I will save this for a future post.