Stochastic block model

This model tends to produce graphs containing communities, subsets of nodes characterized by being connected with one another with particular edge densities.

Its mathematical formulation was first introduced in 1983 in the field of social network analysis by Paul W. Holland et al.[1] The stochastic block model is important in statistics, machine learning, and network science, where it serves as a useful benchmark for the task of recovering community structure in graph data.

The stochastic block model takes the following parameters: The edge set is then sampled at random as follows: any two vertices

vertices, where the edges are sampled as described, recover the groups

This case is degenerate—the partition into communities becomes irrelevant—but it illustrates a close relationship to the Erdős–Rényi model.

The planted partition model is the special case that the values of the probability matrix

, while two vertices in different communities share an edge with probability

: each diagonal entry is only required to dominate the rest of its own row and column.

[2] Disassortative forms of this terminology exist, by reversing all inequalities.

For some algorithms, recovery might be easier for block models with assortative or disassortative conditions of this form.

[2] Much of the literature on algorithmic community detection addresses three statistical tasks: detection, partial recovery, and exact recovery.

The algorithmic task is to correctly identify which of these two underlying models generated the graph.

of the graph to grow, keeping the community sizes in fixed proportions.

If the probability matrix remains fixed, tasks such as partial and exact recovery become feasible for all non-degenerate parameter settings.

increases, we observe a sharp phase transition: for certain settings of the parameters, it will become possible to achieve recovery with probability tending to 1, whereas on the opposite side of the parameter threshold, the probability of recovery tends to 0 no matter what algorithm is used.

, resulting in graphs of constant average degree.

In the case of two equal-sized communities, in the assortative planted partition model with probability matrix

, resulting in graphs of logarithmic average degree.

Here a similar threshold exists: for the assortative planted partition model with

In fact, the exact recovery threshold is known for the fully general stochastic block model.

[5] In principle, exact recovery can be solved in its feasible range using maximum likelihood, but this amounts to solving a constrained or regularized cut problem such as minimum bisection that is typically NP-complete.

Hence, no known efficient algorithms will correctly compute the maximum-likelihood estimate in the worst case.

However, a wide variety of algorithms perform well in the average case, and many high-probability performance guarantees have been proven for algorithms in both the partial and exact recovery settings.

Successful algorithms include spectral clustering of the vertices,[9][4][5][10] semidefinite programming,[2][8] forms of belief propagation,[7][11] and community detection[12] among others.

One minor tweak allocates vertices to communities randomly, according to a categorical distribution, rather than in a fixed partition.

Signed graphs allow for both favorable and adverse relationships and serve as a common model choice for various data analysis applications, e.g., correlation clustering.

The stochastic block model can be trivially extended to signed graphs by assigning both positive and negative edge weights or equivalently using a difference of adjacency matrices of two stochastic block models.

[18] GraphChallenge[19] encourages community approaches to developing new solutions for analyzing graphs and sparse data derived from social media, sensor feeds, and scientific data to enable relationships between events to be discovered as they unfold in the field.

Streaming stochastic block partition is one of the challenges since 2017.

[20] Spectral clustering has demonstrated outstanding performance compared to the original and even improved[21] base algorithm, matching its quality of clusters while being multiple orders of magnitude faster.