Learning latent block structure in weighted networks

Aicher, Christopher; Jacobs, Abigail Z.; Clauset, Aaron

doi:10.1093/comnet/cnu026

Abstract

Community detection is an important task in network analysis, in which we aim to learn a network partition that groups together vertices with similar community-level connectivity patterns. By finding such groups of vertices with similar structural roles, we extract a compact representation of the network's large-scale structure, which can facilitate its scientific interpretation and the prediction of unknown or future interactions. Popular approaches, including the stochastic block model (SBM), assume edges are unweighted, which limits their utility by discarding potentially useful information. We introduce the weighted stochastic block model (WSBM), which generalizes the SBM to networks with edge weights drawn from any exponential family distribution. This model learns from both the presence and weight of edges, allowing it to discover structure that would otherwise be hidden when weights are discarded or thresholded. We describe a Bayesian variational algorithm for efficiently approximating this model's posterior distribution over latent block structures. We then evaluate the WSBM's performance on both edge-existence and edge-weight prediction tasks for a set of real-world weighted networks. In all cases, the WSBM performs as well or better than the best alternatives on these tasks.

1. Introduction

Networks are an increasingly important form of structured data consisting of interactions between pairs of individuals in complex social, biological or other systems. Unlike attribute data where each observation is associated with an individual, network data are represented by graphs, where individuals are vertices and interactions are edges. Because vertices are pairwise related, network data violate traditional assumptions of attribute data, such as independence. This intrinsic difference in structure prompts the development of new tools for handling network data.

In many networks, vertices often play distinct structural roles in generating the network's large-scale structure. To identify such latent structural roles, we aim to identify a network partition that groups together vertices with similar group-level connectivity patterns. We call these groups ‘communities’, and their inference produces a compact description of the large-scale structure of a network. (We note that this definition of a ‘community’ is more general than the assortative-only definition that is commonly used.) This compact large-scale description itself has many potential uses, including dividing a large heterogeneous system into several smaller and more homogeneous parts that may be studied semi-independently, and in predicting unknown or future patterns of interactions. By grouping vertices by these roles, community detection in networks is similar to clustering in vector spaces, and many approaches have been proposed [1].

The stochastic block model (SBM) [2,3] is a popular generative model for learning community structure in unweighted networks. In its classic form, the SBM is a probabilistic model of pairwise interactions among |$n$| vertices. Each vertex |$i$| belongs to one of |$K$| latent groups or ‘blocks’ denoted by |$z_i$|⁠, and each edge |$A_{ij}$| exists with a probability |$\theta _{z_i z_j}$| that depends only on the group memberships of the connecting vertices. Vertices in the same block are stochastically equivalent, indicating their equivalent roles in generating the network's structure. The SBM is fully specified by a vector |$z$| denoting the group membership of each vertex and a |$K\times K$| matrix |$\theta $| of edge bundle probabilities, where |$\theta _{k,k'}$| gives the probability that a vertex in group |$k$| connects to some vertex of group |$k'$|⁠.

The SBM is popular in part because it can generate a wide variety of large-scale patterns of network connectivity depending on the choice of |$\theta $| (Fig. 1(a–d)). For example, if the diagonal elements of |$\theta $| are greater than its off-diagonal elements, the block structure is assortative, with communities exhibiting greater edge densities within than between them (Fig. 1(a))—a common pattern in social networks [4]. Reversing the pattern in |$\theta $| generates disassortative structure (Fig. 1(b)), which is often found in language and ecological networks [5]. Other choices of |$\theta $| can generate hierarchical, multi-partite or core-periphery patterns [6,7], and the SBM also has been generalized for count-valued data, degree-correction [8], bipartite structure [9] and categorical values [10].

Fig. 1.

Examples of structure that can be learned using the SBM. The first row shows the abstract connections between four groups. The second row shows the ‘block’ structure found in the adjacency matrix after sorting by group membership; black corresponds to edges and white corresponds to non-edges. (a) Assortative structure: edges mainly exist within groups. (b) Disassortative structure: edges mainly exist between distinct groups. (c) Core-periphery structure: the ‘core’ connects mainly with itself and the ‘periphery’, whereas the ‘periphery’ mainly connects with the ‘core’. (d) Ordered structure: top-left connects to top-right, top-right connects to bottom-left, bottom-left connects to bottom-right.

Open in new tab Download slide

In addition to this flexibility, the SBM's probabilistic structure provides a principled approach to quantifying uncertainty of group membership, an attractive feature in unsupervised network analysis. This structure has led to theoretical guarantees, including consistency of the SBM estimators [11] and the identifiability and consistency of latent block models [12,13].

However, each of these models assumes an unweighted network, where edge presence or absence is represented as a binary variable (or perhaps a count-valued variable), while most real-world networks have weights, e.g. interaction frequency, volume or character. Such information is typically discarded via thresholding before analysis, which can obscure or distort latent structure [14]. To illustrate this loss of information from thresholding, consider a toy network of four equally sized groups labeled 1–4 (see Fig. 2), where each edge |$(i,j)$| is assigned a weight equal to the smaller of the endpoints' group labels, plus a small amount of noise. Edges between groups are thus assigned weights near 1, 2 or 3, whereas those within a group are assigned weights near 1, 2, 3 or 4. This model is obviously unrealistic but serves to illustrate the common consequences of applying a global threshold to edge-weighted networks.

Fig. 2.

(a) An example of a weighted network where thresholding will never succeed. (b) A plot of the normalized mutual information (NMI) between the true community structure and inferred SBM community structure after thresholding at various threshold values (averaged over 100 trials). Examples of community structure found by thresholding are shown above the graph (different shades of gray). As the NMI is |$<$|1 for all threshold values, the SBM after thresholding never infers the true community structure shown in (a).

Open in new tab Download slide

To apply the SBM to this simple network, we must convert it into an unweighted network by discarding edges with weights less than some threshold. To illustrate the results of this action, we consider all possible thresholds and compute the average NMI between the best community structure found using the SBM and the true structure (Fig. 2). No matter what threshold we choose, edges are divided into at most three groups: those with weight above, at or below the threshold. The SBM can thus recover a maximum of three groups, rather than the four planted in this network, and the threshold determines which three groups it finds. No threshold yields the correct inference here, because thresholding discards edge-weight information.

Instead of thresholding, we could use more complex methods, such as using multiple thresholds or a binning scheme, to convert a weighted network into an unweighted or count-valued network of some sort. These methods would perform better than applying a single threshold, at the cost of additional complexity in specifying multiple threshold or bin values. Regardless of the method, these approaches will still discard potentially useful edge-weight information. To exploit the maximal amount of information in the original data in recovering the true hidden structure, we should prefer to model the edge weights directly.

In this paper, we introduce the weighted stochastic block model (WSBM), a generalization of the SBM that can learn from both the presence and weight of edges. The WSBM provides a natural solution to this problem by generalizing the SBM to learn from both types of edge information. Specifically, the WSBM models each weighted edge |$A_{ij}$| as a draw from a parametric exponential family distribution, whose parameters depend only on the group memberships of the connecting vertices |$i$| and |$j$|⁠. It includes as special cases most standard distributional forms, e.g. the normal, the exponential and their generalizations, and enables the direct use of weighted edges in recovering latent group or block structure. This paper generalizes and extends our previous work [15].

We first describe the form of the WSBM, which combines edge-existence and -weight information. We then derive a variational Bayes algorithm for efficiently learning WSBM parameters from data. Applying this algorithm to a small real-world weighted network, we show that the SBM and WSBM can learn distinct latent structures as a result of observing or ignoring edge weights. Finally, we compare the performance of the WSBM to alternative methods for two edge prediction tasks, using a set of real-world networks. In all cases, the WSBM performs as well as alternatives on edge-existence prediction, and outperforms all alternatives on edge-weight prediction. This model thus enables the discovery of latent group structures in a wider range of networks than was previously possible.

2. Weighted stochastic block model

We begin by reviewing the SBM and exponential families, and then describe a natural generalization of the SBM to weighted networks. In what follows, we consider the general case of directed graphs; undirected graphs are a special case of this model.

In the SBM, the network's adjacency matrix |$A$| contains binary values representing edge existences, i.e. |$A_{ij} \in \{0,1\}$|⁠, the integer |$K$| denotes a fixed number of latent groups, and the vector |$z$| contains the group label of each vertex |$z_i \in \{1,\ldots ,K\}$|⁠. The number of latent groups |$K$| controls the model's complexity and may be chosen in a variety of ways—we defer a discussion of this matter until Section 3.3. Each possible group assignment vector |$z$| represents a different partition of the vertices into |$K$| groups, and each pair of groups |$(k k')$| defines a ‘bundle’ of edges that run between them. The SBM assigns an edge-existence parameter to each edge bundle |$\theta _{k k'}$|⁠, which we represent collectively by the |$K\,\times \,K$| matrix |$\theta $|⁠. The existence probability of an edge |$A_{ij}$| is given by the parameter |$\theta _{z_i z_j}$| that depends only on the group memberships of vertices |$i$| and |$j$|⁠.

Assuming that each edge-existence |$A_{ij}$| is conditionally independent given |$z$| and |$\theta $|⁠, the SBM's likelihood function is

\begin{equation}\label{eq2.1} \Pr(A \,|\, z,\theta) = \prod_{ij} \theta_{z_i z_j}^{A_{ij}} (1-\theta_{z_i z_j})^{1-A_{ij}}, \end{equation}

(2.1)

which we may rewrite as

\[ \Pr(A \, | \, z,\theta) = \prod_{ij} \exp\left(A_{ij} \cdot \log \left(\frac{\theta_{z_i z_j}}{1-\theta_{z_i z_j}}\right) + \log(1-\theta_{z_i z_j}) \right). \]

Thus, the likelihood has the form of an exponential family

\begin{equation}\label{eq2.2} \Pr(A \,|\, z,\theta) \propto \exp\left(\sum_{ij} T(A_{ij}) \cdot \eta(\theta_{z_i z_j}) \right), \end{equation}

(2.2)

where |$T(x) = (x,1)$| is the vector-valued function of sufficient statistics of the Bernoulli random variable and |$\eta (x) = (\log [x/(1-x)],\log [1-x])$| is the vector-valued function of natural parameters. Appendix B provides further details about exponential families.

This choice of functions |$(T,\eta)$| produces binary-valued edge weights. By choosing an appropriate but different pair of functions |$(T,\eta)$|⁠, defined on some domain |$\mathcal {X}$| and |$\mathcal {\Theta ,}$| respectively, we may specify a SBM whose weights are drawn from an exponential family distribution over |$\mathcal {X}$|⁠. As in the SBM, this WSBM is defined by a vector |$z$| and matrix |$\theta $|⁠, but now each |$\theta _{z_{i}z_{j}}$| specifies the parameters governing the weight distribution of the |$(z_{i}z_{j})$| edge bundle. Figure 3 visualizes the dependencies in the WSBM's likelihood function as a graphical model.

$Graphical model for the WSBM. Each weighted edge $A_{ij}$ (plate) is distributed according to the appropriate edge parameter $\theta _{z_i,z_j}$ for each observed interaction $(i,j)$. In our variational Bayes inference scheme, the WSBM's latent parameters $z,\theta $ are themselves modeled as random variables distributed according to $\mu ,\tau $, respectively. We highlight that the arrow from $z$ to $\theta _{z_i,z_j}$ hides the complex relational structure between each $z_i$.$

Fig. 3.

Graphical model for the WSBM. Each weighted edge |$A_{ij}$| (plate) is distributed according to the appropriate edge parameter |$\theta _{z_i,z_j}$| for each observed interaction |$(i,j)$|⁠. In our variational Bayes inference scheme, the WSBM's latent parameters |$z,\theta $| are themselves modeled as random variables distributed according to |$\mu ,\tau $|⁠, respectively. We highlight that the arrow from |$z$| to |$\theta _{z_i,z_j}$| hides the complex relational structure between each |$z_i$|⁠.

Open in new tab Download slide

The generative process of creating a weighted network from the WSBM consists of the following steps.

For each vertex |$i$|⁠, assign a group membership |$z_i$|⁠.
For each pair of groups |$(k,k')$|⁠, assign an edge bundle parameter |$\theta _{k k'} \in \mathcal {\Theta }$|⁠.
For each edge |$(i,j)$|⁠, draw |$A_{ij} \in \mathcal {X}$| from the exponential family |$(T,\eta)$| parametrized by |$\theta _{z_i z_j}$|⁠.

The community structure of the WSBM retains the stochastic equivalence principle of the classic SBM, in which all vertices in a group maintain the same probabilistic connectivity to the rest of the network.

For example, if the edge weights are real-valued |$\mathcal {X} = \mathbb {R}$|⁠, then we may choose to model the edge weights with the normal distribution, which has sufficient statistics |$T = (x,x^2,1)$| and natural parameters |$\eta = (\mu /\sigma ^2, -1/(2\sigma ^2),-\mu ^2/(2\sigma ^2))$|⁠. Instead of edge-existence probabilities, each edge bundle |$(z_{i} z_{j})$| is now parameterized by a mean and variance |$\theta _{z_{i} z_{j}} = (\mu _{z_{i} z_{j}}, \sigma ^2_{z_{i} z_{j}})$|⁠. In this case, the likelihood function would be

\begin{equation}\label{eq2.3} \Pr(A\,|\,z,\mu,\sigma^2) = \prod_{ij} \mathcal{N}(A_{ij} \, | \, \mu_{z_i z_j} , \sigma^2_{z_i z_j}) = \prod_{ij} \exp\left(A_{ij} \cdot \frac{\mu_{z_i z_j}}{\sigma^2_{z_i z_j}} - A_{ij}^2 \cdot \frac{1}{2 \sigma^2_{z_i z_j}} - 1 \cdot \frac{\mu^2_{z_i z_j}}{\sigma^2_{z_i z_j}} \right). \end{equation}

(2.3)

That is, this particular WSBM uses a normal distribution instead of a Bernoulli distribution to model the values observed in an edge bundle. We emphasize that the choice of the normal distribution is merely illustrative.

This construction produces complete graphs, in which every pair of vertices is connected by an edge with some real-valued weight. For a complete network, this formulation may be entirely sufficient. However, most real-world networks are sparse, with only |$O(n)$| pairs having a connection that may have a weight, and a dense model like this one cannot be applied directly. We now describe how sparsity can be naturally incorporated within our model, which also produces more scalable inference algorithms.

2.1 Sparse weighted graphs

A key insight for modeling edge-weighted sparse networks lays in clarifying the meaning of zeros in a weighted adjacency matrix. Typically, a value |$A_{ij}=0$| may represent one of three things: (i) the absence of an edge, (ii) an edge that exists but has weight zero or (iii) missing data, i.e. an unobserved interaction. In both of the former two cases, we do in fact observe the interaction, whereas in the latter, we do not. For observed interactions, we call the observed non-interaction to be a ‘non-edge,’ and we let that |$A_{ij}=0$| denotes the presence of an edge with weight zero. In many empirical networks, distinct types of interactions may have been confounded, e.g. non-edges, edges with zero weight and unobserved interactions may all be assigned a value |$A_{ij}=0$|⁠. However, for accurate inference, this distinction can be important. For example, a non-edge may indicate an interaction that is impossible to measure, which is distinct from choosing not to measure the interaction (an unobserved interaction) or an interaction with weight zero.

Here, we assume that these three types of interactions are distinguished in our input data. This creates two types of information: information from edge-existence (non-edges versus weighted edges) and information from edge-weight (the weighted values). To handle these two types of information, the WSBM then models an edge's existence as a Bernoulli or binary random variable, as in the SBM, and models an edge's weight using an exponential family distribution. Terms corresponding to unobserved interactions contribute no information to inference and are dropped from the likelihood function. If the pair |$(T_{\rm e},\eta _{\mathrm {e}})$| denotes the family of edge-existence distributions and the pair |$(T_{\mathrm {w}},\eta _{\mathrm { w}})$| denotes the family of edge-weight distributions, then we may combine their contributions in the likelihood function via a simple tuning parameter |$\alpha \in [0,1]$| that determines their relative importance in inference

\begin{equation}\label{eq2.4} \hbox{logPr}({A}\, |\,z,\theta) = \alpha \sum_{ij \in E} T_{\mathrm{e}}(A_{ij}) \cdot + (1-\alpha) \sum_{ij \in W} T_{\rm w}(A_{ij}) \cdot \eta_{\rm w}(\theta^{(w)}_{z_i z_j}), \end{equation}

(2.4)

where |$E$| is the set of observed interactions (including non-edges) and |$W$| is the set of weighted edges (⁠|$W \subset E$|⁠). This generalization can be reduced to the compact form of Equation (2.2) by combining the vectors |$\alpha T_{\mathrm {e}}$| with |$(1-\alpha)T_{\mathrm {w}}$| and |$\eta _{\mathrm {e}}$| with |$\eta _{\mathrm {w}}$|⁠.

By tuning |$\alpha $|⁠, we can learn different latent structures. When |$\alpha = 1$|⁠, the model ignores edge-weight information and reduces to the SBM. When |$\alpha = 0$|⁠, the model treats edge absence as if it were unobserved, and fits only to the weight information. When |$0<\alpha <1$|⁠, the likelihood combines information from both edge-existence and edge-weight. In principle, the best choice of |$\alpha $| could also be learned, but we leave this subtle problem for future work. In practice, we often find that |$\alpha = 1/2$|⁠, giving equal weight to both types of information, works well.

2.2 Degree correction

The last piece of the WSBM is a generalization to naturally handle heavy-tailed degree distributions, which are ubiquitous in real-world networks and are known to cause the SBM to produce undesirable results, e.g. placing all high-degree vertices in a group together, regardless of their natural community membership [8].

Karrer and Newman [8] introduced an elegant extension of the SBM that circumvents this behavior. In their ‘degree-corrected’ SBM (here DCBM), they add vertex degree information into the generative model by adding an ‘edge-propensity’ parameter |$\phi _i$| to each vertex. As a result, the number of edges that exist between a pair of vertices |$i$| and |$j$| is a Poisson random variable with mean |$\phi _i \phi _j \theta _{z_i,z_j}$|⁠. Because vertices with high propensity are more likely to connect than vertices with low propensity, the propensity parameters |$\phi $| allow for heterogenous degree distributions within groups. In the DCBM, vertices in the same block are no longer stochastically equivalent, but have similar group-level connectivity patterns conditioned on their propensity parameters |$\phi $|⁠.

The likelihood function for this model is

\[ \Pr(A \,|\, z,\theta,\phi) \propto \prod_{ij} (\phi_i \phi_j \theta_{z_i z_j})^{A_{ij}} \exp(-\phi_i \phi_j \theta_{z_i z_j}), \]

\begin{equation}\label{eq2.5} \Pr(A \,|\, z,\theta,\phi) \propto \prod_{ij} \exp(A_{ij} \cdot \log\,\theta_{z_i z_j} -d_i d_j \cdot \theta_{z_i z_j}), \end{equation}

(2.5)

where the sufficient statistics are |$T = (A_{ij}, -d_i d_j)$| and the natural parameters are |$\eta = (\log \,\theta _{z_i z_j}, \theta _{z_i z_j})$|⁠. Thus, to derive a degree-corrected WSBM, we simply replace the SBM contribution in Equation (2.4) with that of the DCBM in Equation (2.5). We note that this model can easily extended to included in and out propensity parameters for directed networks.

This degree-corrected-WSBM allows for heterogeneous degree distributions within groups by modeling vertex degree or rather the sum of edge existences. This is distinct from what one might call a ‘strength’-corrected SBM that produces heterogeneous weight distributions within edge bundles by modeling vertex strength (the sum of a vertex's edge weights). This ‘strength’-corrected model is not considered here and is an area for future work.

3. Learning latent block structure

Given some sparse-weighted graph |$A$|⁠, we recover the underlying communities by learning the parameters |$z,\theta $|⁠. Any of a large number of standard approaches can be used to optimize the likelihood function for the WSBM. Here, we describe an efficient variational Bayes approach [16,17], which effectively handles one technical difficulty in fitting the model to real data.

Specifically, learning the parameters |$z,\theta $| by directly maximizing the likelihood in Equation (2.2) can suffer degenerate solutions under continuous-valued weights. For instance, consider the WSBM with normally distributed edge weights, where some bundle of edges has all-equal weights. In this case, the maximum likelihood estimate is a variance parameter equal to zero, which creates a degeneracy in the likelihood calculation. This case is not pathological, as a poor choice of partition |$z$|—chosen, perhaps, inadvertently over the course of maximizing the likelihood—can easily create two small groups with only a few edges, each with the same weight, between them. This problem has not previously been identified in the block-modeling literature because the SBM is a model where edge ‘weights’ are discrete Bernoulli random variables, whose parameters are never degenerate.

We solve this problem using Bayesian regularization. In the Bayesian framework, we treat the parameters as random variables and assign an appropriate prior distribution |$\pi $| to our parameters |$z,\theta $|⁠. If we treat the prior distribution as the probability of the parameters |$\pi (z,\theta) = \Pr (z,\theta)$|⁠, then we may calculate the posterior distribution as the probability of the parameters conditioned on the data |$\pi ^*(z,\theta) = \Pr (z,\theta | A)$| through Bayes' law

\[ \pi^*(z,\theta) \propto \Pr(A|z,\theta)\pi(z,\theta). \]

After calculating the posterior distribution, we may either return our posterior beliefs |$\pi ^*$| about the parameters |$z,\theta $| or further calculate a point estimate to minimize a posterior expected loss with respect to a given loss function [18,19]. In both cases, it suffices to calculate the posterior |$\pi ^*$|⁠. The maximum likelihood estimate corresponds to only maximizing the likelihood |$\Pr (A|z,\theta)$|⁠. The inclusion of the prior distribution |$\pi $| prevents the posterior distribution |$\pi ^*$| from over-fitting to the degenerate maximum likelihood solution and therefore estimation can proceed smoothly.

However, the posterior distribution is generally difficult to calculate analytically. Instead, we approximate |$\pi ^{*}(z,\theta)$| by a factorizable distribution |$q(z,\theta) = q_z(z)q_\theta (\theta)$|⁠, a common approach in both machine learning and statistical physics. We select our approximation |$q$| by minimizing its Kullback–Leibler (KL) divergence to the posterior,

\[ D_{\text{KL}}(q\, || \, \pi^{*}) = -\int q\, \log \frac{\pi^{*}}{q} . \]

The KL divergence is a non-symmetric, non-negative and information-theoretic measure of difference between two distribution. Thus, our approximation |$q$| can be thought of as the closest approximation to the posterior |$\pi ^*$|⁠, subject to factorization and distribution constraints.

Expanding the constant marginal log-likelihood |$\hbox {logPr}(A)$|⁠, we observe that minimizing the KL-divergence is equivalent to maximizing the functional |$\mathcal {G}(q)$| defined as follows. Let

\begin{align*} \hbox{logPr}({A}) &= \int_\Theta \sum_{z \in Z} q(z,\theta) \,{\mathrm{d}}\theta \ \hbox{logPr}({A}) \\ &= \int_\Theta \sum_{z \in Z} q(z,\theta) \log\frac{\Pr(A , z , \theta)}{ \Pr(z, \theta | A)} \,{\rm d}\theta \\ &= \int_\Theta \sum_{z \in Z} q(z,\theta) \log\frac{\Pr(A, z , \theta)}{q(z,\theta)} \,{\rm d}\theta - \int_\Theta \sum_{z \in Z} q(z,\theta) \log\frac{\Pr(z , \theta | A)}{q(z,\theta)} \,{\rm d}\theta \\ & = \mathcal{G}(q) + D_{\rm KL}(q(z,\theta) || \pi^*(z,\theta)), \end{align*}

where

\begin{equation}\label{eq3.1} \mathcal{G}(q) = \int_\Theta \sum_{z \in Z} q(z,\theta) \log\frac{\Pr(A, z , \theta)}{q(z,\theta)} \,{\rm d}\theta = \mathbb{E}_{q}(\hbox{logPr}(A\, |\, z,\theta)) + \mathbb{E}_{q} \left(\log \frac{\pi(z , \theta)}{q(z , \theta)}\right). \end{equation}

(3.1)

The first term of Equation (3.1) is the expected log-likelihood under the approximation |$q$| and the second term is the negative KL-divergence of the approximation |$q$| from the prior |$\pi $|⁠. Therefore, we aim to maximize the expected log-likelihood of the data and weakly constrain the approximation to be close to the prior. The second term serves as a regularizer which prevents over-fitting and eliminates the aforementioned maximum likelihood degeneracies. In practice, the first term dominates the second term given sufficient data and approximates the maximum likelihood estimation.

Because the KL-divergence is non-negative, we can think of |$\mathcal {G}(q)$| as a functional lower bound on the log-evidence or marginal log-likelihood, that is,

\begin{equation}\label{eq3.2} \hbox{logPr}({A}) = \mathcal{G}(q) + D_{\text{KL}}(q \,||\, \pi^{*}) \geq \mathcal{G}(q). \end{equation}

(3.2)

Maximizing |$\mathcal {G}(q)$| is equivalent to minimizing the KL divergence |$D_{\text {KL}}(q\, || \,\pi ^{*})$| because the log-evidence |$\hbox {logPr}( {A})$| is constant. Therefore as we maximize |$\mathcal {G}(q)$|⁠, our approximation |$q$| gets closer to the true posterior |$\pi ^{*}$|⁠. For more details on variational Bayesian inference in graphical models, we refer the interested readers to Ref. [16].

3.1 Conjugate distributions

To calculate |$\mathcal {G}$| in practice, we must assign prior distributions |$\pi $| to our parameters and place constraints on the distributions of our approximation |$q$|⁠. For mathematical convenience, we choose |$\pi $| and restrict |$q$| to be the product of parameterized conjugate distributions. Because |$q$| takes a parameterized form, maximizing the functional |$\mathcal {G}(q)$| over all factorized distributions |$q$| simplifies to maximizing |$\mathcal {G}(q)$| over the parameters of |$q$|⁠.

For the edge bundle parameters |$\theta $|⁠, the standard conjugate prior of the parameter of an exponential family |$(T,\eta)$| is

\begin{equation}\label{eq3.3} \pi(\theta) = \frac{1}{Z(\tau)}\exp(\tau \cdot \eta(\theta)), \end{equation}

(3.3)

where |$\tau $| parameterizes the prior and |$Z(\tau)$| is a normalizing constant for fixed |$\tau $|⁠.

For the vertex labels |$z$|⁠, the natural conjugate prior is a categorical distribution with parameter |$\mu \in \mathbb {R}^{n\times k}$|⁠. The parameter |$\mu _i(k)$| represents the probability that vertex |$i$| belongs to group |$k$| in all of its interactions. If the probability in parameter |$\mu _i$| is spread among multiple groups, then this indicates uncertainty in the membership of vertex |$i$| and not mixed membership. We fit |$\mu _i$| directly, with flat prior |$\mu _0(k) = 1/K$|⁠.

The form of our prior is thus

\begin{equation}\label{eq3.4} \pi(z, \theta \,|\, \mu_0, \tau_0) = \prod_i \mu_0(z_i) \times \prod_{r} \frac{1}{Z(\tau_0)}\exp (\tau_0 \cdot \eta(\theta_r)), \end{equation}

(3.4)

where |$\mu _0$| and |$\tau _0$| are the parameters for the priors |$\pi _i$| and |$\pi _r$|⁠, picked to be a ‘non-informative’ reference prior [20], also called a flat prior.

Similarly, our approximation |$q$| takes the form

\begin{equation}\label{eq3.5} q(z, \theta \,|\, \mu, \tau) = \prod_i \mu_{i}(z_i) \times\prod_{r} \frac{1}{Z(\tau_r)}\exp (\tau_r \cdot \eta(\theta_r)). \end{equation}

(3.5)

3.2 An efficient algorithm for optimizing |$\mathcal {G}$|

Now we consider maximizing |$\mathcal {G}$| over |$q$|'s parameters |$\mu _i$| and |$\tau _r$|⁠. To simplify notation, let |$\langle T \rangle _r$| and |$\langle \eta \rangle _r$| be the expected values of the sufficient statistics |$T_r$| and natural parameters |$\eta _r$| under the approximation |$q$|⁠, that is, we set

\begin{equation} \langle T \rangle_r = \sum_{ij} \sum_{(z_i,z_j) = r} \mu_i(z_i) \, \mu_j(z_j) \, T(A_{ij}), \label{eq3.6} \end{equation}

(3.6)

\begin{equation} \langle \eta \rangle_r = \left. \frac{\partial}{\partial \tau} \log\,Z(\tau) \right|_{\tau=\tau_r}.\label{eq3.7} \end{equation}

(3.7)

Substituting the conjugate prior forms of |$\pi $|⁠, |$q$| into |$\mathcal {G}$| thus yields

\begin{equation}\label{eq3.8} \mathcal{G} \propto \sum_r (\langle T \rangle_r + \tau_0 - \tau_r) \cdot \langle \eta \rangle_r + \sum_r \log \frac{Z(\tau_r)}{Z(\tau_0)} + \sum_i \sum_{z_i} \mu_i (z_i) \log \frac{\mu_0 (z_i)}{\mu_i (z_i)}. \end{equation}

(3.8)

To optimize |$\mathcal {G}$|⁠, we take derivatives with respect to |$q$|'s parameters |$\mu $| and |$\tau $| and set them to zero. We iteratively solve for the maximum by updating |$\mu $| and |$\tau $| independently.

For the edge bundle parameters |$\tau $|⁠, the derivative of |$\mathcal {G}$| is

\begin{equation}\label{eq3.9} \frac{\partial \mathcal{G}}{\partial \tau_r} = (\langle T \rangle_r + \tau_0 - \tau_r) \frac{\partial \langle \eta \rangle_r}{\partial \tau_r}, \end{equation}

(3.9)

and setting this equal to zero yields a compact update equation

\begin{equation}\label{eq3.10} \tau_r = \tau_0+ \langle T \rangle_r \end{equation}

(3.10)

for each edge bundle |$r$|⁠.

For the vertex-label parameters |$\mu $|⁠, we include Lagrange multipliers |$\lambda _i$| to enforce the constraint |$\sum _{z} \mu _i(z) = 1$|⁠. Setting the derivative of |$\mathcal {G}$| with respect to |$\mu _i$| equal to |$\lambda _i$| yields

\[ \frac{\partial \mathcal{G}}{\partial \mu_i(z)} = \sum_r \left(\frac{\partial \langle T \rangle_r}{\partial \mu_i(z)} \cdot \langle \eta \rangle_r \right) - \log\,\mu_i(z) = \lambda_i, \]

where

\[ \frac{\partial \langle T \rangle_r}{\partial \mu_i(z)} := \sum_{z' : (z,z') = r} \sum_{j \neq i} T(A_{ij}) \mu_j(z'). \]

Solving for |$\mu _i(z)$| yields a compact update equation

\begin{equation}\label{eq3.11} \mu_i(z) \propto \exp \left(\sum_r \frac{\partial \langle T \rangle_r}{\partial \mu_i(z)} \cdot \langle \eta \rangle_r \right), \end{equation}

(3.11)

where each |$\mu _i$| is normalized to a probability distribution. To calculate the |$\mu _i$| values, we iteratively update each |$\mu _i$| from some initial guess until convergence to within some numerical tolerance.

As with all variational approaches, and indeed all practical optimization techniques, the algorithm is only guaranteed to converge to a local optima of |$\mathcal {G}$|⁠. In practice, we would run the algorithm to convergence from a number of randomly chosen initial conditions, and then select the best |$\mu ,\tau $|⁠.

In addition to the variational Bayes algorithm above, we derive in Appendix C an efficient loopy belief propagation algorithm [21–23] for the WSBM on sparse graphs. The loopy belief propagation algorithm creates a more flexible approximation to the posterior distribution than the variational Bayes algorithm, but with a slightly higher computational cost. Small modifications for dealing with sparse-weighted networks are described in Appendix D. Finally, Appendix A describes how to obtain our implementation of these methods.

3.3 Selecting |$K$| with Bayes factors

In principle, any of several model selection techniques could be used, including minimum description length [24], integrated likelihood [25] or Bayes factors [17]. Classic complexity-control techniques such as AIC or BIC are known to misestimate |$K$| in certain situations [23]. Here, we describe an approach for choosing |$K$| based on Bayes factors that choose the value |$K$| with largest marginal log-likelihood.

Let |$\mathcal {M}_1 = (K_1,\alpha _1,T,\eta)$| and |$\mathcal {M}_2 = (K_2,\alpha _2,T,\eta)$| be two competing models, one with |$K_{1}$| groups and one with |$K_{2}$| groups. The Bayes factor between these models is

\begin{equation}\label{eq3.12} \log B(\mathcal{M}_1,\mathcal{M}_2) = \log\frac{\Pr(A \,|\, \mathcal{M}_1)}{\Pr(A \,|\, \mathcal{M}_2)} \approx {\mathcal{G}}_1 - {\mathcal{G}}_2, \end{equation}

(3.12)

where we approximate the marginal log-likelihood of each model |$\Pr (A \,|\, \mathcal {M}_i)$| with our lower bound |$\mathcal {G}_i$| Equation (3.2). Although Bayes factors assigns a uniform prior on a set of nested models, this approach has a built-in penalty for complex models through the prior distribution. In our experiments below, we treat |$K,\alpha ,T$| and |$\eta $| as fixed. This method has produced good results on synthetic data with known planted structure [15].

$(a) Example network with $K = 8$ groups and variance $\sigma ^2 = 0.15$. (b) Approximate marginal log-likelihood $\mathcal {G}$ for each model as a function of $K$. (c) NMI between the fitted model and the true planted structure as a function of $K$. Each data series lines in (b,c) corresponds to a different choice of variance $\sigma ^{2}$ in edge weights. Results are averaged over 20 trials and the error bars are the standard errors.$

Fig. 4.

(a) Example network with |$K = 8$| groups and variance |$\sigma ^2 = 0.15$|⁠. (b) Approximate marginal log-likelihood |$\mathcal {G}$| for each model as a function of |$K$|⁠. (c) NMI between the fitted model and the true planted structure as a function of |$K$|⁠. Each data series lines in (b,c) corresponds to a different choice of variance |$\sigma ^{2}$| in edge weights. Results are averaged over 20 trials and the error bars are the standard errors.

Open in new tab Download slide

To each choice of |$\sigma ^{2}$|⁠, and for a large number of networks drawn from this model, we fit the WSBM using the normal distribution for the edge weights and vary the number of inferred groups |$K$| from |$1$| to |$14$|⁠. Figure 4(b) shows the approximate marginal log-likelihood |$\mathcal {G}$| of each fitted model as |$K$| varies, which represents our proportional belief that each choice of |$K$| is the correct. Similarly, Fig. 4(c) shows the NMI between each fitted model and the true planted structure, which represents the performance of each choice of |$K$|⁠. Reassuringly, both quantities are maximized at or close to the true value of |$K$|⁠. When the within- and between-group edge-weight distributions are relatively well separated, both the marginal log-likelihood |$\mathcal {G}$| and NMI are consistently maximized at |$K = 8$|⁠, indicating that Bayes factors provide a reasonably reliable method for selecting the correct number of groups and thereby recovering the true planted structure in most cases. As the distributions overlap (greater |$\sigma ^{2}$| here), it becomes more difficult to distinguish groups, and accuracy degrades to some degree, as would be expected.

4. Experimental evaluation

In this section, we evaluate the performance of the WSBM on several real-world networks, in two different ways. First, we consider the question of whether adding edge-weight information necessarily reinforces the latent group structure contained in the edge existences. That is, can the WSBM find structure distinct from what the SBM would find? Second, we evaluate the WSBM's performance on two prediction tasks. The first focuses on predicting missing edges (also called ‘link prediction’), whereas the second focuses on predicting missing edge weights. We compare its performance with other block models through cross-validation.

4.1 Edge weight versus edge-existence latent group structure

To probe the question of whether edge weights can contain latent group structures that are distinct from those contained in the edge existences, we consider a simple network derived from the competitions among a set of professional sports teams. In this network, called ‘NFL-2009’ hereafter, each vertex represents one of the 32 professional American football teams in the National Football League (NFL). In this network, an edge exists whenever pair of teams played each other in the 2009 season, and each of these edges is assigned a weight equal to the average score difference across games played by that pair [26]. (This definition of edge-weight implies the network is skew-symmetric |$A_{ij} = -A_{ji}$|⁠.) These teams are divided equally between two ‘conferences’ (called AFC and NFC), and within each conference, teams are assigned to one of four divisions, each containing four teams. Play among teams, i.e. the existence of an edge, is determined by division memberships, and many teams never play each other during the regular season.

To analyze this network, we choose |$K=4$| and fit both the SBM (⁠|$\alpha = 1$|⁠) and the ‘pure’ WSBM (pWSBM) (⁠|$\alpha = 0$|⁠) using the normal distribution as a model of edge weights. This choice is reasonable for these data as score differences can be positive or negative, and score totals are close to a binomial distribution [27]. The |$\alpha = 1$| (SBM) case ignores the weights of edges, whereas the |$\alpha = 0$| (pWSBM) case ignores the presence or absence of edges, focusing only on the observed score differences.

Examining the results of both models, we see that the block structure learned by the SBM (⁠|$\alpha = 1$|⁠, Fig. 5(a, b)) exactly recovers the major divisions within each conference, along with the division between conferences, illustrating that division membership fully explains which teams played each other in this season. That is, the empty off-diagonal blocks (Fig. 5) reflect the fact that two pairs of two divisions never play each other.

$NFL-2009 network: black nodes ($\bullet $) are teams in conference 1 (NFC) and white nodes ($\circ $) are teams in conference 2 (AFC). Edges are colored by score differential (light positive, gray approximately zero, dark negative). (a) Network showing SBM communities ($\alpha = 1$). (b) Adjacency matrix, sorted by SBM communities. (c) Network showing WSBM communities ($\alpha = 0$). (d) Adjacency matrix, sorted by WSBM communities. The SBM ($\alpha = 1$) groups correspond to NFL conference structure, whereas the WSBM ($\alpha =0$) corresponds to relative skills levels.$

Fig. 5.

NFL-2009 network: black nodes (⁠|$\bullet $|⁠) are teams in conference 1 (NFC) and white nodes (⁠|$\circ $|⁠) are teams in conference 2 (AFC). Edges are colored by score differential (light positive, gray approximately zero, dark negative). (a) Network showing SBM communities (⁠|$\alpha = 1$|⁠). (b) Adjacency matrix, sorted by SBM communities. (c) Network showing WSBM communities (⁠|$\alpha = 0$|⁠). (d) Adjacency matrix, sorted by WSBM communities. The SBM (⁠|$\alpha = 1$|⁠) groups correspond to NFL conference structure, whereas the WSBM (⁠|$\alpha =0$|⁠) corresponds to relative skills levels.

Open in new tab Download slide

In contrast, the block structure learned by the pure WSBM (⁠|$\alpha = 0$|⁠, Fig. 5(c, d)) recovers a global ordering of teams (as in Fig. 1(d)) that reflects each team's general skill, so that teams within each block have roughly equal skill. This pattern mixes teams across conference and division lines, and thus disagrees with the block structure recovered by the SBM. For instance, consider the upper-left group in Fig. 5, which generally has positive score differences (wins) in games against teams in either lower group, with a mean lead of 11 points. Similarly, the lower-left group has positive score differences (wins) against teams in the lower-right group. The small upper-right group performs equally well against teams of every other group. Within each group, however, score differences tend toward zero, indicating roughly equal skill.

The fact that the SBM and pure WSBM recover entirely distinct block structures illustrates that adding edge-weight information to the inference step can dramatically alter our conclusions about the latent block structure of a network. That is, adding edge weights does not necessarily reinforce the inferences produced from binary edges alone. The extremal settings of the parameter |$\alpha $| in our model allows a practitioner to choose which of these types of latent structure to find, while if a mixed-type conclusion is preferred, an intermediate value of |$\alpha $| may be chosen. In the following section, we demonstrate that such a model, which we call the ‘balanced’ WSBM, that can learn simultaneously from edge-existence and edge-weight information.

4.2 Predicting edge existence and weight

To illustrate a more rigorous evaluation of the WSBM, in this section, we consider the problem of predicting missing information when the model is fitted to a partially observed network. In particular, we consider predicting the existence or the weight of some unobserved interaction, a similar task to missing and spurious link prediction [28,29].

Here, we compare the WSBM to other block models on five real-world networks from various domains. Most of these models are only defined for unweighted networks, and thus some care is required to make them perform under the edge-weight prediction task, which we describe below. We evaluate performance numerically across multiple trials of cross-validation, training each model on 80% of the |$n^2$| possible edges and testing on the remaining 20%.

The weighted graphs we consider are the following.

Airport. Vertices represent the |$n=500$| busiest airports in the USA, and each of the |$m=5960$| directed edges is weighted by the number of passengers traveling from one airport to another [30].
Collaboration. Vertices represent |$n=226$| nations on Earth, and each of the |$m=20616$| edges is weighted by a normalized count of academic papers whose author lists include that pair of nations [31].
Congress. Vertices represent the |$n=163$| committees in the 102nd US Congress, and each of the |$m=26569$| edges is weighted by the pairwise normalized ‘interlock’ value of shared members [32].
Forum. Vertices represent |$n=1899$| users of a student social network at UC Irvine, and each of the |$m=20291$| directed edges is weighted by the number of messages sent between users [33].
College FB. Vertices represent the |$n=1411$| NCAA college football teams, and each of the |$m=22168$| edges are weighted by the average point difference across games between a pair of teams [26].

For each of the two prediction tasks and for each network, we evaluate the following models. The ‘pure’ WSBM (pWSBM), using only weight information |$(\alpha = 0)$|⁠, the ‘balanced’ WSBM (bWSBM), using both edge and weight information |$(\alpha = 0.5)$|⁠, the ‘classic’ SBM, using only edge information |$(\alpha = 1)$|⁠, a degree-corrected weighted block model (DCWBM), where |$(\alpha = 0.5),$| and the degree-corrected block model (DCBM). For the weighted block models, we select the normal distribution to model the edge weights.

In both prediction tasks, we first choose a uniformly random |$20\%$| of the |$n^2$| interactions, which we treat as missing when we fit the model to the network. We then fit each model to the observed edges and infer group membership labels for each vertex in the network. Finally, we use the posterior mean obtained from variational inference as the predictor for edge existence and edge weight for unobserved interactions between those groups. For the models that do not naturally model edge-weights (SBM, DCBM), we take their partitions and compute the sample mean weight for each of the induced edge bundles in the weighted network and use this value to predict the weight of any missing edge in that bundle. These estimators correctly correspond to the underlying generative model for edge-prediction in the SBM and DCBM, and are a natural extension for predicting edge-weights for a given block membership. Under this scheme, each model is made to predict the unobserved interactions for a given network, and we score the accuracy of these predictions using the mean-squared error (MSE). Evaluating edge-existence prediction could be achieved using alternative criteria such as AUC [6], which gives similar results.

Each of these models has a free parameter |$K$| that determines the number of parameters that are estimated, which thus controls their overall flexibility. We control this variable model complexity and ensure a fair comparison by fixing all models to have |$K = 4$| latent groups, and we treat all networks as directed. Finding the true number of latent groups |$K$| for each network is separate worthwhile problem not considered here. To compare the results across different data sets, all edge-weights were normalized to fall on the interval |$[-1,1]$|⁠. Non-negative weights were normalized after applying a logarithmic transform (cases marked with a star |$*$| in Tables 1 and 2).

Table 1.

Open in new tab

Average MSE on edge prediction in |$25$| trials

Network	pWSBM	bWSBM	SBM	DCWBM	DCBM
Airport	0.0202 (1)	0.0156 (1)	0.0158 (1)	0.0238 (1)	0.0238 (1)
Collaboration	0.1446 (3)	0.1167 (3)	0.1138 (3)	0.2289 (5)	0.2454 (5)
Congress	0.1765 (4)	0.1648 (4)	0.1640 (5)	0.2298 (9)	0.2402 (9)
Forum	0.00560 (1)	0.00535 (1)	0.00535 (1)	0.00565 (1)	0.00565 (1)
College FB	0.0369 (2)	0.0344 (1)	0.0346 (1)	0.0387 (2)	0.0389 (2)

Network	pWSBM	bWSBM	SBM	DCWBM	DCBM
Airport	0.0202 (1)	0.0156 (1)	0.0158 (1)	0.0238 (1)	0.0238 (1)
Collaboration	0.1446 (3)	0.1167 (3)	0.1138 (3)	0.2289 (5)	0.2454 (5)
Congress	0.1765 (4)	0.1648 (4)	0.1640 (5)	0.2298 (9)	0.2402 (9)
Forum	0.00560 (1)	0.00535 (1)	0.00535 (1)	0.00565 (1)	0.00565 (1)
College FB	0.0369 (2)	0.0344 (1)	0.0346 (1)	0.0387 (2)	0.0389 (2)

Table 1.

Open in new tab

Average MSE on edge prediction in |$25$| trials

Network	pWSBM	bWSBM	SBM	DCWBM	DCBM
Airport	0.0202 (1)	0.0156 (1)	0.0158 (1)	0.0238 (1)	0.0238 (1)
Collaboration	0.1446 (3)	0.1167 (3)	0.1138 (3)	0.2289 (5)	0.2454 (5)
Congress	0.1765 (4)	0.1648 (4)	0.1640 (5)	0.2298 (9)	0.2402 (9)
Forum	0.00560 (1)	0.00535 (1)	0.00535 (1)	0.00565 (1)	0.00565 (1)
College FB	0.0369 (2)	0.0344 (1)	0.0346 (1)	0.0387 (2)	0.0389 (2)

Network	pWSBM	bWSBM	SBM	DCWBM	DCBM
Airport	0.0202 (1)	0.0156 (1)	0.0158 (1)	0.0238 (1)	0.0238 (1)
Collaboration	0.1446 (3)	0.1167 (3)	0.1138 (3)	0.2289 (5)	0.2454 (5)
Congress	0.1765 (4)	0.1648 (4)	0.1640 (5)	0.2298 (9)	0.2402 (9)
Forum	0.00560 (1)	0.00535 (1)	0.00535 (1)	0.00565 (1)	0.00565 (1)
College FB	0.0369 (2)	0.0344 (1)	0.0346 (1)	0.0387 (2)	0.0389 (2)

Table 2.

Open in new tab

Average MSE on normalized weight prediction in 25 trials

Network	pWSBM	bWSBM	SBM	DCWBM	DCBM
Airport*	0.0486 (6)	0.0543 (5)	0.0632 (8)	0.0746 (9)	0.0918 (8)
Collaboration*	0.0407 (1)	0.0462 (1)	0.0497 (3)	0.0500 (2)	0.0849 (3)
Congress*	0.0571 (4)	0.0594 (4)	0.0634 (6)	0.0653 (4)	0.1050 (6)
Forum*	0.0726 (3)	0.0845 (3)	0.0851 (4)	0.0882 (4)	0.0882 (4)
College FB	0.0124 (1)	0.0140 (1)	0.0145 (1)	0.0149 (1)	0.0160 (2)

Network	pWSBM	bWSBM	SBM	DCWBM	DCBM
Airport*	0.0486 (6)	0.0543 (5)	0.0632 (8)	0.0746 (9)	0.0918 (8)
Collaboration*	0.0407 (1)	0.0462 (1)	0.0497 (3)	0.0500 (2)	0.0849 (3)
Congress*	0.0571 (4)	0.0594 (4)	0.0634 (6)	0.0653 (4)	0.1050 (6)
Forum*	0.0726 (3)	0.0845 (3)	0.0851 (4)	0.0882 (4)	0.0882 (4)
College FB	0.0124 (1)	0.0140 (1)	0.0145 (1)	0.0149 (1)	0.0160 (2)

Table 2.

Open in new tab

Average MSE on normalized weight prediction in 25 trials

Network	pWSBM	bWSBM	SBM	DCWBM	DCBM
Airport*	0.0486 (6)	0.0543 (5)	0.0632 (8)	0.0746 (9)	0.0918 (8)
Collaboration*	0.0407 (1)	0.0462 (1)	0.0497 (3)	0.0500 (2)	0.0849 (3)
Congress*	0.0571 (4)	0.0594 (4)	0.0634 (6)	0.0653 (4)	0.1050 (6)
Forum*	0.0726 (3)	0.0845 (3)	0.0851 (4)	0.0882 (4)	0.0882 (4)
College FB	0.0124 (1)	0.0140 (1)	0.0145 (1)	0.0149 (1)	0.0160 (2)

Network	pWSBM	bWSBM	SBM	DCWBM	DCBM
Airport*	0.0486 (6)	0.0543 (5)	0.0632 (8)	0.0746 (9)	0.0918 (8)
Collaboration*	0.0407 (1)	0.0462 (1)	0.0497 (3)	0.0500 (2)	0.0849 (3)
Congress*	0.0571 (4)	0.0594 (4)	0.0634 (6)	0.0653 (4)	0.1050 (6)
Forum*	0.0726 (3)	0.0845 (3)	0.0851 (4)	0.0882 (4)	0.0882 (4)
College FB	0.0124 (1)	0.0140 (1)	0.0145 (1)	0.0149 (1)	0.0160 (2)

For each model and each network, we ran 25 independent trials with our |$80/20$| cross-validation split, as described above, and then compute the average MSE on the particular prediction task. The results for predicting edge-existences are summarized in Table 1 and the results for predicting edge-weights are summarized in Table 2. Bolded values denote the best MSE across all models, and parentheses indicate the uncertainty (standard error) in the last digit.

Notably, in the edge-existence prediction task, the SBM and the bWSBM are the most accurate among all models, often by a large margin. The fact that the SBM performs well is perhaps unsurprising, as it is, by design, only sensitive to edge-existences in the first place. However, the bWSBM is learning from both existence and weight information, and its strong performance indicates that for these networks, learning from edge weights does not necessarily confuse predictions on edge existence. In the edge-weight prediction task, however, the pWSBM (⁠|$\alpha = 0$|⁠) is the most accurate, often by a large margin, as we might expect for a model designed to learn only from edge weight information.

In this experimental framework, none of the degree-corrected models performs well. This is likely caused by the DCBM's and DCWBM's correction for edge propensity in the group membership. By focusing on finding community structure after accounting for edge propensity, the DCBM and DCWSBM have less accurate predictions in predicting edge-existence and edge-weight. It is worth pointing out, however, that prediction is not the only measure of utility for community detection techniques, and degree-corrected models often perform better than non-corrected models at recovering meaningful latent group structures in practical situations. We thus expect the degree-corrected WSBM will be most useful in situations where the goal is the recovery of scientifically meaningful group structures, rather than edge-existence or edge-weight prediction.

In general, the SBM performs well on edge prediction but poorly on weight prediction, whereas the pWSBM performs poorly on edge prediction but well on weight prediction. This pattern is precisely as we might expect, as the SBM only considers edge-existence information, while the pWSBM only considers edge-weight information.

What is surprising, however, is the good performance on both tasks by the bWSBM (⁠|$\alpha = 0.5$|⁠), which is as good or nearly as good as SBM in edge prediction, but substantially better than the SBM in weight prediction. This demonstrates that the bWSBM is a more powerful model than the SBM: it performs as well as the SBM on SBM-like tasks and better on edge weight tasks. In these examples, incorporating edge weight information into the SBM framework does not detract the WSBM performance in edge prediction. In fact, this good general performance is possible because the bWSBM learns from both edge-existence and edge-weight information.

5. Discussion

In the analysis of networks, the inference of latent community structure is a common task that facilitates subsequent analysis, e.g. by dividing a large heterogeneous network into a set of smaller, more homogeneous subgraphs, and can reveal important insights into its basic organizational patterns. When edges are annotated with weights, this extra information is often discarded, e.g. by applying a single universal threshold to all weights. The weighted stochastic block model (WSBM) we described here is a natural generalization of the popular stochastic block model (SBM) to edge-weighted sparse networks. Crucially, the WSBM provides a statistically principled solution to the community detection problem in edge-weighted networks, and removes the need to apply any thresholds before analysis. Thus, this model preserves the maximal amount of information in such networks for characterizing their large-scale structure.

The WSBM's general form, given in Equation (2.4), is parametrized by a mixing parameter |$\alpha $|⁠, which allows it to learn simultaneously from both the existence (presence or absence) of edges and their associated weights. In our tests with real-world networks, the WSBM yields excellent results on both edge-existence and -weight prediction tasks. Additionally, the balanced model (⁠|$\alpha =0.5$|⁠) performed as well or nearly as well as the best alternative block model, suggesting it may work well as a general model for novel applications where it is not known whether edge-existences or edge-weights are more informative.

In many applications, the inferred group structure will be of primary interest. For these cases, it is important to note that the groups identified by the WSBM can be distinct from those identified by examining only an unweighted version of the same network. Both forms of latent structure may be interesting and are likely to shed different light on the underlying organization of the network. It remains an open question to determine the types of networks for which weight information contains distinct partition structure from edge existences, although we have shown at least one example of such a network in Section 4.1.

The variational algorithm described here provides an efficient method for fitting the WSBM to an empirical network. It scalability is relatively good by modern standards, and thus should be applicable to networks of millions of vertices or more. Alternative algorithms such as those based on Markov chain Monte Carlo for unweighted networks are possible [29,34]; however, each must contend with several technical problems presented by edge weight distributions, e.g. the degeneracies in the likelihood function produced by edge bundle whose weights have zero variance.

Finally, there are several natural extensions of the WSBM, including mixed memberships [35], bipartite forms [9], dynamic networks [36], different distributions for different edge bundles and the handling of more complex forms of auxiliary information, e.g. on the vertices or edges.

An important and open theoretical question presented by this model is whether utilizing weight information modifies the fundamental detectability of latent group structure, which exhibits a phase transition in the classic SBM [22]. We look forward to these and other extensions.

Funding

This work was supported by the U.S. Air Force Office of Scientific Research and the Defense Advanced Research Projects Agency (grant number FA9550-12-1-0432).

Acknowledgements

We thank Dan Larremore, Leto Peel and Nora Connor for helpful conversations and suggestions.

Certain data included herein are derived from the Science Citation Index Expanded, Social Science Citation Index and Arts & Humanities Citation Index, prepared by Thomson Reuters, Philadelphia, Pennsylvania, USA, Copyright Thomson Reuters, 2011.

Appendix A. Working implementation

A working implementation of the WSBM inference code, written by the authors, may be found at http://tuvalu.santafe.edu/%7Eaaronc/wsbm/. This code implements the efficient algorithms discussed in Appendix D.

Appendix B. Exponential families

Let |$\mathcal {X}$| be a fixed domain and |$\Theta $| be set of parameters. An exponential family is a collection of parametric distributions |$\mathcal {F}$| that can be written in the form

\[ \mathcal{F} = \{ f(x \,|\, \theta) = h(x)\exp(T(x) \cdot \eta(\theta)) \ \text{for } x \in \mathcal{X} \, | \, \theta \in \Theta \} , \]

where |$h,T$| and |$\eta $| are fixed functions. The map |$T$| is the sufficient statistic function and the map |$\eta (\theta)$| is the natural parameters. Note that |$T$| and |$\eta $| can be vectors. The function |$h(x)$| distinguishes different probability distributions, but appears as an additive constant in the log-likelihood function, which can thus be ignored. Thus, only the pair |$(T,\eta)$| directly impacts the likelihood function.

Examples of exponential families include the normal, exponential, gamma, log-normal, Pareto, binomial, multinomial, Poisson and beta distributions. Examples of distributions that are not exponential families are the uniform distribution and certain mixture distributions.

A common representation of an exponential family sometimes includes the log-partition function |$A(\theta)$| written as

\[ f(x \,|\, \theta) = h(x) \exp(\tilde{T}(x) \cdot \tilde{\eta}(\theta)-A(\theta)). \]

To keep notation compact we absorb |$-1 \cdot A$| into |$T \cdot \eta $|⁠.

A convenient property of exponential families is that they have easily written conjugate priors. For our exponential family, the standard class of conjugate priors |$\pi $| is

\[ \pi(\theta) = \frac{1}{Z(\tau)}\exp(\tau \cdot \eta(\theta)), \]

where |$\tau $| is the (hyper-)parameters of the prior and can be thought of as pseudo-observations of |$T$|⁠. The function |$Z$| is the normalizing constant, defined as

\[ Z(\tau) = \int_\Theta \exp(\tau \cdot \eta(\theta)) \,{\rm d}\theta . \]

\[ \langle\eta(\theta)\rangle = \frac{\partial \log\,Z(\tau)}{\partial \tau} . \]

Further details on exponential families can be found in Refs. [18,19] and for appropriate prior distributions in Ref. [20].

Appendix C. Belief propagation derivation

The main difference between a loopy belief propagation (hereafter simply BP) algorithm and the variational Bayes algorithm described in the main text lays in how we update the group membership parameters |$\mu $| [21]. The BP approach gives a more accurate approximation of the true posterior of |$z$| and has been shown to produce good results in the classic SBM case [22].

In variational Bayes, we used a mean-field approximation to the posterior distribution |$\pi ^*$|⁠:

\[ \pi^*(z) \approx q(z) = \prod_i \mu_i(z_i) , \]

where each vertex label is assumed to be independently distributed according to |$q_i$| (a categorical or multinomial random variable).

In BP, we use pairwise approximations to the posterior distribution. Ideally, this approximation would have the form

\[ \pi^*(z) \approx q(z) \propto \prod_{ij} \mu_{ij}(z_i,z_j), \]

where |$\mu _{ij}(\cdot ,\cdot)$| are joint probabilities. However, this typically is not achievable because normalizing the product of distributions over all edge pairs is non-trivial (each vertex of degree |$k_i$| appears |$k_i$| times). Luckily, in the case of trees, it is possible to normalize |$q$| to a probability distribution by accounting for this repetition, that is,

\[ q(z) = \frac{\prod_{ij \in E} \mu_{ij}(z_i,z_j)}{\prod_i \mu_i(z_i)^{k_i-1}} , \]

where |$\mu _i$| is the marginal of |$\mu _{ij}$|⁠, |$k_i$| is the degree of vertex |$i$|⁠, and |$E$| is the set of observed edges. But, the factor graph of the WSBM is not a tree, so this form is not necessarily exact.

Here, we take a loopy BP approach and assume the structure of pairwise terms |$ij \in E$| is in fact locally tree like, and then apply the BP update equations. The assumption for locally tree-like structure makes this algorithm a poor choice on dense networks (when we observe |$O(n^2)$| interactions), but is both acceptable and effective for sparse networks.

Under this formulation, our goal is to maximize the variational approximation to the likelihood of the data |$\mathcal {G}$|⁠, so that the KL divergence between |$q$| and |$\pi ^*$| is minimized. Recall from Equation (3.1) the objective function |$\mathcal {G}$| consists of two parts

\[ \mathcal{G} = \mathbb{E}_q\ \hbox{logPr}(A \,|\, z,\theta) + \mathbb{E}_q \log (\pi / q), \]

a likelihood term and a prior regularizer term.

The likelihood term is

\begin{align*} \mathbb{E}_q\ \hbox{logPr}(A\,|\,z,\theta) &\propto \sum_r \left(\sum_{ij} T(A_{ij}) \mathbb{E}_q(z_i z_j) + \tau^{(0)}_r \right) \cdot \langle\eta\rangle_r \approx \sum_r (\langle T \rangle_r + \tau^{(0)}_r) \cdot \langle\eta\rangle_r , \end{align*}

where

\begin{align*} \langle T \rangle_r &= \sum_{ij} \sum_{(z,z') = r} \mu_{ij}(z,z') \, T(A_{ij}), \\ \langle \eta \rangle_r &= \left.\frac{\partial \log Z(\tau)}{\partial \tau}\right|_{\tau = \tau_r} , \end{align*}

and where we approximate |$\mathbb {E}_q(z_i z_j) \approx q_{ij}(z_i,z_j)$| and |$\tau ^{(0)}_r$| and |$\mu ^{(0)}$| are parameters for the prior.

The regularizer term consists of two parts

\[ \mathbb{E}_q \log(\pi / q) = \mathbb{E}_q(\log\,\pi) - \mathbb{E}_q(\log\,q) . \]

The second term requires us to sum over |$q(z)$| which is combinatorically difficult to calculate, so we use the Bethe approximation

\begin{align*} -\mathbb{E}_q(\log\,q) &\approx -\sum_{ij \in E}\sum_{z,z'} \mu_{ij}(z,z')\log\mu_{ij}(z,z')+ \sum_{i,z} (k_i-1)\mu_i(z)\log\mu_i(z)\\ &\quad + \sum_r -\tau_r \cdot \langle\eta\rangle_r + \log\,Z(\tau_r). \end{align*}

Combining these parts, the objective function may be written as

\begin{align*} \mathcal{G} & = \sum_r (\langle T \rangle_r +\tau^{(0)}_r-\tau_r) \cdot \langle\eta\rangle_r + \sum_r \log \frac{Z(\tau_r)}{Z(\tau^{(0)}_r)}\\ & \quad + \sum_{i,z} (k_i-1)\mu_i(z)\log\frac{\mu_i(z)}{\mu^{(0)}_i(z)} - \sum_{ij \in E} \sum_{z,z'} \mu_{ij}(z,z')\log\frac{\mu_{ij}(z,z')}{\mu^{(0)}_i(z) \mu^{(0)}_j(z')} . \end{align*}

To enforce the marginalization and normalization restrictions on |$q(z)$|⁠, we introduce Lagrange multipliers, yielding

\begin{align*} \mathcal{G}' &= \mathcal{G}+ \sum_i \lambda_i \left(\sum_i \mu_i - 1\right)\\ &\quad + \sum_{ij \in E} \left(\sum_z \lambda_{ij,z} \left(\mu_i(z)-\sum_{z'} \mu_{ij}(z,z')\right)+ \sum_{z'} \lambda'_{ij,z'} \left(\mu_j(z')-\sum_{z} \mu_{ij}(z,z')\right) \right). \end{align*}

Note that |$\lambda _i$| enforces normalization of |$\mu _i$|⁠, |$\lambda _{ij,z}$| enforces marginalization over |$i$|⁠, and |$\lambda '_{ij,z'}$| enforces marginalization over |$j$|⁠. We maximize |$\mathcal {G}'$| by setting its derivatives with respect to the parameters of |$q$| equal to |$0$|⁠.

For the edge parameters |$\theta $|⁠, we differentiate with respect to |$\tau _r$|

\begin{align*} \frac{\partial \mathcal{G}'}{\partial \tau_r} &= (\langle T \rangle_r + \tau^{(0)}_r - \tau_r) \frac{\partial \langle \eta \rangle_r}{\partial \tau_r} - \langle \eta \rangle_r + \left.\frac{\partial \log Z(\tau)}{\partial \tau}\right|_{\tau = \tau_r} \\ &\propto \langle T \rangle_r + \tau^{(0)}_r - \tau_r . \end{align*}

This is the same expression as for the variational Bayes solution, since we only modified |$q(z)$|⁠. The update equations for |$\tau $| remain |$\tau _r = \tau ^{(0)}_r+ \langle T \rangle _r$|⁠.

For the vertex labels |$z$|⁠, we will differentiate with respect to |$\mu _i(z)$| and |$\mu _{ij}(z,z')$| and solve this system of equations using a message passing method, which is standard in BP. The derivatives are

\[ \frac{\partial \mathcal{G}'}{\partial \mu_i(z)} = (k_i-1)(\hbox{log}\mu_i(z) - \hbox{log}\mu^{(0)}_i(z) + 1) + \lambda_i + \sum_{j : ij \in E} \lambda_{ij,z} = 0, \]

and

\[ \frac{\partial \mathcal{G}'}{\partial \mu_{ij}(z,z')} = T(A_{ij}) \cdot \langle\eta\rangle_{z,z'}- \hbox{log}\mu_{ij}(z,z')+\hbox{log}\mu^{(0)}_i(z)+\hbox{log}\mu^{(0)}_j(z') - 1 - \lambda_{ij,z}-\lambda'_{ij,z'} = 0 . \]

Solving for |$\mu _i(z)$| and |$\mu _{ij}(z,z')$| we obtain

\begin{align*} \mu_i(z) &\propto \mu^{(0)}_i(z) \prod_{j: ij \in E}{\mathrm{e}}^{-\lambda_{ij,z}/(k_i-1),}\\ \mu_{ij}(z,z') &\propto \mu^{(0)}_i(z)\mu^{(0)}_j(z')\exp(T(A_{ij}) \cdot \langle\eta\rangle_{z,z'}) \,{\rm e}^{-\lambda_{ij,z}}\,{\rm e}^{-\lambda'_{ij,z'}}. \end{align*}

For notational convenience, let

\[ M_{ij}(z,z') = \exp(T(A_{ij}) \cdot \langle\eta\rangle_{z,z'}). \]

Since |$\sum _{z'} \mu _{ij}(z,z') = \mu _i(z)$|⁠, we have

\[ \mu_i(z) \propto \mu^{(0)}_i(z) \sum_{z'} \mu^{(0)}_j(z') M_{ij}(z,z') \,{\rm e}^{-\lambda_{ij,z}} \,{\rm e}^{-\lambda'_{ij,z'}}. \]

Setting our two equations for |$\mu _i(z)$| is equal, we obtain

\begin{align}\label{eqC1} \mu^{(0)}_i(z) \prod_{j': ij' \in E} {\mathrm{e}}^{-\lambda_{ij',z}/(k_i-1)} &\propto \mu^{(0)}_i(z) \sum_{z'} \mu^{(0)}_j(z') M_{ij}(z,z') \,{\rm e}^{-\lambda_{ij,z}}\,{\rm e}^{-\lambda'_{ij,z'}}, \nonumber \\ \prod_{j': ij' \in E} {\rm e}^{-\lambda_{ij',z}/(k_i-1)} &\propto \sum_{z'} \mu^{(0)}_j(z') M_{ij}(z,z') \,{\rm e}^{-\lambda_{ij,z}}\,{\rm e}^{-\lambda'_{ij,z'}}. \end{align}

(*)

Let |$\psi _{i\rightarrow j}(z_j)$| denotes the message from vertex |$i$| to vertex |$j$| and set

\begin{align*} {\mathrm{e}}^{-\lambda_{ij,z}} & = \prod_{k: ik \in E, k \neq j} \psi_{k\rightarrow i}(z),\\ {\rm e}^{-\lambda'_{ij,z'}} &= \prod_{k: j,k \in E, k \neq i} \psi_{k\rightarrow j}(z'). \end{align*}

Plugging in our definition of |$\psi $|⁠, we obtain

\[ \prod_{j: ij \in E} {\rm e}^{-\lambda_{ij,z}/(k_i-1)} = \prod_{j: ij \in E} \prod_{k: ik \in E, k \neq j} \psi_{k\rightarrow i}(z)^{1/(k_i-1)} = \prod_{ij \in E} \psi_{j \rightarrow i}(z). \]

And, using Equation (*), we obtain the following recursive definition for |$\psi $|

\begin{align*} \prod_{ij \in E} \psi_{j \rightarrow i}(z) &\propto \sum_{z'} \mu^{(0)}_j(z') \left(\sum_{z'} M_{ij}(z,z')\right) \prod_{k: ik \in E, k \neq j} \psi_{k\rightarrow i}(z)\prod_{k: jk \in E, k \neq i} \psi_{k\rightarrow j}(z') \\ \psi_{j \rightarrow i}(z) &\propto \sum_{z'} \mu^{(0)}_j(z') M_{ij}(z,z') \prod_{k: k,j \in E, k \neq i} \psi_{k\rightarrow j}(z') . \end{align*}

Finally, our update equations for |$\mu $| become

\begin{align*} \mu_i(z) &\propto \mu^{(0)}_i(z) \prod_{ij \in E} \psi_{j \rightarrow i}(z)\\ \mu_{ij}(z,z') &\propto \mu^{(0)}_i(z)\mu^{(0)}_j(z')M_{ij}(z,z')\prod_{k: ik \in E, k \neq j} \psi_{k\rightarrow i}(z) \prod_{l: l,j \in E, l \neq i} \psi_{l\rightarrow j}(z') . \end{align*}

If |$m = |E|$| is the number of observed edges/interactions, then the BP algorithm requires |$O(m)$| messages to be passed and therefore each iteration has an |$O((m+n)K^2)$| running time (updating the messages |$\psi $| and then the group membership parameters |$\mu $|⁠).

It will be convenient to use the following equivalent messages |$\varphi $| used by [22,23,37] in our BP algorithm

\[ \varphi_{i\rightarrow j}(z') = \mu^{(0)}_j(z') \prod_{k: k,j \in E, k\neq i} \psi_{k,j}(z') . \]

Note that from our old message |$\psi $| update equations, we obtain

\[ \psi_{i\rightarrow j}(z') = \sum_{z} M_{ij}(z,z') \varphi_{j \rightarrow i}(z) . \]

Putting these two equations together, our new update equations using |$\varphi $| for our messages become

\begin{align*} \varphi_{i \rightarrow j}(z') &= \mu^{(0)}_j(z') \prod_{k: kj \in E, k\neq i} \sum_{z} M_{k,j}(z,z') \varphi_{j \rightarrow k}(z) \\ \mu_i(z) &\propto \mu^{(0)}_i(z) \prod_{ij \in E}\sum_{z'} M_{ij}(z,z') \varphi_{i \rightarrow j}(z') \\ \mu_{ij}(z,z') &\propto M_{ij}(z,z') \varphi_{j \rightarrow i}(z) \varphi_{i \rightarrow j}(z') . \end{align*}

Algorithm 2 gives pseudocode for the full loopy BP algorithm.

Appendix D. Modifications for sparse-weighted graphs

We now consider modifications to our variational Bayes algorithm (Algorithm 1) and our BP algorithm (Algorithm 2) for the case of sparse-weighted graphs discussed in Section 2.1.

Recall that for a network of |$n$| nodes we can partition the |$n^2$| interaction into three disjoint edge lists |$W,N$| and |$M$|⁠, where |$W$| is a list of weighted edges, |$N$| is a list of non-edges and |$M$| is a list of missing edges or unobserved edges. We define the union |$E = W \cup N$| as the list of observed edges. Let |$m_W = | W |$| be the number of weighted edges, |$m_E = | E |$| be the number of observed edges and |$m_W = |M|$| be the number of missing edges. Note that |$m_{\mathrm {E}}+m_{\mathrm { M}} = |E| + |M| = |A| = n^2$|⁠.

Both algorithms we presented require |$O(|E| K^2)$| time when updating |$\mu $|⁠. If the number of observed edges is sparse (⁠|$|E| = O(n)$|⁠), then no changes are required. However, it may be the case that the number of weighted edges is sparse (⁠|$|W| = O(n)$|⁠), whereas the number of non-edges is dense (⁠|$|N| = O(n^2)$|⁠). In this case, if we assume the number of missing edges is also sparse (⁠|$|M| = O(n)$|⁠), then we can modify Algorithms 1 and 2, so that running time is once again |$O(n K^2)$|⁠. The key idea is to exploit the structure of our edge-existence distribution.

First we introduce some notation, then we consider the edge bundle |$\tau $| updates, and finally we introduce modifications to the group membership |$\mu $| updates.

Notation. There are two types of degrees: the degree with respect to weighted edges and degree with respect to observed edges. Let |$d_W^{-}(i)$| be the in-degree of vertex |$i$| with respect to weighted edges. Let |$d_W^{+}(i)$| be the out-degree of vertex |$i$| with respect to weighted edges. Let |$d_{\mathrm {E}}^{-}(i)$| be the in-degree of vertex |$i$| with respect to observed edges. Let |$d_{\mathrm {E}}^{+}(i)$| be the out-degree of vertex |$i$| with respect to observed edges.

Let our exponential family edge-weight distribution |$f_{\mathrm {w}}$| under parameter |$\theta _{\mathrm {w}}$| takes form

\[ f_{\rm w}(x \,|\, \theta_{\rm w}) = h_{\rm w}(x) \exp(T_w{\rm w}x) \cdot \eta_{\rm w}(\theta_{\rm w})), \]

where |$h_{\mathrm {w}},T_{\mathrm { w}}$| and |$\eta _{\mathrm {w}}$| are fixed functions.

Let our exponential family edge-existence distribution |$f_{\mathrm {e}}$| under parameter |$\theta _{\mathrm {e}}$| take the form

\[ f_{\rm e}(x \,|\, \theta_{\rm e}) = h_{\rm e}(x) \exp(T_{\rm e}(x) \cdot \eta_{\rm e}(\theta_{\rm e})), \]

where |$h_{\mathrm {e}},T_{\mathrm { e}}$| and |$\eta _{\mathrm {e}}$| are fixed functions.

Let |$R: K \times K \rightarrow r $| be the mapping between the groups and edge-bundles.

D.1 Update for |$\tau $| (edge distribution)

The edge bundle updates consist of two steps: (i) calculating the expected sufficient statistic |$\langle T \rangle $| for each edge bundle and (ii) updating |$\tau $| for each edge bundle.

Weighted |$\tau _{\mathrm {w}}$|. For the weighted distribution, the expected sufficient statistic |$\langle T_{\mathrm {w}} \rangle _r$| for all edge bundles |$r$| can be calculated using Equation (D.1) for all pairs of groups |$(z,z')$|⁠, as

\begin{equation}\label{eqD.1} \langle T_{\mathrm{w}} \rangle_{R(z,z')} += \sum_{ij \in W} T_{\rm w}(A_{ij})\mu_i(z) \mu_j(z') . \end{equation}

(D.1)

Edge-existence |$\tau _{\mathrm {e}}$|. To update |$T_{\mathrm {e}}$| we note that the sufficient statistic value for a non-edge is typically zero except for the last dimension that takes the value |$1$| for observed edges. Knowing that this value is |$1$| for all edges lets us calculate |$T_{\mathrm {e}}$| without needing to sum over |$W \cup N$|⁠.

Therefore we update |$T_{\mathrm {e}}$| using Equation (D.1) for all but the last dimensions of |$T_{\mathrm {e}}$|⁠. For the last dimension, we update |$T_{\mathrm {e}}$| with

\begin{equation}\label{eqD.2} \langle T_{\rm e} \rangle_{R(z,z')} += \sum_{ij} \mu_i(z) \mu_j(z') - \sum_{ij \in M} \mu_i(z) \mu_j(z') = \left(\sum_{i} \mu_i(z)\right) \left(\sum_{j} \mu_j(z')\right) - \sum_{ij \in M} \mu_i(z) \mu_j(z') , \end{equation}

(D.2)

which takes |$O(K^2 (n+m_M))$| time.

Degree-corrected edge-existence |$\tau _{\rm e}$|. For the degree-corrected block model, recall that edge-existence distribution is modified slightly by replacing the |$T_{\mathrm {e}}(A_{ij}) = 1$| in the last dimension of |$T_{\mathrm {e}}$| with the product of |$i,j$|'s in and out degrees, |$T_{\mathrm {e}}(A_{ij}) = d_W^{+}(i) d_W^{-}(j)$|⁠. This changes equation (D.2) by replacing |$\mu _i(z)$| with |$d_W^{+}(i)\mu _i(z)$| and |$\mu _j(z')$| with |$d_W^{-}(j)\mu _j(z')$|⁠. This gives us

\begin{equation}\label{eqD.3} \langle T_{\mathrm{e}} \rangle_{R(z,z')} += \left(\sum_{i} d_W^{+}(i)\mu_i(z)\right) \left(\sum_{j} d_W^{-}(j)\mu_j(z')\right) - \sum_{ij \in M} d_W^{+}(i)\mu_i(z) d_W^{-}(j)\mu_j(z') . \end{equation}

(D.3)

The running time remains the same as in the edge-existence case.

D.2 Update for |$\mu $| (vertex labels)

Variational Bayes algorithm. The update for the vertex labels under the variational Bayes algorithm is to calculate (i) |${\partial \langle T \rangle _r}/{\partial \mu _i(z)}$| and (ii) update |$\mu _i$| using

\[ \mu_i(z) \propto \exp \left(\sum_r \frac{\partial \langle T \rangle_r}{\partial \mu_i(z)} \cdot \langle \eta \rangle_r \right) . \]

The rate-limiting step is calculating |${\partial \langle T \rangle _r}/{\partial \mu _i(z)}$|⁠.

For the weighted sufficient statistics |$T_{\mathrm {w}}$|⁠, we calculate for all pairs |$(z,z')$| and for each vertex |$i$|

\begin{equation}\label{eqD.4} \frac{\partial \langle T_{\mathrm{w}} \rangle_{R(z,z')}}{\partial \mu_i(z)} += \sum_{j \in \partial i^+_W} T_{\rm w}(A_{ij}) \mu_j(z'), \quad \frac{\partial \langle T_{\rm w} \rangle_{R(z',z)}}{\partial \mu_i(z)} += \sum_{j \in \partial i^-_W} \mu_j(z') T_{\rm w}(A_{ji}) , \end{equation}

(D.4)

where |$\partial i^{+}_W$| is the neighborhood formed by the outgoing weighted edges of vetex |$i$|⁠. Since the sum in Equation (D.4) is over |$d_W^{+}(i)$| terms, the running time is |$O(K^2\sum _i d_W^{+}(i)) = O(K^2(n+m_W))$|⁠.

Similar to how we updated |$\tau _{\mathrm {e}}$|⁠, in the edge-existence case we update |${\partial \langle T \rangle _r}/{\partial \mu _i(z)}$| by calculating the entire sum and subtracting away the missing edges. Again, we exploit the fact that the last dimension of |$T_{\mathrm {e}}$| is |$1$| for observed edges, and

\begin{equation}\label{eqD.5} \frac{\partial \langle T_e \rangle_{R(z,z')}}{\partial \mu_i(z)} += \left(\sum_j \mu_j(z')\right) -\sum_{j \in \partial i^+_M} T_e(A_{ij}) \mu_j(z') . \end{equation}

(D.5)

Calculating Equation (D.5) for all vertices has a total |$O(K^2(n+m_M))$| running time if we pre-calculate |$\sum _j \mu _j(z')$|⁠.

For the degree-corrected block model, we replace |$\mu _j$| with |$d^{-}_W(j)\mu _j(z')$| and use Equation (D.5).

Loopy BP algorithm. The update for the vertex labels under the BP algorithm requires us to (i) calculate the marginal evidence from each edge |$M_{ij}(z,z')$|⁠, (ii) update messages |$\varphi _{j\rightarrow i}(z_i)$| between weighted edges, (iii) approximate messages |$\varphi _{\rightarrow i}(z_i) = \mu _i(z_i)$| between non-edges and (iv) calculate the vertex-label probabilities |$\mu _i$|⁠.

We calculate the marginal evidence |$M$|

\begin{equation}\label{eqD.6} M_{ij}(z,z') = \exp(T(A_{ij})\cdot \langle \eta \rangle_{R(z,z')} + T(A_{ji})\cdot \langle \eta \rangle_{R(z',z)}) , \end{equation}

(D.6)

for each weighted edge |$ij \in W$| for all |$z,z'$|⁠. This takes |$O(K^2 m_W)$| time. Note that |$M_{ij} = M_{ji}$|⁠. For the non-edges, we again exploit the fact that the last dimension of |$T_{\mathrm {e}}$| is 1 for observed edges and only need to calculate |$M_{ij} = M_N$| once using Equation (D.6) for all non-edges |$ij \in N$|⁠.

The messages between weighted edges are

\begin{equation}\label{eqD.7} \varphi_{i \rightarrow j}(z') \propto \mu_0(z') \prod_{k \neq j, k \in \partial i_W} \sum_{z_k} \varphi_{j \rightarrow k}(z_k) M_{jk}(z',z_k) . \end{equation}

(D.7)

Each step requires |$O(|\partial i_W| K^2)$| calculations. In the case of a sparse graph, |$\partial i_W=O(1)$| and since we repeat this step for each pair |$i,j$| in |$W$|⁠, the overall running time is |$O(K^2 m_W)$|⁠.

Since there are |$O(n^2)$| non-edges, the messages between non-edges must be approximated for our algorithm to be efficient. The idea behind this approximation is to exploit the sparsity of the weighted edges.

To be concrete, suppose we select the Bernoulli distribution for our edge-existence distribution |$f_e(x\,|\,p)$|⁠. Then our marginal evidence takes the form

\begin{equation}\label{eqD.8} \tilde{M}_{ij}(z,z') = \begin{cases} \exp(\langle \log\,p \rangle_{z,z'}) \cdot M_{ij}(z,z') &\text{if } ij \in E \\ \exp(\langle \log(1-p) \rangle_{z,z'}) &\text{otherwise}, \end{cases} \end{equation}

(D.8)

where |$p$| is the edge-existence parameter |$\theta _{\mathrm {e}}$|⁠. If the graph is sparse, then |$\langle \log (1-p)\rangle _r =O(1/n)$|⁠. Thus for |$i,j \in E$|⁠, we have |$\tilde {M}_{ij} \approx 1$|⁠. And therefore messages between non-edges can be approximated as

\begin{equation}\label{eqD.9} {\varphi}_{i \rightarrow j}(z') = \mu^{(0)}_j(z') \prod_{k \neq i} \sum_{z} \tilde{M}_{k,j}(z,z') {\varphi}_{j \rightarrow k}(z) \approx \mu^{(0)}_j(z') \prod_{k} \sum_{z} \tilde{M}_{k,j}(z,z') {\varphi}_{j \rightarrow k}(z) = \mu_j(z') . \end{equation}

(D.9)

Thus, we can approximate all messages between non-edges |${\varphi }_{i \rightarrow j}(z')$| with their marginal distribution |$\mu _j(z')$| taking |$O(nK)$| space and time.

The Poisson and degree-corrected case are more complicated and should follow along the lines of Equation [23]. This extension is left for future work.

Given the messages and marginal evidence, we calculate the vertex-label probabilities with

\begin{align} \mu_i(z) &\propto \mu^{(0)}_i(z) \prod_{ij \in W}\sum_{z'} M_{ij}(z,z') \varphi_{i \rightarrow j}(z') \cdot \prod_{ij \in N}\sum_{z'} M_N(z,z')\mu_j(z') \nonumber \\ &= \prod_{j \in\partial i_W}\frac{\sum_{z'} M_{ij}(z,z') \varphi_{i \rightarrow j}(z')}{\sum_{z'} M_N(z,z')\mu_j(z')} \cdot \left(\sum_{z'} M_N(z,z') \sum_j \mu_j(z')\right)^{\partial i_{\mathrm{E}}} ,\label{eqD.10} \end{align}

(D.10)

where |$\partial i_{\mathrm { E}}$| is the total (in and out) degree of observed edges and |$M_N$| is the marginal evidence of a non-edge. Each of these updates takes |$O(|\partial i_W| K^2)$| calculations. Let |$\partial i_W$| be the in and out neighborhood of |$i$|⁠. In the case of a sparse graph, |$\partial i_W$| is |$O(1)$| and since we repeat this step for each pair |$i,j$| in |$W$|⁠, the overall running time is |$O(K^2 m_W)$|⁠.

In conclusion, all three steps take |$O(nK^2)$| time when the number of weighted edges and missing edges is sparse (⁠|$|W| = O(n)$| and |$|M| = O(n)$|⁠). Although both the variational Bayes algorithm and the loopy BP algorithm have the same asymptotic running time, the constant in front of |$O(nK^2)$| for the loopy BP algorithm depends on the average-weighted degree of the network.

References

1

Fortunato

S.

,

Community detection in graphs

,

Phys. Rep.

,

2010

, vol.

486

(pg.

75

-

174

)

Google Scholar

Crossref

WorldCat

2

Holland

P.

Laskey

K.

Leinhardt

S.

,

Stochastic blockmodels: first steps

,

Soc. Netw.

,

1983

, vol.

5

(pg.

109

-

137

)

Google Scholar

Crossref

WorldCat

3

Wang

Y. J.

Wong

G. Y.

,

Stochastic blockmodels for directed graphs

,

J. Am. Stat. Assoc.

,

1987

, vol.

82

(pg.

8

-

19

)

Google Scholar

Crossref

WorldCat

4

Newman

M. E. J.

,

Mixing patterns in networks

,

Phys. Rev. E

,

2003

, vol.

67

pg.

026126

Google Scholar

Crossref

WorldCat

5

Newman

M. E. J.

,

Networks: An Introduction

,

2010

Oxford University Press, Oxford

New York

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

6

Clauset

A.

Moore

C.

Newman

M. E. J.

,

Hierarchical structure and the prediction of missing links in networks

,

Nature

,

2008

, vol.

453

(pg.

98

-

101

)

7

Park

Y.

Moore

C.

Bader

J.

,

Dynamic networks from hierarchical Bayesian graph clustering

,

PLoS ONE

,

2010

, vol.

5

pg.

e8118

8

Karrer

B.

Newman

M. E. J.

,

Stochastic blockmodels and community structure in networks

,

Phys. Rev. E

,

2011

, vol.

83

pg.

016107

Google Scholar

Crossref

WorldCat

9

Larremore

D. B.

Clauset

A.

Jacobs

A. Z.

,

Efficiently inferring community structure in bipartite networks

,

2014

Pre-print, arXiv:1403.2933

OpenURL Placeholder Text

WorldCat

10

Guimerà

R.

Sales-Pardo

M.

,

A network inference method for large-scale unsupervised identification of novel drug-drug interactions

,

PLoS Comput. Biol.

,

2013

, vol.

9

pg.

e1003374

11

Celisse

A.

Daudin

J.-J.

Pierre

L.

,

Consistency of maximum-likelihood and variational estimators in the stochastic block model

,

Electron. J. Stat.

,

2012

, vol.

6

(pg.

1847

-

1899

)

Google Scholar

Crossref

WorldCat

12

Ambroise

C.

Matias

C.

,

New consistent and asymptotically normal parameter estimates for random-graph mixture models

,

J. R. Stat. Soc. Ser. B

,

2012

, vol.

74

(pg.

3

-

35

)

Google Scholar

Crossref

WorldCat

13

Allman

E. S.

Matias

C.

Rhodes

J. A.

,

Parameter identifiability in a class of random graph mixture models

,

J. Statist. Plann. Inference

,

2011

, vol.

141

(pg.

1719

-

1736

)

Google Scholar

Crossref

WorldCat

14

Thomas

A. C.

Blitzstein

J. K.

,

Valued ties tell fewer lies: why not to dichotomize network edges with thresholds

,

2011

Pre-print, arXiv:1101.0788.

OpenURL Placeholder Text

WorldCat

15

Aicher

C.

Jacobs

A. Z.

Clauset

A.

,

Adapting the stochastic block model to edge-weighted networks

,

ICML Workshop on Structured Learning (SLG 2013)

,

2013

Google Scholar

OpenURL Placeholder Text

WorldCat

16

Attias

H.

,

A variational Bayesian framework for graphical models

,

Adv. Neural Info. Proc. Sys. 12.

,

2000

MIT Press

(pg.

209

-

215

)

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

17

Hofman

J.

Wiggins

C.

,

Bayesian approach to network modularity

,

Phys. Rev. Lett.

,

2008

, vol.

100

pg.

258701

18

O'Hagan

A.

,

Kendall's Advanced Theory of Statistics: Bayesian Inference. 2B

,

2004

1st edn.

New York

Wiley

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

19

Robert

C.

,

The Bayesian Choice: From Decision-Theoretic Foundations to Computational Implementation

,

2007

New York

Springer

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

20

Berger

J. O.

Bernardo

J. M.

,

On the development of reference priors

,

Bayesian statistics

,

1992

, vol.

4

(pg.

35

-

60

)

Google Scholar

OpenURL Placeholder Text

WorldCat

21

Yedidia

J.

Freeman

W.

Weiss

Y.

,

Understanding belief propagation and its generalizations

,

Explor. Artif. Intell. Millenn.

,

2003

, vol.

8

(pg.

236

-

239

)

Google Scholar

OpenURL Placeholder Text

WorldCat

22

Decelle

A.

Krzakala

F.

Moore

C.

Zdeborová

L.

,

Inference and phase transitions in the detection of modules in sparse networks

,

Phys. Rev. Lett.

,

2011

, vol.

107

pg.

65701

Google Scholar

Crossref

WorldCat

23

Yan

X.

Jensen

J. E.

Krzakala

F.

Moore

C.

Shalizi

C. R.

Zdeborová

L.

Zhang

P.

Zhu

Y.

,

Model selection for degree-corrected block models

,

2012

Pre-print, arXiv:1207.3994.

OpenURL Placeholder Text

WorldCat

24

Peixoto

T.

,

Parsimonious module inference in large networks

,

Phys. Rev. Lett.

,

2013

, vol.

110

pg.

148701

25

Côme

E.

Latouche

P.

,

Model selection and clustering in stochastic block models with the exact integrated complete data likelihood

,

2013

Pre-print, arXiv:1303.2962.

OpenURL Placeholder Text

WorldCat

26

STATS LCC

,

2014

OpenURL Placeholder Text

WorldCat

27

Merritt

S.

Clauset

A.

,

Scoring dynamics across professional team sports: tempo, balance and predictability

,

EPJ Data Sci.

,

2014

, vol.

3

pg.

4

Google Scholar

Crossref

WorldCat

28

Guimerà

R.

Sales-Pardo

M.

,

Missing and spurious interactions and the reconstruction of complex networks

,

Proc. Natl. Acad. Sci. USA

,

2009

, vol.

106

(pg.

22073

-

22078

)

Google Scholar

Crossref

WorldCat

29

Clauset

A.

Moore

C.

Newman

M. E.

,

Structural Inference of Hierarchies in Networks

,

2007

Heidelberg, Berlin

Springer

Lecture Notes in Computer Science, vol. 4503

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

30

Colizza

V.

Pastor-Satorras

R.

Vespignani

A.

,

Reaction diffusion processes and metapopulation models in heterogeneous networks

,

Nat. Phys.

,

2007

, vol.

3

(pg.

276

-

282

)

Google Scholar

Crossref

WorldCat

31

Pan

R. K.

Kaski

K.

Fortunato

S.

,

World citation and collaboration networks: uncovering the role of geography in science

,

Sci. Rep.

,

2012

, vol.

2

pg.

902

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

32

Porter

M. A.

Mucha

P. J.

Newman

M.

Warmbrand

C. M.

,

A network analysis of committees in the United States House of Representatives

,

Proc. Natl. Acad. Sci. USA

,

2005

, vol.

102

(pg.

7057

-

7062

)

Google Scholar

Crossref

WorldCat

33

Opsahl

T.

Panzarasa

P.

,

Clustering in weighted networks

,

Soc. Netw.

,

2009

, vol.

31

(pg.

155

-

163

)

Google Scholar

Crossref

WorldCat

34

Peixoto

T.

,

Hierarchical block structures and high-resolution model selection in large networks

,

Phys. Rev. X

,

2014

, vol.

4

pg.

011047

Google Scholar

OpenURL Placeholder Text

WorldCat

35

Airoldi

E. M.

Blei

D. M.

Fienberg

S. E.

Xing

E. P.

,

Mixed membership stochastic blockmodels

,

J. Mach. Learn. Res.

,

2008

, vol.

9

(pg.

1981

-

2014

)

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

36

Peel

L.

Clauset

A.

,

Detecting change points in the large-scale structure of evolving networks

,

2014

Pre-print, arXiv:1403.0989.

OpenURL Placeholder Text

WorldCat

37

Zhang

P.

Krzakala

F.

Reichardt

J.

Zdeborová

L.

,

Comparative study for inference of hidden classes in stochastic block models

,

J. Stat. Mech.

,

2012

, vol.

2012

pg.

P12021

Google Scholar

Crossref

WorldCat

Author notes

Edited by: Ernesto Estrada

Download all slides

Month:	Total Views:
January 2017	6
February 2017	2
March 2017	2
June 2017	1
July 2017	2
August 2017	1
October 2017	19
November 2017	26
December 2017	37
January 2018	32
February 2018	43
March 2018	55
April 2018	39
May 2018	85
June 2018	78
July 2018	78
August 2018	74
September 2018	1
October 2018	4
November 2018	1
December 2018	4
January 2019	5
February 2019	12
March 2019	12
April 2019	9
May 2019	13
June 2019	8
July 2019	20
August 2019	24
September 2019	17
October 2019	22
November 2019	31
December 2019	21
January 2020	5
February 2020	19
March 2020	9
April 2020	15
May 2020	15
June 2020	4
July 2020	12
August 2020	5
September 2020	17
October 2020	12
November 2020	15
December 2020	9
January 2021	10
February 2021	6
March 2021	32
April 2021	19
May 2021	6
June 2021	12
July 2021	27
August 2021	35
September 2021	32
October 2021	32
November 2021	19
December 2021	20
January 2022	31
February 2022	7
March 2022	19
April 2022	13
May 2022	24
June 2022	22
July 2022	15
August 2022	9
September 2022	33
October 2022	21
November 2022	22
December 2022	15
January 2023	10
February 2023	17
March 2023	16
April 2023	12
May 2023	14
June 2023	16
July 2023	3
August 2023	4
September 2023	23
October 2023	21
November 2023	25
December 2023	11
January 2024	14
February 2024	10
March 2024	16
April 2024	33
May 2024	12
June 2024	14
July 2024	8
August 2024	14
September 2024	12
October 2024	14
November 2024	20
December 2024	12
January 2025	9
February 2025	16
March 2025	11
April 2025	16

Article Contents

Learning latent block structure in weighted networks

Abstract

1. Introduction

2. Weighted stochastic block model

2.1 Sparse weighted graphs

2.2 Degree correction

3. Learning latent block structure

3.1 Conjugate distributions

3.2 An efficient algorithm for optimizing |$\mathcal {G}$|

3.3 Selecting |$K$| with Bayes factors

4. Experimental evaluation

4.1 Edge weight versus edge-existence latent group structure

4.2 Predicting edge existence and weight

5. Discussion

Funding

Acknowledgements

Appendix A. Working implementation

Appendix B. Exponential families

Appendix C. Belief propagation derivation

Appendix D. Modifications for sparse-weighted graphs

D.1 Update for |$\tau $| (edge distribution)

D.2 Update for |$\mu $| (vertex labels)

References

Author notes

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Article Contents

Learning latent block structure in weighted networks

Abstract

1. Introduction

2. Weighted stochastic block model

2.1 Sparse weighted graphs

2.2 Degree correction

3. Learning latent block structure

3.1 Conjugate distributions

3.2 An efficient algorithm for optimizing |$\mathcal {G}$|

3.3 Selecting |$K$| with Bayes factors

4. Experimental evaluation

4.1 Edge weight versus edge-existence latent group structure

4.2 Predicting edge existence and weight

5. Discussion

Funding

Acknowledgements

Appendix A. Working implementation

Appendix B. Exponential families

Appendix C. Belief propagation derivation

Appendix D. Modifications for sparse-weighted graphs

D.1 Update for |$\tau $| (edge distribution)

D.2 Update for |$\mu $| (vertex labels)

References

Author notes

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only