Читать книгу Artificial Intelligence and Quantum Computing for Advanced Wireless Networks - Savo G. Glisic - Страница 81
5.2.2 ConvGNNs
ОглавлениеThese networks are closely related to recurrent GNNs. Instead of iterating node states with contractive constraints, they address the cyclic mutual dependencies architecturally using a fixed number of layers with different weights in each layer, as illustrated in Figure 5.2a. This key distinction from recurrent GNNs is illustrated in Figures 5.2b and 5.2c. As graph convolutions are more efficient and convenient to composite with other neural networks, the popularity of ConvGNNs has been rapidly growing in recent years. These networks fall into two categories, spectral‐based and spatial‐based. Spectral‐based approaches define graph convolutions by introducing filters from the perspective of graph signal processing [44] where the graph convolutional operation is interpreted as removing noise from graph signals. Spatial‐based approaches inherit ideas from RecGNNs to define graph convolutions by information propagation. Since GCN [14] bridged the gap between spectral‐based approaches and spatial‐based approaches, spatial‐based methods have developed rapidly recently due to their attractive efficiency, flexibility, and generality.
Spectral‐based ConvGNNs assume graphs to be undirected. The normalized graph Laplacian matrix is a mathematical representation of an undirected graph, defined as , where D is a diagonal matrix of node degrees, Dii = ∑j(Ai, j). For details, see Appendix 5.A. The normalized graph Laplacian matrix possesses the property of being real symmetric positive semidefinite. With this property, the normalized Laplacian matrix can be factored as L = UΛUT, where U = [u0, u1, ⋯, un − 1] ∈ Rn × n is the matrix of eigenvectors ordered by eigenvalues, and Λ is the diagonal matrix of eigenvalues (spectrum), Λii = λi. The eigenvectors of the normalized Laplacian matrix form an orthonormal space, in mathematical words UTU= I. In graph signal processing, a graph signal x ∈ Rn is a feature vector of all nodes of a graph where xi is the value of the i‐th node. The graph Fourier transform to a signal x is defined as , and the inverse graph Fourier transform is defined as , where represents the resulting signal from the graph Fourier transform. For more details, see Appendix 5.A. The graph Fourier transform projects the input graph signal to the orthonormal space where the basis is formed by eigenvectors of the normalized graph Laplacian. Elements of the transformed signal are the coordinates of the graph signal in the new space so that the input signal can be represented as which is exactly the inverse graph Fourier transform. Now, the graph convolution *G of the input signal x with a filter g ∈ Rn is defined as
where ⊙ denotes the element‐wise product. If we denote a filter as gθ = diag (UTg), then the spectral graph convolution *G is simplified as
(5.37)
Spectral‐based ConvGNNs all follow this definition. The key difference lies in the choice of the filter gθ.
Spectral Convolutional Neural Network (Spectral CNN) [12] assumes the filter is a set of learnable parameters and considers graph signals with multiple channels. The graph convolutional layer of Spectral CNN is defined as
(5.38)
where k is the layer index, is the input graph signal, H(0) = X, fk − 1 is the number of input channels and fk is the number of output channels, and is a diagonal matrix filled with learnable parameters. Due to the eigen decomposition of the Laplacian matrix, spectral CNN faces three limitations: (i) any perturbation to a graph results in a change of eigenbasis; (ii) the learned filters are domain dependent, meaning they cannot be applied to a graph with a different structure; and (iii) eigen decomposition requires O(n3) computational complexity. In follow‐up works, ChebNet [45] and GCN [14, 46] reduced the computational complexity to O(m) by making several approximations and simplifications.
Chebyshev Spectral CNN (ChebNet) [45] approximates the filter gθ by Chebyshev polynomials of the diagonal matrix of eigenvalues, that is, , where , and the values of lie in [−1, 1]. The Chebyshev polynomials are defined recursively by Ti(x) = 2xTi − 1(x) − Ti − 2(x) with T0(x) = 1 and T1(x) = x. As a result, the convolution of a graph signal x with the defined filter gθ is
(5.39)
where . As , which can be proved by induction on i. ChebNet takes the form
As an improvement over spectral CNN, the filters defined by ChebNet are localized in space, which means filters can extract local features independently of the graph size. The spectrum of ChebNet is mapped to [−1, 1] linearly. CayleyNet [47] further applies Cayley polynomials, which are parametric rational complex functions, to capture narrow frequency bands. The spectral graph convolution of CayleyNet is defined as
(5.41)
where Re(·) returns the real part of a complex number, c0 is a real coefficent, cj is a complex coefficent, i is the imaginary number, and h is a parameter that controls the spectrum of a Cayley filter. While preserving spatial locality, CayleyNet shows that ChebNet can be considered as a special case of CayleyNet.
GCN [14] introduces a first‐order approximation of ChebNet. Assuming K = 1 and λmax = 2, Eq. (5.40) is simplified as
(5.42)
To restrict the number of parameters and avoid overfitting, GCN further assume θ = θ0 = − θ1 , leading to the following definition of a graph convolution:
To allow multi‐channels of inputs and outputs, GCN modifies Eq. (5.43) into a compositional layer, defined as
where and f(·) is an activation function. Using empirically causes numerical instability to GCN. To address this problem, GCN applies a normalization to replace by with and Being a spectral‐based method, GCN can be also interpreted as a spatial‐based method. From a spatial‐based perspective, GCN can be considered as aggregating feature information from a node’s neighborhood. Equation (5.44) can be expressed as
(5.45)
Several recent works made incremental improvements over GCN [14] by exploring alternative symmetric matrices.
Adaptive Graph Convolutional Network (AGCN) [16] learns hidden structural relations unspecified by the graph adjacency matrix. It constructs a so‐called residual graph adjacency matrix through a learnable distance function that takes two nodes’ features as inputs.
DGCN [17] introduces a dual graph convolutional architecture with two graph convolutional layers in parallel. While these two layers share parameters, they use the normalized adjacency matrix and the PPMI matrix, which capture nodes’ co‐occurrence information through random walks sampled from a graph. The PPMI matrix is defined as
(5.46)
where v1, v2 ∈V, ∣D∣ , and the count (·) function returns the frequency with which node v and/or node u co‐occur/occur in sampled random walks. By ensembling outputs from dual graph convolutional layers, DGCN encodes both local and global structural information without the need to stack multiple graph convolutional layers.
Spatial‐based ConvGNNs: Analogous to the convolutional operation of a conventional CNN on an image, spatial‐based methods define graph convolutions based on a node’s spatial relations. Images can be considered as a special form of graph with each pixel representing a node. Each pixel is directly connected to its nearby pixels, as illustrated in Figure 5.3a (adopted from [38]). A filter is applied to a 3 × 3 patch by taking the weighted average of pixel values of the central node and its neighbors across each channel. Similarly, the spatial‐based graph convolutions convolve the central node’s representation with its neighbors’ representations to derive the updated representation for the central node, as illustrated in Figure 5.3b. From another perspective, spatial‐based ConvGNNs share the same idea of information propagation/message passing with RecGNNs. The spatial graph convolutional operation essentially propagates node information along edges.
Neural Network for Graphs (NN4G) [48]: Proposed in parallel with GNN*, this is the first work toward spatial‐based ConvGNNs. Distinctively different from RecGNNs, NN4G learns graph mutual dependency through a compositional neural architecture with independent parameters at each layer. The neighborhood of a node can be extended through incremental construction of the architecture. NN4G performs graph convolutions by summing up a node’s neighborhood information directly. It also applies residual connections and skips connections to memorize information over each layer. As a result, NN4G derives its next layer node states by
(5.47)
where f(·) is an activation function and . The equation can also be written in a matrix form:
(5.48)
which resembles the form of GCN [14]. One difference is that NN4G uses the unnormalized adjacency matrix, which may potentially cause hidden node states to have extremely different scales.
Contextual Graph Markov Model (CGMM) [20] proposes a probabilistic model inspired by NN4G. While maintaining spatial locality, CGMM has the benefit of probabilistic interpretability
DCNN [16] regards graph convolutions as a diffusion process. It assumes that information is transferred from one node to one of its neighboring nodes with a certain transition probability so that information distribution can reach equilibrium after several rounds. DCNN defines the diffusion graph convolution (DGC) as
(5.49)
where f(·) is an activation function, and the probability transition matrix P ∈ Rn × n is computed by P = D−1 A. Note that in DCNN, the hidden representation matrix H(k) remains the same dimension as the input feature matrix X and is not a function of its previous hidden representation matrix H(k − 1). DCNN concatenates H(1), H(2), ⋯, H(K) together as the final model outputs. As the stationary distribution of a diffusion process is the sum of the power series of probability transition matrices, DGC [49] sums up outputs at each diffusion step instead of concatenating them. It defines the DGC by
where W(k) ∈ RD × F and f(·) is an activation function. Using the power of a transition probability matrix implies that distant neighbors contribute very little information to a central node.
Parametric graph convolution (parametric GC ‐DGCNN) [12, 39] increases the contributions of distant neighbors based on shortest paths. It defines a shortest path adjacency matrix S(j). If the shortest path from a node v to a node u is of length j, then otherwise 0. With a hyperparameter r to control the receptive field size, PGC‐DGCNN introduces a graph convolutional operation as follows:
(5.51)
where , H(0)= X, and ‖ represents the concatenation of vectors. The calculation of the shortest path adjacency matrix can be expensive with O(n3) at maximum. An illustration of parameter r is given in Figure 5.4 [39]
Partition graph convolution (PGC) [50] partitions a node’s neighbors into Q groups based on certain criteria not limited to shortest paths. PGC constructs Q adjacency matrices according to the defined neighborhood by each group. Then, PGC applies GCN [14] with a different parameter matrix to each neighbor group and sums the results:
(5.52)
where and
MPNN [51] outlines a general framework of spatial‐based ConvGNNs. It treats graph convolutions as a message passing process in which information can be passed from one node to another along edges directly. MPNN runs K‐step message passing iterations to let information propagate further. The message passing function (namely the spatial graph convolution) is defined as
(5.53)
where =xv, Uk(·) and Mk(·) are functions with learnable parameters. After deriving the hidden representations of each node, can be passed to an output layer to perform node‐level prediction tasks or to a readout function to perform graph‐level prediction tasks. The readout function generates a representation of the entire graph based on node hidden representations. It is generally defined as
(5.54)
where R(·) represents the readout function with learnable parameters. MPNN can cover many existing GNNs by assuming different forms of Uk(·), Mk(·), and R(·).
Graph Isomorphism Network (GIN): Reference [52] finds that previous MPNN‐based methods are incapable of distinguishing different graph structures based on the graph embedding they produced. To amend this drawback, GIN adjusts the weight of the central node by a learnable parameter ε(k). It performs graph convolutions by
(5.55)
where MLP(·) represents a multi‐layer perceptron. As the number of neighbors of a node can vary from one to a thousand or even more, it is inefficient to take the full size of a node’s neighborhood. GraphSAGE [23] adopts sampling to obtain a fixed number of neighbors for each node. It performs graph convolutions by
(5.56)
where fk(·) is an aggregation function, SN(v) is a random sample of the node v’s neighbors. The aggregation function should be invariant to the permutations of node orderings such as a mean, sum, or max function.
GAT [53] assumes that contributions of neighboring nodes to the central node are neither identical like GraphSAGE [23], nor pre‐determined like GCN [14]. GAT adopts attention mechanisms to learn the relative weights between two connected nodes. The graph convolutional operation according to GAT is defined as
(5.57)
where . The attention weight measures the connective strength between the node v and its neighbor u:
(5.58)
where g(·) is a LeakyReLU activation function and a is a vector of learnable parameters. The softmax function ensures that the attention weights sum up to one over all neighbors of the node v.
Graph pooling modules: After a GNN generates node features, we can use them for the final task. But using all these features directly can be computationally challenging, and thus a downsampling strategy is needed. Depending on the objective and the role it plays in the network, different names are given to this strategy: (i) the pooling operation aims to reduce the size of parameters by downsampling the nodes to generate smaller representations and thus avoid overfitting, permutation invariance, and computational complexity issues and (ii) the readout operation is mainly used to generate graph‐level representation based on node representations (see Figure 5.5). Their mechanism is very similar. In this section, we use pooling to refer to all kinds of downsampling strategies applied to GNNs. Nowadays, mean/max/sum pooling, already illustrated in Section 5.1, is the most primitive and effective way to implement downsampling since calculating the mean/ max /sum value in the pooling window is fast:
(5.59)
where K is the index of the last graph convolutional layer.