Читать книгу Artificial Intelligence and Quantum Computing for Advanced Wireless Networks - Savo G. Glisic - Страница 77
5.1.2 Propagation Types
ОглавлениеThe propagation step and output step in the model define the hidden states of nodes (or edges). Several major modifications have been made to the propagation step from the original GNN model, whereas in the output step a simple feedforward neural network setting is the most popular. The variants utilize different aggregators to gather information from each node’s neighbors and specific updaters to update nodes’ hidden states.
Convolution has been generalized to the graph domain as well. Advances in this direction are often categorized as spectral approaches and non‐spectral (spatial) approaches. Spectral approaches work with a spectral representation of the graphs.
Spectral network: The spectral network was proposed in Bruna et al. [12]. The convolution operation is defined in the Fourier domain by computing the eigen decomposition of the graph Laplacian. For the basics of these techniques, see Appendix 5.A. The operation can be defined as the multiplication of a signal x ∈ ℝN (a scalar for each node) with a filter gθ = diag(θ) parameterized by θ ∈ ℝN:
(5.8)
where U is the matrix of the eigenvectors of the normalized graph Laplacian (D is the degree matrix, and A is the adjacency matrix of the graph), with a diagonal matrix of its eigenvalues Λ.
ChebyshevNet [13] truncates gθ(Λ) in terms of Chebyshev polynomials Tk(x) up to Kth order as
(5.9)
Parameter λmax denotes the largest eigenvalue of L. θ ∈ ℝK is now a vector of Chebyshev coefficients. The Chebyshev polynomials are defined as Tk(x) = 2xTk − 1(x) − Tk − 2(x), with T0(x) = 1 and T1(x) = x. It can be observed that the operation is K‐localized since it is a Kth‐order polynomial in the Laplacian.
Graph convolutional network (GCN) [14] : This limits the layer‐wise convolution operation to K = 1 to alleviate the problem of overfitting on local neighborhood structures for graphs with very wide node degree distributions. By approximating λmax ≈ 2, the equation simplifies to
(5.10)
with two free parameters and . After constraining the number of parameters with , we get
Non‐spectral approaches define convolutions directly on the graph, operating on spatially close neighbors. The major challenge with non‐spectral approaches is defining the convolution operation with differently sized neighborhoods and maintaining the local invariance of CNNs.
Neural fingerprints (FP) [15] use different weight matrices for nodes with different degrees,
(5.12)
where is the weight matrix for nodes with degree at layer t. The main drawback of the method is that it cannot be applied to large‐scale graphs with more node degrees.
Diffusion‐convolutional neural networks (DCNNs) were proposed in [16].Transition matrices are used to define the neighborhood for nodes in DCNN. For node classification, it has the form
(5.13)
where X is an N × F tensor of input features (N is the number of nodes, and F is the number of features). P* is an N × K × N tensor that contains the power series {P, P2, …, PK} of matrix P, and P is the degree‐normalized transition matrix from the graph adjacency matrix A. Each entity is transformed to a diffusion convolutional representation that is a K × F matrix defined by K hops of graph diffusion over F features. It will then be defined by a K × F weight matrix and a nonlinear activation function f. Finally, H (which is N × K × F) denotes the diffusion representations of each node in the graph. As for graph classification, DCNN simply takes the average of nodes’ representation,
(5.14)
and 1N here is an N × 1 vector of ones. DCNN can also be applied to edge classification tasks, which requires converting edges to nodes and augmenting the adjacency matrix.
Dual graph convolutional network (DGCN) [17]: This jointly considers the local consistency and global consistency on graphs. It uses two convolutional networks to capture the local/global consistency and adopts an unsupervised loss function to ensemble them. As the first step, let us note that stacking the operator in Eq. (5.11) could lead to numerical instabilities and exploding/vanishing gradients, so [14] introduces the renormalization trick: , with and . Finally, [14] generalizes the definition to a signal X ∈ℝN × C with C input channels and F filters for feature maps as follows:
where Θ ∈ ℝC × F is a matrix of filter parameters, and Z ∈ ℝN × F is the convolved signal matrix. The first convolutional network for DGCN is the same as Eq. (5.15). The second network replaces the adjacency matrix with a positive pointwise mutual information (PPMI) matrix:
(5.16)
where XP is the PPMI matrix, and DP is the diagonal degree matrix of XP.
When dealing with large‐scale networks, low‐dimensional vector embeddings of nodes in large graphs have proved extremely useful as feature inputs for a wide variety of prediction and graph analysis tasks [18–22]. The basic idea behind node‐embedding approaches is to use dimensionality reduction techniques to distill the high‐dimensional information about a node’s graph neighborhood into a dense vector embedding. These node embeddings can then be fed to downstream machine learning systems and aid in tasks such as node classification, clustering, and link prediction [19–21].
GraphSAGE (SAmple and aggreGatE) [23] is a general inductive framework. The framework generates embeddings by sampling and aggregating features from a node’s local neighborhood.
However, the original work utilizes in Eq. (5.17) only a fixed‐size set of neighbors by uniformly sampling. Three aggregator functions are used.
The mean aggregator could be viewed as an approximation of the convolutional operation from the transductive GCN framework [14], so that the inductive version of the GCN variant could be derived by
(5.18)
The mean aggregator is different from other aggregators because it does not perform the concatenation operation that concatenates and in Eq. (5.17). It could be viewed as a form of “skip connection” [24] and could achieve better performance.
The long short‐term memory (LSTM) aggregator, which has a larger expressive capability, is also used. However, LSTMs process inputs in a sequential manner so that they are not permutation invariant. Reference [23] adapts LSTMs to operate on an unordered set by permutating the node’s neighbors.
Pooling aggregator: In the pooling aggregator, each neighbor’s hidden state is fed through a fully connected layer, after which a max ‐pooling operation is applied to the set of the node’s neighbors:
(5.19)
Note that any symmetric function could be used in place of the max ‐pooling operation here. The operation of GraphSAGE is illustrated in Figure 5.1 [23].
Gate: Several works have attempted to use a gate mechanism such as gate recurrent units (GRUs) [25] or LSTM [26] in the propagation step to mitigate the restrictions in the former GNN models and improve the long‐term propagation of information across the graph structure.
Gated graph neural network (GGNN) [27] uses GRUs in the propagation step, unrolls the recurrence for a fixed number of steps T, and uses backpropagation through time in order to compute gradients. So, the propagation model can be presented as
(5.20)
Figure 5.1 Operation of GraphSAGE: (a) sample neighborhood, (b) aggregate feature information from neighbors, (c) predict graph context and label using aggregated information.
Source: Hamilton et al. [23].
The node v first aggregates message from its neighbors, where Av is the submatrix of the graph adjacency matrix A and denotes the connection of node v with its neighbors. The GRU‐like update functions incorporate information from the other nodes and from the previous time step to update each node’s hidden state. Matrix a gathers the neighborhood information of node v, and z and r are the update and reset gates.
LSTM architecture extensions, referred to as the Child‐Sum Tree‐LSTM and the N‐ary Tree‐LSTM, are presented in [28]. As in standard LSTM units, each Tree‐LSTM unit (indexed by v) contains input and output gates iv and ov, a memory cell cv , and a hidden state hv . Instead of a single forget gate, the Tree‐LSTM unit contains one forget gate fvk for each child k, allowing the unit to selectively incorporate information from each child. The Child‐Sum Tree‐LSTM transition equations are given as
(5.21)
is the input vector at time t in the standard LSTM setting. If the branching factor of a tree is at most K and all children of a node are ordered, – that is, they can be indexed from 1 to K – then the N‐ary Tree‐LSTM can be used. For node v, and denote the hidden state and memory cell of its k‐th child at time t, respectively. The transition equations are now
(5.22)
The introduction of separate parameter matrices for each child k allows the model to learn more fine‐grained representations conditioning on the states of a unit’s children than the Child‐Sum Tree‐LSTM.
The two types of Tree‐LSTMs can be easily adapted to the graph. The graph‐structured LSTM in [29] is an example of the N‐ary Tree‐LSTM applied to the graph. However, it is a simplified version since each node in the graph has at most two incoming edges (from its parent and sibling predecessor). Reference [30] proposed another variant of the Graph LSTM based on the relation extraction task. The main difference between graphs and trees is that edges of graphs have labels. Work in [30] utilizes different weight matrices to represent different labels:
(5.23)
where m(v, k) denotes the edge label between node v and k.
The attention mechanism has been successfully used in many sequence‐based tasks such as machine translation [31–33] and machine reading [34]. Work in [35] proposed a graph attention network (GAT) that incorporates the attention mechanism into the propagation step. It computes the hidden states of each node by attending to its neighbors, following a self‐attention strategy. The work defines a single graph attentional layer and constructs arbitrary GATs by stacking this layer. The layer computes the coefficients in the attention mechanism of the node pair (i, j) by
(5.24)
where αij is the attention coefficient of node j to represents the neighborhoods of node i in the graph. The input set of node features to the layer is h = {h1, h2, …, hN}, hi ∈ ℝF, where N is the number of nodes, and F is the number of features of each node; the layer produces a new set of node features (of potentially different cardinality F′), , as its output. is the weight matrix of a shared linear transformation that is applied to every node, and is the weight vector of a single‐layer FNN. It is normalized by a softmax function, and the LeakyReLU nonlinearity (with negative input slope α = 0.2) is applied. After applying a nonlinearity, the final output features of each node can be obtained as
(5.25)
The layer utilizes multi‐head attention similarly to [33] to stabilize the learning process. It applies K independent attention mechanisms to compute the hidden states and then concatenates their features (or computes the average), resulting in the following two output representations:
(5.26)
(5.27)
where is the normalized attention coefficient computed by the k‐th attention mechanism. The attention architecture in [35] has several properties: (i) the computation of the node‐neighbor pairs is parallelizable, thus making the operation efficient; (ii) it can be applied to graph nodes with different degrees by specifying arbitrary weights to neighbors; and (iii) it can be easily applied to inductive learning problems.
Apart from different variants of GNNs, several general frameworks have been proposed that aim to integrate different models into a single framework.
Message passing neural networks (MPNNs) [36]: This framework abstracts the commonalities between several of the most popular models for graph‐structured data, such as spectral approaches and non‐spectral approaches in graph convolution, gated GNNs, interaction networks, molecular graph convolutions, and deep tensor neural networks. The model contains two phases, a message passing phase and a readout phase. The message passing phase (namely, the propagation step) runs for T time steps and is defined in terms of th message function Mt and the vertex update function Ut . Using messages , the updating functions of the hidden states are
(5.28)
where evw represents features of the edge from node v to w. The readout phase computes a feature vector for the whole graph using the readout function R according to
(5.29)
where T denotes the total time steps. The message function Mt , vertex update function Ut , and readout function R could have different settings. Hence, the MPNN framework could generalize several different models via different function settings. Here, we give an example of generalizing GGNN, and other models’ function settings could be found in Eq. (5.36). The function settings for GGNNs are
(5.30)
where is the adjacency matrix, one for each edge label e. The is the gated recurrent unit introduced in [25]. i and j are neural networks in function R.
Non‐local neural networks (NLNN) are proposed for capturing long‐range dependencies with deep neural networks by computing the response at a position as a weighted sum of the features at all positions (in space, time, or spacetime). The generic non‐local operation is defined as
(5.31)
where i is the index of an output position, and j is the index that enumerates all possible positions. f(hi, hj) computes a scalar between i and j representing the relation between them. g(hj) denotes a transformation of the input hj , and a factor 1/ is utilized to normalize the results.
There are several instantiations with different f and g settings. For simplicity, the linear transformation can be used as the function g. That means g(hj) = Wghj , where Wg is a learned weight matrix. The Gaussian function is a natural choice for function f, giving , where is dot‐product similarity and C (h) =∑∀j f(hi, hj). It is straightforward to extend the Gaussian function by computing similarity in the embedding space giving with θ (hi) = Wθhi , φ(hj ) = Wφhj , and . The function f can also be implemented as a dot‐product similarity f(hi, hj) = θ(hi)T φ(hj ). Here, the factor , where N is the number of positions in h. Concatenation can also be used, defined as , where wf is a weight vector projecting the vector to a scalar and