Читать книгу Artificial Intelligence and Quantum Computing for Advanced Wireless Networks - Savo G. Glisic - Страница 85
5.3.1 Labeled Graph NN (LGNN)
ОглавлениеHere, a graph G is a pair (N, E), where N is the set of nodes and E is the set of edges. The set ne [n] stands for the neighbors of n, that is, the nodes connected to n by an arc, while co[n] denotes the set of arcs having n as a vertex. Nodes and edges may have labels represented by real vectors. The labels attached to node n and edge (n1, n2) will be represented by and , respectively. Let l denote the vector obtained by stacking together all the labels of the graph.
If y is a vector that contains data from a graph and S is a subset of the nodes (the edges), then yS denotes the vector obtained by selecting from y the components related to the node (the edges) in S.
For example, lne[n] stands for the vector containing the labels of all the neighbors of n. Labels usually include features of objects related to nodes and features of the relationships between the objects. No assumption is made regarding the arcs: directed and undirected edges are both permitted. However, when different kinds of edges coexist in the same dataset, it is necessary to distinguish them. This can be easily achieved by attaching a proper label to each edge. In this case, different kinds of arcs turn out to be just arcs with different labels.
The considered graphs may be either positional or nonpositional. Nonpositional graphs are those that have been described thus far; positional graphs differ from them since a unique integer identifier is assigned to each neighbor of a node n to indicate its logical position. Formally, for each node n in a positional graph, there exists an injective function νn : ne[n] → {1, …, ∣ N∣}, which assigns to each neighbor u of n a position νn(u).
The domain is the set of pairs of a graph and a node, that is, , where is a set of the graphs and is a subset of their nodes. We assume a supervised learning framework with the learning set
where ni, j ∈ Ni denotes the j‐th node in the set and ti, j is the desired target associated with ni, j . Finally, and qi ≤ ∣ Ni∣. All the graphs of the learning set can be combined into a unique disconnected graph, and therefore one might think of the learning set as the pair , where G = (N, E) is a graph and is a set of pairs {(ni, ti) ∣ ni ∈ N, ti ∈ Rm, 1 ≤ i ≤ q}. It is worth mentioning that this compact definition is useful not only for its simplicity, but that it also directly captures the very nature of some problems where the domain consists of only one graph, for instance, a large portion of the Web (see Figure 5.8).
The model: As before, the nodes in a graph represent objects or concepts, and edges represent their relationships. Each concept is naturally defined by its features and the related concepts. Thus, we can attach a state xn ∈ Rs to each node n that is based on the information contained in the neighborhood of n (see Figure 5.9). The state xn contains a representation of the concept denoted by n and can be used to produce an output o, that is, a decision about the concept. Let fw be a parametric function, called the local transition function, that expresses the dependence of a node n on its neighborhood, and let gw be the local output function that describes how the output is produced. Then, xn and on are defined as follows:
where ln, lco[n], xne[n] , and lne[n] are the label of n, the labels of its edges, the states, and the labels of the nodes in the neighborhood of n, respectively.
Let x, o, l and lN be the vectors constructed by stacking all the states, all the outputs, all the labels, and all the node labels, respectively. Then, a compact form of Eq. (5.74) becomes
where Fw , the global transition function, and Gw , the global output function, are stacked versions of ∣N∣ instances of fw and gw , respectively. For us, the case of interest is when x, o are uniquely defined and Eq. (5.75) defines a map ϕw : that takes a graph as input and returns an output on for each node. Banach’s fixed‐point theorem [62] provides a sufficient condition for the existence and uniqueness of the solution of a system of equations. According to Banach’s theorem, Eq. (5.75) has a unique solution provided that Fw is a contraction map with respect to the state; that is, there exists μ, 0 ≤ μ < 1 such that ‖Fw(x, l) − Fw(y, l)‖ ≤ μ‖x − y‖ holds for any x, y, where ‖ . ‖ denotes a vectorial norm. Thus, for the moment, let us assume that Fw is a contraction map. Later, we will show that, in GNNs, this property is enforced by an appropriate implementation of the transition function.
Figure 5.8 A subset of the Web.
Figure 5.9 Graph and the neighborhood of a node. The state x1 of node 1 depends on the information contained in its neighborhood.
Source: Scarselli et al. [1].
For positional graphs, fw must receive the positions of the neighbors as additional inputs. In practice, this can be easily achieved provided that information contained in xne[n], lco[n] , and lne[n] is sorted according to neighbors’ positions and is appropriately padded with special null values in positions corresponding to nonexistent neighbors. For example, xne[n] = [y1, …, yM], where M = maxn, u vn(u) is the maximum number of neighbors of a node; yi = xu holds, if u is the i‐th neighbor of n(νn(u) = i); and yi = x0 , for some predefined null state x0 , if there is no i‐th neighbor.
For nonpositional graphs, it is useful to replace function fw of Eq. (5.74) with
where hw is a parametric function. This transition function, which has been successfully used in recursive NNs, is not affected by the positions and the number of the children. In the following, Eq. (5.76) is referred to as the nonpositional form, and Eq. (5.74) is called the positional form. To implement the GNN model, we need (i) a method to solve Eq. (5.74)); (ii) a learning algorithm to adapt fw and gw using examples from the training dataset; and (iii) an implementation of fw and gw. These steps will be considered in the following paragraphs.
Computation of the state: Banach’s fixed‐point theorem does not only ensure the existence and the uniqueness of the solution of Eq. (5.74), but it also suggests the following classic iterative scheme for computing the state:
where x(t) denotes the t‐th iteration of x. The dynamical system (Eq. (5.77)) converges exponentially fast to the solution of Eq. (5.75) for any initial value (0). We can, therefore, think of x(t) as the state that is updated by the transition function Fw . In fact, Eq. (5.77) implements the Jacobi iterative method for solving nonlinear equations [63]. Thus, the outputs and the states can be computed by iterating
Note that the computation described in Eq. (5.78) can be interpreted as the representation of a network consisting of units that compute fw and gw . Such a network is called an encoding network, following an analogous terminology used for the recursive neural network model [64]. In order to build the encoding network, each node of the graph is replaced by a unit computing the function fw (see Figure 5.10). Each unit stores the current state xn(t) of node n, and when activated it calculates the state xn(t + 1) using the node label and the information stored in the neighborhood. The simultaneous and repeated activation of the units produce the behavior described in Eq. (5.78). The output of node n is produced by another unit, which implements gw. When fw and gw are implemented by FNNs, the encoding network turns out to be a recurrent neural network where the connections between the neurons can be divided into internal and external connections. The internal connectivity is determined by the neural network architecture used to implement the unit. The external connectivity depends on the edges of the processed graph.
The learning algorithm: Learning in GNNs consists of estimating the parameter w such that ϕw approximates the data in the learning dataset Eq. (5.73) where qi is the number of supervised nodes in Gi . For graph‐focused tasks, one special node is used for the target (qi = 1 holds), whereas for node‐focused tasks, in principle, the supervision can be performed on every node. The learning task can be posed as the minimization of a quadratic cost function
Figure 5.10 Graph (on the top, left), the corresponding encoding network (top right), and the network obtained by unfolding the encoding network (at the bottom).The nodes (the circles) of the graph are replaced, in the encoding network, by units computing fw and gw (the squares). When fw and gw are implemented by FNNs, the encoding network is a recurrent neural network. In the unfolding network, each layer corresponds to a time instant and contains a copy of all the units of the encoding network. Connections between layers depend on encoding network connectivity.
The learning algorithm is based on a gradient‐descent strategy and is composed of the following steps:
1 The states xn(t) are iteratively updated by Eq. (5.78) until at time T they approach the fixed‐point solution of Eq. (5.75): x(T) ≈ x.
2 The gradient ∂ew(T)/∂w is computed.
3 The weights w are updated according to the gradient computed in Step 2.
When it comes to Step 1, note that the hypothesis that Fw is a contraction map ensures convergence to the fixed point. Step 3 is carried out within the traditional framework of gradient descent. We will show in the following that Step 3 can be carried out in a very efficient way by exploiting the diffusion process that takes place in GNNs. This diffusion process is very much related to the process that occurs in recurrent neural networks, for which the gradient computation is based on the BPTT algorithm (see Chapter 3). In this case, the encoding network is unfolded from time T back to an initial time t0 . The unfolding produces the layered network shown in Figure 5.10. Each layer corresponds to a time instant and contains a copy of all the units fw of the encoding network. The units of two consecutive layers are connected following graph connectivity. The last layer corresponding to time T also includes the units gw and computes the output of the network.
Backpropagation through time consists of carrying out the traditional backpropagation step (see Chapter 3) on the unfolded network to compute the gradient of the cost function at time T with respect to all the instances of fw and gw . Then, ∂ew(T)/∂w is obtained by summing the gradients of all instances. However, backpropagation through time requires storing the states of every instance of the units. When the graphs and T − t0 are large, the memory required may be considerable. On the other hand, in this case, a more efficient approach is possible, based on the Almeida–Pineda algorithm [65, 66]. Since Eq. (5.78) has reached a stable point x before the gradient computation, we can assume that x(t) = x holds for any t ≥ t0 . Thus, backpropagation through time can be carried out by storing only x. The following two theorems show that such an intuitive approach has a formal justification. The former theorem proves that function ϕw is differentiable.