Читать книгу Artificial Intelligence and Quantum Computing for Advanced Wireless Networks - Savo G. Glisic - Страница 85

5.3.1 Labeled Graph NN (LGNN)

Here, a graph G is a pair (N, E), where N is the set of nodes and E is the set of edges. The set ne [n] stands for the neighbors of n, that is, the nodes connected to n by an arc, while co[n] denotes the set of arcs having n as a vertex. Nodes and edges may have labels represented by real vectors. The labels attached to node n and edge (n₁, n₂) will be represented by and , respectively. Let l denote the vector obtained by stacking together all the labels of the graph.

If y is a vector that contains data from a graph and S is a subset of the nodes (the edges), then y_S denotes the vector obtained by selecting from y the components related to the node (the edges) in S.

For example, l_ne[n] stands for the vector containing the labels of all the neighbors of n. Labels usually include features of objects related to nodes and features of the relationships between the objects. No assumption is made regarding the arcs: directed and undirected edges are both permitted. However, when different kinds of edges coexist in the same dataset, it is necessary to distinguish them. This can be easily achieved by attaching a proper label to each edge. In this case, different kinds of arcs turn out to be just arcs with different labels.

The considered graphs may be either positional or nonpositional. Nonpositional graphs are those that have been described thus far; positional graphs differ from them since a unique integer identifier is assigned to each neighbor of a node n to indicate its logical position. Formally, for each node n in a positional graph, there exists an injective function ν_n : ne[n] → {1, …, ∣ N∣}, which assigns to each neighbor u of n a position ν_n(u).

The domain is the set of pairs of a graph and a node, that is, , where is a set of the graphs and is a subset of their nodes. We assume a supervised learning framework with the learning set

(5.73)

where n_{i, j} ∈ N_i denotes the j‐th node in the set and t_{i, j} is the desired target associated with n_{i, j} . Finally, and q_i ≤ ∣ N_i∣. All the graphs of the learning set can be combined into a unique disconnected graph, and therefore one might think of the learning set as the pair , where G = (N, E) is a graph and is a set of pairs {(n_i, t_i) ∣ n_i ∈ N, t_i ∈ R^m, 1 ≤ i ≤ q}. It is worth mentioning that this compact definition is useful not only for its simplicity, but that it also directly captures the very nature of some problems where the domain consists of only one graph, for instance, a large portion of the Web (see Figure 5.8).

The model: As before, the nodes in a graph represent objects or concepts, and edges represent their relationships. Each concept is naturally defined by its features and the related concepts. Thus, we can attach a state x_n ∈ R^s to each node n that is based on the information contained in the neighborhood of n (see Figure 5.9). The state x_n contains a representation of the concept denoted by n and can be used to produce an output o, that is, a decision about the concept. Let f_w be a parametric function, called the local transition function, that expresses the dependence of a node n on its neighborhood, and let g_w be the local output function that describes how the output is produced. Then, x_n and o_n are defined as follows:

(5.74)

where l_n, l_co[n], x_ne[n] , and l_ne[n] are the label of n, the labels of its edges, the states, and the labels of the nodes in the neighborhood of n, respectively.

Let x, o, l and l_N be the vectors constructed by stacking all the states, all the outputs, all the labels, and all the node labels, respectively. Then, a compact form of Eq. (5.74) becomes

(5.75)

where F_w , the global transition function, and G_w , the global output function, are stacked versions of ∣N∣ instances of f_w and g_w , respectively. For us, the case of interest is when x, o are uniquely defined and Eq. (5.75) defines a map ϕ_w : that takes a graph as input and returns an output o_n for each node. Banach’s fixed‐point theorem [62] provides a sufficient condition for the existence and uniqueness of the solution of a system of equations. According to Banach’s theorem, Eq. (5.75) has a unique solution provided that F_w is a contraction map with respect to the state; that is, there exists μ, 0 ≤ μ < 1 such that ‖F_w(x, l) − F_w(y, l)‖ ≤ μ‖x − y‖ holds for any x, y, where ‖ . ‖ denotes a vectorial norm. Thus, for the moment, let us assume that F_w is a contraction map. Later, we will show that, in GNNs, this property is enforced by an appropriate implementation of the transition function.

Figure 5.8 A subset of the Web.

Figure 5.9 Graph and the neighborhood of a node. The state x₁ of node 1 depends on the information contained in its neighborhood.

Source: Scarselli et al. [1].

For positional graphs, f_w must receive the positions of the neighbors as additional inputs. In practice, this can be easily achieved provided that information contained in x_ne[n], l_co[n] , and l_ne[n] is sorted according to neighbors’ positions and is appropriately padded with special null values in positions corresponding to nonexistent neighbors. For example, x_ne[n] = [y₁, …, y_M], where M = max_{n, u} v_n(u) is the maximum number of neighbors of a node; y_i = x_u holds, if u is the i‐th neighbor of n(ν_n(u) = i); and y_i = x₀ , for some predefined null state x₀ , if there is no i‐th neighbor.

For nonpositional graphs, it is useful to replace function f_w of Eq. (5.74) with

(5.76)

where h_w is a parametric function. This transition function, which has been successfully used in recursive NNs, is not affected by the positions and the number of the children. In the following, Eq. (5.76) is referred to as the nonpositional form, and Eq. (5.74) is called the positional form. To implement the GNN model, we need (i) a method to solve Eq. (5.74)); (ii) a learning algorithm to adapt f_w and g_w using examples from the training dataset; and (iii) an implementation of f_w and g_w. These steps will be considered in the following paragraphs.

Computation of the state: Banach’s fixed‐point theorem does not only ensure the existence and the uniqueness of the solution of Eq. (5.74), but it also suggests the following classic iterative scheme for computing the state:

(5.77)

where x(t) denotes the t‐th iteration of x. The dynamical system (Eq. (5.77)) converges exponentially fast to the solution of Eq. (5.75) for any initial value (0). We can, therefore, think of x(t) as the state that is updated by the transition function F_w . In fact, Eq. (5.77) implements the Jacobi iterative method for solving nonlinear equations [63]. Thus, the outputs and the states can be computed by iterating

(5.78)

Note that the computation described in Eq. (5.78) can be interpreted as the representation of a network consisting of units that compute f_w and g_w . Such a network is called an encoding network, following an analogous terminology used for the recursive neural network model [64]. In order to build the encoding network, each node of the graph is replaced by a unit computing the function f_w (see Figure 5.10). Each unit stores the current state x_n(t) of node n, and when activated it calculates the state x_n(t + 1) using the node label and the information stored in the neighborhood. The simultaneous and repeated activation of the units produce the behavior described in Eq. (5.78). The output of node n is produced by another unit, which implements g_w. When f_w and g_w are implemented by FNNs, the encoding network turns out to be a recurrent neural network where the connections between the neurons can be divided into internal and external connections. The internal connectivity is determined by the neural network architecture used to implement the unit. The external connectivity depends on the edges of the processed graph.

The learning algorithm: Learning in GNNs consists of estimating the parameter w such that ϕ_w approximates the data in the learning dataset Eq. (5.73) where q_i is the number of supervised nodes in G_i . For graph‐focused tasks, one special node is used for the target (q_i = 1 holds), whereas for node‐focused tasks, in principle, the supervision can be performed on every node. The learning task can be posed as the minimization of a quadratic cost function

Figure 5.10 Graph (on the top, left), the corresponding encoding network (top right), and the network obtained by unfolding the encoding network (at the bottom).The nodes (the circles) of the graph are replaced, in the encoding network, by units computing f_w and g_w (the squares). When f_w and g_w are implemented by FNNs, the encoding network is a recurrent neural network. In the unfolding network, each layer corresponds to a time instant and contains a copy of all the units of the encoding network. Connections between layers depend on encoding network connectivity.

(5.79)

The learning algorithm is based on a gradient‐descent strategy and is composed of the following steps:

1 The states xn(t) are iteratively updated by Eq. (5.78) until at time T they approach the fixed‐point solution of Eq. (5.75): x(T) ≈ x.
2 The gradient ∂ew(T)/∂w is computed.
3 The weights w are updated according to the gradient computed in Step 2.

When it comes to Step 1, note that the hypothesis that F_w is a contraction map ensures convergence to the fixed point. Step 3 is carried out within the traditional framework of gradient descent. We will show in the following that Step 3 can be carried out in a very efficient way by exploiting the diffusion process that takes place in GNNs. This diffusion process is very much related to the process that occurs in recurrent neural networks, for which the gradient computation is based on the BPTT algorithm (see Chapter 3). In this case, the encoding network is unfolded from time T back to an initial time t₀ . The unfolding produces the layered network shown in Figure 5.10. Each layer corresponds to a time instant and contains a copy of all the units f_w of the encoding network. The units of two consecutive layers are connected following graph connectivity. The last layer corresponding to time T also includes the units g_w and computes the output of the network.

Backpropagation through time consists of carrying out the traditional backpropagation step (see Chapter 3) on the unfolded network to compute the gradient of the cost function at time T with respect to all the instances of f_w and g_w . Then, ∂e_w(T)/∂w is obtained by summing the gradients of all instances. However, backpropagation through time requires storing the states of every instance of the units. When the graphs and T − t₀ are large, the memory required may be considerable. On the other hand, in this case, a more efficient approach is possible, based on the Almeida–Pineda algorithm [65, 66]. Since Eq. (5.78) has reached a stable point x before the gradient computation, we can assume that x(t) = x holds for any t ≥ t₀ . Thus, backpropagation through time can be carried out by storing only x. The following two theorems show that such an intuitive approach has a formal justification. The former theorem proves that function ϕ_w is differentiable.

Artificial Intelligence and Quantum Computing for Advanced Wireless Networks

Подняться наверх