Читать книгу Artificial Intelligence and Quantum Computing for Advanced Wireless Networks - Savo G. Glisic - Страница 88

5.3.2 Computational Complexity

Here, we derive an analysis of the computational cost in GNN. The analysis will focus on three different GNN models: positional GNNs, where the functions f_w and g_w of Eq. (5.74) are implemented by FNNs; linear (nonpositional) GNNs; and nonlinear (nonpositional) GNNs.

First, we will describe with more details the most complex instructions involved in the learning procedure (see Table 5.2 reproduced from [1]). Then, the complexity of the learning algorithm will be defined. For the sake of simplicity, the cost is derived assuming that the training set contains just one graph G. Such an assumption does not cause any loss of generality, since the graphs of the training set can always be merged into a single graph. The complexity is measured by the order of floating point operations. By the common definition of time complexity, an algorithm requires O (l(a)) operations, if there exist α>0, , such that c(a) ≤ αl(a) holds for each , where c (a) is the maximal number of operations executed by the algorithm when the length of the input is a.

We will assume that there exist two procedures FP and BP, which implement the forward phase and the backward phase of the back propagation procedure, respectively. Formally, given a function l_w : R^a → R^b implemented by an FNN, we have

Table 5.2 Time complexity of the most expensive instructions of the learning algorithm. For each instruction and each GNN model, a bound on the order of floating point operations is given. The table also displays the number of times per epoch that each instruction is executed.

Source: Scarselli et al. [1].

Instruction	Positional	Nonlinear	Linear	Execs.
z(t + 1) = z(t) ⋅ A + b	s² ∣ E∣	s² ∣ E∣	s² ∣ E∣	it_b
o = G_w(x(t), l_w)				1
x(t + 1) = F_w(x(t), l)			s² ∣ E∣	it_f
				1
			–	1
	∣N∣	∣N∣	∣N∣	1
			–	1
				1
				1
				1

Here, y ∈ R^a is the input vector, and the row vector δ ∈ R^b is a signal that suggests how the network output must be adjusted to improve the cost function. In most applications, the cost function is e_w(y) = (t − y)² and δ=(∂e_w/∂o)(y) = 2(t − o), where o =l_w(y) and t (target) is the vector of the desired output corresponding to input y. On the other hand, δ(∂l_w/∂y)(y) is the gradient of e_w with respect to the network input and is easily computed as a side product of backpropagation. Backpropagation computes for each neuron v the delta value (∂e_w/∂a_v)(y) = δ(∂l_w/∂a_v)(y), where e_w is the cost function and a_v the activation level of neuron v. Thus, δ(∂l_w/∂y)(y) is just a vector stacking all the delta values of the input neurons. Finally, denote the computational complexity required by the application of FP and BP on l_w , respectively. For example, if l_w is implemented by a multilayered FNN with a inputs, b hidden neurons, and c outputs, then holds.

Complexity of Instructions

1 Instructions z(t + 1) = z(t) · A + b, 0 = Gw(x, lN), and x(t + 1) = Fw(x(t), l): Since A is a matrix having at most s2 ∣ E∣ nonnull elements, the multiplication of z(t) by A, and as a consequence, the instruction z(t + 1) = z(t) · A + b, costs O(s2 ∣ E∣) floating points operations. The state x (t+1) and the output vector o are calculated by applying the local transition function and the local output function to each node n. Thus, in positional GNNs and in nonlinear GNNs, where fw, hw , and gw are directly implemented by FNNs, x(t + 1) and o are computed by running the forward phase of backpropagation once for each node or edge (see Table 5.2). On the other hand, in linear GNNs, xn(t) is calculated in two steps: the matrices An of Eq. (5.82) and the vectors bn of Eq. (5.83) are evaluated; then, x(t) is computed. The former phase, the cost of which is , is executed once for each epoch, whereas the latter phase, the cost of which is O(s2 ∣ E∣), is executed at every step of the cycle in the function FORWARD.
2 Instruction =(∂Fw/∂x)(x, l): This instruction requires the computation of the Jacobian of Fw. Note that A = {An,u} is a block matrix where the block An,u measures the effect of node u on node n if there is an arc (n,u) from u to n, and is null otherwise. In the linear model, the matrices An,u correspond to those displayed in Eq. (5.82) and are used to calculate x(t) in the forward phase. Thus, such an instruction has no cost in the backward phase in linear GNNs.In nonlinear GNNs, An,u = (∂hw/∂xn)(ln, l(n,u), xu, lu) is computed by appropriately exploiting the backpropagation procedure. In other words, let qi ∈ Rs be a vector where all the components are zero except for the i‐th one, which equals one, that is, q1 = [1, 0, …, 0], q2 = [0, 1, 0, …, 0], and so on. Note that BP, when it is applied to lw with δ = bi, returns , that is, the i‐th column of the Jacobian (∂lw/∂y)(y). Thus, An, u can be computed by applying BP on all the qi, that is,(5.84)

where BP₂ indicates that we are considering only the first component of the output of BP. A similar reasoning can also be used with positional GNNs. The complexity of these procedures is easily derived and is displayed in the fourth row of Table 5.2.

1 Computation of ∂ew/∂o and ∂pw/∂w: In linear GNNs, the cost function is , and as a consequence, if nk is a node belonging to the training set, and 0 otherwise. Thus, ∂ew/∂o is easily calculated by O(∣N∣) operations.

In positional and nonlinear GNNs, a penalty term p_w is added to the cost function to force the transition function to be a contraction map. In this case, it is necessary to compute ∂p_w/∂w, because such a vector must be added to the gradient. Let denote the element in position i,j of the block A_{n, u} . Recalling the definition of p_w , we have

where if the sum is larger than 0, and 0 otherwise. It follows that

where sgn is the sign function. Let R_{n, u} be a matrix whose element in position i,j is , and let vec be the operator that takes a matrix and produces a column vector by stacking all its columns one on top of the other. Then

(5.85)

holds. The vector ∂vec(A_{n, u})/∂w depends on selected implementation of h_w or f_w . For the sake of simplicity, let us restrict our attention to nonlinear GNNs and assume that the transition network is a three‐layered FNN. σ_j, a_j, V_j , and t_j are the activation function, the vector of the activation levels, the matrix of the weights, and the thresholds of the j‐th layer, respectively. σ_j is a vectorial function that takes as input the vector of the activation levels of neurons in a layer and returns the vector of the outputs of the neurons of the same layer. The following reasoning can also be extended to positional GNNs and networks with a different number of layers. The function h_w is formally defined in terms of σ_j, a_j, V_j , and t_j

By the chain differentiation rule, we get

where is the derivative of σ_j , diag is an operator that transforms a vector into a diagonal matrix having such a vector as diagonal, and is the submatrix of V₁ that contains only the weights that connect the inputs corresponding to x_u to the hidden layer. The parameters w affect four components of vec(A_{n, u}), that is, a₃, V₂, a₂ , and . By the properties of derivatives for matrix products and the chain rule

(5.86)

holds. Thus, (vec (R_u,v))^′ · ∂vec(A_n,u)/∂w is the sum of four contributions. In order to derive a method of computing those terms, let I_a denote the a × a identity matrix. Let ⊗ be the Kronecker product, and suppose that P_a is a a² × a matrix such that vec(diag (v) = P_a v for any vector v ∈ R^a. By the Kronecker product’s properties, vec(AB) = (B^′ ⊗ I_a) · vec(A) holds for matrices A, B, and I_a having compatible dimensions [67]. Thus, we have

which implies

Similarly, using the properties vec(ABC) =(C^′ ⊗ A) · vec(B) and vec(AB) =(I_a ⊗ A) · vec(B), it follows that

where d_h is the number of hidden neurons. Then, we have

(5.87)

(5.88)

(5.89)

(5.90)

where the aforementioned Kronecker product properties have been used.

It follows that (vec (R_u,v))^′ · ∂vec(A_n,u)/∂w can be written as the sum of the four contributions represented by Eqs. (5.87)–(5.90). The second and the fourth terms – Eqs. (5.88) and (5.90) – can be computed directly using the corresponding formulas. The first one can be calculated by observing that looks like the function computed by a three‐layered FNN that is the same as h_w except for the activation function of the last layer. In fact, if we denote by such a network, then

(5.91)

holds, where . A similar reasoning can be applied also to the third contribution.

Required number of operations: The above method includes two tasks: the matrix multiplications of Eqs. (5.87)–(5.90) and the backpropagation as defined by Eq. (5.91). The former task consists of several matrix multiplications. By inspection of Eqs. (5.87)–(5.90), the number of floating point operations is approximately estimated as 2s² + 12s hi_h + 10s² · hi_h , where hi_h denotes the number of hidden‐layer neurons implementing the function h. The second task has approximately the same cost as a backpropagation phase through the original function h_w. Such a value is obtained from the following observations: for an a × b matrix C and a b × c matrix D, the multiplication CD requires approximately 2abc operations; more precisely, abc multiplications and ac (b − 1) sums. If D is a diagonal b ×b matrix, then CD requires 2ab operations. Similarly, if C is an a × b matrix, D is a b × a matrix, and P_a is the a ² ×a matrix defined above and used in Eqs. (5.87)–(5.90), then computing vec(CD)P_c costs only 2ab operations provided that a sparse representation is used for P_α . Finally, a₁, a₂, a₃ are already available, since they are computed during the forward phase of the learning algorithm. Thus, the complexity of computing ∂p_w/∂w is . Note, however, that even if the sum in Eq. (5.85) ranges over all the arcs of the graph, only those arcs (n, u) such that R_{n, u} ≠ 0 have to be considered. In practice, R_{n, u} ≠ 0 is a rare event, since it happens only when the columns of the Jacobian are larger than μ, and a penalty function was used to limit the occurrence of these cases. As a consequence, a better estimate of the complexity of computing ∂p_w/∂w is O , where t_R is the average number of nodes u such that R_{n, u} ≠ 0 holds for some n.

1 Instructions b = (∂ew/∂o)(∂Gw/∂x)(x, lN) and =(∂ew/∂o)(∂Gw/∂w)(x, lN): The terms b and c can be calculated by the backpropagation of ∂ew/∂o through the network that implements gw . Since such an operation must be repeated for each node, the time complexity of instructions b = (∂ew/∂o)(∂Gw/∂x)(x, lN) and c = (∂ew/∂o)(∂Gw/∂w)(x, lN) is for all the GNN models.
2 Instruction = z(t)(∂Fw/∂w)(x, l): By definition of Fw, fw , and BP, we have(5.92)

where y = [l_n, x_u, l_{(n, u)}, l_u] and BP₁ indicates that we are considering only the first part of the output of BP. Similarly

(5.93)

where y = [l_n, x_u, l_{(n, u)}, l_u]. These two equations provide a direct method to compute d in positional and nonlinear GNNs, respectively.

For linear GNNs, let denote the the output of h_w and note that

holds where and are the element in position i, j of matrix A_{n, u} and the corresponding output of the transition network, respectively, while is the the element of vector is the corresponding output of the forcing network (see Eq. (5.83))], and is the i‐th element of x_u . Then

where y δ = ∣ ne[n] ∣ · z′(t), and is a vector that stores in the position corresponding to i, j, that is, . Thus, in linear GNNs, d is computed by calling the backpropagation procedure on each arc and node.

Time complexity of the GNN model: Formally, the complexity is easily derived from Table 5.2: it is for positional GNNs, for nonlinear GNNs, and for linear GNNs. In practice, the cost of the test phase is mainly due to the repeated computation of the state x (t). The cost of each iteration is linear with respect to both the dimension of the input graph (the number of edges) and the dimension of the employed FNNs and the state, with the sole exception of linear GNNs, whose single iteration cost is quadratic with respect to the state. The number of iterations required for the convergence of the state depends on the problem at hand, but Banach’s theorem ensures that the convergence is exponentially fast, and experiments have shown that 5–15 iterations are generally sufficient to approximate the fixed point [1].

In positional and nonlinear GNNs, the transition function must be activated it_f · ∣ N∣ and it_f · ∣ E∣ times, respectively. Even if such a difference may appear significant, in practice, the complexity of the two models is similar, because the network that implements the f_w is larger than the one that implements h_w . In fact, f_w has M(s + l_E) input neurons, where M is the maximum number of neighbors for a node, whereas h_w has only s +l_E input neurons. A significant difference can be noticed only for graphs where the number of neighbors of nodes is highly variable, since the inputs of f_w must be sufficient to accommodate the maximum number of neighbors, and many inputs may remain unused when f_w is applied. On the other hand, it is observed that in the linear model the FNNs are used only once for each iteration, so that the complexity of each iteration is O(s² ∣ E∣) instead of . Note that holds, when h_w is implemented by a three‐layered FNN with hi_h hidden neurons. In practical cases, where hi_h is often larger than s, the linear model is faster than the nonlinear model. As confirmed by the experiments, such an advantage is mitigated by the smaller accuracy that the model usually achieves.

In GNNs, the learning phase requires much more time than the test phase, mainly due to the repetition of the forward and backward phases for several epochs. The experiments have shown that the time spent in the forward and backward phases is not very different. Similarly, to the forward phase, the cost of the function BACKWARD is mainly due to the repetition of the instruction that computes z(t). Theorem 5.2 ensures that z(t) converges exponentially fast, and experiments have confirmed that it_b is usually a small number.

Formally, the cost of each learning epoch is given by the sum of all the instructions times the iterations in Table 5.2. An inspection of the table shows that the cost of all instructions involved in the learning phase are linear with respect to both the dimension of the input graph and of the FNNs. The only exceptions are due to the computation of z(t + 1) = z(t) · A + b, (∂F_w/∂x)(x, l) and ∂p_w/w, which depend quadratically on s.

The most expensive instruction is apparently the computation of ∂p_w/w in nonlinear GNNs, which costs O On the other hand, the experiments have shown [1] that t_R is usually a small number. In most epochs, t_R is 0, since the Jacobian does not violate the imposed constraint, and in the other cases, t_R is usually in the range 1–5. Thus, for a small state dimension s, the computation of ∂p_w/w requires few applications of backpropagation on h and has a small impact on the global complexity of the learning process. On the other hand, in theory, if s is very large, it might happen that and at the same time t_R ≫ 0, causing the computation of the gradient to be very slow.

Artificial Intelligence and Quantum Computing for Advanced Wireless Networks

Подняться наверх