Читать книгу Artificial Intelligence and Quantum Computing for Advanced Wireless Networks - Savo G. Glisic - Страница 88
5.3.2 Computational Complexity
ОглавлениеHere, we derive an analysis of the computational cost in GNN. The analysis will focus on three different GNN models: positional GNNs, where the functions fw and gw of Eq. (5.74) are implemented by FNNs; linear (nonpositional) GNNs; and nonlinear (nonpositional) GNNs.
First, we will describe with more details the most complex instructions involved in the learning procedure (see Table 5.2 reproduced from [1]). Then, the complexity of the learning algorithm will be defined. For the sake of simplicity, the cost is derived assuming that the training set contains just one graph G. Such an assumption does not cause any loss of generality, since the graphs of the training set can always be merged into a single graph. The complexity is measured by the order of floating point operations. By the common definition of time complexity, an algorithm requires O (l(a)) operations, if there exist α>0, , such that c(a) ≤ αl(a) holds for each , where c (a) is the maximal number of operations executed by the algorithm when the length of the input is a.
We will assume that there exist two procedures FP and BP, which implement the forward phase and the backward phase of the back propagation procedure, respectively. Formally, given a function lw : Ra → Rb implemented by an FNN, we have
Table 5.2 Time complexity of the most expensive instructions of the learning algorithm. For each instruction and each GNN model, a bound on the order of floating point operations is given. The table also displays the number of times per epoch that each instruction is executed.
Source: Scarselli et al. [1].
Instruction | Positional | Nonlinear | Linear | Execs. |
---|---|---|---|---|
z(t + 1) = z(t) ⋅ A + b | s2 ∣ E∣ | s2 ∣ E∣ | s2 ∣ E∣ | itb |
o = Gw(x(t), lw) | 1 | |||
x(t + 1) = Fw(x(t), l) | s2 ∣ E∣ | itf | ||
1 | ||||
– | 1 | |||
∣N∣ | ∣N∣ | ∣N∣ | 1 | |
– | 1 | |||
1 | ||||
1 | ||||
1 |
Here, y ∈ Ra is the input vector, and the row vector δ ∈ Rb is a signal that suggests how the network output must be adjusted to improve the cost function. In most applications, the cost function is ew(y) = (t − y)2 and δ=(∂ew/∂o)(y) = 2(t − o), where o =lw(y) and t (target) is the vector of the desired output corresponding to input y. On the other hand, δ(∂lw/∂y)(y) is the gradient of ew with respect to the network input and is easily computed as a side product of backpropagation. Backpropagation computes for each neuron v the delta value (∂ew/∂av)(y) = δ(∂lw/∂av)(y), where ew is the cost function and av the activation level of neuron v. Thus, δ(∂lw/∂y)(y) is just a vector stacking all the delta values of the input neurons. Finally, denote the computational complexity required by the application of FP and BP on lw , respectively. For example, if lw is implemented by a multilayered FNN with a inputs, b hidden neurons, and c outputs, then holds.
Complexity of Instructions
1 Instructions z(t + 1) = z(t) · A + b, 0 = Gw(x, lN), and x(t + 1) = Fw(x(t), l): Since A is a matrix having at most s2 ∣ E∣ nonnull elements, the multiplication of z(t) by A, and as a consequence, the instruction z(t + 1) = z(t) · A + b, costs O(s2 ∣ E∣) floating points operations. The state x (t+1) and the output vector o are calculated by applying the local transition function and the local output function to each node n. Thus, in positional GNNs and in nonlinear GNNs, where fw, hw , and gw are directly implemented by FNNs, x(t + 1) and o are computed by running the forward phase of backpropagation once for each node or edge (see Table 5.2). On the other hand, in linear GNNs, xn(t) is calculated in two steps: the matrices An of Eq. (5.82) and the vectors bn of Eq. (5.83) are evaluated; then, x(t) is computed. The former phase, the cost of which is , is executed once for each epoch, whereas the latter phase, the cost of which is O(s2 ∣ E∣), is executed at every step of the cycle in the function FORWARD.
2 Instruction =(∂Fw/∂x)(x, l): This instruction requires the computation of the Jacobian of Fw. Note that A = {An,u} is a block matrix where the block An,u measures the effect of node u on node n if there is an arc (n,u) from u to n, and is null otherwise. In the linear model, the matrices An,u correspond to those displayed in Eq. (5.82) and are used to calculate x(t) in the forward phase. Thus, such an instruction has no cost in the backward phase in linear GNNs.In nonlinear GNNs, An,u = (∂hw/∂xn)(ln, l(n,u), xu, lu) is computed by appropriately exploiting the backpropagation procedure. In other words, let qi ∈ Rs be a vector where all the components are zero except for the i‐th one, which equals one, that is, q1 = [1, 0, …, 0], q2 = [0, 1, 0, …, 0], and so on. Note that BP, when it is applied to lw with δ = bi, returns , that is, the i‐th column of the Jacobian (∂lw/∂y)(y). Thus, An, u can be computed by applying BP on all the qi, that is,(5.84)
where BP2 indicates that we are considering only the first component of the output of BP. A similar reasoning can also be used with positional GNNs. The complexity of these procedures is easily derived and is displayed in the fourth row of Table 5.2.
1 Computation of ∂ew/∂o and ∂pw/∂w: In linear GNNs, the cost function is , and as a consequence, if nk is a node belonging to the training set, and 0 otherwise. Thus, ∂ew/∂o is easily calculated by O(∣N∣) operations.
In positional and nonlinear GNNs, a penalty term pw is added to the cost function to force the transition function to be a contraction map. In this case, it is necessary to compute ∂pw/∂w, because such a vector must be added to the gradient. Let denote the element in position i,j of the block An, u . Recalling the definition of pw , we have
where if the sum is larger than 0, and 0 otherwise. It follows that
where sgn is the sign function. Let Rn, u be a matrix whose element in position i,j is , and let vec be the operator that takes a matrix and produces a column vector by stacking all its columns one on top of the other. Then
holds. The vector ∂vec(An, u)/∂w depends on selected implementation of hw or fw . For the sake of simplicity, let us restrict our attention to nonlinear GNNs and assume that the transition network is a three‐layered FNN. σj, aj, Vj , and tj are the activation function, the vector of the activation levels, the matrix of the weights, and the thresholds of the j‐th layer, respectively. σj is a vectorial function that takes as input the vector of the activation levels of neurons in a layer and returns the vector of the outputs of the neurons of the same layer. The following reasoning can also be extended to positional GNNs and networks with a different number of layers. The function hw is formally defined in terms of σj, aj, Vj , and tj
By the chain differentiation rule, we get
where is the derivative of σj , diag is an operator that transforms a vector into a diagonal matrix having such a vector as diagonal, and is the submatrix of V1 that contains only the weights that connect the inputs corresponding to xu to the hidden layer. The parameters w affect four components of vec(An, u), that is, a3, V2, a2 , and . By the properties of derivatives for matrix products and the chain rule
(5.86)
holds. Thus, (vec (Ru,v))′ · ∂vec(An,u)/∂w is the sum of four contributions. In order to derive a method of computing those terms, let Ia denote the a × a identity matrix. Let ⊗ be the Kronecker product, and suppose that Pa is a a2 × a matrix such that vec(diag (v) = Pa v for any vector v ∈ Ra. By the Kronecker product’s properties, vec(AB) = (B′ ⊗ Ia) · vec(A) holds for matrices A, B, and Ia having compatible dimensions [67]. Thus, we have
which implies
Similarly, using the properties vec(ABC) =(C′ ⊗ A) · vec(B) and vec(AB) =(Ia ⊗ A) · vec(B), it follows that
where dh is the number of hidden neurons. Then, we have
(5.89)
where the aforementioned Kronecker product properties have been used.
It follows that (vec (Ru,v))′ · ∂vec(An,u)/∂w can be written as the sum of the four contributions represented by Eqs. (5.87)–(5.90). The second and the fourth terms – Eqs. (5.88) and (5.90) – can be computed directly using the corresponding formulas. The first one can be calculated by observing that looks like the function computed by a three‐layered FNN that is the same as hw except for the activation function of the last layer. In fact, if we denote by such a network, then
holds, where . A similar reasoning can be applied also to the third contribution.
Required number of operations: The above method includes two tasks: the matrix multiplications of Eqs. (5.87)–(5.90) and the backpropagation as defined by Eq. (5.91). The former task consists of several matrix multiplications. By inspection of Eqs. (5.87)–(5.90), the number of floating point operations is approximately estimated as 2s2 + 12s hih + 10s2 · hih , where hih denotes the number of hidden‐layer neurons implementing the function h. The second task has approximately the same cost as a backpropagation phase through the original function hw. Such a value is obtained from the following observations: for an a × b matrix C and a b × c matrix D, the multiplication CD requires approximately 2abc operations; more precisely, abc multiplications and ac (b − 1) sums. If D is a diagonal b ×b matrix, then CD requires 2ab operations. Similarly, if C is an a × b matrix, D is a b × a matrix, and Pa is the a 2 ×a matrix defined above and used in Eqs. (5.87)–(5.90), then computing vec(CD)Pc costs only 2ab operations provided that a sparse representation is used for Pα . Finally, a1, a2, a3 are already available, since they are computed during the forward phase of the learning algorithm. Thus, the complexity of computing ∂pw/∂w is . Note, however, that even if the sum in Eq. (5.85) ranges over all the arcs of the graph, only those arcs (n, u) such that Rn, u ≠ 0 have to be considered. In practice, Rn, u ≠ 0 is a rare event, since it happens only when the columns of the Jacobian are larger than μ, and a penalty function was used to limit the occurrence of these cases. As a consequence, a better estimate of the complexity of computing ∂pw/∂w is O , where tR is the average number of nodes u such that Rn, u ≠ 0 holds for some n.
1 Instructions b = (∂ew/∂o)(∂Gw/∂x)(x, lN) and =(∂ew/∂o)(∂Gw/∂w)(x, lN): The terms b and c can be calculated by the backpropagation of ∂ew/∂o through the network that implements gw . Since such an operation must be repeated for each node, the time complexity of instructions b = (∂ew/∂o)(∂Gw/∂x)(x, lN) and c = (∂ew/∂o)(∂Gw/∂w)(x, lN) is for all the GNN models.
2 Instruction = z(t)(∂Fw/∂w)(x, l): By definition of Fw, fw , and BP, we have(5.92)
where y = [ln, xu, l(n, u), lu] and BP1 indicates that we are considering only the first part of the output of BP. Similarly
(5.93)
where y = [ln, xu, l(n, u), lu]. These two equations provide a direct method to compute d in positional and nonlinear GNNs, respectively.
For linear GNNs, let denote the the output of hw and note that
holds where and are the element in position i, j of matrix An, u and the corresponding output of the transition network, respectively, while is the the element of vector is the corresponding output of the forcing network (see Eq. (5.83))], and is the i‐th element of xu . Then
where y δ = ∣ ne[n] ∣ · z′(t), and is a vector that stores in the position corresponding to i, j, that is, . Thus, in linear GNNs, d is computed by calling the backpropagation procedure on each arc and node.
Time complexity of the GNN model: Formally, the complexity is easily derived from Table 5.2: it is for positional GNNs, for nonlinear GNNs, and for linear GNNs. In practice, the cost of the test phase is mainly due to the repeated computation of the state x (t). The cost of each iteration is linear with respect to both the dimension of the input graph (the number of edges) and the dimension of the employed FNNs and the state, with the sole exception of linear GNNs, whose single iteration cost is quadratic with respect to the state. The number of iterations required for the convergence of the state depends on the problem at hand, but Banach’s theorem ensures that the convergence is exponentially fast, and experiments have shown that 5–15 iterations are generally sufficient to approximate the fixed point [1].
In positional and nonlinear GNNs, the transition function must be activated itf · ∣ N∣ and itf · ∣ E∣ times, respectively. Even if such a difference may appear significant, in practice, the complexity of the two models is similar, because the network that implements the fw is larger than the one that implements hw . In fact, fw has M(s + lE) input neurons, where M is the maximum number of neighbors for a node, whereas hw has only s +lE input neurons. A significant difference can be noticed only for graphs where the number of neighbors of nodes is highly variable, since the inputs of fw must be sufficient to accommodate the maximum number of neighbors, and many inputs may remain unused when fw is applied. On the other hand, it is observed that in the linear model the FNNs are used only once for each iteration, so that the complexity of each iteration is O(s2 ∣ E∣) instead of . Note that holds, when hw is implemented by a three‐layered FNN with hih hidden neurons. In practical cases, where hih is often larger than s, the linear model is faster than the nonlinear model. As confirmed by the experiments, such an advantage is mitigated by the smaller accuracy that the model usually achieves.
In GNNs, the learning phase requires much more time than the test phase, mainly due to the repetition of the forward and backward phases for several epochs. The experiments have shown that the time spent in the forward and backward phases is not very different. Similarly, to the forward phase, the cost of the function BACKWARD is mainly due to the repetition of the instruction that computes z(t). Theorem 5.2 ensures that z(t) converges exponentially fast, and experiments have confirmed that itb is usually a small number.
Formally, the cost of each learning epoch is given by the sum of all the instructions times the iterations in Table 5.2. An inspection of the table shows that the cost of all instructions involved in the learning phase are linear with respect to both the dimension of the input graph and of the FNNs. The only exceptions are due to the computation of z(t + 1) = z(t) · A + b, (∂Fw/∂x)(x, l) and ∂pw/w, which depend quadratically on s.
The most expensive instruction is apparently the computation of ∂pw/w in nonlinear GNNs, which costs O On the other hand, the experiments have shown [1] that tR is usually a small number. In most epochs, tR is 0, since the Jacobian does not violate the imposed constraint, and in the other cases, tR is usually in the range 1–5. Thus, for a small state dimension s, the computation of ∂pw/w requires few applications of backpropagation on h and has a small impact on the global complexity of the learning process. On the other hand, in theory, if s is very large, it might happen that and at the same time tR ≫ 0, causing the computation of the gradient to be very slow.