Читать книгу Multi-Processor System-on-Chip 2 - Liliana Andrade - Страница 30
1.5.2. Design space exploration
ОглавлениеLet us refer to Figure 1.8 again. Note that the matrix elements are color-coded to match outputs with inputs.
Three loops can be identified: loop i along K, loop m along M of the dot product and loop l along M of the cyclic shift. We will use this notation for the rest of this chapter. D matrix is m and i dependent; g matrix is l, m and i dependent and output x is i and l dependent. There are 6 (3!) loop order combinations to arrange data reading, computation and data storing in this algorithm. In addition, we can vectorize loops i or l at the output and any of the loops at the input. In our notation, a vectorized input loop is emboldened, for example, i→i, and a vectorized output loop is underlined, for example, i→i.
In the design space, we can search for optimizations in regard to section 1.5.1. We decided to search for two optimizations: a) ideal vectorization and b) low memory bandwidth. In a), we analyze two variants (black and red) and in b) two other variants (blue and green) of GFDM. Due to some features overlapping between optimization variants, in the following text we will first present features and then assign them accordingly to black, red, blue and green variants. A summary is provided in Table 1.3 (Damjancevic et al. 2019). The features are:
1 1) Modifying the very long instruction word (VLIW) configuration – The vDSP has the capability to perform either a wide 2veclen vector load or a veclen vector load + veclen vector store. This removes the bandwidth bottleneck for operations that require two operands, which, in turn, removes stalls when executing consecutive vector MAC operations. Basic configuration supports a veclen vector load + veclen vector store.
2 2) Vectorizing loop i maximizes efficiency (no boundary effects) – The IDFT output is of size K = 2i, i ∈ N due to the assumed algorithm implementation of IDFT as 2i point iFFT (K ∈ [128, 4096] i.e.[low end, high end]), and veclen =16. Moreover, this makes the code scale with any M and K value.
3 3) Vectorizing loop m allows the use of a) a shuffle operation on buffered coefficients g, reducing the required number of g loads from M 2 K to MK; b) a combination of element-wise vector multiplication (MPY) and reduction add (rADD) that avoids MACs, and if M ≤ veclen, then the m loop is consumed altogether. This greatly reduces the number of memory accesses. Adversely, for the cases when M is not a composite of veclen, the boundary conditions deteriorate and the efficiency drops. We could handcraft the code for every M to optimize boundary effects, but this is a limitation to code scalability. In Table 1.3, we present the upper bound of efficiency for arbitrary M.
4 4) Vectorizing loop l at the output reduces the operation count, but is subject to boundary effects that reduce core efficiency.
5 5) m nested in i minimizes the need for ACCs in the register file (see Figure 1.13). Variants with this loop order use 1 veclen ACC, by computing the partial sums of the dot product first. Table 1.3. Selected GFDM implementation variants
6 6) i nested in l produces the output sequentially on a symbol-by-symbol basis of size K.
7 7) i nested in m allows D to be read sequentially symbol-by-symbol, preventing stalls when the IFFT result is incoming sequentially (blocks of size K in contrast to the otherwise parallel M × K blocks requirement).
8 8) l nested in i removes the need to either reload or buffer D in register file.
9 9) m and l nested in i allow the use of the same memory locations for both D and x. An index location of D becomes obsolete when x of the same index location is ready, for example, when x0,0 is ready, D0,0 is no longer used. The relative order of m and l is irrelevant. The same is true for i nested in m and l.
Black uses l {i {m {… }}}, hence 1), 2), 5), 6) are valid.
Red uses m {i {l {… }}}, hence 1), 2), 7), 8) are valid.
Blue uses i {l {m {… }}}, hence 3), 5), 8), 9) are valid.
Green uses l {i {m {… }}}, hence 3), 5), 6) are valid. In green, we have to either reload or buffer D; we chose buffer D in register file, reducing memory bandwidth.
Now that we have identified GFDM variants to minimize cycle count (for maximum throughput) or required memory bandwidth (for low energy consumption), we can investigate how the variable throughput requirement from the standard analysis in section 1.1 maps to GFDM kernels, in particular, to black – high-throughput variant, and blue – low-energy-consumption variant. We drop red and green from further consideration since they do not achieve the lowest cycle counts and have high-memory register file usage. However, we should keep in mind the lesson learned: that our code could create a kernel that is in a very unfavorable spot on the design space, if we were to underestimate all the complexity layers and their implications.