Читать книгу Multi-Processor System-on-Chip 2 - Liliana Andrade - Страница 29
1.5.1. Implementation considerations
ОглавлениеFor the kernel implementation, we need to factor in the following shortened list of considerations (Damjancevic et al. 2019):
1 1) Vectorization – In an ideal case, single-instruction, multiple-data (SIMD) processors of vector length veclen, i.e. veclen SIMD-lanes, will reduce the number of operations required by a factor of veclen, but, in practice, this number is smaller, due to Amdahl’s law, boundary conditions and other potential bottlenecks, such as load/store bandwidth. Similarly, the number of memory accesses (loads/stores) in the ideal case is reduced veclen times.
2 2) Boundary conditions – Based on the algorithm implemented, the mismatch and misalignment of data in memory and data accessed by the vector load/store unit (Figure 1.13) lead to a need for extra processing cycles.
3 3) Bandwidth bottleneck – When the functional units and register file blocks of Figure 1.13 are consuming or producing more data than the load/store block can handle, the core will run into a BW bottleneck, lowering the core utilization. We could use multiple load/store units and/or increased load/store vector length to mitigate this problem, although such architectures come with higher energy consumption per load/store and a bigger area on the chip needed for the load/store unit.
4 4) Memory bandwidth – Amount of data transferred per unit of time.
5 5) Trade-off cycle count and memory bandwidth – Sometimes, when implementing an algorithm, it would be beneficial to minimize cycle counts, which is usually paid for in additional memory accesses and vice versa, optimizing for a low load/store count would increase the overall cycle count. In GFDM, we encounter such a trade-off. Low cycle count optimization reduces execution time, and a low memory bandwidth optimization saves energy (and area, if using an architecture with a smaller load/store unit).
6 6) Loop order and vectorization – Some algorithms, such as GFDM, can be described with mutually independent nested loops. We identified three such loops in GFDM (see “for” loops in Figure 1.9). Loop order and choosing which loop to vectorize can significantly alter the boundary conditions, bandwidth, memory and register space used and even type (instruction set architecture (ISA)) and count of operations needed.