Читать книгу Multi-Processor System-on-Chip 1 - Liliana Andrade - Страница 35
2.3.3. VLIW core
ОглавлениеThe MPPA cores implement a 64-bit VLIW architecture, which is an effective way to design instruction-level parallel cores targeting numerical, signal and image processing (Fisher et al. 2005). The VLIW core has six issue lanes (Figure 2.10) that, respectively, feed a branch and control unit (BCU), two 64-bit ALUs, a 128-bit FPU, a 256-bit load–store unit (LSU) and a deep learning coprocessor. Each VLIW core has private L1 instruction and data caches, both 16 KB and four-way set associated with LRU replacement policy. All load instructions also have an L1 cache-bypass variant for direct access to the cluster SPM or L2$. These instructions improve the performance of codes with non-temporal memory access patterns, and also increase the accuracy of static analysis for computing worst-case execution time (WCET) bounds. The implementation of this VLIW core and its caches ensure that the resulting processing element is timing-compositional, a critical property with regard to computing accurate bounds on worst-case response times (WCRT) (Kästner et al. 2013).
Figure 2.10. VLIW core instruction pipeline
Based on previous compiler design experience with different types of VLIW architectures (Dupont de Dinechin et al. 2000, 2004), a Fisher-style VLIW architecture has been selected, rather than an EPIC-style VLIW architecture (Table 2.3). The main features of the Kalray VLIW architecture are as follows: – Partial predication: fully predicated architectures are expensive with regard to instruction encoding, while control speculation of arithmetic instructions performs better than if-conversion when applicable. Moreover, conditional SELECT operations are equivalent to CMOV operations with operand renaming constraints (Dupont de Dinechin 2014). Then, if-conversion only needs to be supported by conditional load/store and CMOV instructions on scalar and vector operands.
Table 2.3. Types of VLIW architectures
Classical VLIW architecture | EPIC VLIW architecture |
SELECT operations on Boolean operand | Fully predicated ISA |
Conditional load/store/floating-point operations | Advanced loads (data speculation) |
Dismissible loads (control speculation) | Speculative loads (control speculation) |
Clustered register files and function units | Polycyclic/multiconnect register files |
Multi-way conditional branches | Rotating registers |
Compiler techniques | |
Trace scheduling | Modulo scheduling |
Partial predication | Full predication |
Main examples | |
Multiflow TRACE processors | Cydrome Cydra-5 |
HP Labs Lx / STMicroelectronics ST200 | HP-Intel IA64 |
Philips TriMedia | Texas Instruments VelociTI |
– Dismissible loads: these instructions enable control speculation of load instructions by suppressing exceptions on address errors, and by ensuring that no side-effects occur in the I/O areas. Additional configuration in the MMU refine their behavior on protection and no-mapping exceptions.
– No rotating registers: rotating registers rename temporary variables defined inside software pipelines, whose schedule is built while ignoring register antidependences. However, rotating registers add significant ISA and implementation complexity, while temporary variable renaming can be done by the compiler.
– Widened memory access: widening the memory accesses on a single port is simpler to implement than multiple memory ports, especially when memory address translation is implied. This simplification enables, in turn, the support of misaligned memory accesses, which significantly improves compiler vectorization opportunities.
– Unification of the scalar and SIMD data paths around a main register file of 64×64-bit registers, for the same motivations as the POWER vector-scalar architecture (Gschwind 2016). Operands for the SIMD instructions map to register pairs (16 bytes) or to register quadruples (32 bytes).