Читать книгу Multi-Processor System-on-Chip 1 - Liliana Andrade - Страница 4

List of Illustrations

Оглавление

1 Chapter 1Figure 1.1. Training and inference in machine learningFigure 1.2. Different types of processing in machine learning inference applicat...Figure 1.3. 2D convolution applying a weight kernel to input data to calculate a...Figure 1.4. Example pooling operations: max pooling and average poolingFigure 1.5. Two types of vector MAC instructions of the ARC EM9D processorFigure 1.6. ARC EM9D processor with XY memory and address generation unitsFigure 1.7. Assembly code generated from MLI C-code for a fully connected layer ...Figure 1.8. Assembly code generated from MLI C-code for 2D convolution of 16-bit...Figure 1.9. CNN graph of the CIFAR-10 example applicationFigure 1.10. MLI code of the CIFAR-10 inference application

2 Chapter 2Figure 2.1. Homogeneous multi-core processor (Firesmith 2017)Figure 2.2. NVIDIA fermi GPGPU architecture (Huang et al . 2013)Figure 2.3. Operation of a Volta tensor core (NVIDIA 2020)Figure 2.4. Numerical formats used in deep learning inference (adapted from Gust...Figure 2.5. Autoware automated driving system functions (CNX 2019)Figure 2.6. Application domains and partitions on the MPPA3 processorFigure 2.7. Overview of the MPPA3 processorFigure 2.8. Global interconnects of the MPPA3 processorFigure 2.9. Local interconnects of the MPPA3 processorFigure 2.10. VLIW core instruction pipelineFigure 2.11. Tensor coprocessor data pathFigure 2.12. Load-scatter to a quadruple register operandFigure 2.13. INT8.32 matrix multiply-accumulate operationFigure 2.14. OpenCL NDRange execution using the SPMD modeFigure 2.15. KaNN inference code generator workflowFigure 2.16. Activation splitting across MPPA3 compute clustersFigure 2.17. KaNN augmented computation graphFigure 2.18. ROSACE harmonic multi-periodic case study (Graillat et al. 2018)Figure 2.19. MCG code generation of the MPPA processor

3 Chapter 3Figure 3.1. Plural many-core architecture. Many cores, hardware accelerators and...Figure 3.2. Task state graphFigure 3.3. Many-flow pipelining: (a) Task graph and single execution of an imag...Figure 3.4. Core management tableFigure 3.5. Task management tableFigure 3.6. Core state graphFigure 3.7. Allocation (top) and termination (bottom) algorithmsFigure 3.8. Plural run-time software. The kernel enables boot, initialization, t...Figure 3.9. Event sequence performing stream inputFigure 3.10. Plural software development kitFigure 3.11. Matrix multiplication code on the Plural architectureFigure 3.12. Task graph for matrix multiplication

4 Chapter 4Figure 4.1. Overview of the ASIP pipeline with its vector ALUs and register file...Figure 4.2. On-chip memory subsystem with banked vector memories and an example ...Figure 4.3. Cell area of the synthesized cores’ logic for different clock period...Figure 4.4. 3x3 NoC based on the HERMES frameworkFigure 4.5. Runtime over flit-width and port buffer size for two exemplary layer...Figure 4.6. Runtime for different packet lengths

5 Chapter 5Figure 5.1. Roofline: a visual performance model for multi-core architectures. A...Figure 5.2. Proposed tile-based many-core architectureFigure 5.3. Directory savings using the RBCC concept compared to global coherenc...Figure 5.4. RBCC-malloc() exampleFigure 5.5. Internal block diagram of the coherency region managerFigure 5.6. Breakdown of the CRM’s resource utilization for increasing region si...Figure 5.7. Execution time per-frame: rbcc mode and mp modeFigure 5.8. Breakdown of execution time for different clips with increasing back...Figure 5.9. Normalized benchmark execution time for different coherency region s...Figure 5.10. Taxonomy of in-/near-memory computing (colored elements are covered...Figure 5.11. Architecture of the remote near-memory synchronization acceleratorFigure 5.12. Queues are widely used as message passing buffersFigure 5.13. Mechanism for a remote dequeue operation (right) for queues in dist...Figure 5.14. NAS benchmark4 × 4 results (left) and IS scalability (right) for di...Figure 5.15. Far-from memory (left) versus near-memory (right) graph copyFigure 5.16. IMSuite benchmark results on a 4 × 4 tile design with 1 memory tileFigure 5.17. IMSuite benchmark results for inter-memory graph copyFigure 5.18. Effect of NCAFigure 5.19. Interplay of RBCC and NMA for shared and distributed memory program...

6 Chapter 6Figure 6.1. The MAGIC NOR gate. (a) MAGIC NOR gate schematic; (b) MAGIC NOR gate...Figure 6.2. Evaluation tool for MAGIC within crossbar arrays. The initial voltag...Figure 6.3. The SIMPLE and SIMPLER flows. In both flows, the logic is synthesize...Figure 6.4. A 1-bit full adder implementation using SIMPLER. (a) A 1-bit full ad...Figure 6.5. High-level description of the mMPU architecture. A program is execut...Figure 6.6. The internal structure of the mMPU controller. First, an instruction...

7 Chapter 7Figure 7.1. Address translation for the ARMv7 architectureFigure 7.2. Host view of the architected state of the guestFigure 7.3. Pseudo-code of the helper for the ldr instructionFigure 7.4. QEMU -generated code to perform a load instructionFigure 7.5. Embedding guest address spaceFigure 7.6. Overview of the implementationFigure 7.7. Contrasting memory access binary translationsFigure 7.8. Kernel module page fault handlerFigure 7.9. Percentage of memory accesses (with Linux)Figure 7.10. Time spent in the Soft MMUFigure 7.11. Benchmark speed-ups: our solution versus vanilla QEMUFigure 7.12. Plain/hybrid speed-ups versus vanilla QEMUFigure 7.13. Number of calls to slow path during program execution (i386 and ARM...Figure 7.14. Page fault optimization speed-ups (ARM guest)Figure 7.15. Page fault handling – internal versus percolated (note the logarith...

8 Chapter 8Figure 8.1. Kalray MPPA overall architectureFigure 8.2. Memory banks and local interconnect in a Kalray clusterFigure 8.3. Example of collisions with four accesses to the same memory bankFigure 8.4. Description of memory address bits and their use for a 32-bit memory...Figure 8.5. Example of interleaving with four accesses to different memory banksFigure 8.6. Description of memory address bits and their use for a 32-bit memory...Figure 8.7. A four-bank architecture, 1-byte words, with a 16-byte stride access...Figure 8.8. A four-bank architecture, 1-byte words, with a 17-byte strideFigure 8.9. Left: the distribution of addresses within a 5-bank memory system. R...Figure 8.10. Distribution of addresses across five memory banks with Index = Add...Figure 8.11. Using a hash function for memory bank selection. N is the address s...Figure 8.12. Left: an example of an H matrix of size 2 × 3 . Right: the same H m...Figure 8.13. Example of PRIM allocation in a four-bank architecture, and four me...Figure 8.14. H matrix for the PRIM solutionFigure 8.15. Complex Addressing circuit overviewFigure 8.16. Intel Complex Addressing stage 1 and PRIM 67Figure 8.17. Overview of the Kalray MPPA simplified local memory architectureFigure 8.18. Kalray MPPA simplified crossbar internal architectureFigure 8.19. Theoretical performance measure (in accesses per cycle) for stride ...Figure 8.20. Theoretical performance measure (in accesses per cycle) for stride ...Figure 8.21. Comparison between MOD 16 and MOD 17 for the same executable code a...Figure 8.22. Theoretical performance measure (in accesses per cycle) for stride ...Figure 8.23. Hotmap of memory access efficiency according to the number of banks...

9 Chapter 9Figure 9.1. A traditional synchronous bus, in this case the implementation of an...Figure 9.2. Arteris switch fabric network (Arteris IP 2020)Figure 9.3. NoC layer mapping summary (Arteris IP 2020)Figure 9.4. The NoC on the left has a floorplan-unfriendly topology, whereas the...Figure 9.5. Pipeline stages are required in a path to span a particular distance...Figure 9.6. Single-event effect (SEE) error hierarchy diagram (ISO26262-11 2018b...Figure 9.7. Failure mode effects and diagnostic analysis (FMEDA) includes analys...Figure 9.8. A cache coherent NoC interconnect allows the integration of IP using...Figure 9.9. NoC interconnects enable easier creation of hard macro tiles that ca...Figure 9.10. Hierarchical coherency macros enable massive scalability of cache c...Figure 9.11. An AUTOSAR MCAL showing how NoC configuration information will be u...

10 Chapter 10Figure 10.1. Energy efficiency improvement by near-threshold computingFigure 10.2. Block diagram of the SCM with an R × C-bit capacityFigure 10.3. An example of SCM structures (R = 4 ,C = 4 )Figure 10.4. Energy measurement results for two memories with a 256 × 32 capacit...Figure 10.5. Minimum energy point curvesFigure 10.6. Energy and delay contoursFigure 10.7. Minimum energy point in near- or sub-threshold regionFigure 10.8. Minimum energy point in super-threshold regionFigure 10.9. The concept of minimum energy point tracking algorithmFigure 10.10. An OS-based algorithm for MEPT

11 Chapter 11Figure 11.1. Tasks in a heterogeneous computing architecture communicate with ea...Figure 11.2. Hardware context switchonatask with FIFO-based communication channe...Figure 11.3. Modified communication channel to support hardware context switchin...Figure 11.4. The proposed communication protocol in hardware context switchingFigure 11.5. FIFOs in the communication channelFigure 11.6. Reconfigurable architecture with the proposed communication structu...Figure 11.7. Hardware context switch scenario in the experimentsFigure 11.8. Comparison of hardware context switch (preemption) time between CS ...Figure 11.9. Hardware task migration between heterogeneous reconfigurable SoCsFigure 11.10. Task migration timeline

Multi-Processor System-on-Chip 1

Подняться наверх