Читать книгу Multi-Processor System-on-Chip 1 - Liliana Andrade - Страница 30

2.2. Motivations and context 2.2.1. Many-core processors

A multi-core processor refers to a computing device that contains multiple software-programmable processing units (cores with caches). Multi-core processors deployed in desktop computers or data centers have homogeneous cores and a memory hierarchy is composed of coherent caches (Figure 2.1). Conversely, a many-core processor can be characterized by the architecturally visible grouping of cores inside compute units: cache coherence may not extend beyond the compute unit; or the compute unit may provide scratch-pad memory and data movement engines. A multi-core processor scales by replicating its cores, while a many-core processor scales by replicating its compute units. A many-core architecture may thus be scaled to hundreds, if not thousands, of cores.

Figure 2.1. Homogeneous multi-core processor (Firesmith 2017)

The GPGPU architecture introduced by the NVIDIA Fermi (Figure 2.2) is a mainstream many-core architecture, whose compute units are called streaming multiprocessors (SMs). Each SM comprises 32 streaming cores (SCs) that share a local memory, caches and a global memory system. Threads are scheduled and executed atomically by “warps”, which are sets of 32 threads dispatched to SCs that execute the same instruction at any given time. Hardware multi-threading enables warp execution switching on each cycle, helping to cover the memory access latencies.

Figure 2.2. NVIDIA fermi GPGPU architecture (Huang et al . 2013)

Although embedded GPGPU processors provide adequate performance and energy efficiency for accelerated computing, their architecture carry inherent limitations that hinder their use in intelligent systems:

– kernel programming environment lacks standard features of C/C++, such as recursion, standard multi-threading or accessing a (virtual) file system;

– performance of kernels is highly sensitive to run-time control flow (because of branch divergence) and data access patterns (because of memory coalescing);

– threads blocks are dynamically allocated to SMs, while warps are dynamically scheduled for execution inside an SM;

– coupling between the host CPU and the GPGPU relies on a software stack that results in long and hard-to-predict latencies (Cavicchioli et al. 2019).

Подняться наверх