Читать книгу Heterogeneous Computing - Mohamed Zahran - Страница 10
Оглавление2Different Players: Heterogeneity in Computing
In this chapter we take a closer look at the different computing nodes that can exist in a heterogeneous system. Computing nodes are the parts that do the computations, and computations are the main tasks of any program. Computing nodes are like programming languages. Each one can do any computation, but some are way more efficient in some type of computations than others, as we will see.
In 1966 Michael Flynn classified computations into four categories based on how instructions interact with data. The traditional sequential central processing unit (CPU) executes an instruction with its data, then another instruction with its data, and so on. In Flynn’s classification, this computation is called single instruction–single data (SISD). You can execute the same instruction on different data. Think of multiplying each element of a matrix by a factor, for example. This is called single instruction–multiple data (SIMD). The other way around, we refer to the same data that go through different instructions as multiple instruction–single data (MISD). There are not many examples of MISD around. With some stretch we can call pipelining a special case of MISD. Redundant execution of instructions, for reliability reasons, can also be considered MISD. Finally, the most generic category is multiple instruction–multiple data (MIMD). There are some generalizations. For instance, if we execute the same set of instructions on different data, we can generalize SIMD to single thread (or single program)–multiple data (SPMD). One of the advantages of such classifications is to build hardware suitable for each category, or for categories that are used more often, as we will see in this chapter.
2.1Multicore
The first player in a heterogeneity team is the multicore processor itself. Figure 2.1 shows a generic multicore processor. The de facto definition of a core now is a CPU and its two level-1 caches (one for instructions and the other for data). Below the L1 caches are different designs. One design has a shared L2 and L3 cache, where L3 is usually the last-level cache (LLC) before going off-chip. An L2 cache is physically distributed and logically shared to increase scalability with the number of cores. This makes the shared L2 cache a nonuniform cache access (NUCA), as we saw in the previous chapter. An LLC cache is also NUCA. This LLC is designed either in SRAM or embedded DRAM (eDRAM). POWER processors, from IBM, have eDRAM as an LLC. Another design has private L2 caches per core followed by a shared LLC. In some recent processors, but not in many, there is also an L4 shared cache; for example, Intel’s Broadwell i7 processor has a 128 MB L4 cache implemented in eDRAM technology. After the cache hierarchy, we go off-chip to access the system memory. Currently, the vast majority of system memory is in DRAM, but as we saw earlier, nonvolatile memory technology (such as PCM, STTRAM, MRAM, ReRAM, etc.) will soon appear and will be used with and/or in place of DRAM and also in some cache levels.
Figure 2.1Generic Multicore Processors
When we consider programming a multicore processor, we need to take into account several factors. The first is the process technology used for fabrication. It determines the cost, the power density, and the speed of transistors. The second factor is the number of cores and whether they support simultaneous multithreading (SMT) [Tullsen et al. 1995], called hyperthreading technology in Intel lingo and symmetrical multithreading in AMD parlance. This is where a single core can serve more than one thread at the same time, sharing resources. So if the processor has four cores and each one has two-way SMT capability, then the OS will see your processor as one with eight cores. That number of cores (physical and logical) determines the amount of parallelism that you can get and hence the potential performance gain. The third factor is the architecture of the core itself as it affects the performance of a single thread. The fourth factor is the cache hierarchy: the number of cache levels, the specifics of each cache, the coherence protocol, the consistency model, etc. This factor is of crucial importance because going off-chip to access the memory is a very expensive operation. The cache hierarchy helps reduce those expensive trips, of course with help from the programmer, the compiler, and the OS. Finally, the last factor is scaling out. How efficient is a multisocket design? Can we scale even further to thousands of processors?
Figure 2.2IBM POWER9 processor. (Courtesy of International Business Machines Corporation, © International Business Machines Corporation)
Let’s see an example of a multicore. Figure 2.2 shows the POWER9 processor from IBM [Sadasivam et al. 2017]. The POWER9 is fabricated with 14 nm FinFET technology, with about eight billion transistors, which is a pretty advanced one, as of 2018, even though we see lower process technologies (e.g., 10 nm) but still very expensive and not yet in mass production. The figure shows 24 CPU cores. Each core can support up to four hardware threads (SMT). This means we can have up to 96 threads executed in parallel. There is another variation of the POWER9 (not shown in the figure) that has 12 cores, each of which supports up to 8 hardware threads, bringing the total again to 96 threads. The first variation, the one in the figure, has more physical cores so is better in terms of potential performance, depending on the application at hand, of course. Before we proceed, let’s think from a programmer’s perspective. Suppose you are writing a parallel program for this processor and the language you are using gives you the ability to assign threads (or processes) to cores. How will you decide which thread goes to which core? It is obvious that the first rule of thumb is to assign different threads to different physical cores. But there is a big chance that you have more threads than physical cores. In this case try to assign threads of different personalities to the same physical core; that is, a thread that is memory bound and a thread that is compute bound, or a thread with predominant floating point operations and one with predominant integer operations, and so on. Of course there is no magic recipe, but these are rules of thumb. Note that your assignment may be overridden by the language runtime, the OS, or the hardware. Now back to the Power9.
Each core includes its own L1 caches (instructions and data). The processor has a three-level cache hierarchy. L2 is a private 512 KB 8-way set-associative cache. Depending on the market segment, power9 has two types of cores: SMT4 and SMT8, where the latter has twice the fetch/decode capacity of the former. The L2 cache is private to SMT8, but if we use SMT4 cores, it is shared among two cores. Level 3 is shared, banked, and built out of eDRAM. But DRAM has high density, as we said earlier, and L3 is a massive 120 MB and has nonuniform cache access (NUCA). This cache is divided into 12 regions with 20-way set associativity per region. This means a region is local per SMT8 core, or two SMT4 cores, but can be accessed by the other cores with higher latency (hence NUCA). The on-chip bandwidth is 7 TB/s (tera bytes per second). If we leave the chip to access the main memory, POWER9 has a bandwidth of up to 120 GB/s to a DDR4 memory. These numbers are important because it gives you an indication of how slow/fast getting your data from the memory is, and how crucial it is to have a cache-friendly memory access pattern.
For big problem sizes, you will use a machine with several multicore processors and accelerators (like a GPU, for example). Therefore, it is important to know the bandwidth available to you from the processor to the accelerator because it affects your decision to outsource the problem to the accelerator or do it in-house in the multicore itself. POWER9 is equipped with PCIe (PCI Express) generation 4 with 48 lanes (a single lane gives about 1.9 GB/s), a 16 GB/s interface for connecting neigh-boring sockets, and a 25 GB/s interface that can be used by externally connected accelerators or I/O devices.
Multicore processors represent one of the pieces of the puzzle of heterogeneous computing. But there are some other chips that are much better than multicore processors for certain types of applications. The term much better here means they have a better performance per watt. One of these well-known chips that is playing a big role in our current era of artificial intelligence and big data is the graphics processing unit (GPU).
2.2GPUs
Multicore processors are MIMD in Flynn’s classification. MIMD is very generic and can implement all other types. But if we have an application that is single instruction (or program or thread)–multiple data, then a multicore processor may not be the best choice [Kang et al. 2011]. Why is that? Let’s explain the reason with an example. Suppose we have the matrix-vector multiplication operation that we saw in the previous chapter (repeated here in Algorithm 2.1 for convenience). If we write this program in a multithreaded way and we execute it on a multicore processor, where each thread is responsible for calculating a subset of the vector Y, then each core must fetch/decode/issue instructions for threads, even though they are the same instructions for all the threads. This does not affect the correctness of the execution but is a waste of time and energy.
Algorithm 2AX= Y: Matrix-Vector Multiplication
for i = 0 to m – 1 do
y[i] = 0;
for j = 0 to n – 1 do
y[i] += A[i][j] * X[j];
end for
end for
If we now try to execute the same program on a GPU, the situation will be different. SIMD architectures have several execution units (named differently by different companies) that share the same front end for fetching/decoding/issuing instructions, thus, amortizing the overhead of that part. This also will save a lot of the chip real estate for more execution units, resulting in much better performance.
Figure 2.3 shows a generic GPU. Each block that is labeled lower level scheduling can be seen as a front end and many execution units. Each execution unit is Algorithm 2.1 responsible for calculating one or more elements for vector Y in the example of Algorithm 2.1. Why do we have several blocks then? There are several reasons. First, threads assigned to the different execution units within the same block can exchange data and synchronize among each other. It would be extremely expensive to do that among the execution units of all the chips as there are hundreds in small GPUs and thousands in high-end GPUs. So this distributed design makes the cost manageable. Second, it gives some flexibility. You can execute different SIMD-friendly applications on different blocks. This is why we have high-level scheduling shown in the figure. Execution units of different blocks can communicate, albeit in a slow manner, through the memory shared among all the blocks, labeled “memory hierarchy” in the figure, because in some designs there are some cache levels above the global memory as well as specialized memories like texture memory.
Figure 2.3Generic GPU Design
The confusing thing about GPUs is that each brand has its own naming convention. In NVIDIA parlance, those blocks are called streaming multiprocessors (SM or SMX in later version) and the execution units are called streaming processors (SPs) or CUDA cores. In AMD parlance those blocks are called shader engines and the execution units are called compute units. In Intel parlance, the blocks are called slices (or sub-slices) and the execution units are called just like that: execution units. There are some very slight differences between each design, but the main idea is almost the same.
GPUs can be discrete, that is, stand-alone chips connected to the other processors using connections like PCIe or NVLink, or they can be embedded with the multicore processor in the same chip. On the one hand, the discrete ones are of course more powerful because they have more real estate. But they suffer from the communication overhead of sending the data back and forth between the GPU’s memory and the system’s memory [Jablin et al. 2011], even if the programmer sees a single virtual address space. On the other hand, the embedded GPUs, like Intel GPUs and many AMD APUs, are tightly coupled with the multicore and do not suffer from communication overhead. However, embedded GPUs have limited area because they share the chip with the processor and hence are weaker in terms of performance.
If you have a discrete GPU in your system, there is a high chance you also have an embedded GPU in your multicore chip, which means you can make use of a multicore processor, an embedded GPU, and a discrete GPU, which is a nice exercise of heterogeneous programming!
Let’s see an example of a recent GPU: the Volta architecture V100 from NVIDIA [2017]. Figure 2.4 shows the block diagram of the V100. The giga thread engine at the top of the figure is what we called high-level scheduling in our generic GPU of Figure 2.3. Its main purpose is to schedule blocks to SMs. A block, in NVIDIA parlance, is a group of threads, doing the same operations on different data, assigned to the same SM, so that they can share data more easily and synchronize. There is an L2 cache shared by all, and it is the last-level cache (LLC) before going off-chip to the GPU global memory, not shown in the figure. NVIDIA packs several SMs together in what are called GPU processing clusters (GPCs). In Volta there are six GPCs; each one has 14 SMs. You can think of a GPC as a small full-fledged GPU, with its SMs, raster engines, etc. The main players, who actually do the computations, are the SMs.
Figure 2.4NVIDIA V100 GPU block diagram. (Based on NVIDIA, 2017. NVIDIA Tesla v100 GPU architecture)
Figure 2.5NVIDIA V100 GPU streaming multiprocessor. (Based on NVIDIA, 2017. NVIDIA Tesla v100 GPU architecture)
Figure 2.5 shows the internal configuration of a single SM. Each SM is equipped with an L1 data cache and a shared memory. The main difference is that the cache is totally transparent to the programmer. The shared memory is controllable by the programmer and can be used to share data among the block of threads assigned to that SM. This is why a block of threads assigned to the same SM can share data faster. There is also an L1 instruction cache and an L0 instruction cache for each warp scheduler. Warp? What is a warp?