Читать книгу Heterogeneous Computing - Mohamed Zahran - Страница 9
Оглавление1Why Are We forced to Deal with Heterogeneous Computing?
When computers were first built, about seven decades ago, there was one item on the wish list: correctness. Then soon a second wish appeared: speed. The notion of speed differs of course from those old days and applications to today’s requirements. But in general we can say that we want fast execution. After a few more decades and the proliferation of desktop PCs and then laptops, power became the third wish, whether in the context of battery life or electricity bills. As computers infiltrated many fields and were used in many applications, like military and health care, we were forced to add a fourth wish: reliability. We do not want a computer to fail during a medical procedure, for example; or it would have been a big loss (financially and scientifically) if the rover Curiosity, which NASA landed on Mars in 2012, failed. (And yes, Curiosity is a computer.) With the interconnected world we are in today, security became a must. And this is the fifth wish. Correctness, speed, power, reliability, and security are the five main wishes we want from any computer system. The order of the items differs based on the application, societal needs, and the market segment. This wish list is what directs the advances in hardware and software. But the enabling technologies for fulfilling this wish list lie in hardware advances and software evolution. So there is a vicious cycle between the wish list and hardware and software advances, and this cycle is affected by societal needs. This chapter explains the changes we have been through from the dawn of computer systems till today that made heterogeneous computing a must.
In this chapter we see how computing systems evolved till the current status quo. We learn about the concept of heterogeneity and how to make use of it. At the end of this chapter, ask yourself: Have we reached heterogeneous computing willingly? Or against our will? I hope by then you will have an answer.
1.1The Power Issue
In 1965 Gordon Moore, cofounder of Intel together with Robert Noyce, published a four-page paper that became very famous [Moore 1965]. This paper, titled “Cramming More Components onto Integrated Circuits,” made a prediction that the number of components (he did not mention transistors specifically, but the prediction evolved to mean transistors) in an integrated circuit (IC) will double every year. This prediction evolved over time to be two years, then settled on 18 months. This is what we call Moore’s law: transistors on a chip are expected to double every 18 months. The 50th anniversary of Moore’s law was in 2015! More transistors per chip means more features, which in turn means, hopefully, better performance. Life was very rosy for both the hardware community and the software community. On the hardware side, faster processors with speculative execution, superscalar capabilities, simultaneous multithreading, etc., were coupled with better process technology and higher frequency, which produced faster and faster processors. On the software side, you could write your program and expect it to get faster with every new generation of processors with any effort on your part! Until everything stopped around 2004. What happened?
Around 2004 Dennard scaling stopped. In 1974 Robert Dennard and several other authors [Dennard et al. 1974] published a paper that predicted that voltage and current should be proportional to the linear dimensions of the transistors. This has been known as Dennard scaling. It works quite well with Moore’s law. Transistors get smaller and hence faster and their voltage and current also scale down, so power can stay almost constant, or at least will not increase fast. However, a closer look at the Dennard scaling prediction shows that the authors ignored leakage current (was very insignificant at the time when the paper was published). Now as transistors get smaller and smaller, leakage becomes more significant. The aforementioned paper also ignored the threshold voltage at which the transistor switches. Around 2004 those two factors overcame the prediction of Dennard scaling, and now we increase the number of transistors per Moore’s law, but the power density also increases. Power density has many effects. One of them is that it increases packaging cost. Also, dealing with power dissipation becomes problematic and expensive. Given all that and given that dynamic power is proportional to clock frequency, we are stuck! What is the solution?
The solution is to stop increasing the clock frequency and instead increase the number of cores per chip, mostly at lower frequency. We can no longer increase frequency, otherwise power density becomes unbearable. With simpler cores and lower frequency, we reduce power dissipation and consumption. With multiple cores, we hope to maintain higher performance. Figure 1.1 [Rupp 2018] tells the whole story and shows the trends of several aspects of microprocessors throughout the years. As the figure shows, from around 2004 the number of logical cores started to increase beyond single core. The word logical includes physical cores with simultaneous multithreading (SMT) capability, also known as hyperthreading technology [Tullsen et al. 1995, Lo et al. 1997]. So a single core with four-way hyperthreading is counted as four logical cores. With SMT and the increase in the number of physical cores, we can see a sharp increase in the number of logical cores (note the logarithmic scale). If we look at the power metric, the 1990s was not a very friendly decade in terms of power. We see a steady increase. After we moved to multicore, things slowed down a bit due to the multicore era as well as the rise of dark-silicon techniques [Allred et al. 2012, Bose 2013, Esmaeilzadeh et al. 2011] and some VLSI tricks. “Dark silicon” refers to the parts of the processor that must not be turned off (hence dark) in order for the heat generated not to exceed the maximum capability that the cooling system can dissipate (called thermal design point, or TDP). How to manage dark silicon while trying to increase performance? This is the question that has resulted in many research papers in the second decade of the twenty-first century. We can think of the dark-silicon strategy as a way to continue increasing the number of cores per chip while keeping the power and temperature at a manageable level. The figure also shows that we stopped, or substantially slowed, increasing clock frequency. With this bag of tricks, we sustained, so far, a steady increase of transistors, as the figure shows at its top curve. There is one interesting curve remaining in the figure: the single thread (i.e., sequential programs) performance. There is a steady increase in single thread performance almost till 2010. The major reason is Moore’s law, which allowed computer architects to make use of these transistors to add more features (from pipelining to superscalar to speculative execution, etc.). Another reason is the increase in clock frequency that was maintained till around 2004. There are some minor factors that make single thread performance a bit better with multicore. One of them is that the single thread program has a higher chance of executing on a core by itself without sharing resources with another program. The other is that of thread migration. If a single thread program is running on a core and that core becomes warm, the frequency and voltage will be scaled down, slowing down the program. If the program is running on a multicore, and thread migration is supported, the program may migrate to another core, losing some performance in the migration process but continuing at full speed afterwards.
Figure 1.1Trend of different aspects of microprocessors. (Karl Rupp. 2018. 42 years of microprocessor trend data. Courtesy of Karl Rupp https://github.com/karlrupp/microprocessor-trend-data; last accessed March 2018)
The story outlined above relates the hardware tricks developed to manage the power and temperature near the end of Moore’s law and the end of Dennard scaling. Those tricks gave some relief to the hardware community but started a very difficult problem for software folks.
Now that we have multicore processors all over the place, single thread programs are no longer an option.
The free lunch is over [Sutter 2005]! In the good old days, you could write a sequential program and expect that your program would become faster with every new generation of processors. Now, unless you write parallel code, don’t expect to get that much of a performance boost anymore. Take another look at the single thread performance in Figure 1.1. We moved from single core to multicore not because the software community was ready for concurrency but because the hardware community could not afford to neglect the power issue. The problem is getting even harder because this multicore or parallel machine is no longer homogeneous. You are not writing code for a machine that consists of similar computing nodes but different ones. So now we need heterogeneous parallel programming.
We saw how we moved from single core to multiple homogeneous cores. How did heterogeneity arise? It is again a question of power, as we will see. But before we go deeper into heterogeneity, it is useful to categorize it into two types from a programmer’s perspective.
A machine is as useful as the programs written for it. So let’s look at heterogeneity from a programmer’s perspective. There is this heterogeneity that is beyond a programmer’s control. Surprisingly, this type has been around for several years now; and many programmers don’t know it exists! There is also heterogeneity within a programmer’s control. What is the difference? And how come we have been dealing with heterogeneity without knowing it?
1.2Heterogeneity Beyond Our Control
Multicore processors have been around now for more than a decade, and a lot of programs were written for them using different parallel programming paradigms and languages. However, almost everybody thinks they are writing programs for a heterogeneous machine, unless of course there is an explicit accelerator like a GPU or FPGA involved. In this section we show that we have not been programming a pure homogeneous machine even if we thought so!
1.2.1Process Technology
Everybody, software programmers included, knows that we are using CMOS electronics in our design of digital circuits, and to put them on integrated circuits we use process technology that is based on silicon. This has been the norm for decades. This is true. But even in process technology, there is heterogeneity.
Instead of silicon, semiconductor manufacturing uses a silicon-insulator-silicon structure. The main reason for using silicon on insulator (SOI) is to reduce device capacitance. This capacitance causes the circuit elements to behave in nonideal ways. SOI reduces this capacitance and hence results in performance enhancement.
Instead of traditional CMOS transistors, many manufacturers use what is called a Fin field-effect (FinFET) transistor. Without going into a lot of electronics details, a transistor, which is the main building block of processors, is composed of gate, drain, and source. Depending on the voltage at gate, the current flows from source to drain or is cut off. Switching speed (i.e., from on to off) affects the overall performance. FinFET transistors are found to have a much higher switching time than traditional CMOS technology. An example of a FinFET transistor is Intel’s tri-gate transistor, which was used in 2012 in the Ivy Bridge CPU.
Those small details are usually not known, or not well known, to the software community, making it harder to reason about the expected performance of a chip, or, even worse, of several chips in a multisocket system (i.e., several processors sharing the memory).
1.2.2Voltage and Frequency
The dynamic power consumed and dissipated by the digital circuits of all our processors is defined by this equation: P=C×Vcc2×F×N, where C is the capacitance, Vcc is the supply voltage, F is the frequency, and N is the number of bits switching. As we can see, there is a cubic relationship between the dynamic power and supply voltage and frequency. Reducing the frequency and reducing the supply voltage (up to a limit to avoid switching error) greatly reduces the dynamic power, at the expense of performance.
There are many layers in the computing stack that do dynamic voltage and frequency scaling (DVFS). It is done at the hardware level and per core. This means that even if we think we are writing an application for a homogeneous multicore, it is actually heterogeneous because it may have different performance measure-ments based on the processes running on each core. It is also done at the operating system (OS) level. With Intel processors, for example, the OS requests a particular level of performance, known as performance-level (P-level), from the processor. The processor then uses DVFS to try to meet the requested P-state. Up to that point, the programmer has no control and all this is happening under the hood. There are some techniques that involve application-directed DVFS. The programmer knows best when high performance is needed and when the program can tolerate lower performance for better power saving. However, this direction from the application can be overridden by the OS or the hardware.
1.2.3Memory system
Beginning programmers see the memory as just a big array that is, usually, byte addressable. As programmers gain more knowledge, the concept of virtual memory will arise and they will know that each process has its known virtual address space that is mapped to physical memory that they see as a big array that is, usually, byte addressable! Depending on the background of the programmers, the concept of cache memory may be known to them. But what programmers usually do not know is that the access time for the memory system and large caches is no longer fixed. To overcome complexity and power dissipation, both memory and large caches are divided into banks. Depending on the address accessed, the bank may be near, or far, from the requesting core, resulting in nonuniform memory access (NUMA for memory) [Braithwaite et al. 2012] and nonuniform cache access (NUCA for cache) [Chishti et al. 2003]. This is one of the results from heterogeneous performance of memory hierarchy.
Another factor that contributes to the heterogeneity in memory systems is the cache hits and misses. Professional programmers, and optimizing compilers to some extent, know how to write cache-friendly code. However, the multiprogramming environment, where several processes are running simultaneously, the virtual memory system, and nondeterminism in parallel code make the memory hierarchy response time almost unpredictable. And this is a kind of temporal heterogeneity.
Another form of heterogeneity in memory systems is the technology. In the last several decades, the de facto technology used in memory hierarchy is dynamic RAM (DRAM) for the system memory, and in the last decade embedded DRAM or eDRAM for last-level cache, for some processors, especially IBM POWER processors. For the cache hierarchy static RAM (SRAM) is the main choice. DRAM has higher density but higher latency, due to its refresh cycle. Despite many architecture tricks, DRAM is becoming a limiting factor for performance. This does not mean it will disappear from machines, at least not very soon, but it will need to be complemented with something else. SRAM has shorter latency and lower density. This is why it is used with caches that need to be fast but not as big as the main system memory. Caches are also a big source of static power dissipation, especially leakage [Zhang et al. 2005]. With more cores on chip and with larger datasets, the big-data era, we need larger caches and bigger memory. But DRAM and SRAM are giving us diminishing returns from different angles: size, access latency, and power dissipation/consumption. A new technology is needed, and this adds a third element of heterogeneity.
The last few years have seen several emerging technologies that are candidates for caches and system memory. These technologies have the high density of DRAM, the low latency of SRAM, and, on top of that, they are nonvolatile [Boukhobza et al. 2017]. These technologies are not yet mainstream, but some of them are very close, waiting to solve some challenges related to cost, power, and data consistency.
Table 1.1 shows a comparison between the current (volatile) memory technologies used for caches and main memory, namely, DRAM and SRAM, and the new nonvolatile memory (NVM) technologies. The numbers in the table are approximate and collected from different sources but for the most part are from Boukhobza et al. [2017]. Many of the nonvolatile memory technologies have much higher density than DRAM and SRAM; look at the cell size. They also have comparable read latency and even lower read power in most cases. There are several challenges in using NVM that need to be solved and are shown in the table. For instance, write endurance is much lower than DRAM and SRAM, which causes a reliability problem. The power needed for write is relatively high in NVM. Consistency is also a big issue. When there is a power outage, we know that the data in DRAM and SRAM are gone. But for NVM when do not know whether the data stored are stale or updated. The power may have gone off while in the middle of a data update. A lot of research is needed to address these challenges. NVM can be used in the memory hierarchy at a level by itself, for example, as a last-level-cache (LLC) or in main memory, which is a vertical integration. NVM can also be used in tandem with traditional DRAM or SRAM, which is a horizontal integration. The integration of NVM in the memory hierarchy can be managed by the hardware, managed by the operating system, or left to the programmer to decide where to place the data. The first two cases are beyond the programmer’s control. In the near future, memory hierarchy is expected to include volatile and nonvolatile memories, adding to the heterogeneity of the memory system.
Table 1.1:Comparison of Several Memory Technologies
Figure 1.2 shows a summary of the factors that we have just discussed.
Figure 1.2Factors Introducing Heterogeneity in Memory
1.3Heterogeneity Within Our Control
In the previous section we explored what happens under the hood that makes the system heterogeneous in nature. In this section we explore factors that are under our control and make us use the heterogeneity of the system. There is a big debate on how much control to give the programmer. The more control the better the performance and power efficiency we may get, depending of course on the expertise of the programmer, and the less the productivity. We discuss this issue later in the book. For this section we explore, from a programmer perspective, what we can control.
1.3.1The Algorithm and The Language
When you want to solve a program, you can find several algorithms for that. For instance, look at how many sorting algorithms we have. You decide which algorithm to pick. We have to be very careful here. In the good old days of sequential programming, our main issues were the big-O notation. This means we need to optimize for the amount of computations done. In parallel computing, computation is no longer the most expensive operation. Communication among computing nodes (or cores) and memory access are more expensive than computation. Therefore, it is sometimes wiser to pick a worse algorithm in terms of computation if it has a better communication pattern (i.e., less communication) and a better memory access pattern (i.e., locality). You can even find some algorithms with the same big-O, but one of them is an order of magnitude slower than the other.
Once you pick your algorithm, or set of algorithms in the case of more sophisticated applications, you need to translate it to a program using one of the many parallel programming languages available (and counting!). Here also you are in control: which language to pick. There are several issues to take into account when picking a programming language for your project. The first is how suitable this language is for the algorithm at hand. Any language can implement anything. This applies to sequential and parallel languages. But some languages are much easier than others for some tasks. For example, if you want to count the number of times a specific pattern of characters appears in a text file, you can write a C program to do it. But a small Perl or Python script will do the job in much fewer lines. If you want less control but higher productivity, you can pick some languages with a higher level of abstraction (like Java, Scala, Python, etc.) or application-specific languages. On the other hand, the brave souls who are using PThreads, OpenMP, OpenCL, CUDA, etc., have more control yet the programs are more sophisticated.
Algorithm 1AX= Y: Matrix-Vector Multiplication
for i = 0 to m- 1 do
y[i] = 0;
for j = 0 to n - 1 do
y[i] += A[i][j] * X[j];
end for
end for
1.3.2The Computing Nodes
When you pick an algorithm and a programming language, you already have in mind the type of computing nodes you will be using. A program, or part of a program, can have data parallelism (single thread–multiple data), so it is a good fit for graphics processing units (GPUs). Algorithm 1.1 shows a matrix ( m × n) vector multiplication, which is a textbook definition of data parallelism. As a programmer, you may decide to execute it on a GPU or a traditional multicore. Your decision depends on the amount of parallelism available; in our case, it is the matrix dimension. If the amount of parallelism is not very big, it will not overcome the overhead of moving the data from the main memory to the GPU memory or the overhead of the GPU accessing the main memory (if your GPU and runtime supports that). You are in control.
If you have an application that needs to handle a vast amount of streaming data, like real-time network packet analysis, you may decide to use a field-programmable gate array (FPGA).
With a heterogeneous computing system, you have control of which computing node to choose for each part of your parallel application. You may decide not to use this control and use a high-abstraction language or workflow that does this assignment on your behalf for the sake of productivity—your productivity. However, in many cases an automated tool does not produce better results than a human expert, at least so far.
1.3.3The Cores in Multicore
Let’s assume that you decided to run your application on a multicore processor. You have another level of control: to decide which thread (or process) to assign to which core. In many parallel programming languages, programmers are not even aware that they have this control. For example, in OpenMP there is something called thread affinity that allows the programmer to decide how threads are assigned to cores (and sockets in the case of a multisocket system). This is done by setting some environment variables. If you use PThreads, there are APIs that help you assign thread to cores, such as pthread_setaffinity_np().
Not all the languages allow you this control though. If you are writing in CUDA, for example, you cannot guarantee on which streaming multiprocessor(SM), which is a group of execution units in NVIDIA parlance, your block of threads will execute on. But remember, you have the choice to pick the programming language you want. So, if you want this control, you can pick a language that allows you to have it. You have to keep in mind though that sometimes your thread assignments may be override by the OS or the hardware for different reasons such as thread migration due to temperature, high overhead on the machine from other programs running concurrently with yours, etc.
1.4Seems Like Part of a Solution to Exascale Computing
If we look at the list of the top 500 supercomputers in the world,1 we realize that we are in the petascale era. That is, the peak performance that such a machine can reach is on the order of 1015 floating point operations per second (FLOPS). This list is updated twice a year. Figure 1.3 shows the top four supercomputers. Rmax is the maximal achieved performance, while Rpeak is the theoretical peak (assuming zero-cost communication, etc.). The holy grail of high-performance computing is to have an exascale machine by the year 2021. That deadline has been a moving target: from 2015 to 2018 and now 2021. What is hard about that? We can build an exascale machine, that is, on the order of 1018 FLOPS by connecting, say, a thousand petascale machines with high-speed interconnection, right? Wrong! If you build the machine in the way we just mentioned, it would require about 50% of the power generated by the Hoover Dam! It is the problem of power again. The goal set by the US Department of Energy (2013) for an exascale machine is to have one exascale for 20–30 MW of power. This makes the problem very challenging.
Figure 1.3Part of the TOP500 list of fastest supercomputers (as of November 2018). (Top 500 List. 2018. Top 500 List Super Computers (November 2018) Courtesy Jack Dongarra; Retrieved November 2018; https://www.top500.org/lists/2018/11/)
Heterogeneity is one step toward the solution. Some GPUs may dissipate power more than multicore processors. But if a program is written in a GPU-friendly way and optimized for the GPU at hand, you get orders of magnitude speedup over a multicore, which makes the GPU better than a multicore in performance-per-watt measurement. If we assume the power budget to be fixed to, say, 30 MW, then using the right chips for the application at hand gets you much higher performance. Of course heterogeneity alone will not solve the exascale challenge, but it is a necessary step.