Читать книгу Customizable Computing - Yu-Ting Chen - Страница 11
ОглавлениеCHAPTER 3
Customization of Cores
3.1 INTRODUCTION
Because processing cores contribute greatly to energy consumption in modern processors, the conventional processing core is a good place to start looking for customizations to computation engines. Processing cores are pervasive, and their architecture and compilation flow are mature. Modifications made to processing cores then have the advantage that existing hardware modules and infrastructure invested in building efficient and high-performance processors can be leveraged, without having to necessarily abandon existing software stacks as may be required when designing hardware from the ground up. Additionally, programmers can use their existing knowledge of programming conventional processing cores as a foundation toward learning new techniques that build upon conventional cores, instead of having to adopt new programming paradigms, or near languages.
In addition to benefiting from mature software stacks, any modifications made to a conventional processing core can also take advantage of many of the architectural components that have made cores so effective. Examples of these architectural components are caches, mechanisms for out-of-order scheduling and speculative execution, and software scheduling mechanisms. By integrating modifications directly into a processing core, new features can be designed to blend into these components. For example, adding a new instruction to the existing execution pipeline automatically enables this instruction to benefit from aggressive instruction scheduling already present in a conventional core.
However, introducing new compute capability, such as new arithmetic units, into existing processing cores means being burdened by many of the design restrictions that these cores already exert on arithmetic unit design. For example, out-of-order processing benefits considerably from short latency instructions, as long latency instructions can cause pipeline stalls. Conventional cores are also fundamentally bound, both in terms of performance and efficiency, by the infrastructure necessary to execute instructions. As a result, conventional cores cannot be as efficient at performing a particular task as a hardware structure that is more specialized to that purpose [26]. Figure 3.1 illustrates this point, showing that the energy cost of executing an instruction is much greater than the energy that is required to perform the arithmetic computation (e.g., energy devoted to integer and floating point arithmetic). The rest of the energy is spent to implement the infrastructure internal to the processing core that is used to perform tasks such as scheduling instructions, fetch and decode, extracting instruction level parallelism, etc. Figure 3.1 shows only the comparison of structures internal to the processing core itself, and excludes external components such as memory systems and networks. These are burdens that are ever present in conventional processing cores, and they represent the architectural cost of generality and programmability. This can be contrasted against the energy proportions shown in Figure 3.2, which show the energy saving when the compute engine is customized for a particular application, instead of a general-purpose design. The difference in energy cost devoted to computation is primarily the result of relaxing the design requirements of functional units, so that functional units operate only at precisions that are necessary and are designed to emphasize energy efficiency per computation, and potentially exhibit deeper pipelines and longer latencies than would be tolerable when couched inside a conventional core.
Figure 3.1: Energy consumed by subcomponents of a conventional compute core as a proportion of the total energy consumed by the core. Subcomponents that are not computationally necessary (i.e., they are part of the architectural cost of extracting parallelism, fetching and decoding instructions, scheduling, dependency checking, etc.) are shown as slices without fill. Results are for a Nehalem era 4-core Intel Xeon CPU. Memory includes L1 cache energy only. Taken from [26].
This chapter will cover the following topics related to customization of processing cores:
• Dynamic Core Scaling and Defeaturing: A post-silicon method of selectively deactivating underutilized components with the goal of conserving energy.
Figure 3.2: Energy cost of subcomponents in a conventional compute core as a proportion of the total energy consumed by the core. This shows the energy savings attainable if computation is performed in an energy-optimal ASIC. Results are for a Nehalem era 4-core Intel Xeon CPU. Memory includes L1 cache energy only. Taken from [26].
• Core Fusion: Architectures that enable one “big” core to act as if it were really many “small cores,” and vice versa, to dynamically adapt to different amounts of thread-level or instruction-level parallelism.
• Customized Instruction Set Extensions: Augmenting processor cores with new workload-specific instructions.
3.2 DYNAMIC CORE SCALING AND DEFEATURING
When a general-purpose processor is designed, it is done with the consideration of a wide range of potential workloads. For any particular workload, many resources may not be fully utilized. As a result, these resources continue to consume power, but do not contribute meaningfully to program performance. In order to improve energy efficiency architectural features can be added that allow for these components to be selectively turned off. While this obviously does not allow for the chip area spent on deactivated components to be repurposed, it does allow for a meaningful energy efficiency improvement.
Manufacturers of modern CPUs enable this type of selective defeaturing, though typically not for this purpose. This is done with the introduction of machine-specific registers that indicate the activation of particular components. The purpose of these registers, from a manufacturer’s perspective, is to improve processor yield by allowing faulty components in an otherwise stable processor to be disabled. For this reason, the machine-specific registers governing device activation are rarely documented.
There has been extensive academic work in utilizing defeaturing to create dynamically heterogeneous systems. These works center around identifying when a program is entering a code region that systemically underutilizes some set of features that exist in a conventional core. For example, if it is possible to statically discover that a code region contains long sequences of dependencies between instructions, then it is clear that a processor with a wide issue and fetch width will not be able to find enough independent instructions to make effective use of those wide resources [4, 8, 19, 125]. In that case, powering off the components that enable wide fetch and issue, along with the architectural support for large instruction windows, can save energy without impacting performance. This academic work is contingent upon being able to discern run-time behavior or code, either using run-time monitoring [4, 8] or static analysis [125].
An example of dynamic resource scaling from academia is CoolFetch [125]. CoolFetch relies on compiler support to statically estimate the execution rate of a code region, and then uses this information to dynamically grow and contract structures within a processor’s fetch and issue units. By constraining these structures in code regions with few opportunities to exploit instruction-level parallelism or out-of-order scheduling, CoolFetch also observed a carry-over effect of reducing power consumption of other processor structures that normally operate in parallel, and also reduces energy spent on squashed instructions by reducing reducing the number of instructions stalling at retirement. In all, CoolFetch reported on average 8% energy savings, with a relatively trivial architectural modification and a negligible performance penalty.
3.3 CORE FUSION
The efficiency of a processing core, both in terms of energy consumption and compute per area, tends to be reduced as the potential performance of the core increases. The primary cause of this is a shifting of focus from investing resources in compute engines in the case of small cores, to aggressive scheduling mechanisms in the case of big cores. In modern out-of-order processors, this scheduling mechanism constitutes the overwhelming majority of core area investment, and the overwhelming majority of energy consumption.
An obvious conclusion then is that large sophisticated cores are not worth including in a system, since a sea of weak cores provides greater potential for system-wide throughput than a small group of powerful cores. The problem with this conclusion, however, is that parallelizing software is difficult: parallel code is prone to errors like race conditions, and many algorithms are limited by sequential components that are more difficult to parallelize. Some code cannot reasonably be parallelized at all. In fact, the majority of software is not parallelized at all, and thus cannot make use of a large volume of cores. In these situations a single powerful core is preferable, since it offers high single-thread throughput at the cost of restricting the capability to exploit thread-level parallelism. Because of this observation, it becomes clear that the best design depends on the number of threads that are exposed in software. Large numbers of threads can be run on a large number of cores, enabling higher system-wide throughput, while a few threads may be better run on a few powerful cores, since multiple cores cannot be utilized.
This observation gave rise to a number of academic works that explore heterogeneous systems which feature a small group of very powerful cores on die with a large group of very efficient cores [64, 71, 84]. In addition to numerous academic perspectives on heterogeneous systems, industry has begun to adopt this trend, such as the ARM big.LITTLE [64]. While these designs are interesting, they still allocate compute resources statically, and thus cannot react to variation in the degree of parallelism present in software. To address this rigidity, core fusion [74], and other related work [31, 108, 115, 123], propose mechanisms for implementing powerful cores out of collaborative collections of weak cores. This allows a system to grow and shrink so that there are as many “cores” in the system as there are threads, and each of these cores is scaled to maximize performance with the current amount of parallelism in the system.
Core fusion [74] accomplishes this scaling by splitting a core into two halves: a narrow-issue conventional core with the fetch engine stripped off, and an additional component that acts as a modular fetch/decode/commit component. This added component will either perform fetches for each core individually from individual program sequences, or a wide fetch to feed all processors. Similar to how a line buffer reads in multiple instructions in a single effort, this wide fetch engine will read an entire block of instructions and issue them across different cores. Decode and resource renaming is also performed collectively, with registers being stored as physically resident in various cores. A crossbar is added to move register values from one core to another when necessary. At the end of the pipeline, a reordering step is introduced to guarantee correct commit and exception handling. A diagram of this architecture is shown in Figure 3.3. Two additional instructions are added to this architecture that allow the operating system to merge and split core collections, thus adjusting the number of virtual cores available for scheduling.
As shown in Figure 3.4, Core Fusion cores perform only slightly worse than a natural processor with the same issue width, achieving performance within 20% of a monolithic processor at equivalent effective issue width. The main reason for this is that the infrastructure that enables fusion comes with a performance cost. The strength of this system, however, is its adaptability, not necessarily its performance when compared to a processor designed for a particular software configuration. Furthermore, energy required to power structures necessary for wide out-of-order scheduling do not need to be active when cores are not fused. As a result, core fusion surrenders a portion of area used to implement the out-of-order scheduler and about 20% performance when fused to emulate a larger core. This enables run-time customization of core width and number, in a way that is binary compatible, and thus is completely transparent. For systems that do not have a priori knowledge of the type of workload that will be running on a processor, or expect the software to transition between sequential and parallel portions, the ability to adjust and accommodate varying workloads is a great benefit.
Figure 3.3: A 4-core core fusion processor bundle with components added to support merging of cores. Adapted from [74].
Figure 3.4: Comparison of performance between various processors of issue widths and 6-issue merged core fusion. Taken from [74].
3.4 CUSTOMIZED INSTRUCTION SET EXTENSIONS
In a conventional, general-purpose processor design, each time an instruction is executed, it must pass through a number of stages of a processor pipeline. Each of these stages incurs a cost, which is dependent on the type of processor. Figure 3.1 showed the energy consumed in various stages of the processor pipeline. In terms of the core computational requirement of an application, the energy spent in the execute stage is energy spent doing productive compute work, and everything else (i.e., instruction fetch, renaming, instruction window allocation, wakeup and select logic) is overhead required to support and accelerate general-purpose instruction processing for a particular architecture. The reason for execution constituting such a small portion of energy consumed is that for most instructions, each performs a small amount of work.
Extending the instruction set of an otherwise conventional compute core to increase the amount of work done per instruction is one way of improving both performance and energy efficiency for particular tasks. This is accomplished by merging the tasks that would have otherwise been performed by multiple instructions, into a single instruction. This is valuable because this single large instruction still only requires a single pass through the fetch, decode, and commit phases, and thus requires a reduced amount of bookkeeping to be maintained to perform the same task. In addition to reducing the overhead associated with processing an instruction, ISA extensions enable access to custom compute engines to implement these composite operations more efficiently than could be implemented otherwise.
The strategy of instruction set customization ranges from very simple (e.g., [6, 95, 111]) to complex (e.g., [63, 66]). Simplistic but effective instruction set extensions are now common in commodity processors in the form of specialized vector instructions: SSE and AVX instructions. Section 3.4.1 discusses vector instructions, which allow for simple operations, mostly floating point operations, to be packed into a single instruction and operate over a large volume of data, potentially simultaneously. While these vector instructions are restricted to use in regular, compute-dense code, they lend a large enough performance advantage that processor manufacturers are continuing to push toward more feature-rich vector extensions [55].
In addition to vector instructions, there has also been work proposed by both industry [95] and academia [63] that ties multiple operations together into a single compute engine that operates over a single element of data. These custom compute engines are discussed in Section 3.4.2, and differ from vector instructions in that they describe a group of operations over a small set of data, rather than the reverse. Thus, they can be tied more tightly into the critical path of a conventional core [136].