Читать книгу Customizable Computing - Yu-Ting Chen - Страница 10

Оглавление

CHAPTER 2

Road Map

Customized computing involves the specialization of hardware for a particular domain, and often includes a software component to fully leverage this specialization in hardware. In this section, we will lay the foundation for customized computing, enumerating the design trade-offs and defining vocabulary.

2.1 CUSTOMIZABLE SYSTEM-ON-CHIP DESIGN

In order to provide efficient support of customized computing, the general-purpose CMP (chip multiprocessor) widely used today needs to be replaced or transformed into a Customizable System-on-a-Chip (CSoC), also called customizable heterogeneous platform (CHP) in some other publications [39], which can be customized for a particular domain through the specialization of four major components on such a computing platform, including: (1) processor cores, (2) accelerators and co-processors, (3) on-chip memory components, and (4) the network-on-chip (NoC) that connects various components. We will explore each of these in detail individually, as well as in concert with the other CSoC components.

2.1.1 COMPUTE RESOURCES

Compute components like processor cores handle the actual processing demands of the CSoC. There are a wide array of design choices in the compute components of the CSoC. But when looking at customized compute units, there are three major factors to consider, all of which are largely independent of one another:

• Programmability

• Specialization

• Reconfigurability

Programmability

A fixed function compute unit can do one operation on incoming data, and nothing else. For example, a compute unit that is designed to perform an FFT operation on any incoming data is fixed function. This inflexibility limits how much a compute unit may be leveraged, but it streamlines the design of the unit such that it may be highly optimized for that particular task. The amount of bits used within the datapath of the unit and the types of mathematical operators included for example can be precisely tuned to the particular operation the compute unit will perform.

Contrasting this, a programmable compute unit executes sequences of instructions to define the tasks they are to perform. The instructions understood by the programmable compute unit constitute the instruction set architecture (ISA). The ISA is the interface for use of the programmable compute unit. Software that makes use of the programmable compute unit will consist of these instructions, and these instructions are typically chosen to maximize the expressive nature of the ISA to describe the nature of computation desired in the programmable unit. The hardware of the programmable unit will handle these instructions in a generally more flexible datapath than that of the fixed function compute unit. The fetching, decoding, and sequencing of instructions leads to performance and power overhead that is not required in a fixed function design. But the programmable compute unit is capable of executing different sequences of instructions to handle a wider array of functions than a fixed function pipeline.

There exists a broad spectrum of design choices between these two alternatives. Programmable units may have a large number of instructions or a small number of instructions for example. A pure fixed function compute unit can be thought of as a programmable compute unit that only has a single implicit instruction (i.e., perform an FFT). The more instructions supported by the compute unit, the more compact the software needs to be to express desired functionality. The fewer instructions supported by the compute unit, the simpler the hardware required to implement these instructions and the more potential for an optimized and streamlined implementation. Thus the programmability of the compute unit refers to the degree to which it may be controlled via a sequence of instructions, from fixed function compute units that require no instructions at all to complex, expressive programmable designs with a large number of instructions.

Specialization

Customized computing targets a smaller set of applications and algorithms within a domain to improve performance and reduce power requirements. The degree to which components are customized to a particular domain is the specialization of those components. There are a large number of different specializations that a hardware designer may utilize, from the datapath width of the compute unit, to the number of type of functional units, to the amount of cache, and more.

This is distinct from a general purpose design, which attempts to cover all applications rather than providing a customized architecture for a particular domain. General purpose designs may use a set of benchmarks from a target performance suite, but the idea is not to optimize specifically for those benchmarks. Rather, that performance suite may simply be used to gauge performance.

There is again a broad spectrum of design choices between specialized and general purpose designs. One may consider general purpose designs to be those specialized for the domain of all applications. In some cases, general purpose designs are more cost effective since the design time may be amortized over more possible uses—an ALU that can be designed once and then used in a variety of compute units may amortize the cost of the design of the ALU, for example.

Reconfigurability

Once a design has been implemented, it can be beneficial to allow further adaptation to continue to customize the hardware to react to (1) changes in data usage patterns, (2) algorithmic changes or advancements, and (3) domain expansion or unintended use. For example, a compute unit may have been optimized to perform a particular algorithm for FFT, but a new approach may be faster. Hardware that can flexibly adapt even after tape out is reconfigurable hardware. The degree to which hardware may be reconfigured depends on the granularity of reconfiguration. While finer granularity reconfiguration can allow greater flexibility, the overhead of reconfiguration can mean that a reconfigurable design will perform worse and/or be less energy efficient than a static (i.e., non-reconfigurable) alternative. One example of a fine-grain reconfigurable platform is an FPGA, which can be used to implement a wide array of different compute units, from fixed function to programmable units, with all levels of specialization. But an FPGA implementation of a particular compute unit is less efficient than an ASIC implementation of the same compute unit. However, the ASIC implementation is static, and cannot adapt after design tape out. We will examine more coarse-grain alternatives for reconfigurable compute units in Section 4.4.

Examples

• Accelerators—early GPUs, mpeg/media decoders, crypto accelerators

• Programmable Cores—modern GPUs, general purpose cores, ASIPs

• Future designs may feature accelerators in primary computational role

• Some programmable cores and or programmable fabric are still included for generality/longevity

Section 3 covers the customization of processor cores and Section 4 covers coprocessors and accelerators. We split compute components into two sections to better manage the diversity of the design space for these components.

2.1.2 ON-CHIP MEMORY HIERARCHY

Chips are fundamentally pin-limited, which impacts the amount of bandwidth that can be supplied to the compute units described in the previous section. This is further exacerbated by limitations in DRAM scaling. On-chip memory is one technique to mitigate this. On-chip memory can be used in a variety of ways, from providing data buffering for streaming data from off-chip to providing a place to store data that will be reused multiple times for computation. Once again, different applications will have different on-chip memory requirements. As with compute units, there are a wide array of design choices for hardware architects to consider in the design of the memory hierarchy.

Transparency to Software

A cache is relatively small, but fast memory which leverages the principle of locality to reduce the latency to access memory. Data which will be reused in the near future is kept in the cache to avoid accesses to longer latency memory. There are two primary approaches to managing a cache (i.e., orchestrating what data comes into the cache and what data leaves the cache): purely hardware approaches and software managed caches. In this book, we will use the term scratchpad to refer to software-based caches, where the application writer or the compiler will be responsible for explicitly bringing data into and out of the cache through special instructions. Hardware caches where actual control circuits orchestrate data movement without software intervention will just be referred to as caches in this book. Scratchpads have tremendous potential for application-specific customization since cache management can be tuned to a particular application, but they also come with coding overhead as the programmer or compiler writer must explicitly map out this orchestration. Conventional caches are more flexible, as they can handle a wider array of applications without requiring explicit management, and may be preferable in cases where the access pattern is unpredictable and therefore requires dynamic adaptation.

Sharing

On-chip memory may be kept private to a particular compute unit or may be shared among multiple compute units. Private on-chip memory means that the application will not need to contend for space with another application, and will get the full benefit of the on-chip memory. Shared on-chip memory can amortize the cost of on-chip memory over several compute units, providing a potentially larger pool of space that can be leveraged by these compute units than if the space was partitioned among the units as private memory. For example, four compute units can each have 1MB of on-chip memory dedicated to them. Each compute unit will always have 1MB regardless of the demand from other compute units. However, if four compute units share 4MB of on-chip memory, and if the different compute units use different amounts of memory, one compute unit may, for example, use more than 1MB of space at a particular time since there is a large pool of memory available. Sharing works particularly well when compute units use different amounts of memory at different times. Sharing is also extremely effective when compute units make use of the same memory locations. For example, if compute units are all working on an image in parallel, storing the image in a single memory shared among the units allows the compute units to more effectively cooperate on the shared data.

2.1.3 NETWORK-ON-CHIP

On-chip memory stores the data needed by the compute units, but an important part of the overall CSoC is the communication infrastructure that allows this stored data to be distributed to the compute units, that allows the data to be delivered to/from the on-chip memory from/to the memory interfaces that communicate off-chip, and that allows compute units to synchronize and communicate with one another. In many applications there is a considerable amount of data that must be communicated to the compute units used to accelerate application performance. And with multiple compute units often employed to maximize data level parallelism, there are often multiple data streams being communicated around the CSoC. These requirements transcend the conventional bus-based design of older multicore designs, with designers instead choosing network-on-chip (NoC) designs. NoC designs enable the communication of more data between more CSoC components.

Components interfacing with an NoC typically bundle transmitted data into packets, which contain at least address information as to the desired communication destination and the payload itself, which is some portion of the data to be transmitted to a particular destination. NoCs transmit messages via packets to enable flexible and reliable data transport—packets may be buffered at intermediate nodes within the network or reordered in some situations. Packet-based communication also avoids long latency arbitration that is associated with communication in a single hop over an entire chip. Each hop through a packet-based NoC performs local arbitration instead.

The creation of an NoC involves a rich set of design decisions that may be highly customized for a set of applications in a particular domain. Most design decisions impact the latency or bandwidth of the NoC. In simple terms, the latency of the NoC is how long it takes a given piece of data to pass through the NoC. The bandwidth of the NoC is how much data can be communicated in the NoC at a particular time. Lower latency may be more important for synchronizing communication, like locks or barriers that impact multiple computational threads in an application. Higher bandwidth is more important for applications with streaming computation (i.e., low data locality) for example.

One example of a design decision is the topology of an NoC. This refers to the pattern of links that connect particular components of the NoC. A simple topology is a ring, where each component in the NoC is connected to two neighboring components, forming a chain of components. More complex communication patterns may be realized by more highly connected topologies that allow more simultaneous communication or a shorter communication distance.

Another example is the bandwidth of an individual link in the topology—wire that is traversed in one cycle of the network’s clock. Larger links can improve bandwidth but require more buffering space at intermediate network nodes, which can increase power cost.

An NoC is typically designed with a particular level of utilization in mind, where decisions like topology or link bandwidth are chosen based on an expected level of service. For example, NoCs may be designed for worst case behavior, where the bandwidth of individual links is sized for peak traffic requirements, and every path in the network is capable of sustaining that peak bandwidth requirement. This is a flexible design in that the worst case behavior can manifest on any particular communication path in the NoC, and there will be sufficient bandwidth to handle it. But it can mean overprovisioning the NoC if worst case behavior is infrequent or sparsely exhibited in the NoC. In other words, the larger bandwidth components can mean wasted power (if only static power) or area in most cases. NoCs may also be designed for average case behavior, where the bandwidth is sized according to the average traffic requirement, but in such cases performance can suffer when worst case behavior is exhibited.

Topological Customization

Customized designs can specialize different parts of the NoC for different communication patterns seen in applications within a domain. For example, an architecture may specialize the NoC such that there is a high bandwidth connection between a memory interface and a particular compute unit that performs workload balancing and sorting for particular tasks, and then there is a lower bandwidth connection between that compute unit for workload balancing and the remainder of the compute units that perform the actual computation (i.e., work). More sophisticated designs can adapt bandwidth to the dynamic requirements of the application in execution. Customized designs may also adapt the topology of the NoC to the specific requirements of the application in execution. Section 6.2 will explore such flexible designs, along with some of the complexity in implementing NoC designs that are specialized for particular communication patterns.

Routing Customization

Another approach to specialization is to change the routing of packets in the NoC. Packets may be scheduled in different ways to avoid congestion in the NoC, for example. Another example would be circuit switching, where a particular route through the NoC is reserved for a particular communication, allowing packets in that communication to be expedited through the NoC without intermediate arbitration. This is useful in bursty communication where the cost of arbitration may be amortized over the communication of many packets.

Physical Design Customization

Some designs leverage different types of wires (i.e., different physical trade-offs) to provide a heterogeneous NoC with specialized communication paths. And there are also a number of exciting alternative interconnects that are emerging for use in NoC design. These alternative interconnects typically improve interconnect bandwidth and reduce communication latency, but may require some overhead (such as upconversion to an analog signal to make use of the alternative interconnect). These interconnects have some physical design and architectural challenges, but also provide some interesting options for customized computing, as we will discuss in Section 6.4.

2.2 SOFTWARE LAYER

Customization is often a holistic process that involves both hardware customization and software orchestration. Application writers (i.e., domain experts) may have intimate knowledge of their applications which may not be expressed easily or at all in traditional programming languages. Such information could include knowledge of data value ranges or error tolerance, for example. Software layers should provide a sufficiently expressive language for programmers to communicate their knowledge of the applications in a particular domain to further customize the use of specialized hardware.

There are a number of approaches to programming domain-specific hardware. A common approach is to create multiple layers of abstraction between the application programmer and the domain-specific hardware. The application programmer writes code in a relatively high level language that is expressive enough to capture domain-specific information. The high level language uses library routines implemented in the lower levels of abstraction as much as possible to cover the majority of computational tasks. The library routines may be implemented through further levels of abstraction, but ultimately lead to a set of primitives that directly leverage domain-specific hardware. As an example, library routines to do FFTs may leverage hardware accelerators specifically designed for FFT. This provides some portability of the higher level application programmer code, while still providing domain-specific specialization at the lower abstraction levels that directly leverages customized hardware. This also hides the complexity of customized hardware from the application writer.

Another question is how much of the process of software mapping may be automated. Compilers may be able to perform much of the mapping of high level code to customized hardware through intelligent algorithms that can transform code and leverage application-specific information from the application programmer. Automation is a powerful tool for discovering opportunities for acceleration in code that may not be covered by existing library routines.

Подняться наверх