Not only can the compute engine be customized, but so can the memory system and on-chip interconnects. For example, instead of only using a general-purpose cache, one may use program-managed or accelerator-managed buffers (or scratchpad memories). Customization is needed to flexibly partition these two types of on-chip memories. Memory customization will be discussed in Chapter 5. Also, instead of using a general-purpose mesh-based network-on-chip (NoC) for packet switching, one may prefer a customized circuit-switching topology between accelerators and the memory system. Customization of on-chip interconnects will be discussed in Chapter 6.
The remainder of this lecture is organized as follows. Chapter 2 gives a broad overview of the trajectory of customization in computing. Customization of compute cores, such as custom instructions, will be covered in Chapter 3. Loosely coupled compute engines will be discussed in Chapter 4. Chapter 5 will discuss customizations to the memory system, and Chapter 6 discusses custom network interconnect designs. Finally, Chapter 7 concludes the lecture with discussions of industrial trends and future research topics.
CHAPTER 2
Road Map
Customized computing involves the specialization of hardware for a particular domain, and often includes a software component to fully leverage this specialization in hardware. In this section, we will lay the foundation for customized computing, enumerating the design trade-offs and defining vocabulary.
2.1 CUSTOMIZABLE SYSTEM-ON-CHIP DESIGN
In order to provide efficient support of customized computing, the general-purpose CMP (chip multiprocessor) widely used today needs to be replaced or transformed into a Customizable System-on-a-Chip (CSoC), also called customizable heterogeneous platform (CHP) in some other publications [39], which can be customized for a particular domain through the specialization of four major components on such a computing platform, including: (1) processor cores, (2) accelerators and co-processors, (3) on-chip memory components, and (4) the network-on-chip (NoC) that connects various components. We will explore each of these in detail individually, as well as in concert with the other CSoC components.
2.1.1 COMPUTE RESOURCES
Compute components like processor cores handle the actual processing demands of the CSoC. There are a wide array of design choices in the compute components of the CSoC. But when looking at customized compute units, there are three major factors to consider, all of which are largely independent of one another:
• Programmability
• Specialization
• Reconfigurability
Programmability
A fixed function compute unit can do one operation on incoming data, and nothing else. For example, a compute unit that is designed to perform an FFT operation on any incoming data is fixed function. This inflexibility limits how much a compute unit may be leveraged, but it streamlines the design of the unit such that it may be highly optimized for that particular task. The amount of bits used within the datapath of the unit and the types of mathematical operators included for example can be precisely tuned to the particular operation the compute unit will perform.
Contrasting this, a programmable compute unit executes sequences of instructions to define the tasks they are to perform. The instructions understood by the programmable compute unit constitute the instruction set architecture (ISA). The ISA is the interface for use of the programmable compute unit. Software that makes use of the programmable compute unit will consist of these instructions, and these instructions are typically chosen to maximize the expressive nature of the ISA to describe the nature of computation desired in the programmable unit. The hardware of the programmable unit will handle these instructions in a generally more flexible datapath than that of the fixed function compute unit. The fetching, decoding, and sequencing of instructions leads to performance and power overhead that is not required in a fixed function design. But the programmable compute unit is capable of executing different sequences of instructions to handle a wider array of functions than a fixed function pipeline.
There exists a broad spectrum of design choices between these two alternatives. Programmable units may have a large number of instructions or a small number of instructions for example. A pure fixed function compute unit can be thought of as a programmable compute unit that only has a single implicit instruction (i.e., perform an FFT). The more instructions supported by the compute unit, the more compact the software needs to be to express desired functionality. The fewer instructions supported by the compute unit, the simpler the hardware required to implement these instructions and the more potential for an optimized and streamlined implementation. Thus the programmability of the compute unit refers to the degree to which it may be controlled via a sequence of instructions, from fixed function compute units that require no instructions at all to complex, expressive programmable designs with a large number of instructions.
Specialization
Customized computing targets a smaller set of applications and algorithms within a domain to improve performance and reduce power requirements. The degree to which components are customized to a particular domain is the specialization of those components. There are a large number of different specializations that a hardware designer may utilize, from the datapath width of the compute unit, to the number of type of functional units, to the amount of cache, and more.
This is distinct from a general purpose design, which attempts to cover all applications rather than providing a customized architecture for a particular domain. General purpose designs may use a set of benchmarks from a target performance suite, but the idea is not to optimize specifically for those benchmarks. Rather, that performance suite may simply be used to gauge performance.
There is again a broad spectrum of design choices between specialized and general purpose designs. One may consider general purpose designs to be those specialized for the domain of all applications. In some cases, general purpose designs are more cost effective since the design time may be amortized over more possible uses—an ALU that can be designed once and then used in a variety of compute units may amortize the cost of the design of the ALU, for example.
Reconfigurability
Once a design has been implemented, it can be beneficial to allow further adaptation to continue to customize the hardware to react to (1) changes in data usage patterns, (2) algorithmic changes or advancements, and (3) domain expansion or unintended use. For example, a compute unit may have been optimized to perform a particular algorithm for FFT, but a new approach may be faster. Hardware that can flexibly adapt even after tape out is reconfigurable hardware. The degree to which hardware may be reconfigured depends on the granularity of reconfiguration. While finer granularity reconfiguration can allow greater flexibility, the overhead of reconfiguration can mean that a reconfigurable design will perform worse and/or be less energy efficient than a static (i.e., non-reconfigurable) alternative. One example of a fine-grain reconfigurable platform is an FPGA, which can be used to implement a wide array of different compute units, from fixed function to programmable units, with all levels of specialization. But an FPGA implementation of