Routing Customization
Another approach to specialization is to change the routing of packets in the NoC. Packets may be scheduled in different ways to avoid congestion in the NoC, for example. Another example would be circuit switching, where a particular route through the NoC is reserved for a particular communication, allowing packets in that communication to be expedited through the NoC without intermediate arbitration. This is useful in bursty communication where the cost of arbitration may be amortized over the communication of many packets.
Physical Design Customization
Some designs leverage different types of wires (i.e., different physical trade-offs) to provide a heterogeneous NoC with specialized communication paths. And there are also a number of exciting alternative interconnects that are emerging for use in NoC design. These alternative interconnects typically improve interconnect bandwidth and reduce communication latency, but may require some overhead (such as upconversion to an analog signal to make use of the alternative interconnect). These interconnects have some physical design and architectural challenges, but also provide some interesting options for customized computing, as we will discuss in Section 6.4.
2.2 SOFTWARE LAYER
Customization is often a holistic process that involves both hardware customization and software orchestration. Application writers (i.e., domain experts) may have intimate knowledge of their applications which may not be expressed easily or at all in traditional programming languages. Such information could include knowledge of data value ranges or error tolerance, for example. Software layers should provide a sufficiently expressive language for programmers to communicate their knowledge of the applications in a particular domain to further customize the use of specialized hardware.
There are a number of approaches to programming domain-specific hardware. A common approach is to create multiple layers of abstraction between the application programmer and the domain-specific hardware. The application programmer writes code in a relatively high level language that is expressive enough to capture domain-specific information. The high level language uses library routines implemented in the lower levels of abstraction as much as possible to cover the majority of computational tasks. The library routines may be implemented through further levels of abstraction, but ultimately lead to a set of primitives that directly leverage domain-specific hardware. As an example, library routines to do FFTs may leverage hardware accelerators specifically designed for FFT. This provides some portability of the higher level application programmer code, while still providing domain-specific specialization at the lower abstraction levels that directly leverages customized hardware. This also hides the complexity of customized hardware from the application writer.
Another question is how much of the process of software mapping may be automated. Compilers may be able to perform much of the mapping of high level code to customized hardware through intelligent algorithms that can transform code and leverage application-specific information from the application programmer. Automation is a powerful tool for discovering opportunities for acceleration in code that may not be covered by existing library routines.
CHAPTER 3
Customization of Cores
3.1 INTRODUCTION
Because processing cores contribute greatly to energy consumption in modern processors, the conventional processing core is a good place to start looking for customizations to computation engines. Processing cores are pervasive, and their architecture and compilation flow are mature. Modifications made to processing cores then have the advantage that existing hardware modules and infrastructure invested in building efficient and high-performance processors can be leveraged, without having to necessarily abandon existing software stacks as may be required when designing hardware from the ground up. Additionally, programmers can use their existing knowledge of programming conventional processing cores as a foundation toward learning new techniques that build upon conventional cores, instead of having to adopt new programming paradigms, or near languages.
In addition to benefiting from mature software stacks, any modifications made to a conventional processing core can also take advantage of many of the architectural components that have made cores so effective. Examples of these architectural components are caches, mechanisms for out-of-order scheduling and speculative execution, and software scheduling mechanisms. By integrating modifications directly into a processing core, new features can be designed to blend into these components. For example, adding a new instruction to the existing execution pipeline automatically enables this instruction to benefit from aggressive instruction scheduling already present in a conventional core.
However, introducing new compute capability, such as new arithmetic units, into existing processing cores means being burdened by many of the design restrictions that these cores already exert on arithmetic unit design. For example, out-of-order processing benefits considerably from short latency instructions, as long latency instructions can cause pipeline stalls. Conventional cores are also fundamentally bound, both in terms of performance and efficiency, by the infrastructure necessary to execute instructions. As a result, conventional cores cannot be as efficient at performing a particular task as a hardware structure that is more specialized to that purpose [26]. Figure 3.1 illustrates this point, showing that the energy cost of executing an instruction is much greater than the energy that is required to perform the arithmetic computation (e.g., energy devoted to integer and floating point arithmetic). The rest of the energy is spent to implement the infrastructure internal to the processing core that is used to perform tasks such as scheduling instructions, fetch and decode, extracting instruction level parallelism, etc. Figure 3.1 shows only the comparison of structures internal to the processing core itself, and excludes external components such as memory systems and networks. These are burdens that are ever present in conventional processing cores, and they represent the architectural cost of generality and programmability. This can be contrasted against the energy proportions shown in Figure 3.2, which show the energy saving when the compute engine is customized for a particular application, instead of a general-purpose design. The difference in energy cost devoted to computation is primarily the result of relaxing the design requirements of functional units, so that functional units operate only at precisions that are necessary and are designed to emphasize energy efficiency per computation, and potentially exhibit deeper pipelines and longer latencies than would be tolerable when couched inside a conventional core.
Figure 3.1: Energy consumed by subcomponents of a conventional compute core as a proportion of the total energy consumed by the core. Subcomponents that are not computationally necessary (i.e., they are part of the architectural cost of extracting parallelism, fetching and decoding instructions, scheduling, dependency checking, etc.) are shown as slices without fill. Results are for a Nehalem era 4-core Intel Xeon CPU. Memory includes L1 cache energy only. Taken from [26].
This chapter will cover the following topics related to customization of processing cores:
• Dynamic Core Scaling and Defeaturing: A post-silicon method of selectively deactivating underutilized components with the goal of conserving energy.
Figure