The different types of processing may be implemented using a heterogeneous multi-processor architecture, with different types of processors to satisfy the different processing requirements. However, for low/mid-end machine learning inference, the total compute requirements are often limited and can be handled by a single processor running at a reasonable frequency, provided it has the right capabilities. As we discussed above, the use of a single processor eliminates the area and communication overhead associated with multi-processor architectures. It also simplifies software development, as a single tool chain can be used for the complete application. However, it requires that the processor performs DSP, neural network processing and control processing with excellent cycle efficiency. In the next section, we will discuss the capabilities of a programmable processor that enables such cycle efficiency in more detail.
For many IoT edge devices, low cost is a key requirement. Therefore, making IoT edge devices smarter by adding machine learning inference must be cost-effective. The main contributor to cost is silicon area, in particular, for high-volume products, so it is important that the processor implementing the machine learning inference minimizes the logic area and uses small memories. In addition, small code size is key to limiting the area of the instruction memory.
Many IoT edge devices are battery-operated and have a tight power budget. This demands a power-efficient processor, measured in uW/MHz, as well as an excellent cycle efficiency so that the processor can be run at a low frequency. Low power consumption is particularly important for IoT edge devices that perform always-on functions such as:
– smart speakers, smartphones, etc. with always-on voice command functions that are “always listening”;
– camera-based devices, performing, for example, face detection or gesture recognition that are “always watching”;
– health and fitness monitoring devices that are “always sensing”.
Such devices typically apply smart techniques to reduce power consumption. For example, an “always listening” device may sample the microphone signal and use simple voice detection techniques to check whether anyone is speaking at all. It then applies the more compute-intensive machine learning inference for recognizing voice commands only when voice activity is detected. A processor must limit power consumption in each of these different states, i.e. voice detection and voice command recognition. For this purpose, it must offer various power management features, including effective sleep modes and power-down modes.
1.3.2. Processor capabilities for low-power machine learning inference
Selecting the right processor is key to achieving high efficiency for the implementation of low/mid-end machine learning inference. In this section, we will describe a number of key capabilities of the DSP-enhanced ARC EM9D processor and illustrate how they can be used to implement neural network processing efficiently.
As described earlier, the dot-product operation on input samples and weights is a dominant computation. The key primitive for implementing the dot product is the multiply-accumulate (MAC) operation, which can be used to incrementally sum up the products of input samples and weights. Vectorization of the MAC operations is an important way to increase the efficiency of neural network processing. Figure 1.5 illustrates two types of vector MAC instructions of the ARC EM9D processor.
Figure 1.5. Two types of vector MAC instructions of the ARC EM9D processor
Both of these vector MAC instructions operate on 2x16-bit vector operands. The DMAC instruction on the left is a dual-MAC that can be used to implement a dot product, with A1 and A2 being two neighboring samples from the input map and B1 and B2 being two neighboring weights from the weight kernel. The ARC EM9D processor supports 32-bit accumulators for which an additional eight guard bits can be enabled to avoid overflow. The DMAC operation can effectively be used for weight kernels with an even width, reducing the number of MAC instructions by a factor of two compared to a scalar implementation. However, for weight kernels with an odd width, this instruction is less effective. In such cases, the VMAC instruction, shown on the right in Figure 1.5, can be used to perform two dot-product operations in parallel, accumulating intermediate results into two accumulators. In case the weight kernel “moves” over the input map with a stride of one, A1 and A2 are two neighboring samples from the input map and the value of B1 and B2 is the same weight that is applied to both A1 and A2.
Efficient execution of the dot-product operations requires not only proper vector MAC instructions, but also sufficient memory bandwidth to feed operands to these MAC instructions, as well as ways to avoid overhead for performing address updates, data size conversions, etc. For these purposes, the ARC EM9D processor provides XY memory with advanced address generation. Simply, the XY architecture provides up to three logical memories that the processor can access concurrently, as illustrated in Figure 1.6. The processor can access memory through a regular load, store instruction or enable a functional unit to perform memory accesses through address generation units (AGUs). An AGU can be set up with an address pointer to data in one of the memories and a prescription, or modifier, to update this address pointer in a particular way when a data access is performed through the AGU. After the setup, the AGUs can be used in instructions for directly accessing operands and storing results from/to memory. No explicit load or store instructions need to be executed for these operands and results. Typically, an AGU is set up before a software loop and then used repeatedly as data is traversed inside the loop.
Figure 1.6. ARC EM9D processor with XY memory and address generation units
The AGUs support the following features relevant to machine learning inference:
– multiple modifiers per address pointer, which allow different schemes for address pointer updates to be prescribed and used. For example, a 2D access pattern can be supported by having one modifier prescribing a small horizontal stride within a row in the input map and another modifier prescribing a large stride to move the pointer to the next row in the input map;
– data size conversions, which allow, for example, 2x8-bit data to be expanded on the fly for use as a 2x16-bit vector operand. No extra instructions for unpacking and sign extension are required;
– replications, which allow data values to be replicated on the fly into vectors. For example, a single weight value may be replicated into a 2x16 vector for use in the VMAC instruction as discussed above.
In summary, the use of XY memory and AGUs enables very efficient code as no instructions are needed to load and store data, perform pointer math, or convert and rearrange data. All of these are performed implicitly while accessing data through the AGUs, with up to three memory accesses per cycle. In the next section, we present code examples that illustrate the use of the processor’s XY memory and AGUs for machine learning inference.
Most other embedded processors have to issue explicit load and store instructions to perform accesses to memory. In a single-issue processor, the execution of these instructions may consume a significant portion of the available cycles, effectively reducing the throughput in MACs/cycle. Multi-issue processors, such as VLIW processors, aim to perform the load and store operations in parallel to compute operations (such as the MACs) to increase throughput. However, since wide instructions have to be used, this comes at the price of larger code size and higher power consumption in the instruction memory.
1.3.3. A software library for machine learning inference
After selecting the right processor, the next question is how to arrive at an efficient