Table 1.2. Supported kernels in the embARC MLI library
Group | Kernels | Description |
Convolution | 2D convolution Depthwise 2D convolution | Convolve input features with a set of trained weights |
Pooling | Average pooling Max pooling | Pool input features with a function |
Recurrent and Core | Basic RNN Long-Short Term Memory (LSTM) Fully connected | Recurrent and fully connected kernels |
Transform (Activation) | ReLU (Relu1, Relu6, Leaky ReLU, ..) Sigmoid and TanH SoftMax | Transform each element of input according to a particular non-linear function |
Element-wise | Addition, subtraction, multiplication maximum, minimum | Apply multi-operand function element-wise to several inputs |
Data Manipulation | Permute Concatenation Padding 2D | Move or extend input data by a specified pattern |
The embARC machine learning inference (MLI) library (embARC Open Software Platform 2019) is a set of C functions and associated data structures for building neural networks. Library functions, also called kernels, implement the processing associated with a complete layer in a neural network, with multidimensional input/output maps represented as tensor structures. Thus, a neural network graph can be implemented as a series of MLI function calls. Table 1.2 lists the currently supported MLI kernels, organized into six groups. For each kernel, there can be multiple functions in the library, including functions specifically optimized to support particular data types, weight kernel sizes and strides. In addition, helper functions are provided; these include operations such as data type conversion, data pointing and other operations not directly involved in the kernel functionality.
A notable feature of the embARC MLI library is the explicit support for recurrent layers. Sequential data series from sensors such as microphones or accelerometers are frequently used in the IoT domain. As explained above, RNNs maintain the state while processing sequences of inputs, thereby having the ability to recognize patterns across time. They have proven their effectiveness and are widely used in state-of-the-art solutions, such as speech applications (Amodei et al. 2016).
Data representation is an important part of the MLI definition. Despite the large variety of kernels, the data directly involved in the calculations has common properties. Deep learning frameworks therefore typically employ a unified data representation. For example, TensorFlow works with tensors, while Caffe uses similar objects called “blobs”. Such objects represent multidimensional arrays of the proper sizes and may include additional data, such as links to related objects, synchronization primitives, etc. Similarly, mli_tensor is a universal data container in MLI. It is a lightweight tensor that contains the necessary elements for describing the data: a pointer to the data buffer, its capacity, shape, rank and data format-specific values. A kernel may take multiple tensors as inputs for producing an output tensor. For example, the 2D convolution kernel has three input tensors: the input map, the weights and the biases. All kernel-specific parameters, such as stride and padding for a convolution kernel, or the new order of dimensions for a permute kernel, are grouped into configuration structures. This data representation has several advantages:
– it provides a simple and clear interface for the functions;
– it allows the same interface to be used for several versions of one kernel;
– it matches well with layered neural networks, enabling ease of use and natural library extensibility.
The embARC MLI library uses signed fixed-point data types based on the Q-notation (Q Number Format 2019). Tensor structures with 8-bit and 16-bit data types are supported, both for input/output data and weights. When building an application, we should select the data types that provide sufficiently accurate results for each layer in the neural network. In addition to supporting kernels with the same data type (either 16-bit or 8-bit) for both data and weights, the MLI library also provides kernels with 16-bit data and 8-bit weights, in order to provide more flexibility for accuracy-vs-memory trade-offs.
The AGUs of the ARC EM9D processor support complex memory access patterns without spending cycles on load and store instructions. We illustrate the benefits of the AGUs with a code example of a fully connected layer. Each output value is calculated using a dot-product operation. We consider the case where inputs are vectors of 16-bit values and weights are vectors of 8-bit values. Typically, this implies the extension of the weight operands to 16-bit values, which takes additional process cycles inside a loop. The ARCv2DSP instruction set architecture (ISA) has several dual-MAC instructions, which allow two multiplications with accumulation in a single cycle (see the left diagram in Figure 1.5). These include 16x8 dual-MAC instructions where one operand is a 2x16-bit vector and the other operand is a 2x8-bit vector, which allows direct use of the 8-bit data. The assembly code in Figure 1.7, generated from high-level C-code, shows that this yields a zero-overhead loop with a loop body of just one DMAC instruction and a throughput of two 16x8 MACs/cycle. In addition to the dual-MAC, the DMAC instruction also performs two memory reads and two pointer updates through the two AGUs.
Figure 1.7. Assembly code generated from MLI C-code for a fully connected layer with 16-bit input data and 8-bit weights
Similar optimizations using dual-MAC instructions are used for other kernels, including RNN kernels and some convolution kernels. However, this approach is not convenient for all cases. As an example, we consider a depthwise convolution for a channel–height–width (CHW) data layout with a 3x3 weight kernel size. This type of layer applies a 2D convolution to each input channel separately. Dual-MAC instructions cannot be used optimally here, due to the odd weight kernel size and the short MAC series for calculating an output value. If the convolution stride parameter is equal to 1, then neighboring input data elements are used for the calculation of neighboring output values. This implies that we can calculate two output values simultaneously using VMAC instructions that use two accumulators (see the right diagram in Figure 1.5), as shown in the generated assembly code in Figure 1.8. The VMAC instructions each perform two 16x16 MACs as well as two memory reads, sign extension and replication of the 8-bit value accessed through AGU 2 and two pointer updates.
This example demonstrates the flexibility of AGUs for complex data addressing patterns, including 2D accesses using two modifiers for the input data as well as sign extension and replication of weights. A typical approach for calculating convolution layers, for example, as popularized by Caffe, is to use additional image-to-column (im2col) transformations. Although such transformations are helpful on some processors as they simplify subsequent calculations for performing the convolutions, this comes at a price of a significant overhead for performing these transformations. The advanced AGUs, as used in Figure 1.8, make these transformations obsolete, thereby supporting efficient