Multi-Processor System-on-Chip 1. Liliana Andrade. Читать онлайн. Newlib. NEWLIB.NET

Автор: Liliana Andrade
Издательство: John Wiley & Sons Limited
Серия:
Жанр произведения: Программы
Год издания: 0
isbn: 9781119818281
Скачать книгу
Schematic illustration of assembly code generated from MLI C-code for 2D convolution of 16-bit input data

      Figure 1.8. Assembly code generated from MLI C-code for 2D convolution of 16-bit input data and 8-bit weights

      From the user’s point of view, the embARC MLI library provides ease of use, allowing the construction of efficient machine learning inference engines without requiring in-depth knowledge of the processor architecture and software optimization details. The embARC MLI library provides a broad set of optimized functions, so that the user can concentrate on the application and write embedded code using familiar high-level constructs for machine learning inference.

      1.3.4. Example machine learning applications and benchmarks

      The embARC MLI library is available from embarc.org (embARC Open Software Platform 2019), together with a number of example applications that demonstrate the usage of the library, such as:

       – CIFAR-10 low-resolution object classifier: CNN graph;

       – face detection: CNN graph;

       – human activity recognition (HAR): LSTM-based network;

       – keyword spotting: graph with CNN and LSTM layers trained on the Google speech command dataset.

Schematic illustration of CNN graph of the CIFAR-10 example application.

      Figure 1.9. CNN graph of the CIFAR-10 example application

      We used the CIFAR-10 example application with 8-bit for both feature data and weights to benchmark the performance of machine learning inference on the ARC EM9D processor. The code of this CIFAR-10 application, built using the embARC MLI library, is illustrated in Figure 1.10.

Schematic illustration of MLI code of the CIFAR-10 inference application.

      Figure 1.10. MLI code of the CIFAR-10 inference application

      As the code in Figure 1.10 shows, each layer in the graph is implemented by calling a function from the embARC MLI library. Before executing the first convolution layer, we call a permute function from the embARC MLI library to transform the RGB image into CHW format so that neighboring data elements are from the same color plane. The code further shows that a ping-pong scheme with two buffers, ir_X and ir_Y, is used for buffering input and output maps.

      Table 1.3. Model parameters of the CIFAR-10 CNN graph

# Layer type Weights tensor shape Output tensor shape Coefficients
0 Permute 3 × 32 × 32 0
1 Convolution 32 × 3 × 5 × 5 32 × 32 × 32 (32K) 2400
2 Max Pooling 32 × 16 × 16 (8K) 0
3 Convolution 32 × 32 × 5 × 5 32 × 16 × 16 (8K) 25600
4 Avg Pooling - 32 × 8 × 8 (2K) 0
5 Convolution 64 × 32 × 5 × 5 64 × 8 × 8 (4K) 51200
6 Avg Pooling 64 × 4 × 4 (1K) 0
7 Fully-connected 64 × 1024 64 65536
8 Fully-connected 10 × 64 10 640

      The performance data for processor A is published in (Lai et al. 2018) in terms of milliseconds for a processor running at a clock frequency of 216 MHz. The cycle counts for processor A in Table 1.4 have been calculated by multiplying the published millisecond numbers with this clock frequency. The CIFAR-10 CNN graph reported in (Lai et al. 2018) has the same convolution and pooling layers as listed in Table 1.3, but uses a single fully connected layer with a 4x4x64x10 filter shape to directly transform the 64x4x4 input map into 10 output values. This modification of the Caffe CNN graph reduces the size of the weight data considerably, but requires retraining of the graph. The impact on the total cycle count is marginal.

      The performance data for the RISC-V processor published in (Croome 2018) reports a total of 1.5 Mcycles for executing the CIFAR-10 graph on a highly parallel 8-core RISC-V architecture. For calculating the total number of cycles on a single RISC-V core, we consider that the performance is highly dominated by the cycles spent on 5x5 convolutions, which constitute more than 98% of the compute operations in this graph. For these 5x5 convolutions, (Croome 2018) reports a speed-up from a 1-core system to an 8-core system of 18.5/2.2 = 8.2. Hence,