1.5.3. Measurements for low-end and high-end use cases
For the analysis of GFDM kernel performance for the high-end and low-end workloads posed by the standard analysis in section 1.2 and Table 1.1, we have to provide a vDSP architecture on which to execute the GFDM kernel.
Referring to Figure 1.13, we designed a cycle-accurate simulation model of the vDSP in the ASIP Designer tool suite (Synopsys Inc 2019). The tool comes with a C-compiler and a debugging and profiling environment, which allowed for practical measurements of realistic cycle counts for the GFDM variants implementation. In a reiterative process, we have recursively tuned the prototype, refining the HW architecture instruction set and VLIW configuration, editing functional units of the vDSP, parallel to the development of the black and blue vector codes. We have arrived at an adequate processor and corresponding black and blue optimized code variants. Based on considerations thus far, we define 512 bit SIMD (vector) processing, 16 bit complex data types, 40 bit complex accumulators and 1 GHz clock.
We use theoretical minimum operations as the baseline for a fair, objective and unbiased comparison between the code variants and their utilization of processor resources. We call this metric implementation execution density ρ and define it as a ratio of minimum theoretical operations and measured cycles. The theoretical minimum includes a) only general/standard arithmetic or logical operations (not fused/composite operations that combine several into one, with MAC as the only exception), and b) memory accesses. The theoretical vector operations minimum depends on the implementation variant, i.e. the loop order combination and vectorization.
Table 1.4 gives us the GFDM kernel profile. The variant-specific minimum operations are taken from Table 1.3 for the ρ calculation. From Table 1.4, we can note a) the difference in total memory accesses and cycle counts between the two code variants as a consequence of different loop ordering and loop vectorization; b) even ρv > 1 values are possible when optimizing the low cycle count (black) variant through thoughtful code scheduling on the processor VLIW architecture with multiple instruction issue slots, with 512 bit SIMD capability per slot, and efficient pipelining for the selected loop order and vectorization; and c) ρs shows the ratio between the theoretical best case scalar execution and our kernel and can be interpreted as a speed up against the optimal theoretical scalar processor implementation.
Table 1.4. Kernel profile: cycles, memory accesses, and density
As the last step in our investigation, we show the required frequency budget needed for the corner cases, in Table 1.5. To form the budget requirements, we need to take into account the cycle counts from Table 1.4, as well as number of kernel calls, kernel parameters and deadlines from Table 1.2. Finally, we show the mentioned budget requirement in the context of our vDSP. In particular, how large a portion of the total vDSP frequency budget is needed to meet the specific use case requirements. Higher than 100% usage means that more than 1 vDSP core is needed.
Table 1.5. GFDM: required frequency budget and performance on our vDSP
Use Case | Metric | Black | Blue | |
low-end LTE legacy |
required budget [MHz] | 1.01 | 5.39 | |
our vDSP | processing time [µs] | 0.504 | 2.695 | |
vDSP utilization [%] | 0.10 | 0.54 | ||
min. vDSP s to run [#] | 1 | 1 | ||
CA high-end FR2 |
required budget [MHz] | 921.2 | 5,505.5 | |
our vDSP | processing time [µs] | 57.58 | 344.09 | |
vDSP utilization [%] | 92.12 | 550.55 | ||
min. vDSP s to run [#] | 1 | 6 | ||
MIMO CA high-end FR2 |
required budget [MHz] | 7.37 | 44.04 | |
our vDSP | processing time [µs] | 460.6 | 2,752.7 | |
vDSP utilization [%] | 736.95 | 4,404.38 | ||
min. vDSP s to run [#] | 8 | 45 |
The results argue in favor of the discussion from section 1.2: it is practical to run the low-end use cases quite effortlessly in parallel with other kernels and tasks on a vDSP. Surprisingly, even the CA high-end black GFDM can fit on a single vDSP core and make the deadline, albeit at a heavy load. Since there are several vDSPs on the MPSoC, running this modulation flavor is an option to consider, provided, of course, that the memory bandwidth allows using the black flavor. Finally, as expected, it is practical to use HW accelerator engines for the MIMO CA high-end use case instead of many fully loaded vDSP cores.
1.6. Conclusion
This chapter closely followed an SW implementation of the GFDM algorithm on the SotA vDSP and noted considerations taken into account with regard to handset workloads expected in modern and future mobile communications. We give analyses and conclusions on four layers: specification requirements, translating theory to pseudo-code, precision analysis and requirements and implementation space exploration.
First,