Читать книгу Multi-Processor System-on-Chip 1 - Liliana Andrade - Страница 22
1.3.2. Processor capabilities for low-power machine learning inference
ОглавлениеSelecting the right processor is key to achieving high efficiency for the implementation of low/mid-end machine learning inference. In this section, we will describe a number of key capabilities of the DSP-enhanced ARC EM9D processor and illustrate how they can be used to implement neural network processing efficiently.
As described earlier, the dot-product operation on input samples and weights is a dominant computation. The key primitive for implementing the dot product is the multiply-accumulate (MAC) operation, which can be used to incrementally sum up the products of input samples and weights. Vectorization of the MAC operations is an important way to increase the efficiency of neural network processing. Figure 1.5 illustrates two types of vector MAC instructions of the ARC EM9D processor.
Figure 1.5. Two types of vector MAC instructions of the ARC EM9D processor
Both of these vector MAC instructions operate on 2x16-bit vector operands. The DMAC instruction on the left is a dual-MAC that can be used to implement a dot product, with A1 and A2 being two neighboring samples from the input map and B1 and B2 being two neighboring weights from the weight kernel. The ARC EM9D processor supports 32-bit accumulators for which an additional eight guard bits can be enabled to avoid overflow. The DMAC operation can effectively be used for weight kernels with an even width, reducing the number of MAC instructions by a factor of two compared to a scalar implementation. However, for weight kernels with an odd width, this instruction is less effective. In such cases, the VMAC instruction, shown on the right in Figure 1.5, can be used to perform two dot-product operations in parallel, accumulating intermediate results into two accumulators. In case the weight kernel “moves” over the input map with a stride of one, A1 and A2 are two neighboring samples from the input map and the value of B1 and B2 is the same weight that is applied to both A1 and A2.
Efficient execution of the dot-product operations requires not only proper vector MAC instructions, but also sufficient memory bandwidth to feed operands to these MAC instructions, as well as ways to avoid overhead for performing address updates, data size conversions, etc. For these purposes, the ARC EM9D processor provides XY memory with advanced address generation. Simply, the XY architecture provides up to three logical memories that the processor can access concurrently, as illustrated in Figure 1.6. The processor can access memory through a regular load, store instruction or enable a functional unit to perform memory accesses through address generation units (AGUs). An AGU can be set up with an address pointer to data in one of the memories and a prescription, or modifier, to update this address pointer in a particular way when a data access is performed through the AGU. After the setup, the AGUs can be used in instructions for directly accessing operands and storing results from/to memory. No explicit load or store instructions need to be executed for these operands and results. Typically, an AGU is set up before a software loop and then used repeatedly as data is traversed inside the loop.
Figure 1.6. ARC EM9D processor with XY memory and address generation units
The AGUs support the following features relevant to machine learning inference:
– multiple modifiers per address pointer, which allow different schemes for address pointer updates to be prescribed and used. For example, a 2D access pattern can be supported by having one modifier prescribing a small horizontal stride within a row in the input map and another modifier prescribing a large stride to move the pointer to the next row in the input map;
– data size conversions, which allow, for example, 2x8-bit data to be expanded on the fly for use as a 2x16-bit vector operand. No extra instructions for unpacking and sign extension are required;
– replications, which allow data values to be replicated on the fly into vectors. For example, a single weight value may be replicated into a 2x16 vector for use in the VMAC instruction as discussed above.
In summary, the use of XY memory and AGUs enables very efficient code as no instructions are needed to load and store data, perform pointer math, or convert and rearrange data. All of these are performed implicitly while accessing data through the AGUs, with up to three memory accesses per cycle. In the next section, we present code examples that illustrate the use of the processor’s XY memory and AGUs for machine learning inference.
Most other embedded processors have to issue explicit load and store instructions to perform accesses to memory. In a single-issue processor, the execution of these instructions may consume a significant portion of the available cycles, effectively reducing the throughput in MACs/cycle. Multi-issue processors, such as VLIW processors, aim to perform the load and store operations in parallel to compute operations (such as the MACs) to increase throughput. However, since wide instructions have to be used, this comes at the price of larger code size and higher power consumption in the instruction memory.