Читать книгу Multi-Processor System-on-Chip 1 - Liliana Andrade - Страница 31

2.2.2. Machine learning inference

Оглавление

The main uses of machine learning techniques in intelligent systems are inference of deep learning networks. When considering deep learning inference acceleration, several architectural approaches appear effective. These include loosely coupled accelerators that implement a systolic data path (Google TPU, NVIDIA NVDLA), coarse-grained reconfigurable arrays (Cerebras WSE) or a bulk-synchronous parallel graph processor (GraphCore IPU). Other approaches tightly couple general-purpose processing units with vector or tensor processing units that share the instruction stream and the memory hierarchy. In particular, the GPGPU architecture has further evolved with the NVIDIA Volta by integrating eight “tensor cores” per SM, in order to accelerate machine learning workloads (Jia et al. 2018). Each tensor core executes mixed-precision matrix multiply-accumulate operations on 4 × 4 matrices. Multiplication operand elements use the IEEE 754 binary 16 floating-point representation (FP16), while the accumulation and result operands use the IEEE 754 binary 16 or binary 32 (FP32) floating-point representation (Figure 2.3).


Figure 2.3. Operation of a Volta tensor core (NVIDIA 2020)

Machine learning computations normally rely on FP32 arithmetic; however, significant savings in memory footprint and increases in performance/efficiency can be achieved by using 16-bit representations for training and 8-bit representations for inference with acceptable precision loss. The main 16-bit formats are FP16 and BF16, which is FP32 with 16 mantissa bits truncated (Intel 2018), and INT16 that covers the 16-bit integer and fixed-point representations (Figure 2.4a). Those reduced bit-width formats are, in fact, used as multiplication operands in linear operations, whose results are still accumulated in FP32, INT32 or larger fixed-point representations.

While mainstream uses of 8-bit formats in convolutional network inference are signed or unsigned integers (Jacob et al. 2018; Krishnamoorthi 2018), floating-point formats smaller than 16-bit are also investigated. Their purpose is to eliminate the complexities associated with small integer quantization: fake quantization, where weights and activations are quantized and dequantized in succession during both the forward and backward passes of training; and post-training calibration, where the histogram of activations is collected on a representative dataset to adjust the saturation thresholds. Microsoft introduced the Msfp8 data format (Chung et al. 2018), which is FP16 truncated to 8 bits, with only 2 bits of mantissa left, along with its extension Msfp9. Among the reduced bit-width floating-point formats, however, the Posit8 representations generate the most interest (Carmichael et al. 2019).

A Positn.es representation (Figure 2.4b) is parameterized by n, the total number of bits, and es, the number of exponent bits (Gustafson and Yonemoto 2017). The main difference with an IEEE 754 binary floating-point representation is the regime field, which has a dynamic width and encodes a power of 22es in unary numerals. (de Dinechin et al. 2019) discuss the advantages and disadvantages of Posit representations. They advise the use of Posit as a storage-only format in order to benefit from the compact encoding, while still relying on standard IEEE binary floating-point arithmetic for numerical guarantees. Experimentally, Posit8 numbers provide an effective compressed representation of FP32 network weights and activations by rounding them to Posit8.1 or Posit8.2 numbers (Resmerita et al. 2020). Another approach is to use Posit8.1 on a log domain for the multiplicands, while converting to a linear domain for the accumulations (Johnson 2018). In both cases, however, the large dynamic range that motivates the use of the Posit representations in machine learning inference requires high-precision or exact accumulations.


Figure 2.4. Numerical formats used in deep learning inference (adapted from Gustafson (2017) and Rodriguez et al. (2018))

Multi-Processor System-on-Chip 1

Подняться наверх