Читать книгу Efficient Processing of Deep Neural Networks - Vivienne Sze - Страница 12
ОглавлениеCHAPTER 2
Overview of Deep Neural Networks
Deep Neural Networks (DNNs) come in a wide variety of shapes and sizes depending on the application.1 The popular shapes and sizes are also evolving rapidly to improve accuracy and efficiency. In all cases, the input to a DNN is a set of values representing the information to be analyzed by the network. For instance, these values can be pixels of an image, sampled amplitudes of an audio wave, or the numerical representation of the state of some system or game.
In this chapter, we will describe the key building blocks for DNNs. As there are many different types of DNNs [50], we will focus our attention on those that are most widely used. We will begin by describing the salient characteristics of commonly used DNN layers in Sections 2.1 and 2.2. We will then describe popular DNN layers and how these layers can be combined to form various types of DNNs in Section 2.3. Section 2.4 will provide a detailed discussion on convolutional neural networks (CNNs), since they are widely used and tend to provide many opportunities for efficient DNN processing. It will also highlight various popular CNN models that are often used as workloads for evaluating DNN hardware accelerators. Next, in Section 2.5, we will briefly discuss other types of DNNs and describe how they are similar to and differ from CNNs from a workload processing perspective (e.g., data dependencies, types of compute operations, etc.). Finally, in Section 2.6, we will discuss the various DNN development resources (e.g., frameworks and datasets), which researchers and practitioners have made available to help enable the rapid progress in DNN model and hardware research and development.
2.1 ATTRIBUTES OF CONNECTIONS WITHIN A LAYER
As discussed in Chapter 1, DNNs are composed of several processing layers, where in most layers the main computation is a weighted sum. There are several different types of layers, which primarily differ in terms of how the inputs and outputs are connected within the layers. There are two main attributes of the connections within a layer:
1. The connection pattern between the input and output activations, as shown in Figure 2.1a: if a layer has the attribute that every input activation is connected to every output, then we call that layer fully connected. On the other hand, if a layer has the attribute that only a subset of inputs are connected to the output, then we call that layer sparsely connected. Note that the weights associated with these connections can be zero or non-zero; if a weight happens to be zero (e.g., as a result of training), it does not mean there is no connection (i.e., the connection still exists).
Figure 2.1: Properties of connections in DNNs (Figure adapted from [4]).
For sparsely connected layers, a sub attribute is related to the structure of the connections. Input activations may connect to any output activation (i.e., global), or they may only connect to output activations in their neighborhood (i.e., local). The consequence of such local connections is that each output activation is a function of a restricted window of input activations, which is referred to as the receptive field.
2. The value of the weight associated with each connection: the most general case is that the weight can take on any value (e.g., each weight can have a unique value). A more restricted case is that the same value is shared by multiple weights, which is referred to as weight sharing.
Combinations of these attributes result in many of the common layer types. Any layer with the fully connected attribute is called a fully connected layer (FC layer). In order to distinguish the attribute from the type of layer, in this chapter, we will use the term FC layer as distinguished from the fully connected attribute. However, in subsequent chapters we will follow the common practice of using the terms interchangeably. Another widely used layer type is the convolutional (CONV) layer, which is locally, sparsely connected with weight sharing.2 The computation in FC and CONV layers is a weighted sum. However, there are other computations that might be performed and these result in other types of layers. We will discuss FC, CONV, and these other layers in more detail in Section 2.3.
2.2 ATTRIBUTES OF CONNECTIONS BETWEEN LAYERS
Another attribute is the connections from the output of one layer to the input of another layer, as shown in Figure 2.1b. The output can be connected to the input of the next layer in which case the connection is referred to as feed forward. With feed-forward connections, all of the computation is performed as a sequence of operations on the outputs of a previous layer.3 It has no memory and the output for an input is always the same irrespective of the sequence of inputs previously given to the network. DNNs that contain feed-forward connections are referred to as feed-forward networks. Examples of these types of networks include multi-layer perceptrons (MLPs), which are DNNs that are composed entirely of feed-forward FC layers and convolutional neural networks (CNNs), which are DNNs that contain both FC and CONV layers. CNNs, which are commonly used for image processing and computer vision, will be discussed in more detail in Section 2.4.
Alternatively, the output can be fed back to the input of its own layer in which case the connection is often referred to as recurrent. With recurrent connections, the output of a layer is a function of both the current and prior input(s) to the layer. This creates a form of memory in the DNN, which allows long-term dependencies to affect the output. DNNs that contain these connections are referred to as recurrent neural networks (RNNs), which are commonly used to process sequential data (e.g., speech, text), and will be discussed in more detail in Section 2.5.
2.3 POPULAR TYPES OF LAYERS IN DNNs
In this section, we will discuss the various popular layers used to form DNNs. We will begin by describing the CONV and FC layers whose main computation is a weighted sum, since that tends to dominate the computation cost in terms of both energy consumption and throughput. We will then discuss various layers that can optionally be included in a DNN and do not use weighted sums such as nonlinearity, pooling, and normalization.
These layers can be viewed as primitive layers, which can be combined to form compound layers. Compound layers are often given names as a convenience, when the same combination of primitive layer are frequently used together. In practice, people often refer to either primitive or compound layers as just layers.
2.3.1 CONV LAYER (CONVOLUTIONAL)
CONV layers are primarily composed of high-dimensional convolutions, as shown in Figure 2.2. In this computation, the input activations of a layer are structured as a 3-D input feature map (ifmap), where the dimensions are the height (H), width (W), and number of input channels (C). The weights of a layer are structured as a 3-D filter, where the dimensions are the height (R), width (S), and number of input channels (C). Notice that the number of channels for the input feature map and the filter are the same. For each input channel, the input feature map undergoes a 2-D convolution (see Figure 2.2a) with the corresponding channel in the filter. The results of the convolution at each point are summed across all the input channels to generate the output partial sums. In addition, a 1-D (scalar) bias can be added to the filtering results, but some recent networks [24] remove its usage from parts of the layers. The results of this computation are the output partial sums that comprise one channel of the output feature map (ofmap).4 Additional 3-D filters can be used on the same input feature map to create additional output channels (i.e., applying M filters to the input feature map generates M output channels in the output feature map). Finally, multiple input feature maps (N) may be processed together as a batch to potentially improve reuse of the filter weights.
Figure 2.2: Dimensionality of convolutions. (a) Shows the traditional 2-D convolution used in image processing. (b) Shows the high dimensional convolution used in CNNs, which applies a 2-D convolution on each channel.
Table 2.1: Shape parameters of a CONV/FC layer
Shape Parameter | Description |
N | Batch size of 3-D fmaps |
M | Number of 3-D filters / number of channels of ofmap (output channels) |
C | Number of channels of filter / ifmap (input channels) |
H/W | Ifmap spatial height/width |
R/S | Filter spatial height/width (= H/W in FC) |
P/Q | Ofmap spatial height/width (= 1 in FC) |
Given the shape parameters in Table 2.1,5 the computation of a CONV layer is defined as:
o, i, f, and b are the tensors of the ofmaps, ifmaps, filters, and biases, respectively. U is a given stride size.
Figure 2.2b shows a visualization of this computation (ignoring biases). As much as possible, we will adhere to the following coloring scheme in this book.
• Blue: input activations belonging to an input feature map.
• Green: weights belonging to a filter.
• Red: partial sums—Note: since there is no formal term for an array of partial sums, we will sometimes label an array of partial sums as an output feature map and color it red (even though, technically, output feature maps are composed of activations derived from partial sums that have passed through a nonlinear function and therefore should be blue).
Returning to the CONV layer calculation in Equation (2.1), one notes that the operands (i.e., the ofmaps, ifmaps, and filters) have many dimensions. Therefore, these operands can be viewed as tensors (i.e., high-dimension arrays) and the computation can be treated as a tensor algebra computation where the computation involves performing binary operations (e.g., multiplications and additions forming dot products) between tensors to produce new tensors. Since the CONV layer can be viewed as a tensor algebra operation, it is worth noting that an alternative representation for a CONV layer can be created using the tensor index notation found in [51], which describes a compiler for sparse tensor algebra computations.6 The tensor index notation provides a compact way to describe a kernel’s functionality. For example, in this notation matrix multiply Z = AB can be written as:
That is, the output point (i, j) is formed by taking a dot product of k values along the i-th row of A and the j-th column of B.7 Extending this notation to express computation on the index variables (by putting those calculations in parenthesis) allows a CONV layer in tensor index notation to be represented quite concisely as:
In this calculation, each output at a point (n, m, p, q) is calculated as a dot product taken across the index variables c, r, and s of the specified elements of the input activation and filter weight tensors. Note that this notation attaches no significance to the order of the index variables in the summation. The relevance of this will become apparent in the discussion of dataflows (Chapter 5) and mapping computations onto a DNN accelerator (Chapter 6).
Finally, to align the terminology of CNNs with the generic DNN,
• filters are composed of weights (i.e., synapses), and
• input and output feature maps (ifmaps, ofmaps) are composed of input and output activations (partial sums after application of a nonlinear function) (i.e., input and output neurons).
Figure 2.3: Fully connected layer from convolution point of view with H = R, W = S, P = Q = 1, and U = 1.
2.3.2 FC LAYER (FULLY CONNECTED)
In an FC layer, every value in the output feature map is a weighted sum of every input value in the input feature map (i.e., it is fully connected). Furthermore, FC layers typically do not exhibit weight sharing and as a result the computation tends to be memory-bound. FC layers are often processed in the form of a matrix multiplication, which will be explained in Chapter 4. This is the reason while matrix multiplication is often associated with DNN processing.
An FC layer can also be viewed as a special case of a CONV layer. Specifically, a CONV layer where the filters are of the same size as the input feature maps. Therefore, it does not have the local, sparsely connected with weight sharing property of CONV layers. Therefore, Equation (2.1) still holds for the computation of FC layers with a few additional constraints on the shape parameters: H = R, W = S, P = Q = 1, and U = 1. Figure 2.3 shows a visualization of this computation and in the tensor index notation from Section 2.3.1 it is:
2.3.3 NONLINEARITY
A nonlinear activation function is typically applied after each CONV or FC layer. Various nonlinear functions are used to introduce nonlinearity into the DNN, as shown in Figure 2.4. These include historically conventional nonlinear functions such as sigmoid or hyperbolic tangent. These were popular because they facilitate mathematical analysis/proofs. The rectified linear unit (ReLU) [52] has become popular in recent years due to its simplicity and its ability to enable fast training, while achieving comparable accuracy.8 Variations of ReLU, such as leaky ReLU [53], parametric ReLU [54], exponential LU [55], and Swish [56] have also been explored for improved accuracy. Finally, a nonlinearity called maxout, which takes the maximum value of two intersecting linear functions, has shown to be effective in speech recognition tasks [57, 58].
Figure 2.4: Various forms of nonlinear activation functions. (Figure adapted from [62].)
2.3.4 POOLING AND UNPOOLING
There are a variety of computations that can be used to change the spatial resolution (i.e., H and W or P and Q) of the feature map depending on the application. For applications such as image classification, the goal is to summarize the entire image into one label; therefore, reducing the spatial resolution may be desirable. Networks that reduce input into a sparse output are often referred to as encoder networks. For applications such as semantic segmentation, the goal is to assign a label to each pixel in the image;9 as a result, increasing the spatial resolution may be desirable. Networks that expand input into a dense output are often referred to as decoder networks.
Reducing the spatial resolution of a feature map is referred to as pooling or more generically downsampling. Pooling, which is applied to each channel separately, enables the network to be robust and invariant to small shifts and distortions. Pooling combines, or pools, a set of values in its receptive field into a smaller number of values. Pooling can be parameterized based on the size of its receptive field (e.g., 2×2) and pooling operation (e.g., max or average), as shown in Figure 2.5. Typically, pooling occurs on non-overlapping blocks (i.e., the stride is equal to the size of the pooling). Usually a stride of greater than one is used such that there is a reduction in the spatial resolution of the representation (i.e., feature map). Pooling is usually performed after the nonlinearity.
Figure 2.5: Various forms of pooling.
Figure 2.6: Various forms of unpooling/upsampling. (Figures adapted from [64].)
Increasing the spatial resolution of a feature map is referred to as unpooling or more generically as upsampling. Commonly used forms of upsampling include inserting zeros between the activations, as shown in Figure 2.6a (this type of upsampling is commonly referred to as unpooling10), interpolation using nearest neighbors [63, 64], as shown in Figure 2.6b, and interpolation with bilinear or bicubic filtering [65]. Upsampling is usually performed before the CONV or FC layer. Upsampling can introduce structured sparsity in the input feature map that can be exploited for improved energy efficiency and throughput, as described in Section 8.1.1.
2.3.5 NORMALIZATION
Controlling the input distribution across layers can help to significantly speed up training and improve accuracy. Accordingly, the distribution of the layer input activations (δ, μ) are normalized such that it has a zero mean and a unit standard deviation. In batch normalization (BN), the normalized value is further scaled and shifted, as shown in Equation (2.5), where the parameters (γ, β) are learned from training [66]:11,12
where ∊ is a small constant to avoid numerical problems.
Prior to the wide adoption of BN, local response normalization (LRN) [7] was used, which was inspired by lateral inhibition in neurobiology where excited neurons (i.e., high value activations) should subdue its neighbors (i.e., cause low value activations); however, BN is now considered standard practice in the design of CNNs while LRN is mostly deprecated. Note that while LRN is usually performed after the nonlinear function, BN is usually performed between the CONV or FC layer and the nonlinear function. If BN is performed immediately after the CONV or FC layer, its computation can be folded into the weights of the CONV or FC layer resulting in no additional computation for inference.
2.3.6 COMPOUND LAYERS
The above primitive layers can be combined to form compound layers. For instance, attention layers are composed of matrix multiplications and feed-forward, fully connected layers [68]. Attention layers have become popular for processing a wide range of data including language and images and are commonly used in a type of DNNs called Transformers. We will discuss transformers in more detail in Section 2.5. Another example of a compound layer is the up-convolution layer [60], which performs zero-insertion (unpooling) on the input and then applies a convolutional layer.13 Up-convolution layers are typically used in DNNs such as General Adversarial Networks (GANs) and Auto Encoders (AEs) that process image data. We will discuss GANs and AEs in more detail in Section 2.5.
2.4 CONVOLUTIONAL NEURAL NETWORKS (CNNs)
CNNs are a common form of DNNs that are composed of multiple CONV layers, as shown in Figure 2.7. In such networks, each layer generates a successively higher-level abstraction of the input data, called a feature map (fmap), which preserves essential yet unique information. Modern CNNs are able to achieve superior performance by employing a very deep hierarchy of layers. CNNs are widely used in a variety of applications including image understanding [7], speech recognition [70], game play [10], robotics [42], etc. This book will focus on its use in image processing, specifically for the task of image classification [7]. Modern CNN models for image classification typically have 5 [7] to more than a 1,000 [24] CONV layers. A small number, e.g., 1 to 3, of FC layers are typically applied after the CONV layers for classification purposes.
Figure 2.7: Convolutional Neural Networks.
2.4.1 POPULAR CNN MODELS
Many CNN models have been developed over the past two decades. Each of these models are different in terms of number of layers, layer types, layer shapes (i.e., filter size, number of channels and filters), and connections between layers. Understanding these variations and trends is important for incorporating the right flexibility in any efficient DNN accelerator, as discussed in Chapter 3.
In this section, we will give an overview of various popular CNNs such as LeNet [71] as well as those that competed in and/or won the ImageNet Challenge [23], as shown in Figure 1.8, most of whose models with pre-trained weights are publicly available for download; the CNN models are summarized in Table 2.2. Two results for Top-5 error are reported. In the first row, the accuracy is boosted by using multiple crops from the image and an ensemble of multiple trained models (i.e., the CNN needs to be run several times); these results were used to compete in the ImageNet Challenge. The second row reports the accuracy if only a single crop was used (i.e., the CNN is run only once), which is more consistent with what would likely be deployed in real-time and/or energy-constrained applications.
Table 2.2: Summary of popular CNNs [7, 24, 71, 73, 74]. †Accuracy is measured based on Top-5 error on ImageNet [23] using multiple crops. ‡This version of LeNet-5 has 431k weights for the filters and requires 2.3M MACs per image, and uses ReLU rather than sigmoid.
LeNet [20] was one of the first CNN approaches introduced in 1989. It was designed for the task of digit classification in grayscale images of size 28×28. The most well known version, LeNet-5, contains two CONV layers followed by two FC layers [71]. Each CONV layer uses filters of size 5×5 (1 channel per filter) with 6 filters in the first layer and 16 filters in the second layer. Average pooling of 2×2 is used after each convolution and a sigmoid is used for the non-linearity. In total, LeNet requires 60k weights and 341k multiply-and-accumulates (MACs) per image. LeNet led to CNNs’ first commercial success, as it was deployed in ATMs to recognize digits for check deposits.
AlexNet [7] was the first CNN to win the ImageNet Challenge in 2012. It consists of five CONV layers followed by three FC layers. Within each CONV layer, there are 96 to 384 filters and the filter size ranges from 3×3 to 11×11, with 3 to 256 channels each. In the first layer, the three channels of the filter correspond to the red, green, and blue components of the input image. A ReLU nonlinearity is used in each layer. Max pooling of 3×3 is applied to the outputs of layers 1, 2, and 5. To reduce computation, a stride of 4 is used at the first layer of the network. AlexNet introduced the use of LRN in layers 1 and 2 before the max pooling, though LRN is no longer popular in later CNN models. One important factor that differentiates AlexNet from LeNet is that the number of weights is much larger and the shapes vary from layer to layer. To reduce the amount of weights and computation in the second CONV layer, the 96 output channels of the first layer are split into two groups of 48 input channels for the second layer, such that the filters in the second layer only have 48 channels. This approach is referred to as “grouped convolution” and illustrated in Figure 2.8.14 Similarly, the weights in fourth and fifth layer are also split into two groups. In total, AlexNet requires 61M weights and 724M MACs to process one 227×227 input image.
Overfeat [72] has a very similar architecture to AlexNet with five CONV layers followed by three FC layers. The main differences are that the number of filters is increased for layers 3 (384 to 512), 4 (384 to 1024), and 5 (256 to 1024), layer 2 is not split into two groups, the first FC layer only has 3072 channels rather than 4096, and the input size is 231×231 rather than 227×227. As a result, the number of weights grows to 146M and the number of MACs grows to 2.8G per image. Overfeat has two different models: fast (described here) and accurate. The accurate model used in the ImageNet Challenge gives a 0.65% lower Top-5 error rate than the fast model at the cost of 1.9× more MACs.
VGG-16 [73] goes deeper to 16 layers consisting of 13 CONV layers followed by 3 FC layers. In order to balance out the cost of going deeper, larger filters (e.g., 5×5) are built from multiple smaller filters (e.g., 3×3), which have fewer weights, to achieve the same effective receptive fields, as shown in Figure 2.9a. As a result, all CONV layers have the same filter size of 3×3. In total, VGG-16 requires 138M weights and 15.5G MACs to process one 224×224 input image. VGG has two different models: VGG-16 (described here) and VGG-19. VGG-19 gives a 0.1% lower Top-5 error rate than VGG-16 at the cost of 1.27× more MACs.
GoogLeNet [74] goes even deeper with 22 layers. It introduced an inception module, shown in Figure 2.10, whose input is distributed through multiple feed-forward connections to several parallel layers. These parallel layers contain different sized filters (i.e., 1×1, 3×3, 5×5), along with 3×3 max-pooling, and their outputs are concatenated for the module output. Using multiple filter sizes has the effect of processing the input at multiple scales. For improved training speed, GoogLeNet is designed such that the weights and the activations, which are stored for backpropagation during training, could all fit into the GPU memory. In order to reduce the number of weights, 1×1 filters are applied as a “bottleneck” to reduce the number of channels for each filter [75], as shown in Figure 2.11. The 22 layers consist of three CONV layers, followed by nine inceptions modules (each of which are two CONV layers deep), and one FC layer. The number of FC layers was reduce from three to one using a global average pooling layer, which summarizes the large feature map from the CONV layers into one value; global pooling will be discussed in more detail in Section 9.1.2. Since its introduction in 2014, GoogLeNet (also referred to as Inception) has multiple versions: v1 (described here), v3,15 and v4. Inception-v3 decomposes the convolutions by using smaller 1-D filters, as shown in Figure 2.9b, to reduce number of MACs and weights in order to go deeper to 42 layers. In conjunction with batch normalization [66], v3 achieves over 3% lower Top-5 error than v1 with 2.5× more MACs [76]. Inception-v4 uses residual connections [77], described in the next section, for a 0.4% reduction in error.
Figure 2.8: An example of dividing feature map into two grouped convolutions. Each filter requires 2× fewer weights and multiplications.
Figure 2.9: Decomposing larger filters into smaller filters.
Figure 2.10: Inception module from GoogLeNet [74] with example channel lengths. Note that each CONV layer is followed by a ReLU (not drawn).
ResNet [24], also known as Residual Net, uses feed-forward connections that connects to layers beyond the immediate next layer (often referred to as residual, skip or identity connections); these connections enable a DNN with many layers (e.g., 34 or more) to be trainable. It was the first entry CNN in ImageNet Challenge that exceeded human-level accuracy with a Top-5 error rate below 5%. One of the challenges with deep networks is the vanishing gradient during training [78]; as the error backpropagates through the network the gradient shrinks, which affects the ability to update the weights in the earlier layers for very deep networks. ResNet introduces a “shortcut” module which contains an identity connection such that the weight layers (i.e., CONV layers) can be skipped, as shown in Figure 2.12. Rather than learning the function for the weight layers F(x), the shortcut module learns the residual mapping (F(x) = H(x) − x). Initially, F(x) is zero and the identity connection is taken; then gradually during training, the actual forward connection through the weight layer is used. ResNet also uses the “bottleneck” approach of using 1×1 filters to reduce the number of weights. As a result, the two layers in the shortcut module are replaced by three layers (1×1, 3×3, 1×1) where the first 1×1 layer reduces the number of activations and thus weights in the 3×3 layer, the last 1×1 layer restores the number of activations in the output of the third layer. ResNet-50 consists of one CONV layer, followed by 16 shortcut layers (each of which are 3 CONV layers deep), and 1 FC layer; it requires 25.5M weights and 3.9G MACs per image. There are various versions of ResNet with multiple depths (e.g., without bottleneck: 18, 34; with bottleneck: 50, 101, 152). The ResNet with 152 layers was the winner of the ImageNet Challenge requiring 11.3G MACs and 60M weights. Compared to ResNet-50, it reduces the Top-5 error by around 1% at the cost of 2.9× more MACs and 2.5× more weights.
Figure 2.11: Apply 1×1×C filter (usually referred to as 1×1) to capture cross-channel correlation, but no spatial correlation. This bottleneck approach reduces the number of channels in next layer assuming the number of filters applied (M) is less than the original number of channels (C).
Figure 2.12: Shortcut module from ResNet [24]. Note that ReLU following last CONV layer in shortcut is after the addition.
Several trends can be observed in the popular CNNs shown in Table 2.2. Increasing the depth of the network tends to provide higher accuracy. Controlling for number of weights, a deeper network can support a wider range of nonlinear functions that are more discriminative and also provides more levels of hierarchy in the learned representation [24, 73, 74, 79]. The number of filter shapes continues to vary across layers, thus flexibility is still important. Furthermore, most of the computation has been placed on CONV layers rather than FC layers. In addition, the number of weights in the FC layers is reduced and in most recent networks (since GoogLeNet) the CONV layers also dominate in terms of weights. Thus, the focus of hardware implementations targeted at CNNs should be on addressing the efficiency of the CONV layers, which in many domains are increasingly important.
Since ResNet, there have been several other notable networks that have been proposed to increase accuracy. DenseNet [84] extends the concept of skip connections by adding skip connection from multiple previous layers to strengthen feature map propagation and feature reuse. This concept, commonly referred to as feature aggregation, continues to be widely explored. WideNet [85] proposes increasing the width (i.e., the number of filters) rather than depth of network, which has the added benefit that increasing width is more parallel-friendly than increasing depth. ResNeXt [86] proposes increasing the number of convolution groups (referred to as cardinality) instead of depth and width of network and was used as part of the winning entry for ImageNet in 2017. Finally, EfficientNet [87] proposes uniformly scaling all dimensions including depth, width, and resolution rather than focusing on a single dimension since there is an interplay between the different dimensions (e.g., to support higher input image resolution, the DNN needs higher depth to increase the receptive field and higher width to capture more finegrained patterns). WideNet, ResNeXt, and EfficientNet demonstrate that there exists methods beyond increasing depth for increasing accuracy, and thus highlights that there remains much to be explored and understood about the relationship between layer shape, number of layers, and accuracy.
Figure 2.13: Auto Encoder network for semantic segmentation. Feature maps along with pooling and upsampling layers are shown. (Figure adapted from [92].)
2.5 OTHER DNNs
There are other types of DNNs beyond CNNs including Recurrent Neural Networks (RNNs) [88, 89], Transformers [68], Auto Encoders (AEs) [90], and General Adversarial Networks (GANs) [91]. The diverse types of DNNs allow them to handle a wide range of inputs for a wide range of tasks. For instance, RNNs and Transformers are often used to handle sequential data that can have variable length (e.g., audio for speech recognition, or text for natural language processing). AEs and GANs can be used to generate dense output predictions by combining encoder and decoder networks. Example applications that use AEs include predicting pixel-wise depth values for depth estimation [64] and assigning pixel-wise class labels for semantic segmentation [92], as shown in Figure 2.13. Example applications that use GANs to generate images with the same statistics as the training set include image synthesis [93] and style transfer [94].
Figure 2.14: Dependencies in RNN are in both the time and depth dimension. The same weights (Wi) are used across time, while different weights are used across depth. (Figure adapted from [4].)
While their applications may differ from the CNNs described in Section 2.4, many of the building blocks and primitive layers are similar. For instance, RNNs and transformers heavily rely on matrix multiplications, which means that they have similar challenges as FC layers (e.g., they are memory bound due to lack of data reuse); thus, many of the techniques used to accelerate FC layers can also be used to accelerate RNNs and transformers (e.g., tiling discussed in Chapter 4, network pruning discussed in Chapter 8, etc.). Similarly, the decoder network of GANs and AEs for image processing use up-convolution layers, which involves upsampling the input feature map using zero insertion (unpooling) before applying a convolution; thus, many of the techniques used to accelerate CONV layers can also be used to accelerate the decoder network of GANs and AEs for image processing (e.g., exploit input activation sparsity discussed in Chapter 8).
While the dominant compute aspect of these DNNs are similar to CNNs, they do often require some other forms of compute. For instance, RNNs, particularly Long Short-Term Memory networks (LSTMs) [95], require support of element-wise multiplications as well a variety of nonlinear functions (sigmoid, tanh), unlike CNNs which typically only use ReLU. However, these operations do not tend to dominate run-time or energy consumption; they can be computed in software [96] or the nonlinear functions can be approximated by piecewise linear look up tables [97]. For GANs and AEs, additional support is required for upsampling.
Finally, RNNs have additional dependencies since the output of a layer is fed back to its input, as shown in Figure 2.14. For instance, the inputs to layer i at time t depends on the output of layer i − 1 at time t and layer i at time t − 1. This is similar to the dependency across layers, in that the output of layer i is the input to layer i + 1. These dependencies limit what inputs can be processed in parallel (e.g., within the same batch). For DNNs with feed-forward layers, any inputs can be processed at the same time (i.e., batch size greater than one); however, multiple layers of the same input cannot be processed at the same time (e.g., layers i and i C 1). In contrast, RNNs can only process multiple inputs at the same time if the inputs are not sequentially dependent; in other words, RNNs can process two separate sequences at the same time, but not multiple elements within the sequence (e.g., inputs t and t C 1 of the same sequence) and not multiple layers of the same input (which is similar to feed-forward networks).
2.6 DNN DEVELOPMENT RESOURCES
One of the key factors that has enabled the rapid development of DNNs is the set of development resources that have been made available by the research community and industry. These resources are also key to the development of DNN accelerators by providing characterizations of the workloads and facilitating the exploration of trade-offs in model complexity and accuracy. This section will describe these resources such that those who are interested in this field can quickly get started.
2.6.1 FRAMEWORKS
For ease of DNN development and to enable the sharing of trained networks, several deep learning frameworks have been developed from various sources. These open-source libraries contain software libraries for DNNs. Caffe was made available in 2014 from UC Berkeley [59]. It supports C, C++, Python, and MATLAB. Tensorflow [98] was released by Google in 2015, and supports C++ and Python; it also supports multiple CPUs and GPUs and has more flexibility than Caffe, with the computation expressed as dataflow graphs to manage the “tensors” (multidimensional arrays). Another popular framework is Torch, which was developed by Facebook and NYU and supports C, C++, and Lua; PyTorch [99] is its successor and is built in Python. There are several other frameworks such as Theano, MXNet, CNTK, which are described in [100]. There are also higher-level libraries that can run on top of the aforementioned frameworks to provide a more universal experience and faster development. One example of such libraries is Keras, which is written in Python and supports Tensorflow, CNTK, and Theano.
The existence of such frameworks are not only a convenient aid for DNN researchers and application designers, but they are also invaluable for engineering high performance or more efficient DNN computation engines. In particular, because the frameworks make heavy use of a set of primitive operations, such as the processing of a CONV layer, they can incorporate use of optimized software or hardware accelerators. This acceleration is transparent to the user of the framework. Thus, for example, most frameworks can use Nvidia’s cuDNN library for rapid execution on Nvidia GPUs. Similarly, transparent incorporation of dedicated hardware accelerators can be achieved as was done with the Eyeriss chip using Caffe [101].
Finally, these frameworks are a valuable source of workloads for hardware researchers. They can be used to drive experimental designs for different workloads, for profiling different workloads and for exploring hardware-algorithm trade-offs.
Figure 2.15: MNIST (10 classes, 60k training, 10k testing) [103] versus ImageNet (1000 classes, 1.3M training, 100k testing) [23] dataset.
2.6.2 MODELS
Pretrained DNN models can be downloaded from various websites [80–83] for the various different frameworks. It should be noted that even for the same DNN (e.g., AlexNet) the accuracy of these models can vary by around 1 to 2% depending on how the model was trained and tested, and thus the results do not always exactly match the original publication.
These pre-trained models often are tied to a given framework. In order to facilitate easier exchange between different networks, Open Neural Network Exchange (ONNX) has been established as an open ecosystem for interchangeable DNN models [102]; the current participants include Amazon, Facebook, and Microsoft.
2.6.3 POPULAR DATASETS FOR CLASSIFICATION
It is important to factor in the difficulty of the task when comparing different DNN models. For instance, the task of classifying handwritten digits from the MNIST dataset [103] is much simpler than classifying an object into one of 1000 classes as is required for the ImageNet dataset [23] (Figure 2.15). It is expected that the size of the DNNs (i.e., number of weights) and the number of MACs will be larger for the more difficult task than the simpler task and thus require more energy and have lower throughput. For instance, LeNet-5[71] is designed for digit classification, while AlexNet[7], VGG-16[73], GoogLeNet[74], and ResNet[24] are designed for the 1000-class image classification.
There are many AI tasks that come with publicly available datasets in order to evaluate the accuracy of a given DNN. Public datasets are important for comparing the accuracy of different approaches. The simplest and most common task in computer vision is image classification, which involves being given an entire image, and selecting 1 of N classes that the image most likely belongs to. There is no localization or detection.
MNIST is a widely used dataset for digit classification that was introduced in 1998 [103]. It consists of 28×28 pixel grayscale images of handwritten digits. There are 10 classes (for 10 digits) and 60,000 training images and 10,000 test images. LeNet-5 was able to achieve an accuracy of 99.05% when MNIST was first introduced. Since then the accuracy has increased to 99.79% using regularization of neural networks with dropconnect [104]. Thus, MNIST is now considered a fairly easy dataset.
CIFAR is a dataset that consists of 32×32 pixel colored images of various objects, which was released in 2009 [105]. CIFAR is a subset of the 80 million Tiny Image dataset [106]. CIFAR-10 is composed of 10 mutually exclusive classes. There are 50,000 training images (5000 per class) and 10,000 test images (1000 per class). A two-layer convolutional deep belief network was able to achieve 64.84% accuracy on CIFAR-10 when it was first introduced [107]. Since then the accuracy has increased to 96.53% using fractional max pooling [108].
ImageNet is a large-scale image dataset that was first introduced in 2010; the dataset stabilized in 2012 [23]. It contains images of 256×256 pixel in color with 1000 classes. The classes are defined using the WordNet as a backbone to handle ambiguous word meanings and to combine together synonyms into the same object category. In other words, there is a hierarchy for the ImageNet categories. The 1000 classes were selected such that there is no overlap in the ImageNet hierarchy. The ImageNet dataset contains many fine-grained categories including 120 different breeds of dogs. There are 1.3M training images (732 to 1300 per class), 100,000 testing images (100 per class) and 50,000 validation images (50 per class).
The accuracy for the image classification task in the ImageNet Challenge are reported using two metrics: Top-5 and Top-1 accuracy.16 Top-5 accuracy means that if any of the top five scoring categories are the correct category, it is counted as a correct classification. Top-1 accuracy requires that the top scoring category be correct. In 2012, the winner of the ImageNet Challenge (AlexNet) was able to achieve an accuracy of 83.6% for the Top-5 (which is substantially better than the 73.8% which was second place that year that did not use DNNs); it achieved 61.9% on the Top-1 of the validation set. In 2019, the state-of-the-art DNNs achieve accuracy above 97% for the Top-5 and above 84% for the Top-1 [87].
In summary of the various image classification datasets, it is clear that MNIST is a fairly easy dataset, while ImageNet is a more challenging one with a wider coverage of classes. Thus, in terms of evaluating the accuracy of a given DNN, it is important to consider that dataset upon which the accuracy is measured.
2.6.4 DATASETS FOR OTHER TASKS
Since the accuracy of the state-of-the-art DNNs are performing better than human-level accuracy on image classification tasks, the ImageNet Challenge has started to focus on more difficult tasks such as single-object localization and object detection. For single-object localization, the target object must be localized and classified (out of 1000 classes). The DNN outputs the top five categories and top five bounding box locations. There is no penalty for identifying an object that is in the image but not included in the ground truth. For object detection, all objects in the image must be localized and classified (out of 200 classes). The bounding box for all objects in these categories must be labeled. Objects that are not labeled are penalized as well as duplicated detections.
Beyond ImageNet, there are also other popular image datasets for computer vision tasks. For object detection, there is the PASCAL VOC (2005-2012) dataset that contains 11k images representing 20 classes (27k object instances, 7k of which have detailed segmentation) [109]. For object detection, segmentation, and recognition in context, there is the M.S. COCO dataset with 2.5M labeled instances in 328k images (91 object categories) [110]; compared to ImageNet, COCO has fewer categories but more instances per category, which is useful for precise 2-D localization. COCO also has more labeled instances per image to potentially help with contextual information.
Most recently, even larger scale datasets have been made available. For instance, Google has an Open Images dataset with over 9M images [111], spanning 6000 categories. There is also a YouTube dataset with 8M videos (0.5M hours of video) covering 4800 classes [112]. Google also released an audio dataset comprised of 632 audio event classes and a collection of 2M human-labeled 10-second sound clips [113]. These large datasets will be evermore important as DNNs become deeper with more weights to train. In addition, it has been shown that accuracy increases logarithmically based on the amount of training data [16].17
Undoubtedly, both larger datasets and datasets for new domains will serve as important resources for profiling and exploring the efficiency of future DNN engines.
2.6.5 SUMMARY
The development resources presented in this section enable us to evaluate hardware using the appropriate DNN model and dataset. In particular, it’s important to realize that difficult tasks typically require larger models; for instance, LeNet would not apply to the ImageNet Challenge. In addition, different datasets are required for different tasks; for instance, self-driving cars require high-definition video, and thus a network trained on the low resolution ImageNet dataset may not be sufficient. To address these requirements, the number of datasets continues to grow at a rapid pace.
1 The DNN research community often refers to the shape and size of a DNN as its “network architecture.” However, to avoid confusion with the use of the word “architecture” by the hardware community, we will talk about “DNN models” and their shape and size in this book.
2 CONV layers use a specific type of weight sharing, which will be described in Section 2.4.
3 Connections can come from the immediately preceding layer or an earlier layer. Furthermore, connections from a layer can go to multiple later layers.
4 For simplicity, in this chapter, we will refer to an array of partial sums as an output feature map. However, technically, the output feature map would be composed the values of the partial sums after they have gone through a nonlinear function (i.e., the output activations).
5 In some literature, K is used rather than M to denote the number of 3-D filters (also referred to a kernels), which determines the number of output feature map channels. We opted not to use K to avoid confusion with yet other communities that use it to refer to the number of dimensions. We also have adopted the convention of using P and Q as the dimensions of the output to align with other publications and since our prior use of E and F caused an alias with the use of “F” to represent filter weights. Note that some literature also use X and Y to denote the spatial dimensions of the input rather than W and H.
6 Note that many of the values in the CONV layer tensors are zero, making the tensors sparse. The origins of this sparsity, and approaches for performing the resulting sparse tensor algebra, are presented in Chapter 8.
7 Note that Albert Einstein popularized a similar notation for tensor algebra which omits any explicit specification of the summation variable.
8 In addition to being simple to implement, ReLU also increases the sparsity of the output activations, which can be exploited by a DNN accelerator to increase throughput, reduce energy consumption and reduce storage cost, as described in Section 8.1.1.
9 In the literature, this is often referred to dense prediction.
10 There are two versions of unpooling: (1) zero insertion is applied in a regular pattern, as shown in Figure 2.6a [60]—this is most commonly used; and (2) unpooling is paired with a max pooling layer, where the location of the max value during pooling is stored, and during unpooling the location of the non-zero value is placed in the location of the max value before pooling [61].
11 It has been recently reported that the reason batch normalization enables faster and more stable training is due to the fact that it makes the optimization landscape smoother resulting in more predictive and stable behavior of the gradient [67]; this is in contrast to the popular belief that batch normalization stabilizes the distribution of the input across layers. Nonetheless, batch normalization continues to be widely used for training and thus needs to be supported during inference.
12 During training, parameters δ and μ are computed per batch, and β and β are updated per batch based on the gradient; therefore, training for different batch sizes will result in different δ and μ parameters, which can impact accuracy. Note that each channel has its own set of δ, μ, β, and β parameters. During inference, all parameters are fixed, where δ and μ are computed from the entire training set. To avoid performing an extra pass over the entire training set to compute δ and μ, δ and μ are usually implemented as the running average of the per batch δ and μ computed during training.
13 Note variants of the up CONV layer with different types of upsampling include deconvolution layer, sub-pixel or fractional convolutional layer, transposed convolutional layer, and backward convolution layer [69].
14 This grouped convolution approach is applied more aggressively when performing co-design of algorithms and hardware to reduce complexity, which will be discussed in Chapter 9.
15 v2 is very similar to v3.
16 Note that in some parts of the book we use Top-1 and Top-5 error. The error can be computed as 100% minus accuracy.
17 This was demonstrated on Google’s internal JFT-300M dataset with 300M images and 18,291 classes, which is two orders of magnitude larger than ImageNet. However, performing four iterations across the entire training set using 50 K-80 GPUs required two months of training, which further emphasizes that compute is one of the main bottlenecks in the advancement of DNN research.