Читать книгу Efficient Processing of Deep Neural Networks - Vivienne Sze - Страница 11

Оглавление

CHAPTER 1

Introduction

Deep neural networks (DNNs) are currently the foundation for many modern artificial intelligence (AI) applications [5]. Since the breakthrough application of DNNs to speech recognition [6] and image recognition1 [7], the number of applications that use DNNs has exploded. These DNNs are employed in a myriad of applications from self-driving cars [8], to detecting cancer [9], to playing complex games [10]. In many of these domains, DNNs are now able to exceed human accuracy. The superior accuracy of DNNs comes from their ability to extract high-level features from raw sensory data by using statistical learning on a large amount of data to obtain an effective representation of an input space. This is different from earlier approaches that use hand-crafted features or rules designed by experts.

The superior accuracy of DNNs, however, comes at the cost of high computational complexity. To date, general-purpose compute engines, especially graphics processing units (GPUs), have been the mainstay for much DNN processing. Increasingly, however, in these waning days of Moore’s law, there is a recognition that more specialized hardware is needed to keep improving compute performance and energy efficiency [11]. This is especially true in the domain of DNN computations. This book aims to provide an overview of DNNs, the various tools for understanding their behavior, and the techniques being explored to efficiently accelerate their computation.

1.1 BACKGROUND ON DEEP NEURAL NETWORKS

In this section, we describe the position of DNNs in the context of artificial intelligence (AI) in general and some of the concepts that motivated the development of DNNs. We will also present a brief chronology of the major milestones in the history of DNNs, and some current domains to which it is being applied.

1.1.1 ARTIFICIAL INTELLIGENCE AND DEEP NEURAL NETWORKS

DNNs, also referred to as deep learning, are a part of the broad field of AI. AI is the science and engineering of creating intelligent machines that have the ability to achieve goals like humans do, according to John McCarthy, the computer scientist who coined the term in the 1950s. The relationship of deep learning to the whole of AI is illustrated in Figure 1.1.


Figure 1.1: Deep learning in the context of artificial intelligence.

Within AI is a large sub-field called machine learning, which was defined in 1959 by Arthur Samuel [12] as “the field of study that gives computers the ability to learn without being explicitly programmed.” That means a single program, once created, will be able to learn how to do some intelligent activities outside the notion of programming. This is in contrast to purpose-built programs whose behavior is defined by hand-crafted heuristics that explicitly and statically define their behavior.

The advantage of an effective machine learning algorithm is clear. Instead of the laborious and hit-or-miss approach of creating a distinct, custom program to solve each individual problem in a domain, a single machine learning algorithm simply needs to learn, via a process called training, to handle each new problem.

Within the machine learning field, there is an area that is often referred to as brain-inspired computation. Since the brain is currently the best “machine” we know of for learning and solving problems, it is a natural place to look for inspiration. Therefore, a brain-inspired computation is a program or algorithm that takes some aspects of its basic form or functionality from the way the brain works. This is in contrast to attempts to create a brain, but rather the program aims to emulate some aspects of how we understand the brain to operate.

Although scientists are still exploring the details of how the brain works, it is generally believed that the main computational element of the brain is the neuron. There are approximately 86 billion neurons in the average human brain. The neurons themselves are connected by a number of elements entering them, called dendrites, and an element leaving them, called an axon, as shown in Figure 1.2. The neuron accepts the signals entering it via the dendrites, performs a computation on those signals, and generates a signal on the axon. These input and output signals are referred to as activations. The axon of one neuron branches out and is connected to the dendrites of many other neurons. The connections between a branch of the axon and a dendrite is called a synapse. There are estimated to be 1014 to 1015 synapses in the average human brain.


Figure 1.2: Connections to a neuron in the brain. xi, wi, f (.), and b are the activations, weights, nonlinear function, and bias, respectively. (Figure adapted from [4].)

A key characteristic of the synapse is that it can scale the signal (xi) crossing it, as shown in Figure 1.2. That scaling factor can be referred to as a weight (wi), and the way the brain is believed to learn is through changes to the weights associated with the synapses. Thus, different weights result in different responses to an input. One aspect of learning can be thought of as the adjustment of weights in response to a learning stimulus, while the organization (what might be thought of as the program) of the brain largely does not change. This characteristic makes the brain an excellent inspiration for a machine-learning-style algorithm.

Within the brain-inspired computing paradigm, there is a subarea called spiking computing. In this subarea, inspiration is taken from the fact that the communication on the dendrites and axons are spike-like pulses and that the information being conveyed is not just based on a spike’s amplitude. Instead, it also depends on the time the pulse arrives and that the computation that happens in the neuron is a function of not just a single value but the width of pulse and the timing relationship between different pulses. The IBM TrueNorth project is an example of work that was inspired by the spiking of the brain [13]. In contrast to spiking computing, another subarea of brain-inspired computing is called neural networks, which is the focus of this book.2


Figure 1.3: Simple neural network example and terminology. (Figure adapted from [4].)

1.1.2 NEURAL NETWORKS AND DEEP NEURAL NETWORKS

Neural networks take their inspiration from the notion that a neuron’s computation involves a weighted sum of the input values. These weighted sums correspond to the value scaling performed by the synapses and the combining of those values in the neuron. Furthermore, the neuron does not directly output that weighted sum because the expressive power of the cascade of neurons involving only linear operations is just equal to that of a single neuron, which is very limited. Instead, there is a functional operation within the neuron that is performed on the combined inputs. This operation appears to be a nonlinear function that causes a neuron to generate an output only if its combined inputs cross some threshold. Thus, by analogy, neural networks apply a nonlinear function to the weighted sum of the input values.3 These nonlinear functions are inspired by biological functions, but are not meant to emulate the brain. We look at some of those nonlinear functions in Section 2.3.3.

Figure 1.3a shows a diagram of a three-layer (non-biological) neural network. The neurons in the input layer receive some values, compute their weighted sums followed by the nonlinear function, and propagate the outputs to the neurons in the middle layer of the network, which is also frequently called a “hidden layer.” A neural network can have more than one hidden layer, and the outputs from the hidden layers ultimately propagate to the output layer, which computes the final outputs of the network to the user. To align brain-inspired terminology with neural networks, the outputs of the neurons are often referred to as activations, and the synapses are often referred to as weights, as shown in Figure 1.3a. We will use the activation/weight nomenclature in this book.


Figure 1.4: Example of image classification using deep neural networks. (Figure adapted from [15].) Note that the features go from low level to high level as we go deeper into the network.

Figure 1.3b shows an example of the computation at layer 1: , where Wij, xi, and yj are the weights, input activations, and output activations, respectively, and f(.) is a nonlinear function described in Section 2.3.3. The bias term bj is omitted from Figure 1.3b for simplicity. In this book, we will use the color green to denote weights, blue to denote activations, and red to denote weighted sums (or partial sums, which are further accumulated to become the final weighted sums).

Within the domain of neural networks, there is an area called deep learning, in which the neural networks have more than three layers, i.e., more than one hidden layer. Today, the typical numbers of network layers used in deep learning range from 5 to more than a 1,000. In this book, we will generally use the terminology deep neural networks (DNNs) to refer to the neural networks used in deep learning.

DNNs are capable of learning high-level features with more complexity and abstraction than shallower neural networks. An example that demonstrates this point is using DNNs to process visual data, as shown in Figure 1.4. In these applications, pixels of an image are fed into the first layer of a DNN, and the outputs of that layer can be interpreted as representing the presence of different low-level features in the image, such as lines and edges. In subsequent layers, these features are then combined into a measure of the likely presence of higher-level features, e.g., lines are combined into shapes, which are further combined into sets of shapes. Finally, given all this information, the network provides a probability that these high-level features comprise a particular object or scene. This deep feature hierarchy enables DNNs to achieve superior performance in many tasks.


Figure 1.5: Example of an image classification task. The machine learning platform takes in an image and outputs the class probabilities for a predefined set of classes.

1.2 TRAINING VERSUS INFERENCE

Since DNNs are an instance of machine learning algorithms, the basic program does not change as it learns to perform its given tasks. In the specific case of DNNs, this learning involves determining the value of the weights (and biases) in the network, and is referred to as training the network. Once trained, the program can perform its task by computing the output of the network using the weights determined during the training process. Running the program with these weights is referred to as inference.

In this section, we will use image classification, as shown in Figure 1.5, as a driving example for training and using a DNN. When we perform inference using a DNN, the input is image and the output is a vector of values representing the class probabilities. There is one value for each object class, and the class with the highest value indicates the most likely (predicted) class of object in the image. The overarching goal for training a DNN is to determine the weights that maximize the probability of the correct class and minimize the probabilities of the incorrect classes. The correct class is generally known, as it is often defined in the training set. The gap between the ideal correct probabilities and the probabilities computed by the DNN based on its current weights is referred to as the loss (L). Thus, the goal of training DNNs is to find a set of weights to minimize the average loss over a large training set.

When training a network, the weights (wij) are usually updated using a hill-climbing (hill-descending) optimization process called gradient descent. In gradient descent, a weight is updated by a scaled version of the partial derivative of the loss with respect to the weight (i.e., updated to , where α is called the learning rate4). Note that this gradient indicates how the weights should change in order to reduce the loss. The process is repeated iteratively to reduce the overall loss.

Sidebar: Key steps in training

Here, we will provide a very brief summary of the key steps of training and deploying a model. For more details, we recommend the reader refer to more comprehensive references such as [2]. First, we collect a labeled dataset and divide the data into subsets for training and testing. Second, we use the training set to train a model so that it can learn the weights for a given task. After achieving adequate accuracy on the training set, the ultimate quality of the model is determined by how accurately it performs on unseen data. Therefore, in the third step, we test the trained model by asking it to predict the labels for a test set that it has never seen before and compare the prediction to the ground truth labels. Generalization refers to how well the model maintains the accuracy between training and unseen data. If the model does not generalize well, it is often referred to as overfitting; this implies that the model is fitting to the noise rather than the underlying data structure that we would like it to learn. One way to combat overfitting is to have a large, diverse dataset; it has been shown that accuracy increases logarithmically as a function of the number of training examples [16]. Section 2.6.3 will discuss various popular datasets used for training. There are also other mechanisms that help with generalization including Regularization. It adds constraints to the model during training such as smoothness, number of parameters, size of the parameters, prior distribution or structure, or randomness in the training using dropout [17]. Further partitioning the training set into training and validation sets is another useful tool. Designing a DNN requires determining (tuning) a large number of hyperparameters such as the size and shape of a layer or the number of layers. Tuning the hyperparameters based on the test set may cause overfitting to the test set, which results in a misleading evaluation of the true performance on unseen data. In this circumstance, the validation set can be used instead of the test set to mitigate this problem. Finally, if the model performs sufficiently well on the test set, it can be deployed on unlabeled images.

An efficient way to compute the partial derivatives of the gradient is through a process called backpropagation. Backpropagation, which is a computation derived from the chain rule of calculus, operates by passing values backward through the network to compute how the loss is affected by each weight.


Figure 1.6: An example of back propagation through a neural network.

This backpropagation computation is, in fact, very similar in form to the computation used for inference, as shown in Figure 1.6 [19].5 Thus, techniques for efficiently performing inference can sometimes be useful for performing training. There are, however, some important additional considerations to note. First, backpropagation requires intermediate outputs of the network to be preserved for the backward computation, thus training has increased storage requirements. Second, due to the gradients use for hill-climbing (hill-descending), the precision requirement for training is generally higher than inference. Thus, many of the reduced precision techniques discussed in Chapter 7 are limited to inference only.

A variety of techniques are used to improve the efficiency and robustness of training. For example, often, the loss from multiple inputs is computed before a single pass of weight updates is performed. This is called batching, which helps to speed up and stabilize the process.6

There are multiple ways to train the weights. The most common approach, as described above, is called supervised learning, where all the training samples are labeled (e.g., with the correct class). Unsupervised learning is another approach, where no training samples are labeled. Essentially, the goal is to find the structure or clusters in the data. Semi-supervised learning falls between the two approaches, where only a small subset of the training data is labeled (e.g., use unlabeled data to define the cluster boundaries, and use the small amount of labeled data to label the clusters). Finally, reinforcement learning can be used to the train the weights such that given the state of the current environment, the DNN can output what action the agent should take next to maximize expected rewards; however, the rewards might not be available immediately after an action, but instead only after a series of actions (often referred to as an episode).

Another commonly used approach to determine weights is fine-tuning, where previously trained weights are available and are used as a starting point and then those weights are adjusted for a new dataset (e.g., transfer learning) or for a new constraint (e.g., reduced precision). This results in faster training than starting from a random starting point, and can sometimes result in better accuracy.

This book will focus on the efficient processing of DNN inference rather than training, since DNN inference is often performed on embedded devices (rather than the cloud) where resources are limited, as discussed in more details later.

1.3 DEVELOPMENT HISTORY

Although neural networks were proposed in the 1940s, the first practical application employing multiple digital neurons didn’t appear until the late 1980s, with the LeNet network for handwritten digit recognition [20].7 Such systems are widely used by ATMs for digit recognition on checks. The early 2010s have seen a blossoming of DNN-based applications, with highlights such as Microsoft’s speech recognition system in 2011 [6] and the AlexNet DNN for image recognition in 2012 [7]. A brief chronology of deep learning is shown in Figure 1.7.

The deep learning successes of the early 2010s are believed to be due to a confluence of three factors. The first factor is the amount of available information to train the networks. To learn a powerful representation (rather than using a hand-crafted approach) requires a large amount of training data. For example, Facebook receives up to a billion images per day, Walmart creates 2.5 Petabytes of customer data hourly and YouTube has over 300 hours of video uploaded every minute. As a result, these and many other businesses have a huge amount of data to train their algorithms.

The second factor is the amount of compute capacity available. Semiconductor device and computer architecture advances have continued to provide increased computing capability, and we appear to have crossed a threshold where the large amount of weighted sum computation in DNNs, which is required for both inference and training, can be performed in a reasonable amount of time.


Figure 1.7: A concise history of neural networks. “Deep” refers to the number of layers in the network.

The successes of these early DNN applications opened the floodgates of algorithmic development. It has also inspired the development of several (largely open source) frameworks that make it even easier for researchers and practitioners to explore and use DNNs. Combining these efforts contributes to the third factor, which is the evolution of the algorithmic techniques that have improved accuracy significantly and broadened the domains to which DNNs are being applied.

An excellent example of the successes in deep learning can be illustrated with the ImageNet Challenge [23]. This challenge is a contest involving several different components. One of the components is an image classification task, where algorithms are given an image and they must identify what is in the image, as shown in Figure 1.5. The training set consists of 1.2 million images, each of which is labeled with one of a thousand object categories that the image contains. For the evaluation phase, the algorithm must accurately identify objects in a test set of images, which it hasn’t previously seen.

Figure 1.8 shows the performance of the best entrants in the ImageNet contest over a number of years. The accuracy of the algorithms initially had an error rate of 25% or more. In 2012, a group from the University of Toronto used graphics processing units (GPUs) for their high compute capability and a DNN approach, named AlexNet, and reduced the error rate by approximately 10 percentage points [7]. Their accomplishment inspired an outpouring of deep learning algorithms that have resulted in a steady stream of improvements.

In conjunction with the trend toward using deep learning approaches for the ImageNet Challenge, there has been a corresponding increase in the number of entrants using GPUs: from 2012 when only 4 entrants used GPUs to 2014 when almost all the entrants (110) were using them. This use of GPUs reflects the almost complete switch from traditional computer vision approaches to deep learning-based approaches for the competition.


Figure 1.8: Results from the ImageNet Challenge [23].

In 2015, the ImageNet winning entry, ResNet [24], exceeded human-level accuracy with a Top-5 error rate8 below 5%. Since then, the error rate has dropped below 3% and more focus is now being placed on more challenging components of the competition, such as object detection and localization. These successes are clearly a contributing factor to the wide range of applications to which DNNs are being applied.

1.4 APPLICATIONS OF DNNs

Many domains can benefit from DNNs, ranging from entertainment to medicine. In this section, we will provide examples of areas where DNNs are currently making an impact and highlight emerging areas where DNNs may make an impact in the future.

Image and Video: Video is arguably the biggest of big data. It accounts for over 70% of today’s Internet traffic [25]. For instance, over 800 million hours of video is collected daily worldwide for video surveillance [26]. Computer vision is necessary to extract meaningful information from video. DNNs have significantly improved the accuracy of many computer vision tasks such as image classification [23], object localization and detection [27], image segmentation [28], and action recognition [29].

Speech and Language: DNNs have significantly improved the accuracy of speech recognition [30] as well as many related tasks such as machine translation [6], natural language processing [31], and audio generation [32].

Medicine and Health Care: DNNs have played an important role in genomics to gain insight into the genetics of diseases such as autism, cancers, and spinal muscular atrophy [33–36]. They have also been used in medical imaging such as detecting skin cancer [9], brain cancer [37], and breast cancer [38].

Game Play: Recently, many of the grand AI challenges involving game play have been overcome using DNNs. These successes also required innovations in training techniques, and many rely on reinforcement learning [39]. DNNs have surpassed human level accuracy in playing games such as Atari [40], Go [10], and StarCraft [41], where an exhaustive search of all possibilities is not feasible due to the immense number of possible moves.

Robotics: DNNs have been successful in the domain of robotic tasks such as grasping with a robotic arm [42], motion planning for ground robots [43], visual navigation [8, 44], control to stabilize a quadcopter [45], and driving strategies for autonomous vehicles [46].

DNNs are already widely used in multimedia applications today (e.g., computer vision, speech recognition). Looking forward, we expect that DNNs will likely play an increasingly important role in the medical and robotics fields, as discussed above, as well as finance (e.g., for trading, energy forecasting, and risk assessment), infrastructure (e.g., structural safety, and traffic control), weather forecasting, and event detection [47]. The myriad application domains pose new challenges to the efficient processing of DNNs; the solutions then have to be adaptive and scalable in order to handle the new and varied forms of DNNs that these applications may employ.

1.5 EMBEDDED VERSUS CLOUD

The various applications and aspects of DNN processing (i.e., training versus inference) have different computational needs. Specifically, training often requires a large dataset9 and significant computational resources for multiple weight-update iterations. In many cases, training a DNN model still takes several hours to multiple days (or weeks or months!) and thus is typically performed in the cloud.

Inference, on the other hand, can happen either in the cloud or at the edge (e.g., Internet of Things (IoT) or mobile). In many applications, it is desirable to have the DNN inference processing at the edge near the sensor. For instance, in computer vision applications, such as measuring wait times in stores or predicting traffic patterns, it would be desirable to extract meaningful information from the video right at the image sensor rather than in the cloud, to reduce the communication cost. For other applications, such as autonomous vehicles, drone navigation, and robotics, local processing is desired since the latency and security risks of relying on the cloud are too high. However, video involves a large amount of data, which is computationally complex to process; thus, low-cost hardware to analyze video is challenging, yet critical, to enabling these applications.10 Speech recognition allows us to seamlessly interact with electronic devices, such as smartphones. While currently most of the processing for applications such as Apple Siri and Amazon Alexa voice services is in the cloud, it is still desirable to perform the recognition on the device itself to reduce latency. Some work have even considered partitioning the processing between the cloud and edge at a per layer basis in order to improve performance [49]. However, considerations related to dependency on connectivity, privacy, and security augur for keeping computation at the edge. Many of the embedded platforms that perform DNN inference have stringent requirements on energy consumption, compute and memory cost limitations; efficient processing of DNNs has become of prime importance under these constraints.

1 Image recognition is also commonly referred to as image classification.

2 Note: Recent work using TrueNorth in a stylized fashion allows it to be used to compute reduced precision neural networks [14]. These types of neural networks are discussed in Chapter 7.

3 Without a nonlinear function, multiple layers could be collapsed into one.

4 A large learning rate increases the step size applied at each iteration, which can help speed up the training, but may also result in overshooting the minimum or cause the optimization to not converge. A small learning rate decreases the step size applied at each iteration which slows down the training, but increases likelihood of convergence. There are various methods to set the learning rate such as ADAM [18], etc. Finding the best the learning rate is one of the key challenges in training DNNs.

5 To backpropagate through each layer: (1) compute the gradient of the loss relative to the weights, , from the layer inputs (i.e., the forward activations, Xi) and the gradients of the loss relative to the layer outputs, ; and (2) compute the gradient of the loss relative to the layer inputs, , from the layer weights, Wij, and the gradients of the loss relative to the layer outputs, .

6 There are various forms of gradient decent which differ in terms of how frequently to update the weights. Batch Gradient Descent updates the weights after computing the loss on the entire training set, which is computationally expensive and requires significant storage. Stochastic Gradient Descent update weights after computing loss on a single training example and the examples are shuffled after going through the entire training set. While it is fast, looking at a single example can be noisy and cause the weights to go in the wrong direction. Finally, Mini-batch Gradient Descent divides the training set into smaller sets called mini-batches, and updates weights based on the loss of each mini-batch (commonly referred to simply as “batch”); this approach is most commonly used. In general, each pass through the entire training set is referred to as an epoch.

7 In the early 1960s, single neuron systems built out of analog logic were used for adaptive filtering [21, 22].

8 The Top-5 error rate is measured based on whether the correct answer appears in one of the top five categories selected by the algorithm.

9 One of the major drawbacks of DNNs is their need for large datasets to prevent overfitting during training.

10 As a reference, running a DNN on an embedded devices is estimated to consume several orders of magnitude higher energy per pixel than video compression, which is a common form of processing near image sensor [48].

Efficient Processing of Deep Neural Networks

Подняться наверх