Читать книгу A Guide to Convolutional Neural Networks for Computer Vision - Salman Khan - Страница 11

Оглавление

CHAPTER 1

Introduction

Computer Vision and Machine Learning have played together decisive roles in the development of a variety of image-based applications within the last decade (e.g., various services provided by Google, Facebook, Microsoft, Snapchat). During this time, the vision-based technology has transformed from just a sensing modality to intelligent computing systems which can understand the real world. Thus, acquiring computer vision and machine learning (e.g., deep learning) knowledge is an important skill that is required in many modern innovative businesses and is likely to become even more important in the near future.

1.1 WHAT IS COMPUTER VISION?

Humans use their eyes and their brains to see and understand the 3D world around them. For example, given an image as shown in Fig. 1.1a, humans can easily see a “cat” in the image and thus, categorize the image (classification task); localize the cat in the image (classification plus localization task as shown in Fig. 1.1b); localize and label all objects that are present in the image (object detection task as shown in Fig. 1.1c); and segment the individual objects that are present in the image (instance segmentation task as shown in Fig. 1.1d). Computer vision is the science that aims to give a similar, if not better, capability to computers. More precisely, computer vision seeks to develop methods which are able to replicate one of the most amazing capabilities of the human visual system, i.e., inferring characteristics of the 3D real world purely using the light reflected to the eyes from various objects.

However, recovering and understanding the 3D structure of the world from two-dimensional images captured by cameras is a challenging task. Researchers in computer vision have been developing mathematical techniques to recover the three-dimensional shape and appearance of objects/scene from images. For example, given a large enough set of images of an object captured from a variety of views (Fig. 1.2), computer vision algorithms can reconstruct an accurate dense 3D surface model of the object using dense correspondences across multiple views. However, despite all of these advances, understanding images at the same level as humans still remains challenging.

1.1.1 APPLICATIONS

Due to the significant progress in the field of computer vision and visual sensor technology, computer vision techniques are being used today in a wide variety of real-world applications, such as intelligent human-computer interaction, robotics, and multimedia. It is also expected that the next generation of computers could even understand human actions and languages at the same level as humans, carry out some missions on behalf of humans, and respond to human commands in a smart way.

Figure 1.1: What do we want computers to do with the image data? To look at the image and perform classification, classification plus localization (i.e., to find a bounding box around the main object (CAT) in the image and label it), to localize all objects that are present in the image (CAT, DOG, DUCK) and to label them, or perform semantic instance segmentation, i.e., the segmentation of the individual objects within a scene, even if they are of the same type.

Figure 1.2: Given a set of images of an object (e.g., upper human body) captured from six different viewpoints, a dense 3D model of the object can be reconstructed using computer vision algorithms.

Human-computer Interaction

Nowadays, video cameras are widely used for human-computer interaction and in the entertainment industry. For instance, hand gestures are used in sign language to communicate, transfer messages in noisy environments, and interact with computer games. Video cameras provide a natural and intuitive way of human communication with a device. Therefore, one of the most important aspects for these cameras is the recognition of gestures and short actions from videos.

Robotics

Integrating computer vision technologies with high-performance sensors and cleverly designed hardware has given rise to a new generation of robots which can work alongside humans and perform many different tasks in unpredictable environments. For example, an advanced humanoid robot can jump, talk, run, or walk up stairs in a very similar way a human does. It can also recognize and interact with people. In general, an advanced humanoid robot can perform various activities that are mere reflexes for humans and do not require a high intellectual effort.

Multimedia

Computer vision technology plays a key role in multimedia applications. These have led to a massive research effort in the development of computer vision algorithms for processing, analyzing, and interpreting multimedia data. For example, given a video, one can ask “What does this video mean?”, which involves a quite challenging task of image/video understanding and summarization. As another example, given a clip of video, computers could search the Internet and get millions of similar videos. More interestingly, when one gets tired of watching a long movie, computers would automatically summarize the movie for them.

1.1.2 IMAGE PROCESSING VS. COMPUTER VISION

Image processing can be considered as a preprocessing step for computer vision. More precisely, the goal of image processing is to extract fundamental image primitives, including edges and corners, filtering, morphology operations, etc. These image primitives are usually represented as images. For example, in order to perform semantic image segmentation (Fig. 1.1), which is a computer vision task, one might need to apply some filtering on the image (an image processing task) during that process.

Unlike image processing, which is mainly focused on processing raw images without giving any knowledge feedback on them, computer vision produces semantic descriptions of images. Based on the abstraction level of the output information, computer vision tasks can be divided into three different categories, namely low-level, mid-level, and high-level vision.

Low-level Vision

Based on the extracted image primitives, low-level vision tasks could be preformed on images/videos. Image matching is an example of low-level vision tasks. It is defined as the automatic identification of corresponding image points on a given pair of the same scene from different view points, or a moving scene captured by a fixed camera. Identifying image correspondences is an important problem in computer vision for geometry and motion recovery.

Another fundamental low-level vision task is optical flow computation and motion analysis. Optical flow is the pattern of the apparent motion of objects, surfaces, and edges in a visual scene caused by the movement of an object or camera. Optical flow is a 2D vector field where each vector corresponds to a displacement vector showing the movement of points from one frame to the next. Most existing methods which estimate camera motion or object motion use optical flow information.

Mid-level Vision

Mid-level vision provides a higher level of abstraction than low-level vision. For instance, inferring the geometry of objects is one of the major aspects of mid-level vision. Geometric vision includes multi-view geometry, stereo, and structure from motion (SfM), which infer the 3D scene information from 2D images such that 3D reconstruction could be made possible. Another task of mid-level vision is visual motion capturing and tracking, which estimate 2D and 3D motions, including deformable and articulated motions. In order to answer the question “How does the object move?,” image segmentation is required to find areas in the images which belong to the object.

High-level Vision

Based on an adequate segmented representation of the 2D and/or 3D structure of the image, extracted using lower level vision (e.g., low-level image processing, low-level and mid-level vision), high-level vision completes the task of delivering a coherent interpretation of the image. High-level vision determines what objects are present in the scene and interprets their interrelations. For example, object recognition and scene understanding are two high-level vision tasks which infer the semantics of objects and scenes, respectively. How to achieve robust recognition, e.g., recognizing object from different viewpoint is still a challenging problem.

Another example of higher level vision is image understanding and video understanding. Based on information provided by object recognition, image and video understanding try to answer questions such as “Is there a tiger in the image?” or “Is this video a drama or an action?,” or “Is there any suspicious activity in a surveillance video?” Developing such high-level vision tasks helps to fulfill different higher level tasks in intelligent human-computer interaction, intelligent robots, smart environment, and content-based multimedia.

1.2 WHAT IS MACHINE LEARNING?

Computer vision algorithms have seen a rapid progress in recent years. In particular, combining computer vision with machine learning contributes to the development of flexible and robust computer vision algorithms and, thus, improving the performance of practical vision systems. For instance, Facebook has combined computer vision, machine learning, and their large corpus of photos, to achieve a robust and highly accurate facial recognition system. That is how Facebook can suggest who to tag in your photo. In the following, we first define machine learning and then describe the importance of machine learning for computer vision tasks.

Machine learning is a type of artificial intelligence (AI) which allows computers to learn from data without being explicitly programmed. In other words, the goal of machine learning is to design methods that automatically perform learning using observations of the real world (called the “training data”), without explicit definition of rules or logic by the humans (“trainer”/“supervisor”). In that sense, machine learning can be considered as programming by data samples. In summary, machine learning is about learning to do better in the future based on what was experienced in the past.

A diverse set of machine learning algorithms has been proposed to cover the wide variety of data and problem types. These learning methods can be mainly divided into three main approaches, namely supervised, semi-supervised, and unsupervised. However, the majority of practical machine learning methods are currently supervised learning methods, because of their superior performance compared to other counter-parts. In supervised learning methods, the training data takes the form of a collection of (data:x, label:y) pairs and the goal is to produce a prediction y* in response to a query sample x. The input x can be a features vector, or more complex data such as images, documents, or graphs. Similarly, different types of output y have been studied. The output y can be a binary label which is used in a simple binary classification problem (e.g., “yes” or “no”). However, there has also been numerous research works on problems such as multi-class classification where y is labeled by one of k labels, multi-label classification where y takes on simultaneously the K labels, and general structured prediction problems where y is a high-dimensional output, which is constructed from a sequence of predictions (e.g., semantic segmentation).

Supervised learning methods approximate a mapping function f(x) which can predict the output variables y for a given input sample x. Different forms of mapping function f(.) exist (some are briefly covered in Chapter 2), including decision trees, Random Decision Forests (RDF), logistic regression (LR), Support Vector Machines (SVM), Neural Networks (NN), kernel machines, and Bayesian classifiers. A wide range of learning algorithms has also been proposed to estimate these different types of mappings.

On the other hand, unsupervised learning is where one would only have input data X and no corresponding output variables. It is called unsupervised learning because (unlike supervised learning) there are no ground-truth outputs and there is no teacher. The goal of unsupervised learning is to model the underlying structure/distribution of data in order to discover an interesting structure in the data. The most common unsupervised learning method is the clustering approach such as hierarchical clustering, k-means clustering, Gaussian Mixture Models (GMMs), Self-Organizing Maps (SOMs), and Hidden Markov Models (HMMs).

Semi-supervised learning methods sit in-between supervised and unsupervised learning. These learning methods are used when a large amount of input data is available and only some of the data is labeled. A good example is a photo archive where only some of the images are labeled (e.g., dog, cat, person), and the majority are unlabeled.

1.2.1 WHY DEEP LEARNING?

While these machine learning algorithms have been around for a long time, the ability to automatically apply complex mathematical computations to large-scale data is a recent development. This is because the increased power of today’s computers, in terms of speed and memory, has helped machine learning techniques evolve to learn from a large corpus of training data. For example, with more computing power and a large enough memory, one can create neural networks of many layers, which are called deep neural networks. There are three key advantages which are offered by deep learning.

Simplicity: Instead of problem specific tweaks and tailored feature detectors, deep networks offer basic architectural blocks, network layers, which are repeated several times to generate large networks.

Scalability: Deep learning models are easily scalable to huge datasets. Other competing methods, e.g., kernel machines, encounter serious computational problems if the datasets are huge.

Domain transfer: A model learned on one task is applicable to other related tasks and the learned features are general enough to work on a variety of tasks which may have scarce data available.

Due to the tremendous success in learning these deep neural networks, deep learning techniques are currently state-of-the-art for the detection, segmentation, classification and recognition (i.e., identification and verification) of objects in images. Researchers are now working to apply these successes in pattern recognition to more complex tasks such as medical diagnoses and automatic language translation. Convolutional Neural Networks (ConvNets or CNNs) are a category of deep neural networks which have proven to be very effective in areas such as image recognition and classification (see Chapter 7 for more details). Due to the impressive results of CNNs in these areas, this book is mainly focused on CNNs for computer vision tasks. Figure 1.3 illustrates the relation between computer vision, machine learning, human vision, deep learning, and CNNs.

1.3 BOOK OVERVIEW

CHAPTER 2

The book begins in Chapter 2 with a review of the traditional feature representation and classification methods. Computer vision tasks, such as image classification and object detection, have traditionally been approached using hand-engineered features which are divided into two different main categories: global features and local features. Due to the popularity of the low-level representation, this chapter first reviews three widely used low-level hand-engineered descriptors, namely Histogram of Oriented Gradients (HOG) [Triggs and Dalal, 2005], Scale-Invariant Feature Transform (SIFT) [Lowe, 2004], and Speed-Up Robust Features (SURF) [Bay et al., 2008]. A typical computer vision system feeds these hand-engineered features to machine learning algorithms to classify images/videos. Two widely used machine learning algorithms, namely SVM [Cortes, 1995] and RDF [Breiman, 2001, Quinlan, 1986], are also introduced in details.

Figure 1.3: The relation between human vision, computer vision, machine learning, deep learning, and CNNs.

CHAPTER 3

The performance of a computer vision system is highly dependent on the features used. Therefore, current progress in computer vision has been based on the design of feature learners which minimizes the gap between high-level representations (interpreted by humans) and low-level features (detected by HOG [Triggs and Dalal, 2005] and SIFT [Lowe, 2004] algorithms). Deep neural networks are one of the well-known and popular feature learners which allow the removal of complicated and problematic hand-engineered features. Unlike the standard feature extraction algorithms (e.g., SIFT and HOG), deep neural networks use several hidden layers to hierarchically learn the high level representation of an image. For instance, the first layer might detect edges and curves in the image, the second layer might detect object body-parts (e.g., hands or paws or ears), the third layer might detect the whole object, etc. In this chapter, we provide an introduction to deep neural networks, their computational mechanism and their historical background. Two generic categories of deep neural networks, namely feed-forward and feed-back networks, with their corresponding learning algorithms are explained in detail.

CHAPTER 4

CNNs are a prime example of deep learning methods and have been most extensively studied. Due to the lack of training data and computing power in the early days, it was hard to train a large high-capacity CNN without overfitting. After the rapid growth in the amount of annotated data and the recent improvements in the strengths of Graphics Processor Units (GPUs), research on CNNs has emerged rapidly and achieved state-of-the-art results on various computer vision tasks. In this chapter, we provide a broad survey of the recent advances in CNNs, including state-of-the-art layers (e.g., convolution, pooling, nonlinearity, fully connected, transposed convolution, ROI pooling, spatial pyramid pooling, VLAD, spatial transformer layers), weight initialization approaches (e.g., Gaussian, uniform and orthogonal random initialization, unsupervised pre-training, Xavier, and Rectifier Linear Unit (ReLU) aware scaled initialization, supervised pre-training), regularization approaches (e.g., data augmentation, dropout, drop-connect, batch normalization, ensemble averaging, the 1 and 2 regularization, elastic net, max-norm constraint, early stopping), and several loss functions (e.g., soft-max, SVM hinge, squared hinge, Euclidean, contrastive, and expectation loss).

CHAPTER 5

The CNN training process involves the optimization of its parameters such that the loss function is minimized. This chapter reviews well-known and popular gradient-based training algorithms (e.g., batch gradient descent, stochastic gradient descent, mini-batch gradient descent) followed by state-of-the-art optimizers (e.g., Momentum, Nesterov momentum, AdaGrad, AdaDelta, RMSprop, Adam) which address the limitations of the gradient descent learning algorithms. In order to make this book a self-contained guide, this chapter also discusses the different approaches that are used to compute differentials of the most popular CNN layers which are employed to train CNNs using the error back-propagation algorithm.

CHAPTER 6

This chapter introduces the most popular CNN architectures which are formed using the basic building blocks studied in Chapter 4 and Chapter 7. Both early CNN architectures which are easier to understand (e.g., LeNet, NiN, AlexNet, VGGnet) and the recent CNN ones (e.g., GoogleNet, ResNet, ResNeXt, FractalNet, DenseNet), which are relatively complex, are presented in details.

CHAPTER 7

This chapter reviews various applications of CNNs in computer vision, including image classification, object detection, semantic segmentation, scene labeling, and image generation. For each application, the popular CNN-based models are explained in detail.

CHAPTER 8

Deep learning methods have resulted in significant performance improvements in computer vision applications and, thus, several software frameworks have been developed to facilitate these implementations. This chapter presents a comparative study of nine widely used deep learning frameworks, namely Caffe, TensorFlow, MatConvNet, Torch7, Theano, Keras, Lasagne, Marvin, and Chainer, on different aspects. This chapter helps the readers to understand the main features of these frameworks (e.g., the provided interface and platforms for each framework) and, thus, the readers can choose the one which suits their needs best.

A Guide to Convolutional Neural Networks for Computer Vision

Подняться наверх