Читать книгу Computational Analysis and Deep Learning for Medical Care - Группа авторов - Страница 19

1.2.5 GoogLeNet

Оглавление

In 2014, Google [5] proposed the Inception network for the ImageNet Challenge in 2014 for detection and classification challenges. The basic unit of this model is called “Inception cell”—parallel convolutional layers with different filter sizes, which consists of a series of convolutions at different scales and concatenate the results; different filter sizes extract different feature map at different scales. To reduce the computational cost and the input channel depth, 1 × 1 convolutions are used. In order to concatenate properly, max pooling with “same” padding is used. It also preserves the dimensions. In the state-of-art, three versions of Inception such as Inception v2, v3, and v4 and Inception-ResNet are defined. Figure 1.5 shows the inception module and Figure 1.6 shows the architecture of GoogLeNet.

For each image, resizing is performed so that the input to the network is 224 × 224 × 3 image, extract mean before feeding the training image to the network. The dataset contains 1,000 categories, 1.2 million images for training, 100,000 for testing, and 50,000 for validation. GoogLeNet is 22 layers deep and uses nine inception modules, and global average pooling instead of fully connected layers to go from 7 × 7 × 1,024 to 1 × 1 × 1024, which, in turn, saves a huge number of parameters. It includes several softmax output units to enforce regularization. It is trained on a high-end GPUs within a week and achieved top-5 error rate of 6.67%. GoogleNet trains faster than VGG and size of a pre-trained GoogleNet is comparatively smaller than VGG.

Table 1.5 Various parameters of VGG-16.

Layer name Input size Filter size Window size # Filters Stride/Padding Output size # Feature maps # Parameters
Conv 1 224 × 224 3 × 3 - 64 1/1 224 × 224 64 1,792
Conv 2 224 × 224 3 × 3 - 64 1/1 224 × 224 64 36,928
Max-pooling 1 224 × 224 - 2 × 2 - 2/0 112 × 112 64 0
Conv 3 112 × 112 3 × 3 - 128 1/1 112 × 112 128 73,856
Conv 4 112 × 112 3 × 3 - 128 1/1 112 × 112 128 147,584
Max-pooling 2 112 × 112 - 2 × 2 - 2/0 56 × 56 128 0
Conv 5 56 × 56 3 × 3 - 256 1/1 56 × 56 256 295,168
Conv 6 56 × 56 3 × 3 - 256 1/1 56 × 56 256 590,080
Conv 7 56 × 56 3 × 3 - 256 1/1 56 × 56 256 590,080
Max-pooling 3 56 × 56 - 2 × 2 - 2/0 28 × 28 256 0
Conv 8 28 × 28 3 × 3 - 512 1/1 28 × 28 512 1,180,160
Conv 9 28 × 28 3 × 3 - 512 1/1 28 × 28 512 2,359,808
Conv 10 28 × 28 3 × 3 - 512 1/1 28 × 28 512 2,359,808
Max-pooling 4 28 × 28 - 2 × 2 - 2/0 14 × 14 512 0
Conv 11 14 × 14 3 × 3 - 512 1/1 14 × 14 512 2,359,808
Conv 12 14 × 14 3 × 3 - 512 1/1 14 × 14 512 2,359,808
Conv 13 14 × 14 3 × 3 - 512 1/1 14 × 14 512 2,359,808
Max-pooling 5 14 × 14 - 2 × 2 - 2/0 7 × 7 512 0
Fully connected 1 4,096 neurons 102,764,544
Fully connected 2 4,096 neurons 16,781,312
Fully connected 3 1,000 neurons 4,097,000
Softmax 1,000 classes

Figure 1.5 Inception module.


Figure 1.6 Architecture of GoogleNet.

First layer: Here, input image is 224 × 224 × 3, and the output feature is 112 × 112 × 64. Followed by the convolutional layer uses a kernel of size 7 × 7 × 3 and with step 2. Then, followed by ReLU and max pooling by 3 × 3 kernel with step 2, now the output feature map size is 56 × 56 × 64. Then, do the local response normalization.

Second layer: It is a simplified inception model. Here, 1 × 1 convolution using 64 filters generate feature maps from the previous layer’s output before performing the 3 × 3 (with step 2) convolutions using 64 filters. Then, perform ReLU and local response normalization. Finally, perform a 3 × 3 max pooling with stride 2 to obtain 192 numbers of output of 28 feature maps.

Third layer: Is a complete inception module. The previous layer’s output is 28 × 28 with 192 filters and there will be four branches originating from the previous layer. The first branch uses 1 × 1 convolution kernels with 64 filters and ReLU, generates 64, 28 × 28 feature map; the second branch uses 1 × 1 convolution with 96 kernels (ReLU) before 3 × 3 convolution operation (with 128 filters), generating 128 × 28 × 28 feature map; the third branch use 1 × 1 convolutions with 16 filters (using ReLU) of 32 × 5 × 5 convolution operation, generating 32 × 28 × 28 feature map; the fourth branch contains 3 × 3 max pooling layer and a 1 × 1 convolution operation, generating 32 × 28 × 28 feature maps. And it is followed by concatenation of the generated feature maps that provide an output of 28 × 28 feature map with 258 filters.

The fourth layer is inception module. Input image is 28 × 28 × 256. The branches include 1 × 1 × 128 and ReLU, 1 × 1 × 128 as reduce before 3 × 3 × 192 convolutional operation, 1 × 1 × 32 as reduce before 5 × 5 × 96 convolutional operation, 3 × 3 max pooling with padding 1 before 1 × 1 × 64. The output is 28 × 28 × 128, 28 × 28 × 192, 28 × 28 × 96, and 28 × 28 × 64, respectively for each branch. The final output is 28 × 28 × 480. Table 1.6 shows the parameters of GoogleNet.

Computational Analysis and Deep Learning for Medical Care

Подняться наверх