Читать книгу Machine Learning for Tomographic Imaging - Professor Ge Wang - Страница 25

1.1.6 Sparse coding

In the previous subsection, we have introduced several models on natural image statistics, which produce results similar to the responses of the HVS retina and LGN. These models only get rid of the first- and second-redundancy in images. Now, we will introduce two models that are the first successful attempts to give similar results to those found in simple cells in the visual cortex. These models suppress higher-order redundancy in images. These models interpret visual perception, the data are pre-processed by whitening and DC is removed, being consistent with the HVS pre-processing step for LGN and ganglion cells.

Although these two models are milestones in mimicking the responses of simple cells to natural images, their computational methods are quite time-consuming. Here, we only focus on the main idea behind these models, and in chapter 4 we will introduce some efficient methods to obtain the same results.

The first model was proposed by Olshausen and Field (Olshausen and Field 1996). They used a one-layer network and trained it with natural image patches to extract distinguished features for natural image coding. According to this study, a simple cell of V1 contains about 200 million cells, while the number of ganglion and LGN cells responsible for visual perception is only just over 1 million. This indicates that sparse coding is an effective strategy for data redundancy reduction and efficient image representation.

Sparse coding means that a given image may be typically described in terms of a small number of suitable basis functions chosen out of a large training dataset. A heavy-tailed distribution of representation coefficients is often observed, as illustrated in figure 1.10. For instance, if we consider a natural image patch as a vector x, as shown in figure 1.11, this vector can be represented just by two components, i.e. numbers 3 and 6 out of the 12 features in total. To generalize this problem, a typical sparse encoding strategy is to approximate an image as a linear combination of basis functions:

x=∑i=1Kαiϕi,(1.16)

where αi is a representation coefficient or an active coefficient for the ith basis function, and ϕi is the ith basis function.

Figure 1.11. An image modeled as a linear superposition of basis functions. The sparse encoding is to learn basis functions which capture the structures efficiently in a specific domain, such as natural images. Adapted from figure 1 in Baraniuk (2007) with permission. Copyright 2007 IEEE.

In their work published in 1996 (Olshausen and Field 1996), they trained a feed-forward artificial neural network on natural image patches in terms of the sparse representation with an over-complete basis. In sparse coding, the search process should achieve a match as closely as possible between the distribution of images described by the linear image model under sparse constraints and the corresponding training targets (figure 1.12). For this purpose, Lagrangian optimization was used to describe this problem. Then, the final optimization function can be formularized as follows:

minα,ϕ∑j=1mxj−∑i=1Kαj,iϕi2+λ∑iSαj,i,(1.17)

where x_j is an image patch extracted from a natural image, αj,i is the representation coefficient for basis function ϕi in image patch x_j, S is a sparsity measure, and λ is a weighting parameter. This formula contains two components: the first term computes the reconstruction error while the second term imposes the sparsity penalty.

Figure 1.12. Sparse representation characterized by a generalized Gaussian distribution of representation coefficients, which generates sparse coefficients in terms of an over-complete dictionary. (a) An image is represented by a small number of ‘active’ code elements and (b) the probability distribution of its ‘activities’. Lena image © Playboy Enterprises, Inc.

Although this formula is quite simple and easy to comprehend, it has an open question: how does one measure sparseness mathematically? As a reference point, the distribution of a zero-mean random variable can be compared to the Gaussian distribution with the same variance and mean. The rationale for selection of the Gaussian distribution as the reference is that the Gaussian distribution has the largest entrophy relative to all probability distributions for the same variance. Thus, if the distribution of interest is more concentrated than the Gaussian distribution, it can be regarded as being sparse. Based on this consideration, the measurement of sparseness can be heuristically calculated. The criteria for a sparsity function to work as intended are to emphasize values that are close to zero or values that are much larger than a positive constant, such as 1 for a normalized/whitened random variable. A sparse function satisfying these two requirements is often heavy tailed, i.e. many coefficients are insignificant, and significant coefficients are sparse so that a resultant image representation is sparse. Interestingly, if we use S(x)=∣x∣, the coding process is to solve a Lasso problem, which means that the regularization term is in the L1 norm. This explains why we often use the L1 norm for a sparse solution.

By training their network with image patches of 12 by 12 pixels, they obtained 144 basis functions, as shown in figure 1.13. Recall that the patches were whitened before feeding into the network. The basis functions obtained by sparse coding of natural images are Gabor-like, similar to the responses of the receptive fields of simple cells in V1. Hence, these basis functions model the receptive fields of simple cells in V1 very well.

Figure 1.13. Basis functions learned by the sparse coding algorithm. All were normalized, with zero always represented by the same gray level. Reproduced with permission from Olshausen and Field (1997). Copyright 1997 Elsevier.

The second model (Bell and Sejnowski 1997) was proposed based on the independent component analysis (ICA) principle. ICA is an approach to solving the blind source separation (BSS) problem (figure 1.14). In natural image statistics, let X={xi∣i=1,…,N} represent N independent source signals forming a column vector, Y={y1,yi,…∣i=1,…,M} representing M image patches also forming a column vector, and W is the mixing matrix of N × M dimensions. The BSS problem is to invert the measurement Y={yj∣j=1,…,M},

Y=WX,M⩾N,(1.18)

for both W and X subject to uncertainties in amplitudes and permutations of independent source signals. ICA helps us find the basis components X={xi∣i=1,…,N} which have representative features of image patches.

Figure 1.14. ICA to find both embedded independent components and the mixing matrix that blends the independent components.

The premise of ICA is based on statistical independence among hidden data sources. In information theory (see appendix A for more details), we use the mutual information to measure the relationship between two signals. Let H(X) and H(Y) represent the self-information, which solely depend on the probability density functions of X and Y, respectively. H(X, Y) is the joint information, which represents the amount of information generated when X and Y occur together. I(X, Y) is the mutual information that is the information we have when a certain X or Y is known (figure 1.15). For example, if we know the information about Y, we only need to have the amount of H(X) − I(X, Y) information to determine X completely. When two signals are independent, the mutual information is zero (Chechile 2005).

Figure 1.15. Joint information determined by the signal information H(X) and H(Y) as well as mutual information I(X,Y).

We consider the ICA operation as a system, Y as the input and X as the output. When the output information of the system reaches its maximum, it indicates the minimum mutual information between output components. That is to say, the output components are as independent of each other as possible, since any non-trivial linear combination will compromise data independency. This is a simplified description of the infomax principle.

In 1997, Bell and Sejnowski applied ICA using the information theoretic approach in the case of natural images, and found that ICA is a special sparse coding method. They explained the results obtained with the network proposed by Olshausen and Field in the ICA framework. ICA on natural images produces decorrelating filters that are sensitive to both phase and frequency, similar to the cases with transforms involving oriented Gabor functions or wavelets. Representative ICA filters generated from natural images are shown in figure 1.16. It can be seen that ICA can also model the Gabor-like receptive fields of simple cells in V1.

In this chapter, we have provided a general explanation on how to reduce data redundancy and form a sparse representation in the HVS. Multiple types of cells, such as ganglion and LGN cells, are involved to normalize first- and second-order statistics and remove the associated redundancy. In the HVS, higher-order redundancy is eliminated with simple cells. From the viewpoint of biomimicry, the mechanism of simple cells is the basis for sparse representation. In addition, from the natural image perspective, we can use a sparsifying transform or model to obtain similar results as are observed in the HVS. It is noted that deep neural networks (to be formally explained in chapter 3) exhibit workflows similar to that of the HVS, such as multi-resolution analysis. As a second example, the whitening process is used to pre-process data for both the HVS and machine learning. Yet another example is that higher-order redundancy operations share Gabor-like characteristics observed in the HVS and machine learning. It will become increasingly more clear that machine learning imitates the HVS in major ways. Now, we have the tools to extract features constrained by or in reference to natural image statistics. How could we use these features to help solve practical problems? This question naturally leads us to our following chapters.

Machine Learning for Tomographic Imaging

Подняться наверх