Читать книгу Industrial Data Analytics for Diagnosis and Prognosis - Yong Chen - Страница 42

3.5 Bayesian Inference for Normal Distribution

Оглавление

Let D = {x1, x2,…, xn} denote the observed data set. In the maximum likelihood estimation, the distribution parameters are considered as fixed. The estimation errors are obtained by considering the random distribution of possible data sets D. By contrast, in Bayesian inference, we treat the observed data set D as the only data set. The uncertainty in the parameters is characterized through a probability distribution over the parameters.

In this subsection, we focus on Bayesian inference of normal distribution when the mean μ is unknown and the covariance matrix Σ is assumed as known. The Bayesian inference is based on the Bayes’ theorem. In general, the Bayes’ theorem is about the conditional probability of an event A given that an event B occurs:


Applying Bayes’ theorem for Bayesian inference of μ, we have

(3.25)

where g(μ) is the prior distribution of μ, which is the distribution before observing the data, and f(μ|D) is called as the posterior distribution, which is the distribution after we have observed D. The function f(D|μ) on the right-hand side of (3.25) is the density function for the observed data set D. If it is viewed as a function of the unknown parameter μ, f(D|μ) is exactly the likelihood function of μ. Therefore the Bayes’ theorem can be stated in words as

(3.26)

where ∝ stands for “is proportional to”. Note the denominator p(D) in the right-hand side of (3.25) is a constant which does not depend on the parameter μ. It plays the normalization role to ensure the left-hand side is a valid probability density function and integrates to one. Taking the integral of the right-hand side of (3.25) with respect to μ and setting it to be equal to one, it is easy to see that


A point estimate of μ can be obtained by maximizing the posterior distribution. This method is called the maximum a posteriori (MAP) estimate. The MAP estimate of μ can be written as

(3.27)

From (3.27), it can be seen that the MAP estimate is closely related to MLE. Without the prior g(μ), the MAP is the same as the MLE. So if the prior follows a uniform distribution, the MAP and MLE will be equivalent. Following this argument, if the prior distribution has a flat shape, we expect that the MAP and MLE are similar.

We first consider a simple case where the data follow a univariate normal distribution with unknown mean μ and known variance σ2. The likelihood function based on a random sample of independent observations D = {x1, x2,…, xn} is given by


Based on (3.26), we have


where g(μ) is the probability density function of the prior distribution. We choose a normal distribution N(μ0, σ02) as the prior for μ. This prior is a conjugate prior because the resulting posterior distribution will also be normal. By completing the square in the exponent of the likelihood and prior, the posterior distribution can be obtained as


where

(3.28)

(3.29)

The posterior mean given in (3.28) can be understood as a weighted average of the prior mean μ0 and the sample mean , which is the MLE of μ. When the sample size n is very large, the weight for is close to one and the weight for μ0 is close to 0, and the posterior mean is very close to the MLE, or the sample mean. On the other hand, when n is very small, the posterior mean is very close the prior mean μ0. Similarly, if the prior variance σ02 is very large, the prior distribution has a flat shape and the posterior mean is close to the MLE. Note that because the mode of a normal distribution is equal to the mean, the MAP of μ is exactly μn. Consequently, when n is very large, or when the prior is flat, the MAP is close to the MLE.

Equation (3.29) shows the relationship between the posterior variance and the prior variance. It is easier to understand the relationship if we consider the inverse of the variance, which is called the precision. A high (low) precision corresponds to a low (high) variance. Equation (3.29) basically says that the posterior precision is equal to the prior precision with an added precision contribution proportional to n. Each observation adds a contribution of , the precision of xn, to the posterior precision. When n is very large, the posterior precision becomes very high, or equivalently the posterior variance becomes very small. On the other hand, when n is very small, the posterior precision and variance will be very close to the prior precision and variance. Specifically, when n = 0, the posterior distribution is the same as the prior distribution. We illustrate the posterior distribution of the mean with known variance under various sample sizes in Figure 3.3, where the data are generated from N(2, 1) and the prior distribution of the mean is N(0, 1). It is clear from Figure 3.3 that with sample size n getting larger, the posterior distribution of the mean becomes more and more concentrated at the true mean.


Figure 3.3 Posterior distribution of the mean with various sample sizes

When the data follow a p-dimensional multivariate normal distribution with unknown mean μ and known covariance matrix Σ, the posterior distribution based on a random sample of independent observations D = {x1, x2,…, xn} is given by


where g(μ) is the density of the conjugate prior distribution Np(μ0, Σ0). Similar to the univariate case, the posterior distribution of μ can be obtained as


where

(3.30)

(3.31)

where is the sample mean of the data, which is the MLE of μ. It is easy to see the similarity between the results for the univariate data in (3.28) and (3.29) and the results for the multivariate data in (3.30) and (3.31). The MAP of μ is exactly μn. Similar to the univariate case, when n is large, or when the prior distribution is flat, the MAP is close to the MLE.

One advantage of the Bayesian inference is that the prior knowledge can be included naturally. Suppose, for example, a randomly sampled product turns out to be defective. A MLE of the defective rate based on this single observation would be equal to 1, implying that all products are defective. By contrast, a Bayesian approach with a reasonable prior should give a much less extreme conclusion. In addition, the Bayesian inference can be performed in a sequential manner very naturally. To see this, we can write the posterior distribution of μ with the contribution from the last data point xn separated out as

(3.32)

Equation (3.32) can be viewed as the posterior distribution given a single observation xn with the term in the square bracket treated as the prior. Note that the term in the square brackets is just the posterior distribution (up to a normalization constant) after observing n − 1 data points. Equation (3.32) says that we can treat the posterior based on the first n − 1 observations as the prior and update the posterior based on the next observation using the Bayes’ theorem. This process can be repeated sequentially for each new observation. The sequential update of posterior under the Bayesian framework is very useful when observations are collected sequentially over time.

Example 3.3: For the side_temp_defect data set from a hot rolling process, suppose the true covariance matrix of the side temperatures measured at location 2, 40, and 78 of Stand 5 is known and given by


We use the nominal mean temperatures as given in Example 3.2 as the mean of the prior distribution and a diagonal matrix with variance equal to 100 for each temperature variable as its covariance matrix:


Based on (3.30) and (3.31), the following R codes calculate the posterior mean and covariance matrix for μ using the first five (n = 5) observations in the data set.

Sigma <- matrix(c(2547.4, -111.0, 133.7, -111.0, 533.1, 300.7, 133.7, 300.7, 562.5), nrow = 3, ncol = 3, byrow = T) Precision <- solve(Sigma) Sigma0 <- diag(rep(100, 3)) Precision0 <- solve(Sigma0) mu0 <- c(1926, 1851, 1872) n <- 5 X.n <- side.temp.defect[1:n, c(2, 40, 78)] x.bar <- apply(X.n, 2, mean) mu.n <- solve(Precision0+n*Precision)%*% (Precision0%*%mu0+n*Precision%*%x.bar) Sigma.n <- solve(Precision0 + n*Precision)

The posterior mean and covariance matrix are obtained as


Compared to the sample mean of the first five observations, which is (1943 1850 1838)T, the posterior mean has some deviations from both the sample mean and the prior mean μ0. Now we use the first 100 (n = 100) observations to find the posterior mean by changing n in the R codes from 5 to 100. The posterior mean and covariance matrix are


Compared to the sample mean vector of the first 100 observations, which is (1944 1849 1865)T, the posterior mean with n = 100 observations is very close to the sample mean, while the influence of the prior mean is very small. In addition, the posterior variance for the mean temperature at each of the three locations is much smaller for n = 100 than for n = 5.

Industrial Data Analytics for Diagnosis and Prognosis

Подняться наверх