Читать книгу Artificial Intelligence and Quantum Computing for Advanced Wireless Networks - Savo G. Glisic - Страница 30

2.2 ML Algorithm Analysis 2.2.1 Logistic Regression

In this section, we provide more details on the performance analysis [4] of the logistic regression introduced initially in Section 2.1.2. There, in Eq. (2.5), we provide an expression for the probability that an individual with dataset values X₁, X₂, …, X_p is in outcome g. That is, p_g = Pr(Y = g ∣ X). For this expression, we need to estimate the parameters β’s used in B’s. The likelihood for a sample of N observations is given by

(2.7)

where and y_gj is one if the j^th observation is in outcome g and zero otherwise. Using the fact that , the log likelihood, L, becomes

(2.8)

Maximum likelihood estimates of the β’s are those values that maximize this log likelihood equation. This is accomplished by calculating the partial derivatives and setting them to zero. These equations are ∂L / ∂β_ik for g = 1, 2, …, G and k = 1, 2, …, p. Since all coefficients are zero for g = 1, the effective range of g is from 2 to G.

Because of the nonlinear nature of the parameters, there is no closed‐form solution to these equations, and they must be solved iteratively. The Newton–Raphson [4–7] method is used to solve these equations. This method makes use of the information matrix, I(β), which is formed from the matrix of second partial derivatives.

The elements of the information matrix are given by

The information matrix is used because the asymptotic covariance matrix of the maximum likelihood estimates is equal to the inverse of the information matrix. That is, I (β)⁻¹. This covariance matrix is used in the calculation of confidence intervals for the regression coefficients, odds ratios, and predicted probabilities.

The interpretation of the estimated regression coefficients is not straightforward. In logistic regression, not only is the relationship between X and Y nonlinear, but also, if the dependent variable has more than two unique values, there are several regression equations. Consider the usual case of a binary dependent variable, Y, and a single independent variable, X. Assume that Y is coded so it takes on the values 0 and 1. In this case, the logistic regression equation is ln(p/(1 − p)) = β₀ + β₁ X. Now consider impact of a unit increase in X. The logistic regression equation becomes ln(p ′ /(1 − p′)) = β₀ + β₁(X + 1) = β₀ + β₁ X + β₁. We can isolate the slope by taking the difference between these two equations. We have

(2.9)

That is, β₁ is the log of the ratio of the odds at X + 1 and X. Removing the logarithm by exponentiating both sides gives . The regression coefficient β₁ is interpreted as the log of the odds ratio comparing the odds after a one unit increase in X to the original odds. Note that the interpretation of β1 depends on the particular value of X since the probability values, the p ′ s, will vary for different X.

Inferences about individual regression coefficients, groups of regression coefficients, goodness of fit, mean responses, and predictions of group membership of new observations are all of interest. These inference procedures can be treated by considering hypothesis tests and/or confidence intervals. The inference procedures in logistic regression rely on large sample sizes for accuracy. Two procedures are available for testing the significance of one or more independent variables in a logistic regression: likelihood ratio tests and Wald tests. Simulation studies usually show that the likelihood ratio test performs better than the Wald test. However, the Wald test is still used to test the significance of individual regression coefficients because of its ease of calculation.

The likelihood ratio test statistic is −2 times the difference between the log likelihoods of two models, one of which is a subset of the other. The likelihood ratio is defined as LR = −2[L_subset − L_full] = −2[ ln (l_subset/l_full)]. When the full model in the likelihood ratio test statistic is the saturated model, LR is referred to as the deviance. A saturated model is one that includes all possible terms (including interactions) so that the predicted values from the model equal the original data. The formula for the deviance is D = −2[L_Reduced − L_Saturated]. The deviance may be calculated directly using the formula for the deviance residuals:

(2.10)

This expression may be used to calculate the log likelihood of the saturated model without actually fitting a saturated model. The formula is L_Saturated = L_Reduced + D/2.

The deviance in logistic regression is analogous to the residual sum of squares in multiple regression. In fact, when the deviance is calculated in multiple regression, it is equal to the sum of the squared residuals. Deviance residuals, to be discussed later, may be squared and summed as an alternative way to calculate the deviance D.

The change in deviance, ΔD, due to excluding (or including) one or more variables is used in logistic regression just as the partial F test is used in multiple regression. Many texts use the letter G to represent ΔD, but we have already used G to represent the number of groups in Y. Instead of using the F distribution, the distribution of the change in deviance is approximated by the chi‐square distribution. Note that since the log likelihood for the saturated model is common to both deviance values, ΔD is calculated without actually estimating the saturated model. This fact becomes very important during subset selection. The formula for ΔD that is used for testing the significance of the regression coefficient(s) associated with the independent variable X1 is ΔD_X1 = D_{without X1} −D_{with X1} = −2 [L_{without X1} − L_Saturated] + 2[L_{with X1} − L_Saturated] = −2[L_withoutX1 − L_withX1].

Note that this formula looks identical to the likelihood ratio statistic. Because of the similarity between the change in deviance test and the likelihood ratio test, their names are often used interchangeably.

The formula for the Wald statistic is , where is an estimate of the standard error of b_j provided by the square root of the corresponding diagonal element of the covariance matrix, . With large sample sizes, the distribution of z_j is closely approximated by the normal distribution. With small and moderate sample sizes, the normal approximation is described as “adequate.”

Artificial Intelligence and Quantum Computing for Advanced Wireless Networks

Подняться наверх