Читать книгу The Investment Advisor Body of Knowledge + Test Bank - IMCA - Страница 38
CHAPTER 3
Statistics and Methods
Part VI Linear Regression Analysis
ОглавлениеLinear Regression (One Regressor)
One of the most popular models in statistics is the linear regression model. Given two constants, α and β, and a random error term, ϵ, in its simplest form the model posits a relationship between two variables, X and Y:
As specified, X is known as the regressor or independent variable. Similarly, Y is known as the regressand or dependent variable. As dependent implies, traditionally we think of X as causing Y. This relationship is not necessary, and in practice, especially in finance, this cause-and-effect relationship is either ambiguous or entirely absent. In finance, it is often the case that both X and Y are being driven by a common underlying factor.
The linear regression relationship is often represented graphically as a plot of Y against X, as shown in Figure 3.9. The solid line in the chart represents the deterministic portion of the linear regression equation, Y = α + βX. For any particular point, the distance above or below the line is the error, ϵ, for that point.
FIGURE 3.9 Linear Regression Example
Because there is only one regressor, this model is often referred to as a univariate regression. Mainly, this is to differentiate it from the multivariate model, with more than one regressor, which we will explore later in this chapter. While everybody agrees that a model with two or more regressors is multivariate, not everybody agrees that a model with one regressor is univariate. Even though the univariate model has one regressor, X, it has two variables, X and Y, which has led some people to refer to Equation 3.86 as a bivariate model. From here on out, however, we will refer to Equation 3.86 as a univariate model.
In Equation 3.86, α and β are constants. In the univariate model, α is typically referred to as the intercept, and β is often referred to as the slope. β is referred to as the slope because it measures the slope of the solid line when Y is plotted against X. We can see this by taking the derivative of Y with respect to X:
(3.87)
The final term in Equation 3.86, ϵ, represents a random error, or residual. The error term allows us to specify a relationship between X and Y, even when that relationship is not exact. In effect, the model is incomplete, it is an approximation. Changes in X may drive changes in Y, but there are other variables, which we are not modeling, which also impact Y. These unmodeled variables cause X and Y to deviate from a purely deterministic relationship. That deviation is captured by ϵ, our residual.
In risk management this division of the world into two parts, a part that can be explained by the model and a part that cannot, is a common dichotomy. We refer to risk that can be explained by our model as systematic risk, and to the part that cannot be explained by the model as idiosyncratic risk. In our regression model, Y is divided into a systematic component, α + βX, and an idiosyncratic component, ϵ.
(3.88)
Which component of the overall risk is more important? It depends on what our objective is. As we will see, portfolio managers who wish to hedge certain risks in their portfolios are basically trying to reduce or eliminate systematic risk. Portfolio managers who try to mimic the returns of an index, on the other hand, can be viewed as trying to minimize idiosyncratic risk.
EVALUATING THE REGRESSION
Unlike a controlled laboratory experiment, the real world is a very noisy and complicated place. In finance it is rare that a simple univariate regression model is going to completely explain a large data set. In many cases, the data are so noisy that we must ask ourselves if the model is explaining anything at all. Even when a relationship appears to exist, we are likely to want some quantitative measure of just how strong that relationship is.
Probably the most popular statistic for describing linear regressions is the coefficient of determination, commonly known as R-squared, or just R2. R2 is often described as the goodness of fit of the linear regression. When R2 is one, the regression model completely explains the data. If R2 is one, all the residuals are zero, and the residual sum of squares, RSS, is zero. At the other end of the spectrum, if R2 is zero, the model does not explain any variation in the observed data. In other words, Y does not vary with X, and β is zero.
To calculate the coefficient of determination, we need to define two additional terms: TSS, the total sum of squares, and ESS, the explained sum of squares. They are defined as:
(3.89)
These two sums are related to the previously encountered residual sum of squares, as follows:
(3.90)
In other words, the total variation in our regressand, TSS, can be broken down into two components, the part the model can explain, ESS, and the part the model cannot, RSS. These sums can be used to compute R2:
(3.91)
As promised, when there are no residual errors, when RSS is zero, R2 is one. Also, when ESS is zero, or when the variation in the errors is equal to TSS, R2 is zero. It turns out that for the univariate linear regression model, R2 is also equal to the correlation between X and Y squared. If X and Y are perfectly correlated, ρxy = 1, or perfectly negatively correlated, ρxy = –1, then R2 will equal one.
Estimates of the regression parameters are just like the parameter estimates we examined earlier, and subject to hypothesis testing. In regression analysis, the most common null hypothesis is that the slope parameter, β, is zero. If β is zero, then the regression model does not explain any variation in the regressand.
In finance, we often want to know if α is significantly different from zero, but for different reasons. In modern finance, alpha has become synonymous with the ability of a portfolio manager to generate excess returns. This is because, in a regression equation modeling the returns of a portfolio manager, after we remove all the randomness, ϵ, and the influence of the explanatory variable, X, if α is still positive, then it is suggested that the portfolio manager is producing positive excess returns, something that should be very difficult in efficient markets. Of course, it's not just enough that the α is positive; we require that the α be positive and statistically significant.
Sample Problem
Question:
As a risk manager and expert on statistics, you are asked to evaluate the performance of a long/short equity portfolio manager. You are given 10 years of monthly return data. You regress the log returns of the portfolio manager against the log returns of a market index.
Assume both series are normally distributed and homoscedastic. From this analysis, you obtain the following regression results:
What can we say about the performance of the portfolio manager?
Answer:
The R2 for the regression is low. Only 8.11 percent of the variation in the portfolio manager's returns can be explained by the constant, beta, and variation in the market. The rest is idiosyncratic risk, and is unexplained by the model.
That said, both the constant and the beta seem to be statistically significant (i.e., they are statistically different from zero). We can get the t-statistic by dividing the value of the coefficient by its standard deviation. For the constant, we have:
Similarly, for beta we have a t-statistic of 2.10. Using a statistical package, we calculate the corresponding probability associated with each t-statistic. This should be a two-tailed test with 118 degrees of freedom (10 years × 12 months per year – 2 parameters). We can reject the hypothesis that the constant and slope are zero at the 2 percent level and 4 percent level, respectively. In other words, there seems to be a significant market component to the fund manager's return, but the manager is also generating statistically significant excess returns.
Linear Regression (Multivariate)
Univariate regression models are extremely common in finance and risk management, but sometimes we require a slightly more complicated model. In these cases, we might use a multivariate regression model. The basic idea is the same, but instead of one regressand and one regressor, we have one regressand and multiple regressors. Our basic equation will look something like:
(3.92)
Notice that rather than denoting the first constant with α, we chose to go with β1. This is the more common convention in multivariate regression. To make the equation even more regular, we can assume that there is an X1, which, unlike the other X's, is constant and always equal to one. This convention allows us to easily express a set of observations in matrix form. For t observations and n regressands, we could write:
(3.93)
where the first column of the X matrix —x11, x21, … , xt1– is understood to consist entirely of ones. The entire equation can be written more succinctly as:
(3.94)
where, as before, we have used bold letters to denote matrices.
MULTICOLLINEARITY
In order to determine the parameters of the multivariate regression, we again turn to our OLS assumptions. In the multivariate case, the assumptions are the same as before, but with one addition. In the multivariate case, we require that all of the independent variables be linearly independent of each other. We say that the independent variables must lack multicollinearity:
(A7) The independent variables have no multicollinearity.
To say that the independent variables lack multicollinearity means that it is impossible to express one of the independent variables as a linear combination of the others.
This additional assumption is required to remove ambiguity. To see why this is the case, imagine that we attempt a regression with two independent variables where the second independent variable, X3, can be expressed as a linear function of the first independent variable, X2:
If we substitute the second line of Equation 3.95 into the first, we get:
(3.96)
In the second line, we have simplified by introducing new constants and a new error term. We have replaced (β1 + β3λ1) with β4, replaced (β2 + β3λ2) with β5, and replaced (β3ϵ2 + ϵ1) with ϵ3. β5 can be uniquely determined in a univariate regression, but there is an infinite number of combinations of β2, β3, and λ2 that we could choose to equal β5. If β5 = 10, any of the following combinations would work:
(3.97)
This is why we say that β2 and β3 are ambiguous in the initial equation.
Even in the presence of multicollinearity, the regression model still works in a sense. In the preceding example, even though β2 and β3 are ambiguous, any combination where (β2 + β3λ2) equals β5 will produce the same value of Y for a given set of X's. If our only objective is to predict Y, then the regression model still works. The problem is that the value of the parameters will be unstable. A slightly different data set can cause wild swings in the value of the parameter estimates, and may even flip the signs of the parameters. A variable that we expect to be positively correlated with the regressand may end up with a large negative beta. This makes interpreting the model difficult. Parameter instability is often a sign of multicollinearity.
There is no well-accepted procedure for dealing with multicollinearity. The easiest course of action is often simply to eliminate a variable from the regression. While easy, this is hardly satisfactory.
Another possibility is to transform the variables, to create uncorrelated variables out of linear combinations of the existing variables. In the previous example, even though X3 is correlated with X2, X3 – λ2X2 is uncorrelated with X2.
(3.98)
One potential problem with this approach is similar to what we saw with principal component analysis (which is really just another method for creating uncorrelated variables from linear combinations of correlated variables). If we are lucky, a linear combination of variables will have a simple economic interpretation. For example, if X2 and X3 are two equity indexes, then their difference might correspond to a familiar spread. Similarly, if the two variables are interest rates, their difference might bear some relation to the shape of the yield curve. Other linear combinations might be difficult to interpret, and if the relationship is not readily identifiable, then the relationship is more likely to be unstable or spurious.
Global financial markets are becoming increasingly integrated. More now than ever before, multicollinearity is a problem that risk managers need to be aware of.