Читать книгу Handbook of Regression Analysis With Applications in R - Samprit Chatterjee - Страница 25

1.4 Example — Estimating Home Prices

Оглавление

Determining the appropriate sale price for a home is clearly of great interest to both buyers and sellers. While this can be done in principle by examining the prices at which other similar homes have recently sold, the well‐known existence of strong effects related to location means that there are likely to be relatively few homes with the same important characteristics to make the comparison. A solution to this problem is the use of hedonic regression models, where the sale prices of a set of homes in a particular area are regressed on important characteristics of the home such as the number of bedrooms, the living area, the lot size, and so on. Academic research on this topic is plentiful, going back to at least Wabe (1971).

This analysis is based on a sample from public data on sales of one‐family homes in the Levittown, NY area from June 2010 through May 2011. Levittown is famous as the first planned suburban community built using mass production methods, being aimed at former members of the military after World War II. Most of the homes in this community were built in the late 1940s to early 1950s, without basements and designed to make expansion on the second floor relatively easy.

For each of the houses in the sample, the number of bedrooms, number of bathrooms, living area (in square feet), lot size (in square feet), the year the house was built, and the property taxes are used as potential predictors of the sale price. In any analysis the first step is to look at the data, and Figure 1.4 gives scatter plots of sale price versus each predictor. It is apparent that there is a positive association between sale price and each variable, other than number of bedrooms and lot size. We also note that there are two houses with unusually large living areas for this sample, two with unusually large property taxes (these are not the same two houses), and three that were built six or seven years later than all of the other houses in the sample.


FIGURE 1.4: Scatter plots of sale price versus each predictor for the home price data.

The output below summarizes the results of a multiple regression fit.

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -7.149e+06 3.820e+06 -1.871 0.065043 . Bedrooms -1.229e+04 9.347e+03 -1.315 0.192361 Bathrooms 5.170e+04 1.309e+04 3.948 0.000171 *** Living.area 6.590e+01 1.598e+01 4.124 9.22e-05 *** Lot.size -8.971e-01 4.194e+00 -0.214 0.831197 Year.built 3.761e+03 1.963e+03 1.916 0.058981 . Property.tax 1.476e+00 2.832e+00 0.521 0.603734 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 47380 on 78 degrees of freedom Multiple R-squared: 0.5065, Adjusted R-squared: 0.4685 F-statistic: 13.34 on 6 and 78 DF, p-value: 2.416e-10

The overall regression is strongly statistically significant, with the tail probability of the ‐test roughly . The predictors account for roughly of the variability in sale prices (). Two of the predictors (number of bathrooms and living area) are highly statistically significant, with tail probabilities less than , and the coefficient of the year built variable is marginally statistically significant. The coefficients imply that given all else in the model is held fixed, one additional bathroom in a house is associated with an estimated expected price that is higher; one additional square foot of living area is associated with an estimated expected price that is higher (given the typical value of the living area variable, a more meaningful statement would probably be that an additional square feet of living area is associated with an estimated expected price that is higher); and a house being built one year later is associated with an estimated expected price that is higher.

This is a situation where the distinction between a confidence interval for a fitted value and a prediction interval (and which is of more interest to a particular person) is clear. Consider a house with bedrooms, bathroom, square feet of living area, square foot lot size, built in 1948, with in property taxes. Substituting those values into the above equation gives an estimated expected sale price of a house with these characteristics equal to . A buyer or a seller is interested in the sale price of one particular house, so a prediction interval for the sale price would provide a range for what the buyer can expect to pay and the seller expect to get. The standard error of the estimate can be used to construct a rough prediction interval, in that roughly of the time a house with these characteristics can be expected to sell for within of that estimated sale price, but a more exact interval might be required. On the other hand, a home appraiser or tax assessor is more interested in the typical (average) sale price for all homes of that type in the area, so they can give a justifiable interval estimate giving the precision of the estimate of the true expected value of the house, so a confidence interval for the fitted value is desired.

Exact intervals for a house with these characteristics can be obtained from statistical software, and turn out to be for the prediction interval and for the confidence interval. As expected, the prediction interval is much wider than the confidence interval, since it reflects the inherent variability in sale prices in the population of houses; indeed, it is probably too wide to be of any practical value in this case, but an interval with smaller coverage (that is expected to include the actual price only of the time, say) might be useful (a interval in this case would be , so a seller could be told that there is a chance that their house will sell for a value in this range).

The validity of all of these results depends on whether the assumptions hold. Figure 1.5 gives a scatter plot of the residuals versus the fitted values and a normal plot of the residuals for this model fit. There is no apparent pattern in the plot of residuals versus fitted values, and the ordered residuals form a roughly straight line in the normal plot, so there are no apparent violations of assumptions here. The plot of residuals versus each of the predictors (Figure 1.6) also does not show any apparent patterns, other than the houses with unusual living area and year being built, respectively. It would be reasonable to omit these observations to see if they have had an effect on the regression, but we will postpone discussion of that to Chapter 3, where diagnostics for unusual observations are discussed in greater detail.

An obvious consideration at this point is that the models discussed here appear to be overspecified; that is, they include variables that do not apparently add to the predictive power of the model. As was noted earlier, this suggests the consideration of model building, where a more appropriate (simplified) model can be chosen, which will be discussed in Chapter 2.


FIGURE 1.5: Residual plots for the home price data. (a) Plot of residuals versus fitted values. (b) Normal plot of the residuals.


FIGURE 1.6: Scatter plots of residuals versus each predictor for the home price data.

Handbook of Regression Analysis With Applications in R

Подняться наверх