Читать книгу Data Science For Dummies - Lillian Pierson - Страница 78
Linear regression
ОглавлениеLinear regression is a machine learning method you can use to describe and quantify the relationship between your target variable, y — the predictant, in statistics lingo — and the dataset features you’ve chosen to use as predictor variables (commonly designated as dataset X in machine learning). When you use just one variable as your predictor, linear regression is as simple as the middle school algebra formula y=mx+b. A classic example of linear regression is its usage in predicting home prices, as shown in Figure 4-6. You can also use linear regression to quantify correlations between several variables in a dataset — called multiple linear regression. Before getting too excited about using linear regression, though, make sure you’ve considered its limitations:
Linear regression works with only numerical variables, not categorical ones.
If your dataset has missing values, it will cause problems. Be sure to address your missing values before attempting to build a linear regression model.
If your data has outliers present, your model will produce inaccurate results. Check for outliers before proceeding.
The linear regression model assumes that a linear relationship exists between dataset features and the target variable.
The linear regression model assumes that all features are independent of each other.
Prediction errors, or residuals, should be normally distributed.
Credit: Python for Data Science Essential Training Part 2, LinkedIn.com
FIGURE 4-6: Linear regression used to predict home prices based on the number of rooms in a house.
Don’t forget dataset size! A good rule of thumb is that you should have at least 20 observations per predictive feature if you expect to generate reliable results using linear regression.