Читать книгу Probability with R - Jane M. Horgan - Страница 58

Example 3.1

Оглавление

Suppose there are 50 pairs of observations available for obtaining the line that best fits the data in order to predict from . The data are randomly divided into the training set and testing set, using 40 observations for training (Table 3.1), and 10 for testing (Table 3.2).

TABLE 3.1 The Training Set

Observation Numbers Observation Numbers
1 11.8 31.3 21 15.1 80.1
2 10.8 59.9 22 14.7 66.9
3 8.6 27.6 23 10.5 42.0
4 10.3 57.7 24 10.9 72.9
5 8.5 50.2 25 11.6 67.8
6 11.6 52.1 26 9.1 45.3
7 14.4 79.1 27 5.4 30.2
8 8.6 32.3 28 8.8 49.6
9 12.4 58.8 29 11.2 44.3
10 14.9 79.5 30 7.4 46.1
11 8.9 57.0 31 7.9 45.1
12 8.7 35.1 32 12.2 46.5
13 11.7 68.2 33 8.5 42.7
14 11.4 60.1 34 9.3 56.3
15 8.8 44.5 35 10.0 27.4
16 5.9 28.9 36 3.8 20.2
17 13.5 75.8 37 14.9 68.5
18 8.7 48.7 38 12.4 72.6
19 11.0 54.7 39 11.1 54.3
20 8.3 32.8 40 8.9 38.5

TABLE 3.2 The Testing Set

Observation Numbers 1 2 3 4 5 6 7 8 9 10
8.5 9.4 5.4 11.7 6.5 10.3 12.7 11.0 15.4 2.8
49.4 43.0 19.3 56.4 28.3 53.7 58.1 28.7 80.7 13.6

Use the training set to obtain the line of best fit of on and the testing set to examine how well the line fits the data.

First, read the training set into R.

x_train <- c(11.8, 10.8, 8.6, ..., 8.9) y_train <- c(31.3, 59.9, 27.6, ..., 38.5)

and the testing set

x_test <- c(8.5, 9.4, 5.4, …, 2.8) y_test <- c(49.4, 43.0, 19.3,…, 13.6)

Then, plot the training set, to establish if a linear trend exists.

plot(x_train, y_train, main = "Training Data", font.main = 1)

gives Fig. 3.17.


Figure 3.17 The Scatter of the Training Data

Since Fig. 3.17 shows a linear trend, we obtain the line of best fit of on , and superimpose it on the scatter diagram in Fig. 3.17. In R, write

abline(lm(y_train ∼ x_train))

to get Fig. 3.18.


Figure 3.18 The Line of Best Fit for the Training Data

Next, we use the testing data to decide on the suitability of the line.

The coefficients of the line are obtained in R with

lm(formula = y_train ∼ x_train) Coefficients: (Intercept) x_train -0.9764 4.9959

The estimated values are calculated in R as follows:

y_est <- - 0.9764 + 4.9959 * x_test round(y_est, 1)

which gives

y_est 41.5 46.0 26.0 57.5 31.5 50.5 62.5 54.0 76.0 13.0

We now compare these estimated values with the observed values.

y_test 49.4 43.0 19.3 56.4 28.7 53.7 58.1 54.0 80.7 13.6plot(x_test, y_test, main = "Testing Data", font.main = 1) abline(lm(y_train ∼ x_train)) # plot the line of best fit segments(x_test, y_test, x_test, y_est)

gives Fig. 3.19. Here, segments plots vertical lines between (x_test, y_test) and (x_test, y-est)

Figure 3.19 shows the observed values, , along with the values estimated from the line, . The vertical lines illustrate the differences between them. A decision has to be made then as to whether or not the line is a “good fit” or whether an alternative model should be investigated.


Figure 3.19 Differences Between Observed and Estimated Values in the Testing Set

The line of best fit is the simplest regression model; it uses just one independent variable for prediction. In real‐life situations, many more independent variables or other models, such as, for example a quadratic, may be required, but for supervised learning, the approach is always the same:

 Determine if there is a relationship between the dependent variable and the independent variables;

 Fit the model to the training data;

 Test the suitability of the model by predicting the ‐values in the testing data from the model and by comparing the observed and predicted ‐values.

The predictions from these models assumes that the trend, based on the data analyzed, continues to exist. Should the trend change, for example, when a house pricing model is estimated from data before an economic crash, the predictions will not be valid.

Regression analysis is just one of the many techniques from the area of Probability and Statistics that machine learning invokes. We will encounter more in later chapters. Should you wish to go into this topic more deeply, we recommend the book, A First Course in Machine Learning by Girolami (2015).

Probability with R

Подняться наверх