Читать книгу Data Science For Dummies - Lillian Pierson - Страница 53
Walking through the steps of the machine learning process
ОглавлениеThree main steps are involved in machine learning: setup, learning, and application. Setup involves acquiring data, preprocessing it, selecting the most appropriate variables for the task at hand (called feature selection), and breaking the data into training and test datasets. You use the training data to train the model, and the test data to test the accuracy of the model’s predictions. The learning step involves model experimentation, training, building, and testing. The application step involves model deployment and prediction.
Here’s a rule of thumb for breaking data into test-and-training sets: Apply random sampling to two-thirds of the original dataset in order to use that sample to train the model. Use the remaining one-third of the data as test data, for evaluating the model’s predictions.
A random sample contains observations that all each have an equal probability of being selected from the original dataset. A simple example of a random sample is illustrated by Figure 3-1 below. You need your sample to be randomly chosen so that it represents the full data set in an unbiased way. Random sampling allows you to test and train an output model without selection bias.
FIGURE 3-1: A example of a simple random sample