Читать книгу Machine Learning For Dummies - John Paul Mueller, John Mueller Paul, Luca Massaron - Страница 48
Defining What Training Means
ОглавлениеMany people are somewhat used to the idea that applications start with a function, accept data as input, and then provide a result. For example, a programmer might create a function called Add()
that accepts two values as input, such as 1
and 2
. The result of Add()
is 3
. The output of this process is a value. In the past, writing a program meant understanding the function used to manipulate data to create a given result with certain inputs.
Machine learning turns this process around. In this case, you know that you have inputs, such as 1
and 2
. You also know that the desired result is 3
. However, you don’t know what function to apply to create the desired result. Training provides a learner algorithm with all sorts of examples of the desired inputs and results expected from those inputs. The learner then uses this input to create a function. In other words, training is the process whereby the learner algorithm maps a flexible function to the data. The output is typically the probability of a certain class or a numeric value.
A single learner algorithm can learn many different things, but not every algorithm is suited for certain tasks. Some algorithms are general enough that they can play chess, recognize faces on Facebook, and diagnose cancer in patients. An algorithm reduces the data inputs and the expected results of those inputs to a function in every case, but the function is specific to the kind of task you want the algorithm to perform.
The secret to machine learning is generalization. The goal is to generalize the output function so that it works on data beyond the training set. For example, consider a spam filter. Your dictionary contains 100,000 words (actually a small dictionary). A limited training dataset of 4,000 or 5,000 word combinations must create a generalized function that can then find spam in the 2100000 combinations that the function will see when working with actual data.
When viewed from this perspective, training might seem impossible and learning even worse. However, to create this generalized function, the learner algorithm relies on just three components:
Representation: The learner algorithm creates a model, which is a function that will produce a given result for specific inputs. The representation is a set of models that a learner algorithm can learn. In other words, the learner algorithm must create a model that will produce the desired results from the input data. If the learner algorithm can’t perform this task, it can’t learn from the data and the data is outside the hypothesis space of the learner algorithm. Part of the representation is to discover which features (data elements within the data source) to use for the learning process.
Evaluation: The learner can create more than one model. However, it doesn’t know the difference between good and bad models. An evaluation function determines which of the models works best in creating a desired result from a set of inputs. The evaluation function scores the models because more than one model could provide the required results.
Optimization: At some point, the training process produces a set of models that can generally output the right result for a given set of inputs. At this point, the training process searches through these models to determine which one works best. The best model is then output as the result of the training process.
Much of this book focuses on representation. For example, in Chapter 10 you discover how to create a working spam detector using the Naïve Bayes algorithm, based on a probabilistic representation of the problem. However, the training process is more involved than simply choosing a representation. All three steps come into play when performing the training process. Fortunately, you can start by focusing on representation and allow the various libraries discussed in the book to do the rest of the work for you.