Читать книгу Machine Learning for Time Series Forecasting with Python - Francesca Lazzeri - Страница 7

Supervised Learning for Time Series Forecasting

Machine learning is a subset of artificial intelligence that uses techniques (such as deep learning) that enable machines to use experience to improve at tasks (aka.ms/deeplearningVSmachinelearning). The learning process is based on the following steps:

1 Feed data into an algorithm. (In this step you can provide additional information to the model, for example, by performing feature extraction.)
2 Use this data to train a model.
3 Test and deploy the model.
4 Consume the deployed model to do an automated predictive task. In other words, call and use the deployed model to receive the predictions returned by the model (aka.ms/deeplearningVSmachinelearning).

Machine learning is a way to achieve artificial intelligence. By using machine learning and deep learning techniques, data scientists can build computer systems and applications that do tasks that are commonly associated with human intelligence. These tasks include time series forecasting, image recognition, speech recognition, and language translation (aka.ms/deeplearningVSmachinelearning).

There are three main classes of machine learning: supervised learning, unsupervised learning, and reinforcement learning. In the following few paragraphs, we will have a closer look at each of these machine learning classes:

Supervised learning is a type of machine learning system in which both input (which is the collection of values for the variables in your data set) and desired output (the predicted values for the target variable) are provided. Data is identified and labeled a priori to provide the algorithm a learning memory for future data handling. An example of a numerical label is the sale price associated with a used car (aka.ms/MLAlgorithmCS). The goal of supervised learning is to study many labeled examples like these and then to be able to make predictions about future data points, like, for example, assigning accurate sale prices to other used cars that have similar characteristics to the one used during the labeling process. It is called supervised learning because data scientists supervise the process of an algorithm learning from the training data set (www.aka.ms/MLAlgorithmCS): they know the correct answers and they iteratively share them with the algorithm during the learning process. There are several specific types of supervised learning. Two of the most common are classification and regression:Classification: Classification is a type of supervised learning used to identify what category new information belongs in. It can answer simple two-choice questions, like yes or no, true or false, for example:Is this tweet positive?Will this customer renew their service?Which of two coupons draws more customers?Classification can be used also to predict between several categories and in this case is called multi-class classification. It answers complex questions with multiple possible answers, for example:What is the mood of this tweet?Which service will this customer choose?Which of several promotions draws more customers?Regression: Regression is a type of supervised learning used to forecast the future by estimating the relationship between variables. Data scientists use it to achieve the following goals:Estimate product demandPredict sales figuresAnalyze marketing returns

Unsupervised learning is a type of machine learning system in which data points for the input have no labels associated with them. In this case, data is not labeled a priori so that the unsupervised learning algorithm itself can organize the data in and describe its structure. This can mean grouping it into clusters or finding different ways of looking at complex data structures (aka.ms/MLAlgorithmCS).There are several types of unsupervised learning, such as cluster analysis, anomaly detection, and principal component analysis:Cluster analysis: Cluster analysis is a type of unsupervised learning used to separate similar data points into intuitive groups. Data scientists use it when they have to discover structures among their data, such as in the following examples:Perform customer segmentationPredict customer tastesDetermine market priceAnomaly detection: Anomaly detection is a type of supervised learning used to identify and predict rare or unusual data points. Data scientists use it when they have to discover unusual occurrences, such as with these examples:Catch abnormal equipment readingsDetect fraudPredict riskThe approach that anomaly detection takes is to simply learn what normal activity looks like (using a history of non-fraudulent transactions) and identify anything that is significantly different.Principal component analysis: Principal component analysis is a method for reducing the dimensionality of the feature space by representing it with fewer uncorrelated variables. Data scientists use it when they need to combine input features in order to drop the least important features while still retaining the most valuable information from the features in the data set.Principal component analysis is very helpful when data scientists need to answer questions such as the following:How can we understand the relationships between each variable?How can we look at all of the variables collected and focus on a few of them?How can we avoid the danger of overfitting our model to our data?

Reinforcement learning is a type of machine learning system in which the algorithm is trained to make a sequence of decisions. The algorithm learns to achieve a goal in an uncertain, potentially complex environment by employing a trial and error process to come up with a solution to the problem (aka.ms/MLAlgorithmCS).Data scientists need to define the problem a priori, and the algorithm gets either rewards or penalties for the actions it performs. Its goal is to maximize the total reward. It's up to the model to figure out how to perform the task to maximize the reward, starting from totally random trials. Here are some examples of applications of reinforcement learning:Reinforcement learning for traffic signal controlReinforcement learning for optimizing chemical reactionsReinforcement learning for personalized news recommendations

When data scientists are choosing an algorithm, there are many different factors to take into consideration (aka.ms/AlgorithmSelection):

Evaluation criteria: Evaluation criteria help data scientists to evaluate the performance of their solutions by using different metrics to monitor how well machine learning models represent data. They are an important step in the training pipeline to validate a model. There are different evaluation metrics for different machine learning approaches, such as accuracy, precision, recall, F-score, receiver operating characteristic (ROC), and area under the curve (AUC) for classification scenarios and mean absolute error (MAE), mean squared error (MSE), R-squared score, and adjusted R-squared for regression scenarios. MAE is a metric that can be used to measure forecast accuracy. As the name denotes, it is the mean of the absolute error: the absolute error is the absolute value of the difference between the forecasted value and the actual value, and it is scale-dependent: The fact that this metric is not scaled to the average demand can represent a limitation for data scientists who need to compare accuracy across time series with different scales. For time series forecasting scenarios, data scientists can also use the mean absolute percentage error (MAPE) to compare the fits of different forecasting and smoothing methods. This metric expresses accuracy as a percentage of MAE and allows data scientists to compare forecasts of different series in different scales.

Training time: Training time is the amount of time needed to train a machine learning model. Training time is often closely tied to overall model accuracy. In addition, some algorithms are more sensitive to the number of data points than others. When time is limited, it can drive the choice of algorithm, especially when the data set is large.

Linearity: Linearity is mathematical function that identifies a specific relationship between data points of a data set. This mathematical relationship means that data points can be graphically represented as a straight line. Linear algorithms tend to be algorithmically simple and fast to train. Different machine learning algorithms make use of linearity. Linear classification algorithms (such as logistic regression and support vector machines) assume that classes in a data set can be separated by a straight line. Linear regression algorithms assume that data trends follow a straight line.

Number of parameters: Machine learning parameters are numbers (such as the number of error tolerance, the number of iterations, the number of options between variants of how the algorithm behaves) that data scientists usually need to manually select in order to improve an algorithm's performance (aka.ms/AlgorithmSelection). The training time and accuracy of the algorithm can sometimes be quite sensitive to getting just the right settings. Typically, algorithms with large numbers of parameters require the most trial and error to find a good combination. While this is a great way to make sure you've spanned the parameter space, the time required to train a model increases exponentially with the number of parameters. The upside is that having many parameters typically indicates that an algorithm has greater flexibility. It can often achieve very good accuracy, provided you can find the right combination of parameter settings (aka.ms/AlgorithmSelection).

Number of features: Features are properties of a phenomenon based on which data scientists would like to predict results. A large number of features can overload some learning algorithms, making training time long. Data scientists can perform techniques such as feature selection and dimensionality reduction to reduce the number and the dimensionality of the features they have to work with. While both methods are used for reducing the number of features in a data set, there is an important difference:Feature selection is simply selecting and excluding given features without changing them.Dimensionality reduction transforms features into a lower dimension.

With these important machine learning concepts in mind, you can now learn how to reshape your forecasting scenario as a supervised learning problem and, as a consequence, get access to a large portfolio of linear and nonlinear machine learning algorithms

Time series data can be expressed as a supervised learning problem: data scientists usually transform their time series data sets into a supervised learning by exploiting previous time steps and using them as input and then leveraging the next time step as output of the model. Figure 1.8 shows the difference between an original time series data set and a data set transformed as a supervised learning.

Figure 1.8: Time series data set as supervised learning problem

We can summarize some observations from Figure 1.8 in the following way:

The value of Sensor_1 at prior time step (for example, 01/01/2020) becomes the input (Value x) in a supervised learning problem.

The value of Sensor_1 at subsequent time step (for example, 01/02/2020) becomes the output (Value y) in a supervised learning problem.

It is important to note that the temporal order between the Sensor_1 values needs to be maintained during the training of machine learning algorithms.

By performing this transformation on our time series data, the resulting supervised learning data set will show an empty value (NaN) in the first row of Value x. This means that no prior Value x can be leveraged to predict the first value in the time series data set. We suggest removing this row because we cannot use it for our time series forecasting solution.

Finally, the subsequent next value to predict for the last value in the sequence is unknown: this is the value that needs to be predicted by our machine learning model.

How can we turn any time series data set into a supervised learning problem? Data scientists usually exploit the values of prior time steps to predict the subsequent time step value by using a statistical method, called the sliding window method. Once the sliding window method is applied and a time series data set is converted, data scientists can use can leverage standard linear and nonlinear machine learning methods to model their time series data.

Previously and in Figure 1.8, I used examples of univariate time series: these are data sets where only a single variable is observed at each time, such as energy load at each hour. However, the sliding window method can be applied to a time series data set that includes more than one historical variable observed at each time step and when the goal is to predict more than one variable in the future: this type of time series data set is called multivariate time series (I will discuss this concept in more detail later in this book).

We can reframe this time series data set as a supervised learning problem with a window width of one. This means that we will use the prior time step values of Value 1 and Value 2. We will also have available the next time step value for Value 1. We will then predict the next time step value of Value 2. As illustrated in Figure 1.9, this will give us three input features and one output value to predict for each training pattern.

Figure 1.9: Multivariate time series as supervised learning problem

In the example of Figure 1.9, we were predicting two different output variables (Value 1 and Value 2), but very often data scientists need to predict multiple time steps ahead for one output variable. This is called multi-step forecasting. In multi-step forecasting, data scientists need to specify the number of time steps ahead to be forecasted, also called forecast horizon in time series. Multi-step forecasting usually presents two different formats:

One-step forecast: When data scientists need to predict the next time step (t + 1)

Multi-step forecast: When data scientists need to predict two or more (n) future time steps (t + n)

For example, demand forecasting models predict the quantity of an item that will be sold the following week and the following two weeks given the sales up until the current week. In the stock market, given the stock prices up until today one can predict the stock prices for the next 24 hours and 48 hours. Using a weather forecasting engine, one can predict the weather for the next day and for the entire week (Brownlee 2017).

The sliding window method can be applied also on a multi-step forecasting solution to transform it into a supervised learning problem. As illustrated in Figure 1.10, we can use the same univariate time series data set from Figure 1.8 as an example, and we can structure it as a two-step forecasting data set for supervised learning with a window width of one (Brownlee 2017).

Figure 1.10: Univariate time series as multi-step supervised learning

As illustrated in Figure 1.10, data scientists cannot use the first row (time stamp 01/01/2020) and the second to last row (time stamp 01/04/2020) of this sample data set to train their supervised model; hence we suggest removing it. Moreover, this new version of this supervised data set only has one variable Value x that data scientists can exploit to predict the last row (time stamp 01/05/2020) of both Value y and Value y².

In the next section you will learn about different Python libraries for time series data and how libraries such as pandas, statsmodels, and scikit-learn can help you with data handling, time series modeling, and machine learning, respectively.

Originally developed for financial time series such as daily stock market prices, the robust and flexible data structures in different Python libraries can be applied to time series data in any domain, including marketing, health care, engineering, and many others. With these tools you can easily organize, transform, analyze, and visualize your data at any level of granularity—examining details during specific time periods of interest and zooming out to explore variations on different time scales, such as monthly or annual aggregations, recurring patterns, and long-term trends.

Machine Learning for Time Series Forecasting with Python

Подняться наверх