Читать книгу Machine Learning For Dummies - John Paul Mueller, John Mueller Paul, Luca Massaron - Страница 84

Understanding the datasets used in this book

Оглавление

Apart from the datasets offered by Scikit-learn (https://scikit-learn.org/stable/datasets/), this book uses a number of datasets that you can access at https://github.com/lmassaron/datasets. These datasets demonstrate various ways in which you can interact with data, and you use them in the examples to perform a variety of tasks. The following list provides a quick overview of the function used to import each of the datasets into your Python code:

 Air Passengers (https://www.kaggle.com/rakannimer/air-passengers): A .csv file containing the number of passengers on an example airline per month for 12 years starting in 1949.

 IMDB 50K (https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews): A dataset for binary sentiment classification containing a set of 25,000 highly popular movie reviews for training and 25,000 for testing.

 Fashion MNIST (https://github.com/zalandoresearch/fashion-mnist): A dataset of Zalando's article (https://jobs.zalando.com/en/tech/) images. It consists of a training set of 60,000 examples and a test set of 10,000 examples, each of which is labeled with one of ten categories.

 Palmer Penguins (https://github.com/allisonhorst/palmerpenguins): A package containing two datasets collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network.

 Shakespeare (https://www.kaggle.com/kingburrito666/shakespeare-plays): A listing of all Shakespeare's plays, lines from these plays, and who is speaking the line.

 SMS Spam Collection (https://www.kaggle.com/uciml/sms-spam-collection-dataset): A set of SMS tagged messages collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged according to whether the message is ham (legitimate) or spam.

 Tennis (https://www.kaggle.com/ehallmar/a-large-tennis-dataset-for-atp-and-itf-betting): Dataset containing statistics for a large number of tennis matches from the ATP and ITF leagues.

 Titanic (https://www.openml.org/d/40945): A dataset describing the survival status of individual passengers on the Titanic. The titanic data does not contain information from the crew, but it does contain actual ages of half the passengers.

 Wine (https://www.kaggle.com/sgus1318/winedata): Contains statistics related to the quality of wine.

The technique for loading each of these datasets can vary according to the source. The following example shows how to load the Air Passengers dataset. You can find the code in the ML4D2E; 04; Dataset Load.ipynb notebook. The downloadable datasets are archived in the Apache Arrow-based Feather File Format (https://arrow.apache.org/docs/python/feather.html). To make this file format accessible in Notebook, you open an Anaconda prompt and type the following command:

conda install feather-format -c conda-forge

The command takes a while to complete as it collects the package information and solves the environment (determines what to do to perform the installation). At some point, you'll need to type y and press Enter to complete the installation. To verify that you have a good installation, use this command:

conda list feather-format

After a few moments, you see output similar to this:

# packages in environment at C:\Users\John\anaconda3:## Name Version Build Channelfeather-format 0.4.1 pyh9f0ad1d_0 conda-forge

Now that you have the required library to use, you can load a dataset from those supplied on the book’s dataset site. To start, download the air_passengers.feather file from https://github.com/lmassaron/datasets and place it in folder you created for this book. (In later chapters, you see how to download the .feather files directly from the book's dataset site, but performing the download now keeps things simple.) Here is an example of the code you use to load Air Passengers dataset as a dataframe.

import pyarrow.feather as featherread_df = feather.read_feather('air_passengers.feather')print(read_df)

The result is a 144-row dataframe containing the number of passengers per month. Figure 4-9 shows typical output.


FIGURE 4-9: The read_df object contains the loaded dataset as a dataframe.

Machine Learning For Dummies

Подняться наверх