Читать книгу Sports Analytics in Practice with R - Ted Kwartler - Страница 16

Applying R Basics to Real Data

Оглавление

Let’s reward your laborious work though foundational R coding with something of actual interest utilizing sports data. Like many scripts in this book, let’s begin by loading packages. For each of these, you need to first run `install.packages` and assuming that executes without error, the following library calls will specialize R for the task at hand. As an example, script using real sports data, our only goal is to obtain the data, manipulate it, and finally plot it.

To begin call `library(RCurl)` which is a general network interface client. Functions within this library allow R to make a network connection to download the data. One could have data locally in a file, connect to an API, database, or even web scrape the data. However, in upcoming code, the data are download directly from an online repository. Next, `library(ggplot2)` loads the grammar of graphics namespace with excellent visualization capabilities. The `library(ggthemes)` call is a convenience library accompanying `ggplot2` for quick, predefined aesthetics. Lastly, the `library(tidyr)` functions are used for tidying data, which is a style of data organization that is efficient if not intuitive. Here, the basic raw will be rearranged before plotting.

library(RCurl) library(ggplot2) library(ggthemes) library(tidyr)

Next, before establishing a connection between R and the data repository, a character object is created called `c1Data`. The character string is the web URL to the raw comma-separated value, CSV, file. If you open this web address in a typical browser, you will see the raw text-based statistics for regular season Dallas NBA team in the 2019–2020 season. However, the following code does not open a browser and instead downloads this simple file before loading it as an R object.

c1Data <- ‘https://raw.githubusercontent.com/kwartler/Practical_Sports_Analytics/main/C1_Data/2019-2020%20Dallas%20Player%20Stats.csv’

Now to execute a network connection employ the `getURL` function which lies within the `RCurl` package. This function simply accepts the string URL address previously defined. Be sure to have the address exactly correct to avoid any errors. nbaFile <- getURL(c1Data)

Finally, the base-R function `read.csv` is used with the downloaded data. The `read.csv` function is widely used because CSV files are ubiquitous. Further, the function can accept a local file path leading to a hard disk rather than the file downloaded here but the path must be exactly correct. Spaces, capitalization, and misspellings will result in cryptic and frustrating file not found errors. Assuming the web address was correct, and the `getURL` function executed without error, then the result of this code is a new object called `nbaData`. It is automatically read in as a `data.frame` object.

nbaData <- read.csv(text = nbaFile)

Unlike a spreadsheet program where you can scroll to any area of the sheet to look at the contents, R holds the data frame as an object which is an abstraction. As a result, it can be difficult to comprehend the loaded data. Thus, it is a best practice to explore the data to learn about its characteristics. In fact, exploratory data analysis, EDA, in itself is a robust field within analytics. The code below only scratches the surface of what is possible.

To being this basic EDA defines the dimensions of the data using the `dim` function applied to the `nbaData` data frame. This will print the total rows and columns for the data frame. Similar to the indexing code, the first number represents the rows and the second the columns.

dim(nbaData)

Since data frames have named columns, you may want to know what the column headers are. The base-R function `names` accepts a few types of objects and in this case will print the column names of the basketball data.

names(nbaData)

At this point you know the column names and the size of the data loaded in the environment. Another popular way to get familiar with the data is to glimpse at a portion of it. This is preferred to calling the entire object in your console. Data frames can often be tens of thousands of rows or more plus hundreds of columns. If you call a large object directly in console, your system may lag trying to print that much data as an output. Thus, the popular `head` function accepts a data object along with an integer parameter representing the number of records to print to select. Since this function call is not being assigned an object, the result is printed to console for review. The default behavior selects six though this can be adjusted for more or less observations. When called the `head` function will print the first `n` rows of the data frame. This is in contrast to the `tail` function which will print the last `n` rows.

head(nbaData, n = 6)

You should notice that the column `TEAM` shows “Dal” for all results in the `head` function. To ensure this data set only contains players from the Dallas team you can employ the `table` function specifying the `TEAM` column either by name or by index position. The `table` function merely tallies the levels or values of a column. After running the next code chunk, you see that “Dal” appears 19 times in this data set. Had there been another value in this column, additional tallied information would be presented.

table(nbaData$TEAM) table(nbaData[,2])

Lastly, another basic EDA function is `summary`. The `summary` function can be applied to any object and will return some information determined by the type of object it receives. In the case of a data frame, the `summary` function will examine each column individually. It will denote character columns and, when declared as factor, will tally the different factor levels. Perhaps most important is how `summary` treats numeric columns. For each numeric column, the minimum, first quartile, median, mean, third quartile, and maximum are returned. If missing values are stored as “NA” in a particular column, the function will also tally that. This allows the practitioner to understand each columns range, distribution, averages, and how much of the column contains NA values.

summary(nbaData)

Now that you have a rudimentary understanding of the player level Dallas basketball data set, you can visualize aspects of it. For example, one would expect that the more minutes a player averages per game, the more points the player averages per game. To confirm this assumption, a simple scatter plot may help identify the correlation. Of course, you can calculate correlation, with the `cor` function, but often visualizing data can be a powerful tool in a sports analyst’s toolkit. The `ggplot2` library contains a convenience function called `qplot` for quick plotting. This function accepts the name of a column for the x-axis, followed by another column name to plot on the y-axis. The last parameter is the data itself. The `data` parameter requires a data frame so that the specific columns can be plotted. Additionally, an optional aesthetic is added declaring the `size` for each dot in the scatterplot. The code below adds another aspect to the `qplot` to improve the overall look. Specifically, another “layer” is added from the `ggthemes` library to adjust many parameters within a single function call. Here, the empty function `theme_hc` emulates the popular “Highcharts” JavaScript theme. As is standard with `ggplot2` objects, additional parameters such as aesthetics are added in layers using the `+` sign. This is not the arithmetic addition sign merely an operator to append layers to `ggplot` objects. Figure 1.8 is the result of the `qplot` and `theme_hc` adjustment using the Dallas basketball data to explore the relationship between average minutes per game and average points per game.


Figure 1.8 As expected the more minutes a player averages the higher the average points.

qplot(x = MPG_min_per_game, y = POINTS_PER_GAME, size = 5, data = nbaData) + theme_hc()

Let’s add a bit more complexity to the visualization by creating a heatmap. The heatmap chart has x and y axes but represents data amounts as color intensity. A heatmap allows the audience to comprehend complex data quickly and concisely. To begin, let’s use the `data.frame` function to create a smaller data set. Here, the column names are being renamed and each individual column from the `nbaData` object is explicitly selected. The new object has the same number of rows but a subset of the columns. There are additional functions to perform this operation but this is straightforward. As the book continues, more concise though complex examples will perform the same operation.

smallerStats <- data.frame(player = nbaData$ï.PLAYER, FTA = nbaData$FTA_free_throws_attempted, TWO_PA = nbaData$TWO_PA, THREE_PA = nbaData$THREE_PA)

In order to construct a heatmap with `ggplot2`, the `smallerStats` data frame must be rearranged into a “tidy” format. This type of data organization can be difficult to comprehend for novice R programmers, but the main point is that the data is not being changed, merely rearranged. The `tidyr` library function `pivot_longer` accepts the data frame first. Next, the `cols` parameter is defined. In this case, the column to pivot upon is the `player` column. This will result in each player’s name being repeated and two new columns being created. These columns are defined in the function as `names_to` and `values_to`, respectively. In the end, each player and corresponding statistic name and value are captured as a row. Whereas the `smallerStats` data frame had 19 observations with 4 columns, now the `nbaDataLong` object which has been pivoted by the `player` column has 57 rows and 3 columns. After the pivot the `head` function is executed to demonstrate the difference.

nbaDataLong <- pivot_longer(data = smallerStats, cols = -c(player), names_to = "stat", values_to = "value") head(nbaDataLong)

Now that the data has been modified, it will be readily accepted by the `ggplot` function. Instead of the previous `qplot` function, now the more expansive `ggplot` function is called. The first parameter is the `data` object. The next parameter is the `mapping` aesthetics. This is a multi-part input declared with yet another function `aes`. Within the `aes` function, the column names to be plotted are defined. Specifically, the x-axis column name, `stat`, followed by the y-axis column name `player`, and finally the fill value which corresponds to the `value` column. Thus, the visual is set up so that player statistics are arranged on the x-axis, individual players will be a single row along the y-axis, and the color intensity will be scaled by the players corresponding statistical value. Once the base layer plot has been defined, another layer is added with the `+` sign to declare the type of plot needed. In this case, the heatmap is called using `geom_tile`. In subsequent chapters, additional visuals are illustrated including ggplot2 and more dynamic interactive graphics. Since this text requires gray-scale graphics, another layer is added to define the color intensity between `lightgrey` and `black`. Finally, another layer is added to retitle the x-axis label as “Scoring Statistics” encased in quotes because it is a label not an object or column name. For simplicity, this is captured in an object called `heatPlot`.

heatPlot <- ggplot(data = nbaDataLong, mapping = aes(x = stat, y = player, fill = value)) + geom_tile() + scale_fill_gradient(low="lightgrey", high="black") + xlab(label = "Scoring Statistics")

Although calling `heatPlot` now in the console will create the visual, some additional layers can be added. First, a predefined theme for Highcharts is added, just as before using `theme_hc`. Next, a chart title is declared with `ggtitle` along with the quoted “Dallas Team Offensive Stats.” Lastly, a `theme` is appended as the final layer that simply removed the legend altogether. Now when the `heatPlot` object is called, a clean, visually compelling plot is created that clearly shows the most offensively productive player for the three statistics on the team. Additionally, other player’s strengths in these statistics are easily understood because their sections are darker compared to teammates. Conversely weaker players in these stats have a lighter color. These facts are more quickly understood in a visual compared to reviewing a table of player data. The result of the `heatPlot` object is shown in Figure 1.9.


Figure 1.9 The Dallas team statistics represented in a heatmap illustrating the most impactful players among these statistics in the 2019–2020 regular NBA season.

heatPlot <- heatPlot + theme_hc() + ggtitle('Dallas Team Offensive Stats') + theme(legend.position = "none")

There are multiple ways to extend the lessons of this chapter to improve R coding fluency. For example, the data itself can be explored further or subset by position. Additional visualizations are also possible although many of these topics are covered in subsequent chapters with expanded explanations.

Sports Analytics in Practice with R

Подняться наверх