Читать книгу Practical Data Analysis with JMP, Third Edition - Robert Carver - Страница 9
ОглавлениеChapter 3: Describing a Single Variable
Variable Types and Their Distributions
Distribution of a Categorical Variable
Using Graph Builder to Explore Categorical Data Visually
Distribution of a Quantitative Variable
Using the Distribution Platform for Continuous Data
Exploring Further with the Graph Builder
Summary Statistics for a Single Variable
Overview
Once we have framed some research questions and gathered relevant data, the next phase of an investigation is to examine the variability in the data. The goal of descriptive analysis is to summarize where things stand with each variable. In fact, the term statistics comes from the practice of characterizing the state of political affairs through the reporting of facts and figures. This chapter presents several standard tools that we can use to examine how a variable varies, to describe the pattern of variation that it exhibits, and to look for departures from the overall pattern as well.
The Concept of a Distribution
Data analysis generally focuses on one or more variables—attributes of the individual observations. When we speak of a variable’s distribution, we are referring to a pattern of values. The distribution describes the different values the variable can assume, and how often it assumes each value.
In our first example, we will continue to consider the variability of life expectancy around the world. The data that we will use come to us from the World Bank. In Chapter 1, we used a small portion of this data set for 2017. Now we will look at more years.
Variable Types and Their Distributions
In Chapter 2, we did our work in a JMP Project. Get in the habit of using a project for each chapter.
1. Select File ► New ► Project.
2. Select File ► Open, select the Life Expectancy data table, and click Open.
Before doing any analysis, make sure that you can answer these questions:
● What population does this data table represent?
● What is the source of the data?
● How many variables are in the table?
● What data type is each variable?
● What does each variable represent?
● How many observations are there?
Take special note of the way this data table has been organized. We have 12 annual observations for each country, spaced at 5-year intervals, and they are stacked one upon the other. Not surprisingly, JMP refers to this arrangement as stacked data.
As in Chapter 1, we will raise some questions about how life expectancy at birth varies in different parts of the world. There are far too many observations for us get a general sense of the variation simply by scanning the table visually. We need some sensible ways to find the patterns among the large number of rows. We will begin our analysis by looking at the nominal variable called Region.
Statisticians generally distinguish among four types of data:
Categorical Types | Quantitative Types |
Nominal | Interval |
Ordinal | Ratio |
One reason that it is important to understand the differences among data types is that we analyze them in different ways. In JMP, we differentiate between nominal, ordinal, and continuous data. Nominal and ordinal variables are categorical, distinguishing one observation from another in some qualitative, non-measurable way. Interval and ratio data are both numeric. Interval variables are artificially constructed, like a temperature scale or stock index, with arbitrarily chosen zero points. Most measurement data are considered ratio data because ratios of values are meaningful. For example, a film that lasts 120 minutes is twice as long as one lasting 60 minutes. In contrast, 120 degrees Celsius is not twice as hot as 60 degrees Celsius.
Distribution of a Categorical Variable
In its reporting, the World Bank identifies each country of the world with a continental region. There are seven regions, each with a different number of countries. The variable Region is nominal—it literally names a country’s general location on earth. Let’s get familiar with the different regions and see how many countries are in each. In other words, let’s look at the distribution of Region.
1. Select Analyze ► Distribution. In the Distribution dialog box (Figure 3.1), select the variable region as the Y, Columns variable. Click OK.
Figure 3.1: Distribution Dialog Box
Anytime you want to assign a column to a role in a JMP dialog box, you have three options: you can highlight the column name in the Select Columns list and click the corresponding role button, you can double-click the column name, or you can click-drag the column name into the role box.
The result appears in Figure 3.2. JMP constructs a simple bar chart listing the six continental regions and showing a rectangular bar corresponding to the number of times the name of the region occurs in the data table. Though we cannot immediately tell from the graph alone exactly how many countries are in each, North America clearly has the fewest countries and Europe and Central Asia has the most.
Figure 3.2: Distribution of Region
Below the graph is a frequency distribution (titled Frequencies), which provides a more specific summary. Here we find the name of each region, and the number of times each regional name appears in our table. For example, “East Asia & Pacific” occurs 432 times. As a proportion of the whole table, 16.7% of the rows (Prob. = 0.16744) represent countries in that region.
At this point, you might wisely pause and say, “Wait a second. Can there possibly be 432 countries in the East Asia and the Pacific region?” And you would be right. Remember that we have stacked data, with 13 rows representing 12 years of data devoted to each country. Therefore, there are 432/12 = 36 countries in the region.
Even though JMP handles the heavy computational or graphical tasks, always think about the data and its context and ask yourself if the results make sense to you.
Using the Data Filter to Temporarily Narrow the Focus
Because we know each country appears repeatedly in this data table, let’s choose just one year’s data to obtain a clearer picture of regional variation. We can specify rows to display in a graph by using the Data Filter. This is a tool that allows us to select rows that satisfy specific conditions such as only displaying data rows from the year 2010.
This chapter illustrates the use of the Data Filter to temporarily select rows in a data table for all active analyses. This is known as the global Data Filter. Alternatively, when you click the red triangles in most analysis reports, you will find a Script option with a local Data Filter that applies only to the current report. The local Data Filter is illustrated in later chapters, but curious readers should explore it at any time.
1. To see the effects of the Data Filter, we will instruct JMP to automatically update the graph and recalculate the frequencies. Click the red arrow next to Distributions and choose Redo ► Automatic Recalc.
2. Select Rows ► Data Filter. In the list of Columns, select year and click the Add button.
3. The dialog box takes on a new appearance (Figure 3.3). It now displays a list of years contained in the table. Near the top of the dialog box, check Show and Include so that only the rows that we select for 2010 will appear in all graphs and be included in any computations. Other rows will be hidden and excluded.
Figure 3.3: Choosing 2010 in the Data Filter
4. Scroll down the list of Year levels and highlight 2015. As noted in the dialog box, this selects 215 rows and temporarily suppresses the others.
5. Minimize the Data Filter. If you look in the data table of Life Expectancy, you will see that most rows now have two icons () indicating that they are excluded and hidden. The rows from 2015 are highlighted and will remain so until we clear the Data Filter or take another action that selects other rows.
Using Graph Builder to Explore Categorical Data Visually
In Chapter 1, we met the Graph Builder, and we will use it throughout this book. It is most useful when working with multiple variables, but even with a single nominal variable, it provides a quick way to generate multiple views of the same data. Because interactivity is such an important feature of the tool, this section of the chapter provides few step-by-step directions. You should interact with the tool and think about the extent to which different graphing formats and options communicate the information content of the variable called region.
1. Select Graph ► Graph Builder. The region column identifies groups of countries. Drag it to the X drop zone.
Within Graph Builder, you can freely reposition a column from one drop zone to another. Hover the cursor over the column name until the cursor changes to the hand shape , then click-drag it to any other drop zone. What’s more, there is also an Undo button. You can also use the same variable in more than one drop zone. For example, you might also color the bars by region.
With Region on the X axis, you will see seven clumps of black points above the seven region names. This is not very informative.
At the top of Graph Builder is a selector bar of icons (see Figure 3.4) representing different graph types. The graphing options available depend on the type of data we have placed on the graph. Hence, some icons are dimmed, but with Region on the X axis, we can opt for any of the highlighted option.
Figure 3.4: Graphing Options for a Nominal Column
2. Spend some time using different graphing formats. Which ones do you think do the best job of clearly and fully summarizing the number of countries within each region?
3. For this example, let’s use a bar chart (seventh option from the left). There is considerable research demonstrating that most people find this simple graph type easy to interpret accurately. Then click Done.
It is always good practice to help a reader by giving a graph an informative descriptive title. The default title “region,” though accurate, is not very helpful. In JMP, it is easy to alter the titles of graphs and other results.
4. Move your cursor to the title region just above the graph and double-click. You can now customize the title of this chart to make it more informative. Type Observations per Region, replacing “Chart” as the title.
5. We have done a bit of work on our project. Let’s save it now as Chap_03.
With most categorical data, JMP automatically reports values in alphabetical sequence1 (East Asia & Pacific, Europe & Central Asia, and so on). We can revise the order of values to suit our purposes as well. Suppose that we want to list the regions approximately from West to East, North to South. In that case, we might prefer a sequence as follows:
North America
Latin America & Caribbean
Europe & Central Asia
Middle East & North Africa
Sub-Saharan Africa
South Asia
East Asia & Pacific
To change the default sequence of categorical values (whether nominal or ordinal), we return to the Life Expectancy data table.
6. Select region from the data grid or the columns panel, right-click, and select Column Info.
7. Click Column Properties and select Value Order.
8. Select a value name and use the Move Up and Move Down buttons to revise the value order to match what we have chosen. Then, click OK.
Now return to Graph Builder and look at the bar chart. You will see that customizing the value order within the data table re-orders the X axis. The effect should speak for itself.
9. Experiment with the other charting options by clicking the red arrow and choosing Show Control Panel and then selecting various graph types.
With categorical data, your choices are limited. Still, it’s worth a few minutes to become familiar with them. When you are through exploring, restore the graphic to a bar chart and leave it open. We will return to this graph in a few pages.
Distribution of a Quantitative Variable
The standard graphing choices expand considerably when we have quantitative data—particularly for continuous variables or discrete variables with many possible values. We will want to summarize a large collection of values in a way that shows where observations tend to cluster.
As a way of visualizing the distribution of a continuous variable, the most commonly used graph is a histogram. A histogram is basically a bar chart with values of the variable on one axis and frequency on the other. Let’s illustrate.
In our data set, we have estimated life expectancy at birth for each country for 13 different years. We just used the Data Filter to isolate the data for 2015, so let’s continue to explore the state of the world in 2015.
Using the Distribution Platform for Continuous Data
As before, we will first use the Distribution platform to do most of the work here.
1. Select Analyze ► Distribution. Cast LifeExp into the role of Y, Columns and click OK.
2. When the distribution window opens, click the red triangle next to Distributions, and select Stack. This will re-orient the output horizontally making it a bit easier to interpret.
The histogram (Figure 3.5) is one representation of the distribution of life expectancy around the world in 2015, and it gives us one view of how much life expectancy varies. Above the histogram is a box plot (also known as a box-and-whiskers plot), which will be explained later in this chapter.
Figure 3.5: A Typical Histogram
As in the bar charts that we have studied earlier, there are two dimensions in the graph. Here, the horizontal axis displays values of the variable and the vertical axis displays the frequency of each small interval of values. For example, we can see that only a few countries have projected life expectancies of 51 to 54 years, but many have life expectancies between 74 and 78 years.
When we look at a histogram, we want to develop the habit of looking for four things: the shape, the center (or central tendency), the dispersion of the distribution, and unusual observations. The histogram can very often clearly represent these three aspects of the distribution.
Shape: Shape refers to the symmetry of the histogram and to the presence of peaks in the graph. A graph is symmetric if you could find a vertical line in the center defining two sides that are mirror images of one another. In Figure 3.5, we see an asymmetrical graph. There are few observations in the tails on the left, and most observations clump together on the right side. We say this is a left-skewed (or negatively skewed) distribution.
Many distributions have one or more peaks—data values that occur more often than the other values. Here we have a distinct peak around 75 to 76 years, and others closer to 72 and 83. Some distributions have multiple peaks, and some have no distinctive peaks at all. In short, we might describe the shape of this distribution as “multi-peaked and left-skewed.”
Center (or central tendency): Where do the values congregate on the number line? In other words, what values does this variable typically assume? As you might already know, there are several definitions of center as reflected in the mean, median, and mode statistics. Visually, we might think of the center of a histogram as the halfway point of the horizontal axis (the median, which is approximately 74 years in this case), as the highest-frequency region (the highest peak near 75), perhaps as a type of visual balancing point (the mean, which is approximately 72), or in some other way. Any of these interpretations have legitimacy, and all respond to the question in slightly different ways.
Dispersion (or spread): While the concept of center focuses on the typical, the concept of spread focuses on departures from the typical. The question here is, “how much does this variable vary?” and again there are several reasonable ways to respond. We might think in terms of the lowest and highest observed values (from about 40 to 85), in terms of a vicinity of the center (for example, “life expectancy tends to vary in most countries between about 65 and 85”), or in some other relative sense.
Unusual Observations: We can summarize the variability of a distribution by citing its shape, center, and dispersion, but in some distributions, there may be a small number of observations that deviate substantially from the pattern. In 2015, there was no such grouping, but let’s explore the shifts in the distribution over time and also find some unusual observations.
3. Re-open the global data filter (Rows ► Data Filter). Click Clear.
4. Click the red triangle next to Distributions and choose Redo ► Automatic Recalc. You will see the histogram change and might notice that it now represents more observations—we are looking at all twelve years of data.
5. Again, click the red triangle next to Distributions and choose Local Data Filter; choose year and click Add.
6. Rather than choosing one year, click the red triangle next to Local Data Filter and choose Animation, as shown in Figure 3.6. This will step through the twelve years, briefly selecting each one and changing the histogram for each year.
Figure 3.6: Animating a Local Data Filter
7. In the Animation Controls, click the blue “play” arrow and watch what happens. Take special notice of how life expectancy has tended to improve from 1960 through 2015.
8. After a few cycles, pause the animation in the year 1995.
Look at the box plot above the histogram. There are two dots at the far left end; these represent two nations with extraordinarily brief life expectancies. We refer to such values as outliers.
9. Hover the cursor over the left-most point in the box plot. You will see a pop-up note that this is Rwanda, with a life expectancy of only 31.977 years in 1995, reflecting the genocide that took place in 1994.
Often, it’s easier to think about shape, center, dispersion, and outliers by comparing two distributions. For example, Figure 3.7 shows two histograms using the life expectancy data from 1965 and 2015. We might wonder how human life expectancy changed during a 50-year period, and in these two histograms, we can note differences in shape, center, dispersion and unusual observations.
Figure 3.7: Comparing Two Distributions
To create the results shown in Figure 3.7, do the following:
10. Return to the original Life Expectancy data table.
11. Re-open the Data Filter dialog box (either choose Windows and find the filter or Rows ► Data Filter). Clear the Select check box but leave Show and Include checked.
12. Hold down the Ctrl key and highlight 1965 and 2015.
13. From the menu bar, choose Analyze ► Distribution.
14. Select LifExp as Y, just as you did earlier.
15. Cast year into the role of By and click OK.
This creates the two distributions with vertically oriented histograms. When you look at them, notice that the axis of the first one runs from 25 to 75 years, and the axis on the second graph runs from 50 to 85 years.
To facilitate the comparison, it is helpful to orient the histograms horizontally in a stacked arrangement and to set the axes to a uniform scale, an option that is available in the red triangle menu next to Distributions. This makes it easy to compare their shapes, centers, and spreads at a glance.
16. In the Distribution report, while pressing the Ctrl key, click the uppermost red triangle and select Uniform Scaling.
If you click the red triangle without pressing the Ctrl key, the uniform scaling option would apply only to the upper histogram. Pressing the Ctrl key has the effect of applying the choice to all graphs in the window.
17. Hold down the Ctrl key, click the red triangle once again, and choose Stack.
The histograms on your screen should now look like Figure 3.7. How does the shape of the 1965 distribution compare to that of the 2015 distribution? What might have caused these changes in the shape of the distribution?
We see that people tend to live longer now than they did in 1965. The location (or central tendency) of the 2015 distribution is to the right side of the 1965 distribution. Additionally, these two distributions also have quite different spreads (degrees of dispersion). We can see that the values were far more spread out in 1965 than they are in 2015 and that there were no outliers in either year. What does that reveal about life expectancy around the world during the past 50 years?
Taking Advantage of Linked Graphs and Tables to Explore Data
When we construct graphs, JMP automatically links all open tables and graphs. If we select rows either in the data table or in a graph, JMP selects and highlights those rows in all open windows.
1. Within the 2015 life expectancy histogram, place the cursor over the right-most bar and click. While pressing the Shift key, also click the adjacent bar. Now you should have selected the two bars representing life expectancies of over 75 years. How many rows are now selected? Look in the Rows panel of the Data Table window.
2. Now find the first window with the Distribution of Region (second tab in your project). Notice that some bars are partially highlighted. When you selected the two bars in the histogram, you were indirectly selecting a group of countries. These countries are grouped within the bar chart as shown, revealing the parts of the world where people tend to live longest.
Customizing Bars and Axes in a Histogram
When we use the Distribution platform to analyze a continuous variable, JMP determines how to divide the variable axis and how to create “bins” for grouping observations. These automatic choices can affect the appearance of the distribution and there are several ways to customize the appearance of a histogram.
We can alter the number of bars in the histogram, creating new boundaries between groups of observations and shifting observations from one bar to the next.
1. Move back to the Distribution report tab. Click anywhere in a blank area of the 2015 histogram to de-select the two right bars
2. Choose Tools ► Grabber.
3. Position the hand anywhere over the bars in the 2015 histogram beneath the box plot, and click-drag the tool straight up and down. In doing so, you will change the number and width of the bars, sometimes dramatically changing the shape of the graph.
Think about this: the apparent shape of the distribution depends on the number of bars we create. By default, the software chooses an initial number of bars, or bins, to categorize the continuous variable. However, that initial choice should not be the final word. As we adjust the number of bins, we should watch closely to see how the shape changes, looking for a rendering that accurately and honestly displays the overall pattern of variation.
One way to resolve the issue is by using a shadowgram. A shadowgram visually averages a large number of bin widths into a diffuse image with no distinct bars at all. Here is how:
4. Click the red triangle next to LifeExp in the 2015 histogram.
5. Choose Histogram Options ► Shadowgram. Figure 3.8 shows the result.
Figure 3.8: A Shadowgram for a Continuous Variable
You should notice that there are several Histogram Options. While you are here, explore them—see what there is to see.
We can also change the scale of the horizontal axis interactively. Initially, JMP set the left and right endpoints, and the limits changed when we chose uniform scaling. Suppose we want the axis to begin at 30 and end at 85.
6. Move the cursor to the left end of the horizontal axis, and notice that the hand now points to the left (this is true whether you have previously chosen the hand tool or not). Click and drag the cursor slowly left and right, and see that you are scrunching or stretching the axis values. Stop when the minimum value is 30.
7. Move the cursor to the right end of the axis, and similarly set the maximum at 100 years just by dragging the cursor.
Finally, we can “pan” along the axis. Think of the border around the graph as a camera’s viewfinder through which we see just a portion of the entire infinite axis.
8. Without holding the mouse button, move the cursor toward the middle of the axis until the hand points upward. Now click and drag to the left or right, and you will pan along the axis.
9. Alternatively, rather than clicking and dragging to change axis attributes, you can directly edit all “Axis Settings” by double-clicking on the axis itself. This opens a dialog box where you can specify a variety of settings.
Exploring Further with the Graph Builder
Our original data table contains values for 12 years, and we have now compared the variation in life expectancy for two years. The Graph Builder can allow us to make a quick visual comparison over 12 years.
1. First, we want to clear our earlier filtering so that we can now access all years. Choose Rows ► Clear Row States to deselect, show, and include all rows.
2. Select Graph ► Graph Builder.
3. Drag LifeExp to the X drop zone.
4. Find the menu bar at the top of the Graph Builder window and locate the Histogram button near the center. Click it.
5. Drag Year to the Wrap drop zone and click the Done button. Your graph should look like Figure 3.9.
Figure 3.9: Longer Lives in Most of the World, 1960 to 2015
What do you see as you inspect these small multiple histograms? Can you see life expectancies gradually getting longer in most countries? There were two peaks in 1960: many countries with short lives, and many with longer lives. The lower peak slowly flattened out as the entire distribution has crept rightward.
Summary Statistics for a Single Variable
Graphs are an ideal way to summarize a large data set and to communicate a great deal of information about a distribution. We can also describe variation in a quantitative variable with summary statistics (also called summary measures or descriptive statistics). Just as a distribution has shape, center, and dispersion, we have summary statistics that capture information about the shape, center, or dispersion of a variable.
Let’s look back at the distribution report for our sample of 2015 life expectancies in 198 countries of the world. Just to the right of the histogram, we find a table of Quantiles followed by a list of Summary Statistics.
Figure 3.10: Quantiles and Summary Statistics
Quantile is a generic term; you might be more familiar with percentiles. When we sort observations in a data set, divide them into groups of observations, and locate the boundaries between the groups, we establish quantiles. When there are 100 such groups, the boundaries are called percentiles. If there are four such groups, we refer to quartiles.
For example, we find that the 90th percentile is 81.54 years. This means that 90% of the observations have life expectancies shorter than 81.54 years. JMP also labels five quantiles known as the five-number summary. They identify the minimum, maximum, 25th percentile (1st quartile or Q1), 50th percentile (median), and 75th percentile (3rd quartile or Q3). Of the 198 countries listed in the data table, one-fourth have life expectancies shorter than 66.43 years, and one-fourth have life expectancies longer than 77.49 years.
Summary Statistics refer to the common descriptive statistics shown in Figure 3.10. At this stage in your study of statistics, three of these statistics are useful, and the other three should wait until Chapter 8.
● The mean is the simple arithmetic average of the observations, usually denoted by the symbol and computed as follows:
Along with the median, it is commonly used as a measure of central tendency; in a symmetric distribution, the mean and median are quite close in value. When a distribution is strongly left-skewed like this one, the mean will tend to be smaller than the median. In a right-skewed distribution, the opposite will be true.
● The standard deviation (Std Dev) is a measure of dispersion, and you might think of it as a typical distance of a value from the mean of the distribution. It is usually represented by the symbol s, and is computed as follows:
We will have more to say about the standard deviation in later chapters, but for now, please note that it must be greater than or equal to zero, and that highly dispersed variables have larger standard deviations than consistent variables.
● n refers to the number of observations in the sample.
Outlier Box Plots
Now that we have discussed the five-number summary, we can interpret a box plot. The key to interpreting an outlier box plot is to recognize that it is a diagram of the five-number summary. Here is a typical example:
In a box plot, there is a rectangle with an intersecting line. Two edges of the rectangle are located at the first (Q1) and third (Q3) quartile values, and the line is located at the median. In other words, the rectangular box spans the interquartile range (IQR). Extending from the ends of the box are two lines called whiskers. In a distribution that is free of outliers, the whiskers reach to the minimum and maximum values. Otherwise, the plot limits the reach of the whiskers by the upper and lower fences, which are located 1.5 IQRs from each quartile. In this illustration, we have a cluster of seven low-value outliers.
JMP also adds two other features to the box plot. One is a diamond that represents the location of the mean. If you imagine a vertical line through the vertices of the diamond, you have located the mean. The other two vertices are positioned at the upper and lower confidence limits of the mean. We will discuss those in Chapter 11.
The second additional feature is a red bracket above the box. This is the shortest half bracket, representing the smallest part of the number line comprising 50% of the cases. We can divide the observations in half in different ways. The median gives the upper and lower halves; the IQR box gives the middle half. This bracket gives the shortest half.
A box plot very efficiently conveys information about the center, symmetry, dispersion, and outliers for a single distribution. When we compare box plots across several groups or samples, the results can be quite revealing. In the next chapter, we will look at such box plots and other ways of summarizing two variables at a time.
Application
Now that you have completed all of the activities in this chapter, use the techniques that you have learned to respond to these questions.
1. Scenario: We will continue our analysis of the variation in life expectancy at birth in 2015. Reset the Data Filter to show and include 2015.
a. When we first constructed the Life Exp histogram, we described it as multi-peaked and left-skewed. Use the hand tool to increase and reduce the number of bars. Adjust the number of bars so that there are two prominent peaks. Describe what you did, and where the peaks are located.
b. Rescale the axes of the same histogram and see if you can emphasize the two peaks even more (in other words, have them separated distinctly). Describe what you did to make these peaks more distinct and noticeable.
c. Based on what you have seen in these exercises, why is it a good idea to think critically about an analyst’s choice of scale in a reported graph?
d. Highlight a few of the left-most bars in the histogram for LifeExp and look at the Distribution report for region. Which continent or continents are home to the countries with the shortest life expectancies in the world? What might account for this?
2. Scenario: Now let’s look at the distribution of life expectancy 25 years before 2015. Use the Data Filter to choose the observations from 1990.
a. Use the Distribution platform to summarize Region and LifeExp for this subset. In a few sentences, describe the distribution of LifeExp in 1990.
b. Compare the five-number summaries for life expectancy in 1990 and in 2015. Comment on what you find.
c. Compare the standard deviations for life expectancy in 1990 and 2015. Comment on what you find.
d. You will recall that in 2015, the mean life expectancy was shorter than the median, consistent with the left-skewed shape. How do the mean and median compare in the 1990 data?
3. Scenario: The data file called Sleeping Animals contains data about the size, sleep habits, lifespan, and other attributes of different mammalian species.
a. Construct box plots for Lifespan and TotalSleep. For each plot, explain what the landmarks on each plot tell you about the distribution of each variable. Comment on noteworthy features of the plot.
b. Which distribution is more symmetric? Explain specifically how the graphs and descriptive statistics helped you come to a conclusion.
c. According to the data table, “Man” has a maximum life span of 100 years. Approximately what percent of mammals in the data set live less than 100 years?
d. Sleep hours are divided into “dreaming” and “non-dreaming” sleep. How do the distributions of these types of sleep compare?
e. Select the species that tend to get the most total sleep. Comment on how those species compare to the other species in terms of their predation, exposure, and overall danger indexes.
f. Now use the Distribution platform to analyze the body weights of these mammals. What’s different about this distribution in comparison to the other continuous variables that you have analyzed thus far?
g. Select those mammals that sleep in the most exposed locations. How do their body weights tend to compare to the other mammals? What might explain this comparison?
4. Scenario: When financial analysts want a benchmark for the performance of individual equities (stocks), they often rely on a “broad market index” such as the S&P 500 in the U.S. There are many such indexes in stock markets around the world. One major index on the Tokyo Stock Exchange is the Nikkei 225, and this set of questions refers to data about the monthly values of the Nikkei 225 from December 31, 2013 through December 31, 2018. In other words, our data table called NIKKEI225 reflects monthly market activity for a five-year period.
a. The variable called Volume is the total number of shares traded per month (in millions of shares). Describe the distribution of this variable.
b. The variable called Change% is the monthly change, expressed as a percentage, in the closing value of the index. When Change% is positive, the index increased that month. When the variable is negative, the index decreased that month. Describe the distribution of this variable.
c. Use the Quantiles to determine approximately how often the Nikkei declines. (Hint: What percentile is 0?)
p. Use Graph Builder to make a Line Graph (6th icon in the icon bar) that shows adjusted closing prices over time. Then, use the Distribution platform to create a histogram of adjusted closing prices. Each graph summarizes the Adj Close variable, but each graph presents a different view of the data. Comment on the comparison of the two graphs.
d. Now make a line graph of the monthly percentage changes over time. How would you describe the pattern in this graph?
5. Scenario: Anyone traveling by air understands that there is always some chance of a flight delay. In the United States, the Department of Transportation monitors the arrival and departure time of every flight. The data table Airline Delays contains a sample of 51,603 flights for four airlines destined for three busy airports.
b. The variable called DEST is the airport for the flight destination. Describe the distribution of this variable.
c. The variable called Arr Delay is the actual arrival delay, measured in minutes. A positive value indicates that the flight was late, and a negative value indicates that the flight arrived early. Describe the distribution of this variable.
d. Notice that the distribution of Arr Delay is skewed. Based on your experience as a traveler, why should we have anticipated that this variable would have a skewed distribution?
e. Use the Quantiles to determine approximately how often flights in this sample were delayed. (Hint: Approximately what percentile is 0?)
6. Scenario: For many years, it has been understood that tobacco use leads to health problems related to the heart and lungs. The Tobacco Use data table contains data about the prevalence of tobacco use and of certain diseases around the world.
a. Use an appropriate technique from this chapter to summarize and describe the variation in tobacco usage (TobaccoUse) around the world.
b. Use an appropriate technique from this chapter to summarize and describe the variation in cancer mortality (CancerMort) around the world.
c. Use an appropriate technique from this chapter to summarize and describe the variation in cardiovascular mortality (CVMort) around the world.
d. You have now examined three distributions. Comment on the similarities and differences in the shapes of these three distributions.
e. Summarize the distribution of the region variable and comment on what you find.
f. We have two columns containing the percentage of males and females around the world who use tobacco. Create a summary for each of these variables and explain how tobacco use among men compares to that among women.
7. Scenario: The States data table contains measures and attributes for the 50 U.S. states and the District of Columbia.
a. The population of the state as estimated by the United States Census Bureau is in pop2018est. Summarize the data in this column, commenting on the center, shape, and spread of the distribution. Note any outliers.
b. Construct box plots for owner-occ and poverty. For each plot, explain what the landmarks tell you about the distribution of each variable and comment on noteworthy features of the plot.
c. The column mean_Income is the mean household income, and med_income_17 is the median household income in the state. Use an appropriate technique from this chapter to summarize the data in these two columns and comment on what you see. Why do you think mean incomes are consistently greater than median incomes?
d. The column homicide is the rate of homicide deaths per 100,000 persons in the state. Summarize the responses and comment.
e. The column soc_sec is the number of people receiving Social Security benefits within the state. Use an appropriate technique to summarize the distribution of this variable. Identify the outlying states and suggest a reason for the fact that these states are outliers.
f. Compare the distributions of unemp2010 and unemp2017 and comment on what you find.
Endnotes
1 There are exceptions to this general principle, as with months of the year or days of the week, for example.