c. The variable called Arr Delay is the actual arrival delay, measured in minutes. A positive value indicates that the flight was late, and a negative value indicates that the flight arrived early. Describe the distribution of this variable.
d. Notice that the distribution of Arr Delay is skewed. Based on your experience as a traveler, why should we have anticipated that this variable would have a skewed distribution?
e. Use the Quantiles to determine approximately how often flights in this sample were delayed. (Hint: Approximately what percentile is 0?)
6. Scenario: For many years, it has been understood that tobacco use leads to health problems related to the heart and lungs. The Tobacco Use data table contains data about the prevalence of tobacco use and of certain diseases around the world.
a. Use an appropriate technique from this chapter to summarize and describe the variation in tobacco usage (TobaccoUse) around the world.
b. Use an appropriate technique from this chapter to summarize and describe the variation in cancer mortality (CancerMort) around the world.
c. Use an appropriate technique from this chapter to summarize and describe the variation in cardiovascular mortality (CVMort) around the world.
d. You have now examined three distributions. Comment on the similarities and differences in the shapes of these three distributions.
e. Summarize the distribution of the region variable and comment on what you find.
f. We have two columns containing the percentage of males and females around the world who use tobacco. Create a summary for each of these variables and explain how tobacco use among men compares to that among women.
7. Scenario: The States data table contains measures and attributes for the 50 U.S. states and the District of Columbia.
a. The population of the state as estimated by the United States Census Bureau is in pop2018est. Summarize the data in this column, commenting on the center, shape, and spread of the distribution. Note any outliers.
b. Construct box plots for owner-occ and poverty. For each plot, explain what the landmarks tell you about the distribution of each variable and comment on noteworthy features of the plot.
c. The column mean_Income is the mean household income, and med_income_17 is the median household income in the state. Use an appropriate technique from this chapter to summarize the data in these two columns and comment on what you see. Why do you think mean incomes are consistently greater than median incomes?
d. The column homicide is the rate of homicide deaths per 100,000 persons in the state. Summarize the responses and comment.
e. The column soc_sec is the number of people receiving Social Security benefits within the state. Use an appropriate technique to summarize the distribution of this variable. Identify the outlying states and suggest a reason for the fact that these states are outliers.
f. Compare the distributions of unemp2010 and unemp2017 and comment on what you find.
Endnotes
1 There are exceptions to this general principle, as with months of the year or days of the week, for example.
Chapter 4: Describing Two Variables at a Time
Describing Covariation: Two Categorical Variables
A Digression: Recoding a Variable and Changing Value Order
Describing Covariation: One Continuous, One Categorical Variable
Describing Covariation: Two Continuous Variables
Scatter Plots for Very Large Data Tables
More Informative Scatter Plots
Overview
Some of the most interesting questions in statistical inquiries involve covariation: how does one variable change when another variable changes? After working through the examples in this chapter, you will know some basic approaches to bivariate analysis, that is, the analysis of two variables at a time.
Two-by-Two: Bivariate Data
Chapter 3 covered techniques for summarizing the variation of a single variable: Univariate distributions. In many statistical investigations, we are interested in how two variables vary together and, in particular, how one variable varies in response to the other. For example, nutritionists might ask how consumption of carbohydrates affects weight loss or marketers might ask whether a demographic group responds positively to an advertising strategy. In these cases, it’s not sufficient to look at one univariate distribution or even to look at the variation in each of two key variables separately. We need methods to describe the covariation of bivariate data, which is to say we need methods to summarize the ways in which two variables vary together.
The organization of this chapter is simple. We have been classifying data as categorical or continuous. If we focus on two variables in a study and conceive of one variable as a response to the other factor, there are four possible combinations to consider, shown in Table 4.1. The next three sections discuss three of the four possibilities: We might have two categorical variables, two continuous variables, a continuous response with a categorical factor, or a categorical response to a continuous factor.
Table 4.1: Chapter Organization—Bivariate Factor-Response Combinations
Continuous Factor | Categorical Factor | |
Continuous Response | Third section to follow | Second section to follow |
Categorical Response | See Chapter 19 | Next Section |
In this chapter, we will introduce several common methods for three ways to pair bivariate data. The first examples relate to a serious issue in civil (non-military) air travel: the periodic collisions between wildlife and commercial airplanes. According to the U.S. Federal Aviation Administration (FAA) so-called wildlife-aircraft strikes have cost hundreds of lives in the past century and account for significant financial losses as well in damage to aircraft. These collisions present environmental, public safety, and business issues for many interested parties. The FAA maintains a database to monitor the incidence of wildlife-aircraft strikes. From 2010 through April 2019, the database contains nearly 117,000 reports of strikes in North America. The state reporting the largest number of events was California.
For this chapter, we will use a subset of the database, looking only at bird strikes associated with three California airports: Los Angeles International, Sacramento (the state capital), and San Francisco International. All of the available data is in the data table called FAA Bird Strikes CA. This data table contains 36 columns providing attributes for each of 3,411 bird strikes at or near the three airports.
1. Open the FAA Bird Strikes CA data table now and scroll through the columns. Our analysis will use several columns, each of which we will explain as we work through the examples.
The REMARKS column in this data table has