As the number of rows in a data table grows, we might encounter a common graphing problem called overplotting. When we attempt to plot many points in a fixed area, the points can become so densely packed that they overlap. At times, we may have multiple instances of the same X-Y pair, so that the graph gives the misimpression that there is a single point when there really might be several.
It is beyond the scope of this chapter to provide a full treatment of graphing large data sets, but this is an apt moment to look at a few strategies.
We have already encountered jittering in the univariate context. Jittering spreads points slightly to reveal when multiple points are at a single location. When the number of rows grows very large, jittering might not be enough.
To illustrate, we will consider the data in NHANES 2016, which contains detailed health information about nearly 10,000 people. We will use this data table extensively in our study of regression analysis. We first saw this data in Chapter 2. As part of the National Health and Nutrition Examination Survey, researchers measured the blood pressure of subjects multiple times. We will examine the covariation of systolic blood pressure (the “top” number) between the first and second readings for each person.
1. Open the data table NHANES 2016.
2. In Graph Builder, drag BPXSY1 to the X drop zone and BPXSY2 to the Y drop zone. This will treat the second reading as a function of the first reading.
Note that Graph Builder initially displays a point cloud and a smoother. It is obvious that we have many points that are densely packed in an elongated elliptical pattern. Try suppressing the display of points by clicking the left-most button in the menu bar. When there are so many points, one option for clarity is to simply show their trend without plotting every point. As we did earlier to create Figure 4.9, choosing the Contour option in the menu bar allows us to see the linear trend as well as regions that are most dense. The darkest contours here show the most common systolic pressure readings.
What if we want to see all 10,000 points more distinctly? Here are three options, among several others.
1. Click the left-most menu button to display all the points.
2. On the right side of the display, right-click over the legend ● BPXSY2
3. From the menu, select Marker to open a menu of marker shapes. Select an open circle or an X rather than a solid dot. After experimenting with a few shapes, return to the solid dot.
Depending on the density of the points, sometimes an open marker is preferable since it can show small spaces between nearby points, rather than having one occlude the other. In some graphs, reducing the size of each marker can achieve the same goal.
4. Again, right-click over the legend ● BPXSY2 and choose Marker Size to see the choices. The Preferred Size in this graph happens to be 3, Large. Try making them larger and smaller, noting how larger points tend to obscure one another in some parts of the graph. For the sake of the next step, leave the markers at size 5, XXL.
Similarly, rendering points as translucent helps with the overplotting problem. If we make the points partially transparent, then locations with multiple points will just appear darker in the graph. This effectively communicates which graph locations are most prevalent.
5. Right-click over the legend again and choose Transparency. By default, points are opaque with a transparency parameter of 1. Try setting transparency to 0.2, 0.5, and 0.8. Figure 4.15 shows this scatterplot with a transparency parameter of 0.2.
Figure 4.15: Using Transparency to Reveal Overplotting
More Informative Scatter Plots
This chapter has introduced ways to depict and summarize bivariate data, but sometimes we want or need to incorporate more than two variables into an analysis. We already know that with Graph Builder, we can color points to represent another variable, and we can wrap or overlay additional variables in a graph. We can use a By column in a Fit Y by X analysis to add a third factor into an investigation.
Another very useful visualization tool called a bubble plot provides a way to visualize as many as seven columns in one graph. This section provides a brief first look at bubble plots in JMP. We’ll use the tool further in Chapter 5.
Think of a bubble plot as a “scatterplot plus… .” In addition to the X-Y axes, the size and color of points can represent variables. Other variables can be used to interactively label points, and in cases where we have repeated measurements over time, the entire graph can be animated through time. Try this first exercise with the 2017 birth rate data:
1. Graph ► Bubble Plot. Cast Fertil as Y and Birthrate as X, just as before.
2. Next, cast Country as ID, Region as Coloring, and MortMaternal2016 as Sizes. Click OK.
3. In the lower left of the Bubble Plot window, find the slider control next to Bubble Size and slide it to the left until your graph looks like Figure 4.16.
Figure 4.16: Enhancing a Scatterplot by Using a Bubble Plot
Compare this graph to Figure 4.13. The Y and X variables are the same, but this picture also conveys further descriptive generalizations; the highest birth and fertility rates are in sub-Saharan Africa, where maternal mortality rates are also the highest in the world. As you move your cursor around the graph, you will see hover labels identifying the points. You can also notice that countries with similar birth and fertility rates might have quite different maternal mortality rates (bubble sizes). Bubble plots pack a great deal of information into a compact, data-rich display.
Application
Now that you have completed all of the activities in this chapter, use the concepts and techniques that you have learned to respond to these questions.
1. Scenario: We will continue to examine the World Development Indicators data in BirthRate 2017. We will broaden our analysis to work with other variables:
- Provider: Source of maternity leave benefits (public, private, or a combination of both)
- Fertil: Average number of births per woman during child-bearing years
- MortUnder5: Deaths of children under 5 years per 1,000 live births
- MortInfant: Deaths of infants per 1,000 live births
a. Create a mosaic plot and contingency table for the Provider and Region columns. Report on what you find.
b. Use appropriate methods to investigate how fertility rates vary across regions of the world. Report on what you find.
c. Create a scatterplot for MortUnder5 and MortInfant. Report the equation of the fitted line and the R-square value, and explain what you have found.
d. Is there any noteworthy pattern in the covariation of Provider and MatLeave90+? Explain what techniques you used, and what you found.
2. Scenario: How do prices of used cars differ, if at all, in different areas of the United States? How do the prices of used cars vary according to the mileage of the cars? Our data table Used Cars contains observational data about the listed prices of three popular compact car models in three different metropolitan areas in the U.S. The cities are Phoenix, AZ; Portland, OR; and Raleigh-Durham-Chapel Hill, NC. The car models are the Chrysler PT Cruiser Touring Edition, the Honda Civic EX, and the Toyota Corolla LE. The cars were all two years old at the time.
a. Create a scatterplot of price versus mileage. Report the equation of the fitted line, the R-square value, and the correlation coefficient, and explain what you have found.
b. Use the Graph Builder to see whether the relationship between price and mileage differs across different car models.
c. Describe the distribution of prices across the three cities in this