Illustrative Example – Scatter Plot – Baseline Corn Size by Corn Size at a Three month Follow‐up
A scatter plot is illustrated in Figure 2.8 for the baseline corn size against the three‐month follow‐up corn size for 181 patients with corns on their feet. There appears to be some association between baseline and three‐month corn size in this sample with larger baseline corn sizes associated with larger three‐month corn sizes and vice versa. There are numerous overlapping data points in this scatterplot with several patients having the same combination of baseline and three‐month corn sizes. Overlapping or overplotting can occur when a continuous measurement, in this example corn size, is rounded to some convenient unit (e.g. the nearest mm). To prevent overlapping data points in Figure 2.8, we have added a small random noise to the data called jittering. Jittering is the act of adding random noise to data in order to prevent overplotting in statistical graphs.
Figure 2.8 Scatter plot of baseline corn size by corn size at a three month follow‐up for 181 patients with corns.
(Source: data from Farndon et al. 2013).
It is likely that baseline corn size will have an influence on corn size at three months, but vice versa cannot be the case. In this case, if one variable, x, (baseline corn size) could cause the other, y, (three‐month corn size) then it is usual to plot the x variable on the horizontal axis and the y variable on the vertical axis.
In contrast, if we were interested in the relationship between baseline corn size and height of the patient then either variable could cause or influence the other. In this example it would be immaterial which variable (corn size or height) is plotted on which axis.
Measures of Symmetry
One important reason for producing dot plots and histograms is to get some idea of the shape of the distribution of the data. In Figure 2.6 there is a (slight) suggestion that the distribution of corn size is not symmetric; that is if the distribution were folded over some central point, the two halves of the distribution would not coincide. When this is the case, the distribution is termed skewed. A distribution is right (left) skewed if the longer tail is to the right (left), see Figure 2.9. If the distribution is symmetric then the median and mean will be close. If the distribution is skewed then the median and interquartile range are in general more appropriate summary measures than the mean and standard deviation, since the latter are sensitive to the skewness.
Figure 2.9 Examples of two skewed distributions.
For the corn size data, the mean from the 200 patients is 3.8 mm and the median is 4 mm so we conclude the data are reasonably symmetric. One is more likely to see skewness when the variables are constrained at one end or the other. For example, waiting time or time in hospital cannot be negative, but can be very large for some patients but relatively short for the majority and so it likely to be right or positively skewed.
A common skewed distribution is annual income, where a few high earners pull up the mean, but not the median. In the UK about 68% of the population earn less than the average wage, that is, the mean value of annual pay is equivalent to the 68th percentile on the income distribution. Thus, many people who earn more than the earnings of 50% (the median) of the population will still feel under paid!
2.6 Within‐Subject Variability
In Figure 2.1, measurements were made only once for each subject. Thus the variability, expressed, say, by the standard deviation, is the between‐subject variability. If, however, measurements are made repeatedly on one subject, we are assessing within‐subject variability.
Illustrative Example – Within‐Subject Variability – Total Steps per Day
Figure 2.10 shows an example in which the total steps‐per‐day walked by one subject, assessed by a pedometer worn on the hip, was recorded every day for 100 days. The observed daily total step count is subject to day‐to‐day fluctuations. There is considerable day‐to‐day variation in steps but little evidence of any trend over time. Such variation is termed within‐subject variation. The within‐subject standard deviation in this case, is SD = 4959 steps with a mean daily step count of 14 107 steps over the 100 days.
Figure 2.10 Total steps per day for 100 days for one participant in a global corporate challenge designed to increase physical activity.
If another subject had also completed this experiment, we could calculate their within‐subject variation as well, and perhaps compare the variabilities for the two subjects using these summary measures. Thus a second subject had a mean step count of 12 745 with standard deviation of 4861 steps, and so has a smaller mean but similar variability.
Successive within‐subject values are unlikely to be independent, that is, consecutive values will be dependent on values preceding them. For example, if a sedentary or inactive person records their step count on one day, then if the step count is low on one day it is likely to be low on the next day. This does not imply that the step count will be low, only that it is a good bet that it will be. In contrast, examples can be found in which high step counts are usually followed by lower values and vice versa. With independent observations, the step count on one day gives no indication or clue as to the step count on the next.
It is clear from Figure 2.10 that the daily step counts are not constant over the observation period. This is nearly always the case when medical observations or measurements are taken over time. Such variation occurs for a variety of reasons. For example, the step count may depend critically on the day of the week, whether the subject was on holiday or at work, whether the subject was on medication or unwell. There may be observer‐to‐observer variation if the successive step counts were recorded by different personnel rather than always by the same person. There may be measuring device‐to‐measuring‐device variation if the successive step counts were recorded by different pedometers rather than always the same pedometer. The possibility of recording errors in the laboratory, transcription errors when conveying the results to the clinic or for statistical analysis, should not be overlooked in appropriate circumstances. When only a single observation is made on one patient at one time only, then the influences of the above sources of variation are not assessable, but may nevertheless all be reflected to some extent in the final entry in the patient's record.
Suppose successive observations on a patient with heart disease taken over time fluctuate around some more or less constant daily step count, then the particular level may be