Categorical or Qualitative Data
Nominal Categorical Data
Nominal or categorical data are data that one can name and put into categories. They are not measured but simply counted. They often consist of unordered ‘either‐or’ type observations that have two categories and are often know as binary. For example: dead or alive; male or female; cured or not cured; pregnant or not pregnant. In Table 2.1 gender is a binary variable. However, categorical data often can have more than two categories, for example: blood group O, A, B, AB, country of origin, ethnic group or social class. The methods of presentation of nominal data are limited in scope. Thus Table 2.1 gives the number and percentage of people treated at each of the seven centres in each of the two randomised groups. Categorical data is sometimes referred to as ‘qualitative’, to distinguish it from ‘quantitative’ which we will discuss later. However, there is a whole area of methodology called ‘qualitative research’ and so to avoid confusion we will not us this term.
Ordinal Data
If there are more than two categories of classification it may be possible to order them in some way. For example, after treatment a patient may be either improved, the same or worse. In Table 2.1 smoking history is given in three categories: non‐smoker, previous smoker, and current smoker. Thus, someone who is a current smoker has more recent exposure to tobacco than someone who is an ex‐smoker and someone who has never smoked. However, without further knowledge (of the current and past levels of tobacco consumption) it would be wrong to ascribe a numerical quantity to the category, for example, non‐smoker = 0; previous smoker = 1; current smoker = 2, as one cannot say that someone who is a current smoker has twice the levels of tobacco consumption as someone who is a previous smoker. This type of data is also known as ordered categorical or ordinal data.
Ranks
In some studies it may be appropriate to assign ranks. For example, patients with corns may be asked to order their preference for treatment, for example, hard skin (corn) removal by scalpel; special rehydration creams for thickened skin; customised soft padding or foam insoles; corn plaster containing salicylic acid. Here although numerical values from 1 to 4 may be assigned to each treatment we cannot treat them as numerical values. They are in fact only codes for best, second best, third choice, and worst.
Numerical or Quantitative Data
Count Data
Table 2.1 gives details of the number of corns each participant had at the start of the trial, since this can only be a whole number or integer value, for example, 0, 1, 2, or 3 in this trial, this is termed count data. Other examples are often counts per unit of time such as the number of deaths in a hospital per year, or the number of attacks of asthma a person has per month. In dentistry, a common measure is the number of decayed, filled or missing teeth (DFM).
Measured or Numerical Continuous
Such data are measurements that can, in theory at least, take any value within a given range. These data contain the most information, and are the ones most commonly used in statistics. Examples of continuous data in Table 2.1 are age, size of index corn, visual analogue scale (VAS), pain score and EQ‐5D tariff.
However, for simplicity, it is often the case in medicine that continuous data are dichotomised to make binary data. Thus, diastolic blood pressure, which is continuous, is converted into hypertension (>90 mmHg) and normotension (≤90 mmHg). This clearly leads to a loss of information. There are two main reasons for dichotomising data. It is easier to describe a population by the proportion of people affected, for example, the proportion of people in the population with hypertension is 10%. Further one often has to make a decision: if a person has hypertension, then they will get treatment, and this too is easier if high blood pressure has been categorised.
One can also divide a continuous variable into more than two groups. For example, we could divide age into age bands of equal lengths of, say 10 years such as: 0–9; 10–19; 20–29, etc. When categorising continuous data authors should give an indication as to why they chose these cut‐off points, and a reader has to be very wary to guard against the fact that the cuts may be chosen to make a particular point. Some statisticians have termed the habit of categorising continuous variables as ‘dichotomania’, which they regard as poor practice since it loses information and assumes a discontinuous relationship that is unlikely in nature.
Interval and Ratio Scales
One can distinguish between interval and ratio scales. In an interval scale, such as body temperature or calendar dates, a difference between two measurements has meaning, but their ratio does not. Consider measuring temperature (in degrees centigrade) then we cannot say that a temperature of 20 °C is twice as hot as a temperature of 10 °C. In a ratio scale, however, such as bodyweight, a 10% increase implies the same weight increase whether expressed in kilogrammes or pounds. The crucial difference is that in a ratio scale, the value of zero has real meaning, whereas in an interval scale, the position of zero is arbitrary.
One difficulty with giving ranks to ordered categorical data is that one cannot assume that the scale is interval. Thus, as we have indicated when discussing ordinal data, one cannot assume that risk of a corn healing for a current smoker, relative to a non‐smoker, is the same as the risk for a previous smoker relative to a non‐smoker. Were Farndon et al. (2013) simply to score the three levels of smoking as 0, 1, 2 in their subsequent analysis, then this would imply in some way the intervals between the levels or scores have equal numerical value.
2.2 Summarising Categorical Data
Binary data are the simplest type of data in which each individual has a label that takes one of two values such as: male or female; corn healed or not healed. A simple summary would be to count the different types of label. However, a raw count is rarely useful. For example, in Table 2.1 there are more non‐smokers in the scalpel group (40 out of 99 or 40%) compared to corn plaster group (34 out of 98 or 35%). It is only when this number is expressed as a proportion that it becomes useful. Hence the first step to analysing categorical data is to count the number of observations in each category and express them as proportions of the total sample size.
Illustrative Example – Salicylic