When one of the chosen values is the MLE, for example the sample mean, then something called the maximum LR is calculated (see Section 2.1.5). This is equivalent to using p values since there is a direct transformation between it and a p value – they both measure evidence against a single value, usually represented by the null hypothesis. This can be useful in providing the maximum possible LR, and hence support, for an effect against another specified value such as the null.
1.2.4 Pros and Cons of Likelihood Approach
Advantages:
1 Provides an objective measure of evidence between competing hypotheses, unaffected by the intentions of the investigator.
2 Calculations tend to be simpler, and are often based on commonly used statistics such as z, t, and F.
3 Likelihoods, LRs and support (log LR) can be calculated directly from the data. Statistical tables and critical values are not required.
4 The scale for support values is intuitively easy to understand ranging from −∞ to +∞. Positive values represent support for primary hypothesis, negative values represent support for secondary hypothesis.3 Zero represents no evidence for either hypothesis.
5 Support values are proportional to the quantity of data, representing the weight of evidence. This means that support values from independent studies can simply be added together.
6 Collecting additional data in a study will tend to strengthen the support for the hypothesis closest to the true value. By contrast with p values, even when the null hypothesis is true, additional data will always eventually give statistically significant results, to any level of α (0.05, 0.01, 0.001, etc.) required.
7 Categorical data analyses are not restricted by normality assumptions, and within a given model, support values sum algebraically.
8 It is versatile and has unlimited flexibility for model comparisons.
9 The stronger the evidence the less likely it is to be misleading because S has a universal bound of e−m of observing misleading evidence, where m is the support for any two hypotheses.
10 Unlike other approaches based on probabilities, it is unaffected by transformations of variables.
Disadvantages:
1 Does not have a specific threshold between statistically unimportant and important.
2 Few major statistical packages support the evidential approach.
3 Few statistics curricula include the evidential approach.
4 Hence, few researchers are familiar with the evidential approach.
1.3 Effect Size – True If Huge!
Breaking news! Massive story! Huge if true! These are phrases used in media headlines to report the latest outrage or scoop. How do we decide how big the story is? Well, there may be several dimensions: timing (e.g. novelty), proximity (cultural and geographical), prominence (celebrities), and magnitude (e.g. number of deaths). In science, the issues of effect size and impact may be more prosaic but are actually of great importance. Indeed, this issue has been sadly neglected in statistical teaching and practice. Too much emphasis has been put on whether a result is statistically significant or not. As Cohen [31] observed ‘The primary product of a research inquiry is one or more measures of effect size, not p values’. We need to ask what is the effect size, and how we measure it.
The effect size, or size of effect, is simply the observed magnitude of a difference between two measurements or the strength of the association between two variables. For example, if there is a very obvious difference between the outcomes produced by two clinical treatments with a high proportion of patients cured, we would say the effect size is large. On the other hand, if the difference between the treatment outcomes were barely noticeable then we would say that the effect size is small. In general, the larger the effect size the larger will be the practical or clinical importance of the result. The effect size in clinical treatments is clearly important, but the effect size also impacts on the assessment of theories – where the observed effect size strongly influences the credibility of the theory.
A common question is how do we know what effect size is of practical importance? That depends. If we think of prices, a difference of $ 1 between two car insurance quotes would probably not be considered important. However, a $ difference in the cost of a coffee offered by two similar cafés would likely influence our choice of café. Sometimes, it is more difficult than this. For example, a drug that produces an absolute risk reduction of 1% might appear to be a small effect size. The reduction means that of 100 people taking the drug there would be 1 fewer people suffering from the disease. If the baseline rate of the disease is 10% in people not taking the drug, then taking the drug would reduce this to 9%. Again this might appear small, but if we consider a million people, this would represent an extra 10 000 people being affected if they did not take the drug.
Reporting the effect size for the results of a study is important, informing readers about the impact of the findings. Effect size is also used in the context of planning a study, since it influences the statistical power of the study. Here, it may be specified in three ways. First, the effect size may be that expected from similar previous published work or even from a pilot study. Second, it may be the minimum effect size that is of practical or clinical importance. Third, it may be simply an effect size that is considered to represent a useful effect. For example, in testing a new treatment in hypertensive patients with systolic blood pressure of 140 mmHg or more, a clinician might judge that a mean reduction of at least 10 mmHg would be clinically important and have a clear desirable health outcome. This would represent the minimum effect size. Alternatively, a clinician might not specify a minimum but merely judge that a mean decrease of 15 mmHg would be clinically important. (In practice, other aspects of a new treatment need to be considered – will the treatment be financially affordable and what might be the likely adverse side effects?)
In most areas of research, there is some effect, however small, for a treatment difference or a correlation. Tukey in 1991 [32] explained this in a forthright manner ‘Statisticians classically asked the wrong question – and were willing to answer with a lie, one that was often a downright lie. They asked “Are the effects of A and B different?” and they were willing to answer “no.” All we know about the world teaches us that the effects of A and B are always different – in some decimal place – for any A and B. Thus asking “Are the effects different?” is foolish’. Hence, we generally need to think about how large an effect is, rather than whether one is present or not. The latter practice is encouraged by the statistical testing approach where p is lesser than or greater than some significance level.
The habit of thinking about effect size forces the researcher to focus on the phenomenon under study. It places emphasis on practical/clinical importance of findings. One scientist, clearly interested in effect sizes, expresses their frustration ‘Honestly, at some point I'd like to work on things where the effect size is grounded on a real-world measurable outcome, but if I'm just looking at difference between psych measures, I'm not sure how to define it other than that’ (Twitter @PaoloAPalma 22 May 2019). This brings us on to how to define effect size.
Which metric should be used to define effect size? Baguley argues convincingly that the best measure of effect size uses the original units