3 For simplicity we consider A one-sided hypothesis, it can be easily reformulated as A two-sided hypothesis.
CHAPTER 3
Statistical Significance Tests
In this book, we are interested in the process of comparing performance of different NLP algorithms in a statistically sound manner. How is this goal related to the calculation of the p-value? Well, calculating the p-value is inextricably linked to statistical significance testing, as we will attempt to explain next. Recall the definition of δ(X) in Equation (2.3). δ(X) is our test statistic for the hypothesis test defined in Equation (2.3).
δ(X) is computed based on X, a specific data sample. In general, one can claim that if our data sample is representative of the data population, extreme values of δ(X) (either negative or positive) are less likely. In other words, the far left and right tails of the δ(X) distribution curve represent the unlikely events in which δ(X) obtains extreme values. What is the chance, given the null hypothesis is true, to have our δ(X) value land in those extreme tails? That probability is exactly the p-value obtained in the statistical test.
So, we now know that the probability of obtaining a δ(X) this high (or higher) is very low under the null hypothesis. Therefore, is the null hypothesis likely given this δ(X)? Well, the answer is, most likely, no. It is much more likely that the performance of algorithm A is better. To summarize, because the probability of seeing such a δ(X) under the null hypothesis (i.e., seeing such a p-value) is very low (< α), we reject the null hypothesis and conclude that there is a statistically significant difference between the performance of the two algorithms. This shows that statistical significance tests and the calculation of the p-value are parallel tools that help quantify the likelihood of the observed results under the null hypothesis.
In this chapter we move from describing the general framework of statistical significance testing to the specific considerations involved in the selection of a statistical significance test for an NLP application. We shall define the difference between parametric and nonparametric tests, and explore another important characteristic of the sample of scores that we work with, one that is highly critical for the design of a valid statistical test. We will present prominent tests useful for NLP setups, and conclude our discussion by providing a simple decision tree that aims to guide the process of selecting a significance test.
3.1 PRELIMINARIES
We previously presented an example of using the statistical significance testing framework for deciding between an LSTM and a phrase-based MT system, based on a certain dataset and evaluation metric, BLEU in our example. We defined our test statistic δ(X) as the difference in BLEU score between the two algorithms, and wanted to compute the p-value, i.e., the probability to observe such a δ(X) under the null hypothesis. But wait, how can we calculate this probability without knowing the distribution of δ(X) under the null hypothesis? Could we possibly choose a test statistic about which we have solid prior knowledge?
A major consideration in the selection of a statistical significance test is the distribution of the test statistic, δ(X), under the null hypothesis. If the distribution of δ(X) is known, then the suitable test will come from the family of parametric tests, that uses δ(X)’s distribution under the null hypothesis in order to obtain statistically powerful results, i.e., have small probability of making a type II error. If the distribution under the null hypothesis is unknown then any assumption made by a test may lead to erroneous conclusions and hence we have to back off to nonparametric tests that do not make any such assumptions. While nonparametric tests may be less powerful than their parametric counterparts, they do not make unjustified assumptions and are hence statistically sound even when the test statistic distribution is unknown.
How can one know the test statistic distribution under the null hypothesis? One common tool is the Central Limit Theorem (CLT) which establishes that, in most situations, when independent random variables are added, their properly normalized sum tends toward a normal distribution even if the original variables themselves are not normally distributed. Hence, statistical significance tests defined over the mean of observations (e.g., the unlabeled attachment score, values of the parse trees of the test set sentences), often assume that this average is normally distributed after proper normalization.
Let us elaborate on this. Recall the definition of the test statistic δ(X) from Equation (2.3). In a dependency parsing example M(A, X) can be, for example, the unlabeled attachment score (UAS) for the parser of Kiperwasser and Goldberg [2016] (K&G) and M(B, X) can be the UAS score of the TurboParser [Martins et al., 2013]. Following the NLP literature, the testset X is comprised of multiple sentences, and the metric M is calculated as the average score over all words in the test-set sentences. Hence, according to the CLT, the distribution of both M(A, X) and M(B, X) can be approximated by the normal distribution. Since δ(X) is defined as the difference between two variables with a normal distribution, it can also be assumed to have a normal distribution, which will make it easy for us to compute probabilities for its different possible values.
Unfortunately, in order to use the CLT, one is required to assume independence between the observations in the sample (test-set), and this independence assumption often does not hold in NLP setups. For example, a dependency parsing test-set (e.g., The WSJ Penn Treebank, Section 23 Marcus et al. [1993]) often consists of subsets of sentences taken from the same article, and many sentences in the Europarl parallel corpus [Koehn, 2005] are taken from the same parliament discussion. Later on in this book we will discuss this fundamental problem, and list it as one of the open issues to be considered in the context of statistical hypothesis testing in NLP.
If we cannot use the CLT in order to assume a normal distribution for the test statistic, we could potentially apply tests designed to evaluate the distribution of a sample of observations. For example, the Shapiro–Wilk test [Shapiro and Wilk, 1965] tests the null hypothesis that a sample comes from a normally distributed population, the Kolmogorov–Smirnov test quantifies the distance between the empirical cumulative distribution function of the sample and the cumulative distribution function of a reference distribution, and the Anderson-Darling test [Anderson and Darling, 1954] tests whether a given sample of data is drawn from a given probability distribution. As we will show later, there are other heuristics that are used in practice but are usually not mentioned in research papers.
To summarize the above discussion:
• Parametric tests—assume that we have complete knowledge regarding our test statistic’s distribution under the null hypothesis. If we indeed have this knowledge, parametric tests can utilize it to ensure a low probability of making a type II error. However, if the distribution is unknown, then any assumptions made by such a test may lead to erroneous conclusions.
• Nonparametric tests—do not require the test statistic’s distribution under the null hypothesis to be known or assumed. Nonparametric tests may be less powerful than their parametric counterparts as they do not make any assumptions about the test statistic distribution and are hence statistically sound even when the test statistic distribution is unknown.
In ensuring that we choose the appropriate statistical tool, we need not only to decide between a parametric and a nonparametric test, but should also consider another important quality of our dataset. Many statistical tests require an assumption of independence between the two populations (the test-set scores of the two algorithms in our case), and the following subtle point is often brushed aside: are the two populations that we are comparing between truly independent, or are they related to one another? Can we regard to samples that represent a state of “before” and “after” as independent? Or yet another example, from the world