Читать книгу Statistical Significance Testing for Natural Language Processing - Rotem Dror - Страница 11
ОглавлениеCHAPTER 2
Statistical Hypothesis Testing
We begin with a definition of the statistical hypothesis testing framework. This fundamental framework will then allow us to discuss statistical significance tests (Chapter 3) and later on their application to experimental research in NLP.
A statistical hypothesis is defined as an hypothesis that is testable by observing and analyzing a process modeled by a set of random variables. In the basic setting, two datasets are compared and a hypothesis is proposed for the statistical relationship between them. This hypothesis is usually suggested as an alternative to an ideal null hypothesis that (often) proposes no relationship between two datasets. If the relationship between the datasets seems unlikely under the null hypothesis according to a threshold probability—the significance level—the null hypothesis will be rejected.
In order to distinguish between the null hypothesis and the alternative hypothesis, we consider two conceptual types of errors. The first type of error occurs when the null hypothesis is wrongly rejected while the second occurs when we wrongfully do not reject the null hypothesis. These two types of errors are known as type I and type II errors, and we will further elaborate on them later on.
In empirical machine learning research in general, and in the NLP community in particular, we would often like to prove the superiority of one algorithm over the other, and present this superiority in terms of a statistically significant improvement according to an evaluation metric, such as accuracy or F-score.1 Therefore, we begin by formulating a general hypothesis testing framework for the comparison between two algorithms. This is the common type of hypothesis testing framework applied in NLP, and its detailed formulation will help us develop our ideas.
2.1 HYPOTHESIS TESTING
We wish to compare two algorithms, A and B. As an example, let us consider a comparison between two machine translation (MT) algorithms: phrase-based MT (such as the Moses MT system [Koehn et al., 2007]) vs. an LSTM Neural Encoder-decoder Network (e.g., the model described in Cho et al. [2014]). In order to compare between the two algorithms, we would experiment with several different parallel corpora. Let X be the set of such corpora, i.e., a collection of datasets X = {X1, X2,…, XN}, where each data set Xi is comprised of sentence pairs, one from the source language and one from the target language. That is, for all i є {1,…, N}, Xi = {xi,1,…, xi,ni}. where xi,j is a source language sentence and its translation.
The difference in performance between the two algorithms is measured with one or more evaluation metrics. In our example, when evaluating the performance of machine translation systems, we may use several evaluation measures to assess the quality of translation from various angles. For example, we would probably like our MT system to provide an accurate translation but we may also want to encourage creativity and linguistic richness, and prefer systems that do not excessively repeat the same words and phrases. Accordingly, we would evaluate it using two vastly used different metrics: BLEU [Papineni et al., 2002] and PINC [Chen et al., 2011]. We denote our set of metrics as M = {M1,…., Mm}.2
So far, we have our two MT algorithms A and B, trained and evaluated on a set of metrics M = {M1,…, Mm}. We denote with Mj(ALG, Xi) the value of the measure Mj when algorithm ALG is applied to the dataset Xi. Without loss of generality, we assume that higher values of the measure are better.
We define the difference in performance between two algorithms, A and B, according to the measure Mj on the dataset Xi as:
Finally, using this notation we formulate the following statistical hypothesis testing problem:
The goal of testing the above hypotheses is to determine if algorithm A is significantly better than algorithm B on the dataset Xi using the evaluation measure Mj. In our example, this translates to the following question: “Is the LSTM-based MT system better than the Phrasebased one on the Wikipedia parallel corpus when considering the BLEU metric?”
If we strive to show that the LSTM is superior to the phrase-based system (in the specific setup of the Wikipedia Corpus and the BLEU metric), we would need to provide statistically valid evidence. Our hypotheses can be described as follows: The (somewhat pessimistic) null hypothesis would state that there is no significant performance difference between the LSTM and the phrase-based system, or that the latter performs even better, while the alternative hypothesis would state that the LSTM performs significantly better.
More generally, in our formulation the null hypothesis, H0, states that there is no difference between the performance of algorithm A and algorithm B, or that B performs better. This hypothesis is tested vs. the alternative statement, H1—that A is superior. If the statistical test results in rejecting the null hypothesis, one concludes that A outperforms B in this setup—i.e., on dataset Xi with respect to the evaluation metric Mj. Otherwise, there is not enough evidence in the data to make the conclusion of rejecting the null hypothesis. In this case, it is uncustomary to claim that we accept the null hypothesis, since the null hypothesis is the starting point, and by posing an alternative hypothesis we try to challenge the idealized state.
Naturally, we could be wrong in our conclusion. Our specific experiments may show that the LSTM outperforms the phrase-based system in a certain setup, but this does not necessarily reflect the true nature of things. Let us now properly define the two types of errors that we may encounter in our hypothesis test.
• Type I error—rejection of the null hypothesis when it is true, i.e., there is no difference in performance between the two algorithms. For example, concluding that the LSTM is superior to the phrase-based system in the explored setting when, in fact, that is not the case in general.
• Type II error—non-rejection of the null hypothesis when the alternative hypothesis is true. For example, missing the fact that the LSTM is in fact superior to the phrase-based system.
Knowing which one of the hypotheses is correct with full certainty is practically impossible, as that would require us to create a sample of all possible scenarios, i.e., observe the complete data generating distribution. Therefore, in practice, we can never know which one of the two algorithms is superior, and so the statistical significance testing framework actually strives to minimize the probability of type I and type II errors. We will touch on this in the following section.
Note, however, that reducing the probability of one of the errors may cause an increase of the probability of the other. The classical approach to hypothesis testing is to find a test that guarantees that the probability of making a type I error is upper bounded by a predefined constant α—the significance level of the test—while keeping the probability of a type II error as low as possible. The last is also referred to as designing a test that is as statistically powerful as possible.
A statistical test is called valid if it controls a certain type I error criterion, i.e., it guarantees to bound the error criterion by a predefined constant. By this definition, however, high validity can be obtained by never rejecting any null hypothesis. Hence, the quality of a statistical test is measured not only by its validity, but also by its power: the probability that it would in fact reject a false null hypothesis. This probability is called the statistical power of the test. In general, we wish to design tests that are both valid and powerful.
In the following section we will introduce the concept of p-value, a statistical instrument that allows us to test whether or not the null hypothesis holds, based on a data sample that is available.
2.2 P-VALUE IN THE WORLD OF NLP
We will now discuss a practical approach for deciding whether or not to reject the null hypothesis. We focus on the setup where the performance of two algorithms, A and B, on a dataset X, is compared using an evaluation measure M. Let us denote with M(ALG, X) the value of the evaluation measure M when algorithm ALG is applied to the dataset X. Without loss of generality, we assume that higher values of the measure are better. We define the difference in performance between the two algorithms according to the measure M on the dataset X as:
In our example, A could be the LSTM and B the phrase-based MT system, and M could be the BLEU metric. According to Equation (2.3), δ(X) would be the difference in performance between our two MT algorithms with respect to the BLEU metric. We would like to test whether δ(X) > 0, which would indicate a higher BLEU score (i.e., better performance) for the LSTM. However, we would also like to assess whether this result is likely to happen again in a new experiment, or whether the current experiment does not reflect the actual relationship between the algorithms.
We will refer to δ(X) as our test statistic—a quantity derived from the experiment and used for the statistical hypothesis testing. Using this notation we formulate the following statistical hypothesis testing problem3:
The null hypothesis, H0, is that δ(X) is smaller than or equal to zero, meaning that algorithm B is better than A, or that B is as good as A. In contrast, the alternative hypothesis, H1, is that there is in fact a difference in performance and that algorithm A is superior. In order to decide whether or not to reject the null hypothesis, we can ask the following question.
Considering the test statistic that we chose and its distribution under the null hypothesis, how likely would it be to encounter the δ(X) value that we have observed in our test, given that the null hypothesis is indeed correct?
After all, if δ(X) is a very large number, then algorithm A strongly outperformed algorithm B, and that would be unlikely under the hypothesis that algorithm B is better. To answer this question we will need to compute a probability term where δ(X) is a random variable, which requires some prior knowledge regarding its distribution under the null hypothesis—we will discuss this further later on this book. We therefore phrase our decision in terms of the probability of observing the δobserved value if the null hypothesis was in fact true. This probability is exactly the p-value of the test.
The p-value is defined as the probability, under the null hypothesis H0, of obtaining a result equal to or even more extreme than what was actually observed. For the hypothesis testing framework defined here, the p-value is defined as:
where δobserved is the performance difference between the algorithms (according to M) when they are applied to X. Going back to our example, we could describe the p-value as the probability that the LSTM shows such stronger performance in this setting (i.e., to observe such a δobserved) when the phrase-based MT system is actually a better model. If δobserved is small, meaning the LSTM’s BLEU score is only slightly better than that of the phrase-based system, it may very well be a statistical “fluke”, such that if we were to repeat the experiment with a slightly different dataset from the same distribution we could probably encounter the opposite result of the phrase-based MT performing better. However, as δobserved increases, the probability of encountering such values under the assumption that the phrase-based MT system is better becomes smaller and smaller.
The smaller the p-value, the stronger is the indication that the observed outcome is unlikely under the null hypothesis, H0. In order to decide whether H0 should be rejected, the researcher should pre-define an arbitrary, fixed threshold value α a.k.a the significance level. Only if p-value < α then the null hypothesis is rejected.
For example, let us say that the probability to encounter a difference of 10 points between BLEU(LSTM) and BLEU (phrase-based) under the assumption that the phrase-based MT system is better, is 0:05. For a significance level of 0:1 we would reject the null hypothesis, since p-value < α. For a significance level of 0:03 we would not reject the null hypothesis. A lower α is a stronger demand, equivalent to saying “We need to see a stronger, more extreme improvement in the LSTM in order to determine that it is a superior model. We want to see such a strong improvement (such a large δobserved), that would only have a probability of 0:03 or less under the null hypothesis.”
How should we choose an α? As noted above, it is impossible to actually know which hypothesis is correct, H0 or H1, and hence we can only strive to minimize the probability of choosing the wrong hypothesis. A small α ensures that we do not reject the null hypothesis easily, but it may also cause us to not reject the null hypothesis when we should. More technically, a small α yields a lower probability of a type I error and a higher probability of a type II error. A common practice is to choose an α that guarantees that the probability of making a type I error is upper bounded by a pre-defined desired value, while achieving the highest possible power, i.e., the lowest possible probability of making a type II error. Popular α values in the literature are 0.05 and 0.01.
1 In this book we use the terms evaluation metric and evaluation measure interchangeably.
2 To keep the discussion concise, throughout this book we assume that only one evaluation measure is used. Our framework can be easily extended to deal with multiple measures.
3 For simplicity we consider A one-sided hypothesis, it can be easily reformulated as A two-sided hypothesis.