Читать книгу Statistical Significance Testing for Natural Language Processing - Rotem Dror - Страница 12

Оглавление

CHAPTER 3

Statistical Significance Tests

In this book, we are interested in the process of comparing performance of different NLP algorithms in a statistically sound manner. How is this goal related to the calculation of the p-value? Well, calculating the p-value is inextricably linked to statistical significance testing, as we will attempt to explain next. Recall the definition of δ(X) in Equation (2.3). δ(X) is our test statistic for the hypothesis test defined in Equation (2.3).

δ(X) is computed based on X, a specific data sample. In general, one can claim that if our data sample is representative of the data population, extreme values of δ(X) (either negative or positive) are less likely. In other words, the far left and right tails of the δ(X) distribution curve represent the unlikely events in which δ(X) obtains extreme values. What is the chance, given the null hypothesis is true, to have our δ(X) value land in those extreme tails? That probability is exactly the p-value obtained in the statistical test.

So, we now know that the probability of obtaining a δ(X) this high (or higher) is very low under the null hypothesis. Therefore, is the null hypothesis likely given this δ(X)? Well, the answer is, most likely, no. It is much more likely that the performance of algorithm A is better. To summarize, because the probability of seeing such a δ(X) under the null hypothesis (i.e., seeing such a p-value) is very low (< α), we reject the null hypothesis and conclude that there is a statistically significant difference between the performance of the two algorithms. This shows that statistical significance tests and the calculation of the p-value are parallel tools that help quantify the likelihood of the observed results under the null hypothesis.

In this chapter we move from describing the general framework of statistical significance testing to the specific considerations involved in the selection of a statistical significance test for an NLP application. We shall define the difference between parametric and nonparametric tests, and explore another important characteristic of the sample of scores that we work with, one that is highly critical for the design of a valid statistical test. We will present prominent tests useful for NLP setups, and conclude our discussion by providing a simple decision tree that aims to guide the process of selecting a significance test.

3.1 PRELIMINARIES

We previously presented an example of using the statistical significance testing framework for deciding between an LSTM and a phrase-based MT system, based on a certain dataset and evaluation metric, BLEU in our example. We defined our test statistic δ(X) as the difference in BLEU score between the two algorithms, and wanted to compute the p-value, i.e., the probability to observe such a δ(X) under the null hypothesis. But wait, how can we calculate this probability without knowing the distribution of δ(X) under the null hypothesis? Could we possibly choose a test statistic about which we have solid prior knowledge?

A major consideration in the selection of a statistical significance test is the distribution of the test statistic, δ(X), under the null hypothesis. If the distribution of δ(X) is known, then the suitable test will come from the family of parametric tests, that uses δ(X)’s distribution under the null hypothesis in order to obtain statistically powerful results, i.e., have small probability of making a type II error. If the distribution under the null hypothesis is unknown then any assumption made by a test may lead to erroneous conclusions and hence we have to back off to nonparametric tests that do not make any such assumptions. While nonparametric tests may be less powerful than their parametric counterparts, they do not make unjustified assumptions and are hence statistically sound even when the test statistic distribution is unknown.

How can one know the test statistic distribution under the null hypothesis? One common tool is the Central Limit Theorem (CLT) which establishes that, in most situations, when independent random variables are added, their properly normalized sum tends toward a normal distribution even if the original variables themselves are not normally distributed. Hence, statistical significance tests defined over the mean of observations (e.g., the unlabeled attachment score, values of the parse trees of the test set sentences), often assume that this average is normally distributed after proper normalization.

Let us elaborate on this. Recall the definition of the test statistic δ(X) from Equation (2.3). In a dependency parsing example M(A, X) can be, for example, the unlabeled attachment score (UAS) for the parser of Kiperwasser and Goldberg [2016] (K&G) and M(B, X) can be the UAS score of the TurboParser [Martins et al., 2013]. Following the NLP literature, the testset X is comprised of multiple sentences, and the metric M is calculated as the average score over all words in the test-set sentences. Hence, according to the CLT, the distribution of both M(A, X) and M(B, X) can be approximated by the normal distribution. Since δ(X) is defined as the difference between two variables with a normal distribution, it can also be assumed to have a normal distribution, which will make it easy for us to compute probabilities for its different possible values.

Unfortunately, in order to use the CLT, one is required to assume independence between the observations in the sample (test-set), and this independence assumption often does not hold in NLP setups. For example, a dependency parsing test-set (e.g., The WSJ Penn Treebank, Section 23 Marcus et al. [1993]) often consists of subsets of sentences taken from the same article, and many sentences in the Europarl parallel corpus [Koehn, 2005] are taken from the same parliament discussion. Later on in this book we will discuss this fundamental problem, and list it as one of the open issues to be considered in the context of statistical hypothesis testing in NLP.

If we cannot use the CLT in order to assume a normal distribution for the test statistic, we could potentially apply tests designed to evaluate the distribution of a sample of observations. For example, the Shapiro–Wilk test [Shapiro and Wilk, 1965] tests the null hypothesis that a sample comes from a normally distributed population, the Kolmogorov–Smirnov test quantifies the distance between the empirical cumulative distribution function of the sample and the cumulative distribution function of a reference distribution, and the Anderson-Darling test [Anderson and Darling, 1954] tests whether a given sample of data is drawn from a given probability distribution. As we will show later, there are other heuristics that are used in practice but are usually not mentioned in research papers.

To summarize the above discussion:

• Parametric tests—assume that we have complete knowledge regarding our test statistic’s distribution under the null hypothesis. If we indeed have this knowledge, parametric tests can utilize it to ensure a low probability of making a type II error. However, if the distribution is unknown, then any assumptions made by such a test may lead to erroneous conclusions.

• Nonparametric tests—do not require the test statistic’s distribution under the null hypothesis to be known or assumed. Nonparametric tests may be less powerful than their parametric counterparts as they do not make any assumptions about the test statistic distribution and are hence statistically sound even when the test statistic distribution is unknown.

In ensuring that we choose the appropriate statistical tool, we need not only to decide between a parametric and a nonparametric test, but should also consider another important quality of our dataset. Many statistical tests require an assumption of independence between the two populations (the test-set scores of the two algorithms in our case), and the following subtle point is often brushed aside: are the two populations that we are comparing between truly independent, or are they related to one another? Can we regard to samples that represent a state of “before” and “after” as independent? Or yet another example, from the world of NLP, can we regard as independent the scores of two different algorithms applied on the same sentence?

These above are examples of paired samples—samples in which a natural coupling occurs. In a dataset of paired samples (often called dependent samples), each data point in one sample is uniquely paired to a data point in the other sample. Paired samples can be before/after samples, in which a metric is computed before and after a certain action. Alternatively, they could be matched samples, in which individuals are matched on some characteristic such as age or gender. In general, paired samples appear in any circumstance in which each data point in one sample is directly matched to a data point in the other sample.

As opposed to the case of paired samples, sometimes we have independent samples, consisting of unrelated data points. Such independent samples can be obtained simply by randomly sampling from two different populations. A more realistic case in the world of medical experiments where two separate treatment groups (often a treatment group and a placebo group) are randomly created, without first matching the subjects.

Algorithm 3.1 Statistical Hypothesis Testing Process with Critical Regions

Input : H0 the null hypothesis, H1 the alternative hypothesis, α the significance level.

Output : Decision to either reject the null hypothesis in favor of the alternative or not reject it.

1:O = {ø}—list of observations.

2:O ← Perform experiment to test the hypotheses.

3:Decide which statistical test is appropriate.

4:Calculate the observed test statistic T (O).

5:Derive the distribution of the test statistic under the null hypothesis H0.

6:Calculate the critical region—the possible values of T for which the null hypothesis is rejected. The probability of the critical region under the distribution of the test statistic under the null hypothesis is α.

7:Reject the null hypothesis H0 in favor of the alternative hypothesis H1 if the observed test statistic T (O) is in the critical region.

Algorithm 3.2 Statistical Hypothesis Testing Process with p-value

Input : H0 the null hypothesis, H1 the alternative hypothesis, the significance level.

Output : Decision to either reject the null hypothesis in favor of the alternative or not reject it Notice: steps 1–5 are the same as in Algorithm 3.1.

6:Calculate the p-value—the probability, under the null hypothesis H0, of observing a test statistic at least as extreme as that which was observed.

7:Reject the null hypothesis H0 in favor of the alternative hypothesis H1 if and only if the p-value is less than (or equal to) α.

The notion of paired vs. independent samples is crucial in NLP. Oftentimes we are comparing between several algorithms on the same dataset and hence paired tests are more common. In what follows, we survey prominent parametric and nonparametric tests, emphasizing the paired setup. In addition, Algorithms 3.1 and 3.2 display a pseudo code of the general testing process that is applied when testing for statistical significance. The two processes are equivalent.

3.2 PARAMETRIC TESTS

As previously defined, parametric tests are statistical significance tests that assume prior knowledge regarding the test statistic’s distribution under the null hypothesis. When using such tests, we utilize the test statistic’s assumed distribution in order to ensure a bound on the type I error and a low probability of making a type II error. We will now elaborate on several prominent parametric tests that are suitable for the setup of paired samples.

Algorithm 3.3 The Paired Z-test

Input : Paired samples {xi}, —standard deviation of the paired differences.

Output : p—the p-value.

Notations : n sample size.

1:Calculate the mean of the paired differences

2:Calculate the test statistic

3:Calculate p = P(Zz) where ZN(0, 1).

We begin with tests that are highly relevant to NLP setups, accounting for cases where the metric values come from a normal distribution. Example relevant NLP metrics are sentence level accuracy, recall, unlabeled attachment score (UAS) and labeled attachment score (LAS) [Yeh, 2000].

Paired Z-test In this test, the sample is assumed to be normally distributed and the standard deviation of the population is known. This test is used to validate the hypothesis that the sample drawn belongs to the same population through checking if the sample mean is the same as the population mean. This test is not very applicable in NLP since the population standard deviation is rarely known, but we define it here for completion. In addition, the statistical test which is used to validate the same hypothesis without the assumption on the known standard deviation in one of the most commonly used tests in NLP, the t-test which is described next. The Z-test is defined in Algorithm 3.3.

Paired Student’s t-test This test aims to assess whether the population means of two sets of measurements differ from each other, and is based on the assumption that both samples come from a normal distribution [Fisher, 1937]. The calculations of the test statistic and the p-value for this test are shown in Algorithm 3.4.

Since this test assumes a normal distribution and is computed over population means, one may argue that based on the Central Limit Theorem (CLT) it can be applied to compare between any large enough measurement sets; however, in NLP setups the test examples (e.g., sentences from the same document) are often dependent, violating the independence assumption of CLT.

In practice, t-test is often applied with evaluation measures such as accuracy, UAS and LAS, that compute the mean number of correct predictions per input example. When comparing two dependency parsers, for example, we can apply the test to check if the averaged difference of their UAS scores is significantly larger than zero, which can serve as an indication that one parser is better than the other. Using t-test with such metrics can be justified based on CLT.

Algorithm 3.4 The Paired Sample t-test

Input : Paired samples.

Output : p—the p-value.

Notations : D differences between two paired samples, di the ith observation in D, n the sample size, the sample mean of the differences, the sample standard deviation of the differences, T the critical value of a t-distribution with n – 1 degrees of freedom, t the t-statistic (t-test statistic) for a paired sample t-test.

1:Calculate the sample mean

2:Calculate the sample standard deviation

3:Calculate the test statistic

4:Find the p-value in the t-distribution table, using the predefined significance level and n — 1 degrees of freedom.

That is, accuracy measures in structured tasks tend to be normally distributed when the number of individual predictions (e.g., number of words in a sentence when considering sentence-level UAS) is large enough.

Statistical Significance Testing for Natural Language Processing

Подняться наверх