Читать книгу Statistical Significance Testing for Natural Language Processing - Rotem Dror - Страница 8
ОглавлениеPreface
The field of Natural Language Processing (NLP) has made substantial progress in the last two decades. This progress stems from multiple sources: the data revolution that has made abundant amounts of textual data from a variety of languages and linguistic domains available, the development of increasingly effective predictive statistical models, and the availability of hardware that can apply these models to large datasets. This dramatic improvement in the capabilities of NLP algorithms carries the potential for a great impact.
The extended reach of NLP algorithms has also resulted in NLP papers giving more and more emphasis to the experiment and result sections by showing comparisons between multiple algorithms on various datasets from different languages and domains. It can be safely argued that the ultimate test for the quality of an NLP algorithm is its performance on well-accepted datasets, sometimes referred to as “leader-boards”. This emphasis on empirical results highlights the role of statistical significance testing in NLP research: If we rely on empirical evaluation to validate our hypotheses and reveal the correct language processing mechanisms, we better be sure that our results are not coincidental.
The goal of this book is to discuss the main aspects of statistical significance testing in NLP. Particularly, we aim to briefly summarize the main concepts so that they are readily available to the interested researcher, address the key challenges of hypothesis testing in the context of NLP tasks and data, and discuss open issues and the main directions for future work.
We start with two introductory chapters that present the basic concepts of statistical significance testing: Chapter 2 provides a brief presentation of the hypothesis testing framework, and Chapter 3 introduces common statistical significance tests. Then, Chapter 4 discusses the application of statistical significance testing to NLP. In Chapter 4, we assume that two algorithms are compared on a single dataset, based on a single output that each of them produces, and discuss the relevant significance tests for various NLP tasks and evaluation measures. The chapter puts an emphasis on the aspects in which NLP tasks and data differ from common examples in the statistical literature, e.g., the non–Gaussian distribution of the data and the dependence between the participating examples, e.g., sentences in the same corpus. This chapter, which extends our ACL 2018 paper [Dror et al, 2018], provides our recommended matching between NLP tasks with their evaluation measures and statistical significance tests.
The next two chapters relax two of the basic assumptions of Chapter 4: (a) that each of the compared algorithms produces a single output for each test example (e.g., a single parse tree for a given input sentence), and (b) that the comparison between the two algorithms is performed on a single dataset. Particularly, Chapter 5 addresses the comparison between two algorithms based on multiple solutions where each of them produces for a single dataset, while Chapter 6 addresses the comparison between two algorithms across several datasets.
The first challenge stems from the recent emergence of Deep Neural Networks (DNNs), which has made data-driven performance comparison much more complicated. This is because these models are non-deterministic due to their non-convex objective functions, complex hyperparameter tuning process and training heuristics such as random dropouts, that are often applied in their implementation. Chapter 5, therefore, defines a framework for a statistically valid comparison between two DNNs based on multiple solutions each of them produces for a given dataset. The chapter summarizes previous attempts in the NLP literature to perform this comparison task and evaluates them in light of the proposed framework. Then, it presents a new comparison method that is better fitted to the pre-defined framework. This chapter is based on our ACL 2019 paper [Dror et al., 2019].
The second challenge is crucial for the efforts to extend the reach of NLP technology to multiple domains and languages. These well-justified efforts result in a large number of comparisons between algorithms, across corpora from a large number of languages and domains. The goal of this chapter is to provide the NLP community with a statistical analysis framework, termed Replicability Analysis, which will allow us to draw statistically sound conclusions in evaluation setups that involve multiple comparisons. The classical goal of replicability analysis is to examine the consistency of findings across studies in order to address the basic dogma of science, namely that a finding is more convincingly true if it is replicated in at least one more study [Heller et al., 2014, Patil et al., 2016]. We adapt this goal to NLP, where we wish to ascertain the superiority of one algorithm over another across multiple datasets, which may come from different languages, domains, and genres. This chapter is based on our TACL paper [Dror et al., 2017].
Finally, while this book aims to provide a basic framework for proper statistical significance testing in NLP research, it is by no means the final word on this topic. Indeed, Chapter 7 presents a list of open questions that are still to be addressed in future research. We hope that this book will contribute to the evaluation practices in our community and eventually to the development of more effective NLP technology.
INTENDED READERSHIP
The book is intended for researchers and practitioners in NLP who would like to analyze their experimental results in a statistically sound manner. Hence, we assume technical background in computer science and related areas such as statistics and probability, mostly at the undergraduate level. Moreover, while in Chapter 4 we discuss various NLP tasks and their proposed significance tests, our discussion of these tasks is quite shallow. Furthermore, when we analyze experimental results with NLP tasks in Chapters 5 and 6 we do not provide the details of the tasks because we assume the reader is familiar with the basic tasks of NLP. Despite these assumptions about the reader’s background, we are trying as much as possible to be self-contained when it comes to statistical hypothesis testing and the derived concepts and methodology, as presenting these ideas to the NLP audience is a core objective of this book.
Further Reading For broader and more in-depth reading on the fundamental concepts of statistics, we refer the reader to other existing resources such as Montgomery and Runger [2007] (which provides an engineering perspective) and Johnson and Bhattacharyya [2019]. For further reading on the topic of multiple comparisons in statistics, we recommend the book by Bretz et al. [2016] which demonstrates the basic concepts and provides examples with R code.
This book evolved from a series of conference and journal papers—Dror et al. [2017], Dror et al [2018], Dror et al. [2019]—which have been greatly expanded in order to form this book. First, we added background chapters that discuss the foundations of statistical hypothesis testing and provide the details of the statistical significance tests that we find most relevant for NLP. Then, we take the handbook approach and provide the pseudocode of the various methods discussed throughout the book, along with concrete recommendations and guidelines—our goal is to allow the practitioner to directly and easily implement the methods described in this book. Finally, in Chapter 7, we critically discuss the ideas presented in this book and point to challenges that are yet to be addressed in order to perform statistically sound analysis of NLP experimental results.
FOCUS OF THIS BOOK
This book is intended to be self-contained, presenting the framework of statistical hypothesis testing and its derived concepts and methodology in the context of NLP research. However, the main focus of the book is on this statistical framework and its application to the analysis of NLP experimental results, rather than on providing in-depth coverage of the NLP field.
Most of the book takes the handbook approach and aims to provide concrete solutions to practical problems. As such, it does not provide in-depth technical coverage of statistical hypothesis testing to a level that will allow the reader to propose alternative solutions to those proposed here, or to solve some of the open challenges we point to. Yet, our hope is that highlighting the challenges of statistically sound evaluation of NLP experiments, both those that already have decent solutions and those that are still open, will attract the attention of the community to these issues and facilitate future development of additional methods and techniques.
Rotem Dror, Lotem Peled-Cohen, Segev Shlomov, and Roi Reichart
April 2020