Читать книгу Introduction to Corpus Linguistics - Sandrine Zufferey - Страница 14

1.3. Chomsky’s arguments against empiricism in linguistics

Although corpus linguistics has experienced a strong growth over the past 20 years, the empirical grounding of linguistics is not new. Linguists have long used observational data. In the 19th Century, for example, linguists used to work on the comparison of Indo-European languages in an attempt to reconstruct their common origin. Research was based on existing data about the languages spoken in Europe such as German, French and English. Similarly, in the first half of the 20th Century in the United States, the so-called distributionist approach to syntax focused on the study of sentence formation in syntactic structures as they appeared in text corpora, and from there, tried to infer language’s general functioning. Around the late 1950s, the use of corpora in linguistics was almost completely interrupted in certain fields such as syntax, following the works of the American linguist Noam Chomsky. In fact, Chomsky defended a strictly rationalist methodological approach to linguistics, and fiercely opposed any use of external data. The objections made by Chomsky against the use of external data in linguistics have been numerous. We will briefly review them, to show in what ways most of them have lost their raison d’être in the context of current research.

Chomsky’s first objection to the use of corpora, which is also the most fundamental one, is that corpora contain language samples produced by speakers. According to him, linguistics should not focus on the linguistic performance of speakers, but on the competence they have in their mother tongue, something he calls their internal language. Now, here is the problem. When people speak, what they produce (their performance) does not necessarily reflect what they know about their language (their competence). For example, under the effect of stress or fatigue, speakers sometimes produce verbal slip-ups or make language mistakes. From time to time, almost everybody happens to badly conjugate an irregular verb and mistakenly produce the form “he eated” instead of “he ate”. However, if the person who produced this wrong form were recorded, and then asked whether he or she thought he or she had spoken correctly or not, we can almost be sure that he or she would realize his or her mistake and would be able to state the correct form, “he ate”. Conversely, a speaker could pronounce a word like “serendipity” after having heard it from somebody else’s lips, but without really knowing its meaning. These examples illustrate the fact that the words speakers “utter” are not always a true reflection of their linguistic competence. In this way, according to Chomsky, the fact of studying corpora places linguists on the wrong track, because they lead them to consider language from the point of view of “production”, which merely represents a biased reflection of the rules of language.

According to Chomsky, another problem related to corpus linguistics stems from the fact that corpora are not representative of the language as a whole. He illustrates this problem in an extreme way, by picking the case of an aphasic speaker recorded in a corpus. Linguists analyzing this corpus would draw totally incorrect conclusions about the language in question, since this person does not represent the linguistic competence of a typical speaker. Furthermore, even if we were not to include an atypical speaker, a corpus could never represent more than a tiny language sample when compared to all the oral and written productions in any language. It is for this very same reason that it is impossible to conclude that a word simply does not exist in a language just because it is absent from a corpus. It could simply never have been pronounced in such particular context, while it could exist in other language registers or have been mentioned by other speakers not included in the corpus. This problem is particularly acute in the case of rare linguistic phenomena, such as infrequent words or little used linguistic structures.

This limitation has led to Chomsky’s third criticism of corpora, namely the fact that a corpus can never contain the whole of a language and that, therefore, the above-mentioned biases are not solvable. According to him, this problem is all the more serious because even if a corpus were of a very large size and included a representative portion of the language, it would not be fully analyzable by linguists, given the fact that it is impossible to manually analyze the content of billions of sentences.

Chomsky’s last two objections have largely become obsolete due to the advances made in computer science. In fact, the size of corpora has increased exponentially over the past 20 years, and corpus analysis tools have also made considerable progress. It has thus become possible to analyze very large amounts of data, which represent a much more accurate mirror of the language than when Chomsky formulated his objections. We will return to this in section 1.4, devoted to the connections between computer science and corpus linguistics. In addition to these technological advances, theoretical and methodological advances have also largely made it possible to eliminate or control the other types of biases mentioned by Chomsky. For example, good practice for building a corpus is to accurately document the type of language it contains. This helps to avoid analyzing the language of a single aphasic subject by mistake, for example, as Chomsky might suggest. It is nonetheless true that a corpus can only show that which it contains, and therefore the absence of evidence that a word or a structure exists in a corpus cannot constitute definitive proof of their absence from the language. Thus, for certain research questions relating to rare or hardly observable phenomena in a corpus, it might be advisable to complement research with another empirical method, namely with the experimental method. As we will see later in this chapter, this method shares the use of a quantitative methodology with corpus linguistics.

In conclusion, we should point out that the rationalist method suggested by Chomsky is also accompanied by biases and limitations which are not negligible and can be corrected by the use of empirical methods. In particular, this method leaves a large space for the subjectivity of linguists while it overestimates the linguistic skills of speakers. Indeed, the use of grammaticality judgments presupposes that all speakers have a definite and consistent intuition regarding all the sentences in their mother tongue. However, such is not the case. If all English speakers agree that a sentence like “Mary dog her walks” is incorrect in English, whereas the sentence “Mary walks her dog” is correct, judgment will not be so unanimous in the case of complex sentences, as the one mentioned above: “When do you think he will prepare which cake?”. These divergences become problematic as soon as these judgments are used for building a linguistic theory. What is more, while it is likely that many English speakers would reject a sentence such as “He does be working” for being grammatically incorrect, in certain areas of the English-speaking world (such as Ireland), this sentence would be acceptable. By resorting to many different speakers and including them in reference corpora of speakers coming from different geographical areas, corpus linguistics makes it possible to respond to this problem in a much more satisfactory way.

What is more, in many areas of linguistics such as lexicology, language acquisition and sociolinguistics, the idea of relying on the internal judgments of linguists is simply not conceivable. No one can study children’s language by remembering how he or she spoke as a child, or make assumptions about language differences between men and women by imagining how he or she would speak if he/she were a man or a woman. In all these fields, the use of text corpora has been obvious for a long time and corpora use was never interrupted as a result of Chomsky’s work. The paradigm shift in recent decades has taken place in areas where it is conceivable to use a purely rationalist methodology, for example syntax.

Finally, it is important to remember that the role of linguistic theory and the intuition of researchers is not absent in most corpora studies. Indeed, a majority of linguists consider corpora studies as a tool, making it possible to validate or invalidate hypotheses on language, formulated in advance, on the basis of scientific literature and their linguistic intuitions. We will see many examples of this approach (empirical validation) throughout this book. This corpus-based research approach is opposed to an approach which considers corpus data as the only point of reference, both in a theoretical and a methodological sense. In this approach, linguists begin their research without an a priori and simply let hypotheses emerge from corpus data (this is called a corpus-driven approach). This approach is almost unanimous among linguists working with an empirical methodology. On this point, we agree with Chomsky’s metaphorically explained opinion where he states that working with linguistics in this way would be the equivalent for physicists of hoping to discover the physical laws of the universe by looking out of their window. Observing data without a hypothesis often leads to not being able to make sense of data. It is for this reason that the approach that we will adopt in this book corresponds to a corpus-based approach, considering these as available tools for linguists to be able to test their hypotheses.

Подняться наверх