Читать книгу Introduction to Corpus Linguistics - Sandrine Zufferey - Страница 15

1.4. Corpus linguistics and computer tools

As we have seen above, corpus linguistics, as performed nowadays, cannot do without computers. Even if works related to corpus linguistics have existed for a long time (such as the indexing of the Bible by theologians or the file-based construction of dictionaries by scholars like Antoine Furetière in French or Samuel Johnson in English), this discipline was only able to properly take off after the arrival of computing.

Corpus linguistics depends on computer science for various reasons. The first one, which we have already mentioned above, is related to the need for computerized texts in order to be able to carry out truly quantitative research. Nevertheless, looking for elements in a corpus, even a computerized one, by using a simple word processing tool is rather inconvenient. Going back to the example of the search for terms related to love in Flaubert, which we discussed earlier, we find that the use of the search function of a typical word processor quickly reaches its limits. First of all, in order to verify that all occurrences found when looking for the verb to love correspond to expressions of love as a feeling rather than to modal uses as in the phrase “I would love it that you kept quiet”, it is necessary to examine each occurrence and thus browse the entire text. Second, to find all the occurrences of the verb to love, it is necessary to perform a different search for each verbal form, for example love, loved, etc. It is for this reason that other computing tools, specifically devoted to corpus linguistics, have been developed.

In particular, concordancers are useful for searching all the occurrences of a word, plus their context of use and for displaying the results line by line in a single query. These tools also make it possible to establish the list of words contained in the corpus, together with their frequency, and to generate a list of keywords matching the content of a corpus. In the case of corpora containing texts as well as their translation, certain tools called aligners make it possible to align the content of the corpus sentence by sentence. That being done, bilingual concordancers search directly for the occurrences of a word in one of the two languages of the corpus, and simultaneously extract the matching sentence in the other language. We will learn how to use these tools in Chapter 5, which is devoted to the presentation of the main French corpora, as well as the tools for analyzing them.

Then, in Chapter 7, we will also see that in order to answer certain research questions, it is necessary to annotate the content of a corpus. For example, let us imagine that we wish to study the different contexts in which we can use the causal adverb since. If we only look up the word since in the corpus, we will also find occurrences which do not correspond to the use of this word as a causal adverb, but to its use as a preposition, for example in “I haven’t seen Mary since Christmas”. So, to be able to correctly look up the uses of since we are interested in, we should only keep those which are adverbs and exclude prepositions. This search can be greatly simplified if the corpus has been annotated by determining, for each word, its grammatical category. This operation, called part-of-speech tagging, can be performed automatically by certain software.

Another problem might arise if we decide to study the use of relative phrases such as “the girl who is intelligent” or “the violin which was left on the bus”. For this study, a good starting point would be to look for relative pronouns such as who or which in order to find occurrences of relative sentences in the corpus. The problem is that these pronouns are also used in interrogative sentences such as “Who do you prefer?” or “Which hat is yours?” In this case, looking for the grammatical category of the word will not solve the problem, because they are both pronouns. In order to find only the occurrences of who and which as relative pronouns, we should use a corpus in which the syntactic structure of each sentence has been analyzed in such a way that we can assign a grammatical function to each word and group them into syntactic constituents. Tools for analyzing the syntactic structure of sentences have also been developed in the context of works for automatic language processing. These automatic analyses still require human checks so as to avoid any form of error, but their performance is continually improving. The arrival of these tools has greatly accelerated research in corpus linguistics. We will discuss this issue in Chapter 7, which is devoted to annotations.

But corpus linguistics was not only developed thanks to the creation of such tools. Above all, it is the general development of computers and the digital revolution which have made the greatest advances possible. In fact, the increase in the computing power of machines – as well as in their memory – has made it possible to build ever larger corpora. Until the 1980s, a corpus of a million words was considered to be a very large corpus. For instance, the first reference corpora (such as the Brown corpus developed for American English in the early 1960s) were about this size. At the same time, the arrival of cassette recorders to the market enabled the first creations of oral corpora containing an exact transcription of spoken speech, rather than a synthesis taken in shorthand.

The marketing of scanners in the 1980s later made it possible to digitize a significant amount of data and corpora began to reach larger sizes, up to 20 million words. Then, with the democratization of computer use, the amount of digitally disseminated texts greatly accelerated the growth of corpora. Finally, since the beginning of 21st Century, the wide dissemination of documents online via the Internet has given another dimension to the size of corpora available to researchers. At present, the Google Books corpus, for example, contains more than 500 billion words, which represents approximately 4% of all the published books of all time (Michel et al. 2011). We will discuss the possible uses of such a corpus in the following chapters. In Chapter 6, we will also see that the Internet potentially offers an exceptional data resource for corpus linguistics, but that Internet research cannot be used without an additional processing step if we are to grant data quality.

Подняться наверх