Читать книгу Introduction to Corpus Linguistics - Sandrine Zufferey - Страница 18

1.7. Different types of corpora

As we will see in the following chapters, corpora represent linguistic samples of a very varied nature, and it is precisely this variety that makes it possible to answer diverse research questions in all fields of linguistics. In this last section, we will introduce a first classification of the types of existing corpora, in order to be able to refer back to it in the following chapters.

The first distinction we can make among all the existing corpora is the one that classifies them into a sample corpus and a monitor corpus. Sample corpora are those in which data have been collected once and for all, and which no longer evolve thereafter. For this reason, they are also known as closed corpora in the specialized literature. The advantage of these corpora is that they have been designed to contain a set of texts representative of the language, or a part of the language to be studied, with a balanced representation of the different text genres, for example. Thus, these corpora make it possible to draw conclusions which can be generalized. On the other hand, their main defect is that they age quickly and do not follow changes in the language. Therefore, sample corpora need to be recollected at regular intervals.

On the other hand, monitor corpora are never finished and constantly continue to integrate new elements, which is why they are described as open corpora in the literature. A typical example of this type of data is the corpus that contains newspaper archives or parliamentary debates. Every year, the number of available data increases. It is for this reason that it is difficult to maintain a perfect balance between the different parts of these corpora, whose representativeness cannot be fully guaranteed. We will return to the problem of representativeness in Chapter 6. On the other hand, these corpora remain up to date. In cases where they comprise a period of a few decades, they make it possible to observe the appearance of certain changes in language.

The second major distinction to be made among existing corpora differentiates general language corpora from specialized language corpora. General language corpora aim to offer a panorama of the whole of a language at a given time. It is evidently impossible to collect a sample of the whole language, but in the same way that a general language dictionary aims to describe the common lexicon of a language, the general corpus seeks to offer a global image, including the main textual genres found in language. These corpora are really valuable when it comes to studying a language as a whole, but they cannot offer precise answers on linguistic phenomena present in certain specific communication means, such as mobile texting, social media, medical reports, etc.

In order to study one of these areas specifically, it is preferable to resort to a specialized corpus. In fact, there are corpora especially devoted to texting, social media, etc. In addition, general corpora include productions by adults who are native speakers of the language represented. Other corpora specialize in representing other population categories, regardless of whether they are monolingual children in the process of acquiring their mother tongue, bilingual children, foreign-language learners, or even children with neuro-developmental disorders influencing language acquisition, such as autism and specific language impairment. Finally, by default, a general corpus includes examples of the variety considered as a language standard, or one of its main varieties. In French, it generally refers to the French language from France and, more precisely, from the Parisian region. In English, general corpora can refer to the English language from the UK or to American English. Conversely, some corpora specialize in the productions of speakers of a certain language variety, such as French from French-speaking Switzerland, Belgium, Canada, etc.

General or specialized language corpora can contain either written language or spoken language samples. For a long time, written language corpora were the norm, but analysis of the spoken language has developed broadly since the 2000s. Corpora of spoken language are typically of smaller size than written language ones, since they require manual transcription. As a matter of fact, it is easy to record voices, but what is difficult is to carry out searches directly on an audio file. At the same time, speech recognition software does not always fully allow reliable automatic transcriptions. It is for this reason that the oral data must be transcribed manually, which often limits the size of the spoken corpora. More recently, audio-visual recording corpora (also called “multimodal” corpora) have been created, in order to facilitate, for instance, the study of gestures and facial expressions as well as their role in communication. These corpora still pose many codification and interpretation challenges. Finally, let us point out that video corpora are also used for the study of sign language.

Another distinction that can be made regarding the types of existing corpora relates to the type of processing carried out on the linguistic data of the corpus. On the one hand, raw corpora contain nothing but language samples. This scenario represents the majority of the French corpora. On the other hand, some annotated corpora contain specific linguistic information, apart from the language samples. The most common type of annotation is the assignment of a grammatical category to each word in the corpus, as we have already mentioned. More rarely, certain corpora contain a syntactic analysis of all of their sentences, as well as other types of information, such as an annotation of the discourse relations (cause, condition, etc.) which interconnect the sentences within the text corpora. Finally, certain corpora, which have been transcribed with the aim of studying phonological phenomena, may end up being transcribed using the International Phonetic Alphabet.

So far, all the types of corpora we have considered are monolingual. Another distinction that we can make is to differentiate these corpora from multilingual corpora. There are two types of multilingual corpora. On the one hand, we have comparable corpora, which contain similar samples produced by native speakers in two or more languages. For example, it is possible to build a comparable corpus of parliamentary debates in France and the UK. Such a corpus would make it possible to compare the ways of speaking in a similar context in two languages and two different cultures. On the other hand, so-called parallel corpora contain texts produced in one language and their translation into one or more other languages. These corpora make it possible to study the linguistic correspondences between languages, as well as the linguistic phenomena linked to the translation process. Parallel corpora can also be annotated with exact matches between sentences. This process is called alignment and gives rise to so-called aligned corpora.

Finally, many corpora are drawn from contemporary written or spoken data. However, there are archives that make it possible to study the history of a language, going back to ancient French, for example. Contemporary corpora are used for studying language in a synchronic way, that is, at a given moment during its evolution, whereas historical corpora make it possible to carry out studies from a diachronic point of view, that is, on the evolution of language.

Подняться наверх