To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Before the texts to be included in a corpus are collected, annotated, and analyzed, it is important to plan the construction of the corpus carefully: what size it will be, what types of texts will be included in it, and what population will be sampled to supply the texts that will comprise the corpus. Ultimately, decisions concerning the composition of a corpus will be determined by the planned uses of the corpus. If, for instance, the corpus is be used primarily for grammatical analysis (e.g. the analysis of relative clauses or the structure of noun phrases), it can consist simply of text excerpts rather than complete texts. On the other hand, if the corpus is intended to permit the study of discourse features, then it will have to contain complete texts.
Deciding how lengthy text samples within a corpus should be is but one of the many methodological considerations that must be addressed before one begins collecting data for inclusion in a corpus. To explore the process of planning a corpus, this chapter will consider the methodological assumptions that guided the compilation of the British National Corpus. Examining the British National Corpus reveals how current corpus planners have overcome the methodological deficiencies of earlier corpora, and raises more general methodological considerations that anyone planning to create a corpus must address.
The British National Corpus
At approximately 100 million words in length, the British National Corpus (BNC) (see table 2.1) is one of the largest corpora ever created.
When someone is referred to as a “corpus linguist,” it is tempting to think of this individual as studying language within a particular linguistic paradigm, corpus linguistics, on par with other paradigms within linguistics, such as sociolinguistics or psycholinguistics. However, if the types of linguistic analyses that corpus linguists conduct are examined, it becomes quite evident that corpus linguistics is more a way of doing linguistics, “a methodological basis for pursuing linguistic research” (Leech 1992: 105), than a separate paradigm within linguistics.
To understand why corpus linguistics is a methodology, it is first of all necessary to examine the main object of inquiry for the corpus linguist: the linguistic corpus. Most corpus linguists conduct their analyses giving little thought as to what a corpus actually is. But defining a corpus is a more interesting question than one would think. A recent posting on the “Corpora” list inquired about the availability of an online corpus of proverbs (Maniez 2000). This message led to an extensive discussion of how a corpus should be defined. Could something as specific as a computerized collection of proverbs be considered a corpus, or would the body of texts from which the proverbs were taken be a corpus and the proverbs themselves the result of a corpus analysis of these texts?
The answer to this question depends crucially on how broadly one wishes to define a corpus.
For a corpus to be fully useful to potential users, it needs to be annotated. There are three types of annotation, or “markup,” that can be inserted in a corpus: “structural” markup, “part-of-speech” markup, and “grammatical” markup.
Structural markup provides descriptive information about the texts. For instance, general information about a text can be included in a “file header,” which is placed at the start of a text, and can contain such information as a complete bibliographic citation for a written text, or ethnographic information about the participants (e.g. their age and gender) in a spoken dialogue. Within the actual spoken and written texts themselves, additional structural markup can be included to indicate, for instance, paragraph boundaries in written texts or overlapping segments of speech in spoken texts. Part-of-speech markup is inserted by a software program called a “tagger” that automatically assigns a part-of-speech designation (e.g. noun, verb) to every word in a corpus. Grammatical markup is inserted by a software program called a “parser” that assigns labels to grammatical structures beyond the level of the word (e.g. phrases, clauses).
This chapter focuses on the process of annotating texts with these three kinds of markup. The first section discusses why it is necessary for corpora to be annotated with structural markup and then provides an overview of the various systems of structural markup that have evolved over the years as well as the various tools that have been developed for inserting markup in corpora.
When the first computer corpus, the Brown Corpus, was being created in the early 1960s, generative grammar dominated linguistics, and there was little tolerance for approaches to linguistic study that did not adhere to what generative grammarians deemed acceptable linguistic practice. As a consequence, even though the creators of the Brown Corpus, W. Nelson Francis and Henry Kučera, are now regarded as pioneers and visionaries in the corpus linguistics community, in the 1960s their efforts to create a machine-readable corpus of English were not warmly accepted by many members of the linguistic community. W. Nelson Francis (1992: 28) tells the story of a leading generative grammarian of the time characterizing the creation of the Brown Corpus as “a useless and foolhardy enterprise” because “the only legitimate source of grammatical knowledge” about a language was the intuitions of the native speaker, which could not be obtained from a corpus. Although some linguists still hold to this belief, linguists of all persuasions are now far more open to the idea of using linguistic corpora for both descriptive and theoretical studies of language. Moreover, the division and divisiveness that has characterized the relationship between the corpus linguist and the generative grammarian rests on a false assumption: that all corpus linguists are descriptivists, interested only in counting and categorizing constructions occurring in a corpus, and that all generative grammarians are theoreticians unconcerned with the data on which their theories are based.
Hungarian is a Finno-Ugric language. The Finno-Ugric languages and the practically extinct Samoyed languages of Siberia constitute the Uralic language family. Within the Finno-Ugric family, Hungarian belongs to the Ugric branch, together with Mansy, or Vogul, and Khanty, or Ostyak, spoken by a few thousand people in western Siberia. The family also has:
a Finnic branch, including Finnish (5 million speakers) and Estonian (1 million speakers);
a Sami or Lappish branch (35 000 speakers); as well as
a Mordvin branch, consisting of Erzya (500 000 speakers) and Moksha (250 000 speakers);
a Mari or Cheremis branch (550 000 speakers); and
a Permi branch, consisting of Udmurt or Votyak (500 000 speakers) and Komi or Zuryen (350 000 speakers).
Hungarian, Finnish and Estonian are state languages; Sami is spoken in northern Norway, Sweden and Finland, whereas the Mordvin, Mari, and Permi languages are spoken in the European territories of Russia.
Hungarian is spoken in Central Europe. It is the state language of Hungary, but the area where it is a native language also extends to the neighboring countries. In Hungary it has 10 million speakers, in Romania 2 million speakers, in Slovakia 700 000 speakers, in Yugoslavia 300 000 speakers, in Ukraine 150 000 speakers. There is also a Hungarian minority in Croatia, Slovenia, and Austria, and a considerable diaspora in Western Europe, North America, South America, Israel, and Australia.
The inner structures of the various types of verb complements resemble the inner structure of the extended verb phrase in their outlines: they consist of a lexical phrase embedded in morphosyntactic projections and operator projections. The noun phrase, too, will be analyzed as a complex containing a lexical kernel (the NP proper) subsumed by operator projections extending it into an indefinite numeral phrase (NumP) and/or a definite determiner phrase (DP). The full range of morphosyntactic projections that play a role in the noun phrase will become evident in the analysis of the possessive construction.
The basic syntactic layers of the noun phrase
The minimal noun projection appearing in the Hungarian sentence is a bare singular case-marked noun; for example:
(1) a. János könyvet olvas.
John book-ACC reads
‘John is book-reading.’
b. János moziba ment.
John cinema-to went
‘John went to (the) cinema.’
In the unmarked case such nouns function as verb modifiers. In Section 3.6 verb modifiers were shown to display both phrase-like and head-like properties: they move into Spec, AspP like a phrase, and merge with the V in Asp like a head, which suggests that they are both minimal and maximal, i.e., they are phrases containing merely a head.
Once the basic outlines of a corpus are determined, it is time to begin the actual creation of the corpus. This is a three-part process, involving the collection, computerization, and annotation of data. This chapter will focus on the first two parts of this process – how to collect and computerize data. The next chapter will focus in detail on the last part of the process: the annotation of a corpus once it has been encoded into computer-readable form.
Collecting data involves recording speech, gathering written texts, obtaining permission from speakers and writers to use their texts, and keeping careful records about the texts collected and the individuals from whom they were obtained. How these collected data are computerized depends upon whether the data are spoken or written. Recordings of speech need to be manually transcribed using either a special cassette tape recorder that can automatically replay segments of a recording, or software that can do the equivalent with a sample of speech that has been converted into digital form. Written texts that are not available in electronic form can be computerized with an optical scanner and accompanying OCR (optical character recognition) software, or (less desirably) they can be retyped manually.
Even though the process of collecting, computerizing, and annotating texts will be discussed as separate stages in this and the next chapter, in many senses the stages are closely connected: after a conversation is recorded, for instance, it may prove more efficient to transcribe it immediately, since whoever made the recording will be available to answer questions about it and to aid in its transcription.
The process of analyzing a completed corpus is in many respects similar to the process of creating a corpus. Like the corpus compiler, the corpus analyst needs to consider such factors as whether the corpus to be analyzed is lengthy enough for the particular linguistic study being undertaken and whether the samples in the corpus are balanced and representative. The major difference between creating and analyzing a corpus, however, is that while the creator of a corpus has the option of adjusting what is included in the corpus to compensate for any complications that arise during the creation of the corpus, the corpus analyst is confronted with a fixed corpus, and has to decide whether to continue with an analysis if the corpus is not entirely suitable for analysis, or find a new corpus altogether.
This chapter describes the process of analyzing a completed corpus. It begins with a discussion of how to frame a research question so that from the start, the analyst has a clear “hypothesis” to test out in a corpus and avoids the common complaint that many voice about corpus-based analyses: that many such analyses do little more than simply “count” linguistic features in a corpus, paying little attention to the significance of the counts. The next sections describe the process of doing a corpus analysis: how to determine whether a given corpus is appropriate for a particular linguistic analysis, how to extract grammatical information relevant to the analysis, how to create data files for recording the grammatical information taken from the corpus, and how to determine the appropriate statistical tests for analyzing the information in the data files that have been created.