We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
This chapter is based on two standard reference corpora, the British National Corpus and the Corpus of Contemporary American English, as opposed to the multi-billion-word database of Google Books Ngrams, which has, despite its allure, not been used in many systematic linguistic studies so far. Focusing on indefinite article allomorphy (a vs an) as an orthographic cue to the phonological strength of ‹h›-onsets in British and American English, the size advantage of the Ngrams database expectedly plays out in larger type and token counts, more stable estimates and fewer distortions due to data sparsity. However, as metadata are extremely limited (to year and variety), a fully accountable analysis is not feasible. The case study illustrates how richly annotated corpora can shed light on potential disturbances arising from two sources: genre differences and between-author variability. A sensitivity analysis offers some degree of reassurance when extending the analysis to the Ngrams database. In this way, the authors demonstrate that the strengths and limitations of corpora and big data resources can, with due caution, be counterbalanced to answer questions of linguistic interest.
This chapter throws into relief the importance of the link between corpus hits and their sources (i.e. texts or speakers) by comparing study designs uninformed and informed by such metadata. The authors’ argument draws attention to the consequences of uneven distributions of observations across the basic text units of a corpus. In the case study (a distributional analysis of the use of actually in the BNCs of 1994 and 2014), it is demonstrated that disproportionate word counts contributed by individual sources (in this case, speakers) will distort estimates for relevant subsections of corpus data (in this case, demographic groups defined by age and gender) if the analysis assigns the same weight to every observation. The proposed solution is to factor in the text (or speaker) level, but this hinges on the availability of the relevant metadata. Moreover, insights into the hierarchical structure of corpus data are shown to benefit the design stage of a study. Thus, if manual post-processing steps preclude an exhaustive analysis of corpus hits, insights into the organization of data points can direct down-sampling strategies to generate a statistically efficient subset of tokens.
Recommend this
Email your librarian or administrator to recommend adding this to your organisation's collection.