Skip to main content Accessibility help
×
Hostname: page-component-745bb68f8f-kw2vx Total loading time: 0 Render date: 2025-01-25T22:57:01.695Z Has data issue: false hasContentIssue false

Doing Linguistics with a Corpus

Methodological Considerations for the Everyday User

Published online by Cambridge University Press:  13 October 2020

Jesse Egbert
Affiliation:
Northern Arizona University
Tove Larsson
Affiliation:
Northern Arizona University
Douglas Biber
Affiliation:
Northern Arizona University

Summary

Paradoxically, doing corpus linguistics is both easier and harder than it has ever been before. On the one hand, it is easier because we have access to more existing corpora, more corpus analysis software tools, and more statistical methods than ever before. On the other hand, reliance on these existing corpora and corpus linguistic methods can potentially create layers of distance between the researcher and the language in a corpus, making it a challenge to do linguistics with a corpus. The goal of this Element is to explore ways for us to improve how we approach linguistic research questions with quantitative corpus data. We introduce and illustrate the major steps in the research process, including how to: select and evaluate corpora, establish linguistically-motivated research questions, observational units and variables, select linguistically interpretable variables, understand and evaluate existing corpus software tools, adopt minimally sufficient statistical methods, and qualitatively interpret quantitative findings.
Get access
Type
Element
Information
Online ISBN: 9781108888790
Publisher: Cambridge University Press
Print publication: 12 November 2020

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Ackoff, R. L. (2010). Systems Thinking for Curious Managers. Chicago: Triarchy Press.Google Scholar
Anthony, L. (2013). A critical look at software tools in corpus linguistics. Linguistics Research, 30(2), 141–61.Google Scholar
Anthony, L. & Baker, P. (2015). ProtAnt: A tool for analysing the prototypicality of texts. International Journal of Corpus Linguistics, 20(3), 273–92.Google Scholar
Baayen, H. R., Janda, L. A., Nesset, T., Endresen, A., & Makarova, A. (2013). Making choices in Russian: Pros and cons of statistical methods for rival forms. Russian Linguistics, 37(3), 253–91.Google Scholar
Baker, P. (2004). Querying keywords: Questions in difference, frequency, and sense in keyword analysis. Journal of English Linguistics, 32(4), 346–59.Google Scholar
Baker, P. (2010). Corpus methods in linguistics. In Litosseliti, L., ed. Research Methods in Linguistics. New York: Continuum, pp. 95113.Google Scholar
Biber, D. (1984). A model of textual relations within the written and spoken modes. Unpublished PhD dissertation. Los Angeles: University of Southern California.Google Scholar
Biber, D. (1988). Variation across Speech and Writing. Cambridge: Cambridge University Press.Google Scholar
Biber, D. (1993). Representativeness in corpus design. Literary and Linguistic Computing, 8(4), 243–57.Google Scholar
Biber, D. (2006). University Language: A Corpus-Based Study of Spoken and Written Registers. Amsterdam: John Benjamins.Google Scholar
Biber, D. (2009). A corpus-driven approach to formulaic language in English: Multi-word patterns in speech and writing. International journal of corpus linguistics, 14(3), 275311.Google Scholar
Biber, D. & Conrad, S. (2019). Register, Genre, and Style (2nd ed.). Cambridge: Cambridge University Press.CrossRefGoogle Scholar
Biber, D. & Egbert, J. (2018). Register Variation Online. Cambridge: Cambridge University Press.Google Scholar
Biber, D., Johansson, S., Leech, G., Conrad, S., & Finegan, E. (1999). The Longman Grammar of Spoken and Written English. London: Longman.Google Scholar
Biber, D. & Jones, J. K. (2009). Quantitative methods in corpus linguistics. In Lüdeling, A. & Kytö, M., eds. Corpus Linguistics: An International Handbook. Berlin: Walter de Gruyter, pp. 12861304.CrossRefGoogle Scholar
Biber, D., Reppen, R., Schnur, E., & Ghanem, R. (2016). On the (non)utility of Juilland’s D to measure lexical dispersion in large corpora. International Journal of Corpus Linguistics, 21(4), 439–64.Google Scholar
Biber, D., Staples, S., Gray, B., & Egbert, J. (2020). Investigating grammatical complexity in L2 English writing research: Linguistic description versus predictive measurement. Journal of English for Academic Purposes.Google Scholar
Blair, E. & Blair, J. (2015). Applied Survey Sampling. London: Sage.CrossRefGoogle Scholar
Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge: Cambridge University Press.Google Scholar
Caldas-Coulthard, C. R. & Moon, R. (2010). “Curvy, hunky, kinky”: Using corpora as tools for critical analysis. Discourse & Society, 21(2), 99133.Google Scholar
Carroll, J. B., Davies, P., & Richman, B. (1971). The American Heritage word frequency book. Houghton Mifflin.Google Scholar
Chen, D. & Manning, C. (2014). A fast and accurate dependency parser using neural networks. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, pp. 740–50.Google Scholar
Clear, J. (1992). Corpus sampling. In Leitner, G., ed., New Directions in Language Corpora: Methodology, Results, Software Developments. Berlin: De Gruyter, pp. 2132.Google Scholar
Cohen, J. (1977). Statistical Power Analysis for the Behavioral Sciences. New York: Routledge.Google Scholar
Davies, M. (2010–) The Corpus of Historical American English (COHA): 400 million words, 1810–2009. Available online at www.english-corpora.org/coha/.Google Scholar
Egbert, J. (2014). Reader Perceptions of Linguistic Variation in Published Academic Writing. Unpublished PhD dissertation. Flagstaff: Northern Arizona University.Google Scholar
Egbert, J. (2015). Sub-register and discipline variation in published academic writing: Investigating statistical interaction in corpus data. International Journal of Corpus Linguistics, 20(1), 129.Google Scholar
Egbert, J. (2019). Corpus design and representativeness. In Berber Sardinha, T. & Veirano Pinto, M., eds., Multi-dimensional Analysis: Research Methods and Current Issues. London: Bloomsbury, pp. 2742.Google Scholar
Egbert, J. & Baker, P. eds. (2019). Using Corpus Methods to Triangulate lLnguistic Analysis. New York: Routledge.CrossRefGoogle Scholar
Egbert, J. & Biber, D. (2019). Incorporating text dispersion into keyword analyses. Corpora, 14(1), 77104.Google Scholar
Egbert, J., Burch, B., & Biber, D. (2020). Lexical dispersion and corpus design. International Journal of Corpus Linguistics, 25(1), 89115.Google Scholar
Egbert, J., & Davies, M. (2019). If olive oil Is made of olives, then what’s baby oil made of?: The shifting semantics of noun+ noun sequences in American English. In Egbert, J. and Baker, P. (Eds.), Using corpus methods to triangulate linguistic analysis (pp. 163184). New York City: Routledge.Google Scholar
Ellis, N. (2019). Usage-based theories of Construction Grammar: Triangulating corpus linguistics and psycholinguistics. In Egbert, J. & Baker, P., eds. (2019). Using Corpus Methods to Triangulate Linguistic Analysis. New York: Routledge.Google Scholar
Evert, S. (2004). The statistics of word cooccurrences: Word pairs and collocations. Unpublished PhD thesis. Stuttgart: University of Stuttgart.Google Scholar
Evert, S. (2009). Corpora and collocations. In Lüdeling, A. & Kytö, M., eds. Corpus Linguistics: An International Handbook, Vol. 2. Berlin/New York: Mouton de Gruyter, pp. 1212–48.Google Scholar
Ford, H. J. (1909). The influence of state politics in expanding federal power. Proceedings of the American Political Science Association, 5, 5363.CrossRefGoogle Scholar
Gabrielatos, C. (2018). Keyness analysis: Nature, metrics and techniques. In Taylor, C. & Marchi, A., eds. Corpus Approaches to Discourse: A Critical Review. London/New York: Routledge, pp. 225–58.Google Scholar
Geiger, R. L. (1997). Research, graduate education, and the ecology of American universities: An interpretive history. In Goodchild, L. F. & Weschler, H. S., eds. The History of Higher Education (2nd ed.). Needham Heights: Simon & Schuster, pp. 273–89.Google Scholar
Graesser, A. C., McNamara, D. S., & Louwerse, M. M. (2003). What do readers need to learn in order to process coherence relations in narrative and expository text? In Sweet, A. P. & Snow, C. E., eds. Rethinking Reading Comprehension. New York: Guilford Publications, pp. 8298.Google Scholar
Gries, S. T. (forthcoming). On classification trees and random forests in corpus linguistics: Some words of caution and suggestions for improvement. Corpus Linguistics and Linguistic Theory.Google Scholar
Hanks, P. (2012). The corpus revolution in lexicography. International Journal of Lexicography, 25(4), 398436.Google Scholar
Hasselgård, H. (2010). Adjunct Adverbials in English. Cambridge: Cambridge University Press.Google Scholar
Hinrichs, L. & Szmrecsanyi, B. (2007). Recent changes in the function and frequency of Standard English genitive constructions: A multivariate analysis of tagged corpora. English Language & Linguistics, 11(3), 437–74.Google Scholar
Hinrichs, L., Szmrecsanyi, B., & Bohmann, A. (2015). Which-hunting and the Standard English relative clause. Language, 91(4), 806836.Google Scholar
Housen, A., De Clercq, B., Kuiken, F., & Vedder, I. (2019). Multiple approaches to complexity in second language research. Second Language Research, 35(1), 321. Published online (2018). https://doi.org/10.1177/0267658318809765.CrossRefGoogle Scholar
Hunston, S. (2002). Corpora in applied linguistics. Cambridge: Cambridge University Press.Google Scholar
Hunston, S. (2007). Semantic prosody revisited. International Journal of Corpus Linguistics, 12(2), 249–68.CrossRefGoogle Scholar
Hunt, K. W. (1970). Do sentences in the second language grow like those in the first? TESOL Quarterly, 4(3), 195202.CrossRefGoogle Scholar
Larsson, T., Callies, M., Hasselgård, H., Laso, N. J., Van Vuuren, S., Verdaguer, I., & Paquot, M. (2020). Adverb placement in EFL academic writing: Going beyond syntactic transfer. International Journal of Corpus Linguistics, 25(2), 155184.Google Scholar
Larsson, T. & Kaatari, H. (2020). Syntactic complexity across registers: Investigating (in)formality in second-language writing. Journal of English for Academic Purposes, 45, 100850.Google Scholar
Larsson, T., Paquot, M., & Plonsky, L. (forthcoming). Inter-rater reliability in learner corpus research: Insights from a collaborative study on adverb placement. International Journal of Learner Corpus Research.Google Scholar
Leech, G. (2007). New resources, or just better old ones? The Holy Grail of representativeness. In Hundt, M., Nesselhauf, N., & Biewer, C., eds. Corpus Linguistics and the Web. Amsterdam: Brill Rodopi, pp. 133–50.Google Scholar
Leech, G., Hundt, M., Mair, C., & Smith, N. (2009). Change in Contemporary English: A Grammatical Study. Cambridge: Cambridge University Press.Google Scholar
Levshina, N. (2015). How to Do Linguistics with R: Data Exploration and Statistical Analysis. Amsterdam: John Benjamins.Google Scholar
Levshina, N. (forthcoming). Conditional Inference Trees and Random Forests. In S. Th. Gries & M. Paquot, eds. A Practical Handbook of Corpus Linguistics. New York: Springer.Google Scholar
Lu, X. (2010). Automatic analysis of syntactic complexity in second language writing. International Journal of Corpus Linguistics, 15(4), 474–96.Google Scholar
Lu, X. (2017). Automated measurement of syntactic complexity in corpus-based L2 writing research and implications for writing. Language Testing, 34(4), 493511.CrossRefGoogle Scholar
McEnery, T., Xiao, R., & Tono, Y. (2006). Corpus-Based Language Studies: An Advanced Resource Book. New York: Taylor & Francis.Google Scholar
Nivre, J., Hall, J., Nilsson, J., Chanev, A., Eryigit, G., Kubler, S., Marinov, S., & Marsi, E. (2007). MALTparser: A language-independent system for data-driven dependency parsing. Natural Language Engineering, 13(2), 95135.Google Scholar
Orbach, B., Callahan, K. S., & Lindemenn, L. M. (2010). Arming states’ rights: Federalism, private lawmakers, and the battering ram strategy. Arizona Law Review, 52, 11611206.Google Scholar
Picoral, A., Reppen, R., & Staples, S. (under review). Evaluation of annotation resources for learner data: A comparison of software tools. Special Issue of International Journal of Learner Corpus Research, Natural Language Processing for Learner Corpus Research.Google Scholar
Quirk, R., Greenbaum, S., Leech, G., & Svartvik, J. (1985). A Comprehensive Grammar of the English Language. London: Longman.Google Scholar
R Core Team (2020). R: A language and environment for statistical computing. Vienna: R Foundation for Statistical Computing, www.R-project.org/.Google Scholar
Rychlý, P. (2008). A lexicographer-friendly association score. In Sojka, P. & Horák, A., eds. Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN. Brno: Masaryk University, pp. 69.Google Scholar
Savický, P. & Hlaváčová, J. (2002). Measures of word commonness. Journal of Quantitative Linguistics, 9, 215–31.Google Scholar
Schimmel, C. (2008). School counseling: A brief historical overview. West Virginia Department of Education. http://wvde.state.wv.us/counselors/history.html.Google Scholar
Scott, M. 1997. PC analysis of key words – and key words. System, 25(2), 233–45.Google Scholar
Sinclair, J. (1991). Corpus, Concordance, Collocation. Oxford: Oxford University Press.Google Scholar
Stallings, W. (1989). Data and Computer Communications (4th ed.). New York: Macmillan.Google Scholar
Strobl, C., Boulesteix, A.-L., Kneib, T., Augustin, T., & Zeileis, A. (2008). Conditional variable importance for random forests.BMC Bioinformatics, 9(307), www.biomedcentral.com/1471–2105/9/307.Google Scholar
Stubbs, M. (1995) Corpus evidence for norms of lexical collocation. In Cook, G. & Seidlhofer, B., eds. Principles and Practice in the Study of Language and Learning. Oxford: Oxford University Press, pp. 245256.Google Scholar
Szmrecsanyi, B. & Hinrichs, L. (2008). Probabilistic determinants of genitive variation in spoken and written English: A multivariate comparison across time, space, and genres. In Nevalainen, T., Taavitsainen, I., Pahta, P., & Korhonen, M., eds. The Dynamics of Linguistic Variation: Corpus Evidence on English Past and Present. Amsterdam: Benjamins, pp. 291309.Google Scholar
Xiao, R. & McEnery, T. (2006). Collocation, semantic prosody, and near synonymy: A cross-linguistic perspective. Applied Linguistics, 27(1), 103–29.Google Scholar

Save element to Kindle

To save this element to your Kindle, first ensure no-reply@cambridge.org is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle.

Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service.

Doing Linguistics with a Corpus
Available formats
×

Save element to Dropbox

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox.

Doing Linguistics with a Corpus
Available formats
×

Save element to Google Drive

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive.

Doing Linguistics with a Corpus
Available formats
×