Skip to main content Accessibility help
×
Hostname: page-component-7dd5485656-wlg5v Total loading time: 0 Render date: 2025-10-26T14:09:04.233Z Has data issue: false hasContentIssue false

Natural Language Processing for Corpus Linguistics

Published online by Cambridge University Press:  04 March 2022

Jonathan Dunn
Affiliation:
University of Canterbury, Christchurch, New Zealand

Summary

Corpus analysis can be expanded and scaled up by incorporating computational methods from natural language processing. This Element shows how text classification and text similarity models can extend our ability to undertake corpus linguistics across very large corpora. These computational methods are becoming increasingly important as corpora grow too large for more traditional types of linguistic analysis. We draw on five case studies to show how and why to use computational methods, ranging from usage-based grammar to authorship analysis to using social media for corpus-based sociolinguistics. Each section is accompanied by an interactive code notebook that shows how to implement the analysis in Python. A stand-alone Python package is also available to help readers use these methods with their own data. Because large-scale analysis introduces new ethical problems, this Element pairs each new methodology with a discussion of potential ethical implications.
Get access

Information

Type
Element
Information
Online ISBN: 9781009070447
Publisher: Cambridge University Press
Print publication: 31 March 2022

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Element purchase

Temporarily unavailable

References

Biber, D. (2012). Register as a Predictor of Linguistic Variation. Corpus Linguistics and Linguistic Theory, 8(1), 937.CrossRefGoogle Scholar
Church, K., & Hanks, P. (1990). Word Association Norms, Mutual Information, and Lexicography. Computational Linguistics, 16(1), 2229.Google Scholar
Diermeier, D., Godbout, J., Yu, B., & Kaufmann, S. (2011). Language and Ideology in Congress. British Journal of Political Science, 42(1), 3155.Google Scholar
Dunn, J. (2013a). Evaluating the Premises and Results of Four Metaphor Identification Systems. In Gelbukh, A. (ed.), Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics, vol. 1 (pp. 471486). Heidelberg: Springer.Google Scholar
Dunn, J. (2013). How Linguistic Structure Influences and Helps to Predict Metaphoric Meaning. Cognitive Linguistics, 24(1), 3366.CrossRefGoogle Scholar
Dunn, J. (2014). Measuring Metaphoricity. In Toutanova, K. & Wu, H. (eds.), Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 745751). Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
Dunn, J. (2015). Modeling Abstractness and Metaphoricity. Metaphor & Symbol, 30, 259289.CrossRefGoogle Scholar
Dunn, J. (2017). Computational Learning of Construction Grammars. Language & Cognition, 9(2), 254292.CrossRefGoogle Scholar
Dunn, J. (2018a). Finding Variants for Construction-Based Dialectometry: A Corpus-Based Approach to Regional CxGs. Cognitive Linguistics, 29(2), 275311.CrossRefGoogle Scholar
Dunn, J. (2018b). Modeling the Complexity and Descriptive Adequacy of Construction Grammars. In Jarosz, G., O’Connor, B., & Pater, J. (eds.), Proceedings of the Society for Computation in Linguistics (pp. 8190). Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
Dunn, J. (2018c). Multi-Unit Directional Measures of Association Moving Beyond Pairs of Words. International Journal of Corpus Linguistics, 23(2), 183215.CrossRefGoogle Scholar
Dunn, J. (2019a). Frequency vs. Association for Constraint Selection in Usage-Based Construction Grammar. In Chersoni, E., Jacobs, C., Lenci, A., Linzen, T., Prévot, L., & Santus, E. (eds.), Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics (pp. 117128). Stroudsburg, PA: Association: for Computational Linguistics.CrossRefGoogle Scholar
Dunn, J. (2019b). Global Syntactic Variation in Seven Languages: Towards a Computational Dialectology. Frontiers in Artificial Intelligence, Collection on Computational Sociolinguistics, 2. DOI: https://doi.org/10.3389/frai.2019.00015.Google Scholar
Dunn, J. (2019c). Modeling Global Syntactic Variation in English Using Dialect Classification. In Zampieri, M., Nakov, P., Malmasi, S., Ljubešić, N., Tiedemann, J., & Ali, A. (eds.), Proceedings of NAACL 2019 Sixth Workshop on NLP for Similar Languages, Varieties and Dialects (pp. 4253). Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
Dunn, J. (2020). Mapping Languages: The Corpus of Global Language Use. Language Resources and Evaluation, 54, 9991018. DOI: https://doi.org/10.1007/s10579-020-09489-2.CrossRefGoogle Scholar
Dunn, J. (2021). Representations of Language Varieties Are Reliable Given Corpus Similarity Measures. In Zampieri, M., Nakov, P., Ljubešić, N., Tiedemann, J., Scherrer, Y., & Jahuiainen, T. (Eds.), Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties, and Dialects (pp. 2838). Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
Dunn, J., & Adams, B. (2019). Mapping Languages and Demographics with Georeferenced Corpora. In Adams, B., de Roiste, M., Gahegan, M., Hulbe, C., O’Sullivan, D., Sila-Nowicka, K., Whigham, P., & Wilson, M. (eds.), Proceedings of Geocomputation 2019 (16 pp.). Auckland: N.p.Google Scholar
Dunn, J., & Adams, B. (2020, May). Geographically-Balanced Gigaword Corpora for 50 Language Varieties. In Calzolari, N., Béchet, F., Blache, P., Choukri, K., Cieri, C., Declerck, T., Goggi, S., Isahara, H., Maegaard, B., Mariani, J., Mazo, H., Moreno, A., Odijk, J., & Piperidis, S. (eds.), Proceedings of the 12th Language Resources and Evaluation Conference (pp. 25282536). Marseilles, European Language Resources Association.Google Scholar
Dunn, J., Argamon, S., Rasooli, A., & Kumar, G. (2016). Profile-Based Authorship Analysis. Literary and Linguistic Computing, 31(4), 689710.CrossRefGoogle Scholar
Dunn, J., Coupe, T., & Adams, B. (2020, Nov.). Measuring Linguistic Diversity During COVID-19. In Bamman, D., Hovy, D., Jurgens, D., O’Connor, B., & Volkova, S. (eds.), Proceedings of the Fourth Workshop on Natural Language Processing and Computational Social Science (pp. 110). Online: Association for Computational Linguistics.Google Scholar
Dunn, J., & Nini, A. (2021). Production vs Perception: The Role of Individuality in Usage-Based Grammar Induction. In Chersoni, E., Hollenstein, N., Jacobs, C., Oseki, Y., Prévot, L., & Santus, E. (Eds.), Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics (pp. 149159). Stroudsburg, PA: Association for Computational Linguistics.CrossRefGoogle Scholar
Dunn, J., & Tayyar Madabushi, H. (2021). Learned Construction Grammars Converge Across Registers Given Increased Exposure. In Bisazza, A. & Abend, O. (Eds.), Proceedings of the Conference on Computational Natural Language Learning (pp. 471486). Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
Ellis, N. (2007). Language Acquisition as Rational Contingency Learning. Applied Linguistics, 27(1), 124.CrossRefGoogle Scholar
Francis, W., & Kucera, H. (1967). Computational Analysis of Present-Day American English. Providence, RI: Brown University Press.Google Scholar
Gentzkow, M., Shapiro, J., & Taddy, M. (2018). Congressional Record for the 43rd–114th Congresses: Parsed Speeches and Phrase Counts (Tech. Rep.). Palo Alto, CA: Stanford Libraries. https://data.stanford.edu/congress_textGoogle Scholar
Gerlach, M., & Font-Clos, F. (2020). A Standardized Project Gutenberg Corpus for Statistical Analysis of Natural Language and Quantitative Linguistics. Entropy, 22(1), 126. DOI: https://doi.org/10.3390/e22010126CrossRefGoogle Scholar
Goldberg, Y. (2017). Neural Network Methods in Natural Language Processing. Williston, VT: Morgan & Claypool Publishers.CrossRefGoogle Scholar
Gries, S. T. (2013). 50-Something Years of Work on Collocations: What Is or Should Be Next. International Journal of Corpus Linguistics, 18(1), 137165.CrossRefGoogle Scholar
Hellrich, J., Kampe, B., & Hahn, U. (2019). The Influence of Down-Sampling Strategies on SVD Word Embedding Stability. In Rogers, A., Drozd, A., Rumshisky, A., & Goldberg, Y. (Eds.), Proceedings of the 3rd Workshop on Evaluating Vector Space Representations for NLP (pp. 1826). Stroudburg, PA: Association for Computational Linguistics.CrossRefGoogle Scholar
Kilgarriff, A. (2001). Comparing Corpora. International Journal of Corpus Linguistics, 6(1), 97133.CrossRefGoogle Scholar
Koppel, M., Schler, J., & Bonchek-Dokow, E. (2007). Measuring Differentiability: Unmasking Pseudonymous Authors. Journal of Machine Learning Research, 8, 12611276.Google Scholar
Landauer, T., Foltz, P., & Laham, D. (1998). Introduction to Latent Semantic Analysis. Discourse Processes, 25(2–3), 259284.CrossRefGoogle Scholar
Levy, O., Goldberg, Y., & Dagan, I. (2015, May). Improving Distributional Similarity with Lessons Learned from Word Embeddings. Transactions of the Association for Computational Linguistics, 3, 211225.CrossRefGoogle Scholar
Li, J. (2012). Hotel Reviews Dataset (Tech. Rep.). Carnegie Mellon University. www.cs.cmu.edu/~jiweil/html/hotel-review.htmlGoogle Scholar
McKenzie, G., & Adams, B. (2018). A Data-Driven Approach to Exploring Similarities of Tourist Attractions through Online Reviews. Journal of Location Based Services, 12(2), 94118.CrossRefGoogle Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed Representations of Words and Phrases and Their Compositionality. In Burges, C. J. C., Bottou, L., Welling, M., Ghahramani, Z., & Weinberger, K. Q. (Eds.), Proceedings of the 26th International Conference on Neural Information Processing Systems–Volume 2 (pp. 31113119). Red Hook, NY: Curran Associates Inc.Google Scholar
Mueller, A., Nicolai, G., Petrou-Zeniou, P., Talmina, N., & Linzen, T. (2020). Cross-Linguistic Syntactic Evaluation of Word Prediction Models. In Jurafsky, D., Chai, J., Schluter, N., & Tetreault, J. (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 55235539). Stroudsburg, PA: Association for Computational Linguistics.CrossRefGoogle Scholar
Parsons, A. (2019). NY Times Article Lead Paragraphs 1851–2017 (Tech. Rep.). Kaggle. https://www.kaggle.com/parsonsandrew1/nytimes-article-lead-paragraphs-18512017Google Scholar
Pennebaker, J. (2011). The Secret Life of Pronouns: What Our Words Say About Us. New York: Bloomsbury Publishing.Google Scholar
Pennington, J., Socher, R., & Manning, C. (2014). GloVe: Global Vectors for Word Representation. In Moschitti, A., Pang, B., & Daelemans, W. (eds.), Empirical Methods in Natural Language Processing (EMNLP) (pp. 15321543). Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
Petrov, S., Das, D., & McDonald, R. (2012). A Universal Part-of-Speech Tagset. In Calzolari, N., Choukri, K., Declerck, T., Uğur Doğan, M., Maegaard, B., Mariani, J., Moreno, A., Odijk, J., & Piperidis, S. (eds.), Proceedings of the Eighth Conference on Language Resources and Evaluation (pp. 20892096). Paris: European Language Resources Association.Google Scholar
Taylor, J. (2004). Linguistic Categorization (3rded.). Oxford: Oxford University Press.Google Scholar
Wang, H., Lu, Y., & Zhai, C. (2011). Latent Aspect Rating Analysis Without Aspect Keyword Supervision. In Proceedings of the 17th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (pp. 618626). New York: Association for Computing Machinery.Google Scholar
Zeman, D. et al. (2021). Universal Dependencies 2.8.1 (Tech. Rep.). LINDAT/CLARIAH-CZ Digital Library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University. http://hdl.handle.net/11234/1-3687Google Scholar
Zhao, J., Zhou, Y., Li, Z., Wang, W., & Chang, K.-W. (2018, October–November). Learning Gender-Neutral Word Embeddings. In Riloff, E., Chiang, D., Hockenmaier, J., & Tsujii, J. (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (pp. 48474853). Brussels: Association for Computational Linguistics.CrossRefGoogle Scholar
Zuboff, S. (2019). The Age of Surveillance Capitalism: The Fight for a Human Future at the New Frontier of Power. New York: PublicAffairs.Google Scholar

Accessibility standard: Unknown

Why this information is here

This section outlines the accessibility features of this content - including support for screen readers, full keyboard navigation and high-contrast display options. This may not be relevant for you.

Accessibility Information

Accessibility compliance for the PDF of this Element is currently unknown and may be updated in the future.

Save element to Kindle

To save this element to your Kindle, first ensure no-reply@cambridge.org is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle.

Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service.

Natural Language Processing for Corpus Linguistics
  • Jonathan Dunn, University of Canterbury, Christchurch, New Zealand
  • Online ISBN: 9781009070447
Available formats
×

Save element to Dropbox

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox.

Natural Language Processing for Corpus Linguistics
  • Jonathan Dunn, University of Canterbury, Christchurch, New Zealand
  • Online ISBN: 9781009070447
Available formats
×

Save element to Google Drive

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive.

Natural Language Processing for Corpus Linguistics
  • Jonathan Dunn, University of Canterbury, Christchurch, New Zealand
  • Online ISBN: 9781009070447
Available formats
×