Skip to main content Accessibility help
×
Hostname: page-component-745bb68f8f-cphqk Total loading time: 0 Render date: 2025-01-25T21:23:35.346Z Has data issue: false hasContentIssue false

3 - Statistics for Corpus-Based and Corpus-Driven Approaches to Empirical Translation Studies

Published online by Cambridge University Press:  10 June 2019

Meng Ji
Affiliation:
University of Sydney
Michael Oakes
Affiliation:
University of Wolverhampton
Get access

Summary

Tognini-Bonelli (2001) made the following distinction between corpus-based and corpus-driven studies. While corpus-based studies start with pre-existing theories which are tested using corpus data, in corpus driven studies the hypothesis is derived by examination of the corpus evidence. This chapter will give an overview of the two different families of statistical tests which are suited for these two approaches. For corpus-based approaches, we use more traditional statistics, such as the t-test, or ANOVA which return a value called a p-value to tell us to what extent we should accept or reject the initial hypothesis. Multi-level modelling (also known as mixed modelling) is a new technique which shows considerable promise for corpus-based studies, and will also be described here to analyse the ENNTT subset of Europarl corpus. Multi-level modelling is useful for the examination of hierarchically structured or “nested” data, where for example translations may be “nested” together in a class if they have the same language of origin. A multi-level model takes account both of the variation between individual translations and the variation between classes. For example, we might expect the scores (such as vocabulary richness or readability scores) of two translations in the same class to be more similar to each other than two translations in different classes.

Type
Chapter
Information
Advances in Empirical Translation Studies
Developing Translation Resources and Technologies
, pp. 28 - 52
Publisher: Cambridge University Press
Print publication year: 2019

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Baayen, R. Harald (2008). Analysing Linguistic Data: A Practical Introduction to Statistics Using R. Cambridge, UK: Cambridge.Google Scholar
Biber, Douglas (2009). Corpus-based and corpus-driven analyses of language variation and use. In Heine, Bernd and Narrog, Heiko (eds.), The Oxford Handbook of Linguistics (1st edition). Oxford, UK: Oxford University Press.Google Scholar
Koehn, Philipp (2005). Europarl: A parallel corpus for statistical machine translation. In Proceeding of the Tenth Machine Translation Summit (MT Summit X), Phuket, Thailand. Tokyo: Asia-Pacific Association for Machine Translation.Google Scholar
Koppel, M. and Ordan, N. (2011). Translationese and its dialects. In Proceedings of ACL, Portland OR, June 2011. Stroudsberg, PA: Association for Computing Machinery, pp. 13181326.Google Scholar
Nisioi, Sergiu, Rabinovich, Ella, Dinu, Liviu P. and Wintner, Shuly (2016). A corpus of native, non-native and translated texts. In Calzolari, Nicoletta et al. (eds.), Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC), Portoroz, Slovenia May 23–28, 2016. European Languages Resources Association, pp. 41974200.Google Scholar
Rabinovich, Ella Nisioi, Sergiu, Ordan, Noam and Wintner, Shuly (2016). On the similarities between native, non-native and translated texts. In van den Bosch, Antal (General Chair) (ed.), Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, 7–12 August. Stroudsburg, PA: Association for Computing Machinery, pp. 18701881.Google Scholar
Rabinovich, Ella, Ordan, Noam and Wintner, Shuly (2017). Found in translation: Reconstructing phylogenetic language trees from translations. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, Canada, 30 July–4 August. Stroudsburg, PA: Association for Computational Linguistics, pp. 530540.Google Scholar
Serva, Maurizio and Petroni, Filippo (2008). Indo-European languages tree by Levenshtein distance. Europhysics Letters 81(6), 68005.Google Scholar
Tognini-Bonelli, Elena (2001). Corpus Linguistics at Work. Amsterdam: John Benjamins.Google Scholar
Winter, Bodo (2013). Linear models and linear mixed effects models in R with linguistic applications. Tutorials 1 and 2. arXiv:1308.5499. http://arxiv.org/pdf/1308.5499.pdf.Google Scholar

Save book to Kindle

To save this book to your Kindle, first ensure no-reply@cambridge.org is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle.

Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service.

Available formats
×

Save book to Dropbox

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox.

Available formats
×

Save book to Google Drive

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive.

Available formats
×