In recent decades, the field of machine translation (MT) has undergone a major shift in approaches—from phrase-based statistical machine translation to neural machine translation—and a corresponding remarkable improvement in translation quality. Since the era of statistical MT systems and moving into the era of neural models, the role of MT in industry has grown, and research aimed at enhancing the models has only intensified. Recently, there have been increasing claims that neural machine translation (NMT) systems are reaching human parity (Hassan et al., Reference Hassan, Aue, Chen, Chowdhary, Clark, Federmann, Huang, Junczys-Dowmunt, Lewis, Li, Liu, Liu, Luo, Menezes, Qin, Seide, Tan, Tian, Wu, Wu, Xia, Zhang, Zhang and Zhou2018, i.a), followed by subsequent analyses that incorporate contextual information and show that there remains a gap between machine translation and human translation performance (Toral et al., Reference Toral, Castilho, Hu and Way2018; Läubli et al., Reference Läubli, Sennrich and Volk2018). This has resulted in calls for changes to evaluation to distinguish high-performing sentence-level machine translation from human translation, as well as for improved approaches to incorporating context into machine translation systems and automatic evaluation metrics.
This special issue aims at tackling a variety of different questions about the roles that context in NMT and in its evaluation. The three research papers published here focus on components of the evaluation process. They each take different perspectives and examine different aspects of what it means to consider “context” in NMT. This highlights the broad range of work still to be undertaken on this topic, which is also summarized in the survey article (Castilho and Knowles, Reference Castilho and Knowles2024). The survey article not only covers past and recent work in the field but also highlights the emergence of large language models and their applications in translation and evaluation. Furthermore, the authors provide perspectives on the future of the field.
Motivated by analyses of reference translations periodically turning up reference translations that are not of as high quality as expected, Zouhar et al. (Reference Zouhar, Kloudová, Popel and Bojar2024) propose a consensus-based approach to building high-quality document-level translations, which they call “optimal reference translations,” along with manual evaluation of those translations in order to verify their quality. They examine annotator agreement, finding that meaning, style, and pragmatics were most influential in overall score, which also matched annotator questionnaire responses about the importance of these factors. Additionally, they consider annotator differences, including those between annotators with differing levels of translation expertise. As the field shifts towards document-level translation and evaluation and as NMT performance continues to improve, having high-quality document-level references will be vital for appropriate use of automatic reference-based metrics and human evaluation.
With the increase in the fluency of MT output brought about by the shift to NMT, the focus of do Campo Bayón and Sánchez-Gijón (2024) is on evaluating the naturalness and user acceptance of NMT output in a low-resource language (Galician) in the social media domain. They propose evaluating NMT using the non-inferiority principle, which is more commonly used in the realm of health and medicine. This type of evaluation for naturalness can serve as a complement to adequacy-based evaluations. Their work touches on context in two important ways. First, they highlight the importance of providing annotators with additional textual context (in this case tweet threads as opposed to tweets on their own) in the annotation process. Second, and more broadly, they consider how the genre, real-world context, and annotator characteristics may interact, such as how different age demographics may have different perceptions and expectations for social media translation. As NMT is used in a wider variety of everyday contexts, it may be increasingly important to design evaluations that take into consideration aspects specific to the context in which the NMT is being used and the interplay between that and users’ expectations.
Knowles and Lo (Reference Knowles and Lo2024) examine data from recent WMT (Conference on Machine Translation) shared tasks in order to explore the impacts of incorporating intersentential context into human evaluation of MT. They examine inter- and intra-annotator variation and discuss some best practices for balancing the challenges of handling document-level intersentential context in human evaluation, such as by using calibration sets to pre-screen annotators or standardize annotator scores. They also highlight the need for future work on better understanding how annotators interpret annotation tasks and how the shift towards document-level translation and evaluation necessitates a reexamination of assumptions about human evaluation protocols.
These papers cover a wide range of disparate context-related topics in NMT and also illustrate how interconnected many of these aspects are in the broader conversation about context. Zouhar et al. (Reference Zouhar, Kloudová, Popel and Bojar2024) and do Campo Bayón and Sánchez-Gijón (Reference do Campo Bayón and Sánchez-Gijón2024) both consider differences between annotators (the former in terms of qualifications related to translation, the latter demographically) and how these influence annotation. Knowles and Lo (Reference Knowles and Lo2024) used anonymized data from WMT, so did not have access to such information, but examined patterns of behaviour across annotators. For Knowles and Lo (Reference Knowles and Lo2024), this included certain annotators who strongly preferred to give scores that lined up with the tick marks on a sliding scale annotation tool, comparable to the observation in Zouhar et al. (Reference Zouhar, Kloudová, Popel and Bojar2024) that annotators tended towards scores of round numbers (typing scores in a spreadsheet as an annotation tool). The observation in Knowles and Lo (Reference Knowles and Lo2024) that low-scoring segments observed in evaluations that include intersentential context have an influence on the scores given to other segments may relate to the analysis in Zouhar et al. (Reference Zouhar, Kloudová, Popel and Bojar2024) that found that annotators focused on the lowest-rated segments when asked to produce document-level scores; there is clearly more to examine in this area. Both do Campo Bayón and Sánchez-Gijón (Reference do Campo Bayón and Sánchez-Gijón2024) and Knowles and Lo (Reference Knowles and Lo2024) consider the issue of whether the task the annotators are performing is sufficiently well-defined, with do Campo Bayón and Sánchez-Gijón (Reference do Campo Bayón and Sánchez-Gijón2024) iterating on their survey questions and providing annotators with more explanations in order to better capture the desired information. Zouhar et al. (Reference Zouhar, Kloudová, Popel and Bojar2024), do Campo Bayón and Sánchez-Gijón (Reference do Campo Bayón and Sánchez-Gijón2024), and Knowles and Lo (Reference Knowles and Lo2024) all consider the effects of having access to additional within-document context while doing evaluation, which ties into the questions considered in prior work about how much intersentential context is necessary in order to do well-informed evaluation (Castilho et al., Reference Castilho, Popović, Way, Béchet, Blache, Choukri, Cieri, Declerck, Goggi, Isahara, Maegaard, Mariani, Mazo and Moreno2020; Castilho, Reference Castilho2022, i.a). In turn, this question of which intersentential context is helpful is discussed in the survey (Castilho and Knowles, Reference Castilho and Knowles2024), from the perspective of how it can impact the quality of translation output as well as how it can be involved in evaluation.
The many questions of context in NMT and NMT evaluation are deeply interconnected and remain an exciting area for future examination, especially as we see the rapid introduction of models such as large language models with extended intersentential context capabilities. The fact that perspectives on evaluation were the main focus of the special issue’s submissions highlights the importance of appropriate evaluation in the development of the field of context-aware MT, as well as the fact that the field is still in the process of developing best practices for evaluation. Ensuring that we have sound and appropriate methods of evaluation will allow us to better explore their capabilities and in turn allow the field to make the most of a wide range of ways that many types of context may be useful for improving MT quality.
We hope that the insights shared in this special issue will prove valuable to our readers and that it will not only ignite further curiosity but also serve as a practical guide for the community, fostering a deeper understanding and robust exploration of the diverse ways in which context can enhance MT.