1. Introduction: L2 pronunciation is a teaching and learning challenge
Learning and teaching foreign and second language (L2) pronunciation is a major challenge in instructed second language acquisition (SLA). For many learners, pronunciation is difficult, and often a learning challenge. Survey studies have shown that learners perceive pronunciation to be an important component of language learning, but still, they do not know what to do to improve their pronunciation skills (Darcy, Reference Darcy2018). For example, they may be focusing on improving accuracy in the production of L2 vowels and consonants while neglecting a focus on improved timing and speech rhythm, or they may be unaware of the importance of linking phenomena in connected speech. For many teachers, pronunciation represents a teaching challenge too. They do not know what aspects of pronunciation they should be teaching, what the main pronunciation targets should be, or when and how they should be teaching them (Darcy et al., Reference Darcy, Ewert, Lidster, Levis and LeVelle2012). In addition, teachers may lack knowledge of phonetics and the sound systems of the learners’ first (or native) languages (L1s) and consequently they may be unable to establish pronunciation teaching priorities according to different proficiency levels or the relative functional load of the pronunciation targets that matter the most for learners to achieve high levels of intelligibility in the L2. Even when they have such knowledge, teachers find it methodologically challenging to effectively integrate pronunciation instruction into communicative language teaching (CLT) (Darcy & Rocca, Reference Darcy and Rocca2022; Mora-Plaza, Reference Mora-Plaza2023), to motivate students to work on pronunciation, and to implement methods and techniques to help them with pronunciation (Huensch, Reference Huensch2019).
Several features of current foreign language classroom instruction make it particularly challenging for teachers to implement efficient pronunciation instruction methods and for researchers to identify individual and context-related predictors of pronunciation learning success in this context. Two such features are the quantity and quality of input learners typically receive in classroom instruction and the scarce opportunities typical meaning-focused activities provide for developing learners’ awareness of their L2 pronunciation learning needs:
Input. One important limitation for L2 pronunciation development in instructed SLA is the usual poor input conditions under which acquisition develops (Muñoz, Reference Muñoz2014). Learners do not only have little opportunity for meaningful communicative interaction in the L2, the oral interactions that take place with peers and teachers are often L1-accented. Such input conditions severely limit exposure to the range of contrasting L2 sounds and features that would be necessary to guarantee the development of a fully functional L2 phonology. One undesirable consequence of such poor input conditions is that learners’ vocabulary typically grows with an under-developed phonology and before phonetic contrasts have been acquired. Although L2 learners’ larger vocabulary size is generally associated with more accurate pronunciation and a more precise encoding of difficult L2 phonetic contrasts (Daidone & Darcy, Reference Daidone and Darcy2021; Llompart, Reference Llompart2021), vocabulary growth itself might hinder phonological development under learning conditions of extensive exposure to L1-accented input and little exposure to authentic L2 input, as imprecise (‘fuzzy’) phonolexical representations might get more strongly entrenched in the L2 mental lexicon through L2 vocabulary use (Darcy et al., Reference Darcy, Llompart, Hayes-Harb, Mora, Adrian, Cook and Ernestus2025).
Awareness of pronunciation. An inherent feature of CLT is that speaking tasks are naturally meaning-oriented and, logically, pronunciation-unfocused. Consequently, the teaching focus tends to be on vocabulary and grammar because these linguistic domains, but not pronunciation, are more essential in achieving task goals (Mora-Plaza, Reference Mora-Plaza2023), especially at lower proficiency levels. As a consequence, learners have few opportunities to develop awareness of their pronunciation learning needs and are unlikely to identify their own pronunciation difficulties and develop strategies to overcome them.
Although much L2 learning takes place in foreign language learning environments and mainly in the classroom, current well-established models of L2 speech learning such as PAM-L2 (Best & Tyler, Reference Best, Tyler, Bohn and Munro2007) and SLM-r (Flege & Bohn, Reference Flege, Bohn and Wayland2021) aim to account for L2 phonological acquisition in naturalistic language learning environments where the L2 is predominantly used and have not had much to say about how phonological development occurs in instructed SLA. By L2 phonological acquisition, these models essentially refer to the process whereby learners develop perceptual phonological categories for L2 sounds that are distinct from those of their native language and acquire the ability to make use of L2-specific features in the distinction of phonologically contrastive L2 sounds (e.g. /æ/-/ʌ/ as in cap-cup). Whether the predictions of these models holds under poor input conditions, such as foreign language classroom instruction, still remains an empirical question. For example, whereas in naturalistic environments, the quality and quantity of input received, the age at which immersion begins, length of residence, and the amount of L1 and L2 use have been shown to account for a large proportion of the variance in L2 speech learning, in much classroom instruction poor input conditions make phonological acquisition extremely challenging, with only very modest gains typically being reported. Consequently, the factors that explain phonological acquisition in immersion settings are unlikely to fully explain phonological acquisition in instructed SLA. This situation may change with current extensive access to online sources of spoken L2 input via media or immersive experiences provided by teachers. Researchers have generally assumed that for the models’ predictions to hold in this classroom contexts, instruction should be enriched in terms of input exposure, increased awareness of cross-language phonetic differences, and support for perceptual learning at early stages (Piske, Reference Piske, Bohn and Munro2007; Tyler, Reference Tyler, Nyvad, Hejná, Højen, Jespersen and Sørensen2019).
Beyond the limitations of the learning conditions in instructed SLA outlined above, other factors make pronunciation learning inherently challenging. One such factor is the nature of our perceptual systems and how they cope with the acquisition of a new phonology. L2 sounds are initially perceived in terms of our L1 phonetic categories because our perceptual system is attuned to the L1 and will filter out acoustic information that is irrelevant for efficient L1 phonological processing (i.e. L1-based processing). However, such acoustic information may be critical for efficient L2 phonological processing. For example, Spanish learners of English will not attend to vowel duration differences English speakers use as a cue to voicing in word-final position (plays /pleɪz/ realized as [pleɪz̥] vs place /pleɪs/ realized as [pleɪs]) as there are no voiced obstruent consonants in word-final position in Spanish. Consequently, learners may not be able to perceive L2 sounds and L2-specific speech processes as faithfully as they would in their L1 (a phenomenon known as ‘perceptual foreign accent’), and they may fail to focus on the relevant L2-specific phonetic cues. Similarly, the English vowels in cap /kæp/ and cup /kʌp/ may both be perceptually assimilated to the only Spanish low central vowel category /a/, which may lead to the development of imprecise phonolexical representations. Thus, both /kæp/ and /kʌp/ might have the phonological form /kap/ in the learners’ mental lexicon, which may lead to both cap and cup being mispronounced undistinguishably as /kap/.
The consequences of the pronunciation learning challenges outlined above may have far-reaching consequences in terms of overall language development. For example, students who speak the L2 with poor intelligibility may feel that their pronunciation prevents them from communicating effectively and fluently in the L2, which may demotivate them and prevent them from seeking opportunities for interacting with other speakers or may avoid getting actively engaged in classroom speaking tasks. Rather than opportunities for language learning, speaking tasks may then be perceived as sources of language speaking anxiety interfering with pronunciation learning (Baran-Łucarz & Lee, Reference Baran-Łucarz and Lee2021). Still, learners’ pronunciation learning goals may vary a lot depending on their overall language learning objectives. Whereas many language learners aim at acquiring a near native-like pronunciation of the L2 because their aim is to communicate with native speakers of the L2 (e.g. Italian, Russian, Japanese), the primary aim of many learners of English is to become fluent English users for intercultural communication. For them, English has the status of a ‘lingua franca’ they use for general communication purposes with other non-native speakers of English (Jenkins, Reference Jenkins2006). For such learners, the presence of an L1-accent in their English is not a reason for concern as long as it is not seriously detrimental to intelligibility. For them, the ability to perceptually accommodate to a variety of non-native accents of English for fluent communication is more essential than speaking the L2 in a native-like manner. However, there may be a minority of learners acquiring English with the primary aim of communicating with native English speakers, or with the aim of becoming English language teachers, who may prefer to acquire a pronunciation of English that closely resembles the standard pronunciation of native speakers of some widely taught variety of English (e.g. General American English or Standard Southern British English) representing a large percentage of learners’ media consumption in the L2.
Aligned with many learners’ goals in pronunciation learning, the field of L2 pronunciation pedagogy has experienced an important paradigm shift from an emphasis on nativelikeness to an emphasis on speech intelligibility and comprehensibility as a more useful and realistic goal of pronunciation instruction (Levis, Reference Levis2020). However, pedagogical practices that integrate a focus on speech comprehensibility development into CLT are scarce and empirically under-researched. Although the intelligibility principle for L2 pronunciation advocates for abandoning the nativeness principle (where L2 learners’ goal was to acquire native-like pronunciation) to embrace intelligible pronunciation (and comprehensibility) as the goal of L2 pronunciation teaching and learning, in general, ideologies of nativeness continue to be pervasive (Jeong & Lindemann, Reference Jeong and Lindemann2025). Learners vary in how willing they may be to assume an L2 speaker identity that can speak the L2 without an L1 accent (Szyszka, Reference Szyszka2022), which may affect the pronunciation learning strategies and goals they set for themselves (Sardegna, Reference Sardegna2022). Similarly, although non-native speaking pronunciation teachers may idealize nativeness while relying on their knowledge of L2 phonetics and phonology to obtain legitimacy as pronunciation instructors (Gordon & Barrantes-Elizondo, Reference Gordon and Barrantes-Elizondo2024), the goals of pronunciation instruction and the focus on specific pronunciation targets may largely depend on teacher cognitions and identity related to pronunciation instruction (Kochem, Reference Kochem2022). Students’ and teachers’ cognitions and ideologies about L2 pronunciation teaching and learning are beyond the scope of the current paper, but they should be considered in relation to the effectiveness of the pronunciation instruction methods and training techniques described below. While some approaches to pronunciation instruction may be apt to develop L2 pronunciation from both the perspectives of intelligibility (actual understanding of speech) and nativelikeness (sounding like a native speaker), some pronunciation training techniques may be more effective than others at improving overall speech comprehensibility (ease of understanding).
Helping learners overcome the challenges associated with pronunciation learning in CLT through classroom instruction would need a thorough understanding of the pronunciation learning difficulties of specific L1-L2 combinations and learner groups, identifying the most relevant (i.e. high functional load) pronunciation targets, and implementing approaches to L2 pronunciation instruction and training methods and techniques whose effectiveness have been proven by empirical research. Given that learners’ individual differences (e.g. experiential, emotional, aptitude-related) are likely to interact with one another and with teaching methods and learning contexts in complex ways to modulate individual pronunciation development (Mora, Reference Mora, Derwing, Munro and Thomson2022), they also need to be considered when researching the effectiveness of pronunciation instruction methods and pronunciation teaching, training, and assessment.
2. Integrating pronunciation instruction into CLT: Empirical findings on pronunciation instruction methods
Several meta-analytical studies have shown that pronunciation instruction is effective at improving various segmental and suprasegmental features of L2 pronunciation (e.g. Lee et al., Reference Lee, Jang and Plonsky2015; Thomson & Derwing, Reference Thomson and Derwing2015), even when such techniques and methods are not integrated into CLT. Pronunciation instruction, including explicit pronunciation instruction (Gordon et al., Reference Gordon, Darcy, Ewert, Levis and LeVelle2013), appears to be especially effective when form-focused, when learners’ attention is explicitly directed to L2-specific features, focused practice and corrective feedback is provided, the focus is on phonetic features that have an impact on intelligibility, and assessment methods elicit controlled speech that is measured acoustically (Saito & Plonsky, Reference Saito and Plonsky2019). In addition, pronunciation research in instructed SLA is extremely varied with respect to intervention outcomes, which are largely dependent on the type (explicit, form-focused, communicative) and length of instruction, the pronunciation targets the instruction focuses on (specific segmental or suprasegmental features vs global pronunciation proficiency dimensions like comprehensibility and accentedness), and the task types used to practice pronunciation (Crowther & Loewen, Reference Crowther and Loewen2025). The complexity of the overall emerging picture calls for further research on pronunciation instruction methods. However, research investigating the effectiveness of integrating a focus on phonetic form (FoPF) into CLT is currently scarce and findings are far from conclusive.
There are currently two main approaches to pronunciation instruction that aim at integrating pronunciation instruction into communicative language teaching. One approach advocates for combining explicit instruction with communicative tasks (Darcy et al., Reference Darcy, Rocca and Hancock2021; Gordon, Reference Gordon2021; Gordon & Darcy, Reference Gordon and Darcy2022), as the addition of communicative activities focusing on the target phonetic features previously taught has been shown to provide a learning advantage. For example, Darcy & Rocca (Reference Darcy and Rocca2022) compared two types of pronunciation instruction, explicit only versus explicit + communicative. Pronunciation instruction, which focused on suprasegmentals, was integrated into a seven-week oral communication course (110 instruction hours). A control group that did not integrate a specific focus on pronunciation provided baseline data. By comparing scores before and after the intervention, the researchers tested gains in comprehensibility and perceived errors (stress and vowel reduction) on speech samples obtained through a read aloud task, a monologic spontaneous speaking task, and an interactive group discussion task (information gap task). They found improvement in comprehensibility when phonetic instruction was integrated in both the explicit only and the explicit + communicative groups, but not in the pronunciation-unfocused group. However, the group with a communicative component did better in the group discussion task than the explicit group, whereas the latter obtained higher gains in the read aloud task and both groups improved only modestly in the monologic speaking task. Much more classroom research of this kind is needed to gain a better understanding of how and to what extent integrating a focus on pronunciation into CLT is possible and effective.
Another approach is task-based pronunciation teaching (TBPT). This approach applies the principles of task-based language teaching (TBLT) to L2 pronunciation teaching and learning (Mora & Levkina, Reference Mora and Levkina2017; Mora & Mora-Plaza, Reference Mora and Mora-Plaza2023; Mora-Plaza, Reference Mora-Plaza2023; Solon et al., Reference Solon, Long and Gurzynski-Weiss2017; Xu et al., Reference Xu, Saito and Mora-Plaza2024). The central idea is to promote attention to linguistic form during communicative task performance by manipulating task design features such as task complexity. This approach is grounded in TBLT research showing that increasing the cognitive complexity of a task enhances attention to linguistic form, thus promoting oral production that is lexically and grammatically more complex (and often more accurate), enhancing linguistic development. Research on TBPT has investigated whether this holds true for pronunciation. Some studies have found an increase in perceptual or productive segmental accuracy resulting from increased cognitive complexity when pronunciation practice is provided through meaning-oriented pronunciation-focused tasks, especially when the target phonetic feature is made essential for task completion. For example, Mora-Plaza (Reference Mora-Plaza2023) implemented a seven-week pedagogic intervention with 92 L1-Spanish EFL school learners (aged 16) around the topic of a trip consisting of 20 pronunciation-focused dyadic problem-solving communicative tasks (e.g. deciding on destinations and visits, choosing which objects to take on the trip) based on minimal-pair words targeting challenging phonological contrasts (/iː/-/ɪ/, /æ/-/ᴧ/). Learners, who were assigned to either a simple or a complex task condition (or a control group), engaged in pre-tasks that familiarized them with the target word forms and in post-tasks that aimed to consolidate the target vowel contrasts through self-assessment before and after the communicative tasks, respectively. The key strategy in the task design of the study is that target forms are practiced during meaningful communication between peers, because this is what is meant to drive acquisition. She found both experimental learner groups to improve significantly in the perception and production of the /iː/-/ɪ/ and /æ/-/ᴧ/ contrasts, but those in the complex task condition obtained higher retention rates, suggesting that, in these kinds of pronunciation-focused tasks, increased task complexity may generate more robust learning gains in pronunciation. However, in pronunciation-unfocused tasks increased task complexity can be detrimental to speaking fluency and pronunciation accuracy in terms of accentedness, rhythm, and intonation (Crowther et al., Reference Crowther, Trofimovich, Saito and Isaacs2018). This raises the question of whether learners can actually attend to pronunciation in meaning-oriented pronunciation-unfocused tasks and, especially in cognitively complex tasks, whether enhanced attention to form generated by increased complexity would be detrimental to pronunciation by favouring increased attention to grammar and lexis. Such a question has important implications for pedagogic approaches that aim to integrate pronunciation instruction into task-based CLT, especially in terms of communicative task design and implementation.
Empirical research has yet to determine the extent to which there may exist attentional trade-offs between pronunciation and lexico-grammar during meaning-based activities with detrimental effects for pronunciation, or the extent to which such trade-offs can be manipulated through task design to enhance attention to phonetic form. Mora-Plaza et al. (Reference Mora-Plaza, Mora, Ortega and Aliaga-García2024) is a first attempt at investigating this question. This study manipulated task complexity and its effects on specific features of L2 English pronunciation in the speech of Spanish EFL learners (VOT, vowel production accuracy, and speech rhythm) in a problem-solving pronunciation-unfocused monologic speaking task (an adaptation of the dinner table task, available from http://sla-speech-tools.com, Mora Plaza et al., Reference Mora-Plaza, Saito, Suzukida, Dewaele and Tierney2022). Pronunciation accuracy was gauged through acoustic measurements of laryngeal timing (voice onset time, VOT), vowel contrastiveness and nativelikeness (Mahalanobis distances), and native speakers’ ratings of comprehensibility and accentedness. Results revealed detrimental effects of increased task complexity on the production of oral stops and speech accentedness; however, no consistent task complexity effects were found on vowel accuracy. In a follow-up study using the same speech materials but focusing on L2 speech rhythm (Fullana et al., Reference Fullana, Mora-Plaza, Mora, Adrian and Sosa-López2025), the oral production data were analysed using well-established rhythm metrics (%V, VarcoV, nPVI-V, and VarcoC), novel distance measures (Euclidean and Mahalanobis distance scores between pairs of rhythm metrics), and native speaker ratings of comprehensibility and accentedness. The results showed that differential task complexity effects on L2 speech rhythm were dependent on the specific rhythm metric or distance measure used, but an overall consistent clear picture did not emerge that could confirm potential detrimental effects of task complexity on L2 speech rhythm. Task complexity was only slightly detrimental to comprehensibility and accentedness.
These findings suggest that increasing task complexity in pronunciation-unfocused tasks may be detrimental for L2 pronunciation learning, such that if pronunciation-unfocused, simple (but not complex) tasks may be more likely to enhance some focus on phonetic form during communicative task performance. In pronunciation-focused communicative tasks, on the other hand, a TBPT task design that makes pronunciation targets essential for task completion may be effective in promoting pronunciation learning in the classroom. Pronunciation pedagogy would benefit substantially from empirical research investigating the effectiveness of integrated L2 pronunciation instruction approaches, currently very scarce. In addition, further research is needed in various areas of L2 pronunciation teaching and training methods, such as the extent to which high variability phonetic training (HVPT) may be effective at developing pronunciation globally, or the extent to which specific pronunciation training methods (e.g. accent imitation, shadowing) can be useful in supporting pronunciation development by enhancing learners’ awareness of their pronunciation limitations and learning needs. Finally, without helping learners establish precise phonological representations for new words and helping them update the representations of already established lexical forms, L2 pronunciation training is unlikely to lead to substantial benefits in L2 speech comprehensibility and accentedness (Darcy et al., Reference Darcy, Llompart, Hayes-Harb, Mora, Adrian, Cook and Ernestus2025). Consequently, it becomes crucial to identify pronunciation methods and techniques that hold a potential for accomplishing such goals, as well as to empirically test whether certain task-design features (for example, the use of form-meaning mapping trials when training difficult sound contrasts through HVPT) can benefit the establishment and updating of phonolexical representations more than others. Therefore, a potentially fruitful research avenue in the area of L2 pronunciation teaching and learning might be to compare and combine specific and global methods of L2 pronunciation training with a focus on the development of pronunciation at the lexical level and to examine the role of aptitude-related individual differences, such as domain-general auditory processing and cognitive factors.
3. An outline of pronunciation training techniques: Characteristics, limitations, and empirical findings
There is little empirical evidence of the positive impact of phonetic training methods on the development of comprehensible speech. Pronunciation instruction approaches like TBPT and explicit instruction (+ communicative tasks) outlined above have produced promising results, but much more research is needed to determine their efficacy in learners of varying proficiency levels, in different learning contexts, and on a wider range of segmental and suprasegmental phonetic targets. Similarly, the efficacy of pronunciation training techniques available to teachers for in and out of class use is largely under-researched. Such techniques include, but are not limited to, the following:
High variability phonetic training (HVPT): In HVPT learners are typically perceptually trained on difficult L2 vocalic or consonantal contrasts (e.g. /r/-/l/ for Japanese learners of English or /æ/-/ʌ for Spanish learners of English) through exposure to highly variable stimuli (different speakers and phonetic environments) via identification or discrimination tasks with feedback (Thomson, Reference Thomson2018). High variability phonetic training has been found to produce robust gains in L2 speech perception and production generalizable to new speakers and contexts (Sakai & Moorman, Reference Sakai and Moorman2018; Uchihara et al., Reference Uchihara, Karas and Thomson2024, Reference Uchihara, Karas and Thomson2025) and has been successfully adapted to implicit gaming paradigms (Saito et al., Reference Saito, Hanzawa, Petrova, Kachlicka, Suzukida and Tierney2022). A potential pedagogical limitation of HVPT, however, is the fact that it is more suitable for individual pronunciation training than integrated classroom practice.
Shadowing: Shadowing involves repeating what one hears as simultaneously and accurately as possible. When shadowing speech presented auditorily, learners are forced to process L2 sound units, sequences, and phonological features while avoiding a focus on meaning during L2 speech input processing. Shadowing has been shown to improve speech perception and production and listening skills (Foote & McDonough, Reference Foote and McDonough2017; Hamada, Reference Hamada2019). This technique is highly adaptable and has been used in various forms to train phonological and bottom-up processing skills (Hamada & Suzuki, Reference Hamada and Suzuki2024).
Embodied pronunciation training (EPT): EPT involves the use of multimodal exposure through gestures to visually support pronunciation training by linking auditory perception with visual actions to enhance the acquisition of segmental and suprasegmental features of speech (Chan, Reference Chan2018). This technique, which may involve hand clapping, hand movements representing tones or intonational contours, or a fist-to-open-hand gesture representing the release burst of aspirated stops, among other gestures, has been shown to result in pronunciation benefits in specific (segmental and suprasegmental accuracy) and global (comprehensibility, accentedness) speech dimensions (Baills et al., Reference Baills, Alazard and Prieto2022; Xi et al., Reference Xi, Li and Prieto2024).
Multimodal pronunciation training through L2-captioned video (CaptV): Captioned video is a useful tool to train the simultaneous processing of L2 auditory input with the visual support of onscreen text and background action. Benefits of exposure to this kind of audiovisual input have been widely attested for listening comprehension and for vocabulary and grammar development (Muñoz, Reference Muñoz2022), but also for speech segmentation (Charles & Trenkic, Reference Charles, Trenkic, Gambier, Caimi and Mariotti2015) and speech perception (Mitterer & McQueen, Reference Mitterer and McQueen2009). Although still an under-researched method for pronunciation learning, several studies suggest it has the potential to generate benefits at the segmental processing level and in speech perception (Galimberti et al., Reference Galimberti, Mora and Gilabert2023; Scheffler & Baranowska, Reference Scheffler and Baranowska2023; Wisniewska & Mora, Reference Wisniewska and Mora2020) and production (Hutchinson & Dmitrieva, Reference Hutchinson and Dmitrieva2022). Given the difficulty in directing learners’ attention to phonetic form under such attention-demanding rich input conditions, studies have used question prompts (Wisniewska & Mora, Reference Wisniewska and Mora2020), input-based activities (Galimberti et al., Reference Galimberti, Mora, Gilabert, Pattemore and Gesa2025), or input enhancement (Azpilicueta-Martínez & Ocáriz-Tejada, Reference Azpilicueta-Martínez and Ocáriz-Tejada2025; Fouz-González & Mora, Reference Fouz-González and Mora2025) to bring target pronunciation features to the attentional foreground during viewing.
L2 accent imitation in L1 (L2AIL1): L2AIL1 involves mimicking the accent of a speaker of the L2 one is learning while speaking one’s L1, usually after having been exposed to such an accent through listening materials. For example, a Spanish learner of English would imitate an English speaker speaking Spanish with an English accent. This technique has been found useful in developing awareness of cross-language phonetic differences resulting in pronunciation improvement at the segmental level (Henderson & Rojczyk, Reference Henderson, Rojczyk, Sardegna and Jarosz2023; Mora et al., Reference Mora, Rochdi and Kivistö-de-souza2014; Rojczyk, Reference Rojczyk, Waniek-Klimczak and Pawlak2015).
Pronunciation self- and peer-assessment (SelfAss): Self- and peer-assessment of pronunciation are useful tools to raise learners’ awareness of their own pronunciation difficulties and help them identify the features they need to improve on (Isbell & Lee, Reference Isbell and Lee2022; Saito et al., Reference Saito, Trofimovich, Abe and In’nami2020). Pronunciation self-assessments are typically compared to native listeners’ assessments to obtain measures of under- and over-estimation of learners’ perceived pronunciation ability (Strachan et al., Reference Strachan, Kennedy and Trofimovich2019; Trofimovich et al., Reference Trofimovich, Isaacs, Kennedy, Saito and Crowther2016). Training self-assessment may therefore be a way to achieve a better alignment between assessment by learners and native listeners, which may have positive consequences for L2 pronunciation development (Tsunemoto et al., Reference Tsunemoto, Trofimovich, Blanchet, Bertrand and Kennedy2022).
One way to orient research on the effectiveness of the pronunciation training techniques briefly outlined above is to determine their appropriateness in terms of a set of critical features (see Table 1 below), such as learners’ proficiency level (low, intermediate, advanced), learning context (classroom vs self-study), the processing level the training involves (pre-lexical or lexical), the pronunciation targets they focus on (vowels, consonants, suprasegmentals), the training modality (primarily perception or production), the training focus (L2 features or contrasting L2-L1 features), or the specific (segmental contrasts, prosodic features) or global dimensions (intelligibility, comprehensibility, accentedness, fluency, listening) they are most likely to have an effect on (see Saito & Plonsky, Reference Saito and Plonsky2019). Empirical studies are needed for every intersection between each one of the training techniques and the different task features in Table 1. For example, HVPT has been widely used to perceptually train L2 segmental contrasts (using lexical and non-lexical materials) in intermediate and advanced adult learners, but very little research has investigated its effectiveness in training the perception of cross-language phonetic differences (Cebrian et al., Reference Cebrian, Cortés, Gavaldà, Mora Plaza, Gorba, De Witte and Carlet2025) or in training young children (Brekelmans et al., Reference Brekelmans, Evans and Wonnacott2024).
Table 1. Task features to consider when assessing effectiveness of pronunciation training techniques

Note: Acc = accentedness, Comp = comprehensibility, Flu = fluency, Int = intelligibility, List = listening.
The pronunciation training techniques in Table 1, and many others, are likely to develop dramatically in the near future through technological advances (Reference Fouz-GonzálezFouz-González, in press; O’Brien et al., Reference O’Brien, Derwing, Cucchiarini, Hardison, Mixdorff, Thomson, Strik, Levis, Munro, Foote and Levis2018), especially through the implementation of computer- and mobile-assisted pronunciation training (CAPT/MAPT), which has been found to be effective in promoting pronunciation development (Stoughton & Kang, Reference Stoughton and Kang2024), but also through the use of artificial intelligence (AI) and automated speech recognition (ASR) systems (Dai & Wu, Reference Dai and Wu2023). Undoubtedly, both the design and implementation of the pronunciation training techniques described above will benefit greatly from technological advances, including the use of ASR to train L2 pronunciation and AI-generated speaker voices to generate speech stimuli for training and testing.
Another important, still largely under-researched aspect of L2 pronunciation training interventions (whether focusing on specific or global dimensions of pronunciation proficiency) is the extent to which they can have an impact on the processes that modify learners’ developing phonolexical representations and whether such an impact can have positive consequences for L2 speech production (John & Frasnelli, Reference John and Frasnelli2022) or they can contribute to enhance the spreading of perception-based phonolexical updates to the entire L2 lexicon and to spontaneous L2 speech production (Rocca et al., Reference Rocca, Llompart and Darcy2025). At present, the mechanisms that underlie the updating of phonolexical representations in the mental lexicon are not fully understood and therefore it is currently not possible to implement specific training methods that would be maximally efficient at changing the imprecise pronunciation of words learners have established in their L2 lexicons.
In addition, there is large variability seen in L2 learners’ phonological acquisition, in general and in instructed SLA in particular, in terms of pronunciation instruction and training gains and overall attainment levels. Understanding the multi-componential nature of the sources that underlie inter-learner variability in phonological acquisition is fundamental in adapting methods to learners’ pronunciation learning needs and their individual cognitive and emotional profiles (Mora, Reference Mora, Derwing, Munro and Thomson2022). Research into L2 pronunciation has only recently begun to investigate the complex role of individual differences in L2 pronunciation learning in instructed SLA. The preliminary picture emerging suggests that individual factors interact with one another in complex ways to enhance or constrain pronunciation development and points to the need of further research in this area, as outlined in the following section.
4. Domain-general auditory processing skills and cognitive and emotional individual differences in L2 pronunciation learning
Sources of individual differences in L2 pronunciation learning, which include experience-related factors linked to overall L2 proficiency (e.g. vocabulary size, amount of L2 use), socio-psychological emotional variables linked to pronunciation learning (e.g. motivation, anxiety, enjoyment), domain-general auditory processing skills (e.g. frequency discrimination), and cognitive and aptitude-related factors (e.g. memory, inhibition, and attention), interact with one another to contribute differentially to L2 pronunciation learning as a function of individual learner profiles, learning contexts, and pronunciation teaching approaches. One way to think about individual differences factors is to conceptualize them as cue-enhancement devices. For example, good attention control is likely to help learners pay attention to relevant L2-specific phonetic cues during L2 speech processing. Another way to conceptualize them is in terms of a bottleneck that determines how much learners can benefit from auditory input. Several factors might play this role in L2 speech learning, but they would also vary as to their relative weight at an individual level. For example, for someone receiving a particular type of input (e.g. three weekly hours of classroom instruction + regular weekly exposure to L2 captioned video) through a given type of instruction (form-focused grammar-centred CLT), auditory processing skills may function as a bottleneck and then other factors (e.g. L2 proficiency, anxiety, attention control, or motivation) may further constrain input processing to varying degrees. Saito’s (Reference Saito2023) auditory precision hypothesis-L2 advocates for the fundamental role of domain-general auditory processing in L2 speech acquisition in allowing learners to maximally benefit from input exposure and pronunciation training, thus leading to more advanced L2 pronunciation proficiency. Within this view, auditory processing skills are supported by cognitive attentional mechanisms that allow learners to efficiently direct their attention towards relevant L2-specific phonetic features and by integration mechanisms that allow them to transform perceptual representations into motor actions generating L2 speech output (Saito et al., Reference Saito, Kachlicka, Suzukida, Mora-Plaza, Ruan and Tierney2024). The interaction between auditory processing skills and attentional and integration skills are partly determined by how learners differ from one another in such skills. In addition, the interaction between instructional conditions, input exposure, and cognitive and auditory processing skills are likely to shape L2 learners’ phonolexical representations, and eventually pronunciation accuracy during L2 use. That is, individual differences will affect individual learners differently in shaping their representations and ultimately their pronunciation development. For the learner model represented in Fig. 1 below, domain-general auditory processing skills appear to be the primary source of individual differences in L2 speech learning (1), constraining how much the learner can benefit from the L2 speech input and exposure received through pronunciation instruction and learning, followed by attention control (2), motivation (3), emotions such as speaking anxiety (4), and proficiency factors like vocabulary size (5), in decreasing order of importance. The specific combination of individual differences factors and their relative weight will partly determine the nature and shape the learners’ phonetic, phonological, and phonolexical representations, which will eventually be reflected in the learners’ speech output during L2 use.

Figure 1. Model of a learner with individual differences factors in order of importance (1–5).
Whether certain factors may stand out as more fundamental (for most learners in general) in constraining how much learners can benefit from L2 exposure and pronunciation training and instruction, or the extent to which different factors play a stronger role for different learners or learning environments, still remains an empirical question in need of further research. For example, some researchers have hypothesized that it is domain-general auditory processing skills and cognitive factors that are most fundamental (e.g. Saito, Reference Saito2023). Investigating the joint contribution of auditory processing and cognitive skills in L2 pronunciation learning may help clarify some of the mixed findings on the role of cognitive individual differences in L2 speech learning (e.g. Saito et al., Reference Saito, Kachlicka, Suzukida, Mora-Plaza, Ruan and Tierney2024). From a research perspective, the key difficulty in determining the role of individual factors in pronunciation development in instructed SLA lies in the possibility of any given factor playing a role for specific learner groups and learning contexts, but not for others. Critical methodological features of research design such as sample size, whether the data is longitudinal or cross-sectional, or whether several factors are tested on the same participants within the same study, will also affect the extent to which the contribution of individual differences factors can reliably be measured. For example, in a recent study, Saito et al. (Reference Saito, Dewaele and Abe2025) investigated the relationship between individual differences in motivation (ideal self and ought–to self), emotions (enjoyment and anxiety), and L2 speech learning longitudinally (three testing times in 1.5 years, a rarity in individual differences research) among 121 Japanese EFL high school students. Unlike findings in previous cross-sectional research pointing to its detrimental effect on language learning, anxiety did not surface as a significant predictor of L2 speech learning, whereas both motivation and enjoyment predicted increased classroom practice and led to significant speech learning. The authors point out that anxiety may play less of a role longitudinally than it does when its effects are measured cross-sectionally, which highlights the importance of conducting more longitudinal research on individual differences (Nagle, Reference Nagle2023).
Research on the role of attention control in L2 speech learning offers another example of mixed findings related to learner populations and pronunciation targets (perception and production of vowels and consonants). In two related studies on the role of inhibitory control and attention control with instructed L2 learners (see Table 2), we obtained inconsistent results related to learner populations and testing targets despite using the same testing instruments for cognitive skills (a retrieval-induced forgetting task for inhibition and an alternating runs attention-switching task for attention) and pronunciation proficiency (a categorical ABX discrimination task in perception and a delayed sentence repetition task in production).
Table 2. Summary of outcomes of individual differences studies on inhibition and attention

* indicates significance,
† indicates marginal significance.
In Study 1 (Darcy et al., Reference Darcy, Mora and Daidone2016; Mora & Darcy, Reference Mora and Darcy2023) inhibition and attention were tested bidirectionally on comparable populations of English learners of Spanish and Spanish learners of English. Inhibitory control explained a significant 17.6% of the variance in vowel perception, but not on consonant perception, whereas in production it explained a non-significant 11.6% in consonant but not vowel production. On the other hand, attention control was found to explain some (non-significant) variance only in vowel production (11.2 %). In Study 2 (Darcy & Mora, Reference Darcy, Mora, Granena, Jackson and Yilmaz2016; Mora & Darcy, Reference Mora, Darcy, Isaacs and Trofimovich2016), two groups of learners, monolingual Spanish speakers and bilingual Spanish-Catalan speakers, were tested on their inhibitory and attention control skills and their L2 English perception and production skills as in Study 1. However, with this population, inhibition was found to explain a significant 46.2% of the variance in vowel and consonant perception (not in production), but only in the monolingual group (not in the comparable bilingual group). On the other hand, attention control explained a significant amount of variance in vowel and consonant production (but not perception) and only for the monolingual group. These results suggest that inhibition and attention contribute differently to vowels and consonants, to perception and production, and to monolingual and bilingual populations. Interestingly, Huensch (Reference Huensch2024) replicated the first of these studies on inhibitory control and added two more inhibition tasks (Simon, Stroop) with L1 English learners of Spanish and found that cognitive measures only explained a small non-significant amount of variance (1%–6%) in the perception and production of vowels and consonants.
To summarize, previous and current research on individual differences in L2 speech learning, especially in instructed SLA, suggests that individual variation is pervasive throughout and remains a very challenging research endeavour. Experiential, affective, cognitive and auditory processing factors are likely to interact with one another differently in different learners according to individual learning styles, learning contexts, instructional methods, and training conditions and tasks, making the explanation of the sources of inter-learner variation in L2 pronunciation proficiency extremely challenging.
5. Conclusions
This article has presented an overview of challenging research topics in L2 pronunciation learning and teaching in instructed SLA with the aim of highlighting fruitful avenues for future research. One such challenge is finding out whether the principles and predictions of current models of L2 speech learning developed to account for phonological development in naturalistic environments, such as PAM-L2 and SLM-r, can be applied to and hold for instructed SLA. To this end, research should investigate the long-term effects of enriching input conditions and training cross-language phonetic differences and L2 segmental contrasts at an early stage. Still, a focus on pronunciation is not easy to integrate into communicative approaches to language teaching that prioritize meaning-focused activities. Attempts at integrating a focus on phonetic form into CLT (e.g. TBPT) are currently scarce and under-researched but constitute a fruitful approach to L2 pronunciation teaching and learning that should be explored and developed further, both from pedagogical and research perspectives.
A second important challenge involves adjusting instructional methods to reflect the recent paradigm shift from an emphasis on accent reduction and nativelikeness to an emphasis on the development of comprehensible speech and intelligible pronunciation. Although traditional explicit form-focused pronunciation instruction as well as approaches that combine pronunciation instruction with communicative practice lead to benefits in L2 pronunciation, there is still little empirical evidence of their effectiveness to develop global dimensions of L2 pronunciation proficiency (e.g. comprehensibility). Recent technological advances (AI, ASR) in combination with rich input-based pronunciation training techniques (shadowing, textually-enhanced captioned video, accent imitation) may provide new ways of training pronunciation globally to effect larger impacts on the development of overall speech intelligibility.
Finally, a third challenge with important pedagogical implications involves gaining a better understanding of the sources of inter-learner variation in L2 pronunciation performance, development, and attainment. Research-informed knowledge of the relative contribution of different factors (auditory processing skills, cognitive individual differences in attention control, motivation and emotional variables) to L2 speech learning in relation to the characteristics of specific learner populations under different instructional methods and in a variety of learning contexts would be very valuable in developing tasks and training methods tailored to match their pronunciation learning needs. I hope that the three research challenges outlined above, among others, pave the way for exciting new research avenues in L2 pronunciation teaching and learning.
Joan C. Mora is professor of English applied linguistics in the Department of Modern Languages and Literatures and English Studies at the University of Barcelona (UB) in Spain where he teaches courses in English phonetics and phonology and the acquisition of second language (L2) speech. His research aims at gaining a better understanding of how contextual and individual factors interact with pronunciation teaching methods and training techniques to shape L2 speech learning in instructed second language acquisition (SLA). His current research interests focus on the role of cognitive and emotional individual differences in the development of L2 pronunciation and speaking fluency, phonological learning in the mental lexicon, phonetic training methods, multimodal pronunciation training, and task-based pronunciation teaching and learning in instructed SLA.