Introduction
With the rise of awareness in diversity, equity, and inclusion, there is a pressing need for examining an ongoing act of social discrimination in which individuals’ language use is misjudged and misunderstood by virtue of listeners’ stereotypes of speakers’ social identities. Much research has demonstrated listeners’ biases toward second language (L2)-accented speech — that is, perceiving accented utterances as less credible for trivia statements (Lev-Ari & Keysar, Reference Lev-Ari and Keysar2010), less acceptable for certain professional positions (Kanget et al., Reference Kang, Yaw and Kostromitina2023), or less grammatical in spoken language (Ruivivar & Collins, Reference Ruivivar and Collins2019) due to their attitudinal and stereotyping biases (Kang & Rubin, Reference Kang and Rubin2009). Some of the drivers of these judgments have also been known from the listeners’ backgrounds and previous experiences (Kang, Reference Kang2012), although there has been contradictory evidence indicating that listeners’ judgments are little affected by listeners’ background or accent familiarity (Munro et al., Reference Munro, Derwing and Morton2006; Powers et al., Reference Powers, Schedl, Wilson Leung and Butler1999).
Artificial Intelligence (AI), on the other hand, offers the promise of processing input data and predicting reasonable output using an intelligence developed through exposing complex deep learning models to training data. While AI can be applied to many facets of decision-making (e.g., finding patterns in numerical data, detecting features in audio/images, or predicting the next token in a chat sequence such as text-based Generative AI), its output can give the impression of a highly intelligent and knowledgeable system. In some contexts, AI has emerged as a potential alternative to human decision-making as a way to mitigate biased judgments. In these cases, AI can serve as an objective tool for automated L2-accented speech scoring systems (Zechner & Evanini, Reference Zechner and Evanini2019). AI is often used to assess characteristics such as gender, age, and mood in AI facial-analysis systems (Buolamwini & Gebru, Reference Buolamwini, Gebru, Friedler and Wilson2018; Wolfe & Caliskan, Reference Wolfe and Caliskan2022), and distinguish between L2 speakers’ writing and AI-generated texts (Jiang et al., Reference Jiang, Hao, Fauss and Li2024). Nevertheless, many current AI models seem to replicate, and at times, even amplify or distort social biases found in human language in unexpected ways. In research on raciolinguistic, gender, and language background factors, AI systems do not process speech with diverse characteristics equitably. For instance, they exhibit lower accuracy in recognizing African American Language in workplace settings (Martin & Wright, Reference Martin and Wright2023), gender-based disparities in YouTube’s auto-generated captions (Tatman, Reference Tatman2017), and reduced recognition accuracy for Chinese first-language (L1) speakers compared to Spanish and Indian L1 speakers of English (Bae & Kang, Reference Bae and Kang2024). Despite these findings, it is at present difficult to determine the exact cause of AI bias, as multiple factors such as training data, model architecture, feature selection, and other components of the machine learning pipeline may contribute to biased output (Bommasani et al., Reference Bommasani, Hudson, Adeli, Altman, Arora, von Arx, Demszky, Bernstein, Bohg, Bosselut, Brunskill and Brynjolfsson2022).
Overall, a substantial amount of research in applied linguistics (AL), social psychology, and related fields has enhanced our understanding of social biases among humans and introduced interventions to mitigate their effects. However, the implications of such social biases and their persistence in AI have not yet been fully understood by AL scholars or computer scientists who investigate AI ethics (e.g., Abid et al., Reference Abid, Farooqi and Zou2021; Tatman, Reference Tatman2017; Wolfe & Caliskan, Reference Wolfe and Caliskan2022) and AI explainability/interpretability (Saeed & Omlin, Reference Saeed and Omlin2023), nor by those in a variety of fields who use AI technologies (Bommasani et al., Reference Bommasani, Hudson, Adeli, Altman, Arora, von Arx, Demszky, Bernstein, Bohg, Bosselut, Brunskill and Brynjolfsson2022; Field et al., Reference Field, Blodgett, Waseem and Tsvetkov2021). Accordingly, the current paper discusses issues in social biases that impact users of different language varieties and relates these to the currently limited research on biases in AI, with a particular focus on stereotyping and AI processing of speech. It begins by discussing fundamental concepts of human biases and stereotyping, then moves on to AI-related counterparts, reviewing recent research that illuminates bias in AI-based Natural Language Processing (NLP), focusing on generative AI and Automatic Speech Recognition (ASR). The paper concludes with specific recommendations and future directions for research and pedagogical practices.
Bias, linguistic stereotyping, and reverse linguistic stereotyping
The social groups we belong to can help form our identities both internally and externally at both individual and social levels. People may be, rather subconsciously, biased against others or members outside of their own social group (out-groups), showing prejudice, stereotypes, and even discrimination (Kang & Rubin, Reference Kang and Rubin2009). An ingroup refers to a social group that a person identifies as being a part of, while an outgroup stands for a social group that one does not identify with. In this process, we often consider various types of bias (i.e., prejudice, stereotyping, and discrimination) in concert as they are all closely related to each other, but they can occur in distinct cases as well (Dovidio et al., Reference Dovidio, Gaertner and Kawakami2003). According to Allport (Reference Allport1954), stereotype is a cognitive bias which refers to a specific belief or assumption about individuals based on their membership in a group. It can be positive or negative. Some researchers (Giles & Niedzielski, Reference Giles, Niedzielski, Bauer and Trudgill1998) argue that stereotypes are not based on inherent differences in the speech produced but rather are a reflection of social group associations that listeners learn from a young age. On the other hand, prejudice is an emotional bias which is particularly linked to a negative attitude and feeling toward an individual based on one’s membership in a particular social group. What is more concerning is discrimination, which refers to acting on prejudiced attitudes toward a group of people, and therefore can bring more harmful and negative effects on people than prejudice or stereotyped attitudes. Within the field of AL, these can often be linked to linguistic discrimination or raciolinguistic discrimination.
Biased perceptual judgments can be based on linguistic factors as well as non-linguistic ones, but then the latter plays a crucial role in human judgements, sometimes, much more than one expects. These non-linguistic variables have been well examined in the context of higher education, though it can also occur elsewhere. For instance, Kang and Rubin’s reverse linguistic stereotyping (RLS) research involved international teaching assistants (ITAs)/instructors and domestic undergraduate students (Kang & Rubin, Reference Kang and Rubin2009; Rubin, Reference Rubin1992) or university L2 students with non-native English teachers (Ghanem & Kang, Reference Ghanem and Kang2021). RLS posits that if extraneous information leads listeners to expect to hear a speaker with a marked non-native accent, then their speech perceptions are likely to manifest distortion in that direction. Additionally, Rubin and his colleagues have extended RLS research beyond educational settings to investigate RLS in business (Rubin et al., Reference Rubin, Ainsworth, Cho, Turk and Winn1999) and healthcare (HIV-prevention counseling; Rubin et al., Reference Rubin, Healy, Zath, Gardiner and Moore1997; perception of health care aides; Rubin et al., Reference Rubin, Coles and Barnett2016). Such listener bias issues also have real-world consequences, such as modified communicative behaviors (e.g., asking fewer questions and employing communication avoidance strategies when interacting with L2 English speakers; Lindemann, Reference Lindemann2003) or lower evaluations of teaching effectiveness (Kang & Rubin, Reference Kang and Rubin2009).
RLS is a theoretical framework in which attributions of a speaker’s group membership or racial identity cue distorted perceptions of that speaker’s language style or proficiency (Rubin, Reference Rubin1992). This RLS phenomenon demonstrates how listener expectations based on speaker ethnicity affect their judgments on speakers. RLS research typically utilizes contextual information to examine listeners’ expectations of how the speaker will sound. In fact, what is more commonly known among social psychologists is the linguistic stereotyping (LS) hypothesis which states that non-standard speech varieties or accents can cue negative listener attitudes toward a speaker (Bradac et al., Reference Bradac, Cargile, Hallett, Robinson and Giles2001; Lambert et al., Reference Lambert, Hodgson, Gardner and Fillenbaum1960). Extensive research on the LS hypothesis supports the presence of such social issues. Then, in RLS, listener beliefs of how a person will sound because of visual and other contextual characteristics can also override what is actually present in that person’s speech. Therefore, these two frameworks are the common approaches most researchers tend to focus on, when discussing stereotyping matters.
Some examples of RLS include the use of L1 English speakers’ photos with different ethno-racial backgrounds (Babel & Russell, Reference Babel and Russell2015; Kang & Rubin, Reference Kang and Rubin2009; Rubin, Reference Rubin1992) or a name associated with an ethnic minority group (Prikhodkine et al., Reference Prikhodkine, Correia Saavedra and Dos Santos Mamed2016). Kang and Rubin’s (Reference Kang and Rubin2009) and Rubin’s (Reference Rubin1992) studies have shown listener expectations based on speaker ethnicity affect perceptions of comprehensibility and accuracy. In their studies, participants listened to a tape-recorded lecture produced by an L1 user of Standard American English (SAE). Instructor ethnicity was operationalized by projecting a photograph of either Caucasian or Asian faces. Then, listeners who were shown a fabricated picture of an Asian perceived a stronger foreign accent and therefore scored lower on a listening comprehension test than those who were shown a photo of a Caucasian, even though what they heard was exactly identical. In other words, listening comprehension appeared to be undermined simply by identifying (visually) the speaker as Asian. Kang and Rubin’s study further illustrated that listener RLS as well as listener background factors contributed substantial variance to ratings of L2 users’ oral performance. Lindemann (Reference Lindemann2005) also examined how U.S. English speakers constructed social categories or even linguistic discrimination for people outside the United States. Her study showed that U.S. undergraduate students evaluated speakers with accents differently according to their familiarity and socio-political beliefs about countries of speaker origin. The background traits of listeners, such as their native language, exposure to various speech forms, experience with foreign language study, or teaching, are recognized to influence how their attitudes and expectations are developed and shaped (e.g., Adank et al., Reference Adank, Evans, Stuart-Smith and Scotti2009; Kang & Yaw, Reference Kang and Yaw2021).
Case studies have provided strong evidence that non-linguistic attributes do matter in how people are perceived and judged. In addition, Piller (Reference Piller2016) offers another example of how language or a linguistic factor is not always the primary obstacle for immigrants seeking employment. Her study reports on a group of Iraqi translators and interpreters who worked with the Australian army and later relocated to Australia after the troops’ withdrawal. A few years after resettlement, only nine of the 223 Iraqis surveyed reported that they were employed full-time. Of these, only one stated that he/she had a job within their field of expertise despite over 60% (N = 135) holding a university degree and all having prior work experience in their fields.
Kang and Yaw’s (2024) recent case study also illustrates such raciolinguistic phenomena through the prospects of job employment for immigrants in the U.S. restaurant business. The study examined the restaurant owner-managers’ reactions to an applicant with a North African racial identity who could produce two distinct accents: (1) a “standard” North American English accent and (2) a North African accent. The participants were six owner-managers of restaurants who agreed to interview the applicant who contacted them for job interviews. For interviews 1‒3, the applicant used her North African accent for all spoken communication. With participants 4‒6, communication was done in a General American (U.S.) accent. The findings confirmed that the intersection of language and race carries crucial impacts for listeners in social contexts. That is, a speaker’s physical appearance as well as non-standard accent can play a major role in setting the listener’s expectations of how the speaker would sound (Burgoon, Reference Burgoon1993). First, when the candidate spoke with her North African accent, she was immediately turned down. What was also surprising is that, even when the candidate had interviews with the General American accent, the restaurant managers were very hesitant to hire the applicant and expressed concerns about the employment prospects, regardless of the applicant’s accent or her English competence. These owners-managers were not willing to hire the candidate due to her foreign (non-White-American) ethno-physical appearance. In this case, the employment decision was less about the candidate’s English itself, but more about her own race and ethnicity. Employers stated that the applicant was “not American enough,” which mostly meant she did not present herself as Caucasian White. Indeed, employability was considered as not just a linguistic issue but as a raciolinguistic one (Rubin & Smith, Reference Rubin and Smith1990). This is a good illustration that shows the complexity involved in human judgements, which can be intently intertwined with social bias and stereotyping issues.
In general, social discrimination persists as individuals’ language use is often misjudged or misunderstood due to listeners’ stereotypes about the speakers’ social identities. Since language judgments can directly affect individuals’ everyday lives – such as educational opportunities, career prospects, and civil rights – these stereotypes can have a profound influence on real-world outcomes and daily experiences. There appears to be a growing need for methods to identify biases among individuals, including students, teachers, physicians, police officers, and airline personnel. However, there is still no clear consensus on how to effectively measure human listeners’ biased reactions in everyday interactions. As a result, issues related to stereotyping and bias in human judgment across different social contexts remain a significant area of interest for researchers and practitioners alike.
AI technologies and social biases
The term Artificial Intelligence is used to describe systems that make predictions based on a prompt and an understanding of the underlying knowledge required to complete a task with the use of training data (Russell & Norvig, Reference Russell and Norvig2016). These systems are thought to be intelligent because, with sufficient architectural complexity (i.e., multiple layers of latent computation), access to input features, and large enough training data, they can parallel human decision-making amongst a variety of tasks. The tasks include classification, recognition, or generation in natural and programming languages, as well as in audio, images, and/or other types of data. Because of the control model designers have over model architecture and training data, AI technologies can make predictions reliably within certain contexts for certain tasks. In constrained environments with few output options, such as deciding if an image contains a target item, a training set of less than a hundred images with and without the target is sufficient for relatively high accuracy (see https://studio.code.org/s/oceans from code.org for an interactive AI training example with detecting fish in pictures). However, tasking AI to make predictions becomes more problematic when (a) the prompt stimuli are limited in quantity and/or variability (e.g., only a few bits of data or only one category of data amongst many possible categories), (b) relationships amongst environmental factors are complex (e.g., the prediction is correct only in a small subset conditions), and/or (c) there are numerous output options (e.g., selecting the correct option from thousands of options is more difficult than few options). For example, Large Language Models (LLMs) such as OpenAI’s GPT-4 are designed to predict the most logical next words in a sequence based on a given prompt. Previous research has confirmed that the text produced by LLMs is more accurate when the model is prompted with sufficient length and is guided by the prompt to arrive at reasoned predictions (Kojima et al., Reference Kojima, Gu, Reid, Matsuo and Iwasawa2022).
Bias in data created by humans can transfer to AI models in numerous ways, primarily through training data, model construction, validation, and use. When trained with data collected from humans, it is expected that these biases are inherent in the model prediction because AI models lack awareness of contextual factors beyond their training data. Furthermore, AI models are also examined through experiments in which prompts and outputs are assessed by humans (Navigli et al., Reference Navigli, Conia and Ross2023). In AI models that process natural language, language datasets created by humans are used for training, which are likely to contain elements of social bias towards users of marginalized language varieties because of the subtle and sometimes imperceptible nature of bias in language. In fact, expressions of bias need not be overt; subtle relationships between words learned during model training can introduce social bias (e.g., anti-Muslim, transphobia, and pro-standard variate users) into the layers of the AI model (Bommasani et al., Reference Bommasani, Hudson, Adeli, Altman, Arora, von Arx, Demszky, Bernstein, Bohg, Bosselut, Brunskill and Brynjolfsson2022; Hofmann et al., Reference Hofmann, Kalluri, Jurafsky and King2024). Therefore, AI can mirror or even increase and distort the biases found amongst humans in unpredictable ways.
AI researchers have taken diverse approaches and perspectives in investigating social biases in commonly used AI models. Bommasani et al. (Reference Bommasani, Hudson, Adeli, Altman, Arora, von Arx, Demszky, Bernstein, Bohg, Bosselut, Brunskill and Brynjolfsson2022) described the threats posed by fundamental AI models in terms of intrinsic biases that are expressed when a model is used to generate texts or other output. These can further be divided into misrepresentations (i.e., stereotypes), under-representations, and over-representations. In a review of NLP research on racial bias by Field et al. (Reference Field, Blodgett, Waseem and Tsvetkov2021) identified that biased output is influenced not only by the representativeness of the training data but also by factors such as who labeled data, how it was labeled, model architecture, and the specific application and user.
When considering LLMs that drive AI output in applications such as ChatGPT, Field et al. (Reference Field, Blodgett, Waseem and Tsvetkov2021) outlined how bias can be introduced in LLMs that use publicly available information repositories as part of their training datasets. Navigli et al. (Reference Navigli, Conia and Ross2023) illustrate that sports are one of the most frequent topics in Wikipedia, which is commonly used in training LLM, making other domains, such as chemistry, are relatively less represented. While the original investigators did not, and arguably could not, investigate the impact of these biases on model output, we ran an anecdotal investigation using GPT-4o (API platform, default settings). We asked two questions: “Who scored the most points in a game” and “who discovered the most chemical compounds.” Across ten tries, GPT-4o identified different chemists for each attempt. Carl Wilhelm Scheele was mentioned in seven of the responses, but Marie Curie was only mentioned twice. Wilt Chamberlain and his 100-point decisive victory with the Philadelphia Warriors in 1962 were the consistent response each time for the first question. While these results are anecdotal, Field et al. (Reference Field, Blodgett, Waseem and Tsvetkov2021) and Navigli et al. (Reference Navigli, Conia and Ross2023) posit that biases in AI models are not only shaped by the topics covered but also by the perspectives of the individuals who contribute to publicly available texts. These biases, rooted in the interests and viewpoints of those contributors, represent just one form of social bias embedded in many AI models used today.
Previous research has also found that many end-user AI applications overrepresent Western, White, and heterosexual groups (Abid et al., Reference Abid, Farooqi and Zou2021; Hofmann et al., Reference Hofmann, Kalluri, Jurafsky and King2024; Nozza et al. Reference Nozza, Bianchi and Hovy2021). For example, Nozza et al. (Reference Nozza, Bianchi and Hovy2021) found that BERT, a precursor to the GPT model that powers ChatGPT, provided sentence completions with harmful words 4%‒9% of the time, replicating societal gender bias and homophobia/transphobia. Abid et al. (Reference Abid, Farooqi and Zou2021) also found a consistent anti-Muslim bias in GPT-3 sentence completion that associated Muslims with violence. Hofmann et al. (Reference Hofmann, Kalluri, Jurafsky and King2024) examined covert and overt racism in words associated with SAE and African American English (AAE) in several GPT models. They found that overt output associated with AAE was positive, but that covert racism (i.e., dialect prejudice) against AAE was more negative than observed in previous experiments with humans. Furthermore, recent AI models are multimodal, and can incorporate visual, audio, and/or textual, and contextual information as a part of the input for training or prediction. While these models are promising, Wolfe and Caliskan (Reference Wolfe and Caliskan2022) conducted a series of studies on the conceptualization of nationality in leading image-language models, which associated the demonym American with White when prompted and examined in numerous classification, generation, and downstream tasks. These studies demonstrate AI models’ capacity to amplify social bias in ways that require examination in multiple dimensions.
The social bias that is baked into AI models also results in unpredictable output when prompts are created with marginalized language varieties. In an experiment by Reusens et al. (Reference Reusens, Borchert, De Weerdt and Baesens2024), commonly used LLMs were given classification and generation tasks under two conditions: with and without the user’s L1/L2 status given in the prompt text. Their results indicated that many leading LLMs produced biased and incorrect information to L2 and non-Western users as compared to L1 and Western users. The biased output worsened when the model was made aware of the L1/L2 status, sometimes diverging from the task prompt and generating a response in another language. These findings highlight the influence of L1/L2 and (non-)Western status on LLM output. However, recent efforts by Jiang et al. (Reference Jiang, Hao, Fauss and Li2024) demonstrate a promising approach to reducing bias against L2 writing in classification tasks. They developed a ChatGPT-generated essay detection system that exhibited minimal bias against L2 writers as compared to L1 writers. Their modeling approach suggests that, when targeted training data, model creation, and specific tasks are considered carefully, specific applications of AI might be able to overcome certain kinds of bias.
Overall, these studies have confirmed that AI technologies are not excluded from the bias and stereotyping issues that are found in human social bias research. Such AI biases can have a direct impact on language users of marginalized varieties or language learners; therefore, this issue will be discussed in more detail in the following section.
Bias in speech varieties and AI technologies
Perceiving speech is a constructivist process (von Glasersfeld, Reference von Glasersfeld1995) in which individual listeners impose patterning based on serial probabilities about what sounds make sense for them to hear (Rubin, Reference Rubin, Levis and LeVelle2012). As discussed above, a substantive body of research supports the existence of the LS or RLS hypotheses across a wide range of contexts (see Fuertes et al., Reference Fuertes, Gottdiener, Martin, Gilbert and Giles2012 for a meta-analysis review on this topic). In U.S. higher education, ITAs with a variety of accents may face blame from their students for low grades (Fitch & Morgan, Reference Fitch and Morgan2003), have lower course enrollments than domestic Teaching Assistants (Bresnahan & Sun Kim, Reference Bresnahan and Sun Kim1993), and receive lower course evaluation ratings (Jiang, Reference Jiang2014). In non-educational settings, such as criminal law cases, guilt is often attributed differently to suspects based on their accent varieties. Speakers with marginalized or regional accents are frequently perceived as more likely to be guilty compared to individuals with higher-prestige accents (Dixon et al. Reference Dixon, Mahoney and Cocks2002). This pattern is also evident in employment contexts. As Lippi-Green (Reference Lippi-Green2012) strongly emphasizes, speakers of L2 varieties face lower chances of being hired, increased risks of unemployment, and greater difficulties in securing promotions compared to their L1-accented colleagues. Linguistic discrimination is a genuine outcome of these stereotypes, affecting various aspects of daily life for individuals who speak less prestigious language varieties or possess L2 accents.
Similarly, the bias introduced through human-produced training datasets and human validation procedures impacts AI-based ASR, and therefore AI’s use of speech in applications such as spoken dialogue systems and virtual assistants. Previous studies have confirmed these biases amongst accent varieties and gender dimensions. For example, Martin and Wright’s (Reference Martin and Wright2023) comprehensive review clearly described the subpar performance of ASR systems in accurately capturing African American Language. This issue was empirically corroborated by Koenecke et al. (Reference Koenecke, Nam, Lake, Nudell, Quartey, Mengesha and Goel2020), who examined ASR systems developed by Apple, Amazon, IBM, Microsoft, and Google. Their study revealed that these systems exhibited significantly lower transcription accuracy for the speech of Black Americans compared to White Americans. These two studies echo Tatman and Kasten’s (Reference Tatman and Kasten2017) earlier findings, which reported that two transcription systems, Microsoft Bing Speech and YouTube, yielded the highest word error rates (WER) for African American speech and the lowest error rates for General American speech. Other dimensions of language variation bias were reported in Tatman (Reference Tatman2017). In this earlier study, Scottish speakers experienced lower transcription accuracy in comparison to other varieties, confirming the presence of dialect biases. They also found gender biases in ASR through their analysis, suggesting that male speakers were more accurately transcribed by YouTube’s automatically generated captions.
At the same time, with the rise of interaction with AI and ASR technology development, it is inevitable to consider how these technologies affect L2 learners and speakers with diverse accents in their everyday communication (Moussalli & Cardoso, Reference Moussalli and Cardoso2020). At present, several commercial entities provide ASR-based pronunciation practice applications and services (Walesiak, Reference Walesiak, Kirkova-Naskova, Henderson and Fouz-González2021), some of which report having millions of users (e.g., https://elsaspeak.com/). The evolution of ASR capabilities from earlier probabilistic speech processing methods (e.g., Neri et al. Reference Neri, Cucchiarini and Strik2003) to modern AI-driven technology in language learning has stimulated an increase in research on AI-powered ASR processing and learning by L2 speakers (e.g., Bae & Kang, Reference Bae and Kang2024; Hirschi & Kang, Reference Hirschi, Kang and Sadeghi2024; Inceoglu et al. Reference Inceoglu, Chen and Lim2023).
Early models with probabilistic prediction of speech were known to be relatively reliable and consistent for when designed for specific accent varieties. Until 2019, simplistic ASR systems operated using a two-stage recognition process: first classified speech signals into phonemes in an acoustic model, and then these phonemes were fed into a language model to compute the most likely word (Yu & Deng, Reference Yu and Deng2016). This approach enabled researchers and industry professionals to improve human–computer reliability rates across a variety of use cases, including L2 pronunciation (Neri et al., Reference Neri, Cucchiarini and Strik2003). Furthermore, it was easier to determine which data were causing issues because of the relatively small training datasets (under 25 hours) needed for adequate recognition accuracy. However, modern models have improved accuracy with increasingly complex computational models through deep learning and various types of neural networks (Yu & Deng, Reference Yu and Deng2016).
Commercial ASR models in the emerging days of deep learning and LLMs also demonstrated social bias. Lima et al. (Reference Lima, Furtado, Furtado and Almeida2019) analyzed the speech recognition accuracy of Apple’s Siri and Google Assistant for Brazilian Portuguese speakers of different genders and accents, including L2 speakers. Although their study was preliminary, with a relatively small sample size (N = 20), their findings indicated that speech recognition and assistant response models were more accurate for female speakers than for male speakers, and in some cases, the models worked performed better with foreign accents than with certain regional ones. Lima et al. (Reference Lima, Furtado, Furtado and Almeida2019) partially attribute the low performance with some regional accents to lower socioeconomic status, indicating that these models may favor those who are more likely to afford expensive technologies.
With increasingly accessible computational power, the availability of large datasets for ASR model training, and more efficient training approaches for processing complex speech data, ASR has evolved to use end-to-end prediction models. Unlike previous systems, end-to-end models such as Wav2vec 2.0 analyze a large amount of speech data in a self-supervised manner, identifying patterns in acoustic energy and mapping them to lexical items (Baevski et al. Reference Baevski, Zhou, Mohamed and Auli2020). However, the neural networks in Wav2vec 2.0, with their multiple layers of latent computational space, create a prediction system that is difficult to understand how or why a specific prediction is made. Furthermore, this model and many others rely on the LibriSpeech dataset of public domain audiobooks (Panayotov et al. Reference Panayotov, Chen, Povey and Khudanpur2015). LibriSpeech contains 960 hours of public domain audiobooks, which represent carefully scripted speech largely by L1 speakers of English with 1,568 books that are mainly Western, including David Copperfield, History of the Decline and Fall of the Roman Empire, and the United States Declaration of Independence. Because of the content in these books and the speech features of those who read them, the Librispeech dataset is likely to introduce biases in multiple dimensions. These biases, embedded in deep learning models, are often go unnoticed in human analysis. This might be particularly problematic when processing L2 speech, as most ASR systems are trained and tested without considering the linguistic background of the speakers or are optimized for the speech patterns of their likely consumers.
Several studies have also investigated biased recognition rates of ASR systems for L2 speakers. Nacimiento-García et al. (Reference Nacimiento-García, Díaz-Kaas-Nielsen and González-González2024) examined ASR accuracy amongst varieties of Spanish and genders of speakers. They found that Alexa and Whisper had slightly different accuracy rates depending on the gender of the speaker, but clear biases for different accents emerged. Primarily, speech from southern Spain and the Caribbean exhibited lower accuracy rates, but the largest model of Whisper had very high accuracy, resembling the findings of Hirschi and Kang (Reference Hirschi, Kang and Sadeghi2024). A recent study (Bae & Kang, Reference Bae and Kang2024) demonstrated the presence of bias in AI intelligibility of L2-accented speech and differences in intelligibility ratings between human experts and AI. They used three datasets, including (1) L2 speakers of Chinese, Indian, Spanish, and South African English (N = 12), (2) read-aloud samples from the same L1s (N = 60), and (3) TOEFL responses from Arabic, Chinese, Korean, and Spanish speakers (N = 40). Intelligibility was operationalized through transcription for WER, using Apple’s Siri and Google Assistant. Their results revealed that, although human raters demonstrated the same intelligibility WER scores for all speakers, AI systems’ WERs varied by L1, showing significantly higher WERs for Chinese accented speech in comparison to Indian and Spanish WERs. These findings have important implications for L2 learners and teachers by raising awareness of AI-related fairness issues in L2 learning and technology applications.
The study by Chan et al. (Reference Chan, Choe, Li, Chen, Gao and Holliday2022) sought to evaluate the commercial ASR transcription system Otter by using speech corpus data of 24 English varieties. The authors found that Otter had poorer transcription performance for speakers whose native language was tonal, indicating that the unique linguistic characteristics of L2 speakers’ L1s significantly influenced its performance. These findings were particularly noteworthy because Otter’s ASR system has been trained on several language background for L2 English, yet the language category (i.e., tonal vs. non-tonal) had a greater impact on ASR accuracy than previous training on a specific set of accents. While these results cannot be attributed to a single mechanism, these biases, along with other documented biases related to race, dialect, and gender, suggest that AI systems are susceptible to accent bias in unpredictable ways. Therefore, just like human listeners, accent bias in AI should be a significant and ongoing challenge that warrants further investigation and mitigation in future AI development.
Summary
Overall, we have seen that bias and stereotyping judgements in both human and AI can impact individuals’ communicative behaviors, evaluative styles, or real-life decision-making processes. Therefore, the implications of this social bias research extend far beyond the educational domain. As mentioned earlier, the bias or RLS/LS scopes expand to linguistic discrimination which can involve general L2 speakers and immigrants as they provide evidence for how listeners judge speakers from different language, social, and ethnic backgrounds. With the growing interest in sociolinguistic justice in both educational and social communities, more AL-oriented social justice research can advance our understanding and interventions for linguistically subordinated individuals in sociopolitical struggles over language as a function of inclusive excellence in social and educational settings.
As we have continually argued above, listener beliefs and background factors can have real-world impacts on their social judgements, such as modification of communicative behaviors (Lindemann, Reference Lindemann2003), lower ratings of teaching effectiveness (Kang & Rubin, Reference Kang and Rubin2009; Rubin, Reference Rubin1992), or decisions not to hire qualified job candidates (Kang & Yaw, Reference Kang, Yaw, Kubota and Motha2024). Some of the non-linguistic biased judgements can also affect a teacher’s or employer’s assessment, as they may filter their evaluation of a learner’s or employee’s speech through those expectancy, leading to perceived weaknesses in the student’s or employee’s speech that are not actually present. Future research can be conducted to better understand how such biases can influence our daily practices. In addition, future studies can consider more up-to-date social and contextual factors in relation to unconscious and implicit biases such as socially biased reactions. Some examples of such implicit biases can include listeners’ exposure to certain social media, a degree of culturalization (e.g., k-pop, k-drama, or Japanese anime), preferences for social media influencers, and attitudes towards social and political issues (e.g., Asian hate-crimes, Tessler et al. Reference Tessler, Choi and Kao2020).
In fact, bias issues can often be situated in subtle and implicit contexts, where people do not explicitly express their dislike about a certain group (Piller et al. Reference Piller, Torsh and Smith-Khan2023). As seen from National Research Council (National Research Council, 2004)’s social bias and discrimination guidelines, subtle and implicit biases should be considered just as seriously as intentional and explicit biases. The National Research Council (2004) outlines several relevant sources of discrimination, including statistical profiling in which people’s perceptions of an individual are based on the statistics the group that they are affiliated with (e.g., believing that someone is uneducated because of their racial background). Further AL investigations of the topic of social bias through the lens of statistical profiling may be productive. Institutional and organizational processes also play a role in bias because “they reflect many of the same biases of the people who operate within them” (National Research Council, 2004, p. 63). For example, organizations may provide training to speakers of non-prestigious varieties under the assumption that they are less competent. Accordingly, future studies can carefully design listener background questionnaires that can elicit somewhat socially, culturally, and politically latent information, and explore their relationships with bias and stereotyping measures.
Indeed, many of our students, colleagues, or even we ourselves may be faced with social biases and other related phenomena without realizing them explicitly. It is important to acknowledge that social judgements of different accents or races may be prompted by a listener’s ethno-racial expectations of the individual, rather than by any objective linguistic features present in the speaker’s output (Kang & Rubin, Reference Kang and Rubin2009). Emphasizing this reality can offer a different perspective to L2 learners, immigrants, or speakers of non-prestigious language varieties, as it also highlights the importance of the listener or the interlocutor rather than the speaker. That is, as communication is a two-way street, and we need an active and responsive listener who shows willingness to communicate for successful communication. Listener training through structured intergroup contact has proven to be successful in improving attitudes toward L2-accented speech (Kang et al. Reference Kang, Rubin and Lindemann2015). Therefore, more research on L2 speakers’ awareness training as well as listener-based positive contact training is called for in the future. These directions can help prevent L2 speakers from misattributing the cause of their failed encounters and expending emotional resources on critiquing their own language learning efforts. Instead, this movement can help them become responsible and collaborative interlocutors who consider themselves as speakers with diverse speech characteristics in global contexts.
Future directions
We propose that bias found in social perception research can serve as an agenda for how to investigate AI models’ classification, processing, and generation tasks. As outlined above, the types of bias found amongst humans can also be expected to exist in AI models because their training data is produced by humans. Novel areas of social bias in human language variation perception, production, and representation will illuminate additional dimensions of bias that need to be investigated with AI models, and support current efforts in AI research to understand social bias in AI (e.g., Navigli et al., Reference Navigli, Conia and Ross2023; Saeed & Omlin, Reference Saeed and Omlin2023). For example, in multimodal systems, approaches and stimuli taken to reduce visual components of racial bias in human perception could be used as training data for more inclusive models through adversarial training (e.g., Berg et al., Reference Berg, Hall, Bhalgat, Yang, Kirk, Shtedritski and Bain2022). As research in multimodal systems’ biases is emerging, priority must be given to reducing the negative impacts of such systems, as their practical use in social gatekeeping abounds. Regarding L2 speech, the future of accurate ASR recognition is promising. To date, large transformer-based ASR models outperform listeners in transcription tasks after a single listen-and-transcribe attempt (Hirschi & Kang, Reference Hirschi, Kang and Sadeghi2024). However, as the models’ recognition does not align with human listeners, uncovering biases and undertaking efforts to create a large diverse training sample of accented speech may help reduce such bias.
Overall, this paper has attempted to demonstrate a number of issues related to human judgements and AI applications. However, we would like to note that the purpose of the paper is not simply to highlight problems of those issues, but also to collaboratively look for solutions. We need to actively seek ways to promote socially responsible human listeners and ethical AI by addressing social bias toward marginalized language users. We can promote more AI literacy training and workshops to help raise awareness among various users and researchers about ongoing AI bias issues. Furthermore, applied linguists who engage in data collection of different speech varieties can, when ethical and with permission, collaborate with AI researchers to create less biased models in the future. In other words, we argue for a critical perspective on the use of AI and research that lies within or may intersect with varieties spoken by underprivileged language users. We must not only be aware of linguistic diversity but uphold its importance through our work in educational, legal, and civic contexts. These changes should come from and benefit everyone, including students, teachers, researchers, policy makers, and even AI developers.