1. Introduction
The important role of vocabulary learning in language learning has been emphasized by researchers and educators. Indeed, as published studies attest, a lack of vocabulary usually leads to poor language skills, which further results in decreased motivation in vocabulary learning (Hao, Wang & Ardasheva, Reference Hao, Wang and Ardasheva2021) and subsequently in poor learning performance (Li & Tong, Reference Li and Tong2020). Moreover, EFL learners often struggle with vocabulary learning because of the complexity of knowing a word and the large number of core words (Yu & Trainin, Reference Yu and Trainin2022). To address this challenge, educators have increasingly turned to mobile technology as a potential remedy to improve vocabulary learning processes and outcomes.
The use of mobile applications in language learning has been shown not only to enrich language learners’ learning experiences but also to customize learning processes according to individual needs (Lee, Reference Lee2023). However, learning outcomes have not been entirely positive. Some recent studies have not found significant differences between the use of mobile and traditional forms of instruction (Rachels & Rockinson-Szapkiw, Reference Rachels and Rockinson-Szapkiw2018). In addition, Li and colleagues (Reference Li, Fan, Wang and Lu2021) observed a decline in students’ vocabulary performance after three weeks of a WeChat-assisted lexical learning program, suggesting that mobile applications may be ineffective and could even hinder learning. These inconsistent results attest to the need to synthesize existing findings to reach a more convincing conclusion. Therefore, the purpose of this study was to examine whether mobile applications were more effective than traditional face-to-face methods through meta-analyzing existing primary studies.
2. Literature review
2.1. Vocabulary knowledge
Vocabulary is a central component of language proficiency (Schmitt & Schmitt, Reference Schmitt and Schmitt2020). Nation (Reference Nation2013) stresses that language learners have to not only learn many words (vocabulary breadth) but also master a range of knowledge aspects related to individual words (vocabulary depth). Using a component approach, Nation (Reference Nation2013) proposes three subconstructs of vocabulary learning: form, meaning, and use. Each can be further broken into two levels of mastery: receptive knowledge mastery and productive knowledge mastery. Of course, there are other important areas such as semantic or morphological aspects of vocabulary learning, which have been frequently examined in linguistic studies (see Levin & Hovav, Reference Levin, Hovav, Spencer and Zwicky2017). As we are particularly interested in vocabulary learning in this study, discussions of those aspects are beyond the scope of the current research.
When it comes to teaching and learning vocabulary, researchers tend to separate vocabulary knowledge into receptive and productive knowledge (Milton, Reference Milton2009). However, overemphasis on the receptive–productive distinction could lead to the misconception that receptive knowledge always precedes productive knowledge, overlooking the fact that productive knowledge of one aspect could precede receptive knowledge of another, and a learner could write a word correctly before fully understanding all its form–meaning connections. Further, this receptive versus productive view tends to generate inconsistent results when examining the effectiveness of technology use on vocabulary learning: some have confirmed the benefit of technology use (Hao et al., Reference Hao, Wang and Ardasheva2021), whereas others disagree (Lin & Lin, Reference Lin and Lin2019; Yu & Trainin, Reference Yu and Trainin2022). As Nation’s (Reference Nation2013) framework considers multiple dimensions of vocabulary learning and levels of vocabulary mastery, vocabulary knowledge has more often been described as multifaceted, rather than a simple binary construct, including spelling, word collocation, words’ grammatical functions, and so on. Therefore, it is necessary to consider levels of mastery as well as aspects of vocabulary knowledge to assess the effectiveness of technology use on vocabulary learning.
2.2. Mobile-assisted vocabulary learning
MAVL (mobile-assisted vocabulary learning) can be defined as vocabulary learning with the help of mobile technologies, including mobile short message service/multimedia message service, mobile learning apps, and context-aware mobile technologies (Lin & Lin, Reference Lin and Lin2019). It stands in contrast with conventional teaching, which involves face-to-face interaction in the classroom through structured resources such as textbooks, blackboards, and slides (Yap, Reference Yap2016). According to the cognitive theory of multimedia learning (Mayer, Reference Mayer2005), the usefulness of using mobiles in vocabulary learning rests on the fact that multimedia learning combines both words and pictures, which enhances understanding by facilitating the development of verbal and visual mental models. While working with two separate channels, learners can process information differently, and each set of information can be organized into models that help learners understand and remember.
Recent years have witnessed a significant growth of research on MAVL. The majority of studies have focused on the effectiveness of mobile use on vocabulary learning motivation (Chen, Wang, Zou, Lin & Xie, Reference Chen, Wang, Zou, Lin and Xie2019) and knowledge development (Li & Hafner, Reference Li and Hafner2022). While scholars have concurred on the benefits of mobile devices in vocabulary learning, other researchers have pointed out the challenges associated with MAVL. Indeed, some studies report no significant differences in delayed vocabulary tests between learning with mobiles and learning with print materials (e.g. Lai, Reference Lai2016). Further, studies that support the effectiveness of MAVL typically have short-term treatments, which leaves unanswered the question of whether MAVL could have a long-term positive impact on vocabulary learning (Burston & Giannakou, Reference Burston and Giannakou2022). Hence, the effectiveness of using mobile applications in vocabulary learning remains inconclusive.
2.3. Present study
The present study aimed to examine the overall effectiveness of MAVL and potential moderators in this relationship. Specifically, two research questions guided this study:
-
1. To what extent is mobile technology effective in improving foreign language learners’ vocabulary learning?
-
2. Is the effect of mobile technology on foreign language learners’ vocabulary learning significantly moderated by gender, education level, cultural background, vocabulary knowledge, vocabulary aspect, learning environment, sample size, type of mobile application, or language proficiency?
To achieve a holistic understanding of this issue, several meta-analyses have been conducted, as summarized in Table 1. The current review differed from these studies in several respects. First, Lin and Lin (Reference Lin and Lin2019), Mahdi (Reference Mahdi2018), and Hao et al. (Reference Hao, Wang and Ardasheva2021) only included studies with small sample sizes and short durations. Given that the perceived effectiveness of the intervention tends to be substantially exaggerated in such experiments (Burston & Giannakou, Reference Burston and Giannakou2022), we only considered studies with larger sample sizes and longer duration of the intervention to reach more substantiated conclusions. Further, a precise calculation of moderator effects should involve at least seven studies in each subgroup of moderators (Borenstein, Hedges, Higgins & Rothstein, Reference Borenstein, Hedges, Higgins and Rothstein2010). However, many previous reviews did not meet this requirement (e.g. Lin & Lin, Reference Lin and Lin2019; Mahdi, Reference Mahdi2018). Finally, previous meta-analyses tend to explore vocabulary learning outcomes as a whole (e.g. Chen, Chen, Jia & An, Reference Chen, Chen, Jia and An2020) or emphasize vocabulary acquisition based on vocabulary types without considering the vocabulary aspect (e.g. Lin & Lin, Reference Lin and Lin2019; Yu & Trainin, Reference Yu and Trainin2022). They consequently fail to present a full picture on this topic.
Table 1. Summary of prior meta-analysis studies in mobile-assisted/technology-assisted vocabulary learning

As a result, the current meta-analysis targeted studies with larger sample sizes, longer duration of intervention, and more studies in each subgroup of moderators by anchoring between a longer time span of publications (15 years) and a wider research scope (considering studies of vocabulary knowledge types and aspects). It is noteworthy that much attention has been paid to English learning in previous reviews, neglecting other languages (Oakes & Howard, Reference Oakes and Howard2022). Hence, the current review did not limit the search to English as the target learning language.
3. Methods
3.1. Data collection
The literature search was performed in June 2025 in two steps. First, to conduct a comprehensive literature review, an annotated bibliography of previous mobile-assisted language learning (MALL) review studies was examined for possible keywords (Lin & Lin, Reference Lin and Lin2019). Together, three sets of search terms were applied to locate relevant literature (see Supplementary Material A): mobile-related keywords, language learning–related keywords, and vocabulary learning–related keywords. Second, major research databases were thoroughly searched, including Web of Science, the Education Resources Information Center, Education Search Complete, Scopus, Google Scholar, and CNKI. Keywords in English and Chinese were used, with no language restrictions in the search results. In addition to publications in English and Chinese, research published in other languages that provided an English title and/or abstract was also retrieved. Due to technical issues, we were unable to access all of them. As a result, only studies in English, Korean, Turkish, and Chinese were included in the data set. We also hand-searched key educational and applied linguistics journals, peer-reviewed book chapters, and tracked reference lists from relevant papers. Moreover, we also consulted existing MALL bibliographies (i.e. Burston, Reference Burstonn.d.; Burston, Reference Burston2025).
3.2. Eligibility criteria
Duman, Orhon and Gedik (Reference Duman, Orhon and Gedik2015) documented a sharp growth in MALL publications from 2008 onward. Thus, we chose to review studies published from 2008 onwards to ensure that studies under review would reflect the influence of mobile devices on language learning. The following criteria were used to select articles for analysis in the study:
-
The study adopted either an experimental or a quasi-experimental design and reported empirical data of between-group comparisons with a control group. Studies with only pre-/post-test designs were excluded.
-
The outcome variables in the study should be specific to vocabulary learning, rather than to language learning in general.
-
The technology used in the treatment must be exclusively mobiles, not other technologies such as laptops or desktops.
-
The study employed second language (L2) learner samples from K-12 to higher education.
-
Studies with sample sizes of 15 or more in each experimental/control group (i.e. a total sample size of 30 or more) and a treatment duration of at least 10 weeks were included.
-
The study provided detailed sample information and sufficient statistical information (i.e. standard deviation [SD], mean [M], statistical significance [p], and sample size [N]).
Studies were excluded for the following reasons:
-
Studies that focused on the perceptions, attitudes, or motivation towards MAVL (e.g. Sun & Gao, Reference Sun and Gao2020).
-
Studies with insufficient data for statistical analyses (e.g. Chen, Jia & Li, Reference Chen, Jia and Li2021).
-
The full text of the study was unavailable.
3.3. Selection process
The first author independently screened the titles and abstracts of identified papers for eligibility. Studies that were not published in English or Chinese but showed potential relevance to the focus of this review were further evaluated to see if they met the inclusion criteria, using Google Translate. After narrowing down the number of articles, both authors examined the full texts of potentially suitable papers to ensure their eligibility for inclusion. The interrater agreement was initially 90% and was then brought to 100% agreement after further discussion. The PRISMA flow diagram (Liberati et al., Reference Liberati, Altman, Tetzlaff, Mulrow, Gøtzsche, Ioannidis, Clarke, Devereaux, Kleijnen and Moher2009) showed that the initial database search yielded a total of 10,163 records, and 6,461 publications were excluded after reviewing the abstracts. Another 1,591 records were removed due to duplication. This led to a total of 2,111 publications being assessed for eligibility, from which 68 studies were selected for further review (Figure 1). A full list of the studies included in this review is presented in Supplementary Material B.

Figure 1. The PRISMA flow diagram for the review.
3.4. Coding scheme
We performed data coding in EndNote 20. The coding was conducted by the two authors independently. For interrater reliability, Cohen’s kappa coefficient was .80, with an interrater agreement of 90%. When disagreement arose, the authors cross-checked the accuracy of data coding to ensure no errors were involved and came to a final agreement. A total of nine moderator variables were coded for each study.
Gender. Past studies have shown that learners of different genders interact with mobile devices for learning activities differently (Şad, Özer, Yakar & Öztürk, Reference Şad, Özer, Yakar and Öztürk2022). Hence, gender was considered as a moderator and coded as a continuous variable using the percentage (%) of female participants in each study, ranging from 0% to 100%.
Education level. Education level has been found to be a significant moderator in MAVL in previous analyses (Mahdi, Reference Mahdi2018), with senior students benefiting more than junior students. Therefore, in this study, education level was considered as a moderator and coded as a categorical moderator in three levels: primary, secondary, and tertiary.
Cultural background. Learners from diverse cultural backgrounds have tended to hold varying attitudes toward MAVL in terms of technological affordances and applicability (Hsu, Reference Hsu2013). Hence, cultural background was considered as a moderator and was coded as a continuous variable according to Hofstede and colleagues’ (Reference Hofstede, Hofstede and Minkov2010) individualism index, ranging from 0 to 100. Countries with individualism index values under 50 were characterized as collectivistic countries.
Vocabulary knowledge. According to previous meta-analyses (Lin & Lin, Reference Lin and Lin2019), learners typically gained more receptive vocabulary knowledge than other types of vocabulary knowledge when learning with mobile technology. We thus considered it as a moderator and coded vocabulary knowledge based on the types of vocabulary knowledge (Nation, Reference Nation2013). If the study measured both receptive and productive vocabulary knowledge, it was coded as “receptive + productive knowledge.”
Vocabulary aspect. Vocabulary aspects have been found to serve as a moderator in the relationship between the use of mobile applications and vocabulary learning, wherein learners gained most in the meaning aspect of vocabulary learning (Mahdi, Reference Mahdi2018). In this review, learners’ vocabulary knowledge was coded as “form,” “meaning,” or “use.” If the study examined learners’ vocabulary knowledge in multiple aspects, then it is coded as a combination of these three aspects.
Mobile application type. Different types of mobile applications could significantly affect the level of effectiveness of mobile use on language learning, with educational-purpose technologies generating more learning gains than general-purpose ones (Chen et al., Reference Chen, Chen, Jia and An2020). In this study, we considered mobile applications as a moderator and coded it as “educational-purpose application” (programs that are specifically designed for educational purposes), “general-purpose application” (programs that are not specifically designed for educational purposes), and “general-purpose + educational-purpose application” (programs that are designed for mixed purposes) (Chen et al., Reference Chen, Chen, Jia and An2020).
Learning environment. Prior studies have shown that MAVL seems to be more efficient in unrestricted settings, as both formal and informal learning opportunities exposed learners to various resources and thus reinforced their vocabulary knowledge (Chen et al., Reference Chen, Chen, Jia and An2020). In this study, we followed Chen et al.’s (Reference Chen, Chen, Jia and An2020) practice to categorize learning environments into three groups: classroom (learning in formal classrooms), out-of-class (learning outside formal classrooms), and unrestricted learning settings (learning in and outside formal classrooms).
Sample size. Past studies (Li, Reference Li2023) have shown that large sample sizes tend to achieve greater effect sizes than small sample size studies. Based on the classification of the number of participants in prior research (Hwang & Fu, Reference Hwang and Fu2019), sample size was coded as a categorical moderator in three categories: small (30–51), medium (51–100), and large (100 and above).
Language proficiency. Past studies have shown that learners of higher L2 proficiency perform better than learners of lower L2 proficiency when using mobile apps to assist vocabulary learning (Ou-Yang & Wu, Reference Ou-Yang and Wu2017). We thus included L2 proficiency as a potential moderator in MAVL and classified participants’ language proficiency levels as “elementary,” “intermediate,” and “advanced” based on the Common European Framework of Reference for Languages (Council of Europe, 2001). Only studies wherein participants’ proficiency was assessed through standardized tests were coded for this moderator.
3.5. Data analysis methods
The data analysis procedure consisted of three steps: overall effect size calculation, moderator analyses, and publication bias calculation. Standardized mean difference was used to compute the overall effectiveness of MAVL by extracting standard deviation, mean, and sample size from each study. We employed a Bayesian meta-analysis method as it addresses publication bias from small-study effects and allows for the inclusion of more variables, even with small sample sizes (Thompson & von Gillern, Reference Thompson and von Gillern2020). Specifically, we used the “bayesmeta” package to calculate effect sizes. The heterogeneity was evaluated by between-study variance τ2. Due to an anticipated higher level of heterogeneity among effect sizes (Harrer, Cuijpers, Furukawa & Ebert, Reference Harrer, Cuijpers, Furukawa and Ebert2021), we used a random-effects random model (more descriptions are presented in Supplementary Material C). Cohen’s d (effect size) and 95% credibility interval were reported to interpret the Bayesian analysis results. The benchmark of effect sizes set by Plonsky and colleagues (Reference Plonsky, Hu, Sudina, Oswald, Mackey and Gass2023) was adopted, whereby d values around .40 were considered a small effect, .70 a medium effect, and 1.00 a large effect.
We adopted weakly informative priors, which was especially effective for small-sample subgroup analyses (Gelman, Reference Gelman2009). For the effect µ, we adopted a normal distribution with a vague effect prior centered at 0. For the heterogeneity τ, we used a half-normal distribution of .50 that assumes τ ≤ .98 with 95% probability (Röver, Reference Röver2020). Furthermore, we conducted sensitivity analyses by changing the priors to observe how the posterior results responded to different priors (Norouzian, De Miranda & Plonsky, Reference Norouzian, De Miranda and Plonsky2019). Various prior distributions (i.e. DuMouchel, Jeffrey, uniform) were compared and no significant difference was observed. Hence, the aforementioned prior distributions were considered appropriate.
Categorical moderators were examined with subgroup analyses, and continuous moderators (i.e. gender composition and cultural background) were examined with meta-regression analyses. For subgroup analyses, we examined each moderator separately, and then computed the differences between subgroups, using posterior means, credible intervals, and posterior probabilities. For continuous moderators, meta-regression functionality was processed by the function “bmr,” and “bmr” was used for computing the coefficient β and 95% credibility interval to examine the effect of moderators.
Additionally, following van Doorn and colleagues’ (Reference van Doorn, van den Bergh, Böhm, Dablander, Derks, Draws, Etz, Evans, Gronau, Haaf, Hinne, Kucharský, Ly, Marsman, Matzke, Gupta, Sarafoglou, Stefan, Voelkel and Wagenmakers2021) guidelines, we selected a statistical model and a prior distribution in the planning stage, then conducted sensitivity analysis to check prior assumptions and a Bayesian meta-analysis. With an accurate understanding of mean posterior distribution and credibility intervals, we plotted the prior and posterior distribution, reported the posterior mean and 95% credible interval, set and justified prior settings, and discussed the robustness of the analysis in the final reporting.
Publication bias would occur where studies with significant results are more likely to be published than ones with non-significant results (Harrer et al., Reference Harrer, Cuijpers, Furukawa and Ebert2021). We examined publication bias using a contour-enhanced funnel plot. Specifically, a symmetrical funnel plot and a small number of detected imputed studies indicated a lower risk of publication bias, which provided robust support for the meta-analysis results (Lam & Zhou, Reference Lam and Zhou2022).
4. Results
4.1. Descriptions of studies
To ensure the robustness of the meta-analysis results, we followed Röver (Reference Röver2020) and conducted an
$\widehat {R\;}$
value check and posterior predictive checks. We first used posterior predictive checks to identify outliers that fell outside the 95% posterior predictive intervals. Three studies (i.e. Lye, Reference Lye2022; Roussel & Galan, Reference Roussel and Galan2018; Wu, Reference Wu2015) were detected as outliers, which were subsequently excluded from the following analyses. Our data set showed an
$\widehat {R\;}$
value below 1.01 (Harrer et al., Reference Harrer, Cuijpers, Furukawa and Ebert2021), demonstrating good convergence validity. All studies were then considered to determine the association between mobile use and vocabulary learning. Among the 65 studies, quasi-experimental design was adopted in 49 studies (i.e. studies with no random assignment to the experimental group and control group of participants), and 16 studies used mixed methods (i.e. studies that used both quantitative and qualitative approaches). Most studies (N = 59) were published in English, four in Chinese, one in Turkish, and one in Korean. A full description of the studies in our review is presented in Figure 2.

Figure 2. Descriptions of included studies (Note: The sum may be lower than 65 due to insufficient information for coding).
4.2. Overall analyses
The summary is presented in Figure 3. Meta-analysis of all studies showed an overall large effect of mobile use on vocabulary learning, with a weighted mean correlation across all effect sizes of 1.28 (95% CI = [1.03, 1.52]), suggesting strong evidence in support of the hypothesis that students who used mobile applications for language learning demonstrated more gains than those who did not. Further, we found a high heterogeneity of 1.07 (95% CI = [.91, 1.27]), indicating a variance in effect size and warranting further moderator analyses. Only 11 studies reported delayed post-test results, with an overall effect size of 1.41 (95% CI = [.86, 1.95]), and the heterogeneity was 1.06 (95% CI = [.75, 1.45]). Figure 4 demonstrates the prior and posterior distribution plot of effect size and heterogeneity.

Figure 3. Forest plot for all included effect sizes.

Figure 4. Prior and posterior distribution of overall effect size (μ) and between-study heterogeneity (τ) regarding all included studies.
4.3. Subgroup analyses
Subgroup analyses were conducted for seven moderators (excluding gender and cultural backgrounds). Table 2 provides a summary of the effect sizes, heterogeneity, and moderator analysis results. The between-group comparisons in these moderator analyses are presented in Table 3.
Table 2. Summary of effect sizes, heterogeneity, and moderator analyses

Note. CI = confidence interval.
a These two variables were coded as continuous variables, and meta-regression was used for moderator analyses.
Table 3. Summary of between-group comparisons in moderator analyses

Note. CI = confidence interval.
Gender. Only 34 studies reported gender information of the sample, and the regression weight was small (β = −.45, 95% CI = [−1.61, .71]). Regarding the heterogeneity, gender showed a high heterogeneity (τ = 1.00, 95% CI = [.76, 1.25]), suggesting a moderate between-study heterogeneity in effect sizes.
Education level. Two studies included primary school students, nine studies included secondary school students, and the remaining 43 studies included university/college students. Eleven studies failed to report this information. The effect sizes tended to be larger for studies with tertiary samples (d = 1.31, 95% CI = [.99, 1.61]), followed by secondary school (d = 1.26, 95% CI = [.77, 1.73]) and primary school samples (d = .60, 95% CI = [−.47, 1.54]).
Cultural background. All 65 studies provided cultural background information. The regression weight was very low (β = .00, 95% CI = [−.02, .02]), with a high heterogeneity (τ = 1.08, 95% CI = [.91, 1.27]), suggesting that the observed effect did not vary much across cultural backgrounds.
Vocabulary knowledge. Thirty-three studies examined receptive vocabulary knowledge, one examined productive knowledge, and another 18 examined both receptive and productive vocabulary knowledge; 13 studies did not report such information. MAVL was most efficient in receptive vocabulary knowledge (d = 1.38, 95% CI = [1.05, 1.69]), followed by productive vocabulary knowledge (d = 1.25, 95% CI = [.48, 1.92]). A combination of receptive and productive vocabulary knowledge showed the smallest effect size (d = .66, 95% CI = [.27, 1.04]).
Vocabulary aspects. Seventeen studies focused on the vocabulary knowledge aspect of form and meaning, and 17 examined form and meaning and use; 31 studies failed to report this information. Learning vocabulary with mobiles was found to be more efficient when learning the form and meaning of a word (d = 1.33, 95% CI = [.97, 1.76]) than learning all aspects of a word (i.e. form, meaning, and use) (d = .65, 95% CI = [.33, .97]).
Learning environment. Eighteen studies were conducted in classrooms, 38 out of class, and seven in unrestricted learning settings; two studies did not specify the learning environments. MAVL was most efficient in out-of-class contexts (d = 1.42, 95% CI = [1.10, 1.73]), followed by unrestricted settings (d = 1.22, 95% CI = [.51, 1.86]). Learning in class was found to have the smallest effect size (d = .82, 95% CI = [.48, 1.16]).
Sample size. Seventeen studies included a small sample size (30–51 participants), 38 included a medium sample size (51–100), and 10 involved a large sample size (over 100). MAVL was found to be the most efficient in studies with a medium sample size (d = 1.40, 95% CI = [1.07, 1.74]), followed by the small sample size studies (d = 1.07, 95% CI = [.70, 1.43]) and the large sample size studies (d = .94, 95% CI = [.38, 1.47]).
Mobile application type. Twenty-two studies employed general-purpose applications, 33 used educational-purpose applications, and 10 had both general-purpose and educational-purpose applications. The learning effect was found to be the largest for mobile applications designed for both general and educational purposes (d = 1.36, 95% CI = [.73, 1.96]), compared to mobile applications developed for educational purposes (d =1.31, 95% CI = [.99, 1.66]). Mobile applications designed for general purposes revealed a smaller effect size (d = 1.08, 95% CI = [.69, 1.47]).
Language proficiency. Twenty-eight studies reported participants’ language proficiency level, among which 21 included intermediate L2 learners, five included elementary L2 learners, and two studies included advanced L2 learners; the remaining 37 studies failed to report participants’ language proficiency level through standardized tests. MAVL was more efficient in intermediate learners (d = 1.57, 95% CI = [1.05, 2.08]), followed by advanced learners (d = 1.31, 95% CI = [.56, 1.91]) and elementary learners (d = .79, 95% CI = [−.10, 1.65]).
4.4. Publication bias
The publication bias analyses were first performed via a contour-enhanced funnel plot (Figure 5). The observed asymmetry, particularly the absence of studies in the lower left contour, could suggest the presence of publication bias or small-study effects. This pattern was further examined through Egger’s regression test for overall vocabulary learning with mobile apps (β = 6.56, t = 5.62, p < .001). Subsequent trim-and-fill test results showed that 26 more studies were needed to nullify publication bias, and the overall adjusted effect size was .74 (95% CI = [.44, 1.03]). Thus, after adjusting, we still observed a medium effect size of mobile use in enhancing vocabulary learning. Moreover, the adjusted effect size was .87 for receptive vocabulary knowledge (95% CI = [.44, 1.30]), 1.46 for productive vocabulary knowledge (95% CI = [.57, 2.36]), and .68 (95% CI = [.30, 1.06]) for a combination of receptive and productive vocabulary knowledge.

Figure 5. Contour-enhanced funnel plot of study-level standardized mean differences.
5. Discussion
The present meta-analysis aggregated studies since 2010 and provided a comprehensive examination of the effectiveness of MAVL while considering a series of potential moderators. Overall, the findings were consistent with previous meta-analyses that MAVL was generally an effective approach for vocabulary learning, as shown by a large effect size. Specifically, learners who used mobile applications demonstrated a sustained advantage in long-term vocabulary retention compared to those who did not.
Moderator analyses showed the effectiveness of MAVL tended to vary across different aspects of vocabulary learning. First, mobile applications enhanced more vocabulary learning via a combination of form and meaning than vocabulary learning involving all three aspects. This was also corroborated by recent research on the order of acquisition of word knowledge components, as shown in the hierarchical acquisition of vocabulary aspects, starting with the form–meaning link and ending with the multiple-meaning (Sukying & Nontasee, Reference Sukying and Nontasee2022).
Next, secondary and tertiary learners showed more learning gains than primary school learners when using mobiles for vocabulary learning. Underlying factors for the lower learning gains by primary students included lower language learning motivation, limited language proficiency, and underdeveloped self-regulation capacity (Yu & Trainin, Reference Yu and Trainin2022). Third, applications for educational purposes yielded a larger effect size in vocabulary learning than applications for general purposes (also see Chen et al., Reference Chen, Chen, Jia and An2020). Plausible reasons included the availability of interesting and diversified learning content (Roohani & Vincheh, Reference Roohani and Vincheh2023) and individualized learning systems in educational mobile applications (Song & Xiong, Reference Song and Xiong2023). General applications for vocabulary learning could also present distracting factors such as online videos and irrelevant messages, which adversely affected the learning process and outcome.
Further, unrestricted environment and out-of-class settings contributed to more gains than classroom settings, which echoes Lin and Lin’s (Reference Lin and Lin2019) findings. It seems that learners in the unrestricted and informal settings had more opportunities to explore and self-regulate their own learning experience after class (Criollo-C, Guerrero-Arias, Jaramillo-Alcázar & Luján-Mora, Reference Criollo-C, Guerrero-Arias, Jaramillo-Alcázar and Luján-Mora2021), which contributed to increased learning gains. Fifth, mobile devices were likely to enhance more vocabulary gains for intermediate-level learners than elementary-level learners, which was consistent with Burston and Giannakou’s (Reference Burston and Giannakou2022) findings. Nonetheless, it should be noted that only three studies involved elementary levels of learners, with no studies examining advanced learners. Future research is thus warranted to re-examine the results with varied levels of language learners.
Finally, mobile use on learning vocabulary was found to be more efficient in enhancing productive vocabulary knowledge than receptive vocabulary knowledge (Hao et al., Reference Hao, Wang and Ardasheva2021). Previous linguistic studies have confirmed that grammatical functions and collocations were learned after form–meaning links were created (Sukying & Nontasee, Reference Sukying and Nontasee2022). However, mobile applications only provide limited or even no opportunities for acquisition of grammatical functions and collocations (Barjesteh, Movafaghardestani & Modaberi, Reference Barjesteh, Movafaghardestani and Modaberi2022). The difficulty of the target word components and limited learning resources could explain why the combined vocabulary knowledge benefited the least from mobile application usage.
6. Conclusion and implications
This meta-analysis adds to existing MAVL literature by confirming a positive and large effect size of mobile applications on both vocabulary learning and delayed vocabulary retention. In other words, the use of mobile applications for vocabulary learning was more effective than conventional methods with treatment duration of over 10 weeks. Among the eight potential moderators in this relationship, only aspects of vocabulary knowledge, mobile application type, and education level were significant moderators, suggesting that the effect of using mobile applications in improving vocabulary learning was independent of many individual and other attributes.
This study does, however, have some limitations. First, the contour-enhanced funnel plot and Egger’s test signified the existence of publication bias, and this analysis was thus at the risk of the “file-drawer” effect in which studies with significant findings were more likely to be accepted than those with non-significant results (Dechartres, Atal, Riveros, Meerpohl & Ravaud, Reference Dechartres, Atal, Riveros, Meerpohl and Ravaud2018). As a result, the computed effect size might be inflated. Second, we identified only two studies with primary school student samples; future research is recommended in this area. Third, separate scores for different aspects of vocabulary learning were not reported in some studies. This did not allow us to disentangle the effect of MAVL on independent aspects of vocabulary learning. Last, some moderators in our study (e.g. vocabulary aspect and vocabulary knowledge) tended to be interconnected and thus might lead to a confounding effect. However, the limited number of studies included in our analyses did not allow for further analysis to address this concern.
Despite these limitations, our findings have implications for future MAVL research and practice. First, although teenagers were found to be keener on using mobile applications, studies with this age group were surprisingly few. As vocabulary is a fundamental factor in learning a language, especially at an early age (Sitompul, Reference Sitompul2020), further investigations to explore the pedagogical potential of MAVL should be extended to various age groups. Furthermore, our study shows that using mobile applications designed for educational purposes produced a significantly stronger effect than those designed for general purposes when learning vocabulary. To realize mobile applications’ maximum benefits, language teachers are encouraged to incorporate educational mobile applications into both in-class instructions and out-of-class learning, while granting sufficient autonomy. For designers, more advanced and integrated features need to be developed to foster greater opportunities to strengthen learners’ vocabulary learning.
Supplementary material
To view supplementary material for this article, please visit https://doi.org/10.1017/S0958344025100335
Data availability statement
Data available within the supplementary materials. Data also openly available in a public repository, with a permanent identifier (https://doi.org/10.7910/DVN/RMVEUS)
Authorship contribution statement
Yonghong Zhou: Writing – original draft, Methodology, Investigation, Visualization, Formal analysis, Data curation; Mingming Zhou: Conceptualization, Writing – review & editing, Methodology, Formal analysis, Supervision.
Funding disclosure statement
This research did not receive any specific funding.
Competing interests statement
The authors declare no competing interests.
Ethical statement
Ethical approval was not required.
GenAI use disclosure statement
The authors declare no use of generative AI.
About the authors
Yonghong Zhou is a PhD student in the Faculty of Education at the University of Macau. Her research interests focus on second language acquisition, educational technology, and emotion regulation.
Mingming Zhou is a professor in the Faculty of Education at the University of Macau. Her research interests focus on educational technology, emotion, and positive psychology.