Pictures of objects are commonly used as stimuli in different tasks assessing speakers’ production abilities, lexical retrieval, proficiency, and more. In order to accurately assess performance, there is a need for a set of pictures that is standardized across a large number of participants (Brodeur et al., Reference Brodeur, Guérard and Bouras2014; Souza et al., Reference Souza, Garrido and Carmo2020). Numerous sets of standardized pictures exist, ranging from black and white drawings (Snodgrass & Vanderwart, Reference Snodgrass and Vanderwart1980) through colorful drawings (Duñabeitia et al., Reference Duñabeitia, Baciero, Antoniou, Antoniou, Ataman, Baus, Ben-Shachar, Çağlar, Chromý and Comesaña2022; Duñabeitia et al., Reference Duñabeitia, Crepaldi, Meyer, New, Pliatsikas, Smolka and Brysbaert2018) to naturalistic photographs (Krautz & Keuleers, Reference Krautz and Keuleers2022; Moreno-Martínez & Montoro, Reference Moreno-Martínez and Montoro2012; Souza et al., Reference Souza, Garrido, Saraiva and Carmo2021). Although such resources are available in many languages, these are often collected from a single population (i.e., native speakers of the target language), and the degree to which information collected from one population predicts performance in other populations is relatively underexplored. Thus, the current study aims to examine whether norms collected from L1 speakers of the language are predictive of second-language (L2) speakers’ performance in a timed picture-naming task.
Picture norms vs. timed picture naming
Picture naming tasks are generally understood to involve three key stages: (1) visual recognition and conceptual identification, where the features of the object depicted are matched with existing semantic knowledge; (2) lexical selection, during which appropriate lexical representations of words are accessed and chosen; and (3) articulation, where the selected word is verbally expressed (for a recent review see Momenian et al., Reference Momenian, Bakhtiar, Chan, Cheung and Weekes2021). Collecting picture norms prior to testing timed picture-naming performance is a standard practice in the field (Souza et al., Reference Souza, Garrido and Carmo2020). Such norms allow researchers to identify the expected response for each picture and help prevent data loss during the actual experiment, as problematic stimuli or ambiguous items can be avoided. Typically, norms establish the percentage of name agreement of particular items, defined as the proportion of participants who gave the same name to an object (Sirois et al., Reference Sirois, Kremin and Cohen2006). Pictures with a single dominant response (i.e., high name agreement) are considered good items and tend to be named more rapidly and more accurately in timed picture-naming tasks, compared to items that received multiple responses in the norms (Barry et al., Reference Barry, Morrison and Ellis1997; Vitkovitch & Tyrrell, Reference Vitkovitch and Tyrrell1995). One explanation to this pattern, is that variability among individuals in the name provided in the norms reflects variability within the individual, such that pictures with low name agreement may be more difficult to name due to internal competition among multiple alternatives within the individual (e.g., sofa/couch, for related discussion see Balatsou et al., Reference Balatsou, Fischer-Baum and Oppenheim2022; Prior et al., Reference Prior, Wintner, MacWhinney and Lavie2011). Indeed, items with low name agreement are associated with more pauses during production, thought to be indicative of increased cognitive load (Hartsuiker & Notebaert, Reference Hartsuiker and Notebaert2009).
However, there are some inherent differences between norming and naming performance. Specifically, picture norms are normally self-paced, where participants respond by typing the name of the picture, in contrast to the oral response provided in a timed picture-naming task (Duñabeitia et al., Reference Duñabeitia, Baciero, Antoniou, Antoniou, Ataman, Baus, Ben-Shachar, Çağlar, Chromý and Comesaña2022; Duñabeitia et al., Reference Duñabeitia, Crepaldi, Meyer, New, Pliatsikas, Smolka and Brysbaert2018; Liu et al., Reference Liu, Hao, Li and Shu2011; Momenian et al., Reference Momenian, Bakhtiar, Chan, Cheung and Weekes2021; Moreno-Martínez & Montoro, Reference Moreno-Martínez and Montoro2012). Therefore, while also capturing visual perception and conceptual identification, self-paced picture norms are more reflective of language-specific vocabulary knowledge (Garcia & Gollan, Reference Garcia and Gollan2022). Moreover, as no time limit is imposed in the norms (but see Krautz & Keuleers, Reference Krautz and Keuleers2022), participants may exert monitoring and meta-linguistic processes to a greater extent in norms relative to timed naming tasks. These differences in task demands may constrain the degree to which name agreement from picture norms indeed predicts timed picture-naming performance in different populations.
Evidence supporting this possibility comes from a recent study by Momenian et al. (Reference Momenian, Bakhtiar, Chan, Cheung and Weekes2021) who examined the role of item-related characteristics, such as name agreement, age of acquisition, imageability, and grammatical class, in predicting timed naming performance in speakers with different language histories. In particular, in that study, norms in Mandarin were collected from monolingual Mandarin speakers and Mandarin-Cantonese bilinguals, whereas norms for Cantonese were collected from Cantonese-English bilinguals. Their pattern of results showed both differences and similarities in the degree to which each variable predicted performance across monolinguals and bilinguals. For current purposes, most notable is the finding that L1 name agreement predicted naming performance in both monolinguals and bilinguals, but that the effect was stronger in the former. Thus, the degree to which name agreement may predict timed naming performance depended on participants’ characteristics. Notably, that study did not report on the degree to which L1 name agreement predicted naming performance in participants’ L2, because the norms collected from Cantonese-English bilinguals in Cantonese were not used to predict the performance of the Mandarin-Cantonese bilinguals. Here, in contrast, we focus on this cross-group prediction. Specifically, we compiled a set of normed pictures and tested its utility in predicting timed picture-naming performance, in both L1 and L2 speakers.
L2 performance
The goal of the current project was to examine whether normative data derived from native L1 speakers can be used to predict L2 naming performance, relative to norms collected from L2 speakers. There are several reasons to anticipate differences between L1-derived and L2-derived norms. First, we might expect larger ranges of name agreement among L2 speakers due to their diverse exposure to language items. Whereas in both L1 and L2 norms, frequent items are expected to result in higher name agreement, the effects of frequency may be larger in the L2 compared to the L1 (Duyck et al., Reference Duyck, Vanderelst, Desmet and Hartsuiker2008). Thus, larger frequency effects in L2 may lead to larger L2 variability of name agreement.
Second, offline norms reflect the ways in which speakers name objects, and linguistic experience has been shown to affect object naming patterns, with non-target language conceptual representations influencing the ways bilinguals name common objects in the other language (Ameel et al., Reference Ameel, Storms, Malt and Sloman2005; Malt & Sloman, Reference Malt and Sloman2003; Pavlenko & Malt, Reference Pavlenko and Malt2011). Such conceptual cross-language influences (see also Degani et al., Reference Degani, Prior and Tokowicz2011) may lead L2 speakers to name objects differently than would native speakers of that language. Thus, name agreement patterns may differ between the L1 and the L2 speakers. To examine these potential differences, the current study compiled a set of norms from L1 and L2 speakers of Hebrew.
Available picture sets for Hebrew
Because naming of objects could vary significantly across languages, and some objects may not have direct equivalents in different cultures or may have different levels of familiarity (e.g., a football helmet may not be familiar to individuals outside of the United States, where American football is not being played), it is critical to establish picture norms in the target language. Only limited picture resources are available for use with healthy Hebrew-speaking adults. A previous effort by Kavé (Reference Kavé2005) provided Hebrew standardization for the Boston Naming Test (BNT, Kaplan et al., Reference Kaplan, Goodglass and Weintraub1983), but no detailed information regarding name agreement, or other picture characteristics was available. Moreover, the study included a set of 48 black and white line drawings (derived from the BNT). While many of the items in that set remain relevant and recognizable today, some have become outdated due to cultural and technological changes.
Recently, Duñabeitia et al. (Reference Duñabeitia, Baciero, Antoniou, Antoniou, Ataman, Baus, Ben-Shachar, Çağlar, Chromý and Comesaña2022) introduced the MultiPic set, which has been normed across multiple languages, including Hebrew. While the availability of the MultiPic dataset in Hebrew is a valuable addition to researchers testing Hebrew speakers, it primarily consists of colored drawings, as opposed to the realistic, full-color photographs employed in the current study, which provide a more authentic representation and can potentially lead to more naturalistic linguistic responses (see also Krautz & Keuleers, Reference Krautz and Keuleers2022; Souza et al., Reference Souza, Garrido, Saraiva and Carmo2021; for elaborate discussion see van Hoef et al., Reference van Hoef, Lynott and Connell2024).
Furthermore, both the MultiPic (Duñabeitia et al., Reference Duñabeitia, Baciero, Antoniou, Antoniou, Ataman, Baus, Ben-Shachar, Çağlar, Chromý and Comesaña2022) and the Kavé (Reference Kavé2005) datasets lack comprehensive data on the language background of the participants. In Kavé (Reference Kavé2005), 87 of the 365 participants were in fact immigrants to Israel, so that Hebrew was their L2, but limited information is available on the L1 of these participants. Similarly, no information is provided by Duñabeitia et al. (Reference Duñabeitia, Baciero, Antoniou, Antoniou, Ataman, Baus, Ben-Shachar, Çağlar, Chromý and Comesaña2022) regarding the language profile of the Hebrew speakers who completed the norms. This omission can limit the interpretability and applicability of the norms in different linguistic contexts. This is particularly important in environments with increased immigration, like Israel, where close to half of the adult population is native speakers of a language other than Hebrew (The Social Survey of the Israeli Central Bureau of Statistics, 2021). Thus, here, we compile normative data on an up-to-date set of realistic colored pictures, separately examining the performance of L1 and L2 speakers of Hebrew.
To summarize, in the current study, we first collected normative data on a large set of realistic colored pictures, largely taken from the Moreno-Martínez and Montoro (Reference Moreno-Martínez and Montoro2012) Spanish norms. Following their procedure, participants were to type in the name of each picture and then rate it on its visual complexity, familiarity, and typicality. Critically, these norms were collected from two populations in Israel—native Hebrew speakers and native Arabic speakers who have learned Hebrew as an L2, all naming the pictures in Hebrew. Next, we examined the predictive value of these offline norms for the timed picture-naming performance of two separate groups of L1 and L2 Hebrew speakers. As such, the current project serves to bridge the gap between L1 and L2 vocabulary knowledge, as reflected in the norms, and real-time language processing, as reflected in the timed picture-naming task. In particular, the study examines whether normative information predicts online performance in a similar way across individuals who differ in their language background, including L1 and L2 speakers. We test whether item-specific normative information derived from L1 speakers can predict the timed picture naming performance of L2 speakers of that same language. We expected limited predicted value for these L1 norms in predicting L2 performance, because the mental lexicon of L2 speakers may differ due to reduced exposure to the language (e.g., the Frequency Lag Hypothesis, Gollan et al., Reference Gollan, Slattery, Goldenberg, Van Assche, Duyck and Rayner2011) and to influences from conceptual representations in the L1 (e.g., Degani et al., Reference Degani, Prior and Tokowicz2011). Documenting such effects may inform future research which utilizes normative information when testing diverse and multilingual populations.
Beyond this theoretical contribution, the current project carries practical importance. Specifically, within the society they reside, the group of native Arabic speakers (L2 Hebrew) maintains their L1 Arabic dominance but is required to function in Hebrew daily. Thus, it is important to examine to what extent their processing differs from that of L1 Hebrew speakers with whom they interact frequently.
Study 1—picture Norming
Methods
Materials
A total of 323 colored, realistic pictures of noun objects served as stimuli in the study. Of these, 205 were selected from the set provided by Moreno-Martínez and Montoro (Reference Moreno-Martínez and Montoro2012) for Spanish speakers, accounting for perceived cultural appropriateness. In addition, 118 pictures were selected from Google Images based on concepts included in the Moreno-Martínez and Montoro (Reference Moreno-Martínez and Montoro2012) or the Snodgrass and Vanderwart (Reference Snodgrass and Vanderwart1980) original sets. Examples are presented in Figure 1. The norming survey included 94 additional pictures normed for other purposes that are not reported here. For the norming study conducted with L2 Hebrew speakers, three pictures were inadvertently excluded, resulting in a total set of 320 pictures.

Figure 1. Examples of the normed stimuli.
(a) Example of items from Moreno-Martínez & Montoro (Reference Moreno-Martínez and Montoro2012)
(b) Example of items from Google based on concepts form Moreno-Martínez and Montoro (Reference Moreno-Martínez and Montoro2012) and Snodgrass and Vanderwart (Reference Snodgrass and Vanderwart1980)
All pictures were centered on a white background and sized to fit into frames of 432 × 324 pixels on the computer screen. The full set of pictures and the corresponding norms are available through the Open Science Foundation (OSF, https://osf.io/2nwzm/).
Participants
Norming data for Hebrew as an L1 were collected from a group of 449 native Hebrew speakers (123 males, 326 females), who each rated a random subset of the pictures. Because the focus was not on RTs, we opted not to restrict participants’ age, which averaged 28.72 years (SD = 6.99; 18 – 62 years). These participants were born in Israel and had not been exposed to other languages regularly until the age of 6. They were recruited from the community to complete the online survey, or received University class credit for their participation.
Norming data for Hebrew as an L2 were collected from a group of 163 native Arabic speakers, who have learned Hebrew as an L2 (42 males, 121 females; mean age = 21.24 years, range from 18 to 33 years). Detailed proficiency information was not collected for these participants, but they were drawn from a population of native Arabic speakers living in Israel, who typically begin learning Hebrew in elementary school, and are also exposed to Hebrew as the majority language spoken in the environment.
Procedure
The study was conducted as an online survey (www.qualtrics.com). General instructions in Hebrew were followed by demographic questions regarding the age, gender, place of birth, native language, and additional language knowledge. Participants who indicated a place of birth other than Israel were thanked and could not proceed with the survey. In addition, participants who indicated a language other than Hebrew or Arabic as their native language, were similarly excluded.
Participants were then presented with a random subset of pictures sampled from the large set, such that each participant evaluated a different number of pictures, presented in random order. Each picture was presented individually on the screen. Following the procedures described in Moreno-Martínez and Montoro (Reference Moreno-Martínez and Montoro2012), each picture was followed by a text box in which they were to type its name, and three scales for rating of visual complexity, familiarity, and typicality. To evaluate visual complexity, participants were asked to rate “To what degree is the picture visually complex?” on a 7-point Likert scale (1= very simple, 7= very complex). To evaluate familiarity, participants were asked “To what degree is the presented object familiar to you?” on a 7-point Likert scale (1= not familiar at all, 7= very familiar). To evaluate typicality, participants were asked “To what degree is the picture typical of the name you’ve given it?” providing their responses on a 7-point Likert scale (1= not typical at all, 7= the most typical). They named each picture and rated it on each of the three dimensions (with order of dimensions kept constant for all participants) before moving on to the next picture. It was explained that there were no correct answers and that we were interested in their opinion. If participants did not know the name of the object, they were instructed to respond with “I don’t know,” but they were not allowed to skip any question. The survey was self-paced, and participants could not go back to a previous picture to revise their responses. The number of pictures rated by each participant varied, but critically, each picture was evaluated by a minimum of 30 participants.
L1 Norming results
The final dataset resulting from the L1 norming study is freely available through the OSF (https://osf.io/2nwzm/). Each picture received an average of 32.49 responses (SD = 3.36). For each picture, we calculated its average name agreement, which is the proportion of responses that the most dominant name had received (Lev Ari & Shao, Reference Lev Ari and Shao2017). Toward this calculation, all responses were treated as valid. Responses that were phonologically similar (e.g., /mivreshet/), even when written differently (e.g., both /מברשת/ and /מיברשת/), were counted together. Moreover, we calculated the average ratings of visual complexity, familiarity, and typicality of each picture based on its ratings. Overall, name agreement in the L1 Hebrew norms ranged from 23% to 100%, with a mean of 81.14% (SD = 18.24%). See Table 1 for the descriptive data and Figure 2 for the frequency histogram of name agreement of the L1 norms, along with that of the L2 norms (described below).
Table 1. Descriptive data of the L1 and L2 norms

Note: Values are means (standard deviations). * Denotes a significant difference between the L1 and L2 norms at the p<.05 level based on t-tests.

Figure 2. Frequency of Name Agreement distribution across items in the L1 and L2 Hebrew norms.
L2 Norming results
In the L2 norming data, each picture received an average of 34.31 responses (SD = 3.18), see Table 1. Coding was similar to that employed in the L1 norms above. Notably, there were more spelling errors in the data of this L2 group; however, we adhered to the criterion set above by which if the phonological form of the word is retained, alternative spellings were pooled together and counted as one word. The distribution of responses was substantially larger than that observed in the L1 norms (see Figure 2), and ranged from 0% to 100%, with a mean of 48.48% (SD = 27.45%).
Comparing L1 and L2 Norming results
Name agreement was calculated separately for each group, but L1 speakers tended to converge on the same response more often than did the L2 speakers (see Table 1). Thus, there was more variability in the responses of L2 speakers.
Next, we examined the alignment in the selected modal responses across the two datasets, to identify cases where a different modal response was selected in L1 vs. L2 norms. While name agreement percentage varied by group, in 70% of the cases (224 of 323 items), the same modal response was preferred in both groups, suggesting that L2 speakers converged on the same name that the L1 speakers converged on.
We assessed the reliability of participants’ ratings for visual complexity, familiarity, and typicality using intraclass correlation coefficients (ICC), with values reported on the diagonal of Table 2. Specifically, we used ICC(2,k), which estimates the consistency of average ratings from multiple raters, assuming raters are a random sample (Shrout & Fleiss, Reference Shrout and Fleiss1979; Koo & Li, Reference Koo and Li2016). We further examined the Spearman rank correlation among the variables extracted from each set of norms. As seen in Table 2, although there are significant correlations across the two sets, these tended to be low to medium correlations. Most critically, the percentage of Name Agreement, as extracted from the norms collected from L1 Hebrew speakers, was only weakly correlated with the percentage of Name Agreement extracted from L2 Hebrew speakers (r(320)=.212, p<.001). Given the observed differences between L1 Name Agreement and L2 Name Agreement, Study 2 examined their predictive value in timed picture-naming performance.
Table 2. Spearman correlations among rating dimensions in the L1 and L2 Hebrew norms (with ICCs on the diagonal)

Note: L1 Norms: N = 323, L2 Norms: N = 320, * p < .05; ** p < .01. Values below the diagonal are Spearman rank correlations between mean ratings. Diagonal values are intraclass correlation coefficients (ICC) indicating the reliability of ratings. Correlations across L1/L2 are marked in bold.
Of note, the subjective visual complexity data appear to differ substantially by group, with a low correlation between the ratings collected from L1 speakers and those collected from L2 speakers (see Table 2). This raises concerns regarding the degree to which these subjective evaluations of visual complexity appropriately capture the theoretical construct of interest regarding the visual features of the presented object. Indeed, prior research (e.g., Machado et al., Reference Machado, Romero, Nadal, Santos, Correia and Carballal2015) noted that subjective ratings may be biased by familiarity with the items (Forsythe et al., Reference Forsythe, Mulhern and Sawey2008) which may differ across L1 and L2 populations. This is also evident in our data on item familiarity, where the correlation between the scores provided by L1 speakers and those provided by L2 speakers is only moderate (see Table 2). Moreover, despite assumptions of the influence of visual complexity on participants’ performance, there is convincing evidence for a null effect. Using a Bayesian meta-analysis, Perret and Bonnin (Reference Perret and Bonin2019) show that although visual complexity is often taken into account, it does not affect performance in picture naming experiments. We have, therefore, decided to exclude the visual complexity ratings from the analyses predicting timed picture-naming performance, as it is not clear what these ratings reflectFootnote 1 .
Study 2—timed picture-naming
Methods
Materials
A subset of 135 pictures from the total 320 that were normed was used in the timed picture-naming task. Pictures were selected if they had available frequency counts in Hebrew (based on the heTenTen 2014 corpus via SketchEngine; see Kilgarriff et al., Reference Kilgarriff, Baisa, Bušta, Jakubíček, Kovář, Michelfeit, Rychlý and Suchomel2014), and if they had a name agreement higher than 70% on the L1 Hebrew norms. This was done in order to reduce frustration during the study with L2 speakers and avoid items that are likely to be unknown within this population. For this subset, L1 name agreement averaged 90% (SD = 8.13%, range 70%–100%), and L2 name agreement averaged 49% (SD = 26.94%, range 0%–100%). Selected items did not have cognate names with their English or Arabic translations. These 135 pictures were included along with 29 additional pictures, which were aimed at examining lexical ambiguity, as well as 7 items with a two-word name, which are not reported here.
Participants
Two additional groups of participants were recruited for the timed picture-naming task, who did not take part in Study 1. At the time of study, they were students in a large University in Israel, where Hebrew is predominantly used. One group consisted of 162 native L1 Hebrew speakers. Of these, 27 participants were excluded (15 did not complete the experimental protocol; 9 had early exposure to another language; 3 were not born in Israel). A second group of 59 native Arabic speakers who have learned Hebrew as an L2 participated as the L2 Hebrew group. Of these, 9 participants were excluded for not completing the experimental protocol. The L2 participants were born in Israel to native Arabic families and started learning Hebrew in school (mean age of Hebrew onset = 7.76). Note that Hebrew is also the majority language spoken in the environment, providing additional exposure to Hebrew for these native Arabic speakers. The final set of participants, therefore, included 135 L1 Hebrew speakers (44 males, 91 females) and 50 L2 Hebrew speakers (19 males, 31 females) who were proficient enough in Hebrew to be enrolled in higher education carried out in Hebrew.
Participants received about $10 for their participation. Participants’ characteristics based on a detailed language history questionnaire (modified version of the LEAP-Q, Marian et al., Reference Marian, Blumenfeld and Kaushanskaya2007) administered at the end of the experimental task are available in Table 3.
Table 3. Linguistic characteristics of the final set of participants in the timed picture-naming task

Note: Values are means (standard deviations). * Denotes a significant difference between the groups at the p<.05 level based on t-tests. AoA refers to Age of Acquisition. SES refers to socio-economic status. Language proficiency in reading, writing, conversation, and speech comprehension, and language use in reading, writing, conversation, internet, listening and TV watching, were rated on an 11-point scale, on which 0 indicated the lowest level of ability/use, respectively, and 10 indicated the highest level of ability/use, respectively. The averages of these ratings constitute the proficiency and use measures.
Procedure
Pictures were divided into two matched blocks presented to participants in randomized order. In between these two matched blocks, an intervening block including exposure to a different language was inserted for other purposes. As this exposure may have affected performance in the second block (e.g., Kreiner & Degani, Reference Kreiner and Degani2015), we opted to focus the analyses reported here on the first block presented to each participant. Thus, each picture was presented to half of the participants in each group. In total, each picture was named by an average of 66.09 (SD = 1.25; Range 62–68) participants in the L1 Hebrew group and an average of 24.99 (SD = 1.00; Range 24–26) participants in the L2 Hebrew group.
Each participant was tested individually in the lab, using E-prime (Version 2.0, Psychological Software Tools, Inc., Pittsburgh, Pennsylvania). Each trial began with a fixation cross appearing at the center of a screen for 750 ms followed by a picture, presented until a vocal response was recorded or for 3000 ms. RTs were recorded by the computer program using a voice key from the onset of stimulus presentation until the onset of vocal response, and responses were digitally recorded for later coding of accuracy. Participants were instructed to name each picture out loud as quickly and accurately as possible.
Data analysis approach
Overall accuracy rates and RTs were analyzed using linear mixed effect models in R (version 4.4.0, R Core Team, 2024), as these models allow one to simultaneously account for variance related to participants and to items. Accuracy data were analyzed using a mixed logistic regression approach via the glmer function from the lme4 package (v 1.1.-21, Bates et al., Reference Bates, Mächler, Bolker, Walker, Christensen, Singmann, Dai, Grothendieck, Green and Bolker2015). Linear mixed-effects models for RTs were fitted using the lmer function. RTs were log-transformed prior to analysis to reduce skew in the distribution. Log-transformation is commonly used in psycholinguistic research, as it generally makes the distribution acceptable for statistical analyses without eliminating or alternating potentially legitimate data points (Nicklin & Plonsky, Reference Nicklin and Plonsky2020). To evaluate the appropriateness of this approach, we also fitted generalized linear mixed-effects models using Gamma, Gaussian, and log-normal distributions on the raw RT data. These alternative models yielded highly similar fixed-effect patterns and marginal R2 values, supporting the use of log- transformation in the main analyses.
The models included fixed effects of Group (deviation coding: −0.5 = L1 Hebrew, +0.5 = L2 Hebrew), L1 norms (Name Agreement, Familiarity, and Typicality), and their interactions with Group. To control for differences across the two groups in the English AoA, English Use, Age, and SES (see Table 3), these variables were included as covariates. Additional covariates included lexical characteristics of length (in number of syllables) and frequency (based on the heTenTen 2014 corpus via SketchEngine; see Kilgarriff et al., Reference Kilgarriff, Baisa, Bušta, Jakubíček, Kovář, Michelfeit, Rychlý and Suchomel2014) as these were available for the items in Hebrew, and may differentially predict naming latencies in different languages (Bates et al., Reference Bates, D’Amico, Jacobsen, Székely, Andonova, Devescovi, Herron, Ching Lu, Pechmann and Pléh2003). All continuous variables were centered prior to analyses. Random structure included by-participant and by-item intercepts, as well as by-item slope for Group and a by-participant slope for L1 Name Agreement. Follow-up comparisons on significant interactions were conducted using the testInteractions function from the phia package, applying Bonferroni correction for multiple comparisons.
Results
Table 4 presents descriptive data on accuracy and RT measures of the timed picture-naming task in both groups. Accurate responses were only those that aligned with the expected response (i.e., the modal response derived from the L1 norms). All other responses were considered errors. RTs were computed on correct responses only, unless a voice key error occurred (e.g., hesitations, repetitions, etc., which were excluded), and were trimmed to eliminate extreme values (less than 300ms – 4% exclusion for L1 Hebrew and 17% exclusion for L2 Hebrew)Footnote 2 .
Table 4. Overall Mean (SD) Accuracy and RTs in the L1 and L2 picture naming task

Using L1 norms to predict L1 and L2 picture-naming performance
In order to test whether norms collected from L1 speakers (as is typically done in the literature) can serve as valid predictors of naming performance across a wide range of participants, including L2 speakers, we examined the extent to which the L1 collected norms predicted timed picture-naming performance of both L1 and L2 Hebrew speakers.
We first checked for multicollinearity by verifying a Variance Inflation Factors (VIF) of below 5 for all predictors (Craney & Surles, Reference Craney and Surles2002). No problematic multicollinearity was identified.
Performance varied by language group, such that L1 Hebrew speakers performed the naming task more accurately and more quickly than the L2 Hebrew group (see Table 5), as expected by their language history. Furthermore, L1 name agreement significantly predicted performance, such that higher name agreement of a specific picture from the L1 norms was associated with better accuracy and shorter RTs when naming this picture in a timed picture-naming task (see Table 5). Critically, the effect of name agreement varied by group in both measures (see Figure 3). Follow-up pairwise comparisons revealed that whereas L1 name agreement significantly predicted accuracy (b = 0.49, SE = 0.10, χ²(1) = 24.31, p<.001) and RT (b = −0.02, SE = 0.01, χ²(1) = 18.10, p<.001) in the L1 Hebrew group, the effects were not significant in the L2 Hebrew group (for accuracy: b = −0.03, SE = 0.19, χ²(1) = 0.03, p = 1.00 for RT: b = −0.002, SE = 0.01, χ²(1) = 0.13, p = 1.00).
Table 5. L1 norms prediction of L1 and L2 picture naming task—model summary


Figure 3. Interaction between Group and L1 Name Agreement in Accuracy (Panel A) and RT (Panel B).
In addition, the main effect of familiarity was significant both in the RT analysis and in the Accuracy analysis, but the effect of familiarity varied by group. Specifically, follow-up pairwise comparisons revealed that whereas L1 familiarity significantly predicted accuracy (b = 1.38, SE = 0.28, χ²(1) = 25.26, p<.001) and RT (b = −0.04, SE = 0.01, χ²(1) = 11.26, p<.01) in the L2 Hebrew group, in the L1 Hebrew group the effects were non-significant both in the RT data (b = −0.02, SE = 0.01, χ²(1) = 4.80, p=.057) and in the accuracy data (b = 0.25, SE = 0.14, χ²(1) = 3.15, p =.152).
Although there was also a significant interaction between group and typicality, follow-up comparisons revealed that typicality did not significantly predict accuracy (L1 Hebrew: b = 0.07, SE = 0.14, χ²(1) = 0.29, p = 1.00; L2 Hebrew: b = −0.52, SE = 0.26, χ²(1) = 3.86, p=.099) or RT in either group when examined separately (L1 Hebrew: b = −0.02, SE = 0.01, χ²(1) = 4.19, p=.082, L2 Hebrew: b = 0.01, SE = 0.01, χ²(1) = 0.53, p=.932).
Lexical frequency predicted performance, with increased frequency associated with greater accuracy and faster responses. No other effects were significant.
As expected, name agreement information extracted from the L1 norms significantly predicted timed picture-naming performance among L1 Hebrew speakers. Critically, this same measure did not predict the performance of L2 Hebrew speakers. In addition, L1 familiarity (how familiar is the object) and L1 typicality (how typical is the object for the provided name) were more predictive in L2 than in L1 naming data. These findings were observed in the presence of control variables of lexical characteristics (frequency and length) as well as individual characteristics pertaining to age, SES, English use, and English AoA, which differed across the groups.
One possibility for the finding that L1 name agreement did not significantly predict L2 naming performance is due to reduced power in the L2 naming performance, as fewer participants were tested in this group. To rule out this possibility, we carried out an additional analysis in which we randomly selected a smaller subset of the L1 Hebrew speakers (n = 50, as in the L2 Hebrew group). With this smaller subset, L1 name agreement was still predictive of L1 naming performance in both Accuracy (b = 0.41, SE = 0.11, χ²(1) = 13.08, p < .001) and RT (b = −0.02, SE = 0.01, χ²(1) = 11.12, p<.001), suggesting that reduced power cannot fully explain the differential pattern observed in the two groups. Moreover, we next examined whether within the same sample of 50 L2 Hebrew speakers, name agreement information derived from L2 speakers can predict timed picture-naming performance.
Using L2 norms to predict L1 and L2 picture-naming performance
One potential remedy to the finding described above, by which L1 norms are inefficient in predicting timed picture-naming performance of L2 speakers, is to collect norms from L2 speakers. Presumably, if name agreement and timed picture-naming similarly rely on the structure of the mental lexicon, then the vocabulary knowledge of L2 speakers as reflected in the norms should align with the lexical processing of a separate group (sampled from the same population of L2 speakers) as reflected in the timed picture-naming performance. Moreover, finding a meaningful relation between L2 norms and L2 timed picture-naming performance would serve to reduce concern that limited power due to the smaller sample size of the L2 speakers explains the lack of relation between L1 norms and L2 timed picture-naming performance.
We first checked for multicollinearity by calculating VIF for all predictors. Most predictors had VIF values below the cut-off of 5, indicating acceptable levels of collinearity. However, L2 Typicality exceeded the cut-off both in the accuracy data (VIF = 7.43) and in the RT data (VIF = 6.24) and was therefore removed from both models.
L2 Name Agreement significantly predicted performance, such that higher name agreement of a specific picture from the L2 norms was associated with better accuracy and shorter RTs when naming this picture in a timed picture-naming task (see Table 6). Critically, the effect of L2 Name Agreement varied by group (see Figure 4) in accuracy but not in RT. Follow-up pairwise comparisons revealed that L2 name agreement significantly predicted accuracy in both the L1 Hebrew group (b = 0.50, SE = 0.11, χ²(1) = 20.85, p<.001) and the L2 Hebrew group (b = 1.83, SE = 0.13, χ²(1) = 211.43, p<.001), but that the effect was stronger for the L2 Hebrew group.
Table 6. L2 norms prediction of L1 and L2 picture naming task—model summary


Figure 4. Interaction between Group and L2 Name Agreement in Accuracy (Panel A) and RT (Panel B).
Furthermore, as seen in Table 6, increased familiarity was associated with higher accuracy and shorter RTs, but again, the effect did not vary by group. Here, lexical characteristics, although included in the model, did not significantly predict performance.
General discussion
The current study collected normative data on a set of colored realistic pictures to be used for designing picture-naming tasks with Hebrew speakers. Critically, norming data was collected separately from L1 Hebrew speakers and L2 Hebrew speakers, revealing substantial differences in name agreement distribution (Study 1). Furthermore, L1-derived name agreement was useful in predicting performance in a timed picture-naming task of native Hebrew speakers, when controlling for key lexical characteristics, revealing the predictive validity of these data. However, this same name agreement information did not predict the performance of a group of L2 Hebrew speakers who completed the same timed picture-naming task in Hebrew (Study 2). Moreover, name agreement information from the L2 Hebrew norms was useful in predicting both L1 and L2 picture naming, with stronger effects in the L2. Below we discuss each of these findings and their implications.
L1 Hebrew name agreement
The current normative dataset extends available resources applicable for use with healthy adult Hebrew speakers (Kavé, Reference Kavé2005; Duñabeitia et al., Reference Duñabeitia, Baciero, Antoniou, Antoniou, Ataman, Baus, Ben-Shachar, Çağlar, Chromý and Comesaña2022), using realistic photographs as stimuli, expected to yield more ecologically valid linguistic responses (see also Krautz & Keuleers, Reference Krautz and Keuleers2022; Souza et al., Reference Souza, Garrido, Saraiva and Carmo2021).
These norming data were used to predict timed picture-naming performance of Hebrew speakers. In particular, name agreement collected from L1 speakers significantly predicted timed Hebrew production when controlling for lexical characteristics, underscoring the utility of the norms. This alignment between L1 norms and L1 timed naming suggests that researchers interested in Hebrew word production may rely on the information available in these norms to reduce uncertainty and select appropriate pictures for their tasks.
Critically, the current findings further show that name agreement information collected from L1 speakers does not predict the performance of L2 speakers in a timed-picture naming task, with lexical characteristics in the model. Additional analysis further shows that this lack of an effect is unlikely due to reduced power, because the same measure was predictive of a comparable-sized sample of L1 Hebrew speakers. These findings carry important practical implications for researchers testing diverse populations and comparing across language dominance groups. It puts into question the assumption that norms collected on linguistic materials are useful across the board. As most available norms were collected from L1 speakers, our findings suggest that the usefulness of such name agreement information may be limited when testing the performance of L2 speakers, as is often the case in bilingualism research. Potentially, items assumed to be appropriate and “good” for timed production studies may be less suitable for L2 speakers due to reduced name agreement for such items within this population relative to L1 speakers. As this is the first study we are aware of to examine this issue, additional work and careful consideration in future studies are warranted.
L2 Hebrew name agreement
A unique feature of the current study is the compilation of normative data from a group of L2 speakers of the target language. Whereas previous picture resources in Hebrew included data from both native speakers and immigrants whose L1 was not Hebrew (Kavé, Reference Kavé2005), there was no information on the distribution of name agreement or other normative ratings. Furthermore, it is unknown whether the Hebrew norming data provided in the MultPic dataset (Duñabeitia et al., Reference Duñabeitia, Baciero, Antoniou, Antoniou, Ataman, Baus, Ben-Shachar, Çağlar, Chromý and Comesaña2022) includes responses from L2 speakers of Hebrew. The present study is the first to our knowledge to target this population for purposes of comparisons across groups. Momenian et al., (Reference Momenian, Bakhtiar, Chan, Cheung and Weekes2021) collected norms from both monolingual and bilingual speakers but did not report how norms collected from L1 speakers predicted the performance of L2 speakers of that language.
We provide normative information as estimated by native Arabic speakers performing the norms in their L2 Hebrew. Comparisons across the norms collected from the L1 group and those collected from the L2 group reveal substantial differences. Most notably, the distribution of name agreement in the L2 group was larger than that of the L1 group (see Figure 2). This finding may be due to differences across populations in word knowledge and may suggest that the L2 speakers converge less in labeling objects in their L2. This may be linked to the observation that bilinguals tend to use similar patterns when naming objects in their L1 and L2 (Ameel et al., Reference Ameel, Storms, Malt and Sloman2005). Thus, it is possible that conceptual influences from the L1 Arabic of these participants affected their naming patterns in their L2 Hebrew, as reflected in the norms. Furthermore, this increased name agreement variability in L2 relative to L1 is also consistent with findings from translation norms where highly proficient speakers were more consistent in their responses (Prior et al., Reference Prior, Wintner, MacWhinney and Lavie2011). Taken together, these factors suggest that the norms from L1 and L2 speakers reflect the unique linguistic experiences and challenges of each group. This underlines the necessity of developing separate norms for L1 and L2 speakers to accurately capture the nuances of language use in different populations.
L2 norms were predictive of L2 timed picture-naming performance. When a new group of native Arabic speakers performed a timed picture-naming task in Hebrew, their L2, information extracted from the L2 norms predicted both their accuracy and their RTs when controlling for lexical characteristics. This finding stands in contrast to the observation that the L1 collected norms did not predict L2 performance. Thus, the degree of convergence of the norming participants in their L2 was linked to the ease with which different L2 speakers (sampled from the same population) performed the timed picture-naming task. This suggests that collecting norms from L2 speakers maybe advantageous for stimuli selection in production studies with diverse populations, including L2 speakers.
Visual complexity, typicality, and familiarity
Interestingly, not all information derived from the norms was similarly effective. Specifically, following the procedures outlined in Moreno-Martínez and Montoro (Reference Moreno-Martínez and Montoro2012), participants not only provided the name of each pictured object, but also rated how visually complex they perceived the picture to be (visual complexity), the degree to which they were familiar with the object (familiarity), and the degree to which they thought the object was typical of the name they provided (typicality). The effects of these variables were somewhat unexpected. First, visual complexity ratings substantially differed across L1 and L2 speakers, questioning the degree to which these ratings accurately capture the visual properties of the pictured object (see also Forsythe et al., Reference Forsythe, Mulhern and Sawey2008; Machado et al., Reference Machado, Romero, Nadal, Santos, Correia and Carballal2015; Nadal et al., Reference Nadal, Munar, Marty and Cela-Conde2010). Objective visual complexity scores may be a more appropriate measure (e.g., Momenian et al., Reference Momenian, Bakhtiar, Chan, Cheung and Weekes2021), but which objective visual complexity measure best aligns with human judgments requires additional evidence (e.g., Nath et al., Reference Nath, Brändle, Schulz, Dayan and Brielmann2024). Moreover, we observed that L1 familiarity predicted performance in L2 speakers but not in L1 speakers. At the same time, L2 familiarity predicted accuracy and latencies in both groups. Of note, it is not clear that these ratings fully reflect the intended dimensions, deeming cautious in interpreting these findings. In particular, familiarity and visual complexity ratings were expected to depend predominantly on the pictured object and be relatively uninfluenced by the name of the picture or participants’ knowledge of the L2 name. Accordingly, we expected high correlations between ratings of familiarity and visual complexity across the two groups of L1 and L2 participants. Nonetheless, visual complexity was only weakly correlated, and familiarity with the object was only moderately correlated across the L1 and L2 norms (see Table 2). These patterns suggest that the processes going into these ratings may vary across participants (see also Table 1) putting their validity into question.
Typicality was expected to reflect both conceptual processing and lexical knowledge, as participants were asked to rate how typical the object is for the provided name. In our L1 norms, this measure was correlated with familiarity of the object, as most participants likely knew the name of the object in their L1. Conversely, in the L2 norms, L2 typicality was highly correlated with both L2 name agreement and L2 familiarity, suggesting that lexical knowledge and conceptual processing of the object jointly affected L2 typicality ratings. However, this high correlation precluded the possibility of estimating its unique contribution in the model. Because similar dimensions were not always collected in other recent norming studies (Momenian et al., Reference Momenian, Bakhtiar, Chan, Cheung and Weekes2021; Duñabeitia et al., Reference Duñabeitia, Baciero, Antoniou, Antoniou, Ataman, Baus, Ben-Shachar, Çağlar, Chromý and Comesaña2022), it is difficult to determine whether the observed pattern can be confidently generalized, and the issue awaits additional work.
Limitations
The current data underscore the potential in collecting and using normative data from L2 speakers, because L2 name agreement was a more valid predictor of L2 speakers’ performance than L1 name agreement. To make this comparison, we treated the L1 norms as the point of departure when selecting items with name agreement above 70%, and treating the modal response from the L1 norms as our expected response in all timed picture-naming tasks (both the L1 and the L2). These decisions may have affected the results in two ways. First, the range of name agreement in L1 was reduced relative to that of the L2, limiting its potential predictive power. Of note, relaxing this criterion might have resulted in increased frustration on the part of the L2 speakers when encountering lower quality or less familiar items, and may have resulted in close to floor performance, given that accuracy rates were already not high in the L2 group. Furthermore, the decision to rely on the L1 modal response as the expected response may have resulted in lower accuracy of the L2 group in the timed picture-naming performance than would have been attained had we adopted a more lenient criterion, because in 23% of the items the modal response derived from the L1 norms differed from that derived from the L2 norms, suggesting that L2 speakers tend to name the object differently. Nonetheless, given that our interest was in the alignment of the norms with the naming data, any other coding scheme would have reduced the potential correlation across the L1 norms and the L2 naming. Additional research exploring these decisions further is needed.
Furthermore, our norms focused on the role of name agreement in the context of frequency and length, as well as familiarity and typicality, but other measures, known to affect picture naming performance, were not explicitly examined in the current study. For instance, AoA ratings (Kuperman et al., Reference Kuperman, Stadthagen-Gonzalez and Brysbaert2012), and additional semantic characteristics such as concreteness and imageability were not tested here, because such information was not readily available for Hebrew, and because these measures may differ between L1 and L2 speakers. Moreover, as noted above, a limitation of our work is that we collected subjective visual complexity ratings, but these turned out to not necessarily reflect the intended construct. Furthermore, as such measures often do not predict picture naming performance (e.g., Perret & Bonnin, Reference Perret and Bonin2019), we excluded these from analysis. Objective measures of visual complexity may be a better way forward (e.g., Momenian et al., Reference Momenian, Bakhtiar, Chan, Cheung and Weekes2021), but because there are multiple complexity dimensions that could be calculated and it is not clear which of these best reflects human judgments (e.g., Nath et al., Reference Nath, Brändle, Schulz, Dayan and Brielmann2024), these were not computed here. Nonetheless, future research that incorporates multiple measures of visual complexity (e.g., reflecting both local elements of pixels and more global estimations of the perimeter) may shed light on the degree to which visual complexity exerts differential influences on L1 and L2 naming performance.
Relatedly, the focus of the current study was on comparisons across L1 and L2 speakers of the target language, leading us to compare native Hebrew speakers to native Arabic speakers with Hebrew as the L2. However, all participants were also proficient to some extent in English, deeming them bilinguals and trilinguals, respectively. The analyses approach we adopted allowed us to control for any cross-group differences in English use and AoA (as well as Age and SES), but the degree to which the findings extend to other monolingual and bilingual groups remains to be explored, as bilingualism may somewhat modulate the alignment between name agreement and timed picture-naming performance (Momenian et al., Reference Momenian, Bakhtiar, Chan, Cheung and Weekes2021).
The L2 results are based on bilinguals of two typologically similar languages (Arabic-Hebrew) who are highly proficient in their L2. The degree to which this finding holds for other bilingual populations, including those who differ in their patterns of language use or trajectories of language learning, awaits additional research.
Finally, it would be valuable to explore whether our results hold with larger picture sets and other languages. Specifically, one may test whether the name agreement distribution and predictive power of the recently published MultiPic (Duñabeitia et al., Reference Duñabeitia, Baciero, Antoniou, Antoniou, Ataman, Baus, Ben-Shachar, Çağlar, Chromý and Comesaña2022) remains the same when naming is collected from L2 speakers of the same target languages. Relatedly, the degree to which the results generalize to other types of items (e.g., pictures of actions as Momenian et al., Reference Momenian, Bakhtiar, Chan, Cheung and Weekes2021) remains to be explored.
Conclusion
The compilation of norms we present here from both L1 and L2 speakers provides researchers with a valuable resource for conducting cross-group comparisons and selecting appropriate stimuli that are applicable across different populations. Moreover, understanding the degree to which L1 and L2 norms generally align in predicting timed performance carries universal importance when studying participants with diverse language backgrounds. Whereas there are picture norming studies where the same set of stimuli was named by different groups allowing for comparisons across languages (e.g., MultiPic, Duñabeitia et al., Reference Duñabeitia, Baciero, Antoniou, Antoniou, Ataman, Baus, Ben-Shachar, Çağlar, Chromý and Comesaña2022), the current study is the first to our knowledge to allow comparisons using not only the same stimuli but also the same target language, across two groups who differ in their order of language acquisition. With this novel design, our study revealed that L1 norms are useful in predicting L1 timed picture-naming performance, but that they are unable to predict L2 timed picture-naming performance. Normative information derived from L2 speakers was useful in predicting the performance of L2 speakers in a timed task, highlighting the potential of compiling normative picture information from diverse populations.
Replication package
All data, materials, and scripts are available through the OSF platform at https://osf.io/2nwzm/. None of the experiments reported here was preregistered.
Acknowledgements
The authors would like to thank Yael Marchuson and Miri Goldberg for assistance with data collection, and Aia Zuabi for assistance with data coding. The research was funded by ISF grant 1341/14 awarded to HK and TD.
Competing interests
The authors declare none.
Appendix—Model summaries including visual complexity measures
Table A1. L1 norms prediction of L1 and L2 picture naming task—model summary

Table A2. L2 norms prediction of L1 and L2 picture naming task—model summary










