1. Introduction
When infants begin learning a spoken language, there is a period in which only a small portion of the speech that they hear corresponds to words they can recognise. This must be true: infants hear sentences from birth, and at birth, they do not know any words. So, in early infancy, most of each utterance presents a mysterious flow of intricate phonetic information. How do infants move on from this point? For the most part, infant language learning research has addressed this question in three ways: by tracking infants’ progress in learning the speech-sound categories of the native language (Kuhl et al., Reference Kuhl, Williams, Lacerda, Stevens and Lindblom1992; Werker & Tees, Reference Werker and Tees1984), by computational modelling of how these native sounds could be grouped together to locate words (e.g., Brent & Cartwright, Reference Brent and Cartwright1996), and by experimentally testing infants’ recognition of word-forms presented repeatedly in sentence passages (e.g., Jusczyk & Aslin, Reference Jusczyk and Aslin1995).
Possibly because infants seem so adept at learning their language’s phones, and therefore (perhaps) at categorising them in continuous speech, relatively little research has addressed the question of how variability in the realisation of speech sounds and words might impact the word discovery process. Most computational studies of infant word segmentation elide the question of phonetic variability entirely, presenting models with error-free sequences of phonological labels stripped of any extraneous phonetic content (e.g., Batchelder, Reference Batchelder2002; Cabiddu et al., Reference Cabiddu, Bott, Jones and Gambi2023). When researchers draw word frequency counts from infant-directed speech corpora, all the instances of a word are taken to increment the count equally, as if all were equally accessible to the child (e.g., Swingley & Humphrey, Reference Swingley and Humphrey2017). But laboratory studies of infants’ word-finding using familiarisation – test procedures indicate that phonetic variability actually has a substantial impact on infants’ tendency to treat instances of the same word as being, in fact, the same (e.g., Houston & Jusczyk, Reference Houston and Jusczyk2000; Singh et al., Reference Singh, Morgan and White2004). Even 2-year olds and 3-year olds have trouble recognising phonetically reduced word-forms, when this reduction poses no difficulty for adults (Beech et al., Reference Beech, Shelton and Swingley2023). In sum, ordinary processes of phonetic reduction can have a material impact on infants’ categorisation of words, but descriptive work has left more unknown than known about how phonetic reduction plays out in infant-directed speech.
A premise of the present work is that understanding how infants start discovering words requires a better characterisation of patterns of reduction and hyperarticulation in infant-directed speech. If infants often fail to encode hypoarticulated instances of words or do not consider them equivalent to their more exaggerated, canonical forms, then simulating early development depends on knowing how common reduced forms are and how they are distributed. For example, studies of discourse among adults and among parent–child dyads show that when a word is used twice in close succession, the first instance is often hyperarticulated relative to the second (e.g., Bortfeld & Morgan, Reference Bortfeld and Morgan2010; Fisher & Tokura, Reference Fisher and Tokura1995; Fowler & Housum, Reference Fowler and Housum1987; Tippenhauer et al., Reference Tippenhauer, Fourakis, Watson and Lew-Williams2020). If one of the ways infants discover words is by noting repeated stretches of speech in nearby utterances (Nencheva et al., Reference Nencheva, Schwab, Lew-Williams and Fausey2024), we can learn more about how this process might work by assessing the magnitude and consistency of first mention effects. If first-mention effects are large and consistent, it would suggest that infants could use a strategy of listening for hyperarticulated portions of sentences and monitoring subsequent sentences for the presence of similar-sounding stretches of speech – but this would only work if infants were capable of relating hyperarticulated forms to their reduced variants.
Here, in a sample of American English infant-directed speech, we assessed the magnitude and distribution of first-mention effects on spoken word clarity. Based on prior research, we expected that the clearest, most emphatic, or most hyperarticulated instances of words would tend to be those occurring for the first time in a discourse and that the subsequent mention of these words would be substantially reduced. In addition, we evaluated several predictors we thought might relate to the first-mention effect, or to spoken-word clarity more generally: word frequency, whether the parent thought the child knew the word, and whether the word was utterance-final.
To summarize, then, our goal was to estimate the following effects on speech clarity, as measured using rating and transcription tasks: (a) the size of first-mention effects; (b) whether first-mention effects would be modulated by word frequency, child knowledge of the relevant word, or utterance-final word position; and (c) whether word clarity, independent of first or second mention, would be affected by frequency, child word knowledge, or sentence position.
In contrast to some prior research, our goal was to characterise the problem facing the infant, and not to describe the difference between infant-directed speech and adult-directed speech. As a result, we were able to examine this question using a speech corpus of natural, unscripted interaction, rather than speech hedged in by constraining situational devices meant to channel conversations with adult and child addressees onto similar paths. We selected word repetitions from the Brent corpus of infant-directed interaction (Brent & Siskind, Reference Brent and Siskind2001) and played them to adult native English speakers for transcription and judgements of clarity. These responses permitted us to evaluate the magnitude of first-mention effects and the conditions under which they appear.
2. Experiment
2.1. Methods
Stimuli. This work was originally done (and preregistered) as two separate experiments with different stimulus sets, selected according to slightly different criteria. The original purpose of the second experiment was to clarify certain effects present in the first dataset but that seemed likely to have been carried by a small number of stimuli. Indeed, the primary result of the second study was that these effects did not replicate with a stimulus set that was better balanced on the relevant variables. In the interest of contributing a more concise report that avoids such detours, these two experiments have been merged here.Footnote 1
Our first step was to extract repeated word tokens from the recordings of eight mother–child dyads in the Brent corpus (the same portion of the corpus that was analysed in Swingley & Humphrey, Reference Swingley and Humphrey2017). Repetitions were defined as word types of identical orthographic transcription repeated in consecutive utterances. The first mention was always the first instance in its source recording session. Types occurring for the first time within the initial 125 utterances of the session were excluded, on the grounds that they might have also been spoken not long before the recording began. Items repeated within an utterance were excluded, because in many cases they were not in real sentences (e.g., “teeth teeth teeth.”). “Utterance” was defined as in the original Brent corpus transcription: a stretch of speech from one talker with no pauses any longer than 300 ms.
These sets of repetition pairs were then restricted to just those words that appear on the Words and Gestures version of the MacArthur-Bates Communicative Development Inventory (Fenson et al., Reference Fenson, Dale, Reznick, Bates, Thal and Pethick1994), which had been completed at the researchers’ 12-month visit by 7 of the 8 mothers, and at the 15-month visit by all 8. This allowed us to evaluate, in a quite preliminary way, whether intelligibility varied with the parent’s belief that her child understood the word she was saying.
In creating the first stimulus set, we chose words that were used by at least four of the eight mothers, to restrict the set to broadly common types. We dropped this restriction in creating the second stimulus set, to expand the number of possible items. The second stimulus set also excluded the dyad for which no 12-month CDI was available (mom “s2”), and selected items in such a way that among the pairs, the first-mention and second-mention that were utterance-final or not utterance-final were balanced as well as possible, and that the counts of these in turn were as similar as possible among words known and unknown on the 12-month CDI. The two stimulus sets were otherwise selected the same way.
We excluded items that are often function words that rarely receive emphatic focus (e.g., am, gonna, for, is), words functioning as proper names (e.g., mama, Piglet), words that are onomatopoeic or occur in stereotyped interactions (e.g., pattycake, hi, yum), and words that exist outside the usual phonological constraints of English (e.g., hmm, mmhm). Some word types were included as pairs more than once, but in such cases, the pairs were always drawn from different dyads.
Having created a pool of potential items this way, two members of our lab listened to each of the available tokens to evaluate its recording quality for undesirable features like hiss, overlapping speech, or unusually quiet vocalisation but not considering the degree of emphasis or articulation of the words. To be considered for inclusion in the study, both tokens of the pair needed to achieve a recording-quality rating score of at least a 3 out of 5 by both of the raters.
Once the potential item set was narrowed by these constraints, a quasirandom process was implemented to select the final set of tokens, sampling as evenly as possible from the dyads.
The final complete item stock included 172 pairs in the first set and 113 pairs in the second set. Their distribution by speaker is given in Table 1. Infant age at the time of the recording, for both the first and the second stimulus sets, ranged from 8 months, 28 days, up to 15 months, 8 days, with the mean age for the first stimulus set 12 months 9 days, and the second set 12 months, 11 days. The words were typical of the vocabulary of the CDI: airplane, bug, cry, open, store, where, and so forth. The full list of words is given in the publicly available dataset.
Table 1. Stimulus distribution over Brent dyads. Dyad labels (c1, d1…) are as given in the Brent dataset

The words evaluated came from different sentence positions. Considering whether an item was from a one-word sentence (“isolated”), the first or last word in a multi-word utterance, or utterance-medially, the words may be counted as shown in Table 2. Across the first and second utterances, the mean number of words occurring between the first mention and the second was 3.5 (sd, 2.3; 25th percentile, 2; 75th percentile, 5).
Table 2. Stimulus distribution by sentence position. Tokens from one-word utterances are excluded from the other categories in the table

Once identified, the audio tokens were extracted from their sentences and scaled to have equivalent maximum amplitude using the norm function of the utility sox.
Procedures. All judges completed three tasks: rating, transcription, and paired comparison. In the rating and transcription tasks, judges heard one word at a time and indicated on a scale from 1 to 5 how clearly realized they thought the word was, and then typed in as a free response what word they thought they had heard. Listeners were permitted to re-play the word as often as they desired (though they usually did not; most listeners only re-played a few items). In the paired comparison task, both members of a first-mention/second-mention pair were played, and judges indicated one or the other as the more clearly articulated and easiest to understand. Re-plays were not permitted for this task. The rating and transcription tasks were completed first, and then the paired comparison task. The instructions for the rating tasks are given in the Supporting Materials. All items were used in all three tasks for every participant.
To set up the presentation orders for the paired comparison task, items were divided into an a set and a b set, where each set included one pair of each word (i.e., each word type: car, ball, …). In trial order 1, for the a set, the first-mention token was presented first on the paired trials, and for the b set, the second-mention token was presented first on the paired trials. The reverse was true in trial order 2. In this way, every token, whether a first-mention or a second-mention in the corpus, was presented first or second in a paired trial equally often. All trials were randomly ordered at presentation time. Each listener heard half of the items from the a set and half from the b set.
Out of a concern that vagaries of the participant recruitment process might introduce confounding variation across the (original) Experiments 1 and 2 datasets, a quasirandom selection of pairs from the first study was included in the stimulus set of the second study. This matched set of stimuli allowed us to evaluate the possibility that participants in the two experiments were different enough in their performance to limit the comparability of the experiments. The items were selected by averaging the participant ratings of the first and second mentions of each pair; placing the pairs into quintiles; and randomly sampling five pairs from each quintile. These 25 pairs (50 token clips) were analysed separately to compare the participant pools for the two stimulus sets.
Participants were recruited using the Prolific platform, an online testing site. Thirty-six judges were included in the final sample, evenly divided between the Experiment 1 and Experiment 2 stimulus sets. PennController software (Zehr & Schwarz, Reference Zehr and Schwarz2018) handled stimulus presentation and data recording. At the start of the experimental session, participants were asked a few questions about their demographic background. All participants indicated that English was their first language and their language of daily use. To guard against respondents who were not actually English speakers or who might have been software programs rather than humans, a few introductory multiple-choice questions were asked about the names of household objects shown in an image (a sieve, pliers), and a few simple linguistic word problems were given. Performance on these was variable, but almost all participants got most of the questions right, and the total score on these tests did not correlate with performance in the transcription task and was not related to the degree to which a given participant’s responses were correlated with the average responses of the other participants. Responses were also checked for any suspicious behaviours like conspicuous patterns in the ratings or long stretches of strange transcription responses. No participants were excluded on the basis of any of these checks. The whole rating procedure took judges about 45 minutes.
3. Results
We began by assessing whether the second experiment’s participants were comparable to the first experiment’s participants in their responses to the items that the two experiments had in common. Indeed, judges behaved similarly: considering subject means over all items, mean likelihoods of choosing the first mention as clearer were similar (Expt 1, 56.4%; Expt 2, 56.9%, t(34.0) = 0.50, p = 0.88), and the ratings advantage for first mentions over second mentions was similar (Expt 1, 0.216; Expt 2, 0.267, t(27.1) = 0.81, p = 0.42). Considering averages over subjects for each item, proportions of first-mention choices in the first and second studies were correlated at r = 0.78, p(23) < 0.000, and the first-mention ratings advantages were correlated at r = 0.89, p(23) < 0.000. Thus, any differences in outcomes between the first and second studies were not readily attributable to important differences in the participants.
Paired comparisons of first and second mentions. We will present the judges’ responses on each of the three tasks in turn, starting with the paired task, which asked judges to rank the clarity of a word that was originally present in two consecutive sentences. To begin, Figure 1 provides an overview of the first-mention advantage, showing every item from each of the dyads as the proportion of judges picking the first mention as the clearer token. There was substantial variability in this effect, and it was not large, but it was shown by all eight of the dyads.

Figure 1. For each word (item), the proportion of judges selecting the first mention as the clearer token. Boxplots show medians and quartiles of the items averaging over subjects. Means are shown as red triangles. Results for each pair are overlaid on the boxplots, with items from the Experiment 1 set as circles and Experiment 2 as x’s. Dyads are listed using the codes given in the Brent corpus.
We tested a series of possible stimulus characteristics that might influence judges’ choice of the first-mention token as the clearer one. First, we evaluated whether the order with which a given token appeared on a given trial affected performance. Indeed, it did: judges tended to choose the first mention as clearer 66.3% of the time when they heard it second (se 2.4% over items) but only 49.6% of the time when they heard it first (se 2.5% over items). This difference was significant in a paired t-test (t = 12.6, df = 284, p < 0.0001). We speculate that this effect came about because judges hearing the word a second time were primed by the first time, and the resulting ease of recognition of the second token biased their rating of clarity of articulation. This effect was large and consistent but did not interact with any variable of interest in any of the subsequent analyses. Given that every item was counterbalanced over this variable and was responded to by the same number of judges in each presentation condition, this order effect does not interfere with interpretation of the other variables; however, because there was some variability in the size of the trial position effect between judges, we included a random effect that included both judge and trial position in regressions predicting the size of the first-mention effect.
We first evaluated whether first-mention effects would be moderated by whether the parent thought the child knew the word. The dataset included parent reports on the CDI checklist instrument for each tested word at 12 months and 15 months. The recordings spanned the ages of 8 to 15 months. This means that for some children and some words, the CDI results were informative about the mother’s estimate of the child’s knowledge of the words she was saying; for other words, the CDI results were ambiguous. For example, a “known” on the CDI at 12 months indicates that in a 13-month-old recording, the parent probably thought the child knew the word; but in a 10-month-old recording, we do not really know, because maybe the child learned the word at 11 months. Similarly, if the parent indicated “not known” on the 12 month CDI and “known” at 15 months, we may assume “not known” for an 11 month recording but cannot be sure for a 14 month recording. When we follow these considerations to their conclusions for each item in the dataset, 140 words were indicated as not known at the time of recording, 52 were indicated as known, and 93 were ambiguous and excluded from these analyses. (We also computed a parallel set of analyses that included all of the words and used the result on the 12-month CDI as the CDI measure. These analyses yielded very similar results.)
In this analysis, we also evaluated potential effects of word frequency on the first mention effect, considering the possibility that parents might speak a word more clearly the first time if it were rarer, but then back off this extra hyperarticulation after the first mention. Word frequency was estimated based on frequency of occurrence in the relevant speaker’s own data within the Brent corpus. Thus, this analysis predicted first-mention choice (yes or no) from trial position, CDI status, centred (log) word frequency, and the interaction of CDI status and frequency, with random effects for subject and trial position, and word.
Parents did not exhibit stronger or weaker first mention effects for words they thought their children knew. In addition, neither frequency nor its interaction with CDI status predicted first-mention choice. Table 3 displays the analysis.
Table 3. Regression predicting judges’ choice of whether the first-mention or second-mention token was the clearer one. Predictors are named at left. The condition that contrasts from baseline is given in square brackets. exp(coef) indicates the multiplicative change in the odds of choosing the first-mention token when going from the baseline case to the indicated case

Another element of parental speech we considered was the tendency to place important words sentence-finally in infant-directed speech (e.g., Aslin et al., Reference Aslin, Woodward, LaMendola, Bever, Morgan and Demuth1996; Fernald & Mazzie, Reference Fernald and Mazzie1991). Utterance-final words often have longer vowels than utterance-medial words in infant-directed speech (e.g., Swingley, Reference Swingley2019) and, in hyperarticulated contexts, are recognised more easily (Fernald et al., Reference Fernald, McRoberts, Swingley, Weissenborn and Hoehle2001; see also Seidl & Johnson, Reference Seidl and Johnson2006). There is some evidence that words that appear frequently in utterance-final position are learned more easily than words that do not (Frank et al., Reference Frank, Braginsky, Yurovsky and Marchman2017; though see Swingley & Humphrey, Reference Swingley and Humphrey2017). Thus, it seemed reasonable to consider whether utterance-finality might relate to first-mention hyperarticulation. We would expect that first-mention choices would be more likely when the first-mention was utterance-final and less likely when the second-mention was utterance final. An interaction might suggest that when parents place successive instances of a word utterance-finally, they are using a teaching register that could be immune to the typical discourse effects of second mentions.
The regression revealed a significant effect of utterance-final positioning of the first mention, favouring its selection; and a nonsignificant complementary effect in the (expected) opposite direction for utterance-final positioning of the second mention; but no interaction. Frequency modulated the effect of utterance position on first-mention effects. When words were more common, utterance-final first mentions yielded larger first-mention advantages (p = .021), and utterance final second mentions yielded smaller first-mention advantages for more common words, though this latter effect was not significant (p = .075). See Table 4.
Table 4. Logistic regression predicting judges’ choice of whether the first-mention or second-mention token was the clearer one. Predictors are named at left. The condition that contrasts from baseline is given in square brackets

This may be viewed graphically in Figure 2. When words were low in frequency (on the left side of each facet), utterance finality had a minimal impact on choice of the first-mention token. As words gained in frequency, listeners’ choice of the first or second mention as the clearer one was increasingly dominated by utterance position, on those items for which the two tokens differed in utterance-finality. Thus, for high-frequency words, listeners chose the first mention when only it was utterance-final, and the second mention when only it was utterance-final. When utterance-finality was equivalent for the first and second mentions, frequency did not have a significant impact on the size of the first-mention effect. We confirmed this result in a variant of the above regression analysis that restricted the dataset to the 207 items for which neither or both tokens were utterance-final. In this restricted dataset, frequency did not have a significant impact on first-mention choice, whether on its own (coef. = − .05, p > 0.5) or in interaction with utterance position (i.e., both-final versus neither-final; coef. = .07, p > 0.5). The interaction of frequency and utterance position on the first-mention effect is probably not really “about” the first-mention effect per se; its effects on first- versus second-mention clarity may be best viewed as collateral effects of the fact that first and second mention are confounded with utterance position when considering pairs in which utterance position varies between tokens. This is somewhat easier to see in the independent clarity ratings than in the relative judgements over pairs, so we turn to the clarity ratings next.

Figure 2. For each item pair, the proportion of judges who chose the first mention as the clearer one, split into facets according to whether the first or second mention was utterance-final. Within each facet, items are arrayed on the x axis by (log) frequency.
Clarity ratings. In addition to making judgements of the relative clarity of first and second mentions, participants also rated the clarity of each of the words, presented in a randomised order, on a scale of 1 to 5. The paired choice task discussed above optimized fine comparison between the first and second mentions by explicitly requiring relative judgements in which one item was the standard for judging the other. The choice task has a weakness, though: we can only evaluate the magnitude of the difference between first and second mentions by comparing decisions over judges: if judges agree with one another, it seems to indicate a larger difference; if they disagree, it seems to indicate a more subtle difference. The single-item rating task, on the other hand, yields a more continuous measure, if at the cost of relying on judges maintaining a consistent set of internal rating criteria over trials. Ratings can also be used independently of the comparison of first and second mentions, a point we return to later. The item by item results are shown in Figure 3.

Figure 3. For each word pair, the mean rating of the first mention and the second mention. Each Brent corpus dyad is given separately, indicated by the dyad identifier from the corpus. Points above the diagonal line show items that had a first-mention advantage.
To establish that first mentions were, in general, rated as more clear than second mentions, we computed a linear model over the set of items, with each item represented as the mean difference in ratings of the first mention and the second mention, averaging over judges. These difference scores were approximately normally distributed, with positive values indicating a first-mention advantage. Predictors in the regression were CDI (known, not known), centred log frequency, and utterance position of the two tokens (both utterance-final, only the first mention utterance-final, only the second, or neither utterance-final). The model results are given in Table 5.
Table 5. Analysis of first-mention and second-mention mean difference scores in ratings of word clarity, including CDI result and utterance-final positioning

The R package emmeans was used to assess the significance of the first mention effect. Weighting the levels of “cdi” and “utterance position” according to their proportion in the dataset, the estimated mean advantage of first-mentions was 0.260 (on the rating scale from 1 to 5), which was significantly greater than zero (s.e. = .054; t(186) = 4.81, p < 0.0001).Footnote 2
The analysis showed no impact of CDI knowledge on the size of the first mention effect. Word frequency was also unrelated to the size of the first mention effect. Regarding utterance position, the first-mention effect was significantly larger when the first-mention token was utterance-final and tended to be smaller when the second-mention token was utterance-final, though this latter difference was not reliable. As one might expect, the first-mention advantage when the first-mention token was utterance-final was diminished when the second-mention token was also utterance-final, though this interaction effect was only marginally significant (p = 0.068).
Given that there was no sign of an impact of CDI knowledge on the first-mention effect, a follow-up analysis excluded the CDI and therefore could include all 5130 ratings in the dataset (the 18 judges’ first-mention minus second-mention difference scores for 285 items). Predictors were frequency, utterance position (utterance-final or not), and their interactions.
The results were consistent with the outcome of the choice task. Ratings were affected by utterance-final position, so when only one of the tokens was utterance-final, that token was rated more highly, either increasing the first mention effect (if the first mention were utterance-final) or, decreasing it (though not significantly) if the second mention were utterance-final. Word frequency modulated these first-mention effects, with more common words showing utterance-finality enhancement effects more strongly. These outcomes are enumerated in Table 6 and displayed graphically in Figure 4.
Table 6. Analysis of first-mention and second-mention mean difference scores in ratings of word clarity. The analysis includes all trials, and tests the interaction between frequency and utterance-final position. Log frequency is mean-centred


Figure 4. For each word pair, the mean rating of the first mention and the second mention, arrayed according to word frequency. The facets indicate which tokens of the pair were utterance-final. The first-mention tokens’ ratings are given in darker, blue points, the second-mention tokens’ ratings in lighter, red points. Lines show linear fits, with the grey shading indicating a 95% confidence interval.
Considering Figure 4, we can estimate the first-mention effect as the difference between the dark blue regression lines and the lighter red ones. The effect is essentially unaffected by frequency (the lines are nearly parallel) when neither word was utterance-final (leftmost panel) or when both were (rightmost panel). The two centre panels show the outcome of mixing an utterance-final advantage with the first-mention advantage. The first- versus second-mention difference changes with frequency in both cases, though in opposite directions, as one would expect: strengthening when the first mention is utterance final, becoming increasingly negative when the second mention is final. The ratings data reveal a feature of this effect that was not visible in the choice data, namely that frequency’s influence on the first-mention effect is carried mainly by the utterance-medial word. Ratings of tokens that were not utterance-final fell off strongly with frequency (r = −.270, t(232) = −4.27, p < 0.0001), whereas ratings of utterance-final tokens fell off less strongly (r = −.099, t(334) = −1.82, p = 0.069). This suggests that mothers maintain hyperarticulation for utterance-final words, but as words become more common, they allow phonetic reduction to take place in less privileged sentence positions. This generalisation appears to be true independently of first-mention effects.
In one final analysis of the ratings data, we consider effects of CDI knowledge, not on the size of the first-mention effect but purely on speech clarity. Table 7 shows the results of an ordinal regression predicting single-trial ratings from mention (first versus second), utterance-final position, centred log-frequency, CDI, the interaction of CDI with mention and utterance position, and the interaction of log frequency and utterance position. Listener identity and item-pair were included as random effects, with random slopes for mention. The CDI interactions were not significant but are included here in keeping with the purpose of the analysis. The results show that second mentions were rated lower; more frequent words were rated lower; utterance-final words were rated higher, and words that the mother thought her child knew were rated higher. Variants of this analysis that included other interactions (not shown) indicated that none were significant or near significant.
Table 7. Ordinal regression predicting ratings of words. Data were entered at the trial level. Contrasts were treatment coded. [yes] and [known] in brackets refer to the level designated as treatment. All data for which CDI scores were available were included

Most of these effects are familiar from previous analyses; for example, the frequency effect’s interaction with utterance position is visible in Figure 4. Note, for example, the steep slope of the frequency effect in the “neither” panel of that figure (both tokens utterance-medial) relative to the shallower slope in the “both” panel (both tokens utterance-final), and the analogous findings in the middle panels. Utterance-finality clearly protects words from frequency-based reduction, to some degree.
The CDI effect is in the wrong direction for the hypothesis that mothers would hyperarticulate words more strongly or more consistently when children do not know them yet. Instead, mothers seem to hyperarticulate when they think it will help them be understood. We return to this theme in the Discussion.
Transcriptions. Finally, we considered judges’ transcriptions of the words. This task was intended to address a limitation of rating methods, which is that it is difficult to calibrate rating differences to functional consequences. It could be that ratings are just aesthetic judgements whose range is limited enough that even poorly rated words would be perfectly recognisable.
Judges’ free transcription responses were not always words. Responses not in the PronLex dictionary (Kingsbury et al., Reference Kingsbury, Strassel, McLemore and MacIntyre1994) were evaluated one by one. When they appeared to be typographical errors or misspellings, they were corrected (coffe for coffee, manget for magnet). When they were nonwords but seemed to be plausibly intended as such, they were retained, and pronunciations were estimated for them by analogy to other words (myan in response to lion was assumed to rhyme with lion). When responses were English words with more than one pronunciation (like read), the pronunciation was assumed to be the one closest to the transcription of the word in the corpus. Pronunciations were tabulated to evaluate the phonological distance between responses and the corpus transcription, possibly providing a more sensitive measure than the binary outcome of whether a response matched exactly. Distances were computed using the R stringdist function’s implementation of the Levenshtein distance metric (van der Loo, Reference van der Loo2014). The Levenshtein distance, also known as the edit distance, is the minimum number of additions, removals, or substitutions required to convert one string into another.
Most words were recognised; that is, most responses matched the corpus transcription. Among first mentions, 70.5% of trials’ responses matched; among second mentions, 66.4%. This difference was significant by proportion test (χ 2 = 19.34, p < 0.0001). The full table of distances is given in Table 8, enumerated over responses without averaging. For each pair, the mean Levenshtein distance of the responses to the second mention were subtracted from the mean to the first mention, giving a distribution of difference scores that was roughly normal in form, with a mean of −0.133 and standard deviation of 0.713 (first quartile, −0.333; third quartile, 0.167). The mean was significantly less than zero by two-tailed one-sample t-test (t(284) = −3.141, p = 0.0019). Note that because the measure is a distance, a negative mean corresponds to a tendency for first mentions to be closer to the correct pronunciation than the second mentions.
Table 8. Estimated phonological distances of all word responses from the canonical pronunciation of the corpus word

Predictors of transcription accuracy were evaluated using negative binomial regression analysis (similar to Poisson regression, but taking into account overdispersion in the outcome distribution). Data were entered at the trial level, with the outcome being the Levenshtein distance of the response from the canonical pronunciation of the spoken word. Predictors were mention, (log) frequency in the maternal corpus, word knowledge on the CDI, whether the word was utterance-final, the interaction of these predictors with mention, the interaction of frequency and utterance-finality, and random effects terms for subject and for item pair, with a slope term for mention in the item effect. This analysis is given in Table 9.
Table 9. Negative binomial regression predicting the phonological distance of transcriptions from the correct word. For the “mention” predictor, regression coefficients are negative when first-mention effects are stronger. Exponentiated coefficients (exp(coef)) give the multiplicative change in phonological distance expected given a unit change in the predictor. Contrasts are treatment coded. Material in brackets, like [yes], refers to the level designated as treatment. Random effects were listener and item pair with random slopes for mention

No interaction terms for mention were significant; thus, the results provided no robust evidence for an effect of word frequency, CDI reporting of word knowledge, or utterance position on the magnitude of the first-mention effect on transcription accuracy. Removing nonsignificant interaction predictors and re-running that analysis resulted in the outcome presented in Table 10.
Table 10. Negative binomial regression predicting the phonological distance of transcriptions from the correct word, with nonsignificant interactions removed from the formula (cf. Table 9)

Transcriptions of second mentions were significantly less close to the target than transcriptions of first mentions. Higher maternal word frequency was associated with greater distance, in keeping with the typical effects of frequency on reduction, but this effect was attenuated in utterance-final position. In general, utterance-finality was linked to closer proximity to the canonical form. Words were also closer to the canonical form when mothers thought their child knew them.
The effects of frequency and sentence position on transcription accuracy are shown in the right panel of Figure 5, with the analogous ratings data in the left panel. Recall that for the transcription measure, greater distance (higher on the y axis) corresponds to lower clarity. Over trials, 77.6% of the utterance-final transcription distances were zero; only 55.4% of the utterance-medial distances were zero. The effect of utterance position on transcription accuracy was quite large and became more pronounced with greater word frequency.

Figure 5. For each word pair, the mean rating of the first mention and the second mention (left panel) or the transcriptions’ mean distance from the correct target word (right panel) arrayed according to word frequency. Utterance-final words are shown in red, non-utterance-final words in grey. Coloured lines are linear regressions of rating or distance on log frequency, separated according to sentence position category. Marginal density plots show ratings or distance distributions by utterance position; areas under the curves sum to one. The plots show a very large effect of utterance position on both measures, and a significant tendency for more frequent words to be spoken less clearly.
Summary of the results across the three tasks. The three tasks yielded similar results: However we asked judges to evaluate words, first mentions were found to be moderately phonetically clearer than second mentions, as expected. This effect was nevertheless quite variable over items, and its magnitude was not predictable based on our measurements of word frequency, maternal estimates of their child’s knowledge of the word, or utterance-finality. The results therefore failed to support a number of hypotheses about maternal discourse effects: for example, that maintenance of more emphatic realisations from first to second mention might coincide with maternal beliefs about the child’s knowledge of the word; or that mothers might tend to compensate for placing a word utterance-medially in a second mention by hyperarticulating it a bit more. If such effects are present in English infant-directed speech, they may be too weak to emerge from the myriad other influences on the phonetic realisation of words.
First- and second-mentions aside, ratings and transcriptions showed that high word frequency tended to go along with less hyperarticulation, and utterance-finality with more. Finally, parents were more likely to hyperarticulate words they thought their children knew.
4. Discussion
On embarking on this project, we expected that phonetic reduction on words’ second mention would be consistent and substantial, given prior reports (e.g., Bortfeld & Morgan, Reference Bortfeld and Morgan2010), though perhaps weaker than in adult conversation (Fernald & Mazzie, Reference Fernald and Mazzie1991). In fact, we found that first-mention effects were weak and variable, though certainly present in the speech of all eight mothers. We also anticipated that first-mention effects might depend on whether the mother thought her child already knew the word she was saying. Because known words (by definition) do not need to be taught, parents might feel more free to offer reduced second mentions of known words relative to unknown words. Instead, we found the clarity of known words to be greater than the clarity of unknown words independently of mention.
If maternal conversation with infants were primarily dedicated to word teaching, we might expect the reverse, namely that as-yet-unknown words would be the clearest of all. Instead, the data show considerable heterogeneity in clarity. Why might this be? One possibility is that a word’s referent might be independently clear from the situational context. For example, if a toy car figured prominently in a play interaction before being mentioned, a parent might reasonably refrain from hyperarticulation because the lexical concept was already “given” in the discourse.
Another possibility is that the parent’s priority is not always to maximize the likelihood of being understood. Parents speaking with their infants have a range of goals. One of these is just to maintain an ongoing social connection, a goal that might lead parents to produce a stream of talk without necessarily ensuring that its details be linguistically interpretable. Parents also sometimes talk simply to entertain themselves while changing diapers or washing bottles. Perhaps this diversity of speech functions underlies some of the variability in the clarity effects found here (Beech & Swingley, Reference Beech and Swingley2024).
Consider, for example, Figure 6. The dominant feature of the ratings distribution is that within adjacent utterances, ratings of repetitions of words are correlated, the first with the second (r = 0.775, t(283) = 20.6, p < 0.0001). It appears that there are sentence pairs in which common words were relevant enough to be repeated, but that were still spoken with low clarity, and other sentence pairs in which both words were hyperarticulated. If hyperarticulation is linked primarily to the speaker’s desire to be understood, rather than to the speaker’s interest in teaching new words, it follows that there are some utterance pairs in which the parent does not make extra efforts to use clear articulation, either because conveying a given word is not a priority, or because she considers it unnecessary in context.

Figure 6. Over all items, mean rating for the first mention and the second mention of each pair. The first-mention advantage is shown by colour and position. When judges gave higher (better) clarity ratings to the first mention than the second mention, the plot point for that item falls in the upper left portion of the plot.
Utterance-finality is well known to be linked with clarity, particularly in infant-directed speech (e.g., Fernald & Mazzie, Reference Fernald and Mazzie1991). Parents place words that are significant in the discourse in utterance-final position and frequently also speak those words with noticeable pitch peaks and increased duration in the words’ segments (e.g., Aslin et al., Reference Aslin, Woodward, LaMendola, Bever, Morgan and Demuth1996; Swingley, Reference Swingley2019). Fernald and Mazzie found that parents maintained this pitch feature on the second mentions of words to a greater degree when talking to children than adults did in conversation with other adults. This behaviour would be expected to reduce first-mention effects.
The frequency effect found here is a familiar one in psycholinguistics and broadly attested (Clopper & Turnbull, Reference Clopper, Turnbull, Cangemi, Clayards, Niebuhr, Schuppler and Zellers2018; Jurafsky et al., Reference Jurafsky, Bell, Gregory, Raymond, Bybee and Hopper2001). There is some debate in the literature about whether reduction of high-frequency words reflects production processes or audience design. Perhaps words uttered more frequently are spoken along well-canalized production pathways that have weathered away some of the distinguishing features of the component sounds. Or, perhaps speakers are alert to the in-the-moment needs of the listener, and intuit that listeners require clearer realisations for lower-probability words. If the latter hypothesis explains our results here, the lack of interaction with the CDI results is surprising; a priori one would expect the word knowledge variable to dominate the frequency variable. If parents think their child does not know a word, despite its high occurrence frequency, an audience-tuned phonetic approach would suggest hyperarticulating such words independently of their frequency. This would be revealed as an interaction that we did not observe. Thus, our results suggest that the frequency effects may emerge from psycholinguistic processes in the speakers having to do with word retrieval and production representations rather than fine attunement to the needs of infant listeners.
What do these results mean for infant word-finding early in language learning? The words we tested here were relatively privileged words, first of all for being repeated in successive utterances, and second for being present on the CDI. These are the kinds of words that make up children’s early lexicons. The substantial phonetic variability with which they are apparently presented to children suggests that we should not assume that a word’s presence in a transcript implies that the word is available to the infant, particularly if it is not yet familiar. It is not yet known whether presentation of a very clear token as a first mention followed by a relatively hypoarticulated token helps infants to accept the second as an instance of the same type as the first (much as our adult listeners found the second token they heard to be considerably more interpretable than the first). If so, it might also help educate infants about the typical form of phonetic reduction.
This study has some limitations. The listening conditions of the participants, and the participants’ language backgrounds, cannot be guaranteed, so it is possible that there is some contamination in the data. In addition, a dataset with more vocabulary measurements would be quite useful; although we established some robust relations between CDI status and word clarity, their strength would be better estimated with vocabulary measurements made closer in time to the speech samples. Our conclusions about vocabulary knowledge would be stronger, too, if the CDI comparisons could be made within words across children. Here, although some words were tested with more than one pair (in different dyads), there were not enough CDI-discordant items to perform a within-word test of the relationship between clarity and word knowledge, and as a result, it is theoretically possible that the CDI effects are actually facts about the particular set of words children tend to know and that an unmeasured variable is responsible for the greater average clarity of the “known” words. This could be resolved using a design specifically targeting words whose “known” status varies across children, or developmentally within children.
Computational models of infant word-finding typically make the simplifying assumptions that words are always spoken the same way, or with only minor contextually dependent variations, and that infants are capable of reliably extracting a veridical phonological transcription of spoken words as strings of syllables or phones. This is unlikely to be the case, given the marked variability in word realisations noted here and in prior research (e.g., Bard & Anderson, Reference Bard and Anderson1983). An alternative possibility is that at first infants operate over the phonetic signal directly rather than over phonetic categorisations and derive both words and phones at the same time (e.g., Feldman et al., Reference Feldman, Griffiths, Goldwater and Morgan2013; Swingley, Reference Swingley2009). Given how consistently we found that utterance-final words were easier to identify, and rated as clearer than other words, it might be useful to contemplate an alternative model of the infant as listening for prominent utterance-final chunks of speech, or for sequences that are repeated in close succession (McInnes & Goldwater, Reference McInnes and Goldwater2011; Nencheva et al., Reference Nencheva, Schwab, Lew-Williams and Fausey2024), and building the initial lexicon from these.
Supplementary material
The supplementary material for this article can be found at http://doi.org/10.1017/S0305000925100263.
Acknowledgements
The author thanks M. Brent for his role in creating the Brent Corpus and thanks many dedicated research assistants at Penn for their annotation efforts over the years.
Funding statement
This work was supported by NSF grants SBE-1917608 and SBE-2444175 and uses corpus resources that were created under NIH grant R01-HD049681 to D. Swingley. Portions of the research reported here were presented at the Workshop on Infant Language Development (Donostia) in 2022.