1. Introduction
Figurative language is defined as a form of language use where the meaning of the stimulus is not directly available from the words it comprises but is learned by convention. For example, ‘bright student’ describes the intelligence, rather than the luminosity, of the person. Figurative expressions are pervasive in everyday communication (Cameron, Reference Cameron and Gibbs2008), yet they are unique as linguistic expressions in that they tend to be emotionally charged, far beyond the emotionality of the words they contain (Citron & Goldberg, Reference Citron and Goldberg2014; Citron, Lee, & Michaelis, Reference Citron, Lee and Michaelis2020). In this study, we uncover some relevant processes involved in the acquisition of figurative language and unpack how their emotional content unfolds.
Conceptual metaphor theory (Lakoff & Johnson, Reference Lakoff and Johnson1980) suggests that metaphors typically consist of mappings between concrete and abstract concepts. For example, ‘bright’, something we can experience with our senses, is used metaphorically to convey the otherwise abstract concept ‘intelligent’. Metaphorical language is used much more frequently to describe personal events that are high in emotional intensity (Fainsilber & Ortony, Reference Fainsilber and Ortony1987). Furthermore, Citron and Goldberg (Reference Citron and Goldberg2014) found that conventional metaphors conveying taste were more emotionally engaging than literal sentences and increased activation of the left hippocampus and amygdala; similar results were found for longer passages containing metaphors not restricted to taste (Citron et al., Reference Citron, Güsten, Michaelis and Goldberg2016; see also Citron, Michaelis, & Goldberg, Reference Citron, Michaelis and Goldberg2020). The amygdala has been associated with emotional memory processing (McGaugh, Reference McGaugh2004), and the encoding of emotional memories has been correlated with co-activation of both the amygdala and the hippocampus (for a review, see Walker, Reference Walker2021).
Increased activation in a number of brain regions (see meta-analysis by Bohrn et al., Reference Bohrn, Altmann and Jacobs2012) suggests that greater processing resources are involved in the understanding of figurative compared to literal language. However, conventional metaphors (e.g. ‘bright student’) have been found to require much less processing resources than novel metaphors (Lai & Curran, Reference Lai and Curran2013); in fact, conventional metaphors are processed as quickly as literal expressions as their meaning is accessed directly (Gibbs et al., Reference Gibbs, Lima and Francozo2004). The meaning of novel metaphors, by contrast, needs to be created on the fly; for example, ‘her husband is an elephant’ can mean that he is clumsy or large or that he has a long nose, and this requires inferences, theory of mind and other pragmatic processes (Festl, Reference Festl2010). Novel metaphorical word pairs (e.g. ‘chilly awkwardness’) may also differ in their sensibility, with some definitions (‘uncomfortable social situation’) matching the novel word pair more congruently than others (‘busy traffic with angry drivers’), and these congruent pairs might, again, be easier to process.
Though we know little about when and how figurative language engages readers or listeners emotionally, there is substantial knowledge of how emotion and memory inter-relate. For instance, verbal memory for emotional information tends to be more accurate than that of non-emotional information (Adelman & Estes, Reference Adelman and Estes2013; Kensinger & Corkin, Reference Kensinger and Corkin2003; McGaugh, Reference McGaugh2004). Emotional valence, which describes how positive or negative a stimulus is, and arousal, relating to how calm or exciting an experience is (Russell, Reference Russell2003), have different effects on memory. A meta-analysis by Pereira et al. (Reference Pereira, Teixeira-Santos, Sampaio and Pinheiro2023) on (non-verbal) emotional memories found that positively or negatively valenced memories were remembered less well than neutral memories. However, arousal (regardless of valence) resulted in improved memory compared to neutral memories. For verbal stimuli, the few studies that have distinguished valence and arousal show conflicting effects. Kensinger and Corkin (Reference Kensinger and Corkin2003) found that words high in negative valence and high in arousal lead to the greatest increases in memory performance, whereas Adelman and Estes (Reference Adelman and Estes2013) reported significant increases in memory performance for words of both negative and positive valence; however, there was no effect of arousal. Thus, although differences in memory performance for stimuli varying in emotionality are observed, the exact effects of valence and arousal for encoding verbal stimuli remain somewhat unclear.
A range of evidence indicates a role of sleep in promoting both learning and memory (Berres & Erdfelder, Reference Berres and Erdfelder2021; Cellini et al., Reference Cellini, Torre, Stegagno and Sarlo2016; Cordi & Rasch, Reference Cordi and Rasch2021; Rasch & Born, Reference Rasch and Born2013), and sleep may thus be vital to the encoding of new figurative expressions and their emotional enkindling. For example, a period of sleep after learning increases memory for word pairs compared to wakefulness (Wilson et al., Reference Wilson, Baran, Pace-Schott, Ivry and Spencer2012), and integration of novel word pairs into long-term memory has also been found to occur after a period of sleep (Dumay & Gaskell, Reference Dumay and Gaskell2007). The active systems consolidation hypothesis (Diekelmann & Born, Reference Diekelmann and Born2010; Marshall & Born, Reference Marshall and Born2007) suggests that this boost to learning and memory during offline periods of sleep is due to the reactivation and integration of these memories from the hippocampus, where they are initially encoded, into long-term semantic representations in the neocortex. This encoding in the cortex can then reduce the effects of interference that may occur if encoding remains in the hippocampus (Dastgheib et al., Reference Dastgheib, Kulanayagam and Dringenberg2022; Frankland & Bontempi, Reference Frankland and Bontempi2005).
In addition to promoting memory of information content, sleep may affect the processing of emotional information (Chambers & Payne, Reference Chambers and Payne2014; Niu et al., Reference Niu, Utayde, Sanders, Denis, Kensinger and Payne2024; Payne et al., Reference Payne, Stickgold, Swanberg and Kensinger2008; Reid et al., Reference Reid, Bloxham, Carr, van Rijn, Basoudan, Tulip and Blagrove2022; Wagner et al., Reference Wagner, Gais and Born2001; Walker, Reference Walker2021). Lipinska et al. (Reference Lipinska, Stuart, Thomas, Baldwin and Bolinger2019) conducted a meta-analysis of sleep effects on emotional memory consolidation and found that there was no overall effect of sleep in promoting memory for emotional stimuli in particular, though there was some evidence of a beneficial effect when studies involved overnight sleep or free recall instead of only recognition memory. Focusing on emotional memory recognition studies, Schäfer et al. (Reference Schäfer, Wirth, Staginnus, Becker, Michael and Sopp2020) found in a meta-analysis that sleep improved memory for neutral and emotional stimuli compared to wake. There was no overall benefit for emotional stimuli from sleep, but there was some evidence that rapid eye movement (REM) sleep might relate to a boost to emotional memory. Davidson et al. (Reference Davidson, Jönsson, Carlsson and Pace-Schott2021) conducted a meta-analysis of studies comparing sleep and wake groups on emotional memory consolidation. They found that the majority of studies did not find an effect of sleep on emotional memory compared to neutral stimuli and concluded that if it is there, then it is likely a small effect.
In parallel with consolidating memory content, sleep may also change the affective tone of the information itself. The sleep to forget, sleep to remember (SFSR) hypothesis (Walker, Reference Walker2021) suggests that emotional memories do persist over time, but the emotional response to such information is manipulable (e.g. Xia et al., Reference Xia, Yao, Guo, Liu, Chen, Liu, Paller and Hu2023) and can be reduced as a consequence of sleep, though such effects are not always observed (e.g. Cellini et al., Reference Cellini, Torre, Stegagno and Sarlo2016). However, a key feature of previous stimuli used in sleep and memory studies of emotion is that they employ stimuli that are already imbued with emotion, which prevents an investigation of how emotion and memory might develop in synchrony. In this study, therefore, we provided participants with novel stimuli – metaphorical expressions – that become charged with emotion during learning.
We also varied the extent to which the meaning of the figurative expressions to be learned was congruent or incongruent with the words they contained. Figurative language varies in this respect; for instance, ‘She spilled the beans’ is congruent (or semantically transparent), with its idiomatic meaning of ‘revealing a secret’, which can be almost intuitively inferred from its constituting words, whereas ‘He went down rabbit holes’ is incongruent (or semantically opaque) in that the words it contains are unrelated to its idiomatic meaning of ‘becoming deeply involved in a complex situation, which often leads to even more complications’; indeed, this meaning must be learned (Keysar & Bly, Reference Keysar and Bly1995). Typically, semantically transparent figurative expressions are processed more quickly, and thus more easily, than opaque expressions (e.g. Bell & Schäfer, Reference Bell and Schäfer2016; Gibbs, Nayak & Cutting, Reference Gibbs, Nayak and Cutting1989). This congruence, or semantic transparency, might also relate to different effects of sleep on learning and memory. Payne et al. (Reference Payne, Chambers and Kensinger2012) compared learning of word pairs that were either semantically related or unrelated and found that sleep resulted in a larger advantage for memory of unrelated word pairs. Ashton et al. (Reference Ashton, Staresina and Cairney2022) found that semantically incongruent word pairings (e.g. red–elephant) were remembered worse than congruent pairings, but this disadvantage was significantly smaller following sleep, indicating that sleep promoted storage of the incongruent pairs. Sio et al. (Reference Sio, Monaghan and Ormerod2013) found, for a problem-solving study, that harder-to-process material benefited more from sleep than wake, consistent with the observation that unrelated material – also harder to learn – is boosted by sleep.
The current study therefore investigates whether sleep differentially affects memory for conventional and novel metaphorical word pairs of negative, positive and neutral valence that vary in their congruence. We had four predictions from our study. First, there will be an overall benefit of sleep compared to wake on both recognition and definition recall accuracy of both conventional and novel metaphors as sleep promotes memory. Second, sleep will enhance memory for the harder-to-process novel metaphorical word pairs compared to conventional metaphors, and also, sleep will boost incongruent more than congruent pairs. Third, emotionality of the word pairs will enhance their accurate recognition (Adelman & Estes, Reference Adelman and Estes2013; Kensinger & Corkin, Reference Kensinger and Corkin2003). Due to inconsistencies within the literature as to the effects of valence versus arousal, we may expect to see an increase in memory performance for word pairs rated high in negative valence and arousal (Kensinger & Corkin, Reference Kensinger and Corkin2003), or increases in memory for word pairs of both negative and positive valence, independent of arousal (Adelman & Estes, Reference Adelman and Estes2013). Fourth, if sleep affects emotional memory processing (Walker, Reference Walker2021), then, according to the SFSR hypothesis, ratings of arousal and valence of the word pairs would be reduced as a consequence of sleep.
2. Method
2.1. Participants
Forty-nine Lancaster University students took part in the study for course credit or payment of £7. All participants gave informed consent and were fully debriefed at the end of the study. Non-native British English speakers (n = 6) were removed from the analysis, as well as one participant who self-reported being in bed for less than 6 hours (cut-off based on previous studies of sleep and memory, e.g. Diekelmann et al., Reference Diekelmann, Landolt, Lahl, Born and Wagner2008). Two participants were removed due to a computer error during completion of the experiment. This led to the inclusion of 40 participants (34 female), mean age = 20.8 years, standard deviation [SD] = 3.01 and range = 18–32, which meant we had a power of 0.8 to find effects of d = 0.8 or greater. Participants were randomly allocated to the sleep, N = 20, 5 female, 15 male, mean age = 20.4 years, SD = 1.6 and range = 18–24, or the wake condition, N = 20, 1 female, 19 male, mean age = 21.2 years, SD = 3.9 and range = 18–32. Ethical approval was granted by the University’s faculty ethics board, and the study was conducted in accordance with the World Medical Association Declaration of Helsinki ethical guidelines.
2.2. Materials
We collated 71 conventional and 67 novel metaphorical word pairs from a range of sources: 56 conventional and 59 novel word pairs were taken from Liu et al. (Reference Liu, Connell and Lynott2019), two conventional word pairs from Mashal and Itkes (Reference Mashal and Itkes2014) and five conventional and four novel word pairs from Forgács et al. (Reference Forgács, Lukács and Pléh2014). Eight conventional and four novel metaphors were also added.
Using these word pairs, we collected ratings on a 7-point scale measuring sensibility (the extent to which a word pair made sense), metaphoricity (the extent to which a word pair expressed a non-literal expression), familiarity (extent of exposure to the word pair), emotional valence (from negative to positive) and emotional arousal (from non-arousing to arousing), from 15 participants who did not take part in the main experiment. Thirteen of those same participants completed a second questionnaire providing a definition of what they thought the meaning of each novel word pair was.
We extracted 30 conventional metaphorical word pairs by choosing those that were rated high in metaphoricity (mean = 4.97, SD = 0.68), sensibility (mean = 6.39, SD = 0.33) and familiarity (mean = 5.48, SD = 0.76). We also extracted 60 novel metaphorical word pairs rated low in familiarity (mean = 2.15, SD = 0.83, comparison to conventional metaphors: t(62.94) = 18.98, p < .001, d = 4.12), to ensure these were not previously known to participants. The novel word pairs were also rated lower in metaphoricity (mean = 4.61, SD = 0.57, comparison to conventional metaphors: t(50.30) = 2.50, p = .016, d = 0.59) and sensibility (mean = 3.88, SD = 1.09, comparison to conventional metaphors: t(77.55) = 16.35, p < .001, d = 2.75). For both conventional and novel word pairs, we ensured that a third were rated high in negative valence (mean = −1.74, SD = 0.49), a third high in positive valence (mean = 1.44, SD = 0.54) and a third neutral in valence (mean = −0.14, SD = 0.49), which were significantly different from the negative, t(58.00) = −12.71, p < .001, d = −3.28, and the positive, t(57.34) = 11.76, p < .001, d = 3.04, word pairs. We chose negative and positive word pairs with ratings of arousal that were as similar as possible (mean for negative = 3.57, SD = 0.84; mean for positive = 4.13, SD = 1.03, t(55.69) = 2.28, p = .026, d = 0.59) and very different than the neutral word pairs, which were selected to be low in arousal (mean = 2.43, SD = 0.98), comparison to negative word pairs: t(57.85) = 6.53, p < .001, d = 1.69, comparison to positive word pairs: t(57.34) = 11.76, p < .001, d = 3.04.
We then divided these pairs into two sets, with 15 conventional and 30 novel word pairs each. One set was presented to participants during training and testing (seen pairs), and the other set was reserved as an unseen set for testing (unseen pairs).
Within each set, the novel word pairs were divided into two categories: 15 word pairs in which we assigned a meaning that was congruent, i.e. similar to the meanings provided by participants in the questionnaire, and therefore sensible: e.g., ‘jingling satisfaction’ was paired with the meaning ‘feeling very pleased and proud of yourself’; and 15 word pairs were assigned an incongruent meaning, i.e. a meaning that differed from those provided by participants and was therefore less sensical: e.g., ‘jingling satisfaction’ was paired with ‘a generous reward’. Whether a word pair was in the congruent or incongruent condition was counterbalanced: half of the participants saw one set of word pairs with a congruent meaning, and the other half saw the same set with an incongruent meaning. Conventional metaphors, instead, were always assigned their congruent, conventionalised meaning; e.g., ‘handsome profit’ was paired with ‘a large sum of money made’.
The three metaphor conditions (conventional, novel congruent and novel incongruent) were each composed of three groups differing in valence, so that five word pairs had negative valence, five word pairs had positive valence, and five were neutral.
For seen word pairs, we ensured that conventional and novel word pairs differed in familiarity and sensibility but were controlled as far as possible on valence and arousal. We also ensured that conventional and novel word pairs did not differ in metaphoricity, valence or arousal, with the exception of positive conventional pairs, which were significantly higher in metaphoricity and valence than positive novel pairs. Since metaphors are often highly emotive (Citron, Lee, & Michaelis, Reference Citron, Lee and Michaelis2020; Fainsilber & Ortony, Reference Fainsilber and Ortony1987), this could be due to lower metaphoricity, leading to lower valence ratings. We also checked that valence was significantly different for positive versus neutral versus negative lists for both conventional and novel word pairs. All statistical tests are reported in the Supplementary Information (SI, available at https://osf.io/s3g98).
For seen versus unseen word lists, we ensured that there were no overall differences in familiarity, sensibility, metaphoricity, valence and arousal ratings for each subset of positive, neutral and negative conventional word pairs. We also tested for any differences between the seen versus unseen novel word pairs. There were no significant differences except that negative seen novel word pairs were rated lower than unseen for metaphoricity, and positive seen novel word pairs were rated higher than unseen for valence (see Supplementary Information S1 for statistics). These differences are not problematic because the effects of word pair properties on memory were only investigated within seen word pairs.
2.3. Procedure
Participants in the wake group were exposed to the word pairs and definitions at 9 am and then tested the same day at 9 pm after a day of wakefulness. Participants in the sleep group were exposed to word pairs and definitions at 9 pm and then tested at 9 am after a night of sleep. Participants in the sleep condition were required to wear an ActiGraph sleep monitor overnight to measure time spent asleep.
During the first session, participants first completed a questionnaire on their general sleep habits, on their mood using the Self-Assessment Manikin (SAM; Bradley & Lang, Reference Bradley and Lang1994) questionnaire, which includes measures of pleasure (or emotional valence), activation (or emotional arousal) and control (or dominance), and the Stanford Sleepiness Scale (SSS; Hoddes, Zarcone, Smythe, Phillips, & Dement, Reference Hoddes, Zarcone, Smythe, Phillips and Dement1973). Participants were then seated approximately 60 cm from the computer screen and were exposed to the 15 conventional, 15 novel congruent and 15 novel incongruent word pairs and definitions. Participants were presented with each word pair and definition one at a time for 5 s each. The definition would then disappear from the screen, and participants were required to provide a rating of valence and arousal for the word pair. Valence was rated on a scale from negative (1) to positive (7), with 4 meaning completely neutral, and arousal was rated on a scale from not at all arousing (1) to completely arousing (7). Participants were asked to press the corresponding key on the keyboard to give their response. Participants were then required to reproduce the definition of the word pair by typing the meaning into the computer using the keyboard (first recall of definitions). Participants were not given a time limit to complete the ratings or definitions. Each word pair and definition was presented in random order and once only.
After all 45 word pairs and definitions had been presented to participants, we again gave participants each word pair one at a time in random order, without an accompanying definition, and asked them to type the definition of each word pair again (second recall). Participants were instructed not to guess and to type ‘I don’t know’ if they could not remember the definition. Each word pair remained on the screen until participants entered their definition, and participants were not given a time limit to complete these definitions. This test was included in order to determine the extent to which encoding of the definitions had been effective.
Participants then left the lab, and those in the wake condition were instructed not to nap throughout the day. Twelve hours later, after a period of overnight sleep or daytime wake, participants returned for the testing phase of the experiment. In this second session, participants were again asked to complete the sleep habit questionnaire, the SAM and the SSS. Participants were then asked to complete a recognition task on the screen (same setup as for the recall task above). They were presented with a central fixation cross for 500 ms, before one of the word pairs was presented at the centre of the screen. Participants were asked to decide whether they recognised the word pair as one that was presented to them in the first session (old) or not presented in the first session (new). Participants made their decision on whether the word pair was old or new on a scale from one to six based on their confidence, with 1 indicating ‘definitely old’ and 6 indicating ‘definitely new’ (see Weidemann & Kahana, Reference Weidemann and Kahana2016, for a similar approach). The word pair remained on the screen until participants made a decision. Participants were not given a response deadline but were asked to respond as quickly and accurately as possible. Once participants had made their decision, they were then presented with another fixation point for 500 ms, before another word pair appeared on the screen. The presentation order of each word pair was randomised for each participant.
After participants had seen all word pairs, they were then given each of the previously seen word pairs again, one at a time, without an accompanying definition. Participants were first asked to rate the valence and arousal of the word pair using the same rating scales as in the first session and then to provide the definition of the word pair by typing into the box provided on the computer screen (third recall). Participants were instructed not to guess and to type ‘I don’t know’ if they did not know the answer. The word pair remained on the screen until participants gave the ratings and definition, and participants were not given a response deadline.
The procedure is summarised in Figure 1.

Figure 1. Experimental procedure.
2.4. Data analysis
Data were analysed using the package lme4 in R. For binary outcomes (recall and recognition accuracy), generalised linear mixed-effects models were used. For response times (RTs), linear mixed-effects analyses were performed. For all analyses, to determine the best-fitting model, fixed effects were added incrementally. The model was initially fitted with random effects of participants and items (word pairs), and then, fixed effects were added to the model one at a time. Each model was compared to the previous model using likelihood-ratio tests, and fixed effects remained in the final model if they contributed significantly to model fit.
For recall of definitions, whether the definition provided by participants was correct or incorrect was determined by two scorers independently. Responses were scored as correct if the definition was the same or similar to that learned during training. If synonyms of the definition were used, the definition was scored as correct. Where the scorers differed in their decision, the participant’s response was discussed between scorers, and a final decision was made. Both recall and recognition accuracy were coded such that a correct response was given a score of 1, and an incorrect response was given a score of 0.
We converted the six-point confidence ratings (from definitely not seen to definitely seen) to accuracies by recording accurate responses as those where the word pair was previously seen when participants judged it to have been seen before (with any degree of confidence, i.e. 1, 2 or 3 on the confidence scale) and when the word pair that was not previously seen was judged as not seen before, again with any degree of confidence (4, 5 or 6 on the confidence scale). Supplementary Information S2 reports analyses of these confidence ratings, which show very similar results to the conversion to accuracy responses reported in this paper.
RTs for correct responses in the recognition task were analysed. RTs exceeding 2.5 SDs from the mean were removed from the analysis, leading to the removal of 2.44% of the data and leaving 3157 data points. Based on the test of normality, we reduced skewness in the RT distribution by transforming latencies to log10(RT) and analysed the logRTs, based on suggestions from Baayen, Feldman and Schreuder (Reference Baayen, Feldman and Schreuder2006).
To compare performance between novel and conventional word pairs, and between novel congruent and novel incongruent word pairs, we coded the ‘type of metaphor’ categorical variable using Helmert contrasts. For novel versus conventional metaphors (contrast A), conventional word pairs = 1 and both novel congruent and novel incongruent word pairs = −.5. To compare novel congruent and incongruent word pairs (contrast B), conventional word pairs = 0, novel congruent word pairs = −.5 and novel incongruent word pairs = .5. For the other categorical fixed effects: for the sleep versus wake groups, sleep was the reference category; for time of testing, the testing session was the reference category against which the training session was compared.
3. Results
Data and analyses are available at https://osf.io/s3g98. In order to determine the effect of sleep and wake consolidation on metaphor encoding and processing, we first ensured that sleep versus wake effects were not due to time-of-day effects (3.1). We then assessed recall accuracy of word pairs at the testing session (3.2); then, we assessed change from training to testing sessions for recall accuracy (3.3), and for participants’ ratings of emotional valence and emotional arousal (3.4), in order to determine if sleep affected changes in recall and emotionality over time. We focused on the effect of stimulus type on processing for the previously seen stimuli, as the type of stimulus was only controlled for these items. Supplementary Information contains reports of model construction and comparisons, as well as additional analyses of seen versus unseen stimuli (S1) and analyses of RTs (S3).
3.1. Analyses of time-of-day effects on performance
3.1.1. Participants’ mood and sleepiness scores in each session
Mean scores of sleepiness, mood valence, mood activation/arousal and mood control/dominance (the latter three measures from the SAM) in training and testing sessions are reported in Table 1. We tested whether sleepiness or mood might be different for the sleep versus wake groups, due to the time of day of the training session, also reported in Table 1. We found no significant effects apart from wake group mood activation scores being higher than for the sleep group before the testing session. To determine whether this difference in mood activation had an influence on definition recall, we conducted a generalised linear mixed-effects model with recall accuracy of definitions of word pairs at testing as the dependent variable. Adding mood activation scores did not improve model fit compared to a model with no fixed effects (χ2(1) = 1.125, p = .289). We also tested whether mood activation scores affected word pair recognition at testing, and again, adding it to a model with no fixed effects did not improve model fit (χ2(1) = 0.467, p = .495). Thus, there was no evidence that differences in sleep or mood between the two groups at different times of day may have affected the results.
Table 1. Mean, SD and t-tests comparing wake and sleep groups for sleepiness and SAM mood self-ratings

3.1.2. Time-of-day controls in training session
Emotional valence and arousal ratings of word pairs at training. We analysed differences in participants’ ratings of valence and arousal in the training session, to determine whether time of day affected ratings of the stimuli. Table 2 reports the means and SDs of ratings by the sleep and wake groups. Adding the fixed effect of group to a linear mixed-effects model of valence ratings of the word pairs did not significantly improve model fit, indicating that there was no evident effect of time of day on the ratings (χ 2(1) = 0.036, p = .849). Adding the fixed effect of group to a linear mixed-effects model of arousal ratings of the word pairs also did not significantly improve model fit (χ2(1) = 0.803, p = .370).
Table 2. Means and standard deviations of ratings of valence and arousal; first and second recall of definitions at training (scores indicate proportion of correct items recalled)

First recall of definitions of word pairs at training. To check whether the two groups were different in the first session, indicating an effect of time of day on encoding, we ran a generalised linear mixed-effects model, with first recall of definition accuracy at the training session as the dependent variable (see Table 2 for descriptives). Adding a fixed effect of group to a model with only random effects did not improve model fit (χ2(1) = 1.202, p = .273), indicating, as expected, no significant difference between the sleep and wake groups.
Second recall of definitions at training. During the training session, we also measured participants’ ability to recall definitions for a second time, to gain a measure of recall accuracy at the end of the training session after a short retention period. Although all participants provided definitions for a second time during training, due to a computer error only half of the data were collected (10 participants in the wake group and 10 in the sleep group). We thus ran a generalised linear mixed-effects model on data from these 20 participants, with random effects of participants and items (see Table 2 for descriptives). Adding the effect of group to a model with only random effects did not significantly improve model fit, indicating no time-of-day effect on recall of definitions (χ2(1) = 0.626, p = .429).
3.2. Effects of sleep on recognition of word pairs at testing
As fixed effects, we coded stimulus type as Helmert contrasts (comparing conventional versus novel word pairs and whether the novel pairs were congruent or incongruent). We then included the comparison between positive and neutral stimuli and between negative and neutral stimuli, as well as actual emotional valence and arousal ratings of word pairs given by participants during the training session to determine whether individual ratings affected processing. We then added the fixed effect of group (sleep or wake) as a main effect and in interaction with stimulus type and emotionality. Effects were only retained in the model if they contributed significantly to model fit.
For recognition accuracy, the final model included a significant fixed effect of congruent versus incongruent novel metaphors (χ2(1) = 5.575, p = .018): participants recognised novel congruent word pairs (mean = 0.89, SD = 0.13) better than novel incongruent metaphors (mean = 0.84, SD = 0.12), contrary to expectations. The model also included arousal ratings at training (χ2(1) = 5.362, p = .021), with greater arousal showing higher accuracy; and wake versus sleep group (χ2(1) = 7.058, p = .008): as predicted, the sleep group (mean = 0.91, SD = 0.05) outperformed the wake group (mean = 0.84, SD = 0.10). No other main effects or interactions significantly contributed to model fit, and the final model is reported in Table 3.
Table 3. Summary of fixed effects in the final model of recognition accuracy of previously seen word pairs

3.3. Effects of sleep on recall of definitions at training (second recall) versus testing
We next modelled participants’ ability to recall definitions of word pairs at training versus testing. The final model (see Table 4) included a significant fixed effect of time of testing (χ2(1) = 7.789, p = .005): with reduced accuracy at testing (mean proportion correct = 0.45, SD = 0.14) compared to training (mean = 0.56, SD = 0.11), as anticipated due to a longer period between exposure and testing; an effect of conventional versus novel metaphors (χ2(1) = 33.782, p < .001): conventional (mean = 0.73, SD = 0.23) was recalled better than novel (mean = 0.33, SD = 0.13); congruent versus incongruent novel metaphors (χ2(1) = 641.620, p < .001): congruent novel (mean = 0.59, SD = 0.23) was better recalled than incongruent novel (mean = 0.07, SD = 0.08); and valence ratings at training (χ2(1) = 6.242, p = .012): word pairs rated as more negative were more accurately recalled. No other main effects or interactions significantly contributed to model fit.
Table 4. Summary of fixed effects in the final model of recall of definitions, comparing training and testing sessions

3.4. Effects of change in emotion for stimuli
When valence ratings, including ratings at both training and testing, were the outcome variable, emotionality of the word pairs significantly contributed to model fit (χ2(2) = 87.049, p < .001), confirming that positive word pairs were rated as more positive and negative word pairs as more negative than neutral word pairs; in addition, arousal ratings at training significantly contributed to model fit (χ2(1) = 91.871, p < .001): word pairs rated as more arousing at training were rated as more positive. The interaction between time of testing and conventional versus novel metaphor was also significant (χ2(2) = 8.238, p = .016): while conventional metaphors declined in positive ratings from training to testing, novel metaphors increased in positive ratings (see Figure 2). The interaction between time of testing and emotionality of the word pair was significant (χ2(2) = 33.374, p < .001): at testing, positive word pairs were rated as less positive, negative word pairs were rated as less negative, and neutral pairs remained the same as at training (see Figure 3). Finally, the interaction between time of testing and arousal (at training) was significant (χ2(1) = 8.413, p = .004): increasingly more arousing pairs were rated as less positive at testing than training. Critically, there was also a significant interaction between time of testing and sleep or wake group (χ2(2) = 7.443, p = .024): sleep reduced valence ratings of word pairs (z = 3.46, p = .003), whereas there was no evidence of an effect for wake (z = −0.23, p = .996); see Figure 4 and the final model in Table 5.

Figure 2. Mean valence ratings and 95% confidence intervals at training (1) versus testing (2) for conventional and novel metaphorical word pairs.

Figure 3. Mean valence ratings and 95% confidence intervals at training (1) versus testing (2) for word pairs with different emotionality.

Figure 4. Mean valence ratings and 95% confidence intervals for word pairs at training (1) versus testing (2) for the sleep and wake groups.
Table 5. Summary of fixed effects in the final model of valence ratings across training and testing sessions

When arousal ratings, including both ratings at training and testing, were the outcome variable, the final model (Table 6) included significant fixed effects of congruent versus incongruent novel metaphors (χ2(1) = 30.59, p < .001): congruent word pairs were rated as more arousing than incongruent pairs; word pair emotionality (χ2(2) = 55.594, p < .001), with positive and negative word pairs rated as higher in arousal than neutral ones; valence rating at training (χ2(1) = 131.37, p < .001), with more positively valenced word pairs rated as more arousing; an interaction between time of testing and emotionality (χ2(3) = 32.958, p < .001), with positive emotional pairs declining in arousal, negative emotional pairs increasing in arousal ratings at testing and no change for neutral word pairs (see Figure 5); and an interaction between time of testing and valence rating (χ2(1) = 56.784, p < .001): the positive relationship between arousal and valence was stronger at training than at testing.
Table 6. Summary of fixed effects in the final model of arousal ratings across training and testing sessions


Figure 5. Mean arousal ratings and 95% confidence intervals for word pairs at training (1) versus testing (2) for word pairs with different emotionality.
4. Discussion
The present study examined the way in which figurative language is constructed and imbued with emotional affect. We tested the role of sleep on the learning of and memory for both novel and conventional metaphorical word pairs varying in emotionality. To our knowledge, this was the first attempt at investigating memory consolidation for stimuli whose emotive content is not given but develops in tandem with learning of the expressions themselves. We investigated novel metaphors, including both positively and negatively valenced stimuli and including measures of both emotional valence and arousal.
We found a significant main effect of the sleep versus wake groups on recognition accuracy; however, there was no difference for recall of definitions, therefore only partially supporting the large body of research indicating a likely overall benefit of sleep to both learning and memory (Berres & Erdfelder, Reference Berres and Erdfelder2021; Rasch & Born, Reference Rasch and Born2013). Also, definitions were less accurately recalled at testing than at training, showing memory decay over time.
We further predicted that sleep consolidation would particularly benefit learning and memory for more complex items (Ashton et al., Reference Ashton, Staresina and Cairney2022; Sio et al., Reference Sio, Monaghan and Ormerod2013), i.e. novel metaphors, and in particular the ones with incongruent definitions. However, our incongruent items turned out to be particularly difficult for participants to generate definitions for and remember (7% average definition accuracy); thus, sleep consolidation was unlikely to help for poorly encoded items (Wernette & Fenn, Reference Wernette and Fenn2024). In fact, we observed better accuracy and speed of recognition and better recall of definitions of congruent novel metaphors than incongruent novel metaphors. These incongruent novel metaphors are evidently difficult to learn, in particular in terms of generating definitions – though they were recognised at a rate of 84% accuracy during testing. However, investigating how such expressions might become embedded in the speaker’s mind clearly requires additional exposure and training than is present in our paradigm. Furthermore, we observed better recall of the definition of conventional than novel metaphors. This may be due to the fact that conventional metaphorical meanings are known to participants; they do not need to be learned, and therefore, recall performance is higher than for newly learned items. This suggests an overall advantage of a consolidation period for known compared to newly learned items, and for new items whose definition is more intuitively sensible, compared to incongruent items. This is in line with empirical research showing that semantically transparent figurative expressions, which are more congruent with their figurative meaning, are processed faster and therefore more easily than opaque expressions (Bell & Schäfer, Reference Bell and Schäfer2016; Gibbs, Nayak & Cutting, Reference Gibbs, Nayak and Cutting1989). This may, in turn, lead to better memory for transparent/congruent word pairs. However, no differences between sleep and wake were found on recall of definitions.
The lack of advantage for sleep over wake in recall of definitions could be explained by the fact that recall of definitions was tested three times, while recognition was only assessed once at testing. According to Antony et al. (Reference Antony, Ferreira, Norman and Wimber2017), repeated retrieval of information may act as a fast route to memory consolidation, strengthening the links between hippocampal and neocortical representations and leading to greater reliance on neocortical representations over time. Sleep is particularly beneficial for memory consolidation from the hippocampus to the neocortex, and thus, no added benefit of sleep consolidation may be observed when online fast consolidation has taken place.
In addition, we found that novel metaphors were more positively rated after consolidation than at training, whereas highly familiar conventional expressions were less positively rated after consolidation than at training. In line with this, congruent novel metaphors were rated as more arousing than incongruent ones, independent of time of testing. Both of these results are in line with the optimal innovation hypothesis (Giora et al., Reference Giora, Fein, Kronrod, Elnatan, Shuval and Zur2004), according to which highly familiar and overly repeated stimuli, i.e. conventional metaphors in our case, lead to potential disengagement and boredom (thus less positive ratings at testing than training), while optimally innovative stimuli, i.e. novel congruent metaphors in our case, elicit interest and engagement (hence, higher arousal ratings than novel incongruent metaphors), as long as they are not too difficult to process, which is the case for our novel incongruent pairs.
When it comes to the effects of the emotive content of our stimuli, recognition accuracy was better for increasingly arousing word pairs, whereas definitions were better recalled for increasingly more negative pairs. Furthermore, rated arousal at training was associated with higher positive valence at testing and increasingly more positive valence at training with higher arousal at testing. This pattern seems to support an overall memory advantage for more emotive word pairs (Walker, Reference Walker2021) and a relationship between positive valence and arousal. Our finding of these effects is mostly observed when analysing the more sensitive, continuous ratings provided by the same participants who took part in the study rather than by predetermined categories of positive, negative and neutral pairs (although individual ratings essentially confirmed our categorisation); also, the effects are sometimes driven by valence and sometimes by arousal. Due to limitations in the number of stimuli we could use, to avoid excessive cognitive demand and overly long task duration, our experimental design cannot help resolve existing inconsistencies in the literature regarding the specific role of valence and arousal in memory consolidation (e.g. Adelman & Estes, Reference Adelman and Estes2013; Kensinger & Corkin, Reference Kensinger and Corkin2003; Reid et al., Reference Reid, Bloxham, Carr, van Rijn, Basoudan, Tulip and Blagrove2022; Schäfer et al., Reference Schäfer, Wirth, Staginnus, Becker, Michael and Sopp2020), nor it can help clarify whether sleep consolidation is more beneficial for emotive stimuli than consolidation during wake.
Interestingly, the results of the current study partially support the SFSR hypothesis (Walker, Reference Walker2021) as participants showed a reduction in their affective ratings of the word pairs. In line with the ‘sleep to forget’ hypothesis, the sleep group rated word pairs as significantly less valenced at testing than at training, whereas the wake group showed no difference in affective ratings. In addition, further effects point towards a reduction in perceived emotive tone after a period of consolidation regardless of whether that period contained sleep or wake: (1) positive words were rated as less positive and negative words as less negative at testing than at training, while neutral words remained the same; in line with this, positive pairs declined in arousal rating after training, negative pairs increased, but neutral pairs remained the same; (2) increasingly more arousing word pairs were rated as less positive at testing than at training, showing a reduction in strength of the relationship between arousal and positive valence; this was confirmed by the interaction between time of testing and valence ratings at training in predicting arousal ratings at testing. Even though these additional results showed no advantage of sleep versus wake, they are in line with the SFSR hypothesis, which suggests the affective tone of a memory is reduced over time and after sleep. Though there is variability in the effect of emotional content on sleep-based memory consolidation (Lipinska et al., Reference Lipinska, Stuart, Thomas, Baldwin and Bolinger2019; Pereira et al., Reference Pereira, Teixeira-Santos, Sampaio and Pinheiro2023; Schäfer et al., Reference Schäfer, Wirth, Staginnus, Becker, Michael and Sopp2020), the contribution of our study may point to stimulus constraints where emotional effects are elicited by sleep: when emotional content is constructed at the same time as memorisation is developing. Our results point to the importance of distinguishing both valence and arousal in the effects of memory (see also Reid et al., Reference Reid, Bloxham, Carr, van Rijn, Basoudan, Tulip and Blagrove2022).
It is important to consider whether changes in emotion from the first to second sessions might be explained in terms of regression to the mean. In Figure 3, we see higher valence values at session 1 becoming lower at session 2 and vice versa: lower values at session 1 becoming higher at session 2. However, whereas valence might orient to the mean, there is no similar pattern for levels of arousal. Figure 5 shows that, whereas arousal for neutral stimuli remains constant, arousal ratings for positive stimuli reduce towards the mean for the neutral stimuli and arousal ratings for negative stimuli diverge from arousal for the neutral stimuli – increasing from session 1 to session 2.
The results of the present study require further investigation with an increased sample size and a larger set of stimuli, to further enhance the power of the effects found and to be able to explore interactive effects between emotive content and type of metaphor. Our study was powered to detect effect sizes of d = 0.8 or greater, which is a relatively large effect size; however, a larger sample size may enable greater nuance to be determined in future studies of learning of metaphors and their accumulation of emotional content. As noted by Davidson et al. (Reference Davidson, Jönsson, Carlsson and Pace-Schott2021), sleep studies tend to be relatively low in power – their meta-analysis of emotion and memory studies, for instance, showed that, for recognition memory, studies of sleep had a mean sample size of 40 (SD = 23, range: 10–84). Increasing sample size, then, can help to increase sensitivity to effects that are there and also minimise risks of finding spurious effects (Button et al., Reference Button, Ioannidis, Mokrysz, Nosek, Flint, Robinson and Munafò2013; Cordi & Rasch, Reference Cordi and Rasch2021).
To conclude, the current study partially supports the benefit of sleep in the learning of metaphorical items, in the consolidation of emotional memories and in the reduction in their perceived emotive tone after consolidation. However, it provides support for an overall benefit of a period of consolidation, whether after sleep or wake, in the learning of and memory for novel metaphors, more emotive word pairs, and in the reduction in their perceived emotive tone. It also extends previous findings to stimuli whose emotive content is learned and to positive in addition to negative stimuli.
Supplementary material
The supplementary material for this article can be found at https://osf.io/s3g98.




