1 Introduction
While research on human judgment and decision making has yielded many findings that suggest that the best way to make a difficult choice is to think carefully about the options and their consequences (e.g., Baron, 2008; Reference KahnemanKahneman, 2011), the theory of unconscious thought (Reference Dijksterhuis and NordgrenDijksterhuis & Nordgren, 2006) proposes that this is not necessarily the best way to make a difficult choice. Rather, this theory proposes that the best way to make a difficult decision is to refrain from painstaking conscious deliberation and to let one’s unconscious mind solve the problem while one engages in more enjoyable activities such as solving a cross-word puzzle. More specifically, this theory claims the existence of an unconscious form of thought that has a much greater information-processing capacity than conscious thought. As a result, a momentary diversion of attention would benefit making a difficult decision because it allows the clever unconscious mind to take charge and solve the problem at hand.

Figure 1: The paradigm that was introduced by Dijksterhuis (2004) to examine the potential benefits of distraction in complex decision making.
So should decision makers really be told—as they have been (BBC News, 2006; Reference HoareHoare, 2012)—to refrain from conscious deliberation and to rely on their unconscious minds in making difficult decisions? Research examining this matter began with seminal studies in which Dijksterhuis and colleagues (Reference DijksterhuisDijksterhuis, 2004; Reference Dijksterhuis, Bos, Nordgren and Van BaarenDijksterhuis, Bos, Nordgren, & Van Baaren, 2006) presented participants a large number of properties of different choice options (e.g., cars, candidate roommates, apartments), and then asked the participants to select the best option either after a period of conscious deliberation or after performing an unrelated task. (See Figure 1 for a graphical depiction of the paradigm.) Although the statistical analyses reported by Dijksterhuis and colleagues were suboptimal (e.g., Hasselman, Crielaard, & Bosman, submitted; Reference Nieuwenstein and Van RijnNieuwenstein & Van Rijn, 2012), the results of some of these experiments did show that participants who had first performed the unrelated task were more likely to select the best option than participants who were given the opportunity to deliberate—a phenomenon termed the unconscious thought advantage (UTA; Reference DijksterhuisDijksterhuis, 2004; Reference Dijksterhuis and Van OldenDijksterhuis et al., 2006; see also Reference Dijksterhuis and NordgrenDijksterhuis & Nordgren, 2006). Following these reports, many other researchers attempted to replicate the finding of an UTA (e.g., Acker, 2008, who reviewed results available in 2008) and the results of these replication attempts were mixed, as they were split almost evenly between studies that did and did not find evidence for the UTA. (For a recent overview, see Nieuwenstein & van Rijn, 2012.)
2 The current study
In the current study, we contrast two explanations for the inconsistent results of previous studies examining the UTA. According to the reliability account, the UTA does not exist and previous reports of this effect concern nothing but spurious differences obtained from an unreliable paradigm. In contrast, the moderator account proposes that the UTA is real but observed only when specific conditions are met in the choice task. In the following sections, we first elaborate on the argumentation underlying these accounts before turning to the approach we took to adjudicate between them.
2.1 The reliability account
The reliability account was already hinted at in one of the early studies that failed to replicate the UTA. In this study, Acker (2008) conducted a meta-analysis on 17 experiments that were available at that time. The analysis showed that only five of these experiments reported a statistically significant UTA effect. Furthermore, Acker found that these experiments had “the largest effect sizes but at the same time the smallest sample sizes” (p. 299; Reference AckerAcker, 2008), thus raising the possibility that the results found in these studies concerned spurious effects (see also Bakker, Van Dijk, & Wicherts, 2012; Reference Newell and RakowNewell & Rakow, 2011; Reference Rothstein, Sutton and BorensteinRothstein, Sutton, & Borenstein, 2005).
Indeed, the unconscious thought paradigm illustrated in Figure 1 has three properties that together seem to make a potent recipe for spurious results, especially with small sample sizes. To start, the paradigm involves a complex task for which performance is likely to depend on a host of factors that can differ across time and participants, including concentration, mindset, gender, motivation, expertise about the choice at hand, attention and memory. Secondly, the paradigm uses a between-subjects manipulation of mode of thought with random assignment, meaning that the effect of the distraction vs. deliberation manipulation is assessed by comparing the performance of different participants. Thirdly, the performance measure for the task stems from only a single observation for each participant, meaning that each participant carries out the task only once, without practice. Arguably, this combination of properties makes a potent recipe for spurious results because the use of random assignment does not necessarily guarantee an equal distribution of task-relevant factors across two groups of participants, especially when the number of such factors is large (Reference HsuHsu, 1989; Reference Krause and HowardKrause & Howard, 2003), as would seem to be the case in the unconscious thought paradigm. Moreover, the use of a single-trial design entails that the performance measure derived for each participant is bound to be an unreliable index of true, mean performance of that participant. Accordingly, it seems clear that the reliability and validity of results of studies examining the UTA hinges critically on whether these studies used a sample size that was sufficiently large to balance out the many potential confounding factors in the comparison of performance in the deliberation and distraction conditions. By implication, it stands to reason that the small-sample studies that found a statistically significant difference in performance in the deliberation and distraction conditions concerned a spurious difference.
Table 1: Moderators of the UTA identified in the meta-analysis by Strick, Dijksterhuis, Bos, Sjoerdsma, & Van Baaren (2011), and the manner in which these conditions were incorporated in the current large-scale replication attempt (see also Nieuwenstein & Van Rijn, 2012)

2.2 The moderator account
In contrast to the reliability account, the moderator account proposes that the UTA is a real effect that is observed only when certain conditions are met with regard to the choice task. This account was proposed in a recent meta-analysis that was conducted by proponents of the theory of unconscious thought (Reference Strick, Dijksterhuis, Bos and Van BaarenStrick, Dijksterhuis, Bos, Sjoerdsma, & Van Baaren, 2011). The analysis included a large collection of published and unpublished data sets and it examined a large number of potential moderators of the UTA, including seemingly trivial methodological details such as whether the distracting task involved a word-search puzzle or an anagram task. The results yielded a pooled effect size of .218 (CI: .130-.307, p < .01), suggesting that, overall, a benefit of distraction in making complex choices does exist. Furthermore, many of the moderator variables included in the analysis indeed had a significant effect on the magnitude of this benefit (see Table 1). Specifically, the effect size of the UTA was found to depend on the complexity of the choice problem, the type of goal participants were led to adopt during the information acquisition phase of the task, the manner in which the information about the choice alternatives was presented, the duration of the deliberation or distraction phase, and the nature of the task that was used to divert attention in the distraction condition. Accordingly, Strick et al. concluded that the UTA is real but the occurrence of this effect requires that certain conditions be met, as indicated by the results of the moderator analyses.
3 Outline of the current study
In the current study, we set out to adjudicate between the reliability and moderator accounts. To this end, we conducted a large-scale replication study that met each of the conditions found to yield a strong effect in the meta-analysis by Strick et al. (2011; see Table 1), and we conducted a meta-analysis that moved beyond the analysis by Strick et al. by examining the relationship between sample and effect sizes using a funnel-plot (i.e., a plot that depicts effect sizes against a measure of study precision that is directly related to sample size, such as the inverse of the standard error; e.g., Egger, Smith, Schneider, & Meyer, 1997; Reference Light and PillemerLight & Pillemer, 1984). According to the reliability account, previous findings of a significant benefit of distraction concern nothing but a spurious result, and, therefore, these findings would be expected to be confined to studies that used relatively small sample sizes because the probability of a spurious effect should decrease with increasing sample size. Furthermore, the reliability account also predicts that our large-scale replication study should show no significant UTA, in spite of the fact that the design of this study adhered to the recommendations provided by Strick et al.’s (2011) meta-analysis. In contrast, the moderator account would predict that the UTA should also be observed in studies that used a relatively large sample size, provided that they met the conditions under which the UTA is expected to occur (Reference Strick, Dijksterhuis, Bos and Van BaarenStrick et al., 2011). Thus, according to the moderator account, our large-scale replication study would also be predicted to reveal the UTA.
4 The large-scale replication study Footnote 1
The starting point for the large-scale replication study was a recent study in which Nieuwenstein and Van Rijn (2012) conducted a first test of the moderator account and found a number of results that warranted further empirical confirmation. In this earlier study, Nieuwenstein and Van Rijn used a task that met the conditions under which the UTA should be strong according to Strick et al. (2011), with the contrast between deliberation and distraction implemented as a within-subjects design so as to preclude the possibility that any observed UTA could be due to a spurious between-group difference. The results of four such experiments did not yield a statistically significant UTA effect, suggesting that even when all the moderator conditions identified by Strick et al. are met, the UTA is either small or does not occur at all. Importantly, however, these experiments used a relatively small sample size (24-48 participants), and the experiment that used the largest sample size (N = 48) did show a non-significant difference in the direction of the UTA. Furthermore, the results also suggested that perhaps the UTA is gender-specific, as a post-hoc exploratory analysis across all four experiments yielded a significant interaction of mode of thought and gender, with male participants showing a statistically significant conscious thought advantage while female participants showed a non-significant trend towards an UTA. Lastly, the results of these experiments also suggested that insofar as the UTA indeed exists, it might occur only when the duration of the deliberation phase in the conscious deliberation condition is fixed at several minutes. Specifically, the results showed that participants needed only 30 seconds to deliberate about their choice, and they also provided evidence to suggest that performance in the conscious deliberation condition is better when the deliberation phase is self-paced, as opposed to fixed and unnecessarily long (see also, Reference Payne, Samper, Bettman and LucePayne, Samper, Bettman, & Luce, 2008).
Given the concerns about the reliability of results obtained in the unconscious thought paradigm, and given the post-hoc nature of the exploratory analyses that suggested that the UTA might be gender-specific, it is clear that the results reported by Nieuwenstein and Van Rijn (2012) warrant a more powerful test with a larger group of participants. To this end, the current study replicated the first experiment in Nieuwenstein and Van Rijn—i.e., the one that showed a non-significant difference in the direction of the UTA—with a sample of participants that was nearly an order of magnitude larger (N = 399) than the sample used by Nieuwenstein and Van Rijn, thus offering a much more powerful test of the UTA Footnote 2 and the potential moderating role of gender. Furthermore, this large-scale replication attempt also used a within-subjects design for the comparison of the deliberation and distraction conditions, with the order of these conditions counterbalanced across participants. In addition, the experiment included two versions of the deliberation condition that differed in whether the duration of the deliberation phase was fixed or self-paced, thus allowing us to verify if performance in the deliberation condition—and perhaps the occurrence of the UTA—indeed depends on the duration of the deliberation phase. The duration of the deliberation phase was varied between subjects, and we used two different choice sets for the two choices that were to be made by each participant (i.e., a choice between four cars or four apartments), with a random distribution of these choice sets across the two choice conditions.
4.1 Methods
4.1.1 Participants
The study was conducted as part of a test session at the University of Amsterdam Footnote 3 in which all first-year undergraduates in Psychology could participate on a voluntary basis to obtain course credit. The number of students who took part in the study was 423 and this sample included 24 non-native speakers of Dutch, whose data were excluded from analysis. Exclusion of these participants did not change the results. The remaining 399 participants were 19.7 years old on average (SD = 1.86 years), and they included 130 males.
4.1.2 Materials
The experiment was conducted on a computer, using a program written in Adobe Authorware. The experiment comprised two choice tasks and a word-search task. The word-search puzzle task was used to distract participants during the unconscious deliberation phase.
For each of the choice tasks, participants received information about four options—cars or apartments—that were described in terms of twelve properties that could be desirable or undesirable.Footnote 4 The quality of the options was defined in terms of their number of desirable properties, such that the best option had 9 desirable properties whereas two intermediate options each had 6 desirable properties, and the worst option had only 3 desirable properties. During the information acquisition phase, these properties were presented one after the other in a series of timed displays that each included the fictitious name of the option, a sentence describing a property of the option, and a picture of the choice option. The pictures depicted real cars and apartment buildings (see also Nieuwenstein & Van Rijn, 2012). The word-search puzzle task comprised a 10x10 array of letters that was shown together with a target word. The letters were indexed by the numbers 1–100 and the task for the participants was to find the target word and type in the numbers that corresponded to the first and last letter of the word. The target words denoted countries, vegetables, or fruits, and could be written in the array in any direction.
4.1.3 Procedure
At the start of the study, the participants practiced the word-search puzzle task they would later be asked to do again during the unconscious deliberation phase. After practicing this task for one minute, the participants were informed that they would now see a presentation about four [cars/apartments] that would each be described in terms of different properties. In accordance with the recommendations by Strick et al. (2011), participants were instructed that they should form a good impression of each of these options. They were then shown a sequence of 48 displays of the options and their properties. The properties were presented grouped by option and the twelve properties were presented in the same order for each of the four options. The duration of each display was set at 2.5 seconds. In the distraction condition, this information acquisition phase was followed by an instruction telling the participants that they would later be asked for their opinion about the options and that they would first have to do the word-search puzzle task for a period of three minutes. In the deliberation conditions, participants were also told that they would later be asked for their opinion about the options, and they were instructed that they would first get three minutes (fixed deliberation phase) or as long as they needed (self-paced deliberation phase) to think carefully about the options. During this period, the pictures and names of the options remained in view, together with a counter that indicated the passage of time in seconds. In the self-paced deliberation condition, the same display was shown but now participants could press a designated key once they had made up their mind. At this point, participants received the instruction to select the best option by pressing a corresponding key on the keyboard. In the fixed deliberation condition, this instruction appeared automatically after three minutes had passed. After selecting the best option, participants were asked to indicate on a 10-pt. scale how confident they were about their choice. In addition, participants in the deliberation condition with a fixed 3-minute deliberation phase were asked to estimate how long they had needed to arrive at a decision. For participants in the self-paced deliberation condition, the program registered how long it took before they indicated they had made up their mind.
Table 2: Number of participants (N) included in each of the four versions of the task

4.1.4 Design
Each participant made one choice after conscious deliberation and one choice after doing the word-search task, and the order of these conditions was counterbalanced across participants. For half the participants, the duration of the deliberation phase in the conscious deliberation condition was fixed at 3 minutes and it was self-paced for the other participants. The duration of the word-search task that was used to induce a diversion of thought in the distraction condition was three minutes for all participants. The two orders of the deliberation and distraction conditions and the two durations of the deliberation phase were crossed to create four different versions of the task, and participants were randomly assigned to one of these four versions (see Table 2). The two choice sets (cars and apartments) were randomly assigned to the deliberation and distraction conditions, yielding a balanced design of within and between-subject factors.
4.1.5 Data-analysis
The plan for data-analysis was to examine accuracy on the choice task for main effects and interactions of mode of thought (deliberation vs. distraction), gender (male vs. female), and the duration of the deliberation phase in the deliberation condition (fixed vs. self-paced). Choice accuracy was defined in terms of whether a participant selected the option with the greatest number of desirable properties, as is typically done in this paradigm. Since this outcome has a binomial distribution, the data were modelled using a logit function and analyzed using a generalized linear model (GLM). The effects that were tested using the GLM were estimated using generalized estimating equations so as to allow for the possibility that the observations could be correlated across the within-subjects factor of mode of thought. The confidence ratings were treated as an ordinal variable and analyzed for the same effects using a GLM.
4.2 Results
As a first step in analyzing the data, we examined how long participants needed to deliberate about their choice in the fixed and self-paced conscious thought conditions, and we examined if choice accuracy in this condition depended on whether the duration of the deliberation phase was self-paced or fixed at three minutes. The analysis of deliberation time showed that on average, participants in the self-paced condition took only 23 seconds to deliberate ( SD = 19.4, 95% CI = [20.5; 26.1]). In addition, this analysis showed that there was no significant relationship between choice accuracy and deliberation time, with the mean deliberation times being 25.0 ( SD = 25.4 , 95% CI = [20.0; 30.7]) and 21.7 seconds ( SD = 13.1, 95% CI = [13.5; 24.4]), respectively, for participants who made an incorrect or correct choice ( t[195] = 1.17, p = .24, Cohen’s d = .17). A similar result was found for participants for whom the duration of the deliberation phase was fixed at three minutes. To be precise, these participants reported that they had needed 37 seconds ( SD= 31.0, 95% CI= [32.7; 41.4]) on average to deliberate, and for these participants too, self-reported deliberation time did not differ between participants who made a correct or incorrect choice, M = 37.7 ( SD = 30.1, 95% CI= [31.3; 44.3]) vs. M = 37.1 ( SD = 31.8, 95% CI = [ 32.1; 42.7]) seconds respectively, t(200) = .15, p = .88, Cohen’s d = 0.02. Lastly, a comparison of choice accuracy in the deliberation conditions with a self-paced and fixed deliberation phase showed no significant effect of the duration of the deliberation phase, with the percentage of correct choices being 59.4 and 56.9%, respectively, for the fixed and self-paced conditions, Z = .52, p = .61.
Table 3: A. Percentage of participants who chose the option with the largest number of desirable properties in in the deliberation and distraction conditions, shown separately for male and female participants

B. Outcomes of general linear model examining effects of mode of thought (deliberation vs. distraction) and gender on choice accuracy

The main analysis of interest examined choice accuracy for effects of mode of thought and gender. As can be seen in Tables 3A and 3B, there were no significant effects involving mode of thought, with the percentage of correct choices being 58.2% and 61.9%, respectively, in the deliberation and distraction conditions.Footnote 5 The sole effect to reach significance was the main effect of gender, with female participants being significantly more likely to select the best option than male participants (63% vs. 53%, respectively). Crucially, however, gender did not interact with mode of thought, thus failing to replicate the interaction effect that was found in an exploratory analysis by Nieuwenstein and Van Rijn (2012). Lastly, the analysis of the confidence ratings did not show significant effects of mode of thought or of the duration of the deliberation phase, whereas it did yield a significant effect of gender, χ2(1) = 13.27, p < .001, with female participants being less confident about their choice than male participants (M = 6.9 vs. M = 7.4, respectively).
4.3 Bayes factor analysis
Though the results of the GLM analysis are clear in demonstrating a lack of a statistically significant UTA, this type of analysis does not allow for a quantification of the extent to which the results support the null hypothesis over an alternative hypothesis that stipulates that the effect does exist. One approach that offers an elegant means to do so is the computation of a Bayes factor (e.g., Dienes, 2008; Reference DienesDienes, 2011; Reference JeffreysJeffreys, 1961; Reference Morey and RouderMorey & Rouder, 2011; Reference Newell and RakowNewell & Rakow, 2011; Reference Rouder, Speckman, Sun, Morey and IversonRouder, Speckman, Sun, Morey, & Iverson, 2009; Reference WagenmakersWagenmakers, 2007). To be precise, a Bayes factor can be used to competitively contrast two models of the data, which in this case represent the null hypothesis ( H 0) that there exists no UTA effect and an alternative hypothesis ( H 1), which assumes that this effect does exist. The Bayes factor is the relative likelihood of the data under these two hypotheses, and the outcome of this computation indicates the extent to which rational observers should adjust their relative beliefs in response to the data. Specifically, if the Bayes factor is greater than one, it indicates that belief should be adjusted in favor of the null hypothesis, and if it is less than one, it indicates that belief should be adjusted in favor of the alternative hypothesis.
To competitively contrast the H 0and H 1 models, we first had to construct a model for H 1 which was intended to fairly represent the outcome a proponent of the UTA would predict for the current study. To construct the model, we used the outcomes of six experiments that were conducted by proponents of the UTA, and that were reported to show a significant UTA (Experiment 2 in Dijksterhuis [2004], Experiment 1 in Dijksterhuis et al. [2006], Experiments 1 and 2 in Nordgren, Bos, & Dijksterhuis [2011], and Experiments 1 and 2 in Strick, Dijksterhuis, & Van Baaren [2010]). The reasons for using these experiments as the basis for the H 1model were that they were all reported to show evidence in favor of the UTA (even though not all these effects were statistically significant, see the Supplement), and because they were similar to the current study in terms of their outcome measure (proportion of correct choices). The reason why we chose to use only studies that reported proportions correct—as opposed to using all studies done by proponents of the UTA—was that this enabled us to use the same scale to model the data from our own study and from the studies we used to construct the H 1 prior.
Taken together, the six experiments used as a basis for the H 1prior included 150 participants in the distraction condition and 172 participants in the deliberation condition and the proportions of correct choices in these conditions were .62 and .31, respectively. On the basis of these data, we developed a model for H 1 (see Supplement for details) which can be argued to reflect the prediction a proponent of the UTA would make for the current experiment, according to their own observations. Indeed, it could even be argued that our estimate of this prediction underestimates the effect a proponent of the UTA would predict for the current study, as the 6 studies used for deriving this predicted outcome did not all meet all of the requirements for a strong UTA effect, as suggested by Strick et al. (2011) in their meta-analysis. Thus, to the extent that one believes the UTA is stronger if the recommendations of Strick et al. are followed, one should also believe that the outcome we derived as a prediction for the H 1 prior underestimates the magnitude of the UTA that proponents would predict for our experiment, which met all recommendations of Strick et al.
Table 4: Results for a between-subject comparison of performance in the condition that was done first by each participant in the current study


Figure 2: Graphical depiction of the probability density functions for the effect size of the UTA predicted under H 1(the prior, depicted as a dashed line), and the posterior probability density function after inclusion of the outcome of the current study (the solid line). Effect size is defined in probit units.
In computing the Bayes factor, we assumed that the proportions of correct choices in the deliberation and distraction conditions were binomially distributed, and the parameters of these distributions were derived from a standard probit model. By applying this probit model to the 6 previous studies showing the UTA, we derived a distribution of a priori expectations for the true effect size under H 1 (depicted by the dashed line in Figure 2). For the current study, we followed a similar procedure to model the results for the between-subjects comparison of the deliberation and distraction conditions, using only the outcomes for the condition that was done first by each participantFootnote 6 (see Table 4). The Bayes factor was then computed as the extent by which the density around the null hypothesis d= 0 grew from the prior for H 1 to the posterior after including the data from our large-scale study. As can be seen in Figure 2, the null effect of our study caused the posterior distribution to gather around the null value d= 0. Specifically, the density at d= 0 grew by a factor of 7.83, meaning that a rational observer who considers H 1against H 0 should adjust his belief in favor of H 0 by a factor of 7.83.Footnote 7
5 Meta-analysis
Taken together, the results of the large-scale replication study provide compelling evidence against the moderator account, as they make clear that a high-powered study that is optimized in accordance with the purported moderators of the UTA yields no evidence for this effect. By implication, the results of the large-scale replication study may also be considered as support for the reliability account. As described in the introduction, this account not only predicts that the UTA will not be found in a large-scale study but it also predicts that previous studies that did show this effect should be confined to studies that were unreliable due to the use of small sample sizes.
To test this prediction, we examined the relationship between effect and sample sizes for a data set that included both our large-scale study and all previously published experiments that compared the accuracy of difficult choices made after distraction or deliberation. Specifically, we collected data from all published studies that used the same type of multi-attribute choice task, and the same types of deliberation and distraction conditions as Dijksterhuis and colleagues used in their seminal studies from 2004 and 2006 (see Figure 1 for a depiction of the task), and which have since then been used in dozens of replication attempts. (See Table 6 for a list of these studies and their effect and sample sizes.) Based on these data, we constructed a so-called funnel plot in which the effect sizes were plotted against a measure of study precision directly related to sample size, namely the inverse of the standard error (Reference Egger, Smith, Schneider and MinderEgger et al., 1997; see also, Bakker et al., 2012; Reference Light and PillemerLight & Pillemer, 1984). Of particular relevance to the present study, this type of plot allows one to mark regions of statistical (non)significance, as the significance of a standardized mean difference score is a function of the score and its standard error. Thus, a funnel plot allows the viewer to gauge in a single glance both the distribution of significant and non-significant effects, as well as the relationship between these effects and their reliability, defined in terms of standard error. Accordingly, by inspection of the funnel plot, one can determine if previous reports of a significant UTA are indeed confined to studies that were relatively unreliable due to the use of small sample sizes, as predicted by the reliability account.
Aside from using a funnel plot to examine the relationship between effect and sample sizes, we subjected the data set to a quantitative meta-analysis in which we computed the overall effect size, and analyzed and corrected the data set for the existence of publication bias, using procedures described in detail in the following sections.
Table 5: Brief description of the types of studies found in search for studies comparing the effects of deliberation and distraction on judgment and decision making. N = number of research articles that reported studies in one or more domains, K = total number of studies within a particular domain. References and further details for all studies are provided in the Supplement

Table 6: Effect and sample sizes of the studies included in the meta-analysis. Note that the effect sizes derived from the study by Nieuwenstein and Van Rijn (2012) were based on the outcome of between-subjects comparisons of the condition done first in experiments that used a within-subjects design in which each participant made one or more choices after deliberation or distraction

5.1 Data collection and study inclusion criteria
Studies comparing the effects of distraction and deliberation on human judgment and decision making were identified through searching the Web of Science database with “unconscious thought” and “deliberation without attention” as keywords. In addition, we checked all citations of the two seminal studies by Dijksterhuis and colleagues (Reference DijksterhuisDijksterhuis, 2004; Reference Dijksterhuis and Van OldenDijksterhuis et al., 2006), and we cross-checked the studies we found against the set of studies included in the meta-analysis by Strick et al. (2011). All together, this search yielded a set of 54 published research articles that reported a total of 129 unique comparisons of the effects of distraction and deliberation on some measure of judgment or choice accuracy (see Table 5 for a general description of these studies; see the Supplement for a table listing all studies found).
As can be seen in Table 5, the majority of published studies that have compared the effects of distraction and deliberation on judgment and decision making have used a multi-attribute choice task similar to that used in the current large-scale replication attempt. Specifically, of the 54 research articles we found, 33 included one or more studies comparing the effects of distraction and deliberation on a multi-attribute choice task, and these articles together reported a total of 81 such studies (63% of all studies). In comparison, the next largest set of studies—those examining the effects of deliberation and distraction on creativity—included only 13 studies that were reported in 5 research articles. Since our main goal for the meta-analysis was to investigate the relationship between sample and effect sizes, we chose to restrict our analysis to studies using a multi-attribute choice task as these studies constituted the large majority of all studies, and because the use of the same type of task entailed that they could all be assumed to measure the same effect. Studies examining the effects of deliberation and distraction on multi-attritube choice tasks were included in the meta-analysis if they met the following three inclusion criteria:
- 1. The study should include sufficient information to compute Hedges g,a measure of the standardized mean difference between conditions. This criterion led to the exclusion of one study. 
- 2. The instructions given to the participants had to be similar to the instructions used by Dijksterhuis and colleagues in their seminal studies from 2004 and 2006. This meant that participants should have been instructed to form an impression of the options during the information acquisition phase (as opposed to being instructed to memorize the information about the options) and that they should have been informed prior to the distraction task that they would later be asked to judge or choose amongst the options. This criterion led to the exclusion of 7 studies that each used an instruction to memorize the information during the information acquisition phase. 
- 3. The choice problem used in the multi-attribute choice task should have been complex, as the UTA is only predicted to occur for complex choices. The complexity of a multi-attribute choice task can be defined in terms of the number of options multiplied by the number of attributes used to describe these options. Reference Dijksterhuis and NordgrenDijksterhuis and Nordgren (2006) did not propose a criterion for when a multi-attribute choice should be considered to be complex, but the studies by Dijksterhuis and colleagues make clear that choices involving a total of 16 attributes are considered as simple, and therefore unlikely to produce the UTA, whereas choices involving a total of 30 or more attributes were predicted to yield an UTA, and may thus be considered to be complex. Accordingly, we included only studies with a total of at least 30 attributes, resulting in the exclusion of 4 studies that each used a multi-attribute choice task with four options defined by only four attributes. 
5.2 Data set and effect size computation
After exclusion of the twelve multi-attribute choice studies that did not meet our inclusion criteria, we had a total 69 studies remaining in our data set. As a subsequent step, we computed composite effect sizes for 7 studies that reported two separate comparisons for two groups of participants. To be precise, we computed composite effect sizes for studies that compared a distraction and deliberation condition separately for two groups of participants that differed in having been primed to obtain a feeling of high or low power (Experiments 1 and 2 in Reference Smith, Dijksterhuis and WigboldusSmith, Dijksterhuis, & Wigboldus, 2008), the consumption of a can of 7-Up (Reference Bos, Dijksterhuis and Van BaarenBos, Dijksterhuis, & Van Baaren, 2012), low vs. high need for cognition (Experiment 2 in Lassiter, Lindberg, Gonzalez-Vallejo, Belleza, & Phillips, 2009), or featural vs. configural mindset (Experiments 2 and 3 in Lerouge, 2009). The reason for aggregating the results across these between-subjects factors was that these factors could be expected to vary naturally across participants in the other studies. Lastly, we also computed composite effect sizes for two studies in which the information about the options was presented in two different formats (numerical scores vs. colour-defined scores and numerical scores vs. star-count scores; Reference Abadie, Villejoubert, Waroquier and Vallée-TourangeauAbadie, Villejoubert, Waroquier, & Vallée-Tourangeau, 2013a). As a result of computing these composite effect sizes, our data set was reduced to a total of 61 unique effect sizes (see Table 6 for the studies and their effect sizes). The computation of effect sizes was done using the compute.es function in R, and the meta-analysis was done using Viechtbauer’s (2010) metafor package.
5.3 Results
The 61 studies included in our data set had a sample size that ranged between 40 and 399, and their effect sizes ranged between −.74 and 1.48 (see Table 6). Based on these data, we constructed a funnel-plot to visualize the distribution of significant and non-significant effects, and their relationship to study precision, defined in terms of the inverse of the standard error (see Figure 3a ). The white area in the plot marks the region in which effect sizes were non-significant whereas the grey areas mark the regions in which effect sizes were significant either in the direction of a conscious thought advantage (CTA; area on the left, with Hedges’ g < 0) or an unconscious thought advantage (UTA; area on the right, with Hedges’ g > 0). As this figure illustrates within a single glance, the published literature on the unconscious thought effect in multi-attribute choice tasks includes predominantly non-significant effects (N = 45), and only 16 statistically significant effects of which 12 were in the direction of the UTA whereas 4 were in opposite direction, that is, in the direction of an advantage for deliberation over distraction. Moreover, the plot shows a clear relationship between study precision and the finding of a significant UTA, such that the finding of a significant UTA appears to be confined to studies that had lower precision. Indeed, the studies with a relatively high precision show either a non-significant difference or an advantage for deliberation. Accordingly, it may be concluded that the observation of a statistically significant UTA appears to be confined to studies that were unreliable due to the use of small sample sizes.

Figure 3: A. A funnel-plot showing the effect sizes of studies comparing choice made after distraction and deliberation plotted as a function of the inverse of their standard error. The grey area marks the area wherein effect sizes are statistically significant at p < .05 and the dashed line indicates the pooled effect size, Hedges’ g = .15.

B. A funnel-plot with the same effect sizes as those shown in Figure 3a (grey symbols), with the addition of the effect sizes that were filled in using the trim and fill procedure (open symbols). The dashed line indicates the pooled effect size after inclusion of the filled-in effect sizes.
As a subsequent step in our analysis we submitted the data set to a quantitative meta-analysis to compute the overall effect size. The analysis used a random effects model and yielded a pooled effect size of 0.15, with a confidence interval of [0.03; 0.26], a Z-score of 2.54, and p = 0.01, thus suggesting the existence of a small but statistically significant UTA. Importantly, however, the distribution of effect sizes shown in Figure 3a suggests that this effect may need correction for publication bias, as the distribution appears to be asymmetrical, with a relatively large number of low-precision UTA effects, and only few low-precision effects of equal magnitude in opposite direction. The reason why such asymmetry may hint at a publication bias is that a theoretical, completely filled-in funnel would be expected to show a symmetrical distribution of studies around the estimated true, mean effect size, such that studies of the same level of precision would be expected to be distributed symmetrically around this mean. An asymmetrical funnel lacking effects of a particular magnitude, direction, and precision is therefore often interpreted to reflect a publication bias against this type of finding (e.g., Egger et al., 1997).
Since publication bias constitutes a common problem in meta-analyses, several procedures have been developed to deal with it. Some of these procedures focus solely on statistically significant effects, for instance by using the distribution of the p-values of these effects as a means to determine whether the distribution matches what could be expected if an effect truly existed (Reference Simonsohn, Nelson and SimmonsSimonsohn, Nelson, & Simmons, 2014; see also, Van Assen, Van Aert, & Wicherts, in press). Other procedures use the distribution of all effect sizes, thus offering methods compatible with the current data set, which featured predominantly non-significant effects (e.g., Reference Duval and TweedieDuval & Tweedie, 2000; Reference Sterne and EggerSterne & Egger, 2005). A first such procedure that is of relevance for the current purposes regards the possibility to test whether the asymmetry in a funnel plot is statistically significant. This can be done by means of a regression analysis in which study precision is used as a predictor of effect sizes (Reference Sterne and EggerSterne & Egger, 2005). Using such a test, we indeed found evidence for significant asymmetry, with Z = 2.11, and p = .04.Footnote 8
Aside from methods to compute the statistical significance of funnel plot asymmetry, researchers have also developed methods to correct for this asymmetry. One such method is the so-called trim-and-fill procedure, which allows one to impute missing effect sizes based on the assumption that effect sizes of equal precision should be distributed symmetrically around the mean effect sizeFootnote 9 (Reference Duval and TweedieDuval & Tweedie, 2000). The results of applying this procedure to the current data set are shown in Figure 3b, wherein the open symbols denote the 10 effect sizes that were filled in to correct for the asymmetry. After this correction, the overall effect size of the UTA turned non-significant, with a pooled Hedges’ g = 0.018, a confidence interval of [−0.10; 0.14], a Z-score of 0.30, and p = 0.77.Footnote 10
6 Discussion and conclusions
With several dozen published experiments presenting conflicting results, the unconscious thought advantage (UTA) may be considered one of the most controversial phenomena in psychological science today. While proponents of the UTA have argued that the studies that failed to replicate this effect did not meet certain methodological requirements (Reference Strick, Dijksterhuis, Bos and Van BaarenStrick et al., 2011), critics have argued that the effect does not exist and that previous reports of the UTA concerned nothing but spurious, unreliable findings (e.g., Acker, 2008; Reference Newell and RakowNewell & Rakow, 2011; Reference Nieuwenstein and Van RijnNieuwenstein & Van Rijn, 2012). To adjudicate between these opposing views, we conducted a large-scale study that adhered to the conditions deemed optimal for replicating this effect (Reference Strick, Dijksterhuis, Bos and Van BaarenStrick et al., 2011), and we conducted a meta-analysis that examined the relationship between the effect and sample sizes of previous studies. The results of the large-scale replication study yielded no evidence for the UTA, and it also dispelled the recent suggestion from Nieuwenstein and Van Rijn (2012) that the UTA might be gender-specific. Furthermore, the meta-analysis showed that previous reports of a statistically significant UTA were confined to studies that were relatively unreliable due to the use of small samples of participants. Accordingly, the results of the current study lead us to conclude that the claim that distraction leads to better decision making than deliberation in a multi-attribute choice task has no reliable support.
What is left to be explained then is why the paradigm shown in Figure 1 yields no difference in the quality of decisions made after distraction or deliberation. Does that mean that decision makers are just as well off if they do not think consciously about their choices (Reference BarghBargh, 2011)? The answer to this question depends on whether one believes that the choices made in the unconscious thought paradigm truly reflect the outcome of two different modes of thought. On this matter, the literature on human judgment and decision-making offers a sobering perspective. Specifically, this literature includes many findings that show that people rapidly form their opinion when asked to make a judgment (e.g., Baron, 2008; Reference Gigerenzer and GassmaierGigerenzer & Gassmaier, 2011; Reference KahnemanKahneman, 2011). Furthermore, an abundance of findings show that once people have formed an opinion, they are unlikely to change that opinion, as they will only tend to seek further evidence to support that opinion (e.g., Reference Bruner and PotterBruner & Potter, 1964; Reference Edwards and SmithEdwards & Smith, 1996; Reference Lord, Ross and LepperLord, Ross, & Lepper, 1979). Accordingly, the fact that there is no difference in the accuracy of difficult choices made after distraction or deliberation is naturally explained by assuming that participants have already made up their minds during the information acquisition phase of the task and that the ensuing deliberation or distraction phase does not lead them to change their opinion (see also, Lassiter et al., 2009; Reference Newell and RakowNewell & Rakow, 2011). Rather, participants in the distraction condition may simply recall their earlier judgment, whereas participants in the conscious deliberation condition may only search their memory for confirmatory evidence for their earlier established preference.
A last aspect of the data that needs to be explained is why the published literature includes more studies reporting a significant UTA than studies reporting a significant benefit for conscious deliberation. In keeping with the results of our meta-analysis, this asymmetry appears to be due to a publication bias against small sample studies that found evidence for a conscious thought advantage. We can conceive of two reasons for this publication bias. The first is that the UTA concerns a more newsworthy finding than the finding of a conscious thought advantage, as distraction is generally thought to have a detrimental effect on task performance, and, therefore, studies reporting a beneficial effect of distraction will be considered more interesting and newsworthy than studies reporting a detrimental effect of distraction. A second reason could be that any small-sample studies—modeled after the original, small-sample studies by Dijksterhuis and colleagues (2004; 2006)—that produced an effect opposite to that of Dijksterhuis and colleagues are likely to be rejected due the use of a small sample size. This may be considered the catch-22 of the publication of a small sample study that shows a remarkable, but spurious novel effect: Once such a report is published, researchers will generally adhere to the methods of the original study in their replication attempts, and this may either lead to a coincidental replication of the same spurious effect, or to a non-replication that is much more difficult to publish because it is difficult to argue against the existence of a published effect on the basis of a small-sample study (e.g., Frick, 1995).
Aside from a publication bias, another reason for the asymmetry in available findings could be a confirmation bias on part of the researchers who believe in the existence of the UTA. This bias could take different forms as researchers who believe in a certain theory or phenomenon might engage in various questionable research practices, such as p-hacking (e.g., collecting data until the results look the way they should according to one’s favorite hypothesis; Reference Bakker, Van Dijk and WichertsBakker et al., 2012; Reference IoannidisIoannidis, 2005; Reference WagenmakersWagenmakers, 2007), selectively reporting one of several indices of performance (Reference Simmons, Nelson and SimonsohnSimmons, Nelson, & Simonsohn, 2011), or running several studies to test the same hypothesis, each time under slightly different conditions, until a theory-predicted result is found (e.g., Greenwald, Pratkanis, Leippe, & Baumgardner, 1986). Of course, the risk of these practices is that they are bound to produce a predicted outcome at some point, if only by mere coincidence.
To conclude, the current study shows that previous findings suggesting the existence of an unconscious thought advantage in complex decision making concern spurious effects that were obtained with unreliable methods. Accordingly, our findings make clear that future research on the UTA should use more reliable methods, and they also make clear that the results of previous studies on this effect should be interpreted with great caution until they have been replicated in a properly powered study. Until that day, the idea that a momentary diversion of thought leads to better decision making than a period of deliberation remains an intriguing but speculative hypothesis that lacks empirical support.
 
 








