When people in psychological distress seek a diagnosis, they often prefer more experienced clinical psychologists; however, counter-intuitively, this may not always be wise (Tracey et al., 2014; Shanteau, 1992) . While it is self-evident that clinical psychologists who have worked longer in the field gain more experience, it has also been shown that they do not necessarily gain more expertise in psychodiagnostic decision-making (Spengler et al., 2009; Vollmer et al., 2013) : Even after years of experience, clinical psychologists are relatively poor at categorizing mental disorders into DSM categories ( Brailey et al., 2001; Schulte-Mecklenbeck et al., 2015 ) and make judgments strikingly similar to those of novices (e.g., Ægisdóttir et al., 2006; Garb, 1998; Strasser & Gruber, 2004; Witteman & Van den Bercken, 2007 ).
Two possible explanations for this phenomenon have been suggested. One is a task effect (Shanteau, 1992; Shanteau & Weiss, 2014; Tracey et al., 2014) . In a “wicked” learning environment ( Hogarth, 2001 ) such as the clinical domain, where decisions are based on uncertain, incomplete knowledge and without feedback, it is hard to learn from experience and improve performance. The other explanation is that the reasoning style on which experienced clinical psychologists rely does not fit the task (Tracey et al., 2014) . In our paper we focus on the second explanation.
As experience increases, professionals tend to move from the deliberative, detail- oriented processing of the beginner, to faster, more automated information processing (Betsch & Haberstroh, 2005; Elstein & Schwartz, 2002; Evans, 2008; Kahneman, 2011) . Epstein (e.g., 2010) uses the terms “experiential/intuitive” versus “rational/analytical” to refer to these different ways of processing information. Experienced clinical psychologists may thus diagnose more intuitively, quickly matching client presentations to prototypes (Westen, 2012) . However, the benefit of greater experience, demonstrated in many other fields of expertise (Ericsson, 2009; Qui?ones et al., 1995) , is offset in the mental health domain by using a less suitable reasoning style, explaining why more experienced psychologists are not more diagnostically accurate than novices.
In our study we employed two formats for making clinical decisions: i) brief text vignettes and ii) MouselabWeb matrices (Schulte- Mecklenbeck et al., 2011) , a process tracing tool providing information about both how long and how often cues are inspected. We also included control vignettes and matrices, and recruited age-matched participants from other fields of expertise to control for age and the domain-specificity of the effects.
Outcome hypothesis. We expected to find no differences in psychodiagnostic accuracy between novices and experienced clinical psychologists on the clinical tasks (cf. Spengler et al., 2009 ); and we expected clinical psychologists to perform better than control participants on the clinical tasks, as these required domain specific knowledge.
Style hypothesis. We expected experienced clinical psychologists to report a stronger preference for experiential processing than novice psychologists (cf. Betsch & Haberstroh, 2005 ). Additionally, because the environment does not allow learning from experience (Tracey et al., 2014) , more experience would not be associated with higher accuracy, and a stronger preference for rational processing would be related to higher accuracy.
Processing hypothesis. We expected experienced clinical psychologists to be more intuitive and quicker in their decision-making (Westen, 2012) , especially on the clinical tasks, than both novice psychologists and control participants.
Twenty novice (18 females) and 20 experienced clinical psychologists (14 females) participated in this study. Novices were Master students in clinical psychology or young professionals, with a mean of 3 months of experience (SD = 3.2 months) and an average age of 25.3 years (SD = 3.83 years). Experienced clinical psychologists had a mean of 15.6 years of experience (SD = 11.4 years) and with an average age of 42.9 years (SD = 13.1 years).
Novice clinical participants were recruited at two universities, with similar clinical psychology curricula. Experienced clinical participants were recruited through the membership list of their professional organization.
Additionally, we recruited forty age-matched control participants. Twenty (14 female) were Master students or young professionals in a field other than mental health, with an average age of 23.6 years (SD = 3.21 years). The other twenty (12 females) controls, for the experienced group, had an average age of 43.3 years (SD = 11.8 years).
All participated in the study from home using their computers. Eight gift certificates (each worth ?5) were raffled among the 80 participants.
Psychodiagnostic tasks. All psychodiagnostic tasks, both vignettes and MouselabWeb matrices, used criteria taken from the Diagnostic and Statistical Manual of Mental Disorders (DSM-IV-TR; American Psychiatric Association, 2000 ). Each task required participants to indicate which of two DSM-IV-TR diagnoses best fit the case information. There were always eight pieces of information. 1 - 4 pieces of information was diagnostic (i.e., defining) for one but not for the other diagnosis. The remaining pieces of information were non-specific for either of the two diagnoses. For instance, in a task with the diagnostic choices major depressive disorder and dysthymic disorder, information that symptoms have been present only for the last month is defining for major depressive disorder, while that of low energy can typically occur in both disorders. Different pieces of information and diagnoses were used in each task, using a total of 90 unique clinical tasks (45 vignette format, 45 MouselabWeb format).
The control tasks concerned general knowledge about countries, food, and animals and were constructed in the same way as the clinical tasks (i.e., eight pieces of information and two possible answers). The control tasks were piloted and difficulty matched using a different group of young and older volunteers.
Vignettes. The vignettes were short text descriptions of the eight pieces of information, followed by the two possible diagnostic labels. Control vignettes were in the same format, with 15 about countries, 15 about food, and 15 about animals. The cases presented using vignettes were different from those presented via MouselabWeb.
MouselabWeb matrices. In the MouselabWeb matrices ( Willemsen & Johnson, 2011 , see Figure 1), participants first saw eight closed grey boxes and two diagnostic labels. Hovering the mouse over a box revealed the information in that box, while moving it away closed the box again. Thus, participants saw only one piece of information at a time. Participants were instructed to inspect the information, without time- constraints, and were not required to open all eight boxes. They indicated a decision by clicking on one of the labels. Participants completed 45 clinical and 45 control MouselabWeb matrices, the latter comprising 15 tasks about countries, food, and animals, respectively. Tasks were presented to all participants in the same order.
Figure 1. Example of clinical (left) and control (right) MouselabWeb matrices.
The MouselabWeb software allowed us to measure accuracy (correct or incorrect), the number of opened boxes (number of acquisitions), and the time spent on each task.
The design of the study was as presented in Figure 2.
3.1. Rational Experiential Inventory
Figure 2. Study design.
Table 1. Mean scores (SDs) of the Rational-Experiential
The scale ranges from 1 to 5.
3.2. Analysis Strategy
To investigate the role of experience level (novice/experienced), profession (clinical psychologist/control), question type (clinical/non-clinical), and their potential interactions on diagnostic decisions, we used a (generalized) linear-mixed effects models approach (sometimes also referred to as multilevel models or hierarchical linear models) that can account for non-independence in the data (for example, due to the fact that each participant contributed more than 1 data point). This approach has several advantages compared to more traditional analysis approaches, as it allows analysis of data at the trial level (thus making it unnecessary to aggregate across items or participants) while safeguarding against inflated Type I errors by modeling all relevant potential sources of variation and taking into account the non-independence. We used the lme4 package (Bates, Maechler, Bolker, & Walker, 2014) in R (R Core Team, 2013) for the mixed-models analysis. To determine p-values for the effects of interest based on Likelihood Ratio Tests (comparing the model with the effect of interest to the same model without the effect of interest), we used the mixed function from the package afex version 0.15 - 2 (Singmann, Bolker, & Westfall, 2014) .
As a general modeling strategy, we always first ran an omnibus model containing all predictors and interaction terms of interest (experience level, profession, vignette type, and their interactions), and then ran follow-up models to further investigate significant interactions and/or main effects of interest.
For the vignettes analysis (as for the MouselabWeb), experience level, profession, vignette type, and their interactions were modeled as fixed effects, and participants and item were modeled as random intercepts. Vignette type was added as random slopes varying over participants, and experience level and profession were added as random slopes varying over item; in addition, the model contained all possible random correlation terms among the random effects. This represents a “maximal” random effects structure that both accounts for the repeated-measure nature of the data and avoids inflated Type 1 errors (Barr, Levy, Scheepers, & Tily, 2013) .
For the vignettes data, we first present the analysis of accuracy (correct or incorrect response), using a generalized mixed-effects models approach appropriate for the binary data, followed by the analysis of response times, which used a Gaussian model. The same analysis was done for the MouselabWeb data. An additional model used the number of acquisitions as an independent variable. Finally, within each such analysis, we first present the results of the models without REI scores (representing tests of our outcome hypotheses), followed by the same models with REI scores added (represent- ing tests of our style hypotheses).
3.3. Vignettes-Outcome and Style
There were no significant main effects of experience level (χ2(1) = 0.52, coeff = −0.044, p = 0.47, CI95% [−0.17, 0.08]), vignette type (χ2(1) = 0.02, coeff = −0.022, p = 0.89, CI95% [−0.35, 0.26]) nor interaction between profession and experience level (χ2(1) = 1.49, coeff = 0.21, p = 0.22, CI95% [−1.25, 0.17]) on accuracy (model fit: BIC = 5435; AIC = 5516). Clinical psychologists exhibited overall greater accuracy than controls (χ2(1) = 4.19, coeff = 0.143, p < 0.05, CI95% [0.001, 0.29]). The follow-up models showed that psychologists were significantly more accurate than controls only in clinical vignettes (χ2(1) = 21.6, coeff = 0.536, p < 0.001; CI95% [0.31, 0.76]). Clinical psychologists and controls did not differ in accuracy in control vignettes (χ2(1) = 1.14, coeff = 0.025, p = 0.89; CI95% [−0.11, 0.4]; see Table 2).
Response time showed a negative relationship with accuracy in all groups and in both vignette types; the longer participants took to solve tasks, the less accurate they were (χ2(1) = 7.81, coeff = −0.301, p < 0.01; CI95% [−1.45, −0.58]).
The observed results are consistent with our outcome hypothesis: Experienced and novice clinical psychologists did not differ in accuracy on the clinical vignettes.
With respect to our Style hypothesis, the following interactions were significant in the same model as above, but with thinking styles added (model fit: BIC = 5624; AIC = 5434): the three-way interaction between RAT, experience, and profession (χ2(1) = 8.62, coeff = 0.24, p < 0.01, CI95% [0.17, 0.32]); the three-way interaction between experience level, vignette type, and EXP score (χ2(1) = 8.24, coeff = 0.15, p < 0.05, CI95% [0.15, 0.41]); and the four-way interaction between RAT, profession, vignette type, and experience (χ2(1) = 6.51, coeff = 0.16, p = 0.01, CI95% [0.52, 0.91]). Follow-up models demonstrated that there was no relationship between EXP and accuracy (χ2(1) = 0.01, coeff = −0.001, p = 0.99, CI95% [−0.44, 0.37]) in experienced clinical psychologists, whereas in novices, EXP scores were negatively related to accuracy (χ2(1) = 8.16, p < 0.01, coeff = −0.71, CI95% [−1.27, −0.12]; Figure 3, left). In contrast, the RAT score was related to accuracy with experienced, but not novice, psychologists (χ2(1) = 9.71, coeff = 0.556, p < 0.01; CI95% [0.71, 1.42]; χ2(1) = 3.6, coeff = 0.21, p = 0.06, CI95% [−0.94; 0.04]; see Figure 3, right).
For descriptive reasons (to present the relationships in a measure more familiar to most readers than the coefficients in the mixed-effect models) Pearson correlations
Table 2. Mean accuracy percentages.
Figure 3. Association between experientiality (EXP) and accuracy (left) and rationality (RAT) and accuracy (right) for novice and experienced psychologists in vignettes. Accuracy score is the percentage of correct answers. Experientiality and rationality scores were analyzed as continuous variables but are, for illustrative purposes, presented as binary variables
were computed between REI scores and accuracy of both novice and experienced clinicians. As in the mixed-models analysis, higher experiential scores were negatively correlated with the mean accuracy of novice (r = −0.626; p < 0.01) but not experienced clinicians (r = −0.096; p = 0.69); higher rational scores were positively correlated with the mean accuracy of experienced (r = 0.457; p < 0.05) but not novice clinicians (r = −0.278; p = 0.24).
There were no significant interactions between EXP and RAT and other variables in the control group or in control tasks (all p > 0.08).
3.4. Vignettes-Processing Time
To investigate whether the four groups differed in task completion time, we used a similar modelling approach as for accuracy, with a dependent variable of duration (in sec, log transformed) of each task, and we used the lmer instead of the glmer function.
As with accuracy, there were no significant main effects of experience level (χ2(1) = 0.4, coeff = 0.01, p = 0.53, CI95% [−0.05, 0.06]) or profession (χ2(1) = 0.69, coeff = 0.01, p = 0.41, CI95% [−0.05, 0.07]) on vignette duration. The results did not support the processing hypothesis, as no differences emerged between novice and experienced clinicians in the task duration for the vignette tasks.
3.5. MouselabWeb-Outcome and Style
The results of the MouselabWeb matrices were consistent with those of the vignettes (model fit: BIC = 6362; AIC = 6219). There was no significant main effect of experience level (χ2(1) = 0.02, coeff = 0.9, p = 0.9, CI95% [−0.11, 0.13]) on accuracy (Table 2, third and fourth column). Clinicians exhibited greater accuracy overall than controls (χ2(1) = 11.93 , p < 0.001, CI 95% [0.09, 0.26]); and control question type was associated with greater accuracy (χ2(1) = 6.72 , p = 0.01, CI 95% [−0.59, −0.09]). These effects were qualified by a significant interaction between profession and question type (χ2(1) = 30.87, p < 0.001, CI 95% [0.1, 0.23]). Follow-up models demonstrated no difference between the groups on control questions (χ2(1) = 0.01, p = 0.93, CI 95% [−0.16, 0.18]), but clinicians were significantly more accurate than controls on the clinical questions (χ2(1) = 32.97, p < 0.001, CI 95% [0.22, 0.49]).
Time spent solving the matrices had a negative effect on accuracy for both groups and in both matrix types: Longer times were associated with less accuracy (χ2(1) = 37.88, coeff = −0.28, p < 0.001, CI95% [−0.65, −0.46]).
Adding REI scores to the model we again found an effect that was specific for clinical psychologists in their field of expertise (model fit: BIC = 6203; AIC = 5498): in novice clinical psychologists, a stronger preference for an EXP style was associated with lower accuracy (χ2(1) = 5.49, coeff = −0.28, p = 0.02, CI95% [−0.78, −0.34]), while in experienced psychologists a stronger preference for RAT was associated with higher accuracy (χ2(1) = 4.39, coeff = 0.16, p = 0.04, CI95% [0.66, 1.74]).
As for the vignettes, we present correlations for purely descriptive purposes. Higher experiential scores were negatively correlated with the accuracy of novice (r = −0.472; p < 0.05) but not of experienced clinicians (r = 0.121; p = 0.61); higher rational scores were positively correlated to the mean accuracy of experienced (r = 0.457; p < 0.05) but not novice clinicians (r = −0.331; p = 0.15).
3.6. MouselabWeb-Processing Time and Acquisitions
Only one main effect was significant in the omnibus model with time: Response times differed significantly between task types (χ2(1) = 13.05, coeff = 0.09, p < 0.001, CI95% [0.04, 0.14]). All participants took longer to complete the clinical than the control tasks.
Only one significant main effect was found in the model with acquisitions: Task type was associated with number of acquisitions. All participants had more acquisitions in the clinical than in the control tasks (χ2(1) = 41.67, coeff = 1.19, p < 0.001, CI95% [0.86, 1.55]). All other main effects and interactions were non-significant (all p’s > 0.21) (see Table 3).
First, both with vignettes and with MouselabWeb we replicated previous research showing that experience does not influence the accuracy of psychodiagnostic decisions (cf. Spengler et al., 2009 ): novices and experienced clinical psychologists did not differ in their accuracy in diagnostic tasks. Clinical participants were more accurate than
Table 3. Mean number of acquisitions (SDs) in MouselabWeb matrices
The number of acquisitions indicates how many boxes were opened on average per task.
control participants on the clinical tasks, and performed equally well on the control tasks.
Secondly, and contrary to our hypothesis, novice and experienced clinical psychologists did not differ in self-reported preference for an experiential thinking style, while novice psychologists had a lower self-reported preference for a rational thinking style than the other groups.
Our results do not support the explanation that experience does not affect psychodiagnostic accuracy because experienced clinical psychologists prefer to use intuition more than novices. On the contrary: Experienced clinical psychologists did not report a stronger preference for experiential reasoning than novices. They may realize that the clinical environment is not predictable, but is, in Hogarth’s terms “wicked” (Hogarth, 2001) , and that they have not had an opportunity to learn (cf. Kahneman & Klein, 2009 ). No educated intuition seems achievable in this task; clinicians do not engage in deliberate practice and they lack accurate feedback (Tracey et al., 2014) . A novice’s intuition is uninformed and therefore not conducive to accuracy, which can explain that preferring to use intuition does not help novices be more accurate.
There are a few limitations that have to be addressed. First, there is a questionable relationship between self-report of thinking strategy and actual strategy use (Nisbett & Wilson, 1977; Wilson, 2002) . Higher REI scores indicate a stronger preference for, but not necessarily actual use of, the respective thinking style. However, previous studies demonstrated that REI scores do correlate with performance on tasks that have heuristic-intuitive or reasoned-rational solutions (Witteman et al., 2009) .
Another limitation to this study was that the tasks employed were quite easy (average correct responses over 80%). Future research might profit from using only the more difficult tasks. One can argue that the tasks used are somewhat artificial, more than asking clinicians to interact with an actor-client (Groenier, Beerthuis, Pieters, Witteman, & Swinkels, 2011) , and do not mimic actual psychodiagnosis. While in practice diagnostic decision-making indeed does not involve a binary choice, diagnostic classification is a sub-task that needs to be performed before treatment can start and entails clustering the presented symptoms into a disorder label. We used forced-choice tasks to allow us to see which symptom(s) were judged as diagnostic of the presented disorders. As done previously (e.g. Witteman & Van den Bercken, 2007 ) vignettes were used to optimize methodological rigor (cf. Bachmann et al., 2008 ); this greater methodological rigor, however, comes at the cost of the ability to generalize our results to more realistic diagnostic situations.
Finally, though our sample size is typical for this kind of study, increasing the number of participants would increase the confidence that our results generalize to the larger population of clinical psychologists.
The results of the current study indicate that a preference for deliberative thinking is associated with better clinical decision-making, but only for experienced clinical psychologists. For novice psychologists, preferring experiential or intuitive processing is associated with poorer clinical decision-making. We conclude that deliberating about a psychodiagnostic classification serves even the more experienced clinical psychologists, while novices should not trust their intuition. Our results might be used to inform the training of clinical psychologists. Prospective clinical psychologists should be aware of the impact of their thinking style on their diagnostic accuracy, and be encouraged to deliberate and to question their intuition.