In contrast to many randomized controlled trials (RCTs), the naturalistic pre- post design evaluates outcome of psychological treatment in a real world setting, where patients are referred to therapy in standard clinical practice (Leichsenring & Rabung, 2007; Nathan, Stuart, & Dolen, 2000) . However, drop-out of research projects is a major concern in psychotherapy, and is a serious threat to the external validity of naturalistic effectiveness studies if those who drop out differ systematically from those who remain in the study (Groenwold, Donders, Roes, Harrell, & Moons, 2012) . Nevertheless, Schlomer, Bauman, & Card (2010) found that only 38% of outcome studies published in the Journal of Counseling Psychology in 2008 reported percentages of missing data, and only one reported the method to handle this problem. Moreover, even though 84% of the Cochrane reviews of outcome studies in mental health since 2009 reported missing data, only 35% discussed the implications of this (Spineli, Pandis, & Salanti, 2015) .
There are no universal agreed criteria for acceptable rates of project drop-out, but rates of 20% - 50% have been suggested as acceptable in epidemiological cohort studies (Fewtress, Kennedy, Singhal, Martin, Ness et al., 2008) . This, however, may be a substantial loss of data in most naturalistic studies of outcome of psychotherapy. Moreover, Kristman, Manno, & Côte (2004) presented evidence that acceptable rates may depend on different mechanisms, and that data not missing at random in epidemiological studies will result in a significant bias even when 20% of the population is lost. Thus, generalization from outcome studies, including RCTs, should be preceded by a careful examination of the “missingness mechanism(s)” (Bell, Kenward, Fairclough, & Horton, 2013) .
When data are “missing completely at random” (MCAR) the causes of missing data are unrelated to any variable in the dataset, whereas data “missing at random” (MAR) are unrelated to the outcome after controlling for the predictor variables. In contrast, data “missing not at random” (MNAR) imply that the probability that the outcome values are missing depends on the missing values themselves (cf. Streiner, 2008; Newman, 2014 ). Consequently, the problem in a pre-post evaluation psychotherapy project is to clarify whether a patient who drops out, would have been less likely to improve than patients with similar background and scores on predictor variables. Thus, even though data missing not at random are “nonignorable nonresponses”, the possibility of non-res- ponding becomes a conceptual consideration as there is no easy way to unravel the missingness mechanism from the observed data (ibid).
Nevertheless, there is evidence that missing outcome data in psychotherapy studies may be related to a less favorable outcome of treatment. In a previous study, Samstag, Batchelder, Muran, Safran et al. (1994, 1998) found missing data to be an even better predictor of treatment failures in brief psychotherapy than a number of self-report questionnaires. More recently, Clark, Layard, Smithies, Richards et al. (2009) analyzed United Kingdom psychotherapy data collection sites which used two self-report outcome monitoring systems. One of the measures was a session-by-session system and the other a more conventional, less frequently sampled symptom questionnaire. Patients were most compliant with the session by session registration, but those who also complied with the less frequently sampled questionnaire had on average a 1.72 times larger effect-sizes in the session by session evaluations of outcome.
However, according to Newman (2014) , it is important to distinguish between partial project responders as reported by Clark, Layard, Smithies, Richards et al. (2009) , and “non-responders” who do not answer any of the outcome questionnaires. This level of missingness is far more problematic, because the researchers possess no relevant information about the patient that can be used to reduce missing data bias or error in the evaluation of outcome (ibid). Thus, even though statistical methods has been employed to handle missing data, such as multiple imputation or last-observation-carried-forward (Bell, Kenward, Fairclough, & Horton, 2013) , the main drawback of all methods of estimating outcomes in the absence of actual data, is that they introduce uncertainty about what really happened to the patient (Streiner & Geddes, 2001) .
We have previously reported outcome of 39 sessions of psychodynamic group psychotherapy (Jensen, Mortensen, & Lotz, 2010) , ―a treatment format which is now included in the treatment packages in Danish outpatient psychiatry (Danske Regioner, 2012) . Pre-post treatment improvement was assessed by the Symptom Check List-90-Revised Global Severity Index (SCL-90-R GSI) according to Jacobson & Truax’s (1991) construct of Reliable Change. The number of project drop-outs who had completely missing outcome data was reported, but the implication of this was not discussed. However, as recommended by Schlomer, Bauman & Card (2010) , the possibility of outcome data missing not at random should always be considered.
Fortunately, in our previous study, it is possible to explore missing outcome data by analysis of therapist ratings of pre-post treatment improvement after a final post-treatment interview with the patients. In the present study, we estimate the percentage of reliable changed project drop-outs by exploring the association between therapist evaluations of improvement and reliably change in GSI in project compliant patients. We predict that project drop-outs are less likely to have improved according to the therapists, and consequently, that they are estimated to be less likely to have reliably changed in GSI.
In contrast to earlier practices therapist evaluations of outcome seem to be less common in psychodynamic psychotherapy research (Lambert, 2013) . However, as part of the national digitization of the public services in Denmark, both therapist and patients are encouraged to participate in the electronic evaluation of the treatment offered. The therapists’ assess the patients on The Global Assessment of Functioning (GAF) scale, but unfortunately, only at termination from therapy. This implies, that in contrast to the present analysis, only end-state functioning of the project drop-outs can be measured, and not improvement in the course of therapy.
2.1. Participants and Procedure
The study is part of a long-term pre-post naturalistic psychotherapy evaluation project with a 1-year follow-up at the out-patient psychotherapy unit, Bispebjerg Hospital, Copenhagen. Measurements and treatment have been described previously (Lotz & Jensen, 2006; Jensen, Mortensen, & Lotz, 2010) . In brief, 378 patients were invited to participate in 39 sessions of psychodynamic group therapy, mostly within a period of 13 - 15 weeks. Of these patients, 348 (92.1%) accepted to participate in the evaluation project. However, 4 did not answer any of the pre-treatment questionnaires, 2 otherwise project compliant patients had missing post-treatment SCL-90-R, and 15 were not SCL-90-R Global Severity Index (GSI) cases according to Danish norms for pathology (Olsen, Mortensen, & Bech, 2006) . This resulted in a final sample representing 86.5% (327) of the eligible patients (73% women, mean age 36.7 (SD 11.1) years). All patients had ICD-8 diagnoses, and one third ICD-10 diagnoses. However, transformation from ICD-8 to ICD-10 revealed the majority of the patients to have mood (8.8%; F30-39), neurotic (47.7%; F40-48) and personality disorders (41.2%; F60-69).
Two therapists and 6 - 8 patients participated in each of four open heterogeneous groups. The majority of the therapies were administered by non-academic staff (nurses, social workers, and occupational and physical therapists) with a psychiatrist, a psychologist, or a physician (under training), as co-therapist. A total of sixteen therapists participated in the study.
As previously described the analysis includes a broad spectrum of clinical, socio- demographic and self-report variables, including the SCL-90-R and Millon Clinical Multiaxial Inventory-II (MCMI-II) (see Jensen, Mortensen, & Lotz, 2008, 2010, 2013) . The analysis of SCL-90-R subscales included GSI corrected (i.e., ipsatized) SCL-90-R subscale scores by subtracting the GSI value from each of the subscales (Jensen, Mortensen, & Lotz, 2013) .
A Therapist Retrospective Outcome Evaluation questionnaire (ThROE) including six items, was answered by the group therapists after termination from therapy (Jensen, Mortensen, & Lotz, 2008) . The item “What happened with the patients’ symptoms and problems” was included in the present analysis (1 = much improved, 2 = improved, 3 = unchanged, 4 = worse, and 5 = much worse). A corresponding post-treatment questionnaire was answered by the patients (PtROE). The two therapists in the groups discussed the outcome and agreed upon the rating after a final post-treatment interview. Immediately after the interview, the patient was given an envelope with post-treatment questionnaires which had to be returned within the next few weeks. Thus, due to the data-sampling procedure, none of the therapists were aware of the patients’ self-reported symptoms at either pre-treatment or at post-treatment when they rated patient outcome, and they did not know if the patient would subsequently drop-out of the project.
Failure to respond to the complete post-treatment questionnaire package was classified as project drop-out (Newman, 2014) . Thus, a few patients who failed to answer either the PtROE or MCMI-II, but nevertheless answered all other post- treatment questionnaires, were classified as project compliant patients as some of the questionnaires might have been lost during the six year data collection and data entry period. Similarly, drop-out of treatment was operationalized as premature termination from the 39 therapy sessions that was not approved by the therapists (Jensen, Mortensen, & Lots, 2014) .
2.3. Statistical Analysis
We classified GSI difference scores of improvement according to a Jacobson & Truax’s (1991) Reliable Change Index (RCI). The RCI is viewed as the minimum individual pre-post-treatment change (difference score) to be called “statistically significant”, which may be calculated from the reliability coefficients of the test (cf. Jensen, Mortensen, & Lotz, 2010 ). The RCI was calculated to be −0.34 (post- treatment minus pre-treatment difference scores).
The frequency of GSI reliable changed project drop-outs is estimated by a simple ratio calculation of the frequency of GSR reliably changed project compliant patients, and evaluations of being “improved” (i.e., “much improved” and “improved”) according to ThROE. The therapists failed to evaluate “symptoms and problems” in two of the patients, but based on available information the two missing ThROE evaluations were classified as “improved” and “not improved” (i.e., “unchanged” or “worse”), respectively. All data were analyzed using SPSS version 22.0 (SPSS for Windows Inc., Chicago, Illinois) . For significance tests, alpha was set at 0.05.
Of the 327 patients 25.4% (83) dropped out of the project and 20.8% (68) dropped out of treatment which included 72.3% (60) of the project drop-outs. According to ThROE, 66.1% (216) “improved” in symptoms and problems and 33.9% (111) did “not improve” (Chi-Square tests 46.82, p < 0.001).
According to ThROE, only 25.3% (21) of the project drop-outs improved as compared with 79.1% (193) of the project compliant patients (Chi-square 80.15, p < 0.001). Moreover, 52.9% (129) of the patients reliable changed in GSI. This implies that a ratio of 0.668 (129/193) of the patients who improved according to ThROE reliably changed in GSI. In contrast, only 25.3% (21) of the project drop-outs improved in ThROE, and consequently, according to the “reliable change/ThROE improvement” ratio, that 14 (21 × 0.668) may have reliably changed. Thus 16.9% (14) of the project drop-outs reliably changed as compared with 52.9% (129) of the project compliant patients. In sum, in the whole sample, 143 patients may have reliable changed (129 project compliers and 14 project drop-outs), corresponding to 43.7%. This is a significant lower total sample estimate as compared with 52.9% (129) of the project responders (Chi-square 4.96, p = 0.02).
The validity of the simple ratio calculation obviously depends on the association between patients reported SCL-90-R and therapist evaluations of improvement. However, the two outcome measures were significantly associated and 81.0% (17) of the “much improved” patients’ reliable changed in GSI as compared with 58.5% (101) of the “improved, 22.9% (11) of the “unchanged”, and none of the “worse” patients (Kappa 0.026, p = 0.007). Pre-post treatment GSI difference scores and ThROE ratings are shown in Figure 1.
Moreover, evaluations of being “much improved”/”improved” as compared with “unchanged”/”worse” (Figure 1) constitutes homogeneous subsets with mean GSI difference scores of −0.49 (0.47) and −0.10 (0.39) respectively (One- Way ANOVA, F = 13.2, p < 0.001; Tukey post hoc test). This supports the validity of the binary classification of ThROE into “improved” and “not improved” patients. The pattern was replicated in the analysis of PtROE with mean GSI difference scores of −0.56 (0.48) and −0.09 (0.39) (One-Way ANOVA, F = 21.1, p < 0.001; Tukey post hoc test; n = 236 due to missing PtROE evaluations). Moreover, patients who reliably changed in GSI were evaluated to be improved in ThROE and PtROE in 91.5% (118) and 86.1% (105) of the cases, and therapists and patients agreed in this improvement in 87.7% of the evaluations (Kappa 0.381, p < 0.001).
Furthermore, we identified predictor variables of ThROE evaluations of being “improved”. Independent-samples T-test for continuous, and Chi-square tests and Fishers Exact test for categorical variables were used. All significant variables were included in multiple logistic regressions that ensured mutual adjustment for the effects of all predictor variables. The result is shown in Table 1. As can be seen from the table, treatment drop-out was the overall most substantial predictor of ThROE improvement. Project drop-out also reached significance which implies, that project drop-out was independently associated with therapist evaluations of improvement.
Figure 1. Pre-treatment (solid lines) and post-treatment (dotted lines) SCL-90-R GSI of project compliant patients associated with therapist evaluations of improvement (see text). The solid horizontal line indicate the project gender stratified cut-off for GSI pathology, and the dotted line the gender stratified mean score, according to Danish norms.
Table 1. Multiple logistic regression of all significant predictor variables of therapist evaluations of outcome (ThROE). As can be seen from the table, project drop-out was a significant predictor of ThROE evaluations of being “improved” even after mutual adjustment for the effects of all other variables (see text).
However, project drop-out was significantly associated with drop-out of treatment, and consequently, also with shorter treatment length (Pearson r = 0.88). Project drop-outs stayed in therapy for an average of 6.3 weeks (SD 5.1) as compared with 13.6 weeks (2.0) in project compliant patients (Independent- samples T-test, t = 18.6, p < 0.001). When treatment length substituted treatment drop-out in Table 1, treatment length (Wald 24.21, OR 3.17, p < 0.001), and ICD-8 other personality disturbances and Somatization symptoms were significant, whereas project drop-out turned into marginal significance (Wald 3.06, OR 0.47, p = 0.08), explaining 46.5% variance. When both treatment drop-out and treatment length was included, treatment drop-out (Wald 7.08, OR .17, p = 0.008), treatment length (Wald 4.48, OR 1.89, p = 0.03) and ICD-8. Other personality disturbances and Somatization symptoms were significant, whereas project drop-out turned into non significance (Wald 1.11, OR 0.61, p = 0.29), explaining 48.5% of variance. However, premature termination and a corresponding short treatment length is a characteristic of 72.3% (60) of the 83 project-dropouts. Moreover, it is not surprising that treatment drop-out-which is not approved by the therapists―is associated with a subsequent less favorable evaluation of outcome, and consequently that premature termination is the overall most substantial predictor of improvement in ThROE (Table 1).
Furthermore, we analyzed missing data utilizing the SPSS multiple imputation automatic procedure with 40 iterations, and included all pre-treatment variables in the study, ThROE and treatment and project-drop-out and treatment length. This method scans the data and uses a monotone method if the data show a monotone pattern of missing values. Otherwise a fully conditional specification is used (SPSS 22 Manual for Windows Inc., Chicago, Illinois). Depending on the variables included, an estimate of 48.6% (160) reliably changed patients was the overall lowest as compared with 50.2% (165) when ThROE was not included. These estimates are not significantly different from the 52.9% reliable changed in the project completer analysis (Chi-Square tests < 0.1, p > 0.9). However, if ThROE had not been included in the multiple imputation, the difference was marginally significant as compared with 43.7% (143) as calculated on the basis of the “reliable change/ThROE improvement” ratio (Chi-Square tests 2.95, p = 0.08).
As soon as the recruitment phase to therapy was finished, problems with missing data arised in the present study. Thus, 7.9% of the 378 patients who participated in 39 sessions of psychodynamic group therapy did not want to participate in the project. We have no data from these patients but Strauss, Lutz, Steffanowski, Wittmann, Boehnke et al. (2015) described therapist-reported reasons for declining to participate in a large national psychotherapy research project to be mainly due to additional expenditure of time (33.7%), distrust with data-security (22.0%) and dislike of tests (18.8%). It has been suggested that 5% missing data or less is inconsequential for the statistical analysis, whereas more than 10% missing data may result in a bias (cf. Dong & Peng, 2013 ). We do not believe that the amount of patients who were not included in the present study (7.9%) is a serious threat to the generalizability of the results.
In contrast, 25.4% of the patients who participated in the study had completely missing outcome data which, according to the analysis of therapist evaluations of outcome, are most likely to be classified as “missing not at random”. This may be a substantial problem. According to Jacobson & Truax criterion, 52.9% of the project responders reliable changed on the SCL-90-R Global Severity Index (GSI). However, a simple ratio calculation of therapist evaluations of improvement, and their concordance with GSI, resulted in an estimate of 43.7% reliably changed. In comparison, a post hoc intention-to-treat analysis based on the last-observation-carried-forward method revealed 39.4% (129) to have reliably changed. However, this type of intention-to-treat may be highly problematic especially in longer-term treatment because it mistakenly assumes that all project drop-outs are without improvement (cf. Jung, Serralta, Nunes, & Eizirik, 2013 ).
Moreover, post hoc analysis of missing data utilizing the SPSS multiple imputation automatic procedure revealed 48.6% to have reliably changed. However, outcome data “missing not at random” provides a complication for all approaches to missing value replacement, including multiple imputations, especially when project drop-outs has completely missing outcome data as in the present study. Despite statistical methods to handle this problem they require a model for the missing data which is not always possible to develop (cf. Streiner; Newman, 2014) . Moreover, demographic variables are poor predictors of outcome, and even though clinical characteristics may be associated with end-state functioning they are not necessarily related to improvement (Bohart & Wade; 2013; Eskildsen, Hougaard, & Rosenberg, 2010) .
Therapists may overestimate outcome of group psychotherapy (Chapman, Burlingame, Gleave, Rees et al., 2012; Elkjaer, Mortensen, Poulsen, Kristensen, & Lau, 2012) , and retrospective evaluations generally overestimates treatment success as compared with progress as measured with symptom questionnaires (cf. Green, Gleser, Stone, & Seifert, 1975; Hill & Lambert, 2004 ). Nevertheless, the calculation of about 44% reliably changed patients may be a realistic estimate of outcome in the present study. This imply, that the proportion of reliable changed patients, who started in 39 sessions of psychodynamic group therapy, may be about 82% of the figures previously reported (Jensen, Mortensen, & Lotz, 2010) .
4.1. Clinical Implications
There is “no easy statistical fix” to handle missing data (Little, D’Agostino, Cohen, Dickersin et al., 2012) and according to Graham (2009) , the most optimal solution would be to collect data in a random sample from those initially missing. This, however, is not easy and sometimes impossible, and the key to solve the problem of missing data is to design and carry out the trial in a way that limits the problem.
However, a recent Cochrane review of strategies to improve project compliance revealed no effects of non-monetary incentives, letters delivered by priority post, or additional reminders to return project questionnaires (Brueton, Tierney, Stenning, Harding, Meredith, Nazareth, & Rait, 2014) . The methods that appeared to work were offering a small amount of money for return of a completed questionnaire, or enclosing a small amount of money with a questionnaire, with the promise of a further small amount of money for return of a completed questionnaire.
However, if there are no such data available, the present study strongly supports that therapist evaluations of improvement after a final post-treatment interview may be valuable in psychotherapy research and quality assurance evaluation programs (cf. Lambert, 2013 ). In contrast, use of standard statistical programs for missing data replacement as in the SPSS, which are available for a large group of researchers, should be critically considered, because most of these programs are based on the assumption that data are missing at random (cf. Streiner, 2008 ). Sensitivity analysis techniques has been suggested in effectiveness studies of psychotherapy to test for this assumption (cf. Crameri et al., 2015 ), which, however, in contrast to most standard statistical procedures, requires advanced specific statistical knowledge.
One percent of the patients who accepted to participate in the present project could not be included in the analysis due to project drop-out at pre-treatment, and 4% did not fulfill the SCL-90-R criterion for pathology. The exclusion of these patients is in agreement with the purpose of the present analysis, and with Chiesa, Fonagy, Bateman & Mace (2009) who suggested that patients referred to psychodynamic treatment in the UK National Health Services, and who were subclinical according to GSI, probably should have been offered less expensive treatment within a primary care setting. In the present study, post hoc analysis revealed 9 (60%) of these patients to be improved in symptoms and problems, and 6 (40%) to drop out of the project. Overall, however, we do not believe that missing data from project refusing and excluded patients are a serious threat to the generalizability of the present estimates of GSI improvement.
According to therapist retrospective evaluations of outcome, which correspond with patient reported improvement in GSI, project drop-outs are less likely to have improved. Thus, a GSI reliable improvement rate of 52.9% based on project responders, are more likely to be about 44% of the total sample included in 39 sessions of psychodynamic group therapy. This result has implications for the generalizability of outcome as evaluated on the basis of completer analyses. Moreover, missing GSI outcome data in the present study is most likely to be “missing not at random” which suggests that standard statistical multiple imputation of missing values is most likely to be biased.
This research was supported in part by grants from the Danish Research Council for the Humanities (No. 9600938), Director Jacob Madsen and Wife Olga Madsens Foundation, and The Grant of 22nd June 1959. Thanks are due to Vibeke Munk, M.A., for critical comments and help with the manuscript.
Conflict of Interest
The authors declare that there is no conflict of interests regarding the publication of this paper.