• Keine Ergebnisse gefunden

2. Eigene Arbeiten

2.1. Current obstacles in replicating risk assessment findings: A systematic review of

2.1.5. Discussion

According to recent surveys the VRAG, SORAG and Static-99 are the most commonly used ARAIs in clinical practice (R. P. Archer et al., 2006; Viljoen et al., 2010). ARAIs can be considered valid if independent studies can replicate the findings of the development study. In order for a study to qualify as a replication, it has to follow the study protocol of the original study with respect to key study characteristics as well as guidelines for test administration.

0%

Instrument authors claim that there are several replication studies corroborating the validity of ARAIs (Phenix, Hanson, Harris, & Thornton; Waypoint Centre for Mental Health Care, 2012). It is unclear however, whether these replication studies matched the development studies with respect to key characteristics of study protocol as well as the administration of the ARAIs.

The goal of the present study was to examine the similarity between development and replication studies and to investigate assessment integrity. A systematic search of the predictive validity literature on the VRAG, SORAG, and Static-99 identified 84 peer-reviewed investigations comprising 108 samples. Correspondence between the development and replication studies was assessed on the basis of sample characteristics, instrument administration, follow-up, controls for attrition and outcome definition.

The replication match varied considerably across these criteria. The highest replication match could be found regarding assessment integrity, namely whether assessors did not systematically omit (86.1%) or alter items (81.5%). This score has to be interpreted with caution though, since assessment integrity was assumed as given if not otherwise explicitly stated. In 76.9% of the study samples – as required by instrument manuals - the authors consulted file information. This finding shows that assessment integrity was considerably violated in roughly one out of four cases, since the basis of information did not meet the minimum requirement for test administration.

The remaining matching criteria were considered in less than two-thirds of the studies.

Offender sex composition matched the development sample in 65.7% of cases, type of recidivism in 63.9% and type of index offense in 63.0%. Reliability of instrument administration was considered for 58.3% of the study samples either by using trained raters or by documenting inter-rater reliability. The remaining five criteria were explicitly considered by less than 50% of

66

CURRENT OBSTACLES IN REPLICATING RISK ASSESSMENT FINDINGS _____________________________________________________________________________________

the study samples: 44% percent of the samples consisted solely of adult offenders. Finally, controlling for attrition and fixed length of follow-up periods were the least frequently reported criteria (31.5% each).

Of the 108 samples investigated, no study could be identified with a perfect replication match. Roughly half of the study samples matched the development study in two-thirds or fewer of the relevant criteria, although the replication match was better for Static-99 studies than for SORAG/VRAG validation studies.

It is unclear however, whether the replication study authors neglected to report information regarding sample characteristics, application of instrument, follow-up, attrition, and outcome definition or did not, in fact, consider these characteristics fully in their empirical design. If the finding of this paper is that we are dealing with a mere reporting phenomenon, study authors should be reminded about sticking to the scientific protocol in relation to the reporting of results.

If, however, the relatively poor replication match was the result of altered methodological designs, the studies extracted for this investigation cannot be considered as true replication studies. Accordingly, the results of these studies should be interpreted with caution and cannot serve as a direct corroboration of the predictive validity of ARAIs.

Since the majority of the extracted investigations found a significant association with an outcome, the included studies are commonly interpreted as proof of the robustness of the VRAG, SORAG, and Static-99 (e.g. Hastings, Krishnan, Tangney, & Stuewig, 2011; Kröner, Stadtland, Eidt, & Nedopil, 2007) though the purported replication investigations differed with respect to the follow-up period (Quinsey, Book, & Skilling, 2004; Rettenberger, Matthes, et al., 2009), the composition of the study sample (G. T. Harris, Rice, & Cormier, 2002; Hastings et al., 2011;

Snowden, Gray, & Taylor, 2010), and the definition of the outcome criterion (Endrass,

67

Rossegger, Frischknecht, Noll, & Urbaniok, 2008; G. T. Harris & Rice, 2003; G. T. Harris et al., 2004; Hastings et al., 2011; Kroner & Mills, 2001; Lindsay et al., 2008; Loza, Villeneuve, &

Loza-Fanous, 2002; Storey, Watt, Jackson, & Hart, 2012). Deviations from the methodology used in the development study were interpreted as corroboration of model robustness (G. T.

Harris & Rice, 2007; G. T. Harris et al., 2004). Even though on other occasions ARAI authors stipulated specific requirement for replication studies that were also the basis for the current investigation, they still interpreted the results of studies using deviating methodological characteristics as corroboration for the ARAI (e.g. G. T. Harris & Rice, 2007; G. T. Harris et al., 2004).

Using a study that differs from the original investigation with respect to key study characteristics as corroboration for model robustness can lead to contradictory results. This is especially the case when aside from measures of accuracy (such as the “gold standard” Area under the curve (Mossman, 1994) measures of calibration are used (e.g. Endrass, Urbaniok, Held, Vetter, & Rossegger, 2009). If a replication study finds that for example 44% of the offenders in the risk-bin X reoffended within three years of follow-up (compared with the 7-year follow-up period in the original study), it is evidence of a poor calibration and not of the robustness of the model. The validity of the instrument is even more challenged, when the outcome is altered. If ARAIs correlate with all sorts of social maladaptive behavior, the question arises: what do these instruments really measure?

If ARAIs correlate with all sorts of behavior under different conditions, in different contexts, using different follow up periods and data bases, it cannot automatically serve as a corroboration for the (assumed) robustness of the instrument, as it remains unclear, what the instrument really measures. If maladaptive behavior and not proneness to sexual or violent offending was the

68

CURRENT OBSTACLES IN REPLICATING RISK ASSESSMENT FINDINGS _____________________________________________________________________________________

latent trait being assessed by an ARAI, it is questionable whether such an ARAI should be used in court to assess the risk for persistent sexual or violent offending. Whereas some deviation from the original study could suggest model robustness, a larger deviation jeopardizes its validity, especially if the deviation concerns the dependent variable.