When is a Replication Successful?
... and more.
Werner Stahel
Seminar für Statistik, ETH Zürich Zurich, 23 January 2020
Science
... aims at production of “real” knowledge.
real = reproducible
−→
plausible theoryReproducibility = transparency / replicability, successful!
We all know:
•
Ioannidis, 2005, PLOS Med. 2:Why most published research findings are false.
−→
many papers, newspaper articles, round tables, editorials of journals, ...,“Open Science Collaboration”, 2015
“Estimating the reproducibility of psychological science”
Only 36% “success”!
Success? = Significant again! NHST (Null Hypotheis Statistical Testing)
“Only 36%”? What should we expect?
Why? Selective reporting bias
•
Garden of forking paths•
Reporting, within study•
Publication:±
only significant results!−→
Too often significant, biased estimated effect.Science Mining
Gold Mining: Find nuggets!
Dig through lots of rock! Often very inefficient. Tedious!
Popular mining tool: The NHST filter
finds a lot of garbage
−→
“P-value problem”Overview
1. Science versus NHST 2. Success of replication 3. We need strategies!
1.
Science versus NHST
The NHST paradox
If a study is undertaken to estimate anit is not plausible that it be exactly 0effect,
−→
A (tiny) effect exists−→
forn → ∞
the power→ 1
.NHST says whether we have chosen a large enough sample to find a formally significant effect.
Scientific Question must be: Is the effect RELEVANT?
This needs a relevance threshold!
Presentation of results: confidence intervals.
Ban P-values in favor of confidence intervals! DO IT!
Proposal by 71 authors: Lower significance level!
Paradox still applies (larger
n
needed)! Not helpful!The ASA’s statement on p-values: context, process, and purpose, 2016, American statistician
Retire statistical significance!, 2019, Nature
> 800
signaturies What should scientists do routinely?The problem is the lack of a scientific question as long as no relevance threshold is specified!
Classification of results (... of a single study that estimates an effect) effect
0 relevance th.
Rlv: Relevant Sig: Significant Amb: Ambiguous
Ngl: Negligible Ctr: Contradicting
−→
new filter: Relevance instead of significance= Confidence interval above relevance threshold
Selective reporting is not eliminated, but there will be less garbage.
Choice of relevance threshold?
– absolute
– relative to effect size, e.g. 10%
– relative to random variation of data (Cohen’s
d
), e.g. 20%2.
Success of replication
TWO aspects:
•
“sigag”: significant again−→
“relag”: relevant again!•
“Compatible” = no significant difference of effect estimates significant? — Not a scientific problem!−→
relevant!Ask a scientific question!
“True” Effect Size Difference = parameter ED
= ψ
(r)− ψ
(o)−→
estimate! ... with confidence interval−→
classify!Classification of replication results
based on IEffr = confidence interval for the effect from the replication and EDS = confidence interval for Effect Difference (Standardized) (Cnf) Confirmation: IEffr Rlv, EDS small (Ngl or Amb),
Weak Confirmation: IEffr Sig
(Att) Attenuation: EDS: Rlv = relevant effect diff. – even if IEffr : Rlv.
(Enh) Enhancement: if the replication suggests a clearly stronger effect, IEffr : Rlv, EDS significantly positive (Ctr)
(Amb) Ambiguous: if IEffr : Amb (Anh) Annihilation: IEffr : Ngl
(Ctr) Contradiction: IEffr : Ctr (Drp) Dropout: replication failed
OSC15: Paired sample t-tests from Open Science Coop, 2015
effect size, standardized
−1 0 0.2 1 2
St.15 St.33 St.6 St.113 St.7 St.153
95 40 24 125 100 8
242 40 32 177 15 8
OSC15: Paired sample t-tests from Open Science Coop, 2015
effect size, standardized
−1 0 0.2 1 2
St.15 St.33 St.6 St.113 St.7 St.153
95 40 24 125 100 8
242 40 32 177 15 8
CnfW CnfW CnfW Enh Amb Amb
Yet another complication!
Heterogeneity = between studies variance component Experience from chemistry: interlaboratory studies.
Expected effect size differs between studies.
−→
“True Effect Difference” is never 0! EDS6 = 0
−→
Random effects version of meta-analysis!Need several studies to estimate between studies variance!
A valid confidence interval for the effect size
can only be obtained from several studies,
≥ 5
, say.A single replication with good power is not enough!
−→
new power calculations for multiple replications needed!3.
We need strategies for “knowledge production”!
•
Claim of basic interest−→
multiple pre-registered repl’s – Close or conceptual replications?– Choice of threshold?
Suject for professional society.
•
Exploratory study−→
One close pre-registered replication – “confirmation”−→
more replications/generalization– “attenuation” or “ambiguous”
−→
more close replication – “contradiction”−→
...•
“Original” studies without replication = exploratory−→
indications for theoretical hypotheses to be examined by preregistered studies.Still important!
Messages
•
NO scientific QUESTION−→
NO rational ANSWERScientific questions regarding an effect need a relevance threshold.
NHST is not scientific
−→
confidence intervals for parameters, not test-statisics or P-values!•
Success of a replication should be described by> 2
categoriesbased on outcome of replication and Effect Size Difference.
•
One replication is not enough, due to between study variation.A strategy is needed for advancing scientific knowledge!
Sustainable empirical research needs a change of culture!