0
Reproducibility:
Measuring Success of Replication
Werner Stahel
Seminar für Statistik, ETH Zürich Basel, Nov 13, 2017
1. Thoughts on the Role of Reproducibility 1
1.
Thoughts on the Role of Reproducibility
1.1 Paradigms
Reproducibility defines knowledge. “The Scientific Method”:
Scientific facts are those that have been reproduced independently.
... as opposed to belief, which is “irrational” for some of us.
Science should be the collection of knowledge that is “true”.
Well, not quite: Big Bang is not reproducible, but is a theory, nevertheless is called scientific knowledge.
In fact, empirical science needs theories as its foundation.
1. Thoughts on the Role of Reproducibility 2
Reproducibility of facts defines science
– physics, chemistry, biology, life science = “Exact” Sciences What is the role of reproducibility in Humanities and Arts?
•
Humanities try to become “exact sciences” by adopting“the scientific method”.
•
Arts: A composition is a reproducible piece of music.Reproducibility achieved by fixing notes.
Intonation only “reproducible” with recordings.
•
Improvization in music ; mandalla in “sculpture”:Intention to make something unique, irreproducible.
Back to “exact” sciences!
1. Thoughts on the Role of Reproducibility 3
1.2 The Crisis
Reproducibility is a myth in most fields of science!
•
Ioannidis, 2005, PLOS Med. 2:Why most published research findings are false.
−→
many papers, newspaper articles, round tables, editorials of journals, ...,Topic of Collegium Helveticum, ETH Zurich
−→
Handbook•
Tages-Anzeiger of Aug 28, 2015:“Psychologie-Studien sind wenig glaubwürdig”
(Psychological studies have little credibility)
⇐
“Open Science Collaboration”, Science 349, 943-952, 2015:Estimating the reproducibility of psychological science
1. Thoughts on the Role of Reproducibility 4
P-values
1. Thoughts on the Role of Reproducibility 5
Original study P-value
1. Thoughts on the Role of Reproducibility 6
Effect Size
1. Thoughts on the Role of Reproducibility 7
−→
Effect sizes are lower, as a rule, in the replication.Results:
36% of replications had significant results;
39% of effects were subjectively rated to have replicated the original result;
47% of original effect sizes were in the 95% confidence interval of the replication effect size.
Similar results for pharmaceutical trials, Genetic effects, ...
Cancer Biology: 22 replications planned, 7 completed about 55% of the 23 effects studied were significant again Note:
What is a success/failure of a reproduction?
−→
not well defined!... not even in the case of assessing just a single effect!
1. Thoughts on the Role of Reproducibility 8
1.3 Outline
1. Introduction : over!
2. A further replication study 3. Success of replication 4. P-values
5. Between studies variance 6. Consequences for replication 7. Selection biases
−→
no time...8. Conclusions: Is reproducibility a useful concept?
How should scientific findings be “confirmed”?
2. A further replication study 9
2.
A further replication study
“Many Labs” Replication Study in Psychology 17 “hypotheses”=scientific questions,
36 “samples” (mostly university bachelors) Example: 1 scientific question: “Anchoring”
"The distance from San Francisco to New York City is [low] longer than 1,500 miles
[high] shorter than 6,000 miles.
How far do you think it is?"
True distance is 2,906 miles.
Do the answers differ between the two “anchoring” groups?
2. A further replication study 10
Results for PSU (Penn State) – taken here as “original study”
miles
low high
20003000400050006000
anchor
2. A further replication study 11
Test for difference:
##
## Welch Two Sample t-test
##
## data: anchoring1.y by anchoring1.g
## t = 7, df = 60, p-value = 4e-10
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1164 2014
## sample estimates:
## mean in group low mean in group high
## 4182 2592
Confidence interval for difference: [1164, 2014]
2. A further replication study 12
Results for UVA (Virginia) – taken here as “replication study”
miles
psu uva
20003000400050006000
referrer
original replication
2. A further replication study 13
Estimates and confidence intervals for difference:
## original replication
## means 4182, 2592 3979, 3057
## est.diff 1589 922
## stand.err 212 255
## conf.int [ 1164, 2014 ] [ 413, 1432 ]
Is the replication successful?
3. Success of replication 14
3.
Success of replication
3.1 Answer “significance”:
Replication is successful
if both results are significant.
Probability of getting successful replications depends on
•
original P-value,•
sample sizeN
r of replication.3. Success of replication 15
For
N
r→ ∞
, almost always successful because ofP-value Paradox: “There is always at least a tiny effect.
The test only shows if you have a sufficiently large sample to make it statistically significant.”
"First, we are typically not terribly concerned with Type 1 error because we rarely believe that it is possible for the null hypothesis to be strictly true." [Gelman et al. 2009]
For
N
r→ ∞
, almost always successful.−→
Do not use too high power in the replication! ???Problem is with P-value! see later!
3. Success of replication 16
3.2 Answer OSC
(Open Science Collaboration): Replication is successful
if confidence int. of replication contains estimate of original.
OSC study: 47%
??? What kind of statistics is this???
Replication will “always” fail if its sample size is big enough!
3. Success of replication 17
3.3 Answer “overlap”:
Replication is successful
if confidence intervals overlap.
Confidence intervals are for parameters!
Are the parameters in the overlap those that are still possible?
Hope for a small overlap!
Remember your Stat-1 course!
3. Success of replication 18
3.4 Answer “compatibility”:
Replication is successful
if the data could come from the same “population”.
−→
Classical two-sample problem−→
P-value NO!!! Need confidence interval for the difference!General case: 2 samples
−→
two estimates
b θ
k with standard errors sek,k = o, r
. Standard error for∆ = b θ b
r− θ b
o issed
= p
se2o
+
se2rConfidence interval
b θ
r− b θ
o± 1 . 96 ∗
sed.3. Success of replication 19
“Anchoring” Example.
original replication means 2592 ; 4182 3057 ; 3979
est.difference 1589 922
standard error 210 253
difference of diff. −667 ± 643 conf.int(diff.diff) [-1317, -18]
(Sorry!) Why?
3. Success of replication 20
Why?
•
Original or replication study not properly done or analyzed;•
Improved experimental methods have reduced systematic error;•
Statistical model needs improvement!•
... (see after the following excursion)4. Ban p-values 21
4.
Ban p-values
P-value Paradox: “There is always at least a tiny effect.
The test only shows if you have a sufficiently large sample to make it statistically significant.”
Science should ask questions. Is the effect relevant?
Need threshold of relevance
Asking too much?
Confidence interval gives answer for any threshold.
−→
Presentation of results does not depend on threshold.Replace p-values by confidence intervals!
4. Ban p-values 22
Here, Example Anchoring:
Estimate difference of distance judgements between
“high” and “low” anchoring group
Original:
1589 ± 210 = [1179 , 2000]
−
− −
differences of 'high' and 'low' 0500100015002000
original replication difference
−1500−1000−5000500
4. Ban p-values 23
Confidence interval for difference of differences does not cover 0.
Is the difference relevant? – Decision unclear.
General: confidence interval for difference of effects between original and replication.
−→
Decision “difference not relevant” is possible in principle although sample size will often not be sufficient ifrelevance threshold is low.
4. Ban p-values 24
In 1999, a committee of psychologists came close to a
ban of the p-value!
−→
Use confidence intervals!P value
confidence interval
test result yes/no answer
© Markus Kalisch
5. Between Studies Variance 25
5.
Between Studies Variance
Estimate of the effect from many studies.
Example: 36 studies.
− −
−
−
−
−
−−
−
−
−−
−
−
−
−
−−
−
−
−
−
−
− −−
−−
− −
−
−−−
− −
anchoring1
5001000150020002500 5001000150020002500
Uni.US Uni.notUS notUni.US ...
## Error in matext("difference", 2, 2): could
not find function "matext"
5. Between Studies Variance 26
Overall estimate of effect:
b θ =
1K
X
Kk=1
θ b
kPrecision?
•
“Theoretical”:var
tθ b
=
1K2
X
Kk=1
se
2k•
“Empirical”var
eθ b
=
1K−1
X
Kk=1
( b θ
k− θ b )
2var
eθ b
> var
tb θ
−→
overdispersion!5. Between Studies Variance 27
This is a generalization of 1-way ANOVA
which uses within group variance
−→ var
tand between group variance
−→ var
e5. Between Studies Variance 28
−→
Estimation of overdispersion•
Restricted to US Universities## meand seMeandTh seMeandEmp overdFac
## 1245.14 51.41 55.62 1.08
•
All 36 “samples”:## Error in eval(expr, envir, enclos): object ’r.anch1’
not found
Close replication vs. conceptual replication = generalization May hope for no overdispersion (factor=1) in close replications.
Should we? Consider P-value Paradox!
−→
need relevance threshold for difference of effects betw. studies.5. Between Studies Variance 29
Expanding the confidence intervals.
Assuming a known overdispersion factor
ω
,the confidence intervals can be expanded accordingly.
This results in
## [1] -1370.1 35.5
– which covers 0!
Some overdispersion factors:
overdFac anchoring1 1.030 anchoring2 1.438 anchoring3 1.544 anchoring4 1.314 gambler 1.454 moneyprim 0.843
6. Consequences for replication 30
6.
Consequences for replication
A single replication is rather useless. Assume that – we have a relevance threshold
θ
∗> 0
and the confidence interval of the original study is above,
θ b
o− 2 se
o≥ θ
∗;– the precision of the replication is similar to the original,
se
r≈ se
o.If the true
θ
wereθ = θ b
o= θ
∗+ 2 se
o , thenthe probability that the replication confidence interval is
≥ θ
∗would be about 0.5.
Unless
θ b
o− 2 se
o>> θ
∗,the probability of getting an inconclusive result is quite high.
6. Consequences for replication 31
In case of
•
an inconclusive result, one would plan another replication study.−→
sequential inference, to be taken into account even if ...•
the desired resultθ b
r− 2 se
r≥ θ
∗,what is an appropriate
se
for calculating a reliable conf.int.?Should take overdispersion into account!
But we do not have an estimate for it!
−→
Need several replications−→ θ b
k.Then, estimate
θ
as mean (median) ofθ
ks, with empiricalse
e.Do not include the original study because of potential selection bias.
−→
“We need to move toward a better understandingof the relationship between reproducibility, cumulative evidence, and the truth of scientific claims.” [Clemens 2015]
7. Selection biases 32
7.
Selection biases
7.1 Within study selection
Here is a common way of learning from empirical studies:
•
visualize data,•
see patterns (unexpected, but with sensible interpretation),•
test if statistically significant,•
if yes (usually), describe “pattern” in manuscript.(cf. “industry producing statistically significant effects”)
7. Selection biases 33
The problem, formalized: Multiple comparisons
7 groups, generated by random numbers
∼ N (0, 1)
.−→ H
0 true!●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
group
y
1 2 3 4 5 6 7
−2−1012
+ + + + + + +
std.dev.
conf.int
7. Selection biases 34
Test each pair of groups for a difference in expected values.
−→ 7 · 6 / 2 = 21
tests.P (
rejection) = 0. 05
for each test.−→
Expected number of significant test results =1 . 05
! significant differences for 1 vs. 6 and 1 vs. 7Publish the significant result!
You will certainly find an explanation why it makes sense...
−→
Selection bias.7. Selection biases 35
Solution: for multiple (“all pairs”) comparisons:
•
Make a single test for the hypothesis that allµ
g are equal!−→
F-test for factors.•
Lower the levelα
for each of the 21 tests such thatP (≥ 1
significant test result) ≤ α = 0 . 05
!Simplest and general (multiple testing): Bonferroni correction:
divide
α
by number of tests−→
conservative testing procedure−→
You will get no significant results−→
nothing published (Are we back to testing? –Considerations also apply to confidence intervals!)
7. Selection biases 36
In reality, it is even worse!
When exploring data, nobody knows how many hypotheses are
“informally tested” by visual inspection of informative graphs.
Exploratory data analysis – curse or benediction?
Solution?
One dataset – one test!
(or: only a small number of planned tests/confidence intervals, with Bonferroni adjustments (?))
Dream for statistician, nightmare for researchers!
7. Selection biases 37
7.2 More biases
• Publication bias
: A manuscript without significant effects will not be writtenor rejected.
•
“Researcher degrees of freedom”= selecting transformations, estimation methods, model selection, ...
−→
Discussion.Remedy: Pre-registration: Publish the project (with review!), accepted if scientific question is worthwile,
then publish the results in any case!
7. Selection biases 38
7.3 Stepping procedure of advancing science:
1. Explore data freely, allowing all creativity.
Create “hypotheses” about relevant effects.
2. Conduct a new study to examine the hypotheses (not
H
0!)“Believe” effects that are successfully confirmed with a sufficient magnitude to be relevant.
3. Extend the scope in a planned manner to generalize the insight.
1.* Use dataset in an exploratory attitude to generate new hypotheses.
it. Iterate until retirement.
Note that step 2 is a phony replication!
7. Selection biases 39
Huang & Gottardo, Briefings in Bioinformatics 14 (2012), 391-401
8. Conclusions 40
8.
Conclusions
Where and when is reproducibility a useful concept?
8.1 “Exact” sciences ...
(well: “quantitative, empirical part of sciences”)
... Physics, Chemistry, Biology, Medicine, Life sciences.
•
Reproducibility is an important principle to keep in mind.Feasible? Sometimes. Needs motivation, skill & luck. Recognition?
•
Data Challenge, ConfirmationScience is not only about collecting facts
that stand the criterion of reproducibility, but about
generating theories (in a wide sense) that connect the facts.
8. Conclusions 41
Types of confirmation:
+ Repetition = “close replication”: Same setting
−→
should produce response values within “data compatibility”.+ Generalization = “conceptual replication”, extrapolation:
Vary the situation to check the “robustness” of the result.
Data Challenge
+ Extension: Generalize the problem.
Distinction is more useful in more complicated situations (regression).
8. Conclusions 42
Pre-registration!
Repetition studies should be pre-registered.
Journal that published the original result should be prepared to accept and list pre-registrations of it’s repetition (if design is ok) and publish the result regardless of the outcome.
−→
Journal Section “Replication”Recommendation:
Perform combined study for replication and generalization and/or extension.
Each PhD thesis could consist of a replication phase, followed by a generalization study.
8. Conclusions 43
8.2 Social sciences, ...
– Macro-Economics: Economy only exists once, no replication.
– Society, History: same
– Psychology: Circumstances (therapist, institution, culture) are difficult to replicate.
These sciences should not be reduced to their quantitative parts!
What about philosophy and religion?
Good for discussions over lunch.
8. Conclusions 44
Messages
•
Avoid significance tests and P-values. Use confidence intervals!−→
Teaching!•
Precise replication in the sense of compatibility of quantitative results (non-significant difference)is rare (outside student’s lab classes in physics)
•
It becomes somewhat more realistic if models contain a study-to-study variance conponent.8. Conclusions 45
•
Dilemma of advancing science: Exploratory and confirmatory steps.−→
Data ChallengeInstead of mere replicaton studies, perform confirmation / generalization studies!
•
Replication is only applicable to empirical science.There are other modes of thinking that should be recognized as “science” in the broad sense (“Wissenschaft”).
What is confirmation in these fields?
• −→
In what sense / to what degreeshould replication be a requirement for serious research?
Thank you for your endurance
8. Conclusions 46
References
Atmanspacher, H. and Maasen, S. (eds) (2016). Reproducibility: Prin- ciples, Problems, Practices, and Prospects, Wiley.
Clemens, M. A. (2015). The meaning of failed replications: A review and proposal., J. Econ. Surv. 10?: 10.1111/joes.12139.
Gelman, A., Hill, J. and Yajima, M. (2009). Why we (usually) don’t have to worry about multiple comparisons, ??
Ioannidis, J. (2005). Why most published research findings are false, PLoS Medicine 2: 696–701.
Klein, R. and et al. (2014). Investigating variation in replicability., Soc Psychol 45(3): 142–152. “Manylabs”: repr of 17 effects
Open Science Collaboration (2015). Estimating the reproducibility of psychological science, Science 349: 943–952.