Reproducibility: Measuring Success of Replication

(1)

0

Reproducibility:

Measuring Success of Replication

Werner Stahel

Seminar für Statistik, ETH Zürich Basel, Nov 13, 2017

(2)

1. Thoughts on the Role of Reproducibility 1

1.

Thoughts on the Role of Reproducibility

1.1 Paradigms

Reproducibility defines knowledge. “The Scientific Method”:

Scientific facts are those that have been reproduced independently.

... as opposed to belief, which is “irrational” for some of us.

Science should be the collection of knowledge that is “true”.

Well, not quite: Big Bang is not reproducible, but is a theory, nevertheless is called scientific knowledge.

In fact, empirical science needs theories as its foundation.

(3)

Reproducibility of facts defines science

– physics, chemistry, biology, life science = “Exact” Sciences What is the role of reproducibility in Humanities and Arts?

•

Humanities try to become “exact sciences” by adopting

“the scientific method”.

•

Arts: A composition is a reproducible piece of music.

Reproducibility achieved by fixing notes.

Intonation only “reproducible” with recordings.

•

Improvization in music ; mandalla in “sculpture”:

Intention to make something unique, irreproducible.

Back to “exact” sciences!

(4)

1.2 The Crisis

Reproducibility is a myth in most fields of science!

•

Ioannidis, 2005, PLOS Med. 2:

Why most published research findings are false.

−→

many papers, newspaper articles, round tables, editorials of journals, ...,

Topic of Collegium Helveticum, ETH Zurich

−→

Handbook

•

Tages-Anzeiger of Aug 28, 2015:

“Psychologie-Studien sind wenig glaubwürdig”

(Psychological studies have little credibility)

⇐

“Open Science Collaboration”, Science 349, 943-952, 2015:

Estimating the reproducibility of psychological science

(5)

P-values

(6)

Original study P-value

(7)

Effect Size

(8)

−→

Effect sizes are lower, as a rule, in the replication.

Results:

36% of replications had significant results;

39% of effects were subjectively rated to have replicated the original result;

47% of original effect sizes were in the 95% confidence interval of the replication effect size.

Similar results for pharmaceutical trials, Genetic effects, ...

Cancer Biology: 22 replications planned, 7 completed about 55% of the 23 effects studied were significant again Note:

What is a success/failure of a reproduction?

−→

not well defined!

... not even in the case of assessing just a single effect!

(9)

1.3 Outline

1. Introduction : over!

2. A further replication study 3. Success of replication 4. P-values

5. Between studies variance 6. Consequences for replication 7. Selection biases

−→

no time...

8. Conclusions: Is reproducibility a useful concept?

How should scientific findings be “confirmed”?

(10)

2. A further replication study 9

2.

A further replication study

“Many Labs” Replication Study in Psychology 17 “hypotheses”=scientific questions,

36 “samples” (mostly university bachelors) Example: 1 scientific question: “Anchoring”

"The distance from San Francisco to New York City is [low] longer than 1,500 miles

[high] shorter than 6,000 miles.

How far do you think it is?"

True distance is 2,906 miles.

Do the answers differ between the two “anchoring” groups?

(11)

Results for PSU (Penn State) – taken here as “original study”

miles

low high

20003000400050006000

anchor

(12)

Test for difference:

##

## Welch Two Sample t-test

##

## data: anchoring1.y by anchoring1.g

## t = 7, df = 60, p-value = 4e-10

## alternative hypothesis: true difference in means is not equal to 0

## 95 percent confidence interval:

## 1164 2014

## sample estimates:

## mean in group low mean in group high

## 4182 2592

Confidence interval for difference: [1164, 2014]

(13)

Results for UVA (Virginia) – taken here as “replication study”

miles

psu uva

20003000400050006000

referrer

original replication

(14)

Estimates and confidence intervals for difference:

## original replication

## means 4182, 2592 3979, 3057

## est.diff 1589 922

## stand.err 212 255

## conf.int [ 1164, 2014 ] [ 413, 1432 ]

Is the replication successful?

(15)

3. Success of replication 14

3.

Success of replication

3.1 Answer “significance”:

Replication is successful

if both results are significant.

Probability of getting successful replications depends on

•

original P-value,

•

sample size

N

_r of replication.

(16)

For

N

_r

→ ∞

, almost always successful because of

P-value Paradox: “There is always at least a tiny effect.

The test only shows if you have a sufficiently large sample to make it statistically significant.”

"First, we are typically not terribly concerned with Type 1 error because we rarely believe that it is possible for the null hypothesis to be strictly true." [Gelman et al. 2009]

For

N

_r

→ ∞

, almost always successful.

−→

Do not use too high power in the replication! ???

Problem is with P-value! see later!

(17)

3.2 Answer OSC

(Open Science Collaboration): Replication is successful

if confidence int. of replication contains estimate of original.

OSC study: 47%

??? What kind of statistics is this???

Replication will “always” fail if its sample size is big enough!

(18)

3.3 Answer “overlap”:

if confidence intervals overlap.

Confidence intervals are for parameters!

Are the parameters in the overlap those that are still possible?

Hope for a small overlap!

Remember your Stat-1 course!

(19)

3.4 Answer “compatibility”:

if the data could come from the same “population”.

−→

Classical two-sample problem

−→

P-value NO!!! Need confidence interval for the difference!

General case: 2 samples

−→

two estimates

b θ

_k with standard errors sek,

k = o, r

. Standard error for

∆ = b θ b

_r

− θ b

_o is

sed

= p

se²o

+

se²r

Confidence interval

b θ

_r

− b θ

_o

± 1 . 96 ∗

sed.

(20)

“Anchoring” Example.

original replication means 2592 ; 4182 3057 ; 3979

est.difference 1589 922

standard error 210 253

difference of diff. −⁶⁶⁷ ± ⁶⁴³ conf.int(diff.diff) [-1317, -18]

(Sorry!) Why?

(21)

Why?

•

Original or replication study not properly done or analyzed;

•

Improved experimental methods have reduced systematic error;

•

Statistical model needs improvement!

•

... (see after the following excursion)

(22)

4. Ban p-values 21

4.

Ban p-values

P-value Paradox: “There is always at least a tiny effect.

The test only shows if you have a sufficiently large sample to make it statistically significant.”

Science should ask questions. Is the effect relevant?

Need threshold of relevance

Asking too much?

Confidence interval gives answer for any threshold.

−→

Presentation of results does not depend on threshold.

Replace p-values by confidence intervals!

(23)

4. Ban p-values 22

Here, Example Anchoring:

Estimate difference of distance judgements between

“high” and “low” anchoring group

Original:

1589 ± 210 = [1179 , 2000]

−

− −

differences of 'high' and 'low' 0500100015002000

original replication difference

−1500−1000−5000500

(24)

4. Ban p-values 23

Confidence interval for difference of differences does not cover 0.

Is the difference relevant? – Decision unclear.

General: confidence interval for difference of effects between original and replication.

−→

Decision “difference not relevant” is possible in principle although sample size will often not be sufficient if

relevance threshold is low.

(25)

4. Ban p-values 24

In 1999, a committee of psychologists came close to a

ban of the p-value!

−→

Use confidence intervals!

P value

confidence interval

test result yes/no answer

(26)

5. Between Studies Variance 25

5.

Between Studies Variance

Estimate of the effect from many studies.

Example: 36 studies.

− −

−

−−

−

−−

−

−−

−

− −−

−−

− −

−

−−−

− −

anchoring1

5001000150020002500 5001000150020002500

Uni.US Uni.notUS notUni.US ...

## Error in matext("difference", 2, 2): could

not find function "matext"

(27)

Overall estimate of effect:

b θ =

¹

K

X

K

k⁼¹

θ b

_k

Precision?

•

“Theoretical”:

var

_t

θ b

=

¹

K²

X

^K

k⁼¹

se

²_k

•

“Empirical”

var

_e

θ b

=

¹

K−¹

X

K

k⁼¹

( b θ

_k

− θ b )

²

var

_e

θ b

> var

_t

b θ

−→

overdispersion!

(28)

This is a generalization of 1-way ANOVA

which uses within group variance

−→ var

_t

and between group variance

−→ var

_e

(29)

−→

Estimation of overdispersion

•

Restricted to US Universities

## meand seMeandTh seMeandEmp overdFac

## 1245.14 51.41 55.62 1.08

•

All 36 “samples”:

## Error in eval(expr, envir, enclos): object ’r.anch1’

not found

Close replication vs. conceptual replication = generalization May hope for no overdispersion (factor=1) in close replications.

Should we? Consider P-value Paradox!

−→

need relevance threshold for difference of effects betw. studies.

(30)

Expanding the confidence intervals.

Assuming a known overdispersion factor

ω

,

the confidence intervals can be expanded accordingly.

This results in

## [1] -1370.1 35.5

– which covers 0!

Some overdispersion factors:

overdFac anchoring1 1.030 anchoring2 1.438 anchoring3 1.544 anchoring4 1.314 gambler 1.454 moneyprim 0.843

(31)

6. Consequences for replication 30

6.

Consequences for replication

A single replication is rather useless. Assume that – we have a relevance threshold

θ

^∗

> 0

and the confidence interval of the original study is above,

θ b

_o

− 2 se

_o

≥ θ

^∗;

– the precision of the replication is similar to the original,

se

_r

≈ se

_o.

If the true

θ

were

θ = θ b

_o

= θ

^∗

+ 2 se

_o , then

the probability that the replication confidence interval is

≥ θ

^∗

would be about 0.5.

Unless

θ b

_o

− 2 se

_o

>> θ

^∗,

the probability of getting an inconclusive result is quite high.

(32)

6. Consequences for replication 31

In case of

•

an inconclusive result, one would plan another replication study.

−→

sequential inference, to be taken into account even if ...

•

the desired result

θ b

_r

− 2 se

_r

≥ θ

^∗,

what is an appropriate

se

for calculating a reliable conf.int.?

Should take overdispersion into account!

But we do not have an estimate for it!

−→

Need several replications

−→ θ b

_k.

Then, estimate

θ

as mean (median) of

θ

_ks, with empirical

se

_e.

Do not include the original study because of potential selection bias.

−→

“We need to move toward a better understanding

of the relationship between reproducibility, cumulative evidence, and the truth of scientific claims.” [Clemens 2015]

(33)

7. Selection biases 32

7.

Selection biases

7.1 Within study selection

Here is a common way of learning from empirical studies:

•

visualize data,

•

see patterns (unexpected, but with sensible interpretation),

•

test if statistically significant,

•

if yes (usually), describe “pattern” in manuscript.

(cf. “industry producing statistically significant effects”)

(34)

The problem, formalized: Multiple comparisons

7 groups, generated by random numbers

∼ N (0, 1)

.

−→ H

₀ true!

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

group

y

1 2 3 4 5 6 7

−2−1012

+ + + + + + +

std.dev.

conf.int

(35)

Test each pair of groups for a difference in expected values.

−→ 7 · 6 / 2 = 21

tests.

P (

rejection

) = 0. 05

for each test.

−→

Expected number of significant test results =

1 . 05

! significant differences for 1 vs. 6 and 1 vs. 7

Publish the significant result!

You will certainly find an explanation why it makes sense...

−→

Selection bias.

(36)

Solution: for multiple (“all pairs”) comparisons:

•

Make a single test for the hypothesis that all

µ

_g are equal!

−→

F-test for factors.

•

Lower the level

α

for each of the 21 tests such that

P (≥ 1

significant test result

) ≤ α = 0 . 05

_!

Simplest and general (multiple testing): Bonferroni correction:

divide

α

by number of tests

−→

conservative testing procedure

−→

You will get no significant results

−→

nothing published (Are we back to testing? –

Considerations also apply to confidence intervals!)

(37)

In reality, it is even worse!

When exploring data, nobody knows how many hypotheses are

“informally tested” by visual inspection of informative graphs.

Exploratory data analysis – curse or benediction?

Solution?

One dataset – one test!

(or: only a small number of planned tests/confidence intervals, with Bonferroni adjustments (?))

Dream for statistician, nightmare for researchers!

(38)

7.2 More biases

• Publication bias

: A manuscript without significant effects will not be written

or rejected.

•

“Researcher degrees of freedom”

= selecting transformations, estimation methods, model selection, ...

−→

Discussion.

Remedy: Pre-registration: Publish the project (with review!), accepted if scientific question is worthwile,

then publish the results in any case!

(39)

7.3 Stepping procedure of advancing science:

1. Explore data freely, allowing all creativity.

Create “hypotheses” about relevant effects.

2. Conduct a new study to examine the hypotheses (not

H

₀!)

“Believe” effects that are successfully confirmed with a sufficient magnitude to be relevant.

3. Extend the scope in a planned manner to generalize the insight.

1.* Use dataset in an exploratory attitude to generate new hypotheses.

it. Iterate until retirement.

Note that step 2 is a phony replication!

(40)

Huang & Gottardo, Briefings in Bioinformatics 14 (2012), 391-401

(41)

8. Conclusions 40

8.

Conclusions

Where and when is reproducibility a useful concept?

8.1 “Exact” sciences ...

(well: “quantitative, empirical part of sciences”)

... Physics, Chemistry, Biology, Medicine, Life sciences.

•

Reproducibility is an important principle to keep in mind.

Feasible? Sometimes. Needs motivation, skill & luck. Recognition?

•

Data Challenge, Confirmation

Science is not only about collecting facts

that stand the criterion of reproducibility, but about

generating theories (in a wide sense) that connect the facts.

(42)

8. Conclusions 41

Types of confirmation:

+ Repetition = “close replication”: Same setting

−→

should produce response values within “data compatibility”.

+ Generalization = “conceptual replication”, extrapolation:

Vary the situation to check the “robustness” of the result.

Data Challenge

+ Extension: Generalize the problem.

Distinction is more useful in more complicated situations (regression).

(43)

8. Conclusions 42

Pre-registration!

Repetition studies should be pre-registered.

Journal that published the original result should be prepared to accept and list pre-registrations of it’s repetition (if design is ok) and publish the result regardless of the outcome.

−→

Journal Section “Replication”

Recommendation:

Perform combined study for replication and generalization and/or extension.

Each PhD thesis could consist of a replication phase, followed by a generalization study.

(44)

8. Conclusions 43

8.2 Social sciences, ...

– Macro-Economics: Economy only exists once, no replication.

– Society, History: same

– Psychology: Circumstances (therapist, institution, culture) are difficult to replicate.

These sciences should not be reduced to their quantitative parts!

What about philosophy and religion?

Good for discussions over lunch.

(45)

8. Conclusions 44

Messages

•

Avoid significance tests and P-values. Use confidence intervals!

−→

Teaching!

•

Precise replication in the sense of compatibility of quantitative results (non-significant difference)

is rare (outside student’s lab classes in physics)

•

It becomes somewhat more realistic if models contain a study-to-study variance conponent.

(46)

8. Conclusions 45

•

Dilemma of advancing science: Exploratory and confirmatory steps.

−→

Data Challenge

Instead of mere replicaton studies, perform confirmation / generalization studies!

•

Replication is only applicable to empirical science.

There are other modes of thinking that should be recognized as “science” in the broad sense (“Wissenschaft”).

What is confirmation in these fields?

• −→

In what sense / to what degree

should replication be a requirement for serious research?

Thank you for your endurance

(47)

8. Conclusions 46

References

Atmanspacher, H. and Maasen, S. (eds) (2016). Reproducibility: Prin- ciples, Problems, Practices, and Prospects, Wiley.

Clemens, M. A. (2015). The meaning of failed replications: A review and proposal., J. Econ. Surv. 10?: 10.1111/joes.12139.

Gelman, A., Hill, J. and Yajima, M. (2009). Why we (usually) don’t have to worry about multiple comparisons, ??

Ioannidis, J. (2005). Why most published research findings are false, PLoS Medicine 2: 696–701.

Klein, R. and et al. (2014). Investigating variation in replicability., Soc Psychol 45(3): 142–152. “Manylabs”: repr of 17 effects

Open Science Collaboration (2015). Estimating the reproducibility of psychological science, Science 349: 943–952.