• Keine Ergebnisse gefunden

Reproducibility: Measuring Success of Replication

N/A
N/A
Protected

Academic year: 2021

Aktie "Reproducibility: Measuring Success of Replication"

Copied!
47
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

0

Reproducibility:

Measuring Success of Replication

Werner Stahel

Seminar für Statistik, ETH Zürich Basel, Nov 13, 2017

(2)

1. Thoughts on the Role of Reproducibility 1

1.

Thoughts on the Role of Reproducibility

1.1 Paradigms

Reproducibility defines knowledge. “The Scientific Method”:

Scientific facts are those that have been reproduced independently.

... as opposed to belief, which is “irrational” for some of us.

Science should be the collection of knowledge that is “true”.

Well, not quite: Big Bang is not reproducible, but is a theory, nevertheless is called scientific knowledge.

In fact, empirical science needs theories as its foundation.

(3)

1. Thoughts on the Role of Reproducibility 2

Reproducibility of facts defines science

– physics, chemistry, biology, life science = “Exact” Sciences What is the role of reproducibility in Humanities and Arts?

Humanities try to become “exact sciences” by adopting

“the scientific method”.

Arts: A composition is a reproducible piece of music.

Reproducibility achieved by fixing notes.

Intonation only “reproducible” with recordings.

Improvization in music ; mandalla in “sculpture”:

Intention to make something unique, irreproducible.

Back to “exact” sciences!

(4)

1. Thoughts on the Role of Reproducibility 3

1.2 The Crisis

Reproducibility is a myth in most fields of science!

Ioannidis, 2005, PLOS Med. 2:

Why most published research findings are false.

−→

many papers, newspaper articles, round tables, editorials of journals, ...,

Topic of Collegium Helveticum, ETH Zurich

−→

Handbook

Tages-Anzeiger of Aug 28, 2015:

“Psychologie-Studien sind wenig glaubwürdig”

(Psychological studies have little credibility)

“Open Science Collaboration”, Science 349, 943-952, 2015:

Estimating the reproducibility of psychological science

(5)

1. Thoughts on the Role of Reproducibility 4

P-values

(6)

1. Thoughts on the Role of Reproducibility 5

Original study P-value

(7)

1. Thoughts on the Role of Reproducibility 6

Effect Size

(8)

1. Thoughts on the Role of Reproducibility 7

−→

Effect sizes are lower, as a rule, in the replication.

Results:

36% of replications had significant results;

39% of effects were subjectively rated to have replicated the original result;

47% of original effect sizes were in the 95% confidence interval of the replication effect size.

Similar results for pharmaceutical trials, Genetic effects, ...

Cancer Biology: 22 replications planned, 7 completed about 55% of the 23 effects studied were significant again Note:

What is a success/failure of a reproduction?

−→

not well defined!

... not even in the case of assessing just a single effect!

(9)

1. Thoughts on the Role of Reproducibility 8

1.3 Outline

1. Introduction : over!

2. A further replication study 3. Success of replication 4. P-values

5. Between studies variance 6. Consequences for replication 7. Selection biases

−→

no time...

8. Conclusions: Is reproducibility a useful concept?

How should scientific findings be “confirmed”?

(10)

2. A further replication study 9

2.

A further replication study

“Many Labs” Replication Study in Psychology 17 “hypotheses”=scientific questions,

36 “samples” (mostly university bachelors) Example: 1 scientific question: “Anchoring”

"The distance from San Francisco to New York City is [low] longer than 1,500 miles

[high] shorter than 6,000 miles.

How far do you think it is?"

True distance is 2,906 miles.

Do the answers differ between the two “anchoring” groups?

(11)

2. A further replication study 10

Results for PSU (Penn State) – taken here as “original study”

miles

low high

20003000400050006000

anchor

(12)

2. A further replication study 11

Test for difference:

##

## Welch Two Sample t-test

##

## data: anchoring1.y by anchoring1.g

## t = 7, df = 60, p-value = 4e-10

## alternative hypothesis: true difference in means is not equal to 0

## 95 percent confidence interval:

## 1164 2014

## sample estimates:

## mean in group low mean in group high

## 4182 2592

Confidence interval for difference: [1164, 2014]

(13)

2. A further replication study 12

Results for UVA (Virginia) – taken here as “replication study”

miles

psu uva

20003000400050006000

referrer

original replication

(14)

2. A further replication study 13

Estimates and confidence intervals for difference:

## original replication

## means 4182, 2592 3979, 3057

## est.diff 1589 922

## stand.err 212 255

## conf.int [ 1164, 2014 ] [ 413, 1432 ]

Is the replication successful?

(15)

3. Success of replication 14

3.

Success of replication

3.1 Answer “significance”:

Replication is successful

if both results are significant.

Probability of getting successful replications depends on

original P-value,

sample size

N

r of replication.

(16)

3. Success of replication 15

For

N

r

→ ∞

, almost always successful because of

P-value Paradox: “There is always at least a tiny effect.

The test only shows if you have a sufficiently large sample to make it statistically significant.”

"First, we are typically not terribly concerned with Type 1 error because we rarely believe that it is possible for the null hypothesis to be strictly true." [Gelman et al. 2009]

For

N

r

→ ∞

, almost always successful.

−→

Do not use too high power in the replication! ???

Problem is with P-value! see later!

(17)

3. Success of replication 16

3.2 Answer OSC

(Open Science Collaboration): Replication is successful

if confidence int. of replication contains estimate of original.

OSC study: 47%

??? What kind of statistics is this???

Replication will “always” fail if its sample size is big enough!

(18)

3. Success of replication 17

3.3 Answer “overlap”:

Replication is successful

if confidence intervals overlap.

Confidence intervals are for parameters!

Are the parameters in the overlap those that are still possible?

Hope for a small overlap!

Remember your Stat-1 course!

(19)

3. Success of replication 18

3.4 Answer “compatibility”:

Replication is successful

if the data could come from the same “population”.

−→

Classical two-sample problem

−→

P-value NO!!! Need confidence interval for the difference!

General case: 2 samples

−→

two estimates

b θ

k with standard errors sek,

k = o, r

. Standard error for

∆ = b θ b

r

− θ b

o is

sed

= p

se2o

+

se2r

Confidence interval

b θ

r

− b θ

o

± 1 . 96 ∗

sed.

(20)

3. Success of replication 19

“Anchoring” Example.

original replication means 2592 ; 4182 3057 ; 3979

est.difference 1589 922

standard error 210 253

difference of diff. −667 ± 643 conf.int(diff.diff) [-1317, -18]

(Sorry!) Why?

(21)

3. Success of replication 20

Why?

Original or replication study not properly done or analyzed;

Improved experimental methods have reduced systematic error;

Statistical model needs improvement!

... (see after the following excursion)

(22)

4. Ban p-values 21

4.

Ban p-values

P-value Paradox: “There is always at least a tiny effect.

The test only shows if you have a sufficiently large sample to make it statistically significant.”

Science should ask questions. Is the effect relevant?

Need threshold of relevance

Asking too much?

Confidence interval gives answer for any threshold.

−→

Presentation of results does not depend on threshold.

Replace p-values by confidence intervals!

(23)

4. Ban p-values 22

Here, Example Anchoring:

Estimate difference of distance judgements between

“high” and “low” anchoring group

Original:

1589 ± 210 = [1179 , 2000]

− −

differences of 'high' and 'low' 0500100015002000

original replication difference

−1500−1000−5000500

(24)

4. Ban p-values 23

Confidence interval for difference of differences does not cover 0.

Is the difference relevant? – Decision unclear.

General: confidence interval for difference of effects between original and replication.

−→

Decision “difference not relevant” is possible in principle although sample size will often not be sufficient if

relevance threshold is low.

(25)

4. Ban p-values 24

In 1999, a committee of psychologists came close to a

ban of the p-value!

−→

Use confidence intervals!

P value

confidence interval

test result yes/no answer

© Markus Kalisch

(26)

5. Between Studies Variance 25

5.

Between Studies Variance

Estimate of the effect from many studies.

Example: 36 studies.

− −

−−

−−

−−

− −−

−−

− −

−−−

− −

anchoring1

5001000150020002500 5001000150020002500

Uni.US Uni.notUS notUni.US ...

## Error in matext("difference", 2, 2): could

not find function "matext"

(27)

5. Between Studies Variance 26

Overall estimate of effect:

b θ =

1

K

X

K

k=1

θ b

k

Precision?

“Theoretical”:

var

t

θ b

=

1

K2

X

K

k=1

se

2k

“Empirical”

var

e

θ b

=

1

K−1

X

K

k=1

( b θ

k

− θ b )

2

var

e

θ b

> var

t

b θ

−→

overdispersion!

(28)

5. Between Studies Variance 27

This is a generalization of 1-way ANOVA

which uses within group variance

−→ var

t

and between group variance

−→ var

e

(29)

5. Between Studies Variance 28

−→

Estimation of overdispersion

Restricted to US Universities

## meand seMeandTh seMeandEmp overdFac

## 1245.14 51.41 55.62 1.08

All 36 “samples”:

## Error in eval(expr, envir, enclos): object ’r.anch1’

not found

Close replication vs. conceptual replication = generalization May hope for no overdispersion (factor=1) in close replications.

Should we? Consider P-value Paradox!

−→

need relevance threshold for difference of effects betw. studies.

(30)

5. Between Studies Variance 29

Expanding the confidence intervals.

Assuming a known overdispersion factor

ω

,

the confidence intervals can be expanded accordingly.

This results in

## [1] -1370.1 35.5

– which covers 0!

Some overdispersion factors:

overdFac anchoring1 1.030 anchoring2 1.438 anchoring3 1.544 anchoring4 1.314 gambler 1.454 moneyprim 0.843

(31)

6. Consequences for replication 30

6.

Consequences for replication

A single replication is rather useless. Assume that – we have a relevance threshold

θ

> 0

and the confidence interval of the original study is above,

θ b

o

− 2 se

o

≥ θ

;

– the precision of the replication is similar to the original,

se

r

≈ se

o.

If the true

θ

were

θ = θ b

o

= θ

+ 2 se

o , then

the probability that the replication confidence interval is

≥ θ

would be about 0.5.

Unless

θ b

o

− 2 se

o

>> θ

,

the probability of getting an inconclusive result is quite high.

(32)

6. Consequences for replication 31

In case of

an inconclusive result, one would plan another replication study.

−→

sequential inference, to be taken into account even if ...

the desired result

θ b

r

− 2 se

r

≥ θ

,

what is an appropriate

se

for calculating a reliable conf.int.?

Should take overdispersion into account!

But we do not have an estimate for it!

−→

Need several replications

−→ θ b

k.

Then, estimate

θ

as mean (median) of

θ

ks, with empirical

se

e.

Do not include the original study because of potential selection bias.

−→

“We need to move toward a better understanding

of the relationship between reproducibility, cumulative evidence, and the truth of scientific claims.” [Clemens 2015]

(33)

7. Selection biases 32

7.

Selection biases

7.1 Within study selection

Here is a common way of learning from empirical studies:

visualize data,

see patterns (unexpected, but with sensible interpretation),

test if statistically significant,

if yes (usually), describe “pattern” in manuscript.

(cf. “industry producing statistically significant effects”)

(34)

7. Selection biases 33

The problem, formalized: Multiple comparisons

7 groups, generated by random numbers

∼ N (0, 1)

.

−→ H

0 true!

group

y

1 2 3 4 5 6 7

−2−1012

+ + + + + + +

std.dev.

conf.int

(35)

7. Selection biases 34

Test each pair of groups for a difference in expected values.

−→ 7 · 6 / 2 = 21

tests.

P (

rejection

) = 0. 05

for each test.

−→

Expected number of significant test results =

1 . 05

! significant differences for 1 vs. 6 and 1 vs. 7

Publish the significant result!

You will certainly find an explanation why it makes sense...

−→

Selection bias.

(36)

7. Selection biases 35

Solution: for multiple (“all pairs”) comparisons:

Make a single test for the hypothesis that all

µ

g are equal!

−→

F-test for factors.

Lower the level

α

for each of the 21 tests such that

P (≥ 1

significant test result

) ≤ α = 0 . 05

!

Simplest and general (multiple testing): Bonferroni correction:

divide

α

by number of tests

−→

conservative testing procedure

−→

You will get no significant results

−→

nothing published (Are we back to testing? –

Considerations also apply to confidence intervals!)

(37)

7. Selection biases 36

In reality, it is even worse!

When exploring data, nobody knows how many hypotheses are

“informally tested” by visual inspection of informative graphs.

Exploratory data analysis – curse or benediction?

Solution?

One dataset – one test!

(or: only a small number of planned tests/confidence intervals, with Bonferroni adjustments (?))

Dream for statistician, nightmare for researchers!

(38)

7. Selection biases 37

7.2 More biases

Publication bias

: A manuscript without significant effects will not be written

or rejected.

“Researcher degrees of freedom”

= selecting transformations, estimation methods, model selection, ...

−→

Discussion.

Remedy: Pre-registration: Publish the project (with review!), accepted if scientific question is worthwile,

then publish the results in any case!

(39)

7. Selection biases 38

7.3 Stepping procedure of advancing science:

1. Explore data freely, allowing all creativity.

Create “hypotheses” about relevant effects.

2. Conduct a new study to examine the hypotheses (not

H

0!)

“Believe” effects that are successfully confirmed with a sufficient magnitude to be relevant.

3. Extend the scope in a planned manner to generalize the insight.

1.* Use dataset in an exploratory attitude to generate new hypotheses.

it. Iterate until retirement.

Note that step 2 is a phony replication!

(40)

7. Selection biases 39

Huang & Gottardo, Briefings in Bioinformatics 14 (2012), 391-401

(41)

8. Conclusions 40

8.

Conclusions

Where and when is reproducibility a useful concept?

8.1 “Exact” sciences ...

(well: “quantitative, empirical part of sciences”)

... Physics, Chemistry, Biology, Medicine, Life sciences.

Reproducibility is an important principle to keep in mind.

Feasible? Sometimes. Needs motivation, skill & luck. Recognition?

Data Challenge, Confirmation

Science is not only about collecting facts

that stand the criterion of reproducibility, but about

generating theories (in a wide sense) that connect the facts.

(42)

8. Conclusions 41

Types of confirmation:

+ Repetition = “close replication”: Same setting

−→

should produce response values within “data compatibility”.

+ Generalization = “conceptual replication”, extrapolation:

Vary the situation to check the “robustness” of the result.

Data Challenge

+ Extension: Generalize the problem.

Distinction is more useful in more complicated situations (regression).

(43)

8. Conclusions 42

Pre-registration!

Repetition studies should be pre-registered.

Journal that published the original result should be prepared to accept and list pre-registrations of it’s repetition (if design is ok) and publish the result regardless of the outcome.

−→

Journal Section “Replication”

Recommendation:

Perform combined study for replication and generalization and/or extension.

Each PhD thesis could consist of a replication phase, followed by a generalization study.

(44)

8. Conclusions 43

8.2 Social sciences, ...

– Macro-Economics: Economy only exists once, no replication.

– Society, History: same

– Psychology: Circumstances (therapist, institution, culture) are difficult to replicate.

These sciences should not be reduced to their quantitative parts!

What about philosophy and religion?

Good for discussions over lunch.

(45)

8. Conclusions 44

Messages

Avoid significance tests and P-values. Use confidence intervals!

−→

Teaching!

Precise replication in the sense of compatibility of quantitative results (non-significant difference)

is rare (outside student’s lab classes in physics)

It becomes somewhat more realistic if models contain a study-to-study variance conponent.

(46)

8. Conclusions 45

Dilemma of advancing science: Exploratory and confirmatory steps.

−→

Data Challenge

Instead of mere replicaton studies, perform confirmation / generalization studies!

Replication is only applicable to empirical science.

There are other modes of thinking that should be recognized as “science” in the broad sense (“Wissenschaft”).

What is confirmation in these fields?

• −→

In what sense / to what degree

should replication be a requirement for serious research?

Thank you for your endurance

(47)

8. Conclusions 46

References

Atmanspacher, H. and Maasen, S. (eds) (2016). Reproducibility: Prin- ciples, Problems, Practices, and Prospects, Wiley.

Clemens, M. A. (2015). The meaning of failed replications: A review and proposal., J. Econ. Surv. 10?: 10.1111/joes.12139.

Gelman, A., Hill, J. and Yajima, M. (2009). Why we (usually) don’t have to worry about multiple comparisons, ??

Ioannidis, J. (2005). Why most published research findings are false, PLoS Medicine 2: 696–701.

Klein, R. and et al. (2014). Investigating variation in replicability., Soc Psychol 45(3): 142–152. “Manylabs”: repr of 17 effects

Open Science Collaboration (2015). Estimating the reproducibility of psychological science, Science 349: 943–952.

Referenzen

ÄHNLICHE DOKUMENTE

Therefore, the present study was planned to study the effect of siRNA targeting five genes (Capsid, CprM, NS1, NS3 and NS3) of dengue virus genome on all four serotypes of dengue

analysis – for combining and comparing estimates from different studies – and is all 

“classical” hypothesis for binary endpoints and all other hypothesis considered in this work the asymptotically optimal allocation for log odds retention of effect hypothesis,

The influence of deformation temperature on the microstructural evolution of an austenitic stainless steel with an initial grain size of 120 μm, deformed to =57% with a strain-rate of ˙

Mock infected samples were used as controls and comparison of gene expression levels of A549 cells treated with IFN for 24 hours with those of TSV01 infected A549 cells (also

The main aim of the thesis was the investigation of the inhibitory effect of six different versions of single- residue substitutions in SINV nsP2 protease to

b Department of Physics, Southwest University for Nationalities, Chengdu 610041, China Reprint requests to J.-J. 61a, 357 – 363 (2006); received April

When the contribution from the covalence is ignored, putting the covalent parameter N t = N e , the energy level matrix can be reduced to the classical crystal-field result, and