• Keine Ergebnisse gefunden

Thoughts on the Role of Reproducibility

N/A
N/A
Protected

Academic year: 2021

Aktie "Thoughts on the Role of Reproducibility"

Copied!
56
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Reproducibility:

Crisis, Concepts, Outlook

Werner Stahel

Seminar für Statistik, ETH Zürich WBL, Dec 18, 2017

(2)

1. Thoughts on the Role of Reproducibility 1

1.

Thoughts on the Role of Reproducibility

1.1 Paradigms

Reproducibility defines knowledge. “The Scientific Method”:

Scientific facts are those that have been reproduced independently.

... as opposed to belief, which is “irrational” for some of us.

Science should be the collection of knowledge that is “true”.

Well, not quite: Big Bang is not reproducible, but is a theory, nevertheless is called scientific knowledge.

In fact, empirical science needs theories as its foundation.

(3)

Reproducibility of facts defines science

– physics, chemistry, biology, life science = “Exact” Sciences What is the role of reproducibility in Humanities and Arts?

Humanities try to become “exact sciences” by adopting

“the scientific method”.

Arts: A composition is a reproducible piece of music.

Reproducibility achieved by fixing notes.

Intonation only “reproducible” with recordings.

Improvization in music ; mandalla in “sculpture”:

Intention to make something unique, irreproducible.

Back to “exact” sciences!

(4)

1. Thoughts on the Role of Reproducibility 3

1.2 The Crisis

Reproducibility is a myth in most fields of science!

Ioannidis, 2005, PLOS Med. 2:

Why most published research findings are false.

−→

many papers, newspaper articles, round tables, editorials of journals, ...,

Topic of Collegium Helveticum, ETH Zurich

−→

Handbook

Tages-Anzeiger of Aug 28, 2015:

“Psychologie-Studien sind wenig glaubwürdig”

(Psychological studies have little credibility)

“Open Science Collaboration”, Science 349, 943-952, 2015:

Estimating the reproducibility of psychological science

(5)

P-values

(6)

1. Thoughts on the Role of Reproducibility 5

Original study P-value

(7)

Effect Size

(8)

1. Thoughts on the Role of Reproducibility 7

−→

Effect sizes are lower, as a rule, in the replication.

Results:

36% of replications had significant results;

39% of effects were subjectively rated to have replicated the original result;

47% of original effect sizes were in the 95% confidence interval of the replication effect size.

Similar results for pharmaceutical trials, Genetic effects, ...

Cancer Biology: 22 replications planned, 7 completed about 55% of the 23 effects studied were significant again Note:

What is a success/failure of a reproduction?

−→

not well defi- ned! ... not even in the case of assessing just a single effect!

(9)

1.3 Outline

1. Introduction : over!

2. A further replication study 3. Success of replication 4. P-values

5. Between studies variance 6. Consequences for replication 7. Selection biases

8. Regression and Model Selection

9. Conclusions: Is reproducibility a useful concept?

How should scientific findings be “confirmed”?

(10)

2. A further replication study 9

2.

A further replication study

“Many Labs” Replication Study in Psychology 17 “hypotheses”=scientific questions,

36 “samples” (mostly university bachelors) Example: 1 scientific question: “Anchoring”

"The distance from San Francisco to New York City is [low] longer than 1,500 miles

[high] shorter than 6,000 miles.

How far do you think it is?"

True distance is 2,906 miles.

Do the answers differ between the two “anchoring” groups?

(11)

Results for PSU (Penn State) – taken here as “original study”

miles

low high

20003000400050006000

anchor

(12)

2. A further replication study 11

Test for difference:

##

## Welch Two Sample t-test

##

## data: anchoring1.y by anchoring1.g

## t = 7, df = 60, p-value = 4e-10

## alternative hypothesis: true difference in means is not equal to 0

## 95 percent confidence interval:

## 1164 2014

## sample estimates:

## mean in group low mean in group high

## 4182 2592

Confidence interval for difference: [1164, 2014]

(13)

Results for UVA (Virginia) – taken here as “replication study”

miles

psu uva

20003000400050006000

referrer

original replication

(14)

2. A further replication study 13

Estimates and confidence intervals for difference:

## original replication

## means 4182, 2592 3979, 3057

## est.diff 1589 922

## stand.err 212 255

## conf.int [ 1164, 2014 ] [ 413, 1432 ]

Is the replication successful?

(15)

3.

Success of replication

3.1 Answer “significance”:

Replication is successful

if both results are significant.

Probability of getting successful replications depends on

original P-value,

sample size

N

r of replication.

(16)

3. Success of replication 15

For

N

r

→ ∞

, almost always successful because of P-value Paradox: “There is always at least a tiny effect.

The test only shows if you have a sufficiently large sample to make it statistically significant.”

"First, we are typically not terribly concerned with Type 1 error because we rarely believe that it is possible for the null hypothesis to be strictly true." [Gelman et al. 2009]

For

N

r

→ ∞

, almost always successful.

−→

Do not use too high power in the replication! ???

Problem is with P-value! see later!

(17)

3.2 Answer OSC

(Open Science Collaboration): Replication is successful

if confidence int. of replication contains estimate of original.

OSC study: 47%

??? What kind of statistics is this???

Replication will “always” fail if its sample size is big enough!

(18)

3. Success of replication 17

3.3 Answer “overlap”:

Replication is successful

if confidence intervals overlap.

Confidence intervals are for parameters!

Are the parameters in the overlap those that are still possible?

Hope for a small overlap!

Remember your Stat-1 course!

(19)

3.4 Answer “compatibility”:

Replication is successful

if the data could come from the same “population”.

−→

Classical two-sample problem

−→

P-value NO!!! Need confidence interval for the difference!

General case: 2 samples

−→

two estimates

b θ

k with standard errors sek ,

k = o, r

. Standard error for

∆ = b θ b

r

− θ b

o is

sed

= p

se2o

+

se2r

Confidence interval

b θ

r

− b θ

o

± 1 . 96 ∗

sed .

(20)

3. Success of replication 19

“Anchoring” Example.

original replication means 2592 ; 4182 3057 ; 3979

est.difference 1589 922

standard error 210 253

difference of diff. −667 ± 643 conf.int(diff.diff) [-1317, -18]

(Sorry!) Why?

(21)

Why?

Original or replication study not properly done or analyzed;

Improved experimental methods have reduced systematic error;

Statistical model needs improvement!

... (see after the following excursion)

(22)

4. Ban p-values 21

4.

Ban p-values

P-value Paradox: “There is always at least a tiny effect.

The test only shows if you have a sufficiently large sample to make it statistically significant.”

Science should ask questions. Is the effect relevant?

Need threshold of relevance

Asking too much?

Confidence interval gives answer for any threshold.

−→

Presentation of results does not depend on threshold.

Replace p-values by confidence intervals!

(23)

Here, Example Anchoring:

Estimate difference of distance judgements between

“high” and “low” anchoring group

Original:

1589 ± 210 = [1179 , 2000]

− −

differences of 'high' and 'low' 0500100015002000

original replication difference

−1500−1000−5000500

(24)

4. Ban p-values 23

Confidence interval for difference of differences does not cover 0.

Is the difference relevant? – Decision unclear.

General: confidence interval for difference of effects between original and replication.

−→

Decision “difference not relevant” is possible in principle although sample size will often not be sufficient if

relevance threshold is low.

(25)

In 1999, a committee of psychologists came close to a

ban of the p-value!

−→

Use confidence intervals!

P value

confidence interval

test result yes/no answer

© Markus Kalisch

(26)

5. Between Studies Variance 25

5.

Between Studies Variance

Estimate of the effect from many studies.

Example: 36 studies.

− −

−−

−−

−−

− −−

−−

− −

−−−

− −

anchoring1

5001000150020002500 5001000150020002500

Uni.US Uni.notUS notUni.US ...

## Error in matext("difference", 2, 2): could

not find function "matext"

(27)

Overall estimate of effect:

b θ =

1

K

X

K

k=1

θ b

k

Precision?

“Theoretical”:

var

t

θ b

=

1

K2

X

K

k=1

se

2k

“Empirical”

var

e

θ b

=

1

K−1

X

K

k=1

( b θ

k

− θ b )

2

var

e

θ b

> var

t

b θ

−→

overdispersion!

(28)

5. Between Studies Variance 27

This is a generalization of 1-way ANOVA

which uses within group variance

−→ var

t

and between group variance

−→ var

e

(29)

−→

Estimation of overdispersion

Restricted to US Universities

## meand seMeandTh seMeandEmp overdFac

## 1245.14 51.41 55.62 1.08

All 36 “samples”:

## meand seMeandTh seMeandEmp overdFac

## 1248.94 36.30 45.60 1.26

Close replication vs. conceptual replication = generalization May hope for no overdispersion (factor=1) in close replications.

Should we? Consider P-value Paradox!

−→

need relevance threshold for difference of effects betw. studies.

(30)

5. Between Studies Variance 29

Expanding the confidence intervals.

Assuming a known overdispersion factor

ω

,

the confidence intervals can be expanded accordingly.

This results in

## [1] -1370.1 35.5

– which covers 0!

Some overdispersion factors (estimated without the “original”) overdFac

anchoring1 1.030 anchoring2 1.438 anchoring3 1.544 anchoring4 1.314 gambler 1.454 moneyprim 0.843

(31)

6.

Consequences for replication

A single replication is rather useless. Assume that – we have a relevance threshold

θ

> 0

and the confidence interval of the original study is above it,

θ b

o

− 2 se

o

≥ θ

;

– the precision of the replication is similar to the original,

se

r

≈ se

o .

If the true

θ

were

θ = θ b

o

= θ

+ 2 se

o , then

the probability that the replication confidence interval is

≥ θ

would be about 50%.

Unless

θ b

o

− 2 se

o

>> θ

,

the probability of getting an inconclusive result is quite high.

(32)

6. Consequences for replication 31

In case of

an inconclusive result, one would plan another replication study.

−→

sequential inference, to be taken into account even if ...

the desired result

θ b

r

− 2 se

r

≥ θ

,

what is an appropriate

se

for calculating a reliable conf.int.?

Should take overdispersion into account!

But we do not have an estimate for it!

−→

Need several replications

−→ θ b

k .

Then, estimate

θ

as mean (median) of

θ

k s, with empirical

se

e . Do not include the original study because of potential selection bias.

−→

“We need to move toward a better understanding

of the relationship between reproducibility, cumulative evidence, and the truth of scientific claims.” [Clemens 2015]

(33)

7.

Selection biases

7.1 Within study selection

Here is a common way of learning from empirical studies:

visualize data,

see patterns (unexpected, but with sensible interpretation),

test if statistically significant,

if yes (usually), describe “pattern” in manuscript.

(cf. “industry producing statistically significant effects”)

(34)

7. Selection biases 33

The problem, formalized: Multiple comparisons

7 groups, generated by random numbers ∼ N (0, 1)

−→ H

0 true!

group

y

1 2 3 4 5 6 7

−2−1012

+ + + + + + +

std.dev.

conf.int

(35)

Test each pair of groups for a difference in expected values.

−→ 7 · 6 / 2 = 21

tests.

P (

rejection

) = 0. 05

for each test.

−→

Expected number of significant test results =

1 . 05

! significant differences for 1 vs. 6 and 1 vs. 7

Publish the significant result!

You will certainly find an explanation why it makes sense...

−→

“Selection bias”

(36)

7. Selection biases 35

Remedy: For multiple (“all pairs”) comparisons:

Make a single test for the hypothesis that all

µ

g are equal!

−→

F-test for factors.

Lower the level

α

for each of the 21 tests such that

P (≥ 1

significant test result

) ≤ α = 0 . 05

!

Simplest and general (multiple testing): Bonferroni correction:

divide

α

by number of tests

−→

conservative testing procedure

−→

You will get no significant results

−→

nothing published (Are we back to testing? –

Considerations also apply to confidence intervals!)

(37)

In reality, it is even worse!

When exploring data, nobody knows how many hypotheses are

“informally tested” by visual inspection of informative graphs.

−→

“exploration bias”

Exploratory data analysis – curse or benediction?

Remedy?

One dataset – one predetermined test!

(or: only a small number of planned tests/confidence intervals, with Bonferroni adjustments (?))

Dream for statistician, nightmare for researchers!

(38)

7. Selection biases 37

7.2 More biases

Publication bias

: A manuscript without significant effects will not be written

or rejected.

“Researcher degrees of freedom” or “Garden of forking paths”

= selecting transformations, estimation methods, model selection, ...

Remedy: Pre-registration: Publish the project (with review!), accepted if scientific question is worthwile

and statistical evaluation procedures are correct, then publish the results in any case!

(39)

7.3 Stepping procedure of advancing science:

1. Explore data freely, allowing all creativity.

Create “hypotheses” about relevant effects.

2. Conduct a new study to examine the hypotheses (not

H

0 !)

“Believe” effects that are successfully confirmed with a sufficient magnitude to be relevant.

3. Extend the scope in a planned manner to generalize the insight.

1.* Use dataset in an exploratory attitude to generate new hypotheses.

it. Iterate until retirement.

Note that step 2 is a phony replication!

(40)

7. Selection biases 39

Huang & Gottardo, Briefings in Bioinformatics 14 (2012), 391-401

(41)

8.

Regression and Model Development

Response variable

Y

depends on input variables

x

(1)

, ..., x

(m)

Yi = β0+β1x(1)i +β2x(2)i +...+βmx(m)i +Ei , Ei ∼ N

0, σ2

, indep.

Example: Distances needed for stopping freight trains.

(42)

8. Regression and Model Development 41

(43)

Distance S, velocity V0

Si

= β

0

V0

i

+ β

1

V0

2i

+ E e

i quadratic in V0

( S/V0 )

i

= β

0

+ β

1

V0

i

+ E

i linear in V0

0 10 20 30 40 50 60 70 80 90 100

0.00.20.40.60.81.01.21.41.6

V0

S/V0

Example:

Inclin

ation as another input variable, and many more, see later.

(44)

8. Regression and Model Development 43

Reproducibility

In some applications, replication is not or cannot be based on the same input variable values.

−→

Variables that should be kept constant but cannot:

−→

Include in regression model!

Fit a joint model for the data of the original and

the replication study with a grouping variable “

Study

” and all interactions of it with the interesting variables.

(Possibly with a model for correlation of errors

E

i ) This allows for a differential interpretation of

the parts where reproducibility has and has not been achieved.

(45)

Model development

Adapt to the data:

transformations

variable selection

interactions

Example: Distances needed for stopping freight trains. Result:

S/V0

Inclin + Lambda + Length + Type + Lambdaˆ2 + V0 + V0: ( Inclin + Lambda + Length + Type ) + V0: ( Inclin:Lambda + Inclinˆ2 + Lambdaˆ2 ) + V0ˆ2 + (V0ˆ2):Length

(46)

8. Regression and Model Development 45

−→

The resulting model is certainly not the correct one!

What is “the correct model”, anyway?

And when is the model incorrect?

For prediction, only fitted values are important.

For scientific knowledge, approx. “correct” form may be enough.

(47)

Reproduciblily: Model selection is a non-reproducible process except for formalized steps, strategy!

Formal variable selection

−→

methods for adequate inference Family-wise error rate; “Post-selection Inference (POSI)”

Model Development

−→

Many researcher degrees of freedom!

Should it be banned? Do not forbid my favorite computer game, please!

The dilemma of “exploratory data analysis”:

A creative process leading to new insight vs.

Fuzzy inference

−→

Yet another version of the dilemma of advancing science!

Remedy: Separate exploratory and “confirmatory” studies!

(48)

8. Regression and Model Development 47

Reproducibilities

Recomputability (= “reproducibility” in the narrow sense)

Repeatability: Sufficient description of Material and Methods

Replication, successful (= “replicability”) close vs. conceptual rep- lication

Data challenge of a theory (

“conceptual replication”)

(49)

Distinguish types of confirmation:

+ Repetition = “close replication”: Same setting

−→

should produce response values within “data compatibility”.

+ Generalization = “conceptual replication”, extrapolation:

Vary the situation to check the “robustness” of the result.

Data Challenge

+ Extension: Generalize the problem.

Distinction is more useful in more complicated situations (regression).

(50)

9. Conclusions 49

9.

Conclusions

Where and when is reproducibility a useful concept?

9.1 “Exact” sciences ...

(well: “quantitative, empirical part of sciences”)

... Physics, Chemistry, Biology, Medicine, Life sciences.

Reproducibility is an important principle to keep in mind.

Feasible? Sometimes. Needs motivation, skill & luck. Recognition?

Data Challenge, Confirmation

Science is not only about collecting facts

that stand the criterion of reproducibility, but about

generating theories (in a wide sense) that connect the facts.

(51)

Pre-registration!

Repetition studies should be pre-registered.

Journal that published the original result should be prepared to accept and list pre-registrations of it’s repetition (if design is ok) and publish the result regardless of the outcome.

−→

Journal Section “Replication”

Recommendation:

Perform combined study for replication and generalization and/or extension.

Each PhD thesis could consist of a replication phase,

(52)

9. Conclusions 51

followed by a generalization study.

(53)

9.2 Social sciences, ...

– Macro-Economics: Economy only exists once, no replication.

– Society, History: same

– Psychology: Circumstances (therapist, institution, culture) are difficult to replicate.

These sciences should not be reduced to their quantitative parts!

What about philosophy and religion?

Good for discussions over lunch.

(54)

9. Conclusions 53

Messages

Avoid significance tests and P-values. Use confidence intervals!

−→

Teaching!

Precise replication in the sense of compatibility of quantitative results (non-significant difference)

is rare (outside student’s lab classes in physics)

It becomes somewhat more realistic if models contain a study-to-study variance conponent.

(55)

Dilemma of advancing science: Exploratory and confirmatory steps.

−→

Data Challenge

Instead of mere replicaton studies, perform confirmation / generalization studies!

Replication is only applicable to empirical science.

There are other modes of thinking that should be recognized as “science” in the broad sense (“Wissenschaft”).

What is confirmation in these fields?

• −→

In what sense / to what degree

should replication be a requirement for serious research?

Thank you for your endurance

(56)

9. Conclusions 55

References

Atmanspacher, H. and Maasen, S. (eds) (2016). Reproducibility: Prin- ciples, Problems, Practices, and Prospects, Wiley.

Clemens, M. A. (2015). The meaning of failed replications: A review and proposal., J. Econ. Surv. 10?: 10.1111/joes.12139.

Gelman, A., Hill, J. and Yajima, M. (2009). Why we (usually) don’t have to worry about multiple comparisons, arXiv:0907.2478 [stat.AP]

Ioannidis, J. (2005). Why most published research findings are false, PLoS Medicine 2: 696–701.

Klein, R. and et al. (2014). Investigating variation in replicability., Soc Psychol 45(3): 142–152. “Manylabs”: repr of 17 effects

Open Science Collaboration (2015). Estimating the reproducibility of psychological science, Science 349: 943–952.

Referenzen

ÄHNLICHE DOKUMENTE

The influence of deformation temperature on the microstructural evolution of an austenitic stainless steel with an initial grain size of 120 μm, deformed to =57% with a strain-rate of ˙

Furthermore, if the investment of outside the firm has not only the same expected return as the firms debt but also the same risk we also have no change in total risk

allowing the anisotropic axial compressibility of akimotoite to be understood in terms of octahedral compression. The oxygens lying along the a-axis form the

It has recently been shown (78) however, that the observed differences between the experimental bound structures and a molecular dynamics (MD) ensemble of un- bound ubiquitin

analysis – for combining and comparing estimates from different studies – and is all 

Top branches of spruces a well exposed to the electric field and collect the radon daughters from the air.. Method: The top needles of spruces under HV line were picked and

Influence of amplitude on the period at the propagation depth H p as measured in Lake Constance for 2 solitary wave trains consisting of several solitary waves, which occurred

[2] we know that the bulk modulus for random networks decreases and becomes zero at the isostatic point while it remains finite until z = 0 in the case of jammed spring networks.. On