When is a Replication Successful? ... and more.

(1)

When is a Replication Successful?

... and more.

Werner Stahel

Seminar für Statistik, ETH Zürich Zurich, 23 January 2020

(2)

Science

... aims at production of “real” knowledge.

real = reproducible

−→

plausible theory

Reproducibility = transparency / replicability, successful!

We all know:

•

Ioannidis, 2005, PLOS Med. 2:

Why most published research findings are false.

−→

many papers, newspaper articles, round tables, editorials of journals, ...,

(3)

“Open Science Collaboration”, 2015

“Estimating the reproducibility of psychological science”

Only 36% “success”!

Success? = Significant again! NHST (Null Hypotheis Statistical Testing)

“Only 36%”? What should we expect?

Why? Selective reporting bias

•

Garden of forking paths

•

Reporting, within study

•

Publication:

±

only significant results!

−→

Too often significant, biased estimated effect.

(4)

Science Mining

Gold Mining: Find nuggets!

Dig through lots of rock! Often very inefficient. Tedious!

Popular mining tool: The NHST filter

finds a lot of garbage

−→

“P-value problem”

(5)

Overview

1. Science versus NHST 2. Success of replication 3. We need strategies!

(6)

1.

Science versus NHST

The NHST paradox

If a study is undertaken to estimate anit is not plausible that it be exactly 0effect,

−→

A (tiny) effect exists

−→

for

n → ∞

the power

→ 1

_.

NHST says whether we have chosen a large enough sample to find a formally significant effect.

Scientific Question must be: Is the effect RELEVANT?

This needs a relevance threshold!

(7)

Presentation of results: confidence intervals.

Ban P-values in favor of confidence intervals! DO IT!

Proposal by 71 authors: Lower significance level!

Paradox still applies (larger

n

needed)! Not helpful!

The ASA’s statement on p-values: context, process, and purpose, 2016, American statistician

Retire statistical significance!, 2019, Nature

> 800

signaturies What should scientists do routinely?

The problem is the lack of a scientific question as long as no relevance threshold is specified!

(8)

Classification of results (... of a single study that estimates an effect) effect

0 relevance th.

Rlv: Relevant Sig: Significant Amb: Ambiguous

Ngl: Negligible Ctr: Contradicting

−→

new filter: Relevance instead of significance

= Confidence interval above relevance threshold

Selective reporting is not eliminated, but there will be less garbage.

(9)

Choice of relevance threshold?

– absolute

– relative to effect size, e.g. 10%

– relative to random variation of data (Cohen’s

d

), e.g. 20%

(10)

2.

Success of replication

TWO aspects:

•

“sigag”: significant again

−→

“relag”: relevant again!

•

“Compatible” = no significant difference of effect estimates significant? — Not a scientific problem!

−→

relevant!

Ask a scientific question!

“True” Effect Size Difference = parameter ED

= ψ

⁽^r⁾

− ψ

⁽^o⁾

−→

estimate! ... with confidence interval

−→

classify!

(11)

Classification of replication results

based on IEffr = confidence interval for the effect from the replication and EDS = confidence interval for Effect Difference (Standardized) (Cnf) Confirmation: IEffr Rlv, EDS small (Ngl or Amb),

Weak Confirmation: IEffr Sig

(Att) Attenuation: EDS: Rlv = relevant effect diff. – even if IEffr : Rlv.

(Enh) Enhancement: if the replication suggests a clearly stronger effect, IEffr : Rlv, EDS significantly positive (Ctr)

(Amb) Ambiguous: if IEffr : Amb (Anh) Annihilation: IEffr : Ngl

(Ctr) Contradiction: IEffr : Ctr (Drp) Dropout: replication failed

(12)

OSC15: Paired sample t-tests from Open Science Coop, 2015

effect size, standardized

−1 0 0.2 1 2

St.15 St.33 St.6 St.113 St.7 St.153

95 40 24 125 100 8

242 40 32 177 15 8

(13)

OSC15: Paired sample t-tests from Open Science Coop, 2015

effect size, standardized

−1 0 0.2 1 2

St.15 St.33 St.6 St.113 St.7 St.153

95 40 24 125 100 8

242 40 32 177 15 8

CnfW CnfW CnfW Enh Amb Amb

(14)

Yet another complication!

Heterogeneity = between studies variance component Experience from chemistry: interlaboratory studies.

Expected effect size differs between studies.

−→

“True Effect Difference” is never 0! EDS

6 = 0

−→

Random effects version of meta-analysis!

Need several studies to estimate between studies variance!

A valid confidence interval for the effect size

can only be obtained from several studies,

≥ 5

, say.

A single replication with good power is not enough!

−→

new power calculations for multiple replications needed!

(15)

3.

We need strategies for “knowledge production”!

•

Claim of basic interest

−→

multiple pre-registered repl’s – Close or conceptual replications?

– Choice of threshold?

Suject for professional society.

•

Exploratory study

−→

One close pre-registered replication – “confirmation”

−→

more replications/generalization

– “attenuation” or “ambiguous”

−→

more close replication – “contradiction”

−→

...

•

“Original” studies without replication = exploratory

−→

indications for theoretical hypotheses to be examined by preregistered studies.

Still important!

(16)

Messages

•

NO scientific QUESTION

−→

NO rational ANSWER

Scientific questions regarding an effect need a relevance threshold.

NHST is not scientific

−→

confidence intervals for parameters, not test-statisics or P-values!

•

Success of a replication should be described by

> 2

_categories

based on outcome of replication and Effect Size Difference.

•

One replication is not enough, due to between study variation.

A strategy is needed for advancing scientific knowledge!

Sustainable empirical research needs a change of culture!