• Keine Ergebnisse gefunden

When is a Replication Successful? ... and more.

N/A
N/A
Protected

Academic year: 2021

Aktie "When is a Replication Successful? ... and more."

Copied!
16
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

When is a Replication Successful?

... and more.

Werner Stahel

Seminar für Statistik, ETH Zürich Zurich, 23 January 2020

(2)

Science

... aims at production of “real” knowledge.

real = reproducible

−→

plausible theory

Reproducibility = transparency / replicability, successful!

We all know:

Ioannidis, 2005, PLOS Med. 2:

Why most published research findings are false.

−→

many papers, newspaper articles, round tables, editorials of journals, ...,

(3)

“Open Science Collaboration”, 2015

“Estimating the reproducibility of psychological science”

Only 36% “success”!

Success? = Significant again! NHST (Null Hypotheis Statistical Testing)

“Only 36%”? What should we expect?

Why? Selective reporting bias

Garden of forking paths

Reporting, within study

Publication:

±

only significant results!

−→

Too often significant, biased estimated effect.

(4)

Science Mining

Gold Mining: Find nuggets!

Dig through lots of rock! Often very inefficient. Tedious!

Popular mining tool: The NHST filter

finds a lot of garbage

−→

“P-value problem”

(5)

Overview

1. Science versus NHST 2. Success of replication 3. We need strategies!

(6)

1.

Science versus NHST

The NHST paradox

If a study is undertaken to estimate anit is not plausible that it be exactly 0effect,

−→

A (tiny) effect exists

−→

for

n → ∞

the power

→ 1

.

NHST says whether we have chosen a large enough sample to find a formally significant effect.

Scientific Question must be: Is the effect RELEVANT?

This needs a relevance threshold!

(7)

Presentation of results: confidence intervals.

Ban P-values in favor of confidence intervals! DO IT!

Proposal by 71 authors: Lower significance level!

Paradox still applies (larger

n

needed)! Not helpful!

The ASA’s statement on p-values: context, process, and purpose, 2016, American statistician

Retire statistical significance!, 2019, Nature

> 800

signaturies What should scientists do routinely?

The problem is the lack of a scientific question as long as no relevance threshold is specified!

(8)

Classification of results (... of a single study that estimates an effect) effect

0 relevance th.

Rlv: Relevant Sig: Significant Amb: Ambiguous

Ngl: Negligible Ctr: Contradicting

−→

new filter: Relevance instead of significance

= Confidence interval above relevance threshold

Selective reporting is not eliminated, but there will be less garbage.

(9)

Choice of relevance threshold?

– absolute

– relative to effect size, e.g. 10%

– relative to random variation of data (Cohen’s

d

), e.g. 20%

(10)

2.

Success of replication

TWO aspects:

“sigag”: significant again

−→

“relag”: relevant again!

“Compatible” = no significant difference of effect estimates significant? — Not a scientific problem!

−→

relevant!

Ask a scientific question!

“True” Effect Size Difference = parameter ED

= ψ

(r)

− ψ

(o)

−→

estimate! ... with confidence interval

−→

classify!

(11)

Classification of replication results

based on IEffr = confidence interval for the effect from the replication and EDS = confidence interval for Effect Difference (Standardized) (Cnf) Confirmation: IEffr Rlv, EDS small (Ngl or Amb),

Weak Confirmation: IEffr Sig

(Att) Attenuation: EDS: Rlv = relevant effect diff. – even if IEffr : Rlv.

(Enh) Enhancement: if the replication suggests a clearly stronger effect, IEffr : Rlv, EDS significantly positive (Ctr)

(Amb) Ambiguous: if IEffr : Amb (Anh) Annihilation: IEffr : Ngl

(Ctr) Contradiction: IEffr : Ctr (Drp) Dropout: replication failed

(12)

OSC15: Paired sample t-tests from Open Science Coop, 2015

effect size, standardized

−1 0 0.2 1 2

St.15 St.33 St.6 St.113 St.7 St.153

95 40 24 125 100 8

242 40 32 177 15 8

(13)

OSC15: Paired sample t-tests from Open Science Coop, 2015

effect size, standardized

−1 0 0.2 1 2

St.15 St.33 St.6 St.113 St.7 St.153

95 40 24 125 100 8

242 40 32 177 15 8

CnfW CnfW CnfW Enh Amb Amb

(14)

Yet another complication!

Heterogeneity = between studies variance component Experience from chemistry: interlaboratory studies.

Expected effect size differs between studies.

−→

“True Effect Difference” is never 0! EDS

6 = 0

−→

Random effects version of meta-analysis!

Need several studies to estimate between studies variance!

A valid confidence interval for the effect size

can only be obtained from several studies,

≥ 5

, say.

A single replication with good power is not enough!

−→

new power calculations for multiple replications needed!

(15)

3.

We need strategies for “knowledge production”!

Claim of basic interest

−→

multiple pre-registered repl’s – Close or conceptual replications?

– Choice of threshold?

Suject for professional society.

Exploratory study

−→

One close pre-registered replication – “confirmation”

−→

more replications/generalization

– “attenuation” or “ambiguous”

−→

more close replication – “contradiction”

−→

...

“Original” studies without replication = exploratory

−→

indications for theoretical hypotheses to be examined by preregistered studies.

Still important!

(16)

Messages

NO scientific QUESTION

−→

NO rational ANSWER

Scientific questions regarding an effect need a relevance threshold.

NHST is not scientific

−→

confidence intervals for parameters, not test-statisics or P-values!

Success of a replication should be described by

> 2

categories

based on outcome of replication and Effect Size Difference.

One replication is not enough, due to between study variation.

A strategy is needed for advancing scientific knowledge!

Sustainable empirical research needs a change of culture!

Referenzen

ÄHNLICHE DOKUMENTE

Even if the departure from gainful employment will be delayed in the future 88 and part of the time after working life will remain devoted to rest, leisure, and

The review will be based on the results of ESPON projects ESPON 1.1.1 (Urban Areas as Nodes in Polycentric Develop- ment) and ESPON 1.1.3 (Enlargement of the European Union

Calculations in this index on the percentage change in real annually committed funds per capita to the 11 th EDF compared to the 10 th EDF indicate that the EU12 Member States’

Indepen- dent of the size or scope of its aid program, the US government should explicitly commit to maintaining policy engagement at the federal and provincial levels on

Die Eltern müssen 100 Euro Schul- geld pro Monat bezahlen. Die An- zahl der Anmeldungen war viel höher als die 20 zu vergebenden Plätze. Am Anfang stand der Schul- gründungsantrag,

German and European efforts to reduce irregular migration, particularly from sub- Saharan Africa, place a great emphasis on development co-operation.. The aim is for this to

Fewer respondents said that they frequently work under strong pressure to meet deadlines or performance pressure (48 %), are disturbed or interrupted at work (46 %) or have to

Einfachheit lässt sich aber auch auf eine andere Weise erzielen, nämlich wenn eine große Anzahl Freiheitsgrade eines Systems es erlaubt, rein aus der statistischen Beschreibung