• Keine Ergebnisse gefunden

1 Data Analysis

N/A
N/A
Protected

Academic year: 2021

Aktie "1 Data Analysis"

Copied!
44
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

1.1.

THEORY OF DATA ANALYSIS ?!

0

1 Data Analysis

1.1 Theory of Data Analysis ?!

a Courses in statistics:

Collection of methods for a fixed type of problems.

Exercises: Datasets asking for this methodology.

b Practice: Start from (scientific) question and (potential) data.

−→ Choose appropriate statistical methods!

In reality:

• Questions are often vague

−→ Sharpen to identify suitable methods

• Data are distorted, missing, ...

c Practice of Data Analysis is difficult to teach consistently.

−→ “Apprenticeship”, “Statistics Lab”

(2)

1.1.

THEORY OF DATA ANALYSIS ?!

1

1.1

d Theory of Data Analysis?

would need “soft science” methods. First step: Concepts, like

• Statistical Problem:

(“Scientific”) Question and (potential) Data.

• Strategy, consisting of steps (see course on Regression).

• Quality of a result:

How to measure? −→ Ranking of Strategies.

e Case Studies are useful!

f General Strategy to define and tackle a Statistical Problem?

Strategy consists of steps, defined by a step procedure leading to step results and

possibly to step decisions about the further steps.

(3)

2.1.

PREPARING

2

2 Steps in a Statistical Project

Hints for efficient data analysis.

Using the regr0 package:

install.packages("regr0", repos="http://r-forge.r-project.org") require(regr0)

Example script in the Appendix

2.1 Preparing

a Situation: Data generated by “client”.

Start a notes file with coordinates of client, keywords of discussions points to remember, possibly dates and steps.

Use R studio, knitr (need to change default ...)

(4)

2.1.

PREPARING

3

2.1

b • Project goal −→ report, Introduction

• Nature of data:

Variables: target variable(s), primary and secondary expl.v.

−→ variable names to be used in models, graphs, ...

−→ table of names with explanations Structure of the dataset:

– nature of observation units – groupings

−→ variance components, random effects, correlations

(5)

2.1.

PREPARING

4

2.1

c • Data transfer:

often excel → csv, d.proj < read.csv(file) names(d.proj) < c(...)

Check result str(d.proj), showd(d.proj) −→ R-script data.R

• Data screening:

– ranges; factors −→ summary, str

– scatterplot matrix −→ pairs, plmatrix

−→“basic data set”, store! save(d.proj, file="data.rda")

(6)

2.2.

MODEL(S)

5

2.2 Model(s)

a Most often, a type of regression model is needed for answering the project questions.

b • Type of target variable:

continuous-normal, survival-failure, binary, count, ordered, factor censored; multivariate regression

−→ choose appropriate model or use regr

• Random effects? −→ package lmer

c Model development: dangerous, but fascinating...

What you need to examine a model:

– fitting function, summary, drop1, plot, termplot

– regr0: regr, plot

(7)

2.3. INTERPRETATION AND CONCLUSIONS 6

• Many models for many subsets of the data or many variables:

Select 3 subproblems: 1 easy, 1 typical, 1 extreme Careful analysis for these 3, then for loop.

Document the 3 selected in main part of the report, others in Appendix

2.3 Interpretation and Conclusions

• Write down the conclusions explicitly, even if they seem “obvious”

from the numerical results.

• Give graphical representations of results and explain the displays.

• Make sure that you explicitly answer the original question(s) – or give reasons why you don’t.

(8)

2.5.

STRUCTURE OF A REPORT

7

2.5 Structure of a Report

a Target audience: Client and his/her “environment”.

Theory: Sketch main ideas, give basic reference.

b Structure: in scientific articles, the common structure is fixed:

“Introduction”, “Material and Methods”, “Results”, “Discussion”

This may be adequate. Sometimes preferable:

Integrate statistical methods with results

c Do not write a detective story! Do not withhold the results!

(9)

2.5.

STRUCTURE OF A REPORT

8

2.5

d Proposal for the structure:

• Introduction: Problem, background, earlier work.

• Data

• Statistical methods and results

• Discussion and outlook

– Summary and interpretation of results,

– Outlook on open questions and extensions of the analyses

e Write summary in the end. Should be readable for itself.

(10)

2.5.

STRUCTURE OF A REPORT

9

2.5

f Appendices:

– detailed description of data,

– exact description of statistical methods – “complaints”

– repetitive analyses

– analyses with uninteresting results

g Length of report: 5 - 10 pages (witout appendices)

(11)

6.1.

EXPERIENCE

5

6 Statistical Studies and Consulting

6.1 Experience

a Statistical Consulting covers a wide spectrum:

• Knowledge and skill of the client and the consultant

• Goals of the consultancy:

“How should I interpret this output?”

“A reviewer has criticized the following: ...

How can I justify my approach?”

“I have been told to do a Conjoint Analysis.

Which program does this for me?”

“I have here an interesting data set.

I would like to apply multivariate methods.”

(12)

6.1.

EXPERIENCE

6

6.1

b Roles : Statistician is

• consultant with a limited task (limited liability)

• responsible for adequate statistical analysis of the data

• partner in the project, from planning to reporting / article

c Communication

• Who poses the (scientific) problem, and to whom should the answer be targeted?

What effort will the targeted persons make to understand the answer?

−→ Type of answer, approaches that can be understood.

• How much effort can be invested? (Money, time, energy)

(13)

6.1.

EXPERIENCE

7

6.1

d Critical points. At the beginning clarify:

• the problem (“scientific” question).

Informal questions are ok. at the outset, but must be made precise!

−→ leads to methodology = models and procedures

• structure of the data: How have / will they be generated?

Search for groupings – blocks,

– ”main/subplot” / ”within/between subjects”

– ”closeness” in space or time.

(14)

6.1.

EXPERIENCE

8

6.1

e Why is data structure important?

Statistics = model + estimation of parameters

estimation without indication of precision is meaningless −→ ”standard error”

−→ confidence interval or test.

Determination of s.e. needs independent observations (or a good model for the dependencies)

If (positive) correlations among observations are neglected,

−→ s.e. will be under-estimated

−→ too short confidence intervals or ”liberal tests” (=wrong !) Look for independent groups of observations!

Do any analysis you like, as for individual observations, then use bootstrap over groups to get precision.

(15)

7.1.

PARADIGMS

9

7 Reproducibility – Crisis of Science

7.1 Paradigms

Reproducibility defines knowledge. “The Scientific Method”:

Scientific facts are those that have been reproduced independently.

... as opposed to belief, which is “irrational” for some of us.

Science should be the collection of knowledge that is “true”.

Well, not quite: Big Bang is not reproducible, but is a theory, nevertheless is called scientific knowledge.

In fact, empirical science needs theories as its foundation.

(16)

7.1.

PARADIGMS

10 Reproducibility of facts defines science

– physics, chemistry, biology, life science =“Exact” Sciences What is the role of reproducibility in Humanities and Arts?

• Humanities try to become “exact sciences” by adopting

“the scientific method”.

• Arts: A composition is a reproducible piece of music.

Reproducibility achieved by fixing notes.

Intonation only “reproducible” with recordings.

• Improvization in music ; mandalla in “sculpture”:

Intention to make something unique, irreproducible.

Back to “exact” sciences!

(17)

7.2.

THE CRISIS

11

7.2 The Crisis

Reproducibility is a myth in most fields of science!

• Ioannidis, 2005, PLOS Med. 2:

Why most published research findings are false.

−→ many papers, newspaper articles, round tables, editorials of journals, ...,

Topic of Collegium Helveticum, ETH Zurich −→ Handbook

• Tages-Anzeiger of Aug 28, 2015:

“Psychologie-Studien sind wenig glaubw ˜A14rdig”

(Psychological studies have little credibility) ⇐ Science (journal)

(18)

7.3.

REPRODUCIBILITY: EMPIRICAL RESULTS

12

7.3 Reproducibility: Empirical results

“Open Science Collaboration”, Science 349, 943-952, Aug 28, 2015:

“Estimating the reproducibility of psychological science”

100 research articles from high-ranking psychological journals.

260 collaboraters attempt to reproduce 1 result for each.

Effect size could be expressed as a correlation

−→ P-values, confidence intervals.

(19)

7.3.

REPRODUCIBILITY: EMPIRICAL RESULTS

13 P-values

(20)

7.3.

REPRODUCIBILITY: EMPIRICAL RESULTS

14

Original study P-value

(21)

7.3.

REPRODUCIBILITY: EMPIRICAL RESULTS

15 Effect Size

(22)

7.3.

REPRODUCIBILITY: EMPIRICAL RESULTS

16

−→ Effect sizes are lower, as a rule, in the replication.

Significant difference in effect size?

was not studied!!!

Instead: only 47% of the confidence intervals of the repr.study covered the original estimated effect!

Similar results for pharmaceutical trials, Genetic effects, ...

Note:

What is a success/failure of a reproduction? −→ not well defined!

... not even in the case of assessing just a single effect!

Why does replication fail?

Data manipulation? Biased experiment?

(23)

7.4.

MULTIPLE COMPARISONS AND MULTIPLE TESTING

17

7.4 Multiple comparisons and multiple testing

Here is a common way of learning from empirical studies:

• visualize data,

• see patterns (unexpected, but with sensible interpretation),

• test if statistically significant,

• if yes, publish.

(cf. “industry producing statistically significant effects”)

(24)

7.4.

MULTIPLE COMPARISONS AND MULTIPLE TESTING

18

The problem, formalized

7 groups, generated by random numbers ∼ N h0, 1i. −→ H0 true!

group

y

1 2 3 4 5 6 7

−2−1012

+ + + + + + +

std.dev.

conf.int

(25)

7.4.

MULTIPLE COMPARISONS AND MULTIPLE TESTING

19

Test each pair of groups for a difference in expected values.

−→ 7 · 6/2 = 21 tests. P hrejectioni = 0.05 for each test.

−→ Expected number of significant test results = 1.05!

significant differences for 1 vs. 6 and 1 vs. 7 −→ Publish the significant result!

You will certainly find an explanation why it makes sense...

−→ Selection bias.

(26)

7.4.

MULTIPLE COMPARISONS AND MULTIPLE TESTING

20 Solutions: for multiple (“all pairs”) comparisons:

• Make a single test for the hypothesis that all µg are equal!

−→ F-test for factors.

• Lower the level α for each of the 21 tests such that P h≥ 1significant test resulti ≤ α = 0.05!

Bonferroni correction: divide α by number of tests.

−→ conservative testing procedure −→ You will get no significant results

−→ nothing published (Are we back to testing? –

Considerations also apply to confidence intervals!)

(27)

7.4.

MULTIPLE COMPARISONS AND MULTIPLE TESTING

21

In reality, it is even worse!

When exploring data, nobody knows how many hypotheses are

“informally tested” by visual inspection of informative graphs.

Exploratory data analysis – curse or benediction?

Solution?

One dataset – one test!

(or: only a small number of planned tests/confidence intervals, with Bonferroni adjustments (?))

Dream for statistician, nightmare for researchers!

Analogous:

Publication bias

(28)

7.5.

MODEL DEVELOPMENT

22

7.5 Model development

... consists of adapting the (structure of the) model to the data.

Select:

a. the explanatory and nuisance variables (“full model”) b. functional form (transformations, polynomials, splines) c. interaction terms

d. possibly a correlation structure of the random errors −→ systematically select the best fitting terms! −→ overfitting!

Tradeoff between flexibility and parsimony.

Why should models be parsimonious?

Here: Intuition says that simple models reproduce better.

(29)

7.5.

MODEL DEVELOPMENT

23

−→ The resulting model is certainly not the correct one!

What is “the correct model”, anyway?

Reproduciblily: Model selection is a non-reproducible process (except for formalized procedures)

Should it be banned?

−→ Yet another version of the dilemma of advancing science!

Summarizing:

Model development leads to severe reproducibility problems because of “Researcher’s degrees of freedom”

Adequate statistical procedures can solve the more formalized types of such problems. −→ Model Selection Procedures.

(30)

7.6.

CONCLUSIONS

24

7.6 Conclusions

Stepping procedure of advancing science:

1. Explore data freely, allowing all creativity Create Hypotheses about relevant effects

2. Conduct a new study to confirm the hypotheses (not H0!)

“Believe” effects that are successfully confirmed (with a sufficient magnitude to be relevant!)

1.* Use dataset in an exploratory attitude to generate new hypotheses.

it. Iterate until retirement.

Note that step 2 is a phony replication!

(31)

7.6.

CONCLUSIONS

25

• Data Challenge, Confirmation

Science is not only about collecting facts

that stand the criterion of reproducibility, but about

generating theories (in a wide sense) that connect the facts.

(32)

7.6.

CONCLUSIONS

26 Types of confirmation:

+ Reproduction: Same values of input variables

−→ should produce response values within variability of error distr.

+ Generalization = extrapolation: Extend the range of input var’s Check if regression function is still appropriate.

Data Challenge

+ Extension: Vary additional input variables to find adequate extension of the model.

Recommendation: Perform combined study for reproducibility and generalization and/or extension.

(33)

7.6.

CONCLUSIONS

27

Psychology: Reproducibility of Concept Quantify “concepts” such as intelligence.

Questionnaires or “tests” −→ quantified concept.

Study relationships between concepts (response) and e.g., socio-economic variables, or between concepts.

Confirm concepts by using different questionaires / tests hopefully getting “the same” concepts and their relations.

(34)

7.6.

CONCLUSIONS

28 Messages

• Avoid significance tests and P-values. Use confidence intervals!

−→ Teaching!

• Precise reproducibility in the sense of compatibility of quantitative results (non-significant difference)

is rare (outside student’s lab classes in physics)

• The main problem is selection bias on different levels:

“Researcher degrees of freedom”,

informal (visual) preliminary analysis, model development, selection of results within paper, publication bias

(35)

7.6.

CONCLUSIONS

29

• Dilemma of advancing science: Exploratory and confirmatory steps.

−→ Data Challenge

Instead of mere reproducibility studies, perform confirmation / generalization studies!

• Reproducibility is only applicable to empirical science.

There are other modes of thinking that should be recognized as “science” in the broad sense (“Wissenschaft”).

What is confirmation in these fields?

• −→ In what sense / to what degree

should reproducibility be a requirement for serious research?

Thank you for your endurance

(36)

7.6.

CONCLUSIONS

30

References

Atmanspacher, H. and Maasen, S. (eds) (2016). Reproducibility:

Principles, Problems, Practices, and Prospects, Wiley.

Ioannidis, J. (2005). Why most published research findings are fal- se, PLoS Medicine 2: 696–701.

Open Science Collaboration (2015). Estimating the reproducibility

of psychological science, Science 349: 943–952.

(37)

7.6.

CONCLUSIONS

31

Appendix

Example R scripts for birthrates example.

r-birthr-data.R

## require(regr0)

## ---

## aa <- count.fields("~/data/fertility-switz.dat") dd <- read.table("~/data/birthratesOrig.dat",header=F) names(dd)

vv <- read.table("~/data/birthratesExt-vars.csv",sep=";",header=F, strings=F)

names(dd) <- vv[-nrow(vv),1] ## last variable will be added below d.fertilityX <- dd

(38)

7.6.

CONCLUSIONS

32

## extract useful variables

tj <- c(2,1,19,6:13,26:32,42:43,47,48) dd <- d.fertilityX[,tj]

names(dd) <- c("fertility", "fertTotal","infantMort",

"catholic", "single24","single49",

"eAgric","eIndustry","eCommerce","eTransport","eAdmin",

"german","french","italian","romansh","gradeHigh","gradeLow",

"educHigh","bornLocal","bornForeign","canton","district")

## altitude: create factor indicaing the predominant layer

## (done for having a sensible factor among the explanatory variables) t.a <- apply(d.fertilityX[,c("LOWALT","ALT500","ALT1000")],1, which.max) dd$altitude <- ordered(t.a, labels=c("low","middle","high"))

## standardization

t.m <- ceiling(log10(apply(dd[,1:18],2,max)-1)) t.m["bornForeign"] <- 4

for (j in names(t.m)) dd[,j] <- dd[,j]/10^(t.m[j]-2)

## language: generate factor

(39)

7.6.

CONCLUSIONS

33

td <- dd[,c("german","french","italian","romansh")]

## range(apply(td,1,sum)) ## [1] 97.5 100.0 tl <- td>60

ti <- which(apply(tl,1,sum)!=1) td[ti,]

tv <- sapply(apply(tl,1, which ),

function(x) if(length(x)) x else c(romansh=4) ) dd$language <- factor(names(tv))

##- table(dd$language)

##- french german italian romansh

##- 47 116 10 9

tvd <- c(vv[tj,2],"altitude in three categories: low, medium, high",

"dominating language: german, french, italian, romansh") tvd <- sub("^ *","",tvd)

tv <- cbind(names(dd), tvd)

## store

d.fertility <- dd

write.table(d.fertility, file="~/data/birthratesRed.csv", sep=";")

(40)

7.6.

CONCLUSIONS

34

write.table(tv, file="~/data/birthratesRed-vars.csv", sep=";", quote=F, col.names=F)

## ---

## link to dataset swiss in R

ta <- paste(d.fertilityX$CATHOL/1000,d.fertilityX$IG/10) tb <- paste(swiss$Catholic,swiss$Fertility)

ti <- match(ta,tb)

## sumna(ti) + nrow(swiss) d.fertilityX$RowInSwiss <- ti

write.table(d.fertilityX, file="~/data/birthratesExt.csv", sep=";")

## ======================================================

## enrich the ’swiss’ dataset

sum(duplicated(d.fertility$catholic)) ti <- match(tb,ta)

## sumna(ti)

table(d.fertility$canton[ti])

d.swiss <- cbind(swiss, Language=d.fertility$language[ti], Canton=factor(d.fertilityX$Canton[ti]),

District=I(as.character(d.fertilityX$District[ti]))) write.table(d.swiss, file="~/data/swissExt.csv", sep=";")

(41)

7.6.

CONCLUSIONS

35 r-birthr-ana.R

## Regression models for birthrate data

## ---

## package regr0: more "service"

## install the package. Do this only once!

## install.packages("quantreg")

## install.packages("regr0", repos="http://r-forge.r-project.org") require(regr0)

## get data

dd <- read.csv("~/data/birthrates.csv", sep=";") names(dd)

head(dd) # tail(dd) showd(dd) ## regr0 str(dd)

(42)

7.6.

CONCLUSIONS

36

## screen summary(dd) pairs(dd)

plmatrix(dd) ## regr0

## ---

## multiple regression

## model formula

t.model <- fertility ~ eAgric + single24 + catholic + german plot(t.model, data=dd)

plmatrix(t.model, data=dd) ## regr0 r.mult <- lm(t.model, data=dd)

summary(r.mult)

r.m <- regr(t.model, data=dd) r.m

(43)

7.6.

CONCLUSIONS

37

## ---

## a "large" model

r1 <- regr( fertility ~ catholic + single24 + single49 + eAgric + eIndustry + eCommerce + eTransport + eAdmin + german + french + educHigh + bornLocal + altitude + canton, data=dd )

r1

## with first aid transformations:

rf <- regr( fertility ~ asinp(catholic) + asinp(single24) + asinp(single49) + asinp(eAgric) +

asinp(eIndustry) + asinp(eCommerce) + asinp(eTransport) + asinp(eAdmin) + asinp(german) + asinp(french) +

asinp(educHigh) + asinp(bornLocal/100) + altitude + canton, data=dd )

tt <- rf$termtable

print(tt[tt[,"p.value"]>0.05,"p.value",drop=F],quote=F)

## stepwise model selection rback <- step(rf, k=5)

(44)

7.6.

CONCLUSIONS

38

formula(rback) plot(rback)

## similar, with plain R:

rlm <- lm(formula(rback), data=dd) plot(rlm)

termplot(rlm, partial.resid = TRUE)

## add squares and interactions?

( radd <- add1(rback) )

## ...

Referenzen

ÄHNLICHE DOKUMENTE

A HYONG &amp; H ARLING (2000) suggested that crown- group Unipeltata diverged in two broad directions from the outset, with one major clade evolving highly effi cient

The fields of information and scientific visualization deal with visual representations of data. Scientific visualization examines potentially huge amounts of scientific data ob-

In this paper, we analyze an ex- isting variant of the popular TreeMap family of hierarchical layout algorithms, and we introduce a novel TreeMap algorithm support- ing space

A common feature of these services is that users can form interest groups or other types of connections (such as leader/follower in Twitter), giving rise to relationship

To explore high-volume twitter data, we introduce three novel time- based visual sentiment analysis techniques: (1) topic-based sentiment analysis that extracts, maps, and

Figure 4: Scatterplot-based desc riptor comparison visual- ization. T op: Ob ject coloring. Bottom: The desc riptor com- parison. a) The reference color scheme mapped to the

First, we introduce a new discrimination-based technique to automatically extract the terms that are the subject of the positive or negative opinion (such as price or

Given a projection of data instances to low-dimensional display space, appropriate visualization methods are needed to support the data analysis task at hand.