1.1.
THEORY OF DATA ANALYSIS ?!
01 Data Analysis
1.1 Theory of Data Analysis ?!
a Courses in statistics:
Collection of methods for a fixed type of problems.
Exercises: Datasets asking for this methodology.
b Practice: Start from (scientific) question and (potential) data.
−→ Choose appropriate statistical methods!
In reality:
• Questions are often vague
−→ Sharpen to identify suitable methods
• Data are distorted, missing, ...
c Practice of Data Analysis is difficult to teach consistently.
−→ “Apprenticeship”, “Statistics Lab”
1.1.
THEORY OF DATA ANALYSIS ?!
11.1
d Theory of Data Analysis?
would need “soft science” methods. First step: Concepts, like
• Statistical Problem:
(“Scientific”) Question and (potential) Data.
• Strategy, consisting of steps (see course on Regression).
• Quality of a result:
How to measure? −→ Ranking of Strategies.
e Case Studies are useful!
f General Strategy to define and tackle a Statistical Problem?
Strategy consists of steps, defined by a step procedure leading to step results and
possibly to step decisions about the further steps.
2.1.
PREPARING
22 Steps in a Statistical Project
Hints for efficient data analysis.
Using the regr0 package:
install.packages("regr0", repos="http://r-forge.r-project.org") require(regr0)
Example script in the Appendix
2.1 Preparing
a Situation: Data generated by “client”.
Start a notes file with coordinates of client, keywords of discussions points to remember, possibly dates and steps.
Use R studio, knitr (need to change default ...)
2.1.
PREPARING
32.1
b • Project goal −→ report, Introduction
• Nature of data:
Variables: target variable(s), primary and secondary expl.v.
−→ variable names to be used in models, graphs, ...
−→ table of names with explanations Structure of the dataset:
– nature of observation units – groupings
−→ variance components, random effects, correlations
2.1.
PREPARING
42.1
c • Data transfer:
often excel → csv, d.proj <− read.csv(file) names(d.proj) <− c(...)
Check result str(d.proj), showd(d.proj) −→ R-script data.R
• Data screening:
– ranges; factors −→ summary, str
– scatterplot matrix −→ pairs, plmatrix
−→“basic data set”, store! save(d.proj, file="data.rda")
2.2.
MODEL(S)
52.2 Model(s)
a Most often, a type of regression model is needed for answering the project questions.
b • Type of target variable:
continuous-normal, survival-failure, binary, count, ordered, factor censored; multivariate regression
−→ choose appropriate model or use regr
• Random effects? −→ package lmer
c Model development: dangerous, but fascinating...
What you need to examine a model:
– fitting function, summary, drop1, plot, termplot
– regr0: regr, plot
2.3. INTERPRETATION AND CONCLUSIONS 6
• Many models for many subsets of the data or many variables:
Select 3 subproblems: 1 easy, 1 typical, 1 extreme Careful analysis for these 3, then for loop.
Document the 3 selected in main part of the report, others in Appendix
2.3 Interpretation and Conclusions
• Write down the conclusions explicitly, even if they seem “obvious”
from the numerical results.
• Give graphical representations of results and explain the displays.
• Make sure that you explicitly answer the original question(s) – or give reasons why you don’t.
2.5.
STRUCTURE OF A REPORT
72.5 Structure of a Report
a Target audience: Client and his/her “environment”.
Theory: Sketch main ideas, give basic reference.
b Structure: in scientific articles, the common structure is fixed:
“Introduction”, “Material and Methods”, “Results”, “Discussion”
This may be adequate. Sometimes preferable:
Integrate statistical methods with results
c Do not write a detective story! Do not withhold the results!
2.5.
STRUCTURE OF A REPORT
82.5
d Proposal for the structure:
• Introduction: Problem, background, earlier work.
• Data
• Statistical methods and results
• Discussion and outlook
– Summary and interpretation of results,
– Outlook on open questions and extensions of the analyses
e Write summary in the end. Should be readable for itself.
2.5.
STRUCTURE OF A REPORT
92.5
f Appendices:
– detailed description of data,
– exact description of statistical methods – “complaints”
– repetitive analyses
– analyses with uninteresting results
g Length of report: 5 - 10 pages (witout appendices)
6.1.
EXPERIENCE
56 Statistical Studies and Consulting
6.1 Experience
a Statistical Consulting covers a wide spectrum:
• Knowledge and skill of the client and the consultant
• Goals of the consultancy:
“How should I interpret this output?”
“A reviewer has criticized the following: ...
How can I justify my approach?”
“I have been told to do a Conjoint Analysis.
Which program does this for me?”
“I have here an interesting data set.
I would like to apply multivariate methods.”
6.1.
EXPERIENCE
66.1
b Roles : Statistician is
• consultant with a limited task (limited liability)
• responsible for adequate statistical analysis of the data
• partner in the project, from planning to reporting / article
c Communication
• Who poses the (scientific) problem, and to whom should the answer be targeted?
What effort will the targeted persons make to understand the answer?
−→ Type of answer, approaches that can be understood.
• How much effort can be invested? (Money, time, energy)
6.1.
EXPERIENCE
76.1
d Critical points. At the beginning clarify:
• the problem (“scientific” question).
Informal questions are ok. at the outset, but must be made precise!
−→ leads to methodology = models and procedures
• structure of the data: How have / will they be generated?
Search for groupings – blocks,
– ”main/subplot” / ”within/between subjects”
– ”closeness” in space or time.
6.1.
EXPERIENCE
86.1
e Why is data structure important?
Statistics = model + estimation of parameters
estimation without indication of precision is meaningless −→ ”standard error”
−→ confidence interval or test.
Determination of s.e. needs independent observations (or a good model for the dependencies)
If (positive) correlations among observations are neglected,
−→ s.e. will be under-estimated
−→ too short confidence intervals or ”liberal tests” (=wrong !) Look for independent groups of observations!
Do any analysis you like, as for individual observations, then use bootstrap over groups to get precision.
7.1.
PARADIGMS
97 Reproducibility – Crisis of Science
7.1 Paradigms
Reproducibility defines knowledge. “The Scientific Method”:
Scientific facts are those that have been reproduced independently.
... as opposed to belief, which is “irrational” for some of us.
Science should be the collection of knowledge that is “true”.
Well, not quite: Big Bang is not reproducible, but is a theory, nevertheless is called scientific knowledge.
In fact, empirical science needs theories as its foundation.
7.1.
PARADIGMS
10 Reproducibility of facts defines science– physics, chemistry, biology, life science =“Exact” Sciences What is the role of reproducibility in Humanities and Arts?
• Humanities try to become “exact sciences” by adopting
“the scientific method”.
• Arts: A composition is a reproducible piece of music.
Reproducibility achieved by fixing notes.
Intonation only “reproducible” with recordings.
• Improvization in music ; mandalla in “sculpture”:
Intention to make something unique, irreproducible.
Back to “exact” sciences!
7.2.
THE CRISIS
117.2 The Crisis
Reproducibility is a myth in most fields of science!
• Ioannidis, 2005, PLOS Med. 2:
Why most published research findings are false.
−→ many papers, newspaper articles, round tables, editorials of journals, ...,
Topic of Collegium Helveticum, ETH Zurich −→ Handbook
• Tages-Anzeiger of Aug 28, 2015:
“Psychologie-Studien sind wenig glaubw ˜A14rdig”
(Psychological studies have little credibility) ⇐ Science (journal)
7.3.
REPRODUCIBILITY: EMPIRICAL RESULTS
127.3 Reproducibility: Empirical results
“Open Science Collaboration”, Science 349, 943-952, Aug 28, 2015:
“Estimating the reproducibility of psychological science”
100 research articles from high-ranking psychological journals.
260 collaboraters attempt to reproduce 1 result for each.
Effect size could be expressed as a correlation
−→ P-values, confidence intervals.
7.3.
REPRODUCIBILITY: EMPIRICAL RESULTS
13 P-values7.3.
REPRODUCIBILITY: EMPIRICAL RESULTS
14Original study P-value
7.3.
REPRODUCIBILITY: EMPIRICAL RESULTS
15 Effect Size7.3.
REPRODUCIBILITY: EMPIRICAL RESULTS
16−→ Effect sizes are lower, as a rule, in the replication.
Significant difference in effect size?
was not studied!!!
Instead: only 47% of the confidence intervals of the repr.study covered the original estimated effect!
Similar results for pharmaceutical trials, Genetic effects, ...
Note:
What is a success/failure of a reproduction? −→ not well defined!
... not even in the case of assessing just a single effect!
Why does replication fail?
Data manipulation? Biased experiment?
7.4.
MULTIPLE COMPARISONS AND MULTIPLE TESTING
177.4 Multiple comparisons and multiple testing
Here is a common way of learning from empirical studies:
• visualize data,
• see patterns (unexpected, but with sensible interpretation),
• test if statistically significant,
• if yes, publish.
(cf. “industry producing statistically significant effects”)
7.4.
MULTIPLE COMPARISONS AND MULTIPLE TESTING
18The problem, formalized
7 groups, generated by random numbers ∼ N h0, 1i. −→ H0 true!
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
group
y
1 2 3 4 5 6 7
−2−1012
+ + + + + + +
std.dev.
conf.int
7.4.
MULTIPLE COMPARISONS AND MULTIPLE TESTING
19Test each pair of groups for a difference in expected values.
−→ 7 · 6/2 = 21 tests. P hrejectioni = 0.05 for each test.
−→ Expected number of significant test results = 1.05!
significant differences for 1 vs. 6 and 1 vs. 7 −→ Publish the significant result!
You will certainly find an explanation why it makes sense...
−→ Selection bias.
7.4.
MULTIPLE COMPARISONS AND MULTIPLE TESTING
20 Solutions: for multiple (“all pairs”) comparisons:• Make a single test for the hypothesis that all µg are equal!
−→ F-test for factors.
• Lower the level α for each of the 21 tests such that P h≥ 1significant test resulti ≤ α = 0.05!
Bonferroni correction: divide α by number of tests.
−→ conservative testing procedure −→ You will get no significant results
−→ nothing published (Are we back to testing? –
Considerations also apply to confidence intervals!)
7.4.
MULTIPLE COMPARISONS AND MULTIPLE TESTING
21In reality, it is even worse!
When exploring data, nobody knows how many hypotheses are
“informally tested” by visual inspection of informative graphs.
Exploratory data analysis – curse or benediction?
Solution?
One dataset – one test!
(or: only a small number of planned tests/confidence intervals, with Bonferroni adjustments (?))
Dream for statistician, nightmare for researchers!
Analogous:
Publication bias
7.5.
MODEL DEVELOPMENT
227.5 Model development
... consists of adapting the (structure of the) model to the data.
Select:
a. the explanatory and nuisance variables (“full model”) b. functional form (transformations, polynomials, splines) c. interaction terms
d. possibly a correlation structure of the random errors −→ systematically select the best fitting terms! −→ overfitting!
Tradeoff between flexibility and parsimony.
Why should models be parsimonious?
Here: Intuition says that simple models reproduce better.
7.5.
MODEL DEVELOPMENT
23−→ The resulting model is certainly not the correct one!
What is “the correct model”, anyway?
Reproduciblily: Model selection is a non-reproducible process (except for formalized procedures)
Should it be banned?
−→ Yet another version of the dilemma of advancing science!
Summarizing:
Model development leads to severe reproducibility problems because of “Researcher’s degrees of freedom”
Adequate statistical procedures can solve the more formalized types of such problems. −→ Model Selection Procedures.
7.6.
CONCLUSIONS
247.6 Conclusions
Stepping procedure of advancing science:
1. Explore data freely, allowing all creativity Create Hypotheses about relevant effects
2. Conduct a new study to confirm the hypotheses (not H0!)
“Believe” effects that are successfully confirmed (with a sufficient magnitude to be relevant!)
1.* Use dataset in an exploratory attitude to generate new hypotheses.
it. Iterate until retirement.
Note that step 2 is a phony replication!
7.6.
CONCLUSIONS
25• Data Challenge, Confirmation
Science is not only about collecting facts
that stand the criterion of reproducibility, but about
generating theories (in a wide sense) that connect the facts.
7.6.
CONCLUSIONS
26 Types of confirmation:+ Reproduction: Same values of input variables
−→ should produce response values within variability of error distr.
+ Generalization = extrapolation: Extend the range of input var’s Check if regression function is still appropriate.
Data Challenge
+ Extension: Vary additional input variables to find adequate extension of the model.
Recommendation: Perform combined study for reproducibility and generalization and/or extension.
7.6.
CONCLUSIONS
27Psychology: Reproducibility of Concept Quantify “concepts” such as intelligence.
Questionnaires or “tests” −→ quantified concept.
Study relationships between concepts (response) and e.g., socio-economic variables, or between concepts.
Confirm concepts by using different questionaires / tests hopefully getting “the same” concepts and their relations.
7.6.
CONCLUSIONS
28 Messages• Avoid significance tests and P-values. Use confidence intervals!
−→ Teaching!
• Precise reproducibility in the sense of compatibility of quantitative results (non-significant difference)
is rare (outside student’s lab classes in physics)
• The main problem is selection bias on different levels:
“Researcher degrees of freedom”,
informal (visual) preliminary analysis, model development, selection of results within paper, publication bias
7.6.
CONCLUSIONS
29• Dilemma of advancing science: Exploratory and confirmatory steps.
−→ Data Challenge
Instead of mere reproducibility studies, perform confirmation / generalization studies!
• Reproducibility is only applicable to empirical science.
There are other modes of thinking that should be recognized as “science” in the broad sense (“Wissenschaft”).
What is confirmation in these fields?
• −→ In what sense / to what degree
should reproducibility be a requirement for serious research?
Thank you for your endurance
7.6.
CONCLUSIONS
30References
Atmanspacher, H. and Maasen, S. (eds) (2016). Reproducibility:
Principles, Problems, Practices, and Prospects, Wiley.
Ioannidis, J. (2005). Why most published research findings are fal- se, PLoS Medicine 2: 696–701.
Open Science Collaboration (2015). Estimating the reproducibility
of psychological science, Science 349: 943–952.
7.6.
CONCLUSIONS
31Appendix
Example R scripts for birthrates example.
r-birthr-data.R
## require(regr0)
## ---
## aa <- count.fields("~/data/fertility-switz.dat") dd <- read.table("~/data/birthratesOrig.dat",header=F) names(dd)
vv <- read.table("~/data/birthratesExt-vars.csv",sep=";",header=F, strings=F)
names(dd) <- vv[-nrow(vv),1] ## last variable will be added below d.fertilityX <- dd
7.6.
CONCLUSIONS
32## extract useful variables
tj <- c(2,1,19,6:13,26:32,42:43,47,48) dd <- d.fertilityX[,tj]
names(dd) <- c("fertility", "fertTotal","infantMort",
"catholic", "single24","single49",
"eAgric","eIndustry","eCommerce","eTransport","eAdmin",
"german","french","italian","romansh","gradeHigh","gradeLow",
"educHigh","bornLocal","bornForeign","canton","district")
## altitude: create factor indicaing the predominant layer
## (done for having a sensible factor among the explanatory variables) t.a <- apply(d.fertilityX[,c("LOWALT","ALT500","ALT1000")],1, which.max) dd$altitude <- ordered(t.a, labels=c("low","middle","high"))
## standardization
t.m <- ceiling(log10(apply(dd[,1:18],2,max)-1)) t.m["bornForeign"] <- 4
for (j in names(t.m)) dd[,j] <- dd[,j]/10^(t.m[j]-2)
## language: generate factor
7.6.
CONCLUSIONS
33td <- dd[,c("german","french","italian","romansh")]
## range(apply(td,1,sum)) ## [1] 97.5 100.0 tl <- td>60
ti <- which(apply(tl,1,sum)!=1) td[ti,]
tv <- sapply(apply(tl,1, which ),
function(x) if(length(x)) x else c(romansh=4) ) dd$language <- factor(names(tv))
##- table(dd$language)
##- french german italian romansh
##- 47 116 10 9
tvd <- c(vv[tj,2],"altitude in three categories: low, medium, high",
"dominating language: german, french, italian, romansh") tvd <- sub("^ *","",tvd)
tv <- cbind(names(dd), tvd)
## store
d.fertility <- dd
write.table(d.fertility, file="~/data/birthratesRed.csv", sep=";")
7.6.
CONCLUSIONS
34write.table(tv, file="~/data/birthratesRed-vars.csv", sep=";", quote=F, col.names=F)
## ---
## link to dataset swiss in R
ta <- paste(d.fertilityX$CATHOL/1000,d.fertilityX$IG/10) tb <- paste(swiss$Catholic,swiss$Fertility)
ti <- match(ta,tb)
## sumna(ti) + nrow(swiss) d.fertilityX$RowInSwiss <- ti
write.table(d.fertilityX, file="~/data/birthratesExt.csv", sep=";")
## ======================================================
## enrich the ’swiss’ dataset
sum(duplicated(d.fertility$catholic)) ti <- match(tb,ta)
## sumna(ti)
table(d.fertility$canton[ti])
d.swiss <- cbind(swiss, Language=d.fertility$language[ti], Canton=factor(d.fertilityX$Canton[ti]),
District=I(as.character(d.fertilityX$District[ti]))) write.table(d.swiss, file="~/data/swissExt.csv", sep=";")
7.6.
CONCLUSIONS
35 r-birthr-ana.R## Regression models for birthrate data
## ---
## package regr0: more "service"
## install the package. Do this only once!
## install.packages("quantreg")
## install.packages("regr0", repos="http://r-forge.r-project.org") require(regr0)
## get data
dd <- read.csv("~/data/birthrates.csv", sep=";") names(dd)
head(dd) # tail(dd) showd(dd) ## regr0 str(dd)
7.6.
CONCLUSIONS
36## screen summary(dd) pairs(dd)
plmatrix(dd) ## regr0
## ---
## multiple regression
## model formula
t.model <- fertility ~ eAgric + single24 + catholic + german plot(t.model, data=dd)
plmatrix(t.model, data=dd) ## regr0 r.mult <- lm(t.model, data=dd)
summary(r.mult)
r.m <- regr(t.model, data=dd) r.m
7.6.
CONCLUSIONS
37## ---
## a "large" model
r1 <- regr( fertility ~ catholic + single24 + single49 + eAgric + eIndustry + eCommerce + eTransport + eAdmin + german + french + educHigh + bornLocal + altitude + canton, data=dd )
r1
## with first aid transformations:
rf <- regr( fertility ~ asinp(catholic) + asinp(single24) + asinp(single49) + asinp(eAgric) +
asinp(eIndustry) + asinp(eCommerce) + asinp(eTransport) + asinp(eAdmin) + asinp(german) + asinp(french) +
asinp(educHigh) + asinp(bornLocal/100) + altitude + canton, data=dd )
tt <- rf$termtable
print(tt[tt[,"p.value"]>0.05,"p.value",drop=F],quote=F)
## stepwise model selection rback <- step(rf, k=5)
7.6.
CONCLUSIONS
38formula(rback) plot(rback)
## similar, with plain R:
rlm <- lm(formula(rback), data=dd) plot(rlm)
termplot(rlm, partial.resid = TRUE)
## add squares and interactions?
( radd <- add1(rback) )
## ...