Using R for Data Analysis and Graphics
5. More on Statistics
Missing Values
5.1 Missing Values
In practice, some data values may be missing.
Here, we fake this situation.
> t.kugel <- d.sport[,’kugel’]
> t.kugel[2] <- NA
’NA’ indicates missing data.
> t.kugel
[1] 15.66 NA 15.82 15.31 ...
Which elements of t.kugel are missing?
> t.kugel == NA
[1] NA NA NA NA NA ...
== does not work. Use is.na() instead:
> is.na(t.kugel)
[1] FALSE TRUE FALSE ...
Missing Values
5.1 Missing Values
Applying functions to vectors with missing values:
> mean(t.kugel) [1] NA
> mean(t.kugel, na.rm=TRUE) [1] 15.45
Other simple functions also have the na.rm argument For “higher” functions (e.g. wilcox.test),
the argument na.action defines how missing values are handled.
na.action=na.omit: omit cases with NAs Plotting functions work.
Missing Values
5.1 Missing Values
Drop the NA elements:
> t.kugel[!is.na(t.kugel)]
na.omit(df) drops rows of data.frame df which contain≥1 missing values.
How tospecify missingswhen reading data from “outside”:
> d.dat <- read.table(..., na.strings=c(’.’,-9999))
Default: empty fields are taken as NA for numerical vars.
... or later:
> d.dat[d.dat==-99] <- NA
Distributions and Random Numbers
5.2 Distributions and Random Numbers
The normal distribution is characterized by:
Densityfunction:
> dnorm(0.5, mean=0, sd=1)
[1] 0.35207
Cumulative Probability function:
> pnorm(c(1, 1.96), mean=0, sd=1)
[1] 0.841 0.975 Quantilefunction:
> qnorm(c(0.25,0.975), mean=100, sd=10) [1] 93.3 119.6
Random numbergenerator function:
> rnorm(5, mean=2, sd=2)
[1] 2.964 2.730 -2.581 -2.260 -0.781
Distributions and Random Numbers
5.2 Distributions and Random Numbers
Poisson distribution: dpois, ppois, qpois, rpois
> rpois(10, lambda=3.5)
[1] 3 6 1 4 1 3 3 4 5 5 Discrete Distributions binom Binomial distribution
pois Poisson distribution
hyper Hypergeometric distribution Continuous Distributions
unif Uniform distribution
exp Exponential distribution
norm Normal distribution
lnorm Log-Normal distribution
t, F, chisq t-, F-,χ2−(Chisquare-) distribution weibull, gamma Weibull, Gamma distribution
Random Numbers
5.3 Random Numbers
“Random” numbers are generated by a deterministic function.
Nevertheless, two identical calls give different results.
> runif(4)
[1] 0.9285 0.2529 0.7131 0.0506
> runif(4)
[1] 0.1304 0.2456 0.3324 0.8217
How this? The function gets a vector .Random.seed. To obtain thesame numbersagain, use ...
> set.seed(27)
> runif(1) [1] 0.9717
> set.seed(27)
> runif(1) [1] 0.9717
Visualization of distributions
5.4 Visualization of distributions
Discrete distributions:
> plot(0:15, dpois(0:15,lambda=3.5), + type="h")
Continuous distributions:
> curve(dnorm(x,5,2), xlim=c(-1,10), + xlab="x", ylab="density",
+ main="normal distribution")
Using R for Data Analysis and Graphics
6. Elements of the S
Language
Objects
6.1 Objects
The basic building blocks of the S Language are called “objects”. – They come in“classes”:
numeric,... one dim. seq. of numbers, strings, logical matrix two dimensional array of numbers, ...
array higher dimensions
data.frame two dimensional, numbers and strings formula (regression) model
function also an object!
list collection of objects, see below
... and more
Objects
6.1 Objects
Get information about the structure of an object:
> str(d.sport)
’data.frame’: 15 obs. of 7 variables:
$ weit : num 7.57 8.07 7.6 7.77 7.48 ...
$ kugel : num 15.7 13.6 15.8 15.3 16.3 ...
$ hoch : int 207 204 198 204 198 201 ...
...
Object Oriented Programming
6.2 Object Oriented Programming
Each object has a class, shown by class(object)
> class(d.sport)
"data.frame"
Many functions do rather different things according to the class of the first argument.
Most prominent instance: plot
Implementation by“generic function”which examined class of first argument and then calls a“method function”accordingly.
Example: plot(speer˜kugel, data=d.sport) calls the “formula method” of the
“plot generic function”.
Object Oriented Programming
6.2 Object Oriented Programming
The most basic generic function is print!– Example:
> r.t <- wilcox.test(extra∼group,data=sleep)
> str(r.t) List of 7
$ statistic : Named num 25.5 ..- attr(*, "names")= chr "W"
$ parameter : NULL
$ p.value : num 0.0693
$ null.value : Named num 0
..- attr(*, "names")= chr "location shift"
$ alternative: chr "two.sided"
$ method : chr "Wilcoxon rank sum test with ..."
$ data.name : chr "extra by group"
- attr(*, "class")= chr "htest"
Object Oriented Programming
6.2 Object Oriented Programming
> r.t (or print(r.t))
calls the “htest method” of print
Wilcoxon rank sum test with continuity correction data: extra by group
W = 25.5, p-value = 0.06933
alternative hypothesis: true location shift is not equal to 0
(“New S”: An extention of this principle examines class of more arguments.)
−→“S4 classes”, package methods
Object Oriented Programming
6.2 Object Oriented Programming
Apart from basic classes such as matrix, formula, many functions implementingstatistical methodsattach aspecific class attribute to their result.
Example: Linear regression −→function lm
> r.lm <- lm(speer∼kugel, data=d.sport)
> class(r.lm) [1] "lm"
These functions come with “methods” for plot, print, summary
> summary(r.lm)
> plot(r.lm) ## explained later ...
Attributes
6.3 Attributes
In order to store all kinds of useful information
along with an object, each object can have “attributes”.
Some attributes we have met before: class, names dim is an attribute of matrices and data.frames
row.names and names contain the row- and column names of data.frames. (dimnames for matrices.) All of these are obtained by a function with the same name:
> dim(d.sport) [1] 15 7
Attributes
6.3 Attributes
All attributes of an object can be seen by attributes:
> attributes(d.sport)
$names
[1] "weit" "kugel" "hoch" "disc" "stab" "speer" "punkte"
$class
[1] "data.frame"
$row.names
[1] "OBRIEN" "BUSEMANN" "DVORAK" "FRITZ" "HAMALAINEN"
...
You will rarely use attributes actively.
Often, you do not see them when you just “print” an object (the method of the object’s class for print does not show them.)
“Intestins” of the S language.
Lists
6.4 Lists
Objects of any kind can be collected into alist:
> list(t.v, you=’nice’)
[[1]]
Hans Fritz Elsa Trudi Olga 2.0 -1.0 9.0 0.4 100.0
$you
[1] "nice"
As with c(...), all arguments are collected, names can be given to thecomponents.
Lists
6.4 Lists
Lists are an important (additional) class of objects, since moststatistical functions produce a list
that collects the results.
> t.l <- hist(t.kugel, plot=FALSE)
> t.l
$breaks
[1] 13.5 14.0 14.5 15.0 15.5 16.0 16.5 17.0
$counts
[1] 2 1 4 1 4 1 2
$intensities
[1] 0.267 0.133 0.533 0.133 0.533 0.133 0.267
$density
[1] 0.267 0.133 0.533 0.133 0.533 0.133 0.267
$mids
[1] 13.75 14.25 14.75 15.25 15.75 16.25 16.75 ...
Lists
6.4 Lists
Get asublistof the list: [ ]
> t.l[2:3]
$counts
[1] 2 1 4 1 4 1 2
$intensities
[1] 0.267 0.133 0.533 0.133 0.533 0.133 0.267 or t.l[c("breaks","intensities")]
Get acomponent: [[ ]]
> t.l[[2]]
[1] 2 1 4 1 4 1 2
or t.l[[’counts’]] or t.l$counts.
Note: t.l[’counts’] is a list with one component.
Lists
6.4 Lists
Hint: A data.frame is a listwith additional attributes.
−→Single columns (variables) can be selected by $:
> d.sport$kugel
(select elements from it: d.sport$kugel[4:6] ) Make a list ofsubsetsof a vector:
> split(1:7, c(1, 1, 2, 3, 3, 2, 1))
$‘1‘
[1] 1 2 7
$‘2‘
[1] 3 6
$‘3‘
[1] 4 5
Lists
6.4 Lists
unlist concatenates all elements of all components into a single vector.
> unlist(t.l[1:2])
breaks1 breaks2 breaks3 breaks4 breaks5 breaks6 ...
13.5 14.0 14.5 15.0 15.5 16.0 ...
counts3 counts4 counts5 counts6 counts7
4.0 1.0 4.0 1.0 2.0
Loops
6.5 Loops
Loops are basic for programming. Most important type: for Calculate the first twelve elements of the Fibonacci series.
> fib <- c(1,1)
> for (i in 1:10)
+ fib <- c(fib, fib[i]+fib[i+1])
> fib
[1] 1 1 2 3 5 8 13 21 34 55 89 144
Other loops: while, repeat Jump out of the loop: break
Note: For loops shouldonlybe used if the result of the previous iterations is required in the next loop.
Otherwise: see below!
if – else
6.6 if – else
Conditional execution:
if ( fib[i+2] > 1E10 ) break if ( fib[i+2] %% 3 == 0 )
print(paste(’hit’, fib[i+2])) else print(fib[i+2] %% 3)
Select elementsfrom 2 vectorsbased on condition:
> ifelse(c(TRUE,FALSE,TRUE), 1:3, 11:13)
[1] 1 12 3
Apply
6.7 Apply
Loops can and should be avoidedin many cases!
Apply a function to each column (or row) of a data.frame or matrix:
> apply(d.sport, 2, mean)
weit kugel hoch disc stab speer punkte 7.6 15.2 202.0 46.4 498.0 62.0 8444.7 Second argument: 1 for summary of rows, 2 for columns If the function needsmore arguments,they are provided
as additional arguments of apply:
> apply(d.sport, 2, mean, trim=0.3)
weit kugel hoch disc stab speer punkte 7.59 15.19 201.86 46.42 495.71 63.00 8397.86
Apply
6.7 Apply
Apply a function to each component of alist:
> t.l <- hist(t.kugel,plot=FALSE) # see lists
> lapply(t.l[1:2], length)
$breaks [1] 8
$counts [1] 7
Simplified result (unlisted):
> sapply(t.l[1:4], length)
breaks counts intensities density
8 7 7 7
Aggregate
6.8 Aggregate
Summaries overgroups of data:
> aggregate(sleep[,’extra’], + list(sleep[,’group’]), median)
Group.1 x
1 1 0.35
2 2 1.75
Result is a data.frame.
If there are many groups, it often makes sense to analyze the summaries using new data.frame! Similar: by, tapply.
Write your own Function
6.9 Write your own Function
A simple function:
Get the maximal value of a vector and its index.
f.maxi <- function(vec) {
t.max <- max(vec, na.rm=TRUE) t.i <- match(t.max, vec)
c(max=t.max, i=t.i) }
Syntax:
fnname <- function( arg(s) ) { statements }
> f.maxi(c(3,4,78,2)) max i
78 3
(Note: R provides the function which.max)
Write your own Function
6.9 Write your own Function
This function can now be used in apply.
> apply(d.sport, 2, f.maxi)
weit kugel hoch disc stab speer punkte max 8.07 16.97 213 49.84 540 70.16 8824
i 2.00 13.00 8 4.00 6 3.00 1
Error Handling
6.10 Error Handling
Error messages areoftenhelpful ...
sometimes, you have no clue – mostly, if they occur in a function that was called by a function ...
Show the “stack” of function calls:
> traceback()
Ask an experienced user ...
If you write your own functions:
?debug
browser() as a statement in the function:
stops execution and lets youinspect all variables.
options(error=recover) calls browser when an error occurs.
Options
6.11 Options
Options taylor some aspects of R’s behavior to your desires:
> options(digits=3)
rounds results to 3 significant digits (for printing).
Use options like par: get a specific option by
> options("digits")
The settings of options (and par) will belost upon q().
In order to always get some options and other initial action you use thestartup mechanism, see ?Startup
Linux: provide a file called .Rprofile
Packages
6.12 Packages
By default, R only provides a basic set of functions.
Additional functions (and datasets) are obtained by loading additional “packages”.
> library(MASS) or require(MASS) How to find a command and the corresponding package?
> help.search("...") (slow!) On the internet:
http://www.r-project.org/search.html (or Google!) What does a package do?
> library(help=MASS)