• Keine Ergebnisse gefunden

5. More on Statistics

N/A
N/A
Protected

Academic year: 2021

Aktie "5. More on Statistics"

Copied!
32
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Using R for Data Analysis and Graphics

5. More on Statistics

(2)

Missing Values

5.1 Missing Values

In practice, some data values may be missing.

Here, we fake this situation.

> t.kugel <- d.sport[,’kugel’]

> t.kugel[2] <- NA

’NA’ indicates missing data.

> t.kugel

[1] 15.66 NA 15.82 15.31 ...

Which elements of t.kugel are missing?

> t.kugel == NA

[1] NA NA NA NA NA ...

== does not work. Use is.na() instead:

> is.na(t.kugel)

[1] FALSE TRUE FALSE ...

(3)

Missing Values

5.1 Missing Values

Applying functions to vectors with missing values:

> mean(t.kugel) [1] NA

> mean(t.kugel, na.rm=TRUE) [1] 15.45

Other simple functions also have the na.rm argument For “higher” functions (e.g. wilcox.test),

the argument na.action defines how missing values are handled.

na.action=na.omit: omit cases with NAs Plotting functions work.

(4)

Missing Values

5.1 Missing Values

Drop the NA elements:

> t.kugel[!is.na(t.kugel)]

na.omit(df) drops rows of data.frame df which contain≥1 missing values.

How tospecify missingswhen reading data from “outside”:

> d.dat <- read.table(..., na.strings=c(’.’,-9999))

Default: empty fields are taken as NA for numerical vars.

... or later:

> d.dat[d.dat==-99] <- NA

(5)

Distributions and Random Numbers

5.2 Distributions and Random Numbers

The normal distribution is characterized by:

Densityfunction:

> dnorm(0.5, mean=0, sd=1)

[1] 0.35207

Cumulative Probability function:

> pnorm(c(1, 1.96), mean=0, sd=1)

[1] 0.841 0.975 Quantilefunction:

> qnorm(c(0.25,0.975), mean=100, sd=10) [1] 93.3 119.6

Random numbergenerator function:

> rnorm(5, mean=2, sd=2)

[1] 2.964 2.730 -2.581 -2.260 -0.781

(6)

Distributions and Random Numbers

5.2 Distributions and Random Numbers

Poisson distribution: dpois, ppois, qpois, rpois

> rpois(10, lambda=3.5)

[1] 3 6 1 4 1 3 3 4 5 5 Discrete Distributions binom Binomial distribution

pois Poisson distribution

hyper Hypergeometric distribution Continuous Distributions

unif Uniform distribution

exp Exponential distribution

norm Normal distribution

lnorm Log-Normal distribution

t, F, chisq t-, F-,χ2−(Chisquare-) distribution weibull, gamma Weibull, Gamma distribution

(7)

Random Numbers

5.3 Random Numbers

“Random” numbers are generated by a deterministic function.

Nevertheless, two identical calls give different results.

> runif(4)

[1] 0.9285 0.2529 0.7131 0.0506

> runif(4)

[1] 0.1304 0.2456 0.3324 0.8217

How this? The function gets a vector .Random.seed. To obtain thesame numbersagain, use ...

> set.seed(27)

> runif(1) [1] 0.9717

> set.seed(27)

> runif(1) [1] 0.9717

(8)

Visualization of distributions

5.4 Visualization of distributions

Discrete distributions:

> plot(0:15, dpois(0:15,lambda=3.5), + type="h")

Continuous distributions:

> curve(dnorm(x,5,2), xlim=c(-1,10), + xlab="x", ylab="density",

+ main="normal distribution")

(9)

Using R for Data Analysis and Graphics

6. Elements of the S

Language

(10)

Objects

6.1 Objects

The basic building blocks of the S Language are called “objects”. – They come in“classes”:

numeric,... one dim. seq. of numbers, strings, logical matrix two dimensional array of numbers, ...

array higher dimensions

data.frame two dimensional, numbers and strings formula (regression) model

function also an object!

list collection of objects, see below

... and more

(11)

Objects

6.1 Objects

Get information about the structure of an object:

> str(d.sport)

’data.frame’: 15 obs. of 7 variables:

$ weit : num 7.57 8.07 7.6 7.77 7.48 ...

$ kugel : num 15.7 13.6 15.8 15.3 16.3 ...

$ hoch : int 207 204 198 204 198 201 ...

...

(12)

Object Oriented Programming

6.2 Object Oriented Programming

Each object has a class, shown by class(object)

> class(d.sport)

"data.frame"

Many functions do rather different things according to the class of the first argument.

Most prominent instance: plot

Implementation by“generic function”which examined class of first argument and then calls a“method function”accordingly.

Example: plot(speer˜kugel, data=d.sport) calls the “formula method” of the

“plot generic function”.

(13)

Object Oriented Programming

6.2 Object Oriented Programming

The most basic generic function is print!– Example:

> r.t <- wilcox.test(extra∼group,data=sleep)

> str(r.t) List of 7

$ statistic : Named num 25.5 ..- attr(*, "names")= chr "W"

$ parameter : NULL

$ p.value : num 0.0693

$ null.value : Named num 0

..- attr(*, "names")= chr "location shift"

$ alternative: chr "two.sided"

$ method : chr "Wilcoxon rank sum test with ..."

$ data.name : chr "extra by group"

- attr(*, "class")= chr "htest"

(14)

Object Oriented Programming

6.2 Object Oriented Programming

> r.t (or print(r.t))

calls the “htest method” of print

Wilcoxon rank sum test with continuity correction data: extra by group

W = 25.5, p-value = 0.06933

alternative hypothesis: true location shift is not equal to 0

(“New S”: An extention of this principle examines class of more arguments.)

−→“S4 classes”, package methods

(15)

Object Oriented Programming

6.2 Object Oriented Programming

Apart from basic classes such as matrix, formula, many functions implementingstatistical methodsattach aspecific class attribute to their result.

Example: Linear regression −→function lm

> r.lm <- lm(speer∼kugel, data=d.sport)

> class(r.lm) [1] "lm"

These functions come with “methods” for plot, print, summary

> summary(r.lm)

> plot(r.lm) ## explained later ...

(16)

Attributes

6.3 Attributes

In order to store all kinds of useful information

along with an object, each object can have “attributes”.

Some attributes we have met before: class, names dim is an attribute of matrices and data.frames

row.names and names contain the row- and column names of data.frames. (dimnames for matrices.) All of these are obtained by a function with the same name:

> dim(d.sport) [1] 15 7

(17)

Attributes

6.3 Attributes

All attributes of an object can be seen by attributes:

> attributes(d.sport)

$names

[1] "weit" "kugel" "hoch" "disc" "stab" "speer" "punkte"

$class

[1] "data.frame"

$row.names

[1] "OBRIEN" "BUSEMANN" "DVORAK" "FRITZ" "HAMALAINEN"

...

You will rarely use attributes actively.

Often, you do not see them when you just “print” an object (the method of the object’s class for print does not show them.)

“Intestins” of the S language.

(18)

Lists

6.4 Lists

Objects of any kind can be collected into alist:

> list(t.v, you=’nice’)

[[1]]

Hans Fritz Elsa Trudi Olga 2.0 -1.0 9.0 0.4 100.0

$you

[1] "nice"

As with c(...), all arguments are collected, names can be given to thecomponents.

(19)

Lists

6.4 Lists

Lists are an important (additional) class of objects, since moststatistical functions produce a list

that collects the results.

> t.l <- hist(t.kugel, plot=FALSE)

> t.l

$breaks

[1] 13.5 14.0 14.5 15.0 15.5 16.0 16.5 17.0

$counts

[1] 2 1 4 1 4 1 2

$intensities

[1] 0.267 0.133 0.533 0.133 0.533 0.133 0.267

$density

[1] 0.267 0.133 0.533 0.133 0.533 0.133 0.267

$mids

[1] 13.75 14.25 14.75 15.25 15.75 16.25 16.75 ...

(20)

Lists

6.4 Lists

Get asublistof the list: [ ]

> t.l[2:3]

$counts

[1] 2 1 4 1 4 1 2

$intensities

[1] 0.267 0.133 0.533 0.133 0.533 0.133 0.267 or t.l[c("breaks","intensities")]

Get acomponent: [[ ]]

> t.l[[2]]

[1] 2 1 4 1 4 1 2

or t.l[[’counts’]] or t.l$counts.

Note: t.l[’counts’] is a list with one component.

(21)

Lists

6.4 Lists

Hint: A data.frame is a listwith additional attributes.

−→Single columns (variables) can be selected by $:

> d.sport$kugel

(select elements from it: d.sport$kugel[4:6] ) Make a list ofsubsetsof a vector:

> split(1:7, c(1, 1, 2, 3, 3, 2, 1))

$‘1‘

[1] 1 2 7

$‘2‘

[1] 3 6

$‘3‘

[1] 4 5

(22)

Lists

6.4 Lists

unlist concatenates all elements of all components into a single vector.

> unlist(t.l[1:2])

breaks1 breaks2 breaks3 breaks4 breaks5 breaks6 ...

13.5 14.0 14.5 15.0 15.5 16.0 ...

counts3 counts4 counts5 counts6 counts7

4.0 1.0 4.0 1.0 2.0

(23)

Loops

6.5 Loops

Loops are basic for programming. Most important type: for Calculate the first twelve elements of the Fibonacci series.

> fib <- c(1,1)

> for (i in 1:10)

+ fib <- c(fib, fib[i]+fib[i+1])

> fib

[1] 1 1 2 3 5 8 13 21 34 55 89 144

Other loops: while, repeat Jump out of the loop: break

Note: For loops shouldonlybe used if the result of the previous iterations is required in the next loop.

Otherwise: see below!

(24)

if – else

6.6 if – else

Conditional execution:

if ( fib[i+2] > 1E10 ) break if ( fib[i+2] %% 3 == 0 )

print(paste(’hit’, fib[i+2])) else print(fib[i+2] %% 3)

Select elementsfrom 2 vectorsbased on condition:

> ifelse(c(TRUE,FALSE,TRUE), 1:3, 11:13)

[1] 1 12 3

(25)

Apply

6.7 Apply

Loops can and should be avoidedin many cases!

Apply a function to each column (or row) of a data.frame or matrix:

> apply(d.sport, 2, mean)

weit kugel hoch disc stab speer punkte 7.6 15.2 202.0 46.4 498.0 62.0 8444.7 Second argument: 1 for summary of rows, 2 for columns If the function needsmore arguments,they are provided

as additional arguments of apply:

> apply(d.sport, 2, mean, trim=0.3)

weit kugel hoch disc stab speer punkte 7.59 15.19 201.86 46.42 495.71 63.00 8397.86

(26)

Apply

6.7 Apply

Apply a function to each component of alist:

> t.l <- hist(t.kugel,plot=FALSE) # see lists

> lapply(t.l[1:2], length)

$breaks [1] 8

$counts [1] 7

Simplified result (unlisted):

> sapply(t.l[1:4], length)

breaks counts intensities density

8 7 7 7

(27)

Aggregate

6.8 Aggregate

Summaries overgroups of data:

> aggregate(sleep[,’extra’], + list(sleep[,’group’]), median)

Group.1 x

1 1 0.35

2 2 1.75

Result is a data.frame.

If there are many groups, it often makes sense to analyze the summaries using new data.frame! Similar: by, tapply.

(28)

Write your own Function

6.9 Write your own Function

A simple function:

Get the maximal value of a vector and its index.

f.maxi <- function(vec) {

t.max <- max(vec, na.rm=TRUE) t.i <- match(t.max, vec)

c(max=t.max, i=t.i) }

Syntax:

fnname <- function( arg(s) ) { statements }

> f.maxi(c(3,4,78,2)) max i

78 3

(Note: R provides the function which.max)

(29)

Write your own Function

6.9 Write your own Function

This function can now be used in apply.

> apply(d.sport, 2, f.maxi)

weit kugel hoch disc stab speer punkte max 8.07 16.97 213 49.84 540 70.16 8824

i 2.00 13.00 8 4.00 6 3.00 1

(30)

Error Handling

6.10 Error Handling

Error messages areoftenhelpful ...

sometimes, you have no clue – mostly, if they occur in a function that was called by a function ...

Show the “stack” of function calls:

> traceback()

Ask an experienced user ...

If you write your own functions:

?debug

browser() as a statement in the function:

stops execution and lets youinspect all variables.

options(error=recover) calls browser when an error occurs.

(31)

Options

6.11 Options

Options taylor some aspects of R’s behavior to your desires:

> options(digits=3)

rounds results to 3 significant digits (for printing).

Use options like par: get a specific option by

> options("digits")

The settings of options (and par) will belost upon q().

In order to always get some options and other initial action you use thestartup mechanism, see ?Startup

Linux: provide a file called .Rprofile

(32)

Packages

6.12 Packages

By default, R only provides a basic set of functions.

Additional functions (and datasets) are obtained by loading additional “packages”.

> library(MASS) or require(MASS) How to find a command and the corresponding package?

> help.search("...") (slow!) On the internet:

http://www.r-project.org/search.html (or Google!) What does a package do?

> library(help=MASS)

Referenzen

ÄHNLICHE DOKUMENTE

MORAVIA Europe spol. operations at the same

operations at the same

“open book”, thus you are allowed to bring any written materials you wish. We also recommend bringing a pocket calculator. However, notebooks/computers are not allowed. In this

- based on comparing against the saturated model - not suitable for non-grouped, binary data. • Comparing two

All doctoral students, as well as other attendants who are after ECTS credit points need to attend and pass the exam for getting the credits

All doctoral students, as well as other attendants who are after ECTS credit points need to attend and pass the exam for getting the credits

◆ There are many useful extensions of linear regression: weighted regression, robust regression, nonparametric regression, and generalized linear models.. ■ How does linear

e 2 ) variabile aleatoare continue. e 1 ) Valorile numerice ale unei variabile aleatoare discrete, care pot fi finite sau infinite, sunt atribuite evenimentelor în