Statistical Programming Languages – Day 2 SVN-revision: 0

(1)

SVN-revision: 0

Uwe Ziegenhagen

Institut für Statistik and Ökonometrie

Humboldt-Universität zu Berlin

http://www.uweziegenhagen.de

(2)

Agenda for Today

Data frames

Reading and Writing Data

Exploratory Data Analysis

(3)

Data Frames

matrices can only have one datatype

data frames: several type allowed

equal length of all elements required

(4)

Data Frame Example

1 a < - c(10 ,20 ,15 ,43 ,76 ,41 ,25 ,46) # n u m e r i c

2 # F a c t o r ’ sex ’

3 b < - f a c t o r(c(" m "," f "," m "," f "," m "," f "," m "," f ") )

4 # s i b l i n g s , n u m e r i c

5 c < - c(2 ,5 ,8 ,3 ,6 ,1 ,5 ,6)

6 m y f r a m e < - d a t a.f r a m e( a , b ,c)

7 m y f r a m e

8 c o l n a m e s( m y f r a m e ) < - c(" Age ", " Sex ", " S i b l i n g s ")

(5)

Factor variables

Factor variables: categorical variables (numeric or string ) advantages:

I implemented correctly in statistical modeling

I very useful in many different types of graphics

I correct number of degrees of freedom

(6)

Adressing Components

1 m y f r a m e [ ,1]

2 m y f r a m e [" Age "]

3 m y f r a m e $ Age

4 m y f r a m e [3 ,3]< -2 # c h a n g e v a l u e

5 m y f r a m e [ , -2] # all v a r s e x c e p t 2 nd

(7)

Overview of Objects II

1 # add o b j e c t set to s e a r c h p a th

2 a t t a c h( n a me )

3 # r e m o v e f r o m s e a r c h p a t h

4 d e t a c h( n a me )

(8)

Subgrouping Data Frames

1 > s u b s e t( myframe , m y f r a m e $ Age>30) # 4 e n t r i e s

2 > m e a n(s u b s e t( m y f r a m e $ Age , m y f r a m e $ Sex= =" m ") )

3 [1] 3 1 . 5

4 > m e a n(s u b s e t( m y f r a m e $ Age , m y f r a m e $ Sex= =" f ") )

5 [1] 3 7 . 5

6 m y f r a m e [( m y f r a m e $ Sex= =" m ") & ( m y f r a m e $ Age>30) ,]

7 # m a l e s o v e r 30

8 m y f r a m e [( m y f r a m e $ Sex= =" m ") | ( m y f r a m e $ Age>30) ,]

9 # m a l e or o v e r 30

(9)

Data Frames - Variables

1 > m y f r a m e < - c b i n d ( m y f r a m e , " I n c o m e ( USD ) "=

2 c( 1 7 0 0 , 2 1 0 0 , 2 3 0 0 , 2 0 5 0 , 2 8 0 0 , 1 4 5 0 , 3 4 0 0 , 2 0 0 0 ) )

3 > n a m e s( m y f r a m e ) [n a m e s( m y f r a m e )= =" I n c o m e ( USD ) "] < -

" I n c o m e U S D "

Task: Add variable IncomeEUR.

(10)

Search and Replace

Use gsub to perform replacement of matches determined by regular expressions.

1 > n a m e s( m y f r a m e ) < - g s u b(" In "," Out ",n a m e s( m y f r a m e ) )

2 > m y f r a m e

3 Age Sex S i b l i n g s O u t c o m e U S D

4 1 10 m 2 1 7 0 0

5 2 20 f 5 2 1 0 0

(11)

Deleting and Sorting

1 > m y f r a m e $ Age < - N U L L

2 > m y f r a m e

3

4 > m y f r a m e [o r d e r( m y f r a m e $ Age ) ,]

6 1 10 m 2 1 7 0 0

7 3 15 m 2 2 3 0 0

(12)

Deleting and Sorting

1. Sortieren nach Sex 2. Sortieren nach Age

1 > m y f r a m e [o r d e r( m y f r a m e $ Sex , p a r t i a l=m y f r a m e $ Age ) ,]

3 2 20 f 5 2 1 0 0

4 6 41 f 1 1 4 5 0

5 8 46 f 6 2 0 0 0

6 4 43 f 3 2 0 5 0

7 1 10 m 2 1 7 0 0

8 3 15 m 2 2 3 0 0

9 5 76 m 6 2 8 0 0

10 7 25 m 5 3 4 0 0

(13)

Short excursion sed & awk

sed:

stream editor, rowwise examples

I sed ’s/abc/def/’ input.txt >output.txt

I sed ’s|/|\|g’ input.txt >output.txt

I example using regular expression

extremely useful to process large amounts of data

Tutorial: http://www.grymoire.com/Unix/Sed.html

(14)

Short excursion sed & awk

awk:

Aho, Weinberger, Kernighan in general used to work on columns

I awk ’print 12’ concatenates columns 1 and 2

I awk ’print 1,3’ prints columns 1 and 3

I another example using a sum

Tutorial: http://www.vectorsite.net/tsawk.html

In general: Avoid data processing inside R, try to do it outside.

(15)

Data Management

Sources of data:

I Data in human readable format (CSV, TXT)

I Data in binary format (Excel, SPSS, STATA)

I Data from relational databases

R has 100 built-in datasets: objects(package:datasets)

many packages bring their own datasets

(16)

Loading data from library

1 l i b r a r y(" d a t a s e t s ") # l o a d s d a t a s e t l i b r a r y

2 # ( a u t o m a t i c a l l y l o a d e d )

3 d a t a(" p r e s s u r e ") # l o a d s d a t a s e t

4 d a t a( p r e s s u r e ) # a l t e r n a t i v e

5 p r e s s u r e # o u t p u t p r e s s u r e d a t a

(17)

Data Management

1 o b j e c t s(p a c k a g e: d a t a s e t s )

2 h e l p( T i t a n i c )

3 d a t a( T i t a n i c )

4 o b j e c t s()

(18)

Reading & Writing Data

1 d a t a < - r e a d.t a b l e(" f i l e n a m e ", h e a d e r=T R U E )

2 # g u e s s t y p e of v a r i a b l e : int , double , t e x t

3 # h e a d e r w i t h c o l u m n n a m e s is a v a i l a b l e

4 n a m e s(d a t a) # v a r i a b l e n a m e s

5 str(d a t a) # s h o w s t r u c t u r e of d a t a f r a m e

6 h e a d(d a t a) # s h o w f i r s t r ow s

(19)

Reading & Writing Data

1 # c h e c k if d a t a f i l e has h e a d e r

2 # may g e n e r a t e s t r i n g m a t r i x o n l y

3 # if 1 st row le s s t h a n 2 nd a s s u m e h e a d

4 d a t a < - r e a d.t a b l e(" f i l e n a m e ")

(20)

Reading & Writing Data

1 # u s i n g w r o n g s e p a r a t o r

2 d a t a < - r e a d.t a b l e(" f i l e ", sep=" \ t ")

3 # a s s u m e s t a b u l a t o r , may r e a d w h o l e

4 # l i n e as one v a r i a b l e

(21)

Reading & Writing Data

1 # r e a d i n g N a N s

2 d a t a < - r e a d.t a b l e(" f i l e ", na. s t r i n g s=" . ")

3 # a s s u m e s NaN to be r e p r e s e n t e d as ’. ’

(22)

Reading & Writing Data

1 # r e a d i n g CSV

2 # d e c i m a l sep ’. ’ , var . sep ’ , ’

3 d a t a < - r e a d.csv(" f i l e ") #

4 # d e c i m a l sep ’ , ’ , var . sep ’; ’

5 d a t a 2 < - r e a d.c s v 2(" f i l e ")

6 # d i r e c t i m p o r t f r o m E x c e l

7 d a t a < - r e a d.t a b l e(f i l e =" c l i p b o a r d ")

(23)

Reading & Writing Data

1 x < - r e a d.csv(" b e i s p i e l . csv ", sep=" ; ")

2 dim( x )

3 n a m e s( x )

4 x

5 # w r i t e to f i l e

6 w r i t e.t a b l e( x ,f i l e =" t es t . csv ", sep=" ; ",

7 row.n a m e s = FALSE , q u o t e =F A L S E )

(24)

Univariate Statistics

ddistrib density function pdistrib distribution function qdistrib quantile function

rdistrib random numbers

(25)

Univariate Statistics

1 d n o r m(0) # d e n s i t y v a l u e of N (0 ,1)

2 p n o r m(0) # cum . d e n s i t y up to 0

3 q n o r m( 0 . 5 ) # q u a n t i l e for 0.5

4 r n o r m( 1 0 0 ) # v e c t o r w i t h 100 r a n d o m n u m b e r s

(26)

Univariate Statistics

ddistrib density function pdistrib distribution function qdistrib quantile function

rdistrib random numbers

(27)

Distributions in standard R

<key>binom Binomial

<key>chisq Chi-Squared

<key>exp Exponential

<key>f F

<key>hyper Hypergeometric

<key>multinom Multinomial

<key>logis Logistic

<key>norm Normal

<key>pois Poisson

<key>t Student t

<key>unif Uniform

(28)

Empirical Distributions in R

1 d e n s i t y() # KDE u s i n g G a u s s i a n k e r n e l

2 e c d f() # e m p i r i c a l cdf

(29)

Sampling in R

1 s a m p l e( n ) # s a m p l e 1: n v e c t o r

2 s a m p l e( x ) # s h u f f l e the x v e c t o r

3 s a m p l e( x , r e p l a c e =T R U E ) b o o t s t r a p x v e c t o r

4 s a m p l e( x , n ) # d r a w s a m p l e of s i z e n f r o m x

5 s a m p l e( x , n , r e p l a c e = T R U E ) # b o o t s t r a p s a m p l e f r o m x

Seed is stored in .Random.seed, for simulations use set.seed()

(30)

Summary statistics

1 m e a n( x ) # m e a n *

2 m e d i a n( x ) # m e d i a n

3 var( x ) # s a m p l e v a r i a n c e

4 sd( x ) # s a m p l e std . d e v i a t i o n

5 cov( y ) # cov of m a t r i x y

6 q u a n t i l e( x , p ) # s a m p l e q u a n t i l e *

7 min( x ) # m i n i m u m of x *

8 max( x ) # m a x i m u m of x *

9 r a n g e() # r a n g e of x *

10 s k e w n e s s( x ) # s k e w n e s s

11 k u r t o s i s( x ) # k u r t o s i s

* can remove NaNs using parameter na.rm=T

(31)

Linear Regression

linear regression model

tries to model relation between dependent variable Y and 1 . . . n indep. variables X

₁

, . . . , X

_n

influence of variables is linear, first regressor X

₁

usually set to constant

sample of size n is fitted to model:

y

_i

= β

₁

+ β

₂

· x

₂

+ · · · + β

_n

· x

_n

+ ε

_i

y

_i

= x

_i^>

β + ε

_i

(32)

Linear Regression

Goals:

estimate unknown βs using least squares decide if all variables are needed

check if resulting model explains data well enough use model to forecast

β b = (X

^>

X)

⁻¹

X

^>

y

(33)

Linear Regression

1 # s t a n d a r d m o d e l

2 lm( y ~ x + z )

3 # no i n t e r c e p t

4 lm( y ~ x - 1)

5 # u s i n g d a t a f r a m e

6 lm( a m o u n t ~ price , d a t a = c o n s u m p t i o n )

7 # u s i n g d a t a f r a m e and a t t a c h ()

8 lm( a m o u n t ~ p r i c e )

(34)

Exercise

Download Hubble data from

http://lib.stat.cmu.edu/DASL/Datafiles/Hubble.html and estimate the hubble constant H by the model

recession-velocity = H · distance

(35)

Call:

lm(formula = rec.vel ~ distance - 1) Residuals:

Min 1Q Median 3Q Max

-411.544 -191.302 -7.103 127.951 496.063 Coefficients:

Estimate Std. Error t value Pr(>|t|) distance 423.94 42.15 10.06 6.87e-10 ***

---

Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 Residual standard error: 229 on 23 degrees of freedom

Multiple R-squared: 0.8147, Adjusted R-squared: 0.8067

(36)

Residuals

5 10 15 20

−400−2000200400

Index

residuals(lm)

(37)

Residuals

−400−2000200400

Normal Q−Q Plot

Sample Quantiles

(38)

1. Download Cereal data from DASL 2. Read data as dataframe

3. Run linear regression rating = sugars + fat

(39)

Residuals:

Min 1Q Median 3Q Max

-14.6640 -5.6937 0.2078 4.7660 32.6163 Coefficients:

Estimate Std. Error t value Pr(>|t|) (Intercept) 61.0886 1.9527 31.284 < 2e-16 ***

sugars -2.2128 0.2347 -9.428 2.59e-14 ***

fat -3.0658 1.0365 -2.958 0.00416 **

---

Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 Residual standard error: 8.755 on 74 degrees of freedom Multiple R-squared: 0.6218, Adjusted R-squared: 0.6116

(40)

t-Test

checks if certain coefficient β

j

is different from 0.

teststatistics

t = β ˆ

j

SD( ˆ β

_j

)

under H

₀

: t ∼ t

n−p

with p as number of independ. variables

(41)

F -Test

idea: check if sum of squared residuals is reduced significantly if one regressor is added

add one regressor ⇒ model gets better, but significantly?

Compute RSS1 for full model with k parameters, compute RSS2 for simplified model with k − q parameters

Compute teststatistics

F = (RSS2 − RSS 1)/q

RSS1/(n − k)

under H

₀

: F ∼ F

(n−1,n−q−1)

(42)

Residuals

0 20 40 60 80

−100102030

Index

residuals(health)

(43)

Residuals

−100102030

Sample Quantiles

(44)

Residuals

−2 −1 0 1 2

−10123

Theoretical Quantiles

Sample Quantiles