SVN-revision: 0
Uwe Ziegenhagen
Institut für Statistik and Ökonometrie
Humboldt-Universität zu Berlin
http://www.uweziegenhagen.de
Agenda for Today
Data frames
Reading and Writing Data
Exploratory Data Analysis
Data Frames
matrices can only have one datatype
data frames: several type allowed
equal length of all elements required
Data Frame Example
1 a < - c(10 ,20 ,15 ,43 ,76 ,41 ,25 ,46) # n u m e r i c
2 # F a c t o r ’ sex ’
3 b < - f a c t o r(c(" m "," f "," m "," f "," m "," f "," m "," f ") )
4 # s i b l i n g s , n u m e r i c
5 c < - c(2 ,5 ,8 ,3 ,6 ,1 ,5 ,6)
6 m y f r a m e < - d a t a.f r a m e( a , b ,c)
7 m y f r a m e
8 c o l n a m e s( m y f r a m e ) < - c(" Age ", " Sex ", " S i b l i n g s ")
Factor variables
Factor variables: categorical variables (numeric or string ) advantages:
I implemented correctly in statistical modeling
I very useful in many different types of graphics
I correct number of degrees of freedom
Adressing Components
1 m y f r a m e [ ,1]
2 m y f r a m e [" Age "]
3 m y f r a m e $ Age
4 m y f r a m e [3 ,3]< -2 # c h a n g e v a l u e
5 m y f r a m e [ , -2] # all v a r s e x c e p t 2 nd
Overview of Objects II
1 # add o b j e c t set to s e a r c h p a th
2 a t t a c h( n a me )
3 # r e m o v e f r o m s e a r c h p a t h
4 d e t a c h( n a me )
Subgrouping Data Frames
1 > s u b s e t( myframe , m y f r a m e $ Age>30) # 4 e n t r i e s
2 > m e a n(s u b s e t( m y f r a m e $ Age , m y f r a m e $ Sex= =" m ") )
3 [1] 3 1 . 5
4 > m e a n(s u b s e t( m y f r a m e $ Age , m y f r a m e $ Sex= =" f ") )
5 [1] 3 7 . 5
6 m y f r a m e [( m y f r a m e $ Sex= =" m ") & ( m y f r a m e $ Age>30) ,]
7 # m a l e s o v e r 30
8 m y f r a m e [( m y f r a m e $ Sex= =" m ") | ( m y f r a m e $ Age>30) ,]
9 # m a l e or o v e r 30
Data Frames - Variables
1 > m y f r a m e < - c b i n d ( m y f r a m e , " I n c o m e ( USD ) "=
2 c( 1 7 0 0 , 2 1 0 0 , 2 3 0 0 , 2 0 5 0 , 2 8 0 0 , 1 4 5 0 , 3 4 0 0 , 2 0 0 0 ) )
3 > n a m e s( m y f r a m e ) [n a m e s( m y f r a m e )= =" I n c o m e ( USD ) "] < -
" I n c o m e U S D "
Task: Add variable IncomeEUR.
Search and Replace
Use gsub to perform replacement of matches determined by regular expressions.
1 > n a m e s( m y f r a m e ) < - g s u b(" In "," Out ",n a m e s( m y f r a m e ) )
2 > m y f r a m e
3 Age Sex S i b l i n g s O u t c o m e U S D
4 1 10 m 2 1 7 0 0
5 2 20 f 5 2 1 0 0
Deleting and Sorting
1 > m y f r a m e $ Age < - N U L L
2 > m y f r a m e
3
4 > m y f r a m e [o r d e r( m y f r a m e $ Age ) ,]
5 Age Sex S i b l i n g s O u t c o m e U S D
6 1 10 m 2 1 7 0 0
7 3 15 m 2 2 3 0 0
Deleting and Sorting
1. Sortieren nach Sex 2. Sortieren nach Age
1 > m y f r a m e [o r d e r( m y f r a m e $ Sex , p a r t i a l=m y f r a m e $ Age ) ,]
2 Age Sex S i b l i n g s O u t c o m e U S D
3 2 20 f 5 2 1 0 0
4 6 41 f 1 1 4 5 0
5 8 46 f 6 2 0 0 0
6 4 43 f 3 2 0 5 0
7 1 10 m 2 1 7 0 0
8 3 15 m 2 2 3 0 0
9 5 76 m 6 2 8 0 0
10 7 25 m 5 3 4 0 0
Short excursion sed & awk
sed:
stream editor, rowwise examples
I sed ’s/abc/def/’ input.txt >output.txt
I sed ’s|/|\|g’ input.txt >output.txt
I example using regular expression
extremely useful to process large amounts of data
Tutorial: http://www.grymoire.com/Unix/Sed.html
Short excursion sed & awk
awk:
Aho, Weinberger, Kernighan in general used to work on columns
I awk ’print 12’ concatenates columns 1 and 2
I awk ’print 1,3’ prints columns 1 and 3
I another example using a sum
Tutorial: http://www.vectorsite.net/tsawk.html
In general: Avoid data processing inside R, try to do it outside.
Data Management
Sources of data:
I Data in human readable format (CSV, TXT)
I Data in binary format (Excel, SPSS, STATA)
I Data from relational databases
R has 100 built-in datasets: objects(package:datasets)
many packages bring their own datasets
Loading data from library
1 l i b r a r y(" d a t a s e t s ") # l o a d s d a t a s e t l i b r a r y
2 # ( a u t o m a t i c a l l y l o a d e d )
3 d a t a(" p r e s s u r e ") # l o a d s d a t a s e t
4 d a t a( p r e s s u r e ) # a l t e r n a t i v e
5 p r e s s u r e # o u t p u t p r e s s u r e d a t a
Data Management
1 o b j e c t s(p a c k a g e: d a t a s e t s )
2 h e l p( T i t a n i c )
3 d a t a( T i t a n i c )
4 o b j e c t s()
Reading & Writing Data
1 d a t a < - r e a d.t a b l e(" f i l e n a m e ", h e a d e r=T R U E )
2 # g u e s s t y p e of v a r i a b l e : int , double , t e x t
3 # h e a d e r w i t h c o l u m n n a m e s is a v a i l a b l e
4 n a m e s(d a t a) # v a r i a b l e n a m e s
5 str(d a t a) # s h o w s t r u c t u r e of d a t a f r a m e
6 h e a d(d a t a) # s h o w f i r s t r ow s
Reading & Writing Data
1 # c h e c k if d a t a f i l e has h e a d e r
2 # may g e n e r a t e s t r i n g m a t r i x o n l y
3 # if 1 st row le s s t h a n 2 nd a s s u m e h e a d
4 d a t a < - r e a d.t a b l e(" f i l e n a m e ")
Reading & Writing Data
1 # u s i n g w r o n g s e p a r a t o r
2 d a t a < - r e a d.t a b l e(" f i l e ", sep=" \ t ")
3 # a s s u m e s t a b u l a t o r , may r e a d w h o l e
4 # l i n e as one v a r i a b l e
Reading & Writing Data
1 # r e a d i n g N a N s
2 d a t a < - r e a d.t a b l e(" f i l e ", na. s t r i n g s=" . ")
3 # a s s u m e s NaN to be r e p r e s e n t e d as ’. ’
Reading & Writing Data
1 # r e a d i n g CSV
2 # d e c i m a l sep ’. ’ , var . sep ’ , ’
3 d a t a < - r e a d.csv(" f i l e ") #
4 # d e c i m a l sep ’ , ’ , var . sep ’; ’
5 d a t a 2 < - r e a d.c s v 2(" f i l e ")
6 # d i r e c t i m p o r t f r o m E x c e l
7 d a t a < - r e a d.t a b l e(f i l e =" c l i p b o a r d ")
Reading & Writing Data
1 x < - r e a d.csv(" b e i s p i e l . csv ", sep=" ; ")
2 dim( x )
3 n a m e s( x )
4 x
5 # w r i t e to f i l e
6 w r i t e.t a b l e( x ,f i l e =" t es t . csv ", sep=" ; ",
7 row.n a m e s = FALSE , q u o t e =F A L S E )
Univariate Statistics
ddistrib density function pdistrib distribution function qdistrib quantile function
rdistrib random numbers
Univariate Statistics
1 d n o r m(0) # d e n s i t y v a l u e of N (0 ,1)
2 p n o r m(0) # cum . d e n s i t y up to 0
3 q n o r m( 0 . 5 ) # q u a n t i l e for 0.5
4 r n o r m( 1 0 0 ) # v e c t o r w i t h 100 r a n d o m n u m b e r s
Univariate Statistics
ddistrib density function pdistrib distribution function qdistrib quantile function
rdistrib random numbers
Distributions in standard R
<key>binom Binomial
<key>chisq Chi-Squared
<key>exp Exponential
<key>f F
<key>hyper Hypergeometric
<key>multinom Multinomial
<key>logis Logistic
<key>norm Normal
<key>pois Poisson
<key>t Student t
<key>unif Uniform
Empirical Distributions in R
1 d e n s i t y() # KDE u s i n g G a u s s i a n k e r n e l
2 e c d f() # e m p i r i c a l cdf
Sampling in R
1 s a m p l e( n ) # s a m p l e 1: n v e c t o r
2 s a m p l e( x ) # s h u f f l e the x v e c t o r
3 s a m p l e( x , r e p l a c e =T R U E ) b o o t s t r a p x v e c t o r
4 s a m p l e( x , n ) # d r a w s a m p l e of s i z e n f r o m x
5 s a m p l e( x , n , r e p l a c e = T R U E ) # b o o t s t r a p s a m p l e f r o m x
Seed is stored in .Random.seed, for simulations use set.seed()
Summary statistics
1 m e a n( x ) # m e a n *
2 m e d i a n( x ) # m e d i a n
3 var( x ) # s a m p l e v a r i a n c e
4 sd( x ) # s a m p l e std . d e v i a t i o n
5 cov( y ) # cov of m a t r i x y
6 q u a n t i l e( x , p ) # s a m p l e q u a n t i l e *
7 min( x ) # m i n i m u m of x *
8 max( x ) # m a x i m u m of x *
9 r a n g e() # r a n g e of x *
10 s k e w n e s s( x ) # s k e w n e s s
11 k u r t o s i s( x ) # k u r t o s i s
* can remove NaNs using parameter na.rm=T
Linear Regression
linear regression model
tries to model relation between dependent variable Y and 1 . . . n indep. variables X
1, . . . , X
ninfluence of variables is linear, first regressor X
1usually set to constant
sample of size n is fitted to model:
y
i= β
1+ β
2· x
2+ · · · + β
n· x
n+ ε
iy
i= x
i>β + ε
iLinear Regression
Goals:
estimate unknown βs using least squares decide if all variables are needed
check if resulting model explains data well enough use model to forecast
β b = (X
>X)
−1X
>y
Linear Regression
1 # s t a n d a r d m o d e l
2 lm( y ~ x + z )
3 # no i n t e r c e p t
4 lm( y ~ x - 1)
5 # u s i n g d a t a f r a m e
6 lm( a m o u n t ~ price , d a t a = c o n s u m p t i o n )
7 # u s i n g d a t a f r a m e and a t t a c h ()
8 lm( a m o u n t ~ p r i c e )
Exercise
Download Hubble data from
http://lib.stat.cmu.edu/DASL/Datafiles/Hubble.html and estimate the hubble constant H by the model
recession-velocity = H · distance
Call:
lm(formula = rec.vel ~ distance - 1) Residuals:
Min 1Q Median 3Q Max
-411.544 -191.302 -7.103 127.951 496.063 Coefficients:
Estimate Std. Error t value Pr(>|t|) distance 423.94 42.15 10.06 6.87e-10 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 Residual standard error: 229 on 23 degrees of freedom
Multiple R-squared: 0.8147, Adjusted R-squared: 0.8067
Residuals
5 10 15 20
−400−2000200400
Index
residuals(lm)
Residuals
−400−2000200400
Normal Q−Q Plot
Sample Quantiles
1. Download Cereal data from DASL 2. Read data as dataframe
3. Run linear regression rating = sugars + fat
Residuals:
Min 1Q Median 3Q Max
-14.6640 -5.6937 0.2078 4.7660 32.6163 Coefficients:
Estimate Std. Error t value Pr(>|t|) (Intercept) 61.0886 1.9527 31.284 < 2e-16 ***
sugars -2.2128 0.2347 -9.428 2.59e-14 ***
fat -3.0658 1.0365 -2.958 0.00416 **
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 Residual standard error: 8.755 on 74 degrees of freedom Multiple R-squared: 0.6218, Adjusted R-squared: 0.6116
t-Test
checks if certain coefficient β
jis different from 0.
teststatistics
t = β ˆ
jSD( ˆ β
j)
under H
0: t ∼ t
n−pwith p as number of independ. variables
F -Test
idea: check if sum of squared residuals is reduced significantly if one regressor is added
add one regressor ⇒ model gets better, but significantly?
Compute RSS1 for full model with k parameters, compute RSS2 for simplified model with k − q parameters
Compute teststatistics
F = (RSS2 − RSS 1)/q
RSS1/(n − k)
under H
0: F ∼ F
(n−1,n−q−1)Residuals
0 20 40 60 80
−100102030
Index
residuals(health)
Residuals
−100102030
Normal Q−Q Plot
Sample Quantiles
Residuals
−2 −1 0 1 2
−10123
Normal Q−Q Plot
Theoretical Quantiles
Sample Quantiles