• Keine Ergebnisse gefunden

Statistical Programming Languages – Day 2 SVN-revision: 0

N/A
N/A
Protected

Academic year: 2021

Aktie "Statistical Programming Languages – Day 2 SVN-revision: 0"

Copied!
44
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

SVN-revision: 0

Uwe Ziegenhagen

Institut für Statistik and Ökonometrie

Humboldt-Universität zu Berlin

http://www.uweziegenhagen.de

(2)

Agenda for Today

Data frames

Reading and Writing Data

Exploratory Data Analysis

(3)

Data Frames

matrices can only have one datatype

data frames: several type allowed

equal length of all elements required

(4)

Data Frame Example

1 a < - c(10 ,20 ,15 ,43 ,76 ,41 ,25 ,46) # n u m e r i c

2 # F a c t o r ’ sex ’

3 b < - f a c t o r(c(" m "," f "," m "," f "," m "," f "," m "," f ") )

4 # s i b l i n g s , n u m e r i c

5 c < - c(2 ,5 ,8 ,3 ,6 ,1 ,5 ,6)

6 m y f r a m e < - d a t a.f r a m e( a , b ,c)

7 m y f r a m e

8 c o l n a m e s( m y f r a m e ) < - c(" Age ", " Sex ", " S i b l i n g s ")

(5)

Factor variables

Factor variables: categorical variables (numeric or string ) advantages:

I implemented correctly in statistical modeling

I very useful in many different types of graphics

I correct number of degrees of freedom

(6)

Adressing Components

1 m y f r a m e [ ,1]

2 m y f r a m e [" Age "]

3 m y f r a m e $ Age

4 m y f r a m e [3 ,3]< -2 # c h a n g e v a l u e

5 m y f r a m e [ , -2] # all v a r s e x c e p t 2 nd

(7)

Overview of Objects II

1 # add o b j e c t set to s e a r c h p a th

2 a t t a c h( n a me )

3 # r e m o v e f r o m s e a r c h p a t h

4 d e t a c h( n a me )

(8)

Subgrouping Data Frames

1 > s u b s e t( myframe , m y f r a m e $ Age>30) # 4 e n t r i e s

2 > m e a n(s u b s e t( m y f r a m e $ Age , m y f r a m e $ Sex= =" m ") )

3 [1] 3 1 . 5

4 > m e a n(s u b s e t( m y f r a m e $ Age , m y f r a m e $ Sex= =" f ") )

5 [1] 3 7 . 5

6 m y f r a m e [( m y f r a m e $ Sex= =" m ") & ( m y f r a m e $ Age>30) ,]

7 # m a l e s o v e r 30

8 m y f r a m e [( m y f r a m e $ Sex= =" m ") | ( m y f r a m e $ Age>30) ,]

9 # m a l e or o v e r 30

(9)

Data Frames - Variables

1 > m y f r a m e < - c b i n d ( m y f r a m e , " I n c o m e ( USD ) "=

2 c( 1 7 0 0 , 2 1 0 0 , 2 3 0 0 , 2 0 5 0 , 2 8 0 0 , 1 4 5 0 , 3 4 0 0 , 2 0 0 0 ) )

3 > n a m e s( m y f r a m e ) [n a m e s( m y f r a m e )= =" I n c o m e ( USD ) "] < -

" I n c o m e U S D "

Task: Add variable IncomeEUR.

(10)

Search and Replace

Use gsub to perform replacement of matches determined by regular expressions.

1 > n a m e s( m y f r a m e ) < - g s u b(" In "," Out ",n a m e s( m y f r a m e ) )

2 > m y f r a m e

3 Age Sex S i b l i n g s O u t c o m e U S D

4 1 10 m 2 1 7 0 0

5 2 20 f 5 2 1 0 0

(11)

Deleting and Sorting

1 > m y f r a m e $ Age < - N U L L

2 > m y f r a m e

3

4 > m y f r a m e [o r d e r( m y f r a m e $ Age ) ,]

5 Age Sex S i b l i n g s O u t c o m e U S D

6 1 10 m 2 1 7 0 0

7 3 15 m 2 2 3 0 0

(12)

Deleting and Sorting

1. Sortieren nach Sex 2. Sortieren nach Age

1 > m y f r a m e [o r d e r( m y f r a m e $ Sex , p a r t i a l=m y f r a m e $ Age ) ,]

2 Age Sex S i b l i n g s O u t c o m e U S D

3 2 20 f 5 2 1 0 0

4 6 41 f 1 1 4 5 0

5 8 46 f 6 2 0 0 0

6 4 43 f 3 2 0 5 0

7 1 10 m 2 1 7 0 0

8 3 15 m 2 2 3 0 0

9 5 76 m 6 2 8 0 0

10 7 25 m 5 3 4 0 0

(13)

Short excursion sed & awk

sed:

stream editor, rowwise examples

I sed ’s/abc/def/’ input.txt >output.txt

I sed ’s|/|\|g’ input.txt >output.txt

I example using regular expression

extremely useful to process large amounts of data

Tutorial: http://www.grymoire.com/Unix/Sed.html

(14)

Short excursion sed & awk

awk:

Aho, Weinberger, Kernighan in general used to work on columns

I awk ’print 12’ concatenates columns 1 and 2

I awk ’print 1,3’ prints columns 1 and 3

I another example using a sum

Tutorial: http://www.vectorsite.net/tsawk.html

In general: Avoid data processing inside R, try to do it outside.

(15)

Data Management

Sources of data:

I Data in human readable format (CSV, TXT)

I Data in binary format (Excel, SPSS, STATA)

I Data from relational databases

R has 100 built-in datasets: objects(package:datasets)

many packages bring their own datasets

(16)

Loading data from library

1 l i b r a r y(" d a t a s e t s ") # l o a d s d a t a s e t l i b r a r y

2 # ( a u t o m a t i c a l l y l o a d e d )

3 d a t a(" p r e s s u r e ") # l o a d s d a t a s e t

4 d a t a( p r e s s u r e ) # a l t e r n a t i v e

5 p r e s s u r e # o u t p u t p r e s s u r e d a t a

(17)

Data Management

1 o b j e c t s(p a c k a g e: d a t a s e t s )

2 h e l p( T i t a n i c )

3 d a t a( T i t a n i c )

4 o b j e c t s()

(18)

Reading & Writing Data

1 d a t a < - r e a d.t a b l e(" f i l e n a m e ", h e a d e r=T R U E )

2 # g u e s s t y p e of v a r i a b l e : int , double , t e x t

3 # h e a d e r w i t h c o l u m n n a m e s is a v a i l a b l e

4 n a m e s(d a t a) # v a r i a b l e n a m e s

5 str(d a t a) # s h o w s t r u c t u r e of d a t a f r a m e

6 h e a d(d a t a) # s h o w f i r s t r ow s

(19)

Reading & Writing Data

1 # c h e c k if d a t a f i l e has h e a d e r

2 # may g e n e r a t e s t r i n g m a t r i x o n l y

3 # if 1 st row le s s t h a n 2 nd a s s u m e h e a d

4 d a t a < - r e a d.t a b l e(" f i l e n a m e ")

(20)

Reading & Writing Data

1 # u s i n g w r o n g s e p a r a t o r

2 d a t a < - r e a d.t a b l e(" f i l e ", sep=" \ t ")

3 # a s s u m e s t a b u l a t o r , may r e a d w h o l e

4 # l i n e as one v a r i a b l e

(21)

Reading & Writing Data

1 # r e a d i n g N a N s

2 d a t a < - r e a d.t a b l e(" f i l e ", na. s t r i n g s=" . ")

3 # a s s u m e s NaN to be r e p r e s e n t e d as ’. ’

(22)

Reading & Writing Data

1 # r e a d i n g CSV

2 # d e c i m a l sep ’. ’ , var . sep ’ , ’

3 d a t a < - r e a d.csv(" f i l e ") #

4 # d e c i m a l sep ’ , ’ , var . sep ’; ’

5 d a t a 2 < - r e a d.c s v 2(" f i l e ")

6 # d i r e c t i m p o r t f r o m E x c e l

7 d a t a < - r e a d.t a b l e(f i l e =" c l i p b o a r d ")

(23)

Reading & Writing Data

1 x < - r e a d.csv(" b e i s p i e l . csv ", sep=" ; ")

2 dim( x )

3 n a m e s( x )

4 x

5 # w r i t e to f i l e

6 w r i t e.t a b l e( x ,f i l e =" t es t . csv ", sep=" ; ",

7 row.n a m e s = FALSE , q u o t e =F A L S E )

(24)

Univariate Statistics

ddistrib density function pdistrib distribution function qdistrib quantile function

rdistrib random numbers

(25)

Univariate Statistics

1 d n o r m(0) # d e n s i t y v a l u e of N (0 ,1)

2 p n o r m(0) # cum . d e n s i t y up to 0

3 q n o r m( 0 . 5 ) # q u a n t i l e for 0.5

4 r n o r m( 1 0 0 ) # v e c t o r w i t h 100 r a n d o m n u m b e r s

(26)

Univariate Statistics

ddistrib density function pdistrib distribution function qdistrib quantile function

rdistrib random numbers

(27)

Distributions in standard R

<key>binom Binomial

<key>chisq Chi-Squared

<key>exp Exponential

<key>f F

<key>hyper Hypergeometric

<key>multinom Multinomial

<key>logis Logistic

<key>norm Normal

<key>pois Poisson

<key>t Student t

<key>unif Uniform

(28)

Empirical Distributions in R

1 d e n s i t y() # KDE u s i n g G a u s s i a n k e r n e l

2 e c d f() # e m p i r i c a l cdf

(29)

Sampling in R

1 s a m p l e( n ) # s a m p l e 1: n v e c t o r

2 s a m p l e( x ) # s h u f f l e the x v e c t o r

3 s a m p l e( x , r e p l a c e =T R U E ) b o o t s t r a p x v e c t o r

4 s a m p l e( x , n ) # d r a w s a m p l e of s i z e n f r o m x

5 s a m p l e( x , n , r e p l a c e = T R U E ) # b o o t s t r a p s a m p l e f r o m x

Seed is stored in .Random.seed, for simulations use set.seed()

(30)

Summary statistics

1 m e a n( x ) # m e a n *

2 m e d i a n( x ) # m e d i a n

3 var( x ) # s a m p l e v a r i a n c e

4 sd( x ) # s a m p l e std . d e v i a t i o n

5 cov( y ) # cov of m a t r i x y

6 q u a n t i l e( x , p ) # s a m p l e q u a n t i l e *

7 min( x ) # m i n i m u m of x *

8 max( x ) # m a x i m u m of x *

9 r a n g e() # r a n g e of x *

10 s k e w n e s s( x ) # s k e w n e s s

11 k u r t o s i s( x ) # k u r t o s i s

* can remove NaNs using parameter na.rm=T

(31)

Linear Regression

linear regression model

tries to model relation between dependent variable Y and 1 . . . n indep. variables X

1

, . . . , X

n

influence of variables is linear, first regressor X

1

usually set to constant

sample of size n is fitted to model:

y

i

= β

1

+ β

2

· x

2

+ · · · + β

n

· x

n

+ ε

i

y

i

= x

i>

β + ε

i

(32)

Linear Regression

Goals:

estimate unknown βs using least squares decide if all variables are needed

check if resulting model explains data well enough use model to forecast

β b = (X

>

X)

−1

X

>

y

(33)

Linear Regression

1 # s t a n d a r d m o d e l

2 lm( y ~ x + z )

3 # no i n t e r c e p t

4 lm( y ~ x - 1)

5 # u s i n g d a t a f r a m e

6 lm( a m o u n t ~ price , d a t a = c o n s u m p t i o n )

7 # u s i n g d a t a f r a m e and a t t a c h ()

8 lm( a m o u n t ~ p r i c e )

(34)

Exercise

Download Hubble data from

http://lib.stat.cmu.edu/DASL/Datafiles/Hubble.html and estimate the hubble constant H by the model

recession-velocity = H · distance

(35)

Call:

lm(formula = rec.vel ~ distance - 1) Residuals:

Min 1Q Median 3Q Max

-411.544 -191.302 -7.103 127.951 496.063 Coefficients:

Estimate Std. Error t value Pr(>|t|) distance 423.94 42.15 10.06 6.87e-10 ***

---

Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 Residual standard error: 229 on 23 degrees of freedom

Multiple R-squared: 0.8147, Adjusted R-squared: 0.8067

(36)

Residuals

5 10 15 20

−400−2000200400

Index

residuals(lm)

(37)

Residuals

−400−2000200400

Normal Q−Q Plot

Sample Quantiles

(38)

1. Download Cereal data from DASL 2. Read data as dataframe

3. Run linear regression rating = sugars + fat

(39)

Residuals:

Min 1Q Median 3Q Max

-14.6640 -5.6937 0.2078 4.7660 32.6163 Coefficients:

Estimate Std. Error t value Pr(>|t|) (Intercept) 61.0886 1.9527 31.284 < 2e-16 ***

sugars -2.2128 0.2347 -9.428 2.59e-14 ***

fat -3.0658 1.0365 -2.958 0.00416 **

---

Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 Residual standard error: 8.755 on 74 degrees of freedom Multiple R-squared: 0.6218, Adjusted R-squared: 0.6116

(40)

t-Test

checks if certain coefficient β

j

is different from 0.

teststatistics

t = β ˆ

j

SD( ˆ β

j

)

under H

0

: t ∼ t

n−p

with p as number of independ. variables

(41)

F -Test

idea: check if sum of squared residuals is reduced significantly if one regressor is added

add one regressor ⇒ model gets better, but significantly?

Compute RSS1 for full model with k parameters, compute RSS2 for simplified model with k − q parameters

Compute teststatistics

F = (RSS2 − RSS 1)/q

RSS1/(n − k)

under H

0

: F ∼ F

(n−1,n−q−1)

(42)

Residuals

0 20 40 60 80

−100102030

Index

residuals(health)

(43)

Residuals

−100102030

Normal Q−Q Plot

Sample Quantiles

(44)

Residuals

−2 −1 0 1 2

−10123

Normal Q−Q Plot

Theoretical Quantiles

Sample Quantiles

Referenzen

ÄHNLICHE DOKUMENTE

To avoid messy formulae, one can express partial derivatives of J (·) in terms of higher order versions of J (·) by means of the recursion (3).. Here we collect and extend some

64, 51–62 (2003) Drees, H., de Haan, L., Li, D: Approximations to the tail empirical distribution function with application to testing extreme value conditions. Inference 136,

smoothing parameter for local linear quantile regression: choose the bandwidth.. The

Expansion of the density of the general function of distribution of per capita annual incomes in the USA in 2000 for four groups of population.. From (14) it follows that

Abstract – In this research article: 1) the new quantum macroeconomics and microeconomics theories in the quantum econophysics science are formulated, 2) the notion on the

· Make sure that the file on the SD card is the firmware file for the projector you are using.. If there is a firmware file for another projector on the card, the firmware update

The desire to use graphics as an aid to user interface has led to the development of object-oriented window systems. In these systems. there might not be application progratns,

The convergence theory for iterated random averaged operators turns out to be simple in R n : If an invariant measure for the Markov operator exists, the chain converges to an