Visualization 1

(1)

Visualization 1

Applied Multivariate Statistics – Spring 2013

TexPoint fonts used in EMF.

Read the TexPoint manual before you delete this box.: AAAAAAAAAA

(2)

Goals

 Covariance, Correlation (true / sample version)

 Test for zero correlation: Fisher’s z-Transformation

 Scatterplot / Scatterplotmatrix

 Covariance matrix / Correlation matrix

 Multivariate Normal Distribution

 Mahalanobis distance

(3)

Visualization in 1d

(4)

Normal distribution in 1d:

Most common model choice

'

_¹;¾²

(x) =

^p ¹

2¼¾²

exp( ¡

¹₂

¢

^(x^¡_¾²^¹)²

)

(5)

'

_¹;¾²

(x) =

^p ¹

2¼¾²

exp( ¡

¹₂

¢

^(x^¡_¾²^¹)²

)

Normal distribution in 1d:

Squared Mahalanobis Distance

=

Sq. Distance from mean in standard deviations

(6)

One variable: Expected value and variance

 Expected value: 𝜇 = 𝐸 𝑋 = 𝑥𝑓 𝑥 𝑑𝑥 Estimate: Mean 𝜇 = 𝑥 = _𝑛¹ ∑𝑥_𝑖

 Variance:

𝜎_𝑋² = 𝑉𝑎𝑟 𝑋 = 𝐸 𝑋 − 𝐸 𝑋 ² = 𝑥 − 𝐸 𝑋 ²𝑑𝑥 Estimate: Sample Variance

𝜎 = 1_𝑋²

𝑛 − 1 ∑ 𝑥_𝑖 − 𝑥 ²

 Standard deviation: 𝜎_𝑋 = 𝑉𝑎𝑟 𝑋

Estimate: Square root of Sample Variance

(7)

Two variables: Covariance and Correlation

 Covariance:

 Correlation:

 Sample covariance:

 Sample correlation:

 Correlation is invariant to changes in units, covariance is not

(e.g. kilo/gram, meter/kilometer, etc.)

Cov(X; Y ) = E[(X ¡E[X])(Y ¡E[Y])] 2 [¡1;1]

Corr(X; Y ) = ^Cov(X;Y_¾ ⁾

X¾_Y 2 [¡1; 1]

Cov(x; y) =d _n_¡¹₁ Pn

i=1(x_i ¡ x)(y_i ¡ y)

r_xy = Cor(x; y) =d ^Cov(x;y)c

^

¾_x¾^_y

(8)

Scatterplot: Correlation is scale invariant

(9)

Intuition and pitfalls for correlation Correlation = LINEAR relation

Source: Wikipedia

(10)

Test for zero correlation: Fisher’s z-Test

 X, Y (bivariate) normal distributed with true correlation ½

 Take n samples

 Compute sample correlation r

Compute

 For large n:

 Use cor.test() in R to test and get confidence intervals

z = ¹₂ log¡_1+r

1¡r

¢

» = ¹₂ log¡_1+½

1¡½

¢

pn¡ 1(z ¡») » N(0;1)

(11)

Many dimensions: Scatterplot matrix

(12)

Covariance matrix / correlation matrix:

Table of pairwise values

 True covariance matrix:

 True correlation matrix:

 Sample covariance matrix:

Diagonal: Variances

 Sample correlation matrix:

Diagonal: 1

§_ij = Cov(X_i; X_j) C_ij = Cor(X_i; X_j)

S_ij = Cov(xd _i; x_j)

R_ij = Cor(xd _i; x_j)

(13)

Multivariate Normal Distribution:

f(x;¹;§) = _(2¼)_(p=2)¹

j§j⁽¹⁼²⁾ exp¡

¡ ¹₂ ¢ (x¡ ¹)^T§^¡¹(x¡ ¹)¢

(14)

Funny facts

If X₁, …, X_p multivariate normal, then

 every linear combination Y = a₁ X₁ + … + a_p X_p is normally distributed

 every projection on a subspace is multivariate normally distributed

If margins are normally distributed, then it is NOT

GUARANTEED that the underlying distribution is multivariate normal

(i.e., “multivariate” is stronger than just normal margins)

(15)

Multivariate Normal Distribution: Example 1 1000 random samples

§ =

µ 1 0 0 1

¶

¹ =

µ 0 0

¶

;

Variance along X1

Variance along X2 Covariance btw.

X1 and X2

(16)

Multivariate Normal Distribution: Example 2 1000 random samples

§ =

µ 10 3 3 2

¶

¹ =

µ 5 10

¶

;

Variance along X1

Variance along X2 Covariance btw.

X1 and X2

(17)

Most common model choice (p dimensions)

f(x;¹;§) = _(2¼)_(p=2)¹

j§j⁽¹⁼²⁾ exp¡

¡ ¹₂ ¢ (x¡ ¹)^T§^¡¹(x¡ ¹)¢

Sq. Mahalanobis Distance MD²(x)

=

Sq. distance from mean in standard deviations

IN DIRECTION OF X

(18)

Mahalanobis distance: Example

§ =

µ 25 0 0 1

¶

¹ =

µ 0 0

¶

;

(20,0)

MD = 4 MD = 10

MD = 7.3

(10,7) (0,10)

(19)

Concepts to know

 Covariance, Correlation (true / sample version)

 Test for zero correlation: Fisher’s z-Transformation

 Scatterplot / Scatterplotmatrix

 Covariance matrix / Correlation matrix

 Multivariate Normal Distribution

 Mahalanobis distance

(20)

R commands to know

 read.csv, head, str, dim

 colMeans, cov, cor

 mvrnorm, t, solve, %*%