• Keine Ergebnisse gefunden

Finding Multivariate Outlier

N/A
N/A
Protected

Academic year: 2022

Aktie "Finding Multivariate Outlier"

Copied!
28
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Finding Multivariate Outlier

Applied Multivariate Statistics – Spring 2012

TexPoint fonts used in EMF.

Read the TexPoint manual before you delete this box.: AAAAAAA

(2)

Goals

 Concept: Detecting outliers with (robustly) estimated Mahalanobis distance and QQ-plot

 R: chisq.plot, pcout from package “mvoutlier”

2 Appl. Multivariate Statistics - Spring 2012

(3)

Outlier in one dimension - easy

 Look at scatterplots

 Find dimensions of outliers

 Find extreme samples just in these dimensions

 Remove outlier

3 Appl. Multivariate Statistics - Spring 2012

(4)

2d: More tricky

4 Appl. Multivariate Statistics - Spring 2012

Outlier

No outlier in x or y

(5)

 True Mahalanobis distance:

 Estimated Mahalanobis distance:

Recap: Mahalanobis distance

5 Appl. Multivariate Statistics - Spring 2012

MD(x) = p

(x ¡¹)T§¡1(x ¡¹)

Sq. Mahalanobis Distance MD2(x)

=

Sq. distance from mean in standard deviations

IN DIRECTION OF X

MD(x) =^ q

(x ¡¹)^ T§^¡1(x ¡ ¹)^

(6)

Mahalanobis distance: Example

6 Appl. Multivariate Statistics - Spring 2012

§ =

µ 25 0 0 1

¹ =

µ 0 0

;

(7)

Mahalanobis distance: Example

7 Appl. Multivariate Statistics - Spring 2012

§ =

µ 25 0 0 1

¹ =

µ 0 0

;

(20,0)

MD = 4

(8)

Mahalanobis distance: Example

8 Appl. Multivariate Statistics - Spring 2012

§ =

µ 25 0 0 1

¹ =

µ 0 0

;

(0,10)

MD = 10

(9)

Mahalanobis distance: Example

9 Appl. Multivariate Statistics - Spring 2012

§ =

µ 25 0 0 1

¹ =

µ 0 0

;

(10, 7)

MD = 7.3

(10)

Theory of Mahalanobis Distance

Assume data is multivariate normally distributed (d dimensions)

10 Appl. Multivariate Statistics - Spring 2012

Mahalanobis distance of samples follows a Chi-Square distribution with d degrees of freedom

(“By definition”: Sum of d standard normal random variables has Chi-Square distribution with d degrees of freedom.)

(11)

Check for multivariate outlier

 Are there samples with estimated Mahalanobis distance that don’t fit at all to a Chi-Square distribution?

 Check with a QQ-Plot

 Technical details:

- Chi-Square distribution is still reasonably good for estimated Mahalanobis distance

- use robust estimates for

11 Appl. Multivariate Statistics - Spring 2012

¹; §

(12)

Robust Estimates: Income of 7 people

Robust Scatter

Std. Dev.

(13)

Robust

Std. Dev.

(14)

Robust

Std. Dev.

(15)

Robust Estimates for outlier detection

 If scatter is estimated robustly, outlier “stick out” much more

 Robust Mahalanobis distance:

Mean and Covariance matrix estiamted robustly

15 Appl. Multivariate Statistics - Spring 2012

(16)

Example - continued

16 Appl. Multivariate Statistics - Spring 2012

Outlier easily detected !

(17)

Outliers in >2d can be well hidden !

17 Appl. Multivariate Statistics - Spring 2012

No outlier, right?

(18)

Outliers in >2d can be well hidden !

18 Appl. Multivariate Statistics - Spring 2012

Wrong!

(19)

Outliers in >2d can be well hidden !

19 Appl. Multivariate Statistics - Spring 2012

This outlier can’t be seen in the

scatterplot- matrix

(but in a 3d plot)

(20)

Method 1: Quantile of Chi-Sqaure distribution

 Compute for each sample (in d dimensions) the robustly estimated Mahalanobis distance MD(xi)

 Compute the 97.5%-Quantile Q of the Chi-Square distribution with d degrees of freedom

 All samples with MD(xi) > Q are declared outlier

20 Appl. Multivariate Statistics - Spring 2012

(21)

Method 2: Adjusted Quantile

 Adjusted Quantile for outlier: Depends on distance

between cdf of Chi-Square and ecdf of samples in tails

 Simulate “normal” deviations in the tails

 Outlier have “abnormally large” deviations in the tails

(e.g. more than seen in 100 simulations without outliers)

21 Appl. Multivariate Statistics - Spring 2012

(22)

Method 2: Adjusted Quantile

22 Appl. Multivariate Statistics - Spring 2012

ECDF leaves “plausible” range

Defines adaptive cutoff

(23)

Method 2: Adjusted Quantile Function “aq.plot”

23 Appl. Multivariate Statistics - Spring 2012

(24)

Method 3: State of the art - pcout

 Complex method based on robust principal components

 Pretty involved methodology

 Very fast – good for high dimensions

 R: Function “pcout” in package “mvoutlier”

 $wfinal01: 0 is outlier

 $wfinal: Small values are more severe outlier

 P. Filzmoser, R. Maronna, M. Werner. Outlier identification in high dimensions, Computational Statistics and Data

Analysis, 52, 1694-1711, 2008

24 Appl. Multivariate Statistics - Spring 2012

(25)

Automatic outlier detection

 It is always better to look at a QQ-plot to find outlier !

Just find points “sticking out”; no distributional assumption

 If you can’t: Automatic outlier detection

- finds usually too many or too few outlier depending on parameter settings

- depends on distribution assumptions (e.g. multivariate normality)

+ good for screening of large amounts of data

25 Appl. Multivariate Statistics - Spring 2012

(26)

Concepts to know

 Find multivariate outlier with robustly estimated Mahalanobis distance

 Cutoff

- by eye (best method)

- quantile of Chi-Square distribution

26 Appl. Multivariate Statistics - Spring 2012

(27)

R commands to know

 chisq.plot, pcout in package “mvoutlier”

27 Appl. Multivariate Statistics - Spring 2012

(28)

Next week

 Missing values

28 Appl. Multivariate Statistics - Spring 2012

Referenzen

ÄHNLICHE DOKUMENTE

According to [8, 10], at real solidification (both crystallization and glass tran- sition) viscosity increases by approximately 15 orders of magnitude, activation energies of

Tag der m¨ undlichen Pr¨ ufung: 4.. Undriven rapid granular flow. Kinetic theory and hydrodynamic description. Questions to be answered. Elastic hard-core interactions. Time

MEPs measured from the test hand (tMEPs) and from the conditioning hand (cMEPs) measure how effectively each pulse activated primary motor

The collision of spheres of unequal size shows strong deviations from the qua- sistatic case, namely a highly unsymmetric collision, a vibrational kinetic energy which is at

Moreover, by (4.9) one of the last two inequalities must be proper.. We briefly say k-set for a set of cardinality k. Its number of vertices |V | is called the order of H. We say that

In this research paper I have used historic data on stocks, value European call options using both logistic and normal distribution and then finally compare the results

 Are there samples with estimated Mahalanobis distance that don’t fit at all to a Chi-Square distribution.  Check with

This interpretive system should include basic floating point arithmetic, ele~entar.y transcendental functions and floating decimal Flex-coded print or punch