• Keine Ergebnisse gefunden

Applied Multivariate Statistics – Spring 2013 Trees

N/A
N/A
Protected

Academic year: 2022

Aktie "Applied Multivariate Statistics – Spring 2013 Trees"

Copied!
17
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Trees

Applied Multivariate Statistics – Spring 2013

(2)

Overview

 Intuition for Trees

 Regression Trees

 Classification Trees

1

(3)

Idea of Trees: Regression Trees Continuous response

y

x

Binary Tree

1 2

0

Y=1.2

1 2

X>1 X≤1

Y=1.4 Y=0.7

Y=0.2 Y=1.9 Y=1.3 Y=0.3 X≤0.3 X>0.3 X≤2 X>2

(4)

Idea of Trees: Classification Tree Discrete response

3

Survived in Titanic?

No

Sex=F Sex=M

800/200

No Yes

No 150/50

No 650/150

Age <35 Age ≥35

Yes No

3/17 147/33

Age <27 Age ≥27

Yes No

70/130 580/20

Missclassification rate:

- Total: (3+33+70+20) / 1000 = 0.126 - “Yes”-class: 53/200 = 0.26

- “No”-class: 73/800 = 0.09

(5)

Intuition of Trees: Recursive Partitioning

For simplicity:

Restrict to recursive binary splits

(6)

Fighting overfitting: Cost-complexity pruning

5

Test error

Training error

Complexity of model

Overfitting: Fitting the training data perfectly might not be good for predicting future data

In practice: Use cross-validation

For trees:

1. Fit a very detailed model

2. Prune it using a complexity penalty to optimize cross-validation performance

(7)

Building Regression Trees 1/2

 Assume given partition of space R1, …, RM

Tree model:

 Goal is to minimize sum of squared residuals:

(𝑦𝑖 − 𝑓 𝑥𝑖 2)

 Solution: Average of data points in every region

(8)

Building Regression Trees 2/2

 Finding the best binary partition is computationally infeasible

 Use greedy approach: For variable j and split point s define the two generated regions:

 Choose splitting variable j and split point s that solve:

inner minimization is solved by

 Repeat splitting process on each of the two resulting regions

7

(9)

Pruning Regression Trees

 Stop splitting when some minimal node size (= nmb. of samples per node) is reached (e.g. 5)

 Then, cut back the tree again (“pruning”) to optimize the cost-complexity criterion:

 Tuning parameter 𝛼 is chosen by cross-validation

“Impurity measure” Goodness of fit Complexity

(10)

Classification Trees

 Regression Tree:

Quality of split measured by “Squared error”

 Classification Tree:

Quality of split measured by general “Impurity measure”

9

(11)

Classification Trees: Impurity Measures

 Proportion of class k observations in node m:

 Define majority class in node m: k(m)

 Common impurity measures 𝑄𝑚(𝑇):

 For just two classes:

(12)

Example: Gini Index

11

Side effects after treatment? 100 persons, 50 with and 50 without side effects:

50 / 50 (No / Yes)

Split on sex

50 / 50 M F

30 / 40 Gini = 0.49

20 / 10 Gini = 0.44 Total Gini = 0.49 + 0.44 =

= 0.93

Split on age

50 / 50

young old

10 / 50 Gini = 0.27

40 / 0 Gini = 0 Total Gini = 0.27 + 0 =

= 0.27

0.27 < 0.93, therefore:

Choose split on age

(13)

Classification Trees: Impurity Measures

 Usually:

- Gini Index used for building

- Misclassification error used for pruning

(14)

Example: Pruning using Misclass. Error (MCE)

13

50 / 50

young old

10 / 50

MCE = 0.167

40 / 0 MCE = 0 50 / 50

young old

10 / 50

MCE = 0.167

40 / 0 MCE = 0 short

tall 0 / 50 MCE = 0

10 / 0 MCE = 0

𝐶𝛼 𝑇 = 50 ∗ 0 + 10 ∗ 0 + 40 ∗ 0 + 0.5 ∗ 3 =

= 1.5

𝐶𝛼 𝑇 = 60 ∗ 0.167 + 40 ∗ 0 + 0.5 ∗ 2 =

= 11.0 Smaller 𝐶𝛼(𝑇), therefore don’t prune

e.g., 𝛼 = 0.5

(15)

Trees in R

 Function “rpart” (recursive partitioning) in package “rpart”

together with “print”, “plot”, “text”

Function “rpart” automatically prunes using optimal 𝜶 based on 10-fold CV

 Functions “plotcp” and “printcp” for cost-complexity information

 Function “prune” for manual pruning

(16)

Concepts to know

 Trees as recursive partitionings

 Concept of cost-complexity pruning

 Impurity measures

15

(17)

R functions to know

 From package “rpart”: “rpart”, “print”, “plot”, “text”, “plotcp”,

“printcp”, “prune”

Referenzen

ÄHNLICHE DOKUMENTE

Since we constrain the split time for an index node to be before the begin time of any current child node (so that the new history index node does not refer to current children),

(Whenever data are aggregated, a Törnqvist index is used.) In comparison to the official implicit deflator (calculated from the nominal and real growth data published in

The quality of the melt in the transport ladle and the casting furnace before rotary degassing is poor as expected and exhibits low life cycles, high amounts of pores in cross

A menor necessidade de demanda por margem de transporte devido à melhoria da acessibilidade entre as regiões faz reduzir o nível de atividade do setor de transporte rodoviário, o

Based on the short problem specification by the client, the relevant scientific/statistical topics are prepared before meeting with the client(s), and are discussed with members of

Some people need to visit each other in person quite a lot (“low contact distance”), others less so (“high contact distance”).. To be efficient, everybody should sit closer to

Solve a case study with Rstudio on my computer Explain some concepts with pencil and paper.  You may bring a 1-page summary

In general, woman have lower mortality rates and consequently increased life. expectancy compared