Random Forest

(1)

Random Forest

Applied Multivariate Statistics – Spring 2013

TexPoint fonts used in EMF.

(2)

Overview

 Intuition of Random Forest

 The Random Forest Algorithm

 De-correlation gives better accuracy

 Out-of-bag error (OOB-error)

 Variable importance

1

Diseased Diseased

Healthy

Diseased

(3)

Intuition of Random Forest

young old

short tall

healthy diseased

young old

diseased

female male

healthy healthy

working retired

healthy

short tall

healthy diseased

New sample:

old, retired, male, short Tree predictions:

diseased, healthy, diseased

Majority rule:

diseased

healthy

diseased healthy

Tree 1

Tree 3

Tree 2

(4)

The Random Forest Algorithm

3

(5)

Differences to standard tree

 Train each tree on bootstrap resample of data

(Bootstrap resample of data set with N samples:

Make new data set by drawing with replacement N samples; i.e., some samples will probably occur multiple times in new data set)

 For each split, consider only m randomly selected variables

 Don’t prune

 Fit B trees in such a way and use average or majority voting to aggregate results

(6)

Why Random Forest works 1/2

 Mean Squared Error = Variance + Bias²

 If trees are sufficiently deep, they have very small bias

 How could we improve the variance over that of a single tree?

5

(7)

Why Random Forest works 2/2

i=j

Decreases, if number of trees B increases (irrespective of 𝜌) Decreaes, if

𝜌 decreases, i.e., if m decreases

De-correlation gives better accuracy

x

(8)

Estimating generalization error:

Out-of bag (OOB) error

 Similar to leave-one-out cross-validation, but almost without any additional computational burden

7

young old

short tall

healthy diseased

diseased healthy Resampled Data:

old, tall – healthy old, tall – healthy old, short – diseased old, short – diseased young, tall – healthy young, tall – healthy young, short - healthy

Out of bag samples:

young, short – diseased young, tall– healthy

old, short – diseased

Out of bag (OOB) error rate:

1/3 = 0.33 Data:

old, tall – healthy old, short – diseased young, tall – healthy young, short – healthy young, short – diseased young, tall – healthy old, short– diseased

(9)

Variable Importance for variable i using Permutations ^Data

…

Resampled

Dataset 1 OOB Data 1

Resampled Dataset m

OOB Data m

Tree 1 Tree m

OOB error e₁ OOB error e_m

Permute values of variable i in OOB

data set

OOB error p₁ OOB error p_m

d = _m¹ Pm i=1 d_i

d₁ = p₁–e₁ d_m =p_m-e_m

s²_d = _m¹

¡1

Pm

i=1(d_i ¡d)²

v _i = _s ^d

d

(10)

Trees vs. Random Forest

+ Trees yield insight into decision rules

+ Rather fast + Easy to tune parameters

- Prediction of trees tend to have a high variance

9

+ RF has smaller prediction variance and therefore

usually a better general performance

+ Easy to tune parameters

- Rather slow

- “Black Box”: Rather hard to get insights into decision rules

(11)

Comparing runtime (just for illustration)

RF

Tree

• Up to “thousands” of variables

• Problematic if there are categorical predictors with many levels (max: 32 levels)

RF: First predictor cut into 15 levels

(12)

+ Very fast

+ Discriminants for visualizing group separation

+ Can read off decision rule

- Can model only linear class boundaries

- Mediocre performance - No variable selection

- Only on categorical response - Needs CV for estimating

prediction error

RF vs. LDA

+ Can model nonlinear class boundaries

+ OOB error “for free” (no CV needed)

+ Works on continuous and categorical responses

(regression / classification) + Gives variable

importance

+ Very good performance

- “Black box”

- Slow

11

x x

x x x x

x x

x

x x x x

x x x

x x x x

(13)

Concepts to know

 Idea of Random Forest and how it reduces the prediction variance of trees

 OOB error

 Variable Importance based on Permutation

(14)

R functions to know

 Function “randomForest” and “varImpPlot” from package

“randomForest”

13

Random Forest

Random Forest

Overview

Intuition of Random Forest

The Random Forest Algorithm

Differences to standard tree

Why Random Forest works 1/2

Why Random Forest works 2/2

…

v i = s d

Trees vs. Random Forest

RF vs. LDA

Concepts to know

R functions to know

v _i = _s ^d