• Keine Ergebnisse gefunden

Random Forest

N/A
N/A
Protected

Academic year: 2022

Aktie "Random Forest"

Copied!
14
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Random Forest

Applied Multivariate Statistics – Spring 2013

TexPoint fonts used in EMF.

(2)

Overview

 Intuition of Random Forest

 The Random Forest Algorithm

 De-correlation gives better accuracy

 Out-of-bag error (OOB-error)

 Variable importance

1

Diseased Diseased

Healthy

Healthy

Diseased

(3)

Intuition of Random Forest

young old

short tall

healthy diseased

young old

diseased

female male

healthy healthy

working retired

healthy

short tall

healthy diseased

New sample:

old, retired, male, short Tree predictions:

diseased, healthy, diseased

Majority rule:

diseased

healthy

healthy

diseased healthy

Tree 1

Tree 3

Tree 2

(4)

The Random Forest Algorithm

3

(5)

Differences to standard tree

 Train each tree on bootstrap resample of data

(Bootstrap resample of data set with N samples:

Make new data set by drawing with replacement N samples; i.e., some samples will probably occur multiple times in new data set)

 For each split, consider only m randomly selected variables

 Don’t prune

 Fit B trees in such a way and use average or majority voting to aggregate results

(6)

Why Random Forest works 1/2

 Mean Squared Error = Variance + Bias2

 If trees are sufficiently deep, they have very small bias

 How could we improve the variance over that of a single tree?

5

(7)

Why Random Forest works 2/2

i=j

Decreases, if number of trees B increases (irrespective of 𝜌) Decreaes, if

𝜌 decreases, i.e., if m decreases

De-correlation gives better accuracy

x

(8)

Estimating generalization error:

Out-of bag (OOB) error

 Similar to leave-one-out cross-validation, but almost without any additional computational burden

7

young old

short tall

healthy diseased

diseased healthy Resampled Data:

old, tall – healthy old, tall – healthy old, short – diseased old, short – diseased young, tall – healthy young, tall – healthy young, short - healthy

Out of bag samples:

young, short – diseased young, tall– healthy

old, short – diseased

Out of bag (OOB) error rate:

1/3 = 0.33 Data:

old, tall – healthy old, short – diseased young, tall – healthy young, short – healthy young, short – diseased young, tall – healthy old, short– diseased

(9)

Variable Importance for variable i using Permutations Data

Resampled

Dataset 1 OOB Data 1

Resampled Dataset m

OOB Data m

Tree 1 Tree m

OOB error e1 OOB error em

Permute values of variable i in OOB

data set

OOB error p1 OOB error pm

d = m1 Pm i=1 di

d1 = p1–e1 dm =pm-em

s2d = m1

¡1

Pm

i=1(di ¡d)2

v i = s d

d

(10)

Trees vs. Random Forest

+ Trees yield insight into decision rules

+ Rather fast + Easy to tune parameters

- Prediction of trees tend to have a high variance

9

+ RF has smaller prediction variance and therefore

usually a better general performance

+ Easy to tune parameters

- Rather slow

- “Black Box”: Rather hard to get insights into decision rules

(11)

Comparing runtime (just for illustration)

RF

Tree

Up to “thousands” of variables

Problematic if there are categorical predictors with many levels (max: 32 levels)

RF: First predictor cut into 15 levels

(12)

+ Very fast

+ Discriminants for visualizing group separation

+ Can read off decision rule

- Can model only linear class boundaries

- Mediocre performance - No variable selection

- Only on categorical response - Needs CV for estimating

prediction error

RF vs. LDA

+ Can model nonlinear class boundaries

+ OOB error “for free” (no CV needed)

+ Works on continuous and categorical responses

(regression / classification) + Gives variable

importance

+ Very good performance

- “Black box”

- Slow

11

x x

x x x x

x x x x

x x

x

x x x x

x x x

x x x x

(13)

Concepts to know

 Idea of Random Forest and how it reduces the prediction variance of trees

 OOB error

 Variable Importance based on Permutation

(14)

R functions to know

 Function “randomForest” and “varImpPlot” from package

“randomForest”

13

Referenzen

ÄHNLICHE DOKUMENTE

To develop such systems both feature representation and classification method should be accurate and less time consuming.Aiming to satisfy these criteria we coupled the HOG

We explain in detail (using standard data sets) how to extend machine learning to general spatial prediction, and compare the prediction efficiency of random forest with that

– Divide the original data values by the appropriate seasonal index if using the multiplicative method, or subtract the index if using the additive method. This gives the

– Divide the original data values by the appropriate seasonal index if using the multiplicative method, or subtract the index if using the additive method. This gives the

Mean annual stem growth was normalized for diameter at breast height (dbh) (GRO n ) for Norway spruce (Picea abies), European beech (Fa- gus sylvatica), silver fir (Abies alba) and

Wiener index, weak convergence, distance (in a graph), random binary search tree, random recursive tree, contraction method, bivariate limit law.. 1 Introduction

(1996) Optimal logarithmic time random- ized suffix tree construction.. (1997) An optimal, logarithmic time, ran- domized parallel suffix tree

For the random binary search tree with n nodes inserted the number of ancestors of the elements with ranks k and `, 1 ≤ k < ` ≤ n, as well as the path distance between these