Impurity Measures for Classification Trees

• construction of the so calledmaximum tree T_{M AX} (see below),

• choice of the right tree size (tree pruning)T^∗,

• classification of new data using the constructed treeT^∗.

A maximum tree is the one containing observations of the same class at each of the terminal nodes. The root node – the one at the top of any tree – resembles the whole learning sample. Moving from the root node down the tree, the learning sample is being split recursively in a way that more homogenous clusters of observations are separated into tree nodes. This can be achieved as follows.

4.2 Impurity Measures for Classification Trees

Suppose there arenobservations in the learning sample andnj is the overall number of observations belonging to the classj,j = 1, . . . , J. In the described stock picking setup wherey ∈ {long, short, neutral},J is equal to three. Then define class probabilities as following:

πj = nj

n, (4.1)

i.e. a proportion of observations belonging to a particular class relative to the overall number of observations.

Let n(t) be the number of observations in the node t and nj(t) – the number of observations belonging to j-th class in the same node t. Then the joint probability of the event that an observation of thej-th class falls into the node tis

p(j, t) =πj

nj(t)

n_j . (4.2)

Hence p(t) = ^P^J

j=1

p(j, t), and theconditional probability of an observation to belong to the node tgiven that its class is j is computed as following:

p(j|t) = p(j, t)

p(t) = nj(t)

n(t) , (4.3)

which is a proportion of the classj in the nodet. One can easily show that PJ

j=1

p(j|t) = 1.

A measure of a classification tree that shows the degree of class heterogeneity in a given node is called animpurity measure i(t). With its help, different node configurations or splits – combinations of question variablesX_iand question valuesx– can be compared, and therefore only one of them can finally be selected to be incorporated in the tree, thus making it possible to determine specificoptimal (see below) values ofXi andxemployed in the questions. An impurity measure can be defined via an impurity function ϕ(t) that is determined on the subsets{p₁, . . . , pJ}for∀jandpj ≥0,j= 1, . . . , J, ^P^J

j=1

pj = 1 so that

4 Introduction to Binary Classification Trees

1. ϕ(·) has a unique maximum at the point¹_J,_J¹, . . . ,_J¹;

2. ϕ(·) has a unique minimum at points (1,0, . . . ,0), (0,1,0, . . . ,0), . . ., (0,0, . . . ,1);

3. ϕ(·) is a symmetric function of class probabilitiesp1, . . . , pJ.

Each function satisfying these conditions can be called an impurity function. Func-tions taking only non-negative values will be considered in this study. Given the function ϕ(·), define an impurity measurei(t) for a nodet as

i(t) =ϕ(p(1|t), p(2|t), . . . , p(J|t)). (4.4) It is important to point out that from given definitions it follows that it is possible to definemultiple impurity measures for the same node t.

NodetP

NodetL Node tR

Figure 4.2: The triplet of nodes: t_P – parent node, t_L – left child node, and t_R – right child node

Let tP be the parent node and tL,tR – left and right child nodes of the parent node t_P respectively so that a fractionp_L of observations from the nodet_P follows to the left child node, and a fraction p_R= (1−p_L) – to the right one.

IfnP is the number of observations intP andnL,nR– intLandtRrespectively, then the probabilities of an observation to fall into one of two child nodes can be computed as following:

p_L= n_L

n_P, p_R= n_R

n_P. (4.5)

Denote an arbitrary data split bys. A functional that determines the question at each tree node – split s^∗ – is the maximum value of the one-level decrement of the impurity functioni(s, t_P), which can be computed for an arbitrary node t_P:

∆i(s, tP) =i(tP)−pL·i(tL)−pR·i(tR). (4.6) Obviously, the higher is the value of ∆i(s, tP) – the better split has been obtained since it was possible to reduce data impurity more significantly. Since t_L∪t_R=t_P, the value

∆i(s, t_P) represents a change of data impurity int_P solely due to the split s.

To find the optimalsfor a given node, it is natural to maximize ∆i(s, tP) by different s, i.e. choosing different variables from the learning sample and adjusting the relevant question values. In this way, a classification tree of any configuration up to a maximum tree can be built.

While searching for the optimal value ofs^∗, the value ofi(tP) remains constant because it does not depend on X_i and x that together createt_L and t_R. Hence, it is equivalent

4.2 Impurity Measures for Classification Trees

to state that

s^∗ = argmaxs∆i(s, t_P) = argmaxs{−p_Li(t_L)−p_Ri(t_R)}=

= argmins{p_Li(tL) +pRi(tR)} (4.7) wheretL and tR are implicit functions ofs.

If resulting nodes tL and tR are not enough class homogenous (see below), the same procedure can be looped until a decision tree becomes of a required configuration.

Classes are then assigned to terminal nodes using the following rule:

If p(j|t) = max

i p(i|t), then j^∗(t) =j. (4.8) If the maximum is not unique, then the class j^∗(t) is assigned arbitrary from the pool of arguments{i} for which p(i|t) takes its maximum value.

Note that the criterion in (4.7) can be used when any valid impurity measure is employed there, therefore making the whole algorithm quite versatile.

It may, though, have a little drawback that is worth pointing out. Maximizing the decrement of the impurity function means that only two levels of a decision tree are taken into account whereas other parts of the tree (such as possible child nodes of t_L and tR) can not influence the choice of the optimal split. That is why the procedure can be characterized only as locally optimal.

Is it possible to build aglobally optimalalgorithm of data splitting? One of the poten-tial criteria that takes into account the whole structure of the tree could be theintegral tree impurity decrement that is a weighted sum of local tree impurity decrements.

tLL

tLLL tLLR

tLR

tLRL tLRR

tRL

tRLL tRLR

tRR

tRRL tRRR

Figure 4.3: A maximum binary decision tree containing three splitting levels,M = 3.

Figure 4.3 displays the first three splitting levels of an arbitrary decision tree. For each splitting level, let us unite into groups node probabilities and corresponding node impurities. At the first level, wheret_P is split intot_Landt_R, there are two corresponding

4 Introduction to Binary Classification Trees

probabilities and impurities:

( p⁽¹⁾_L =p_L, p⁽¹⁾_R =p_R

i⁽¹⁾_L =i(tL), i⁽¹⁾_R =i(tR) (4.9) Therefore, the conventional locally optimal impurity decrement rule can be rewritten as

∆i⁽¹⁾= ∆i(s, tP) =i(tP)−^Dp⁽¹⁾_L , i⁽¹⁾_L ^E−^Dp⁽¹⁾_R , i⁽¹⁾_R ^E→max_s (4.10) where h·,·idenotes the scalar product.

Due to the binary nature of a tree, deeper splitting levels contain more elements. For the next two levels, the groups will contain the following elements:

( p⁽²⁾_L ={p_LL, p_RL}, p⁽²⁾_R ={p_LR, p_RR}

If a node contains only observations of the same class (and thus the impurity measure reaches its minimum according to the definition of the measure), then its child nodes are empty sets and

( pL(∅) =pR(∅) = 0,

i(∅) = 0. (4.13)

The globally optimal impurity decrement rule is proposed in the following form:

∆i^(M)=i(t0)− are all possible splits for each split level of a tree.

As it can be seen from (4.14), bigger chunks of data influence the final splitting configuration more due to the higher values of respective probabilities. The whole tree is created at once as opposed to (4.10) where only a single split is determined at a time.

Note that a locally optimal decision tree, i.e. one built via (4.10), is not (under general conditions) globally optimal. Quite interestingly, a globally optimal tree may be not locally optimal.

Unfortunately, because of the enormous computing power required to build an optimal tree via (4.14), it is not (yet) practically possible to test its relative efficiency. However, the rapid development of microprocessors provides serious reasons to conclude that this and other resource-hungry algorithms will become practically feasible in the near future.

In the financial sphere where computations are sometimes required to be carried out virtually online or at least to be conducted very quickly, the speed matter becomes crucial, that is why it is reasonable to apply locally optimal procedures. Once the

4.3 Two Special Functional Forms of Impurity Measures

Im Dokument Stock picking via nonsymmetrically pruned binary decision trees with reject option (Seite 45-49)