• Keine Ergebnisse gefunden

Introduction to Classification Issues in Microarray Data Analysis

N/A
N/A
Protected

Academic year: 2022

Aktie "Introduction to Classification Issues in Microarray Data Analysis"

Copied!
24
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Microarray Workshop 1

Introduction to Classification Issues in Microarray Data Analysis

Jane Fridlyand Jean Yee Hwa Yang

University of California, San Francisco Elsinore, Denmark

May 17-21, 2004

(2)

Bad prognosis ?

recurrence < 5yrs

Good Prognosis recurrence > 5yrs

Reference

L van’t Veer et al (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature, Jan.

.

Objects Array

Feature vectors Gene

expression Predefine

classes Clinical outcome

new array Learning set

Classification rule

Good Prognosis Matesis > 5

(3)

Microarray Workshop 3

Discrimination and Allocation

Learning Set Data with

known classes

Classification Technique

Classification rule

Data with unknown

classes

Class

Assignment

Discrimination

Prediction

(4)

Classification rule

Maximum likelihood discriminant rule

• A maximum likelihood estimator (MLE) chooses the parameter value that makes the chance of the observations the highest.

• For known class conditional densities p

k

(X), the

maximum likelihood (ML) discriminant rule predicts the class of an observation X by

C(X) = argmax

k

p

k

(X)

(5)

Microarray Workshop 5

Gaussian ML discriminant rules

• For multivariate Gaussian (normal) class densities X|Y= k ~ N( µ

k

, Σ

k

), the ML classifier is

C(X) = argmink {(X - µk) Σk-1(X - µk)’ + log| Σk |}

• In general, this is a quadratic rule (Quadratic discriminant analysis, or QDA)

• In practice, population mean vectors µ

k

and covariance matrices Σ

k

are estimated by

corresponding sample quantities

(6)

ML discriminant rules - special cases

[DLDA]

Diagonal linear discriminant analysis class densities have the same diagonal covariance matrix = diag(s12, …, sp2)

[DQDA]

Diagonal quadratic discriminant analysis) class densities have different diagonal covariance matrix k= diag(s1k2, …, spk2) Note. Weighted gene voting of Golub et al. (1999) is a minor variant of DLDA for two classes (different variance calculation).

(7)

Microarray Workshop 7

Classification with SVMs

Generalization of the ideas of separating hyperplanes in the original space.

Linear boundaries between classes in higher-dimensional space lead to the non-linear boundaries in the original space.

Adapted from internet

(8)

Nearest neighbor classification

• Based on a measure of distance between

observations (e.g. Euclidean distance or one minus correlation).

• k-nearest neighbor rule (Fix and Hodges (1951)) classifies an observation X as follows:

- find the k observations in the learning set closest to X - predict the class of X by majority vote, i.e., choose

the class that is most common among those k observations.

• The number of neighbors k can be chosen by

cross-validation (more on this later).

(9)

Microarray Workshop 9

Nearest neighbor rule

(10)

Classification tree

• Partition the feature space into a set of

rectangles, then fit a simple model in each one

• Binary tree structured classifiers are constructed by repeated splits of subsets (nodes) of the

measurement space X into two descendant subsets (starting with X itself)

• Each terminal subset is assigned a class label;

the resulting partition of X corresponds to the

classifier

(11)

Microarray Workshop 11

Classification tree

Gene 1 Mi1 < -0.67

Gene 2 Mi2 > 0.18

0

2

1

yes

yes

no

no 0

1

2

Gene 1 Gene 2

-0.67 0.18

(12)

Three aspects of tree construction

• Split selection rule:

- Example, at each node, choose split maximizing decrease in impurity (e.g. Gini index, entropy, misclassification error).

• Split-stopping:

- Example, grow large tree, prune to obtain a sequence of

subtrees, then use cross-validation to identify the subtree with lowest misclassification rate.

• Class assignment:

- Example, for each terminal node, choose the class minimizing the resubstitution estimate of misclassification probability, given that a case falls into this node.

Supplementary slide

(13)

Microarray Workshop 13

Other classifiers include…

• Neural networks

• Projection pursuit

• Bayesian belief networks

• …

(14)

Why select features

• Lead to better classification performance by removing variables that are noise with respect to the outcome

• May provide useful insights into etiology of a disease

• Can eventually lead to the diagnostic tests

(e.g., “breast cancer chip”)

(15)

Microarray Workshop 15

Approaches to feature selection

• Methods fall into three basic category

- Filter methods

- Wrapper methods - Embedded methods

• The simplest and most frequently used methods are the filter methods.

Adapted from A. Hartemnick

(16)

Filter methods

R p Feature selection R s

s << p

Classifier design

•Features are scored independently and the top s are used by the classifier

•Score: correlation, mutual information, t-statistic, F-statistic, p-value, tree importance statistic etc

Easy to interpret. Can provide some insight into the disease markers.

(17)

Microarray Workshop 17

Problems with filter method

• Redundancy in selected features: features are considered independently and not measured on the basis of whether they contribute new

information

• Interactions among features generally can not be explicitly incorporated (some filter methods are smarter than others)

• Classifier has no say in what features should be used: some scores may be more appropriates in conjuction with some classifiers than others.

Supplementary slide Adapted from A. Hartemnick

(18)

Wrapper methods

R p Feature selection R s

s << p

Classifier design

•Iterative approach: many feature subsets are scored based on classification performance and best is used.

•Selection of subsets: forward selection, backward selection, Forward-backward selection, tree harvesting etc

(19)

Microarray Workshop 19

Problems with wrapper methods

• Computationally expensive: for each feature subset to be considered, a classifier must be built and evaluated

• No exhaustive search is possible (2 subsets to consider) : generally greedy algorithms only.

• Easy to overfit.

p

Supplementary slide Adapted from A. Hartemnick

(20)

Embedded methods

• Attempt to jointly or simultaneously train both a classifier and a feature subset

• Often optimize an objective function that jointly rewards accuracy of classification and penalizes use of more features.

• Intuitively appealing

Some examples: tree-building algorithms,

shrinkage methods (LDA, kNN)

(21)

Microarray Workshop 21

Performance assessment

• Any classification rule needs to be evaluated for its performance on the future samples. It is almost never the case in microarray studies that a large independent population-based collection of samples is available at the time of initial classifier-building phase.

• One needs to estimate future performance based on

what is available: often the same set that is used to build the classifier.

• Assessing performance of the classifier based on

- Cross-validation.

- Test set

- Independent testing on future dataset

(22)

Performance assessment (I)

• Resubstitution estimation: error rate on the learning set.

- Problem: downward bias

• Test set estimation:

1) divide learning set into two sub-sets, L and T; Build the classifier on L and compute the error rate on T.

2) Build the classifier on the training set (L) and compute the error rate on an independent test set (T).

- L and T must be independent and identically distributed (i.i.d).

- Problem: reduced effective sample size

Supplementary slide

(23)

Microarray Workshop 23

Performance assessment (II)

• V-fold cross-validation (CV) estimation: Cases in learning set randomly divided into V subsets of (nearly) equal size.

Build classifiers by leaving one set out; compute test set error rates on the left out set and averaged.

- Bias-variance tradeoff: smaller V can give larger bias but smaller variance

- Computationally intensive.

• Leave-one-out cross validation (LOOCV).

(Special case for V=n). Works well for stable classifiers (k- NN, LDA, SVM)

Supplementary slide

(24)

Performance assessment (III)

• Common practice to do feature selection using the learning , then CV only for model building and

classification.

• However, usually features are unknown and the intended inference includes feature selection. Then, CV

estimates as above tend to be downward biased.

• Features (variables) should be selected only from the learning set used to build the model (and not the entire set)

Referenzen

ÄHNLICHE DOKUMENTE

To enable automatic feature selection for anomaly detection, we derive an appropriate formulation for one-class-classification, a particular kind of anomaly de- tection using

The top row shows four slices from the scale- space of the window image of figure (3.4). For the gradient coordinates in two dimensions this is facilitated.. Explicit expressions

over, when a series of interrelated decisions is to be made over time, the decision maker should 1) revise his probability distributions as new information is obtained and 2)

The design abstracts from specific machine learning libraries or applications in order to allow the examination of different machine learning techniques or feature sets on the

The overall objective of the CATNETS project is to determine the applicability of a decentralized economic self-organized mechanism for service and resource allocation to be used

Therefore, each hypothesis will be tested at α level, which means that the error rate of Go/NoGo decision for each drug candidate is controlled at the level of α, rather than

Figure 4: Scatterplot-based desc riptor comparison visual- ization. T op: Ob ject coloring. Bottom: The desc riptor com- parison. a) The reference color scheme mapped to the

For a whole brain tracking comprising 300.000 fibers, the fiber tracking and fiber tract selection using up to four ROIs with different associatcd Boolean opcrators