• Keine Ergebnisse gefunden

Logistic Regression

N/A
N/A
Protected

Academic year: 2022

Aktie "Logistic Regression"

Copied!
20
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Logistic Regression

Two Worlds: Probabilistic & Algorithmic

Can we have a probabilistic classifier with a modelling focus on classification?

Bayes Classifier

Probabilisticclassifier with a generative setup based on class density models Bayes (Gauss), NaΓ―ve Bayes

β€œDirect” Classifiers

Find best parameter (e.g.𝑀) with respect to a specific loss function measuring

misclassification

Perceptron, SVM, Tree, ANN

We know two conceptual approaches to classification:

data class density estimation

classification

rule decision

learning

data classification

function decision

learning

(2)

Advantages of Both Worlds

β€’ Posterior distribution has advantages over classification label:

β€’ Asymmetric risks: need classification probability

β€’ Classification certainty: Indicator if decision in unsure

β€’ Algorithmic approach with direct learning has advantages:

β€’ Focus of modelling power on correct classification where it counts

β€’ Easier decision line interpretation

β€’ Combination?

5

Discriminative Probabilistic Classifier

𝑃 π‘₯ 𝐢1 𝑃 π‘₯ 𝐢2

𝑔 Τ¦π‘₯ = π’˜π‘‡π’™ + 𝑀0

Bayes Classifier Linear Classifier

Discriminative Probabilistic

Classifier

Bishop PRML

Bishop PRML

(3)

Towards a β€œDirect” Probabilistic Classifier

β€’ Idea 1: Directly learn a posterior distribution

For classification with the Bayes classifier, the posterior distribution is relevant. We can directly estimate a model of this distribution (we called this as a discriminative classifier in NaΓ―ve Bayes). We know from NaΓ―ve Bayes that we can probably expect a good performance from the posterior model.

β€’ Idea 2: Extend linear classification with probabilistic interpretation

The linear classifier outputs a distance to the decision plane. We can use this value and interpret it probabilistically: β€œThe further away, the more certain”

7

Logistic Regression

The Logistic Regressionwill implement both ideas: It is a model of a posterior class distribution for classification and can be interpreted as a probabilistic linear classifier. But it is a fully probabilistic model, not only a β€œpost-processing” of a linear classifier.

It extends the hyperplane decision idea to Bayes world

β€’ Direct model of the posterior for classification

β€’ Probabilistic model (classification according to a probability distribution)

β€’ Discriminative model (models posterior rather than likelihood and prior)

β€’ Linear model for classification

β€’ Simple and accessible (we can understand that)

β€’ We can study the relation to other linear classifiers, i.e. SVM

(4)

History of Logistic Regression

β€’ Logistic Regression is a very β€œold” method of statistical analysis and in widespread use, especially in the traditional statistical community (not machine learning).

1957/58, Walker, Duncan, Cox

β€’ A method more often used to study and identify explaining factors rather than to do individual prediction.

Statistical analysis vs. prediction focus of modern machine learning Many medical studies of risk factors etc. are based on logistic regression

9

Statistical Data Models

10

Simplest form besides constant (one prototype) is a linear model.

 

0 0 0

1

,

d

T

w i i

i

Lin x w x w w x w w x w

ο€½

ο€½

οƒ₯

 ο€½  ο€½ 

1 0

, w

x w

x w

 οƒΉ  οƒΉ

ο€½οƒͺ οƒΊ ο€½οƒͺ οƒΊ

   

 

, 0 ,

Linw x w x w w x

οƒž ο€½  ο€½

We do not know P(x,y)but we can assume a certain form.

---> This is called a data model.

(5)

Repetition: Linear Classifier

Linear classification rule:

11

𝑔 𝒙 = π’˜π‘‡π’™ + 𝑀0 𝑔 𝒙 β‰₯ 0 β‡’ 𝑔 𝒙 < 0 β‡’

Decision boundary is a a hyperplane

Repetition: Posterior Distribution

β€’ Classification with Posterior distribution: Bayes

Based on class densities and a prior

𝑃 𝐢2𝒙 = 𝑝 𝒙 𝐢2 𝑃 𝐢2 𝑝 𝒙 𝐢1 𝑃 𝐢1 + 𝑝 𝒙 𝐢2 𝑃 𝐢2

𝑃 𝐢1𝒙 = 𝑝 𝒙 𝐢1 𝑃 𝐢1 𝑝 𝒙 𝐢1 𝑃 𝐢1 + 𝑝 𝒙 𝐢2 𝑃 𝐢2

Bishop PRML

(6)

Combination: Discriminative Classifier

13

Probabilistic interpretation of classification output: ~distance to separation plane

Decision boundary

Bishop PRML

Notation Changes

β€’ We work with two classes

Data with (numerical) feature vectors Τ¦π‘₯and labels π’š ∈ {𝟎, 𝟏}

We do not use the notation of Bayes with πœ”anymore. We will need the explicit label value of π’š in our models later.

β€’ Classification goal: infer the best class label{𝟎 𝒐𝒓 𝟏}for a given feature point π‘¦βˆ—= arg max

π‘¦βˆˆ{0,1}𝑃(𝑦|𝒙)

β€’ All our modeling focuses only on the posterior of having class 1:

𝑃 𝑦 = 1 𝒙

β€’ Obtaining the other is trivial: 𝑃 𝑦 = 0 𝒙 = 1 βˆ’ 𝑃(𝑦 = 1 |𝒙 )

(7)

Parametric Posterior Model

We need a model for the posterior distribution, depending on the feature vector (of course) and neatly parameterized.

The linear classifier is a good starting point. We know its parametrization very well:

We thus model the posterior as a function of the linear classifier:

Posterior from classification result: β€œscaled distanceβ€œ to decision plane

𝑃 𝑦 = 1 𝒙, 𝜽 = 𝑓 𝒙; 𝜽

𝑔 𝒙; π’˜, 𝑀0 = π’˜π‘‡π’™ + 𝑀0

𝑃 𝑦 = 1 𝒙, π’˜, 𝑀0 = 𝑓(π’˜π‘‡π’™ + 𝑀0)

15

Logistic Function

To use the unboundeddistance to the decision plane in a probabilistic setup, we need to map it into the interval [0, 1]

This is very similar as we did in neural nets: activation function

The logistic function𝜎 π‘₯ squashes a value π‘₯ ∈ ℝto 0, 1

𝜎 π‘₯ = 1

1 + eβˆ’π‘₯

The logistic function is a smooth, soft threshold 𝜎 π‘₯ β†’ 1 π‘₯ β†’ ∞ 𝜎 π‘₯ β†’ 0 π‘₯ β†’ βˆ’βˆž

𝜎 0 =1

2

(8)

The Logistic Function

17

The Logistic β€œRegression”

18

(9)

The Logistic Regression Posterior

We model the posterior distribution for classification in a two- classes-setting by applying the logistic function to the linear classifier:

𝑃 𝑦 = 1 π‘₯ = 𝜎 𝑔 π‘₯

𝑃 𝑦 = 1 𝒙, π’˜, 𝑀0 = 𝑓(π’˜π‘‡π’™ + 𝑀0) = 1

1 + π‘’βˆ’(π’˜π‘‡π’™+𝑀0)

This a location-dependent model of the posterior distribution, parametrized by a linear hyperplane classifier.

19

Logistic Regression is a Linear Classifier

The logistic regression posterior leads to a linear classifier:

𝑃 𝑦 = 1 𝒙, π’˜, 𝑀0 = 1

1 + exp βˆ’(π’˜π‘‡π’™ + 𝑀0) 𝑃 𝑦 = 0 𝒙, π’˜, 𝑀0 = 1 βˆ’ 𝑃 𝑦 = 1 𝒙, π’˜, 𝑀0

𝑃 𝑦 = 1 𝒙, π’˜, 𝑀0 >1

2 β‡’

Classification boundary is at:

β‡’ 1

1 + exp βˆ’(π’˜π‘‡π’™ + 𝑀0) =1

2 π’˜π‘‡π’™ + 𝑀0= 0

𝑦 = 1classification; 𝑦 = 0otherwise

𝑃 𝑦 = 1 𝒙, π’˜, 𝑀0 =1 2

Classification boundary is a hyperplane

β‡’

(10)

Interpretation: Logit

Is the choice of the logistic function justified?

β€’ Yes, the logit is a linear function of our data:

Logit: log of the odds ratio: ln 𝑝

1βˆ’π‘

β€’ But other choices are valid, too

They lead to other models than logistic regression, e.g. probit regression

β†’Generalized Linear Models (GLM)

21

ln𝑃(𝑦 = 1|𝒙)

𝑃(𝑦 = 0|𝒙)= π’˜π‘‡π’™ + 𝑀0 The linear function (~distance from decision plane) directly expresses our classification certainty, measured by the

β€œodds ratio”:

double distance ↔squared odds e.g. 3: 2 β†’ 9: 4

𝐸[𝑦] = π‘“βˆ’1 π’˜π‘‡π’™ + 𝑀0

The Logistic Regression

22

β€’ So far we have made no assumption on the data!

β€’ We can getr(x)from a generative model or model it directly as function of the data (discriminative)

Logistic Regression:

Model: The logit r(x)= log𝑃(𝑦=1|𝒙)

𝑃(𝑦=0|𝒙)

=

log1βˆ’π‘π‘

is a linear function of the data

 

0

1

log ,

1

d i i i

r x p w x w w x

p ο€½

ο€½ ο€½  ο€½

ο€­

οƒ₯



ο€½

 







ο€½

 





ο€½ 1

(11)

Training a Posterior Distribution Model

The posterior model for classification requires training. Logistic regression is not just a post-processing of a linear classifier. Learning of good parameter values needs be done with respect to the

probabilistic meaning of the posterior distribution.

β€’ In the probabilistic setting, learning is usually estimation

We now have a slightly different situation than with Bayes: We do not need class densities but a good posterior distribution.

β€’ We will use Maximum Likelihood and Maximum-A-Posteriori estimates of our parameters π’˜, 𝑀0

Later: This also corresponds to a cost function of obtaining π’˜, 𝑀0

23

Maximum Likelihood Learning

The Maximum Likelihood principle can be adapted to fit the posterior distribution (discriminative case):

β€’ We choose the parameters π’˜, 𝑀0 which maximize the posterior distributionof the training set 𝑿with labels 𝒀:

𝑃 𝑦 𝒙; π’˜, 𝑀0 = 𝑃 𝑦 = 1 𝒙; π’˜, 𝑀0 𝑦 𝑃 𝑦 = 0 𝒙; π’˜, 𝑀0 1βˆ’π‘¦ π’˜, 𝑀0= arg max

π’˜,𝑀0𝑃 Y 𝑋; π’˜, 𝑀0

= arg max

π’˜,𝑀0Ο‚π’™βˆˆπ‘‹π‘ƒ 𝑦 𝒙; π’˜, 𝑀0 (iid)

(12)

25

Logistic Regression: Maximum Likelihood Estimate of w (1) To simplify the notation we use

w, x

instead of π’˜, 𝑀0

The discriminative (log) likelihood function for our data

   

1 N

i i i

P Y X P y x

ο€½

ο€½



  

1

 

y 0



1 y y(1 )1 y

P y x P y x P y x ο€­ p p ο€­

οƒž ο€½ ο€½ ο€½ ο€½ ο€­

1

(1 )1

i i

N

i

y y

i i

p p

ο€½

ο€½



ο€­ ο€­

 

log P Y X ο€½

 

1

log log 1

1

N

i

i i

i i

y p p

ο€½ p

 οƒΆ

ο€½

οƒ₯

 ο€­  ο€­

     

1

log 1 log 1

N

i i i i

i

y p y p

ο€½

 ο€­ ο€­

οƒ₯



1

  

T

P yο€½ x  w x P y



ο€½0 x



ο€½ ο€­1 

 

w xT

With and

β€œcross-entropy” cost function

26

log-likelihood function continued

   

log L Y X, ο‚Ί log P Y X ο€½

 

1

log log 1

1

N

i

i i

i i

y p p

ο€½ p

 οƒΆ

 ο€­

 ο€­ οƒ·

 οƒΈ

οƒ₯

   

1

log , log 1

N T

i i

i

T i

L Y X y w x ew x

ο€½

ο€½

οƒ₯

ο€­ 

w Maximum Likelihood Estimate of w (2)

 

1

1 T

T

i w x

p w x

 eο€­

ο€½ ο€½

Remember  and linear Logit lπ‘œπ‘” 𝑝𝑖

1 βˆ’ 𝑝𝑖= 𝑀𝑇π‘₯

(13)

   

1

log , log 1

N T

i i

i

T i

L Y X y w x ew x

w w ο€½

ο‚Ά ο‚Ά

ο€½ ο€­ 

ο‚Ά ο‚Ά

οƒ₯

1 1

N

T T

i i i

i

T i

T i

w x w x

y x e x

e

ο€½

ο€½ ο€­

οƒ₯



Maximum Likelihood Estimate of w (3)

27

Derivative of a Dot Product

πœ•

πœ•π’˜= 𝛻𝐰= πœ•

πœ•π‘€1, πœ•

πœ•π‘€2, … , πœ•

πœ•π‘€π‘‘

πœ•

πœ•π’˜π’˜π‘‡π’™ = πœ•

πœ•π‘€1π’˜π‘‡π’™, πœ•

πœ•π‘€2π’˜π‘‡π’™, … , πœ•

πœ•π‘€π‘‘π’˜π‘‡π’™

πœ•

πœ•π‘€π‘–π’˜π‘‡π’™ = πœ•

πœ•π‘€π‘–ΰ·

π‘˜=0 𝑑

π‘€π‘˜π‘₯π‘˜ = π‘₯𝑖

πœ•

πœ•π’˜π’˜π‘‡π’™ = π‘₯1, π‘₯2, … , π‘₯𝑑 = 𝒙𝑇 Gradient operator

Final derivative Per component

(14)

   

1

log , log 1

N T

i i

i

T i

L Y X y w x ew x

w w ο€½

ο‚Ά ο‚Ά

ο€½ ο€­ 

ο‚Ά ο‚Ά

οƒ₯

ο€½! 0

1 1

N

T T

i i i

i

T i

T i

w x w x

y x e x

e

ο€½

ο€½ ο€­

οƒ₯



 

 

1 N

T T

i i i

i

y  w x x

ο€½

ο€½

οƒ₯

ο€­

β€’ Non-linear equation inw : no closed form solution.

β€’ The function Log L is concave therefore a unique maximum exists.

Maximum Likelihood Estimate of w (3)

1

1 1

Ti

T T

i i

w x

w x w x

e

e eο€­

 ο€½ 

29

Iterative Reweighted Least Squares

The concave log 𝑃 𝒀|𝑿 can be maximized iteratively with the Newton-Raphson algorithm: Iterative Reweighted Least Squares

π’˜π‘›+1← π’˜π‘›βˆ’ π‘―βˆ’1 πœ•

πœ•π’˜ ln 𝑃 𝒀|𝑿; π’˜π‘›

Derivatives and evaluation always with respect to π’˜π‘›

(15)

Hessian: Concave Likelihood

𝑯 = πœ•2

πœ•π’˜πœ•π’˜π‘‡ln 𝑃 π‘Œ|𝑋

πœ•

πœ•π’˜

πœ•

πœ•π’˜π‘‡ln 𝑃 π‘Œ 𝑋 = βˆ’ ෍

𝑖

π’™π‘–π’™π‘–π‘‡πœŽ π’˜π‘‡π’™π‘– 1 βˆ’ 𝜎 π’˜π‘‡π’™π‘– = βˆ’π‘Ώπ‘‡π‘Ίπ‘Ώ We use an old trick to keep it simple:

π’˜ ≔ 𝑀0

π’˜ , 𝒙 ≔ 1 𝒙

The Hessian is negative definite:

β€’ The sample covariance matrixσ𝑖𝒙𝑖𝒙𝑖𝑇is positive definite

β€’ 𝜎 π’˜π‘‡π’™π‘– 1 βˆ’ 𝜎 π’˜π‘‡π’™π‘– is always positive

The optimization problem is said to be convexand has thus a optimal solution which can be iteratively calculated.

32

Iterative Reweighted Least Squares

The concave log 𝑃 𝒀|𝑿 can be maximized iteratively with the Newton-Raphson algorithm: Iterative Reweighted Least Squares

π’˜π‘›+1← π’˜π‘›βˆ’ π‘―βˆ’1 πœ•

πœ•π’˜ ln 𝑃 𝒀|𝑿; π’˜π‘›

Derivatives and evaluation always with respect to π’˜π‘›

Method results in an iteration of reweightedleast-squares steps 𝑀𝑛+1 = 𝑋𝑇𝑆𝑋 βˆ’1𝑋𝑇𝑆 𝑧

𝑧 = 𝑋𝑀𝑛+ π‘†βˆ’1 π‘Œ βˆ’ 𝑃 𝑀𝑛

β€’ Weighted least-squares with 𝒛as target: 𝑋𝑇𝑆𝑋 βˆ’1𝑋𝑇𝑆 𝑧

β€’ 𝒛: adjusted responses (updated every iteration)

β€’ 𝑷 π’˜π‘› : vector of responses [𝑝 𝑝 , … , 𝑝 ]𝑇

(16)

Example: Logistic Regression

34

0.25 0.75

Solid line: classification (𝑝 = 0.5)

Probabilistic result: posterior of classification everywhere

Dashed lines: 𝑝 = 0.25, 𝑝 = 0.75lines

The posterior probability decays/increases with distance to the decision boundary

0.125 0.875

Linearly Separable

β€’ Maximum Likelihood learning is problematic in the linearly separable case: 𝑀diverges in length

β†’ leads to classification with infinitecertainty

β€’ Classification is still right but posterior estimate is not

(17)

Prior Assumptions

β€’ Infinitely certain classification is likely an estimation artefact:

We do not have enough training samples

β†’ maximum likelihood estimation leads to problematic results

β€’ Solution: MAP estimate with prior assumptions on 𝑀

36

𝑃 𝑀 = 𝑁 𝑀|0, 𝜎2𝐼

𝑃 𝑦|𝒙, π’˜, 𝑀0 = 𝑝𝑦 1 βˆ’ 𝑝 1βˆ’π‘¦ π’˜, 𝑀0= arg max

π’˜,𝑀0𝑃 Y 𝑋; π’˜, 𝑀0 𝑃 π’˜

= arg max

π’˜,𝑀0𝑃 π’˜ ΰ·‘

π’™βˆˆπ‘‹

𝑃 𝑦 𝒙, π’˜, 𝑀0

Smaller 𝑀are preferred (shrinkage)

Likelihood model is unchanged

MAP Learning

ln 𝑃 π’˜ ΰ·‘

π’™βˆˆπ‘‹

𝑃 𝑦 𝒙, π’˜, 𝑀0 =

෍( 𝑦𝑖 π’˜π‘‡π’™π‘–+ 𝑀0 βˆ’ ln 1 + exp π’˜π‘‡π’™π‘–+ 𝑀0 ) βˆ’ 1

2𝜎2 π’˜ 2

πœ•

πœ•π’˜ln 𝑃 π‘Œ|𝑋 = ෍

𝑖

( π‘¦π‘–βˆ’ 𝜎 π’˜π‘‡π’™π‘–+ 𝑀0 𝒙𝑖𝑇) βˆ’ 1

𝜎2π’˜π‘‡ =! 0 We need: πœ•π’˜πœ• π’˜ 2= 2π’˜π‘‡

β€’ Iterative solution: Newton-Raphson

β€’ Prior enforces a regularization

(18)

Bayesian Logistic Regression

Idea: In the separable case, there are many perfectlinear classifiers which all separate the data. Averagethe classification result and accuracy using allof these classifiers.

β€’ Optimal way to deal with missing knowledge in Bayes sense

Bishop PRML Bishop PRML 38

Logistic Regression and Neural Nets

β€’ The standard single neuron with the logistic activation is logistic regression if trained with the same cost function (cross-entropy)

But training with least-squares results in a different classifier

β€’ Multiclass logistic regression with soft-max corresponds to what is called a soft-max layerin ANN. It is the standard multiclass output in most ANN architectures.

𝑃 𝑦 = 1 𝒙, π’˜, 𝑀 = 𝜎 π’˜π‘‡π’™ + 𝑀 π‘₯1

π‘₯2

π‘₯3

π’˜

𝚺 𝜎

(19)

Non-Linear Extension

β€’ Logistic regression is often extended to non-linear cases:

Extension through adding additional transformedfeatures

β€’ Combination terms: π‘₯𝑖π‘₯𝑗

β€’ Monomial terms: π‘₯𝑖2

Standard procedure in medicine: inspect resulting 𝑀to find important factors and interactions π‘₯𝑖π‘₯𝑗 (comes with statistical information).

β€’ Usage of kernelsis possible: training and classification can be formulated with dot products of data points. The scalar products can be β€œreplaced”by kernel expansions with the kernel trick.

42

𝒙 ≔ 𝒙 π‘₯1π‘₯2

π‘₯22

Kernel Logistic Regression

β€’ Equations of logistic regression can be reformulated with dot products:

π’˜π‘»π’™ = ෍

𝑖=1 𝑁

𝛼𝑖𝒙𝑖𝑇𝒙 β†’ ෍

𝑖=1 𝑁

π›Όπ‘–π‘˜ 𝒙𝑖, 𝒙

β€’ No Support Vectors: kernel evaluations with alltraining points

𝑃 𝑦 = 1 𝒙 = 𝜎 ෍

𝑖=1 𝑁

π›Όπ‘–π‘˜ 𝒙𝑖, 𝒙

IVM (import vector machine):

Extension with only sparse support points

Ji Zhu & Trevor Hastie (2005) Kernel Logistic Regression and the Import Vector Machine, Journal of

(20)

Discriminative vs. Generative

Comparison of logistic regression to naΓ―ve Bayes

Ng, Andrew Y., and Michael I. Jordan. "On Discriminative vs. Generative classifiers: A comparison of logistic regression and naive Bayes." Advances in NIPS 14, 2001.

Conclusion:

β€’ Logistic regression has a lower asymptotic error

β€’ NaΓ―ve Bayes can reach its (higher) asymptotic error faster

General over-simplification (dangerous!): use a generative model with few data (more knowledge) and a discriminative model with a lot of training data (more learning)

44

Logistic Regression: Summary

45

 A probabilistic, linear method for classification!

 Discriminative method (Model for posterior)

 Linear model for the Logit

 The posterior probability is given by the logistic function of the Logit:

ML-estimation of is unique but non-linear

Logistic regression is a very often used method

log ,

1

p w x pο€½

ο€­



1

 

,



1 exp



1 ,



P y x w x

 w x

ο€½ ο€½ ο€½

 ο€­

w

Referenzen

Γ„HNLICHE DOKUMENTE

as a function of internal noise Οƒ, primary task criterion ΞΈ, rating criteria spread Ο„, shape of the 346.. distributions, and

Liang and Cheng (1993) discussed the second order asymptotic eciency of LS estimator and MLE of : The technique of bootstrap is a useful tool for the approximation of an unknown

Also, we found that the variables directly related to chronic poverty are: belonging to an ethnic group, living in a rural area, a large family size, having a

Long e Settle (1977) realizaram um estudo, baseado em uma pesquisa com chefes de famΓ­lia de Wisconsin, em que empregaram variΓ‘veis adicionais para testar o modelo de Azzi

One approach that has been used in the literature is to estimate marginal effects by comparing point estimates of the expected probability of various characteristics (Pohlmann

example, the LPC of the impact of female gender on the probability of unemployment can be derived from the results of a logistic regression by estimating the female and male

Indeed, the main factors that positively affect the probability of exporting in Tunisia are: Capital intensity; the company age and size.. Furthermore, among the main factors

Outside this range we get fitted values that are smaller than zero or larger than one.. β—† Nonparametric regression shows S-shaped fit, not a