• Keine Ergebnisse gefunden

Machine Learning Bayesian Decision Theory

N/A
N/A
Protected

Academic year: 2022

Aktie "Machine Learning Bayesian Decision Theory"

Copied!
17
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Machine Learning

Bayesian Decision Theory

Dmitrij Schlesinger

WS2013/2014, November 5, 2013

(2)

Recognition

The model:

Let two random variables be given:

– The first one is typically discrete (k ∈K) and is called

“class”

– The second one is often continuous (x∈X) and is called

“observation”

Let the joint probability distributionp(x, k) be “given”

As k is discrete it is often specified by p(x, k) = p(k)·p(x|k) The recognitiontask: given x, estimate k

Usual problems (questions):

– How to estimate k from x ? (today)

– The joint probability is not always explicitly specified – The set K is sometimes huge

(3)

Idea – a game

Somebodysamples a pair (x, k)according to a p.d. p(x, k) He keeps k hidden and presents x to you

You decide for some k according to a chosen decision strategy

Somebodypenalizes your decision according to a

Loss-function, i.e. he compares your decision to the true hiddenk

You know bothp(x, k)and the Loss-function (how does he compare)

Your goal is to design the decision strategy in order to pay as less as possible in average.

(4)

Bayesian Risk

Notations:

Thedecision setD. Note: it needs not to coincide withK !!!

Examples: decisions like “I don’t know”, “not this class” ...

Decision strategyis a mapping e:XD Loss-functionC :D×K →R

TheBayesian Risk of a strategy e is the expected loss:

R(e) =X

x

X

k

p(x, k)·Ce(x), k→min

e

It should be minimized with respect to the decision strategy

(5)

Some variants

General:

R(e) =X

x

X

k

p(x, k)·Ce(x), k→min

e

Almost always:

decisions can be made for differentx independently (the set of decision strategies is not restricted). Then:

Re(x)=X

k

p(x, k)·Ce(x), k→min

e(x)

Very often: the decision set coincides with the set of classes, i.e.D=K

k = arg min

k

X

k0

p(x, k0C(k, k0) =

= arg min

k

X

k0

p(k0|x)·C(k, k0)

(6)

Maximum A-posteriori Decision (MAP)

The Loss is the simplest one:

C(k, k0) =

( 1 if k6=k0

0 otherwise =δ(k 6=k0) i.e. we pay1 if the answer is not the true class, no matter what error we make.

From that follows:

R(k) = X

k0

p(k0|x)·δ(k 6=k0) =

= X

k0

p(k0|x)−p(k|x) = 1p(k|x)→min

k

p(k|x)→max

k

(7)

A MAP example

LetK ={1,2},x∈R2,p(k)be given. Conditional probability distributions for observations given classes are Gaussians:

p(x|k) = 1 2πσk2 exp

−kx−µkk2k2

The loss-function isδ(k6=k0), i.e. we want MAP.

The decision strategy e : XK partitions the input space into two re- gions: the one corresponding to the first and the one corresponding to the second class.

How does this partition look like?

(8)

A MAP example

For a particularx we decide for 1, if p(1)· 1

2πσ12 exp

−kx−µ1k221

> p(2)· 1 2πσ22 exp

−kx−µ2k222

Special case (for simplicity)σ1 =σ2

→the decision strategy is (derivation on the board) hx, µ2µ1i> const

→a linear classifier – the hyperplane orthogonal to µ2µ1

More classes, equalσ and p(k) → Voronoi-diagram More classes, equalσ, differentp(k) →Fischer-classifier Two classes, differentσ – a general quadratic curve etc.

(9)

Decision with rejection

The decision set is D=K∪ {r}, i.e. extended by a special decision “I don’t know”. The loss-function is

C(d, k) =

( δ(d6=k) if dK

ε if d=r

i.e. we pay a (reasonable) penalty if we are lazy to decide.

Case-by-case analysis:

1. We decide for a class dK, then the decision is MAP d=k = arg maxkp(k|x), the loss for this is 1−p(k|x) 2. We decide to reject d=r and pay ε for this

The decision strategy is:

Comparep(k|x) with 1−ε and decide for the greater value.

(10)

Other simple loss-functions

Let the set of classes bestructured(in some sense) Example:

We have a probability density p(x, y) with an observations x and acontinuous hidden value y. Suppose, we know p(y|x) for a given x, for which we would like to infer y.

The Bayesian Risk reads:

Re(x)=

Z

−∞

p(y|x)·Ce(x), ydy

(11)

Other simple loss-functions

Simpleδ-loss-function→ MAP (not interesting anymore) Loss may account fordifferences between the decision and the “true” hidden value, for instance C(d, y) = (dy)2, i.e. we pay depending on the distance.

Than (see board again):

e(x) = arg min

d

Z

−∞p(y|x)·(d−y)2dy =

=

Z

−∞y·p(y|x)dy=Ep(y|x)[y]

Other choices: C(d, y) = |d−y|,C(d, y) = δ(|dy|> ε), combination with “rejection” etc.

(12)

Additive loss-functions – an example

Q1 Q2 . . . Qn

P1 1 0 . . . 1 P2 0 1 . . . 0 . . . . . . . . . . . . . . . Pm 0 1 . . . 0

P” ? ? . . . ?

Consider a “questionnaire”:

m persons answern questions.

Furthermore, let us assume that persons are rated – a “reliability”

measure is assigned to each one.

The goal is to find the “right”

answers for all questions.

Strategy 1:

Choose thebest person and take allhis/her answers.

Strategy 2:

– Consider a particular question

– Look, what allthe people say concerning this, do (weighted) voting

(13)

Additive loss-functions – example interpretation

People are classesk, reliability measure is the posterior p(k|x) Specialty:

classes consist of “parts” (questions) – classes arestructured The set of classes is k = (k1, k2. . . km)∈Km, it can be seen as a vector of m components each one being a simple answer (0 or 1 in the above example)

The “Strategy 1” is MAP

How to derive (consider, understand) the other decision strategy from the viewpoint of the Bayesian Decision Theory?

(14)

Additive loss-functions

Consider the simple C(k, k0) =δ(k 6=k0)loss for the case classes are structured – it does not reflect how strongthe class and the decision disagree

A better (?) choice – additive loss-function C(k, k0) = X

i

ci(ki, ki0)

i.e. disagreements of all components are summed up

Substitute it in the formula for Bayesian Risk, derive and look what happens ...

(15)

Additive loss-functions – derivation

R(k) = X

k0

p(k0|x)·X

i

ci(ki, ki0)

=/ swap summations

= X

i

X

k0

ci(ki, ki0p(k0|x) = / split summation

= X

i

X

l∈K

X

k0:k0i=l

ci(ki, l)·p(k0|x) =/ factor out

= X

i

X

l∈K

ci(ki, l)· X

k0:k0i=l

p(k0|x)

=/ redare marginals

= X

i

X

l∈K

ci(ki, l)·p(ki0=l|x)→min

k

/ independent problems

X

l∈K

ci(ki, l)·p(k0i=l|x)→min

ki

∀i

(16)

Additive loss-functions – the strategy

1. Compute marginalprobability distributions for values p(ki0=l|x) = X

k0:k0i=l

p(k0|x)

for each variablei and each valuel

2. Decide for each variable “independently” according to its marginal p.d. and the local lossci

X

l∈K

ci(ki, l)·p(ki0=l|x)→min

ki

This is again a Bayesian Decision Problem – minimize the average loss

(17)

Additive loss-functions – a special case

For each variable we pay1 if we are wrong:

ci(ki, k0i) =δ(ki 6=ki0)

The overall loss is the number of misclassified variables (wrongly answered questions)

C(k, k0) =X

i

δ(ki 6=ki0)

and is called Hamming distance

The decision strategy isMaximum Marginal Decision ki = arg max

l

p(k0i=l|x) ∀i

Referenzen

ÄHNLICHE DOKUMENTE

Work with licenses offers two kinds of challenges: one is the terminology that should be common to all parties and as consistent as possible. In practice the terms used

The approach – minimize the average (expected) loss – Simple δ-loss – Maximum A-posteriori Decision – Rejection – an example of "extended" decision sets

Instead, the country in which the difference in incomes between the high-income natives and the low-income natives is larger should admit more asylum seekers (as a

Despite initial pessimism regarding the ability of a research vessel to operate in wintertime conditions in the Labrador Sea, and despite predictions of the collapse of

The purpose of articulating an export model where working capital is required for production is to show how productivity and cash interact to jointly determine export status of

Fortunately, it is easy to neutralize this form of manipulation by letting the social welfare function respond monotonically to preferences: if an individual increases his

Secondly suppose that a group of workers (say group i) is more productive than another (say group -i); for instance, one may imagine that group i is represented by high-skilled

In the last Section, we use his mass formula and obtain a list of all one- class genera of parahoric families in exceptional groups over number fields.. Then k is