• Keine Ergebnisse gefunden

Machine Learning Bayesian Decision Theory

N/A
N/A
Protected

Academic year: 2022

Aktie "Machine Learning Bayesian Decision Theory"

Copied!
20
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Machine Learning

Bayesian Decision Theory

Dmitrij Schlesinger

WS2014/2015, 27.10.2014

(2)

Recognition

The model:

Let two random variables be given:

– The first one is typically discrete (k ∈K) and is called

“class”

– The second one is often continuous (x∈X) and is called

“observation”

Let the joint probability distributionp(x, k) be “given”

As k is discrete it is often specified by p(x, k) = p(k)·p(x|k) The recognitiontask: given x, estimate k

Usual problems (questions):

– How to estimate k from x ? (today)

– The joint probability is not always explicitly specified – The set K is sometimes huge

(3)

Idea – a game

Somebodysamples a pair (x, k)according to a p.d. p(x, k) He keeps k hidden and presents x to you

You decide for some k according to a chosen decision strategy

Somebodypenalizes your decision according to a loss-function, i.e. he compares your decision to the true hiddenk

You know bothp(x, k)and the loss-function (how does he compare)

Your goal is to design the decision strategy in order to pay as less as possible in average.

(4)

Bayesian Risk

Notations:

Thedecision setD. Note: it needs not to coincide withK !!!

Examples: decisions like “I don’t know”, “not this class” ...

Decision strategyis a (deterministic) mapping e:XD Loss-functionC :D×K →R, i.e. for a decision d and a

"true" classk the penalty is C(d, k)

TheBayesian Risk of a strategy e is the expected loss:

R(e) =X

x

X

k

p(x, k)·Ce(x), k→min

e

It should be minimized with respect to the decision strategy

(5)

Some variants

General:

R(e) =X

x

X

k

p(x, k)·Ce(x), k→min

e

Almost always:

decisions can be made for differentx independently (the set of decision strategies is not restricted). Then:

Re(x)=R(d) = X

k

p(x, k)·C(d, k)→min

d

Very often: the decision set coincides with the set of classes, i.e.D=K

k = arg min

k

X

k0

p(x, k0C(k, k0) =

= arg min

k

X

k0

p(k0|x)·C(k, k0)

(6)

Maximum A-posteriori Decision (MAP)

The loss is the simplest one:

C(k, k0) =

( 1 if k6=k0

0 otherwise =δ(k 6=k0) i.e. we pay1 if the answer is not the true class, no matter what error we make.

From that follows:

R(k) = X

k0

p(k0|x)·δ(k 6=k0) =

= X

k0

p(k0|x)−p(k|x) = 1p(k|x)→min

k

p(k|x)→max

k

(7)

A MAP example

LetK ={1,2},x∈R2,p(k)be given. Conditional probability distributions for observations given classes are Gaussians:

p(x|k) = 1 2πσk2 exp

−kx−µkk2k2

The loss-function isδ(k6=k0), i.e. we want MAP.

The decision strategy e : XK partitions the input space into two re- gions: the one corresponding to the first and the one corresponding to the second class.

How does this partition look like?

(8)

A MAP example

For a particularx we decide for 1, if p(1)· 1

2πσ12 exp

−kx−µ1k221

> p(2)· 1 2πσ22 exp

−kx−µ2k222

Special case (for simplicity)σ1 =σ2

→the decision strategy is (derivation on the board) hx, µ2µ1i> const

→a linear classifier – the hyperplane orthogonal to µ2µ1

More classes, equalσ and p(k) → Voronoi-diagram More classes, equalσ, differentp(k) →Fischer-classifier Two classes, differentσ – a general quadratic curve etc.

(9)

Decision with rejection

The decision set is D=K∪ {r}, i.e. extended by a special decision “I don’t know”. The loss-function is

C(d, k) =

( δ(d6=k) if dK

ε if d=r

i.e. we pay a (reasonable) penalty if we are lazy to decide.

Case-by-case analysis:

1. We decide for a class dK, then the decision is MAP d=k = arg maxkp(k|x), the loss for this is 1−p(k|x) 2. We decide to reject d=r and pay ε for this

The decision strategy is:

Comparep(k|x) with 1−ε and decide for the greater value.

Note: not only the argument arg maxk is important but the value of the lossmink as well (for comparison).

(10)

Other simple loss-functions

Let the set of classes bestructured(in some sense) Example:

We have a probability density p(x, y) with an observations x and acontinuous hidden value y. Suppose, we know p(y|x) for a given x, for which we would like to infer y.

The Bayesian Risk reads:

Re(x)=

Z

−∞

p(y|x)·Ce(x), ydy

(11)

Other simple loss-functions

Simpleδ-loss-function→ MAP (not interesting anymore) Loss may account fordifferences between the decision and the “true” hidden value, for instance C(d, y) = (dy)2, i.e. we pay depending on the distance.

Than (see board again):

e(x) = arg min

d

Z

−∞p(y|x)·(d−y)2dy =

=

Z

−∞y·p(y|x)dy=Ep(y|x)[y]

Other choices: C(d, y) = |d−y|,C(d, y) = δ(|dy|> ε), combination with “rejection” etc.

(12)

Additive loss-functions – an example

Q1 Q2 . . . Qn

P1 1 0 . . . 1 P2 0 1 . . . 0 . . . . . . . . . . . . . . . Pm 0 1 . . . 0

P” ? ? . . . ?

Consider a “questionnaire”:

m persons answern questions.

Furthermore, let us assume that persons are rated – a “reliability”

measure is assigned to each one.

The goal is to find the “right”

answers for all questions.

Strategy 1:

Choose thebest person and take allhis/her answers.

Strategy 2:

– Consider a particular question

– Look, what allthe people say concerning this, do (weighted) voting

(13)

Additive loss-functions – example interpretation

People are classesk, reliability measure is the posterior p(k|x) Specialty:

classes consist of “parts” (questions) – classes arestructured The set of classes is k = (k1, k2. . . km)∈Km, it can be seen as a vector of m components each one being a simple answer (0 or 1 in the above example)

The “Strategy 1” is MAP

How to derive (consider, understand) the other decision strategy from the viewpoint of the Bayesian Decision Theory?

(14)

Additive loss-functions

Consider the simple C(k, k0) =δ(k 6=k0)loss for the case classes are structured – it does not reflect how strongthe class and the decision disagree

A better (?) choice – additive loss-function C(k, k0) = X

i

ci(ki, ki0)

i.e. disagreements of all components are summed up

Substitute it in the formula for Bayesian Risk, derive and look what happens ...

(15)

Additive loss-functions – derivation

R(k) = X

k0

p(k0|x)·X

i

ci(ki, ki0)

=/ swap summations

= X

i

X

k0

ci(ki, ki0p(k0|x) = / split summation

= X

i

X

l∈K

X

k0:k0i=l

ci(ki, l)·p(k0|x) =/ factor out

= X

i

X

l∈K

ci(ki, l)· X

k0:k0i=l

p(k0|x)

=/ redare marginals

= X

i

X

l∈K

ci(ki, l)·p(ki0=l|x)→min

k

/ independent problems

X

l∈K

ci(ki, l)·p(k0i=l|x)→min

ki

∀i

(16)

Additive loss-functions – the strategy

1. Compute marginalprobability distributions for values p(ki0=l|x) = X

k0:k0i=l

p(k0|x)

for each variablei and each valuel

2. Decide for each variable “independently” according to its marginal p.d. and the local lossci

X

l∈K

ci(ki, l)·p(ki0=l|x)→min

ki

This is again a Bayesian Decision Problem – minimize the average loss

(17)

Additive loss-functions – a special case

For each variable we pay1 if we are wrong:

ci(ki, k0i) =δ(ki 6=ki0)

The overall loss is the number of misclassified variables (wrongly answered questions)

C(k, k0) =X

i

δ(ki 6=ki0)

and is called Hamming distance

The decision strategy isMaximum Marginal Decision ki = arg max

l

p(k0i=l|x) ∀i

(18)

Minimum Marginal Squared Error (MMSE)

Assume, the valuesl for ki are numbers (vectors). Examples:

– in tracking and pose estimation it is the set of all possible positions of the object to be tracked

– in stereo it is the set of all disparity/depth values – in denoising it is a grayvalue etc.

→a more reasonable (additive) loss should account formetric difference between the decision and the true position, e.g.

C(k, k0) = X

i

ci(ki, k0i) =X

i

kkiki0k2

The task to be solved for each positioni is

X

l∈K

kkilk2·p(k0i=l|x)→min

ki

(19)

Minimum Marginal Squared Error (MMSE)

X

l∈K

kkilk2·p(k0i=l|x)→min

ki

∂ki = X

l∈K

2·(kil)·p(k0i=l|x) = 0

X

l∈K

ki·p(ki0=l|x) = X

l∈K

l·p(ki0=l|x)

ki = X

l∈K

l·p(ki0=l|x)

The optimal decision fori-th variable is the expectation

(average) in the corresponding marginal probability distribution Note: the decision is not necessarily an element ofK, e.g. it may be real-valued → setsD and K are different.

(20)

Summary

Today:

– The idea – a game

The approach – minimize the average (expected) loss – Simple δ-loss – Maximum A-posteriori Decision – Rejection – an example of "extended" decision sets – "Metric" classes – more elaborated losses are possible – Structured classes – even more elaborated losses ...

The message:

The design of the appropriate loss-function is as important as the design of the appropriate probabilistic model.

The next class: probabilistic learning

Referenzen

ÄHNLICHE DOKUMENTE

5 a The average fitnesses, as functions of the number n of actions in the individual decisions, accumulated by the four criteria in single-agent decision problems where the agent

Finally, after fitting the cognitively most plausible agent to the human data, we plan on porting the agent to a version of D-MAP that includes Argus Prime’s target

The decision strategy e : X → K partitions the input space into two re- gions: the one corresponding to the first and the one corresponding to the second class.. How does this

We also aimed to apply our paradigm to investi- gate the effects of prior subjective loss of control experiences on subsequent risky decision- making, specifically, on risk

In the second setup, a single decision maker is faced with a number of decision tasks, among which the same decision task appears several times.. This setup is designed such that

Study one revealed that distinct cognitive decision-making mechanisms in a gambling task share neural mechanisms: Brain activity patterns extending from temporo-parietal to

neuroscience, cognitive science, cognitive neuroscience, mathematics, statistics, behavioral finance and decision theory in order to create a model of human behavior that not

Scheme D operates on a territorial level with close interaction of the existing cultural infrastructure in visual arts in the metropolitan and a newly established art