Machine Learning Bayesian Decision Theory

(1)

Machine Learning

Bayesian Decision Theory

Dmitrij Schlesinger

WS2014/2015, 27.10.2014

(2)

Recognition

The model:

Let two random variables be given:

– The first one is typically discrete (k ∈K) and is called

“class”

– The second one is often continuous (x∈X) and is called

“observation”

Let the joint probability distributionp(x, k) be “given”

As k is discrete it is often specified by p(x, k) = p(k)·p(x|k) The recognitiontask: given x, estimate k

Usual problems (questions):

– How to estimate k from x ? (today)

– The joint probability is not always explicitly specified – The set K is sometimes huge

(3)

Idea – a game

Somebodysamples a pair (x, k)according to a p.d. p(x, k) He keeps k hidden and presents x to you

You decide for some k^∗ according to a chosen decision strategy

Somebodypenalizes your decision according to a loss-function, i.e. he compares your decision to the true hiddenk

You know bothp(x, k)and the loss-function (how does he compare)

Your goal is to design the decision strategy in order to pay as less as possible in average.

(4)

Bayesian Risk

Notations:

Thedecision setD. Note: it needs not to coincide withK !!!

Examples: decisions like “I don’t know”, “not this class” ...

Decision strategyis a (deterministic) mapping e:X →D Loss-functionC :D×K →R, i.e. for a decision d and a

"true" classk the penalty is C(d, k)

TheBayesian Risk of a strategy e is the expected loss:

R(e) =^X

x

X

k

p(x, k)·Ce(x), k→min

e

It should be minimized with respect to the decision strategy

(5)

Some variants

General:

R(e) =^X

x

X

k

p(x, k)·Ce(x), k→min

e

Almost always:

decisions can be made for differentx independently (the set of decision strategies is not restricted). Then:

Re(x)=R(d) = ^X

k

p(x, k)·C(d, k)→min

d

Very often: the decision set coincides with the set of classes, i.e.D=K

k^∗ = arg min

k

X

k⁰

p(x, k⁰)·C(k, k⁰) =

= arg min

k

X

k⁰

p(k⁰|x)·C(k, k⁰)

(6)

Maximum A-posteriori Decision (MAP)

The loss is the simplest one:

C(k, k⁰) =

( 1 if k6=k⁰

0 otherwise =δ(k 6=k⁰) i.e. we pay1 if the answer is not the true class, no matter what error we make.

From that follows:

R(k) = ^X

k⁰

p(k⁰|x)·δ(k 6=k⁰) =

= ^X

k⁰

p(k⁰|x)−p(k|x) = 1−p(k|x)→min

k

p(k|x)→max

k

(7)

A MAP example

LetK ={1,2},x∈R²,p(k)be given. Conditional probability distributions for observations given classes are Gaussians:

p(x|k) = 1 2πσ_k² exp

−kx−µ_kk² 2σ_k²

The loss-function isδ(k6=k⁰), i.e. we want MAP.

The decision strategy e : X → K partitions the input space into two re- gions: the one corresponding to the first and the one corresponding to the second class.

How does this partition look like?

(8)

A MAP example

For a particularx we decide for 1, if p(1)· 1

2πσ₁² exp

−kx−µ₁k² 2σ²₁

> p(2)· 1 2πσ₂² exp

−kx−µ₂k² 2σ₂²

Special case (for simplicity)σ₁ =σ₂

→the decision strategy is (derivation on the board) hx, µ₂−µ₁i> const

→a linear classifier – the hyperplane orthogonal to µ₂−µ₁

More classes, equalσ and p(k) → Voronoi-diagram More classes, equalσ, differentp(k) →Fischer-classifier Two classes, differentσ – a general quadratic curve etc.

(9)

Decision with rejection

The decision set is D=K∪ {r}, i.e. extended by a special decision “I don’t know”. The loss-function is

C(d, k) =

( δ(d6=k) if d∈K

ε if d=r

i.e. we pay a (reasonable) penalty if we are lazy to decide.

Case-by-case analysis:

1. We decide for a class d∈K, then the decision is MAP d=k^∗ = arg max_kp(k|x), the loss for this is 1−p(k^∗|x) 2. We decide to reject d=r and pay ε for this

The decision strategy is:

Comparep(k^∗|x) with 1−ε and decide for the greater value.

Note: not only the argument arg max_k is important but the value of the lossmin_k as well (for comparison).

(10)

Other simple loss-functions

Let the set of classes bestructured(in some sense) Example:

We have a probability density p(x, y) with an observations x and acontinuous hidden value y. Suppose, we know p(y|x) for a given x, for which we would like to infer y.

The Bayesian Risk reads:

Re(x)=

Z ∞

−∞

p(y|x)·Ce(x), ydy

(11)

Other simple loss-functions

Simpleδ-loss-function→ MAP (not interesting anymore) Loss may account fordifferences between the decision and the “true” hidden value, for instance C(d, y) = (d−y)², i.e. we pay depending on the distance.

Than (see board again):

e(x) = arg min

d

Z ∞

−∞p(y|x)·(d−y)²dy =

=

Z ∞

−∞y·p(y|x)dy=E^p(y|x)[y]

Other choices: C(d, y) = |d−y|,C(d, y) = δ(|d−y|> ε), combination with “rejection” etc.

(12)

Additive loss-functions – an example

Q1 Q2 . . . Qn

P₁ 1 0 . . . 1 P₂ 0 1 . . . 0 . . . . . . . . . . . . . . . P_m 0 1 . . . 0

“^P” ? ? . . . ?

Consider a “questionnaire”:

m persons answern questions.

Furthermore, let us assume that persons are rated – a “reliability”

measure is assigned to each one.

The goal is to find the “right”

answers for all questions.

Strategy 1:

Choose thebest person and take allhis/her answers.

Strategy 2:

– Consider a particular question

– Look, what allthe people say concerning this, do (weighted) voting

(13)

Additive loss-functions – example interpretation

People are classesk, reliability measure is the posterior p(k|x) Specialty:

classes consist of “parts” (questions) – classes arestructured The set of classes is k = (k₁, k₂. . . k_m)∈K^m, it can be seen as a vector of m components each one being a simple answer (0 or 1 in the above example)

The “Strategy 1” is MAP

How to derive (consider, understand) the other decision strategy from the viewpoint of the Bayesian Decision Theory?

(14)

Additive loss-functions

Consider the simple C(k, k⁰) =δ(k 6=k⁰)loss for the case classes are structured – it does not reflect how strongthe class and the decision disagree

A better (?) choice – additive loss-function C(k, k⁰) = ^X

i

c_i(k_i, k_i⁰)

i.e. disagreements of all components are summed up

Substitute it in the formula for Bayesian Risk, derive and look what happens ...

(15)

Additive loss-functions – derivation

R(k) = ^X

k⁰

p(k⁰|x)·^X

i

ci(k_i, k_i⁰)

=/ swap summations

= ^X

i

X

k⁰

c_i(k_i, k_i⁰)·p(k⁰|x) = / split summation

= ^X

i

X

l∈K

X

k⁰:k⁰_i=l

c_i(k_i, l)·p(k⁰|x) =/ factor out

= ^X

i

X

l∈K

c_i(k_i, l)· ^X

k⁰:k⁰_i=l

p(k⁰|x)

=/ redare marginals

= ^X

i

X

l∈K

c_i(k_i, l)·p(k_i⁰=l|x)→min

k

/ independent problems

⇒ ^X

l∈K

c_i(k_i, l)·p(k⁰_i=l|x)→min

ki

∀i

(16)

Additive loss-functions – the strategy

1. Compute marginalprobability distributions for values p(k_i⁰=l|x) = ^X

k⁰:k⁰_i=l

p(k⁰|x)

for each variablei and each valuel

2. Decide for each variable “independently” according to its marginal p.d. and the local lossc_i

X

l∈K

c_i(k_i, l)·p(k_i⁰=l|x)→min

ki

This is again a Bayesian Decision Problem – minimize the average loss

(17)

Additive loss-functions – a special case

For each variable we pay1 if we are wrong:

ci(k_i, k⁰_i) =δ(k_i 6=k_i⁰)

The overall loss is the number of misclassified variables (wrongly answered questions)

C(k, k⁰) =^X

i

δ(k_i 6=k_i⁰)

and is called Hamming distance

The decision strategy isMaximum Marginal Decision k_i^∗ = arg max

l

p(k⁰_i=l|x) ∀i

(18)

Minimum Marginal Squared Error (MMSE)

Assume, the valuesl for k_i are numbers (vectors). Examples:

– in tracking and pose estimation it is the set of all possible positions of the object to be tracked

– in stereo it is the set of all disparity/depth values – in denoising it is a grayvalue etc.

→a more reasonable (additive) loss should account formetric difference between the decision and the true position, e.g.

C(k, k⁰) = ^X

i

c_i(k_i, k⁰_i) =^X

i

kk_i−k_i⁰k²

The task to be solved for each positioni is

X

l∈K

kk_i−lk²·p(k⁰_i=l|x)→min

ki

(19)

Minimum Marginal Squared Error (MMSE)

X

l∈K

kk_i−lk²·p(k⁰_i=l|x)→min

ki

∂

∂k_i = ^X

l∈K

2·(k_i−l)·p(k⁰_i=l|x) = 0

X

l∈K

k_i·p(k_i⁰=l|x) = ^X

l∈K

l·p(k_i⁰=l|x)

k_i = ^X

l∈K

l·p(k_i⁰=l|x)

The optimal decision fori-th variable is the expectation

(average) in the corresponding marginal probability distribution Note: the decision is not necessarily an element ofK, e.g. it may be real-valued → setsD and K are different.

(20)

Summary

Today:

– The idea – a game

The approach – minimize the average (expected) loss – Simple δ-loss – Maximum A-posteriori Decision – Rejection – an example of "extended" decision sets – "Metric" classes – more elaborated losses are possible – Structured classes – even more elaborated losses ...

The message:

The design of the appropriate loss-function is as important as the design of the appropriate probabilistic model.

The next class: probabilistic learning