• Keine Ergebnisse gefunden

Machine Learning II

N/A
N/A
Protected

Academic year: 2022

Aktie "Machine Learning II"

Copied!
15
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Machine Learning II

Probability Theory (recap)

(2)

Probability space

is a three-tuple (Ξ©, 𝜎, 𝑃) with:

β€’ Ξ© βˆ’ the set of elementary events

β€’ 𝜎 βˆ’ algebra, in our models 𝜎 = 2Ξ© (powerset)

β€’ 𝑃 βˆ’ probability measure 𝑃: 𝜎 β†’ 0,1 with

β€’ the normalizing 𝑃 Ξ© = 1 and

β€’ 𝜎-additivity:

let 𝐴𝑖 be pairwise disjoint subsets, i.e. 𝐴𝑖 ∩ 𝐴𝑖′ = βˆ…, then

𝑃

𝑖

𝐴𝑖 =

𝑖

𝑃(𝐴𝑖)

(3)

Elementary Events ↔ Random variables

We consider an event 𝐴 as a subset of elementary ones:

𝑃 𝐴 =

πœ”βˆˆA

𝑃(πœ”)

A (real-valued) random variable πœ‰ for a probability space (Ξ©, 𝜎, 𝑃) is a mapping πœ‰: Ξ© β†’ ℝ

Note: elementary events are not numbers – they are elements of a general set Ξ©

Random variables are in contrast numbers, i.e. they can be summed up, subtracted, squared etc.

(4)

Distributions

Cummulative distribution function of a random variable πœ‰ : πΉπœ‰ π‘Ÿ = 𝑃 πœ”: πœ‰ πœ” ≀ π‘Ÿ

Probability distribution of a discrete random variable πœ‰: Ξ© β†’ β„€ : π‘πœ‰ π‘Ÿ = 𝑃({πœ”: πœ‰ πœ” = π‘Ÿ})

Probability density of a continuous random variable πœ‰: Ξ© β†’ ℝ : π‘πœ‰ π‘Ÿ = πœ•πΉπœ‰(π‘Ÿ)

πœ•π‘Ÿ

(5)

Mean

A mean (expectation, average ... ) of a random variable πœ‰ is

𝔼𝑃 πœ‰ =

πœ”βˆˆΞ©

𝑃 πœ” β‹… πœ‰(πœ”) =

π‘Ÿ πœ”:πœ‰ πœ” =π‘Ÿ

𝑃 πœ” β‹… π‘Ÿ =

π‘Ÿ

π‘πœ‰ π‘Ÿ β‹… π‘Ÿ

Arithmetic mean is a special case:

π‘₯ = 1

𝑁 𝑖=1 𝑛

π‘₯𝑖 =

π‘Ÿ

π‘πœ‰(π‘Ÿ) β‹… π‘Ÿ

with

π‘₯ ≑ π‘Ÿ and π‘πœ‰ π‘Ÿ = 1

𝑁

(uniform probability distribution).

(6)

Example – two independent dice numbers

The set of elementary events (6x6 faces):

Ξ© = π‘Ž, 𝑏, 𝑐, 𝑑, 𝑒, 𝑓 Γ— π‘Ž, 𝑏, 𝑐, 𝑑, 𝑒, 𝑓 Probability measure: 𝑃 π‘Žπ‘ = 1

36, 𝑃 𝑐𝑑, π‘“π‘Ž = 1

18 … Two random variables:

1) The number of the first die: πœ‰1 π‘Žπ‘ = 1, πœ‰1 π‘Žπ‘ = 1, πœ‰1 𝑒𝑓 = 5 … 2) The number of the second die: πœ‰2 π‘Žπ‘ = 2, πœ‰2 π‘Žπ‘ = 3, πœ‰2 𝑒𝑓 = 6 …

Probability distributions:

π‘πœ‰1 1 = π‘πœ‰1 2 = β‹― = π‘πœ‰1 6 = 1 6 π‘πœ‰2 1 = π‘πœ‰2 2 = β‹― = π‘πœ‰2 6 = 1 6

(7)

Example – two independent dice numbers

Consider the new random variable: πœ‰ = πœ‰1 + πœ‰2

The probability distribution π‘πœ‰ is not uniform anymore  π‘πœ‰ ∝ (1,2,3,4,5,6,5,4,3,2,1)

Mean value is 𝔼𝑃 πœ‰ = 7 In general for mean values:

𝔼𝑃 πœ‰1 + πœ‰2 =

πœ”βˆˆΞ©

𝑃 πœ” β‹… (πœ‰1 πœ” + πœ‰2 πœ” ) = 𝔼𝑃 πœ‰1 + 𝔼𝑃(πœ‰2)

(8)

Random variables of higher dimension

Analogously: Let πœ‰: Ξ© β†’ ℝ𝑛 be a mapping (𝑛 = 2 for simplicity), with πœ‰ = πœ‰1, πœ‰2 , πœ‰1: Ξ© β†’ ℝ and πœ‰2: Ξ© β†’ ℝ

Cummulative distribution function:

πΉπœ‰ π‘Ÿ, 𝑠 = 𝑃 πœ”: πœ‰1 πœ” ≀ π‘Ÿ ∩ πœ”: πœ‰2 πœ” ≀ 𝑠

Joint probability distribution (discrete):

π‘πœ‰= πœ‰1,πœ‰2 π‘Ÿ, 𝑠 = 𝑃 πœ”: πœ‰1 πœ” = π‘Ÿ ∩ πœ”: πœ‰2 πœ” = 𝑠 Joint probability density (continuous):

π‘πœ‰= πœ‰1,πœ‰2 π‘Ÿ, 𝑠 = πœ•2πΉπœ‰(π‘Ÿ, 𝑠)

πœ•π‘Ÿ πœ•π‘ 

(9)

Independence

Two events 𝐴 ∈ 𝜎 and 𝐡 ∈ 𝜎 are independent, if 𝑃 𝐴 ∩ 𝐡 = 𝑃 𝐴 β‹… 𝑃(𝐡)

Interesting: Events 𝐴 and 𝐡 = Ξ© βˆ– 𝐡 are independent, if 𝐴 and 𝐡 are independent 

Two random variables are independent, if

πΉπœ‰= πœ‰1,πœ‰2 π‘Ÿ, 𝑠 = πΉπœ‰1 π‘Ÿ β‹… πΉπœ‰2 𝑠 βˆ€ π‘Ÿ, 𝑠 It follows (example for continuous πœ‰):

𝑝 π‘Ÿ, 𝑠 = πœ•2πΉπœ‰(π‘Ÿ, 𝑠)

= πœ•πΉπœ‰1(π‘Ÿ)

β‹… πœ•πΉπœ‰2 𝑠

= 𝑝 π‘Ÿ β‹… 𝑝 𝑠

(10)

Conditional probabilities

Conditional probability:

𝑃 𝐴 𝐡) = 𝑃(𝐴∩𝐡)

𝑃(𝐡)

Independence (almost equivalent): 𝐴 and 𝐡 are independent, if 𝑃 𝐴 𝐡) = 𝑃(𝐴) and/or 𝑃 𝐡 𝐴) = 𝑃(𝐡)

Bayesβ€˜ Theorem (formula, rule)

𝑃 𝐴 𝐡) = 𝑃 𝐡 𝐴) β‹… 𝑃(𝐴) 𝑃(𝐡)

(11)

Further definitions (for random variables)

Shorthand: 𝑝 π‘₯, 𝑦 ≑ π‘πœ‰(π‘₯, 𝑦)

Marginal probability distribution:

𝑝 π‘₯ =

𝑦

𝑝(π‘₯, 𝑦)

Conditional probability distribution:

𝑝 π‘₯ 𝑦 = 𝑝(π‘₯, 𝑦) 𝑝(𝑦) Note: π‘₯ 𝑝(π‘₯|𝑦) = 1

Independent probability distribution:

𝑝 π‘₯, 𝑦 = 𝑝 π‘₯ β‹… 𝑝(𝑦)

(12)

Example

Let the probability to be taken ill be

𝑝 𝑖𝑙𝑙 = 0.02

Let the conditional probability to have a temperature in that case is 𝑝 π‘‘π‘’π‘šπ‘ 𝑖𝑙𝑙 = 0.9

However, one may have a temperature without any illness, i.e.

𝑝 π‘‘π‘’π‘šπ‘ 𝑖𝑙𝑙 = 0.05

What is the probability to be taken ill provided that one has a temperature?

(13)

Example

Bayes’ rule:

𝑝 𝑖𝑙𝑙 π‘‘π‘’π‘šπ‘ = 𝑝 π‘‘π‘’π‘šπ‘ 𝑖𝑙𝑙 β‹… 𝑝(𝑖𝑙𝑙)

𝑝(π‘‘π‘’π‘šπ‘) =

(marginal probability in the denominator)

= 𝑝 π‘‘π‘’π‘šπ‘ 𝑖𝑙𝑙 β‹… 𝑝(𝑖𝑙𝑙)

𝑝 π‘‘π‘’π‘šπ‘ 𝑖𝑙𝑙 β‹… 𝑝 𝑖𝑙𝑙 + 𝑝 π‘‘π‘’π‘šπ‘ 𝑖𝑙𝑙 β‹… 𝑝(𝑖𝑙𝑙) =

= 0.9 β‹… 0.02

0.9 β‹… 0.02 + 0.05 β‹… 0.98 β‰ˆ 0.27

βˆ’ not so high as expected , the reason – very low prior probability to be taken ill

(14)

Further Topics

The model:

Let two random variables be given:

β€’ The first one (π‘˜ ∈ 𝐾) is typically discrete and is called β€œclass”

β€’ The second one (π‘₯ ∈ 𝑋) is often continuous and is called

β€œobservation”

Let the joint probability distribution 𝑝(π‘₯, π‘˜) be β€œgiven”.

Note: in structured models sets 𝐾 and 𝑋 are quite complex,

i.e. labellings and images respectively. Their cardinalities are huge !!!

(15)

Further Topics

Inference: given π‘₯, estimate π‘˜.

Usual problems (questions):

β€’ How to estimate π‘˜ from π‘₯ ?

β€’ The joint probability is not always explicitly specified.

β€’ The set 𝐾 is huge.

Statistical learning:

Often (almost always) the probability distribution is known up to free parameters. How to learn them from examples ?

Discriminative learning:

Directly train the prediction function that does inference.

Referenzen

Γ„HNLICHE DOKUMENTE

The Bayesian view allows any self-consistent ascription of prior probabilities to propositions, but then insists on proper Bayesian updating as evidence arrives. For example P(cavity)

Bayesian concept learning: the number game The beta-binomial model: tossing coins The Dirichlet-multinomial model: rolling dice... Bayesian

I Discriminative: These classifiers focus on modeling the class boundaries or the class membership probabilities directly. No attempt is made to model the underlying class

In Bayesian analysis we keep all regression functions, just weighted by their ability to explain the data.. Our knowledge about w after seeing the data is defined by the

β†’ can be solved exact and efficient by Dynamic Programming.. Search techniques – general idea.. There is a "neighbourhood" for each labelling – a subset of labelling

LP = SQ if the relaxed linear optimization is solved exactly There is no strongly polynomial algorithm for linear. optimization

– The Idea of discriminative learning – parameterized families of classifiers, non-statistical learning – Linear classifiers, the Perceptron algorithm – Feature

The reduction in the variance due to orthonormal variates is appreci- able, but even t h e crude estimator gives acceptable results.. After some reflection i t