Machine Learning II
Probability Theory (recap)
Probability space
is a three-tuple (Ξ©, π, π) with:
β’ Ξ© β the set of elementary events
β’ π β algebra, in our models π = 2Ξ© (powerset)
β’ π β probability measure π: π β 0,1 with
β’ the normalizing π Ξ© = 1 and
β’ π-additivity:
let π΄π be pairwise disjoint subsets, i.e. π΄π β© π΄πβ² = β , then
π
π
π΄π =
π
π(π΄π)
Elementary Events β Random variables
We consider an event π΄ as a subset of elementary ones:
π π΄ =
πβA
π(π)
A (real-valued) random variable π for a probability space (Ξ©, π, π) is a mapping π: Ξ© β β
Note: elementary events are not numbers β they are elements of a general set Ξ©
Random variables are in contrast numbers, i.e. they can be summed up, subtracted, squared etc.
Distributions
Cummulative distribution function of a random variable π : πΉπ π = π π: π π β€ π
Probability distribution of a discrete random variable π: Ξ© β β€ : ππ π = π({π: π π = π})
Probability density of a continuous random variable π: Ξ© β β : ππ π = ππΉπ(π)
ππ
Mean
A mean (expectation, average ... ) of a random variable π is
πΌπ π =
πβΞ©
π π β π(π) =
π π:π π =π
π π β π =
π
ππ π β π
Arithmetic mean is a special case:
π₯ = 1
π π=1 π
π₯π =
π
ππ(π) β π
with
π₯ β‘ π and ππ π = 1
π
(uniform probability distribution).
Example β two independent dice numbers
The set of elementary events (6x6 faces):
Ξ© = π, π, π, π, π, π Γ π, π, π, π, π, π Probability measure: π ππ = 1
36, π ππ, ππ = 1
18 β¦ Two random variables:
1) The number of the first die: π1 ππ = 1, π1 ππ = 1, π1 ππ = 5 β¦ 2) The number of the second die: π2 ππ = 2, π2 ππ = 3, π2 ππ = 6 β¦
Probability distributions:
ππ1 1 = ππ1 2 = β― = ππ1 6 = 1 6 ππ2 1 = ππ2 2 = β― = ππ2 6 = 1 6
Example β two independent dice numbers
Consider the new random variable: π = π1 + π2
The probability distribution ππ is not uniform anymore ο ππ β (1,2,3,4,5,6,5,4,3,2,1)
Mean value is πΌπ π = 7 In general for mean values:
πΌπ π1 + π2 =
πβΞ©
π π β (π1 π + π2 π ) = πΌπ π1 + πΌπ(π2)
Random variables of higher dimension
Analogously: Let π: Ξ© β βπ be a mapping (π = 2 for simplicity), with π = π1, π2 , π1: Ξ© β β and π2: Ξ© β β
Cummulative distribution function:
πΉπ π, π = π π: π1 π β€ π β© π: π2 π β€ π
Joint probability distribution (discrete):
ππ= π1,π2 π, π = π π: π1 π = π β© π: π2 π = π Joint probability density (continuous):
ππ= π1,π2 π, π = π2πΉπ(π, π )
ππ ππ
Independence
Two events π΄ β π and π΅ β π are independent, if π π΄ β© π΅ = π π΄ β π(π΅)
Interesting: Events π΄ and π΅ = Ξ© β π΅ are independent, if π΄ and π΅ are independent ο
Two random variables are independent, if
πΉπ= π1,π2 π, π = πΉπ1 π β πΉπ2 π β π, π It follows (example for continuous π):
π π, π = π2πΉπ(π, π )
= ππΉπ1(π)
β ππΉπ2 π
= π π β π π
Conditional probabilities
Conditional probability:
π π΄ π΅) = π(π΄β©π΅)
π(π΅)
Independence (almost equivalent): π΄ and π΅ are independent, if π π΄ π΅) = π(π΄) and/or π π΅ π΄) = π(π΅)
Bayesβ Theorem (formula, rule)
π π΄ π΅) = π π΅ π΄) β π(π΄) π(π΅)
Further definitions (for random variables)
Shorthand: π π₯, π¦ β‘ ππ(π₯, π¦)
Marginal probability distribution:
π π₯ =
π¦
π(π₯, π¦)
Conditional probability distribution:
π π₯ π¦ = π(π₯, π¦) π(π¦) Note: π₯ π(π₯|π¦) = 1
Independent probability distribution:
π π₯, π¦ = π π₯ β π(π¦)
Example
Let the probability to be taken ill be
π πππ = 0.02
Let the conditional probability to have a temperature in that case is π π‘πππ πππ = 0.9
However, one may have a temperature without any illness, i.e.
π π‘πππ πππ = 0.05
What is the probability to be taken ill provided that one has a temperature?
Example
Bayesβ rule:
π πππ π‘πππ = π π‘πππ πππ β π(πππ)
π(π‘πππ) =
(marginal probability in the denominator)
= π π‘πππ πππ β π(πππ)
π π‘πππ πππ β π πππ + π π‘πππ πππ β π(πππ) =
= 0.9 β 0.02
0.9 β 0.02 + 0.05 β 0.98 β 0.27
β not so high as expected ο, the reason β very low prior probability to be taken ill
Further Topics
The model:
Let two random variables be given:
β’ The first one (π β πΎ) is typically discrete and is called βclassβ
β’ The second one (π₯ β π) is often continuous and is called
βobservationβ
Let the joint probability distribution π(π₯, π) be βgivenβ.
Note: in structured models sets πΎ and π are quite complex,
i.e. labellings and images respectively. Their cardinalities are huge !!!
Further Topics
Inference: given π₯, estimate π.
Usual problems (questions):
β’ How to estimate π from π₯ ?
β’ The joint probability is not always explicitly specified.
β’ The set πΎ is huge.
Statistical learning:
Often (almost always) the probability distribution is known up to free parameters. How to learn them from examples ?
Discriminative learning:
Directly train the prediction function that does inference.