Pattern Recognition: Probability Theory

(1)

Pattern Recognition: Probability Theory

Dennis Madsen

Department of Mathematics and Computer Science University of Basel

Fall Semester 2018

(2)

Variability of a pattern - Dog

2

(3)

Variability of a pattern - Digit 4

Bishop 2009

3

(4)

Variability of measurement (noise)

Bishop 2009 4

(5)

Uncertainty in the model

Bishop 2009 5

(6)

Motivation

Why do we need probability theory??

Probability and Statistics To model

Variability of pattern itself

Variability of measurement (noise) Uncertainty in our model

⇒ A short repetition of probability theory in the context of pattern recognition

First Part: Theory → quick reference for you Second Part: Multivariate Gaussian as an example

6

(7)

Motivation

Why do we need probability theory??

Probability and Statistics To model

Variability of pattern itself

Variability of measurement (noise) Uncertainty in our model

⇒ A short repetition of probability theory in the context of pattern recognition

First Part: Theory → quick reference for you Second Part: Multivariate Gaussian as an example

6

(8)

Discrete Random Variables

Random Variable X with possibleRealisations x ∈ {1,2,3, . . .} :

Cumulative Distribution Function (cdf)

P[X < x] =F(x)

Probability Mass Function

P[X=x] =Px

Normalization and Positivity

X

x

P_x = 1 P_x ≥0

7

(9)

Discrete Random Variables

Random Variable X with possibleRealisations x ∈ {1,2,3, . . .} : Cumulative Distribution Function (cdf)

P[X < x] =F(x)

P[X=x] =Px

X

x

P_x = 1 P_x ≥0

7

(10)

Discrete Random Variables

P[X < x] =F(x)

P[X =x] =Px

X

x

P_x = 1 P_x ≥0

7

(11)

Discrete Random Variables

P[X < x] =F(x)

P[X =x] =Px

X

x

P_x = 1 P_x ≥0

7

(12)

Discrete Random Variables — Examples

Binomial – A coin flip

x ∈ {0,1}

P0=P[X = 0] =p, P1=P[X = 1] =q p∈[0,1], q= 1−p

8

(13)

Continuous Random Variables

Random Variable X with possibleRealisations x ∈R:

Cumulative Distribution function (cdf)

F(x) : P[X < x] =F(x)

Probability Density Function (pdf)

p(x) : P[x < X < x+ dx] =p(x) dx = dF(x)

Normalisation and Positivity

Z ∞

−∞

p(x) dx = 1 p(x)≥0

9

(14)

Continuous Random Variables

Random Variable X with possibleRealisations x ∈R: Cumulative Distribution function (cdf)

F(x) : P[X < x] =F(x)

Z ∞

−∞

p(x) dx = 1 p(x)≥0

9

(15)

Continuous Random Variables

F(x) : P[X < x] =F(x)

Z ∞

−∞

p(x) dx = 1 p(x)≥0

9

(16)

Continuous Random Variables

F(x) : P[X < x] =F(x)

Z ∞

−∞

p(x) dx = 1 p(x)≥0

9

(17)

Continuous Random Variables — Examples

Gaussian

X∼ N(µ, σ²), x ∈R p(x) = 1

√

2πσ²e⁻

(x−µ)2 2σ2

Meanµ, Variance σ²

10

(18)

Example: Gaussian

-3 -2 -1

φ

μ,σ2

(

0.8

0.6

0.4

0.2

0.0

−5 −3 1 3 5

x

1.0

−1 0 2 4

−2

−4

x )

μ=0,

μ=−2,

2 0.2,

σ =

2 1.0,

σ =

2 5.0,

σ =

2 0.5,

σ =

wikipedia.org 11

(19)

Mean

The mean is a measure for central tendency Expected Value, Mean, Expectation

E[X] = X

x

x Px E[X] =

Z

x p(x) dx

12

(20)

Variance

The variance is a measure forspread Variance / Standard Deviation

V[X] =E[(X−E[X])²] sd[X] =σ_X =p

V[X]

Hint: V[X] =E[X²]−E[X]²

13

(21)

Variance

The variance is a measure forspread Variance / Standard Deviation

V[X] =E[(X−E[X])²] sd[X] =σ_X =p

V[X]

Hint: V[X] =E[X²]−E[X]²

13

(22)

Multivariate Case

Multiple Random Variables

Example

More than one Random Variable, e.g.

Length L and Weight W of an object X~ = [L, W]^T

Joint Probability

P[X =x ∧ Y =y] =P_{x y} p(x , y)

14

(23)

Multivariate Case

Multiple Random Variables

Example

More than one Random Variable, e.g.

Length L and Weight W of an object X~ = [L, W]^T

Joint Probability

P[X =x ∧ Y =y] =P_{x y} p(x , y)

14

(24)

Marginals and Conditionals

Marginalisation

P[X =x] =X

y

P[X =x , Y =y]

p(x) = Z

p(x , y) dy

Conditional Probability

P[X =x |Y =y] = P[X =x , Y =y]

P[Y =y] P[Y =y]>0 p(x |y) := p(x , y)

p(y)

15

(25)

Marginals and Conditionals

Marginalisation

P[X =x] =X

y

P[X =x , Y =y]

p(x) = Z

p(x , y) dy

Conditional Probability

P[X =x |Y =y] = P[X =x , Y =y]

P[Y =y] P[Y =y]>0 p(x |y) := p(x , y)

p(y)

15

(26)

Bayes’ Rules

Use the factorization for the joint probability density / distribution:

p(x , y) =p(x |y)p(y) p(x , y) =p(y |x)p(x)

P_x|y = P_y|xP_x Py

p(x |y) = p(y |x)p(x) p(y)

Bayesian talk: “Prior adapted to data leads to posterior”

16

(27)

Bayes’ Rules

P_x|y = P_y|xP_x Py

p(x |y) = p(y |x)p(x) p(y)

16

(28)

Bayes’ Rules

P_x|y = P_y|xP_x Py

p(x |y) = p(y |x)p(x) p(y)

16

(29)

Bayes’ Rules

P_x|y = P_y|xP_x Py

p(x |y) = p(y |x)p(x) p(y)

16

(30)

Covariance and Independence

Covariance

Cov(X, Y) =E[(X−E[X]) (Y −E[Y])]

Σ(X) =E[(X−E[X])(X−E[X])^T]

Independence

p(x , y) =p(x)p(y) ⇐⇒ X and Y are independent

Covariance 6= Independence

X and Y are independent,X ⊥Y =⇒ Cov(X, Y) = 0

17

(31)

Covariance and Independence

Covariance

Cov(X, Y) =E[(X−E[X]) (Y −E[Y])]

Σ(X) =E[(X−E[X])(X−E[X])^T]

Independence

17

(32)

Covariance and Independence

Covariance

Cov(X, Y) =E[(X−E[X]) (Y −E[Y])]

Σ(X) =E[(X−E[X])(X−E[X])^T]

Independence

17

(33)

Multivariate Gaussian Distribution

This distribution occurs very frequently

Simple enough to demonstrate these concepts

Multivariate Gaussian Distribution

p(~x) = 1

p(2π)^d |Σ| exp

−1

2(~x −~µ)^TΣ⁻¹(~x−~µ)

~

µ Mean

Σ Covariance Matrix (d ×d, positive definite, symmetric)

|Σ| Determinant of Σ d Number of dimensions

X~ ∼ N(~µ,Σ)

18

(34)

Multivariate Gaussian Distribution

This distribution occurs very frequently

Simple enough to demonstrate these concepts Multivariate Gaussian Distribution

p(~x) = 1

p(2π)^d |Σ| exp

−1

2(~x −~µ)^TΣ⁻¹(~x−~µ)

~

µ Mean

Σ Covariance Matrix (d ×d, positive definite, symmetric)

|Σ| Determinant of Σ d Number of dimensions

X~ ∼ N(~µ,Σ)

18

(35)

2D Gaussian — Surface Plot

19

(36)

2D Gaussian — Contour Plot

Points on a contour have equal probability density - equidensitylines Contours are ellipsoids

Figure: Bishop 2009 20

(37)

2D Gaussian — Samples / Scatter

21

(38)

Equidensity lines are Ellipsoids

The ellipsoids are determined by the quadratic form (~x −~µ)^TΣ⁻¹(~x−~µ)

Σ is positive definite and symmetric⇒ Ellipsoid Center at ~µ

Eigenvectors and eigenvalues of Σ Σ~e_i =λ_i~e_i

Direction of semi-axes is determined by eigenvectors~e_i

λ_i measures the variance along the corresponding eigendirection ~e_i

22

(39)

Equidensity lines are Ellipsoids

The ellipsoids are determined by the quadratic form (~x −~µ)^TΣ⁻¹(~x−~µ) Σ is positive definite and symmetric⇒ Ellipsoid Center at ~µ

22

(40)

Equidensity lines are Ellipsoids

The ellipsoids are determined by the quadratic form (~x −~µ)^TΣ⁻¹(~x−~µ) Σ is positive definite and symmetric⇒ Ellipsoid Center at ~µ

22

(41)

Moments of a Multivariate Gaussian Distribution

Mean

E[X] =~ µ~ E[Xi] =µi

Covariance

V[X] =~ Σ Cov(Xi, Xj) = Σi j

Correlation

Cor(X_i, X_j) =ρ_{i j} = Cov(X_i, X_j)

σ_iσ_j = Σ_{i j} pΣi iΣj j

, σ_i =p Σ_{i i}

23

(42)

Moments of a Multivariate Gaussian Distribution

Mean

Covariance

Correlation

, σ_i =p Σ_{i i}

23

(43)

Moments of a Multivariate Gaussian Distribution

Mean

Covariance

Correlation

, σ_i =p Σ_{i i}

23

(44)

Correlation and Covariance

Correlation measures strength of linear relationsbetween variables It does notmeasure independence

It does nottell you anything about causal relations Correlation is normalized and dimensionless

Example

24

(45)

Marginals

Marginal: Randverteilung

Removing unknown variables — “projection”

p(x) = Z

p(x , y)d y

Marginal of a Gaussian

X~ ∼ N(~µ,Σ) X~ =

X~_a X~_b

, ~µ= ~µ_a

~ µ_b

, Σ=

Σ_aa Σ_ab Σ_ba Σ_bb

p(~xa) =N(~xa |~µa,Σaa)

25

(46)

Marginals

p(x) = Z

p(x , y)d y

X~ ∼ N(~µ,Σ) X~ =

X~_a X~_b

, ~µ= ~µ_a

~ µ_b

, Σ=

25

(47)

Marginals

p(x) = Z

p(x , y)d y

X~ ∼ N(~µ,Σ) X~ =

X~_a X~_b

, ~µ= ~µ_a

~ µ_b

, Σ=

25

(48)

Conditionals

Conditional: Bedingte Verteilung

Fixing a variable to a certain value —“slices”

p(x |y) = p(x , y) p(y)

Conditional of a Gaussian

X~ ∼ N(~µ,Σ) X~ =

X~_a X~_b

, ~µ= ~µ_a

~ µ_b

, Σ=

p(~x_a |X~_b=~x_b) =N(~x_a |~µ_a|b,Σ_a|b)

~

µ_a|b = µ~a+ΣabΣ⁻¹_bb (~xb−~µb) Σ_a|b = Σ_aa−Σ_abΣ⁻¹_bbΣ_ba

26

(49)

Conditionals

p(x |y) = p(x , y) p(y) Conditional of a Gaussian

X~ ∼ N(~µ,Σ) X~ =

X~_a X~_b

, ~µ= ~µ_a

~ µ_b

, Σ=

p(~xa |X~b=~xb) =N(~xa |~µ_a|b,Σ_a|b)

~

26

(50)

Conditionals

p(x |y) = p(x , y) p(y) Conditional of a Gaussian

X~ ∼ N(~µ,Σ) X~ =

X~_a X~_b

, ~µ= ~µ_a

~ µ_b

, Σ=

p(~xa |X~b=~xb) =N(~xa |~µ_a|b,Σ_a|b)

~

26

(51)

Marginal and Conditional of a Gaussian

Bishop 2009

27

(52)

Affine Transformations

Gaussians are stable under affine transforms

Affine transformation: ~Y =AX~+~b (Aand~b are constant)

Affine Transform

X~ ∼ N(~µ,Σ) X~ ∈R^d

~Y =AX~+~b Y~ ∈Rⁿ, A∈R^n×d, ~b∈Rⁿ Y~ ∼ N(~y |µ~_Y,Σ_Y)

~

µ_Y = A~µ+~b ΣY = AΣA^T

28

(53)

Affine Transformations

Affine transformation: ~Y =AX~+~b (Aand~b are constant) Affine Transform

X~ ∼ N(~µ,Σ) X~ ∈R^d

~

µ_Y = A~µ+~b ΣY = AΣA^T

28

(54)

Affine Transformations

Affine transformation: ~Y =AX~+~b (Aand~b are constant) Affine Transform

X~ ∼ N(~µ,Σ) X~ ∈R^d

~

µ_Y = A~µ+~b Σ_Y = AΣA^T

28

(55)

Standard Normal

Univariate Standard Normal

X ∼ N(0,1) µ= 0 σ= 1

Multivariate Standard Normal

X~ ∼ N(0,I_d)

~

µ= 0 Σ=I

29

(56)

When to Stop using Gaussians

Gaussians are very handy and can be used in a lot of situations, but be careful if one of the these points applies to your problem:

Gaussians do not have heavy tails

• In many real world (empirical) distributions extreme events occur far more often than a Gaussian would allow

Gaussians have only a single mode

• Can use a mixture of Gaussians here (see lecture)

30

(57)

When to Stop using Gaussians

30

(58)

When to Stop using Gaussians

30

(59)

Heavy Tails

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

-10 -5 0 5 10

p(x)

x

Standard Normal Cauchy

31