• Keine Ergebnisse gefunden

Machine Learning

N/A
N/A
Protected

Academic year: 2022

Aktie "Machine Learning"

Copied!
56
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Machine Learning

Volker Roth

Department of Mathematics & Computer Science University of Basel

Volker Roth (University of Basel) Machine Learning 1 / 56

(2)

Section 9 Mixture Models

Volker Roth (University of Basel) Machine Learning 2 / 56

(3)

Structure and mixtures

Assume that input examples come in different potentially unobserved types (groups, clusters, etc.).

Assume that

1 there aremunderlying typesz = 1, . . . ,m;

2 each typez occurs with probabilityP(z);

3 examples of typez distributed according to p(x|z).

According to this model, each observed x comes from a mixture distribution:

p(x) =

m

X

j=1

P(z =j)

| {z }

πj

p(x|z =j,θj)

In many practical data analysis problems (such as probabilistic clustering), we want toestimatesuch parametric models from samples {x1, . . . ,xn}. In paticular, we are often interested in finding the types that have generated the examples.

Volker Roth (University of Basel) Machine Learning 3 / 56

(4)

Mixture of Gaussians

A mixture of Gaussians model has the form p(x|θ) =

m

X

j=1

πjN(x|µj,Σj),

where θ=π1, . . . , πm,µ1, . . . ,µm,Σ1, . . . ,Σm contains all the parameters. {πj} are the mixing proportions.

Volker Roth (University of Basel) Machine Learning 4 / 56

(5)

Mixture densities

Data generation process:

P(z)

p(x|z=1) p(x|z=2) z = 2 z = 1

2

1

p(x|θ) =

m

X

j=1

πjp(x|µj,Σj)

Any data point x could have been generated in two ways.

the responsible component needs to beinferred.

Volker Roth (University of Basel) Machine Learning 5 / 56

(6)

Mixtures as Latent Variable Models

In the modelp(x|z =j,θ) the class indicator variable z is latent.

This is an example of a large class of latent variable models(LVM).

Bayesian network (DAG) = graphical representation of the joint distribution of RVs (nodes) asP(x1, . . . ,xn) =Qni=1P(xi|parents(xi))

xi Σ

µ

π

zi

n

p(xi|θ) =X

zi

p(xi,zi|θ)

=X

zi

p(xi|µ,Σ,zi)p(zi|π).

Volker Roth (University of Basel) Machine Learning 6 / 56

(7)

Mixture densities

Consider a two component mixture of Gaussians model.

p(x|θ) =π1p(x1,Σ1) +π2p(x2,Σ2)

If we knew the generating component zi ={1,2} for each example xi, then the estimation would be easy.

P(z)

p(x|z=1) p(x|z=2) z = 2 z = 1

2

1

In particular, we can estimate each Gaussian independently.

Volker Roth (University of Basel) Machine Learning 7 / 56

(8)

Mixture density estimation

Let δ(j|i) be an indicator function of whether examplei is labeled j. Then for each j = 1,2

ˆ

πj ← ˆnj

n, where ˆnj =

n

X

i=1

δ(j|i) ˆ

µj ← 1 ˆ nj

n

X

i=1

δ(j|i)xi

Σˆj ← 1 ˆ nj

n

X

i=1

δ(j|i)(xiµj)(xiµj)t

Volker Roth (University of Basel) Machine Learning 8 / 56

(9)

Mixture density estimation

We don’t have such labels... but we can guess what the labels might be based on our current distribution.

One possible choice: evaluate posterior probability that an observedx was generated from first component

P(z = 1|x,θ) = P(z = 1)·p(x|z = 1) P

j=1,2P(z =jp(x|z =j)

= π1p(x1,Σ1) P

j=1,2πjp(x|µj,Σj)

Information about the component responsible for generatingx. Soft labels or posterior probabilities

ˆ

p(j|i)←P(zi =j|xi,θ),

wherePj=1,2p(j|i) = 1,ˆ ∀i = 1, . . . ,n.

Volker Roth (University of Basel) Machine Learning 9 / 56

(10)

The EM algorithm: iteration k

E-step: softly assign examples to mixture components ˆ

p(j|i)←P(zi =j|xi,θt),∀j = 1,2 and i = 1, . . . ,n.

Note: superscript is time index.

M-step: estimate new mixture parameters θt+1 based on the soft assignments (can be done separately for the two Gaussians)

ˆ

πj ← ˆnj

n, where ˆnj =

n

X

i=1

ˆ p(j|i)

ˆ

µj ← 1 ˆ nj

n

X

i=1

p(j|i)xˆ i Σˆj ← 1

ˆ nj

n

X

i=1

p(j|i)(xˆ iµj)(xiµj)t

Volker Roth (University of Basel) Machine Learning 10 / 56

(11)

The EM-algorithm: Convergence

The EM-algorithm monotonically increases the log-likelihood of the training data (we will show this later). In other words,

l0)<l(θ1)<l(θ2)< . . . until convergence lt) =Xn

i=1logp(xit).

5 10 15 20 25 30 35

0

iterations

log likelihood

Volker Roth (University of Basel) Machine Learning 11 / 56

(12)

Mixture density estimation: example

−2 0 2

−2 0 2

Fig. 11.11 in K. Murphy

Volker Roth (University of Basel) Machine Learning 12 / 56

(13)

Mixture density estimation: example

Fig. 11.11 in K. Murphy

Volker Roth (University of Basel) Machine Learning 13 / 56

(14)

EM example: Iris data

The famous (Fisher’s or Anderson’s) iris data set gives the measurements in centimeters of the variables sepal length and width andpetal length and width, respectively, for 50 flowers from each of 3 species of iris.

The species are Iris setosa, versicolor, and virginica.

Volker Roth (University of Basel) Machine Learning 14 / 56

(15)

Volker Roth (University of Basel) Machine Learning 15 / 56

(16)

Bayesian model selection for mixture models

As a simple strategy for selecting the appropriate number of mixture components, we can find m that minimizes the overall description length (cf. BIC):

DL≈ −logp(data|θˆm) +dm n log(n) n is the number of training points,

θˆm are the maximum likelihood parameters for them-component mixture, and

dm is the number of parameters in them-component mixture.

Volker Roth (University of Basel) Machine Learning 16 / 56

(17)

Model selection example: Iris data, m = 2

Sepal.Length

Sepal.Width

Sepal.Length

2.0 2.5 3.0 3.5 4.0

Petal.Length

Sepal.Length

Petal.Width

Sepal.Length

0.5 1.0 1.5 2.0 2.5

4.55.05.56.06.57.07.58.0

Sepal.Length

Sepal.Width

2.02.53.03.54.0

Sepal.Width

Petal.Length

Sepal.Width

●●

Petal.Width

Sepal.Width

Sepal.Length

Petal.Length

●●

Sepal.Width

Petal.Length

●●

●●

Petal.Length

Petal.Width

Petal.Length

● ●

1234567

Sepal.Length

Petal.Width

● ●

●●

●●

4.5 5.05.5 6.06.5 7.07.5 8.0

0.51.01.52.02.5

Sepal.Width

Petal.Width

● ●

Petal.Length

Petal.Width

● ●

●●

● ●

● ●

●●

●●

● ●

●●

1 2 3 4 5 6 7

Petal.Width

Volker Roth (University of Basel) Machine Learning 17 / 56

(18)

Model selection example: Iris data, m = 3

Sepal.Length

Sepal.Width

Sepal.Length

2.0 2.5 3.0 3.5 4.0

Petal.Length

Sepal.Length

Petal.Width

Sepal.Length

0.5 1.0 1.5 2.0 2.5

4.55.05.56.06.57.07.58.0

Sepal.Length

Sepal.Width

2.02.53.03.54.0

Sepal.Width

Petal.Length

Sepal.Width

●●

Petal.Width

Sepal.Width

Sepal.Length

Petal.Length

●●

Sepal.Width

Petal.Length

●●

●●

Petal.Length

Petal.Width

Petal.Length

● ●

1234567

Sepal.Length

Petal.Width

● ●

●●

●●

4.5 5.05.5 6.06.5 7.07.5 8.0

0.51.01.52.02.5

Sepal.Width

Petal.Width

● ●

Petal.Length

Petal.Width

● ●

●●

● ●

● ●

●●

●●

● ●

●●

1 2 3 4 5 6 7

Petal.Width

Volker Roth (University of Basel) Machine Learning 18 / 56

(19)

Model selection example: Iris data, m = 4

Sepal.Length

Sepal.Width

Sepal.Length

2.0 2.5 3.0 3.5 4.0

Petal.Length

Sepal.Length

Petal.Width

Sepal.Length

0.5 1.0 1.5 2.0 2.5

4.55.05.56.06.57.07.58.0

Sepal.Length

Sepal.Width

2.02.53.03.54.0

Sepal.Width

Petal.Length

Sepal.Width

●●

Petal.Width

Sepal.Width

Sepal.Length

Petal.Length

●●

Sepal.Width

Petal.Length

●●

●●

Petal.Length

Petal.Width

Petal.Length

● ●

1234567

Sepal.Length

Petal.Width

● ●

●●

●●

4.5 5.05.5 6.06.5 7.07.5 8.0

0.51.01.52.02.5

Sepal.Width

Petal.Width

● ●

Petal.Length

Petal.Width

● ●

●●

● ●

● ●

●●

●●

● ●

●●

1 2 3 4 5 6 7

Petal.Width

Volker Roth (University of Basel) Machine Learning 19 / 56

Referenzen

ÄHNLICHE DOKUMENTE

Available glacier length reconstructions dating far back in time from the western (e.g. Mer de Glace) and the central Alps (e.g. Unterer Grindelwaldgletscher) will be completed

The famous (Fisher’s or Anderson’s) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50

Since the mutant YscP 497-515 can not be exported by the type III secretion machinery when expressed under its native promoter, the YscP tail might either be important for

We now consider the case where upstream firms cannot exclude unrelated downstream buyers, who purchase from the upstream firm that charges lower price.. Intuitively this makes

Due to the fact that the 402.6186 nm He I line originates from a high lying parent energy level with 24.04 eV excitation energy, near to the ionization en- ergy (24.59 eV) of the

ВЕРГИЛИЯ И «АРГОНАВТИКЕ» ВАЛЕРИЯ ФЛАККА Статья посвящена исследованию характера распределения срав- нений в «Энеиде» Вергилия и «Аргонавтике» Валерия

Such a balanced design was applied in Experiment 1 and also for the native contrast in the present experiment, which means that one half of the German infants who were tested on

If this is the case one says that the field K is real (or formally real), otherwise nonreal. We assume that the reader is familiar with the basic theory of quadratic forms