Measure Criteria for Model Selection - Bayesian Learning and Regularization for Unsupervised Im

The EM algorithm produces a sequence of estimates {θ(t), tˆ = 0,1,2,3, ...} by alternatingly applying two steps (until some convergence criterion is met):

E-Step: Computes the conditional expectation of the complete log-likelihood in Eq. 3.6, given Y and the current estimate ˆθ(t). Since logp(Y,Z|θ) is linear with respect to the missing Z, we simply have to compute the conditional expectation W ≡ E[Z|Y,θ(t)], and plug it intoˆ logp(Y,Z|θ). The result is the so-calledQ-function:

Q(θ,θ(t))ˆ ≡E[logp(Y,Z|θ)|Y,θ(t)] = logˆ p(Y,W|θ) (3.7) Since the elements ofZ are binary, their conditional expectations (i.e., the elements of W) are given by

w_m⁽ⁱ⁾≡Eh z⁽ⁱ⁾_m

y,θˆ(t)i

= Prh

z_m⁽ⁱ⁾ = 1

y,θˆ(t)i

= ˆ

α_m(t)p y⁽ⁱ⁾

θˆ_m(t)

j=1

ˆ αj(t)p

y⁽ⁱ⁾

θˆm(t)

, (3.8)

where the last equality is simply Bayesian law (notice that α_m is the α priori probability that z⁽ⁱ⁾m = 1, while w⁽ⁱ⁾m = 1 is the a posteriori probability that zm⁽ⁱ⁾ = 1, after observingy⁽ⁱ⁾ for any (i).

M-Step: Updates the parameter estimates according to θˆ(t+ 1) = arg max

n Q

θ,θˆ(t)

+ logp(θ)o

(3.9) in the case of MAP estimation, or

θˆ(t+ 1) = arg max

θ Q

θ,θˆ(t)

. (3.10)

3 Bayesian Model Selection and Nonparametric Blur Identification

For statistic learning based model selection, a model is defined as a collection of probability distributions, indexed by model parametersM ={f(Y|θ)|θ∈ Ω} forming a Riemannian man-ifold, embedded in the space of probability distribution. Akaike information criterion (AIC) [3] derived as asymptotic approximation of Kullback-Liebler information distance between the model of interest and the truth within a set of proposed models.

In the following, we review some of the existing methods for approximating marginal likelihoods.

Then we present our proposed methods for statistic model selection based nonparametric blur identification. These methods are analytic approximations such as the Laplace method [124], the Bayesian Information Criterion (BIC) [223], the Minimum Description Length (MDL) [203]

and the Minimum Message Length (MML) [253]. All these methods make use of the MAP estimate is usually straightforward procedure.

3.3.1 Entropy and Information Measure

The original information theory is developed based on data compression and data transmission.

With the evolution of statistics, researchers found that the information theory handles a lot of underlying disciplines of the physical world. However, this broad realm of information theory with numerous topics all root in two basic concepts: entropy and mutual information, which are functions of probability distribution that underlie the process of communication.

According to Mackay [154], the terminology of entropy measures the uncertainty of a random variable, which can be considered as the information embedded in the variable. In mathematics, the entropy of a discrete random variableX, denoted byS(X), is defined as the expected value of negative-logarithm of probabilities, shown in Table. 3.1. According to the definition, we can derive the nonnegativity property of entropy S(X) ≥ 0. The reason is logp(x) ≤ 0 for any discrete variables, ∀x ∈ X,0 ≤ p(x) ≤ 1, where X is the set of all possible values for X.

Therefore,S(X) =E_p(−logp(x))≥0.

The definition of entropy can be extended to the case of multiple variables, e.g.,joint entropy and conditional entropy. The joint entropy of a pair of discrete random variables X and Y, denoted byS(X, Y). The conditional entropy of a random variableY given another variableX, denoted by S(Y|X), is to measure the uncertainty of Y when X is known. The definitions are shown in Table. 3.1.

Table 3.1: List of Definitions of Entropy and Mutual Information for Discrete Random Variables Types of Entropy Definition

Entropy: S(X) =E_p(−logp(x)) =− P

x∈X

p(x) logp(x), (single variable) Joint entropy: S(X, Y) =Ep(−logp(x, y)) =− P

x∈X

y∈Y

p(x, y) logp(x, y) Conditional entropy:S(Y|X) = P

x∈X

p(x)S(Y|X=x) =− P

x∈X

y∈Y

p(x, y) logp(y|x) Relative entropy : S_KL(p(x), q(x)) =E_plog^p(x)_q(x) = P

x∈X

p(x) log^p(x)_q(x) Mutual information: M(Y;X) =S(X)−S(Y|X) =S(Y)−S(X|Y) = P

x∈X

y∈Y

p(x, y) log^p(x)p(y)_p(x,y)

The relative entropy, also called Kullback-Leibler (KL) divergence is a measure of the distance between two distributions. The divergence is a function of two probability mass functions p(x) and q(x) that potentially characterize the same random variable X, shown in Table. 3.1. The relative entropy is nonnegativity. The proof can be based on the convexity of the logarithm function. For any given probability mass functions p(x) and q(x), we have S_KL(p(x), q(x))≥0 and the equality establishes if and only ifp(x) =q(x). The convention is used that 0 log_q(x)⁰ = 0 andp(x) log^p(x)₀ → ∞. The KL-divergence is nothing else but Shannon’s measure of uncertainty for a random variableX, ifq(x) is a uniform probability distribution. Thus, Shannon’s entropy can be interpreted as the amount of information in a modelq(x) ofXcompared to the maximum incertitude model - the uniform distribution. The uniform distribution is the one with maximum entropy.

The concept mutual information is a measure of the amount of information that one random variable contains about another random variable. It is the reduction in the uncertainty of one random variable due to the knowledge of the other. Consider two random variablesX and Y is given by the uncertainty reduction forY when X is known. Likewise, the information about X containedY is given by the uncertainty reduction forX whenY is known. Mutual information is defined by M(Y;X) = S(X)−S(Y|X) = S(Y) −S(X|Y) and these two definitions are equivalent. The definition of mutual information is shown in Table. 3.1.

While combining the two basic concepts of entropy and mutual information, we can also make an extension tocontinuous random variablesfor the definition of differential entropy which replaces the notation of “sum” by “integral”.

3.3.2 Laplace’s Method

We infers Bayesian information criteria (BIC) from Laplace approximation, and we can easily understand some related information measure criteria such as AIC, MDL and MML. Based on the Bayesian rule, the posterior over parameters θof a model is

P(θ|Y, m) = p(Y|θ, m)P(θ|m)

p(Y|m) (3.11)

The logarithm of the numerator is defined in the following,

t(θ) = log[p(Y|θ, m)P(θ|m)] = logP(θ|m) +

i=1

logp(y⁽ⁱ⁾|θ, m) (3.12)

the Laplace approximation [124] makes a local Gaussian approximation around a MAP param-eter estimate ˆθ in Eq. 3.2. The validity of this approximation is based on the large data limit and regularity constraints. Thet(θ) is expanded to second order as a Taylor series at this point,

t(θ) = t(ˆθ) + (θ−θ)ˆ ^>∂t(θ)

∂θ θ=ˆθ

+ 1

2!(θ−θ)ˆ^>∂²t(θ)

∂θ∂θ^>

θ=ˆθ

+... (3.13)

≈ t(ˆθ) + 1

2(θ−θ)ˆ ^>H(ˆθ)(θ−θ)ˆ (3.14)

3 Bayesian Model Selection and Nonparametric Blur Identification

where H(ˆθ) is the Hessian of the log posterior evaluated at ˆθ, it is a matrix of the second derivatives of Eq. 3.12,

H(ˆθ) = ∂²logp(θ|y, m)

∂θ∂θ^>

θ=ˆθ

= ∂²t(θ)

∂θ∂θ^>

θ=ˆθ

(3.15) where the linear term has vanished as the gradient of the posterior ^∂t(θ)_∂θ at ˆθ is zero because it is the MAP setting or a local maximum. Substituting Eq. 3.14 into the log marginal likelihood in Eq. 3.5 and integrating yields,

logp(Y|m) = log Z

dθP(θ|m)p(y|θ, m) = log Z

dθexp[t(θ)] (3.16)

≈ t(ˆθ) +1 2log

2πH⁻¹

(3.17)

= logP(ˆθ|m) + logp(Y|θ, m) +ˆ d

2log 2π−1

2log|H| (3.18)

wheredis the dimensionality of the parameter space. |H|denotes the determinant value ofH.

Thus, the Eq. 3.18 can be written, p(Y|m)_Laplace=P(ˆθ|m)p(Y|θ, m)ˆ

2πH⁻¹

1/2 (3.19)

where the Laplace approximation to the marginal likelihood consists of a term for the data likelihood at the MAP setting, a penalty term from the prior, and a volume term calculated from the local curvature. However, this approximation has several shortcomings in that the second derivatives of approximation are intractable to compute.

Bayesian model selection method preference for simpler models is a spin-off and built-in Ock-ham’s razor. Bayesian model selection methods include Bayesian factor and Bayesian informa-tion criteria (BIC) are also used for model selecinforma-tion and parameter estimainforma-tion [255]. BIC is considered as an approximation of Bayesian factor [201]. It is based on a large sample approxi-mation of the marginal likelihood yielding the easily-computable BIC.

3.3.3 BIC and MDL

AIC derived as asymptotic approximation of Kullback-Liebler information distance between the model of interest and the truth. AIC and BIC have similar prediction mechanism. They estimate a generalized model which has the ability to fit all “future” data samples from the same underlying process, not just the current data sample.

Because of the intrinsic simplicity, Akaike information criterion AIC, shown in Table. 3.2, and MDL [203] are widely applied to estimating two terms: a maximization of the likelihood data term and a penalty term of the complexity of the model. However, within a Bayesian frame-work, model selection appears more complex as it involves the evaluation of Bayes factors [124].

These Bayes factors require the computation of high-dimensional integrals with no closed-form analytical expression. These computational problems have restricted the use of Bayesian model selection, except for the cases for which asymptotic expansions of the Bayes factor are valid [13], [23].

The Bayesian Information Criterion (BIC) [223] like AIC, is applicable in settings where the fitting is carried out by maximization of a log-likelihood. The BIC can be obtained from the Laplace approximation by retaining only those terms that grow withn. From Eq. 3.18, we have

logp(Y|m)_Laplace= logP(ˆθ|m) + logp(Y|θ, m) +ˆ d

2log 2π−1

2log|H| (3.20)

where each term dependence onnhas been annotated. Here we use the “big-O” notation to see the probability distribution of each term. Retaining O(n) andO(logn) terms yields,

logp(Y|m)_Laplace= logp(Y|θ, m)ˆ −1

2log|H| (3.21)

From Eq. 3.12 and Eq. 3.15, we know that the Hessian scale linearly withn, we have,

n→∞lim 1

2log|H|= 1

2log|nH₀|= d

2logn+1

2|H₀| (3.22)

and then assuming that the prior is non-zero at ˆθ, thus the Eq. 3.21 in the limit of large n becomes the BIC score,

logp(y|m)_BIC = logp(y|θ, m)ˆ − d

2log|n| (3.23)

There are two main advantages of BIC. Firstly, it does not depend on the priorp(θ|m). Secondly, it does not take into account the local geometry of the parameter space. Therefore, it is invariant to update parameters of the model. In practice, the utilized dimension of the model dis equal to the number of well-determined parameters when any potential parameter degeneracies have been removed.

The Minimum Description Length (MDL) principle [203] informally states that the best model is the one which minimizes the sum of two terms: first, the length of the model, and second, the length of the data when encoded using the model as a predictor for the data. In other words, the MDL criterion is utilized for resolving the tradeoff between model complexity (each retained coefficient increases the number of model parameters) and goodness-of-fit (each truncated co-efficient decreases the fit between the received - i.e., noisy - signal and its reconstruction). We seek the data representation that results in the shortest encoding of both observations and constraints.

On the other hand, the BIC is in fact exactly minus the minimum description length (MDL) penalty used in Rissanen [203], [204] shown in Table. 3.2. The minimum description length method (MDL) [203] is an algorithmic coding theory and regularities (redundancy) can be used to compress the data. The main principle of MDL is to achieve the best model that provides the shortest description length of the data in bits by “compressing” the data as tightly as possible. It suggests a means of evaluating this representation system such as the representation of the data item using the model, and the mismatch between this representation and the actual data. Recently, the minimum message length (MML) framework of Wallace and Freeman [253], Lanterman [137], Figueiredo and Jain [77] has been intensively applied for unsupervised model selection techniques [77] which is closely related to Bayesian integration over parameters.

3 Bayesian Model Selection and Nonparametric Blur Identification

Im Dokument Bayesian Learning and Regularization for Unsupervised Image Restoration and Segmentation (Seite 61-66)