Kullback-Leibler Simplex

(1)

Munich Personal RePEc Archive

Kullback-Leibler Simplex

Kangpenkae, Popon

June 2012

Online at https://mpra.ub.uni-muenchen.de/39494/

MPRA Paper No. 39494, posted 17 Jun 2012 00:49 UTC

(2)

KULLBACK-LEIBLER SIMPLEX

Abstract. This technical reference presents the functional structure and the algorithmic implementation of KL (Kullback-Leibler) simplex. It details the simplex approximation and fusion. The KL simplex is fundamental, robust, adaptive an informatics agent for computational research in economics, finance, game and mechanism. From this perspective the study provides comprehensive results to facilitate future work in such areas.

God does not care about our mathematical difficulties. There is nothing free, except the grace of God.

He integrates empirically. Albert Einstein True Grit (2010)

1. Introduction

This paper presents an alternative for sequential optimizing agent which is crucial for the reliability of computational economics research. In particular it is a version of online classifier, a machine learning which processes classification with data stream. The sequential implementation makes it efficient, fast and practical data flow processing. Among this type of classifier, informatics divergence approach stands out with solid foundation in mathematical statistics and informatics theory. It is instructive to see the difference of the two approaches. Standard approach targets the performance in objective function, while the informatics works with statistical measures, e.g. Kullback-Leibler and Renyi divergence [CDR07].

Positively the informatics agent can be effective alternative to standard sequential optimizers.

Furthermore informatics approach delivers powerful concepts, e.g. (i) the advance will leverage the notion and insight from dynamic programming [Sni10]; when a control is simplex and transition matrix, it has a strong foundation in probability and Markov chain [Beh00]. (ii) model-free or agnostic data makes it capable of deriving superior second-order perceptron working the real-world data [BCG05].

This approach consequently can improve machine learning that is robust and applicable for computational research in economics, finance, game and mechanism.

The next section lists useful formula and identity. Section 3 presents the structure of online machine learning [CDF08, LHZG11] and key results; section 4 discusses the implementation. The instructive remarks are in section 5 and the proof is in Appendix.

2. The matrix Simplex[CY11].

h1iµ∈←→

△ ⇔µ·1 = 1andh2iµ∈ △ ⇔µ∈←→

△, withmin (µ)≥0.

Taylor expansion. ln (µ·x_i)≈ln (µ_i·x_i) +^(µ⁻_µ^µⁱ⁾^·^xⁱ

i·xi .

Date: March 23, 2012.

Key words and phrases. KL divergence, second-order perceptron, informatics agent, simplex projection and fusion.

Acknowledgments.Economics department at Thammasat and Queen’s university greatly supports the study. The author warmly thanks Frank Flatters, Frank Milne, Ted Neave and Pranee Tinakorn for the encouragement, without which this paper would not had been written. He certainly appreciates comments and discussions with the Sukniyom, Ko-Kao, Chai-Shane, Lek-Air, Ron-Fran, Pui, NaPoj and of course, Kay and Nongyao.

Contact.facebook.com/popon.kangpenkae.

1

(3)

Symmetric squared decomposition (SSD). Σ[i] = Υ²_[i],Υ[i] = Q[i]

q

diag λ[i,]1, . . . , λ[i,]d

Q^⊤_[i] : Q[i] is orthogonal and the eigenvector ofΣ[i]; λ[i,]1, . . . , λ[i,]d

is the eigenvalue of Σ[i]. Of course Υ[i],Σ[i] is symmetric PSD.

Inversion[PP08][146].

(A+BCD)⁻¹=A⁻¹−A⁻¹B C⁻¹+DA⁻¹B−1

DA⁻¹ for our application,

Σ⁻¹= Σ⁻_i ¹+x_ix^⊤_i

c ⇒Σ = Σi− Σix_ix^⊤_i Σi

c+x^⊤_i Σix_i Differentiation[PP08][78,49,102,83].

∂

∂µ(µ_i−µ) Υ⁻_i²(µ_i−µ) = 2Υ⁻_i²(µ−µ_i)

∂

∂Υln det Υ²

= 2Υ⁻¹

∂

∂ΥTr Υ⁻_i²Υ²

= Υ⁻_i ²Υ + ΥΥ⁻_i²

∂

∂Υx^⊤_i Υ²x_i=x_ix^⊤_i Υ + Υxix^⊤_i = _∂Υ^∂ kΥxik²

∂

∂ΥkΥxik= ^xⁱ^x^⊤ⁱ₂^Υ+Υx_k_Υx_i_kⁱ^x^⊤ⁱ = ^xⁱ^x^⊤ⁱ^Υ+Υxⁱ^x^⊤ⁱ

2√

x^⊤_iΥ²xi

KL divergence.

DKL N µ,Υ²

||N µ_i,Υ²_i

=1 2

ln

det Υ²_i det Υ²

+Tr Υ⁻_i²Υ²

+ (µ_i−µ) Υ⁻_i ²(µ_i−µ)

3. Approximation

3.1. This section refers to [CDR07, LHZG11] for the model concept and definition.

As KL simplex solution in △does not have a closed form, the approximation will start with←→

△, µ_i+1,Σi+1

=argminDKL(N(µ,Σ)||N(µ_i,Σi)) subject to~(yif(µ·xi)−ǫ)≥φp

x^⊤_i Σxi, yi∈ {−1,1},andµ∈←→

△.

Applying the main result in [LCLMV04] [V I.2],an invariance theorem is straightforward, Theorem. The optimal pair µ_i+1,Σi+1

is invariant to similarity-metric divergences.

We consider

normal, hinge, hinge²

constraint (see section Section 5), with two flavors:

{linear,logarithm}={[ln],[ln]} ∋f(.).LetΣ[i] = Υ²_[i] whereΥ[i] has SSD, the~−Lagrangian is L=1

2

ln

det Υ²_i det Υ²

+Tr Υ⁻_i²Υ²

+ (µ_i−µ) Υ⁻_i ²(µ_i−µ)

+α(φkΥxik −~) +ρ(µ·1−1)

Define hinge function⌊z⌋= max{0, z} andhzi=⌊z⌋/|z| ∈ {0,1}.

(4)

3.2. [normal],~_∅. 3.2.1. Linear:~_∅_[ln].

Lemma 1. Σ⁻_i+1¹ ◮~_∅_[ln]

Σ⁻_i+1¹ = Σ⁻_i ¹+αφ x_ix^⊤_i p

x^⊤_i Σi+1x_i Lemma 2. Σi+1◮~_∅_[ln]

Σi+1= Σi−βΣixix^⊤_i Σi

whereβ= ^√_u_i^αφ_+αφυ_i,(ui, υi)≡ x^⊤_i Σi+1x_i,x^⊤_i Σix_i . Lemma 3. √ui◮~_∅_[ln]

√ui= −αφυi+p

α²φ²υ_i²+ 4υi

2 Lemma 4. µ_i+1◮~

∅[ln]

µ_i+1=µ_i+αyiΣi(xi−x_i) where x=x1≡ ¹₁^⊤⊤^ΣΣⁱi^x1ⁱ1

Lemma 5. α◮~_∅_[ln], α=j

−b±√ b²−4ac 2a

ksuch that (a, b, c) =

λ^′

λ^′+υiφ² ,2λ

λ^′+^υⁱ₂^φ²

, λ²−υiφ² λ, λ^′

= yi(µ_i·xi)−ǫ,x^⊤_i Σi(xi−xi)

3.2.2. Logarithm:~_∅_[ln].

Lemma 6. Σ⁻_i+1¹ ◮~_∅_[ln] ≡Lemma 1.

Lemma 7. Σi+1◮~

∅[ln] ≡Lemma 2.

Lemma 8. √ui◮~_∅[ln] ≡Lemma 3.

Lemma 9. µ_i+1◮~_∅_[ln]

µ_i+1≈µ_i+ αyi

µ_i·xi

Σi(xi−xi), where x=x1≡¹₁^⊤⊤^ΣΣⁱi^x1ⁱ1.

Lemma 10. α◮~_∅[ln], α=j

−b±√ b²−4ac 2a

λ^′

λ^′+υiφ² ,2λ

λ^′+^υⁱ₂^φ²

, λ²−υiφ² λ, λ^′

≈

yiln (µ_i·xi)−ǫ,^x^⊤ⁱ_(µ^Σⁱ^(xⁱ⁻^xⁱ⁾

i·xi)²

(5)

3.3. [hinge],~₁ and

hinge² ,~₂. 3.3.1. Linear:~_1[ln],~_2[ln].

Σ⁻_i+1¹ ◮~_[1,2][ln],Lemma 11 ≡Lemma 21 ≡Lemma 1,Σ⁻_i+1¹ ◮~_∅_[ln]

Σi+1◮~_[1,2][ln],Lemma 12 ≡Lemma 22 ≡Lemma 2,Σi+1◮~_∅_[ln]

√ui ◮~_[1,2][ln], Lemma 13 ≡Lemma 23 ≡Lemma 3,√ui ◮~_∅_[ln]

Lemma 14. µ_i+1◮~_1[ln]

µ_i+1=µ_i+hyi(µ_i·xi)−ǫiαyiΣi(xi−xi) where x_i=xi1≡¹₁^⊤⊤^ΣΣⁱi^x1ⁱ1

Lemma 15. α◮~_1[ln], α=j

−b±√ b²−4ac 2a

λ^′

λ^′+υiφ² ,2λ

λ^′+^υⁱ₂^φ²

, λ²−υiφ² , λ, λ^′

= yi(µ_i·xi)−ǫ,x^⊤_i Σi(xi−xi) Lemma 24. µ_i+1◮~_2[ln]

µ_i+1=µ_i+

yi(µ_i·x_i)−ǫ 0.5α⁻¹−x^⊤_i Σi(xi−x_i)

yiΣi(xi−x_i) where x_i=xi1≡ ¹₁^⊤⊤^ΣΣⁱi^x1ⁱ1

−b±√ b²−4ac 2a

λ^′

λ^′+υiφ² ,2λ

λ^′+^υⁱ₂^φ²

, λ²−υiφ² λ, λ^′

≈

(yi(µ_i·xi)−ǫ)²,4λx^⊤_i Σi(xi−xi)

3.3.2. Logarithm:~_1[ln],~_2[ln].

Σ⁻_i+1¹ ◮~_[1,2][ln],Lemma 16≡Lemma 26≡Lemma 6,Σ⁻_i+1¹ ◮~_∅_[ln]

Σi+1◮~_[1,2][ln],Lemma 17≡Lemma 27≡Lemma 7,Σi+1◮~_∅_[ln]

√ui ◮~_[1,2][ln],Lemma 18 ≡Lemma 28≡Lemma 8,√ui ◮~_∅_[ln]

Lemma 19. µ_i+1◮~_1[ln]

µ_i+1=µ_i+hyiln (µ_i·xi)−ǫi αyi

µ_i·x_iΣi(xi−xi) where x_i=xi1≡¹₁^⊤⊤^ΣΣⁱi^x1ⁱ1

(6)

−b±√ b²−4ac 2a

k

such that (a, b, c) =

λ^′

λ^′+υiφ² ,2λ

λ^′+^υⁱ₂^φ²

, λ²−υiφ² λ, λ^′

≈

i·xi)²

Lemma 29. µ_i+1◮~_2[ln]

µ_i+1≈µ_i+





yiln (µ_i·x_i)−ǫ 0.5α⁻¹−^x^⊤ⁱ_(µ^Σⁱ_i^(x_·_xⁱ_i⁻₎²^xⁱ⁾



 yi

µ_i·xi

Σi(xi−x_i)

where x_i=xi1≡ ¹₁^⊤⊤^ΣΣⁱi^x1ⁱ1 Lemma 30. α◮~_2[ln], α=j

−b±√ b²−4ac 2a

k

such that (a, b, c) =

λ^′

λ^′+υiφ² ,2λ

λ^′+^υⁱ₂^φ²

, λ²−υiφ² λ, λ^′

≈

(yiln (µ_i·xi)−ǫ)²,4λ^x^⊤ⁱ_(µ^Σⁱ^(xⁱ⁻^xⁱ⁾

i·xi)²

4. Implementation 4.1. Results in section 3. is valid for the ←→

△ simplex. A more common constraint is △ simplex;

however the close-form solution is not possible with this simplex. Projecting simplex ←→

△ on △ is a practical approximation; [LHZG11] reports the effectiveness of this method. The projection necessar- ily requires a certain transformation of Σ-covariance matrix. Further information on implementing projection algorithm and covariance transformation is in [CY11] and [LHZG11], respectively.

Conjecture. Correlation transform is an nSD-effective covariance transformer.

4.2. Section 3 presents various choices of simplex, from which one can limit the set of simplex using statistical dominance concept, e.g. nSD-effective. Then projecting the simplex and integrating orfusing them which is, in practice, an empirical issue. We define a new simplex fusing method FED (fusing extensive dimension) as follows. Let △i∈{1...m} be a set ofnSD-effective simplex, each △i ∈ [0,1]^N. Connectm subsimplex into a vector in [0,1]^m·N; apply simplex projection to the vector. The result is simplex△ ∈[0,1]^m·N; overlay simplex△, i.e. slot△ into mvectors in[0,1]^N and sum the vectors with the proper array. The overlay will compose aFED simplex ∈[0,1]^N.

Conjecture. FED simplex is an nSD-effective fuse of its nSD-effective subsimplex.

}nSD-effectiveis empirical non-dominated, wrt. to then-order stochastic dominance definition [Dav06].

5. Remark

5.1. The logic of confidence constraint. Suppose ^F(w^·^x_σⁱ⁾^−µ^F⁽^w·^xi⁾

F(w·xi) =ZΦ−cdf; consider a generic confidence constraint Pr(F(w·xi)≥0)≥η≡Φ (φ).

Pr

F(w·xi)−µF(w·xi)

σF(w·x_i) ≥−µF(w·xi)

σF(w·x_i)

≥η⇒Φ

−µF(w·xi)

σF(w·x_i)

≤1−η

(7)

−µF(w·x_i)

σF(w·xi) ≤Φ⁻¹(1−η) =−Φ⁻¹(η)⇒µF(w·x_i)≥Φ⁻¹(η)σF(w·x_i)=φσF(w·x_i)

, i.e. the distance

µF(w·x_i)−F(µw·xi)−φ σF(w·x_i)−σw·xi

determines the proximity to the confidence constraint; [OC09] discusses the validity of similar approach for online optimization.

5.2. The approximating property of

normal,hinge,hinge²

confidence. Define

normal,hinge,hinge² function as follows,

normal: ~_∅_[f]∈

~_∅_[ln],~_∅_[ln] ≡ {yi(µ·x_i)−ǫ, yiln (µ·x_i)−ǫ} hinge: ~_1[f]∈~_1[ln],~_1[ln] ≡ {⌊yi(µ·x_i)−ǫ⌋,⌊yiln (µ·x_i)−ǫ⌋}

hinge²: ~_2[f] ∈~_2[ln],~_2[ln] ≡n

⌊yi(µ·x_i)−ǫ⌋²,⌊yiln (µ·x_i)−ǫ⌋²o , as a result of assumptionw∼N µ,Σ = Υ²

; normal: ~_∅_[ln] is exact;~_∅_[ln] is approximate F(w·xi) =yi(w·xi)−ǫ⇒

µ_F(w·xi), σ²_F(w

·x_i)

= yi(µ·xi)−ǫ, σw·xi =x^⊤_i Σxi F(w·xi) =yiln (w·x_i)−ǫ⇒µF(w·x_i)≈yiln (µ·x_i)

hinge: ~_1[ln][ln] is approximate

F(w·xi) =⌊yi(w·xi)−ǫ⌋ ⇒µF(w·x_i)≈ ⌊yi(µ·xi)−ǫ⌋ F(w·xi) =⌊yiln (w·xi)−ǫ⌋ ⇒µF(w·x_i)≈ ⌊yiln (µ·xi)−ǫ⌋

hinge²: ~_2[ln][ln] is approximate

F(w·xi) =⌊yi(w·xi)−ǫ⌋²⇒µ_F(w·xi)≈ ⌊yi(µ·xi)−ǫ⌋² F(w·xi) =⌊yiln (w·xi)−ǫ⌋²⇒µ_F(w·xi)≈ ⌊yiln (µ·xi)−ǫ⌋²

Appendix

Lemma 1. Σ⁻_i+1¹ ◮~_∅[ln]

∂

∂ΥL= 0 =−Υ⁻¹+1

2Υ⁻²_i Υ +1

2ΥΥ⁻²_i +αφ x_ix^⊤_i Υ 2p

x^⊤_i Υ²xi

+αφ Υxix^⊤_i 2p

x^⊤_i Υ²xi

Υ⁻¹ update condition is, Υ⁻¹=1

2Υ⁻_i²Υ +1

2ΥΥ⁻_i ²+αφ xix^⊤_i Υ 2p

x^⊤_i Υ²xi

+αφ Υxix^⊤_i 2p

x^⊤_i Υ²xi

Υ⁻¹ Start with the solution,Υ⁻² implicit update,

Υ⁻²≡Υ⁻_i+1² = Υ⁻_i ²+αφ xix^⊤_i p

x^⊤_i Υ²xi

Υ⁻² which yields

(8)

Υ⁻¹

2 =Υ⁻_i²Υ 2 +αφ

2 · x_ix^⊤_i Υ p

x^⊤_i Υ²x_i

[×Υ]

Υ⁻¹

2 =ΥΥ⁻²_i 2 +αφ

2 · Υxix^⊤_i p

x^⊤_i Υ²xi

[Υ×] Υ⁻²

⇒ [×Υ] + [Υ×] ⇒ Υ⁻¹

, i.e. Υ⁻²−implicit update satisfying Υ⁻¹−update. The result is direct from the replacement Υ²_i,Υ²

= (Σi,Σi+1) :

Σ⁻¹_i+1= Σ⁻¹_i +αφ x_ix^⊤_i p

x^⊤_i Σi+1xi

Lemma 2. Σi+1◮~_∅_[ln]

Apply matrix inversion toΣ⁻_i+1¹ = Σ⁻_i¹+αφ√ ^xⁱ^x^⊤ⁱ

x^⊤iΣi+1xi

, Σi+1= Σi− Σixix^⊤_i Σi

√x^⊤_iΣi+1xi

αφ +x^⊤_i Σix_i

= Σi− αφΣixix^⊤_i Σi

p

x^⊤_i Σi+1xi+αφx^⊤_i Σixi

Σi+1= Σi−αφΣix_ix^⊤_i Σi

√ui+αφυi

= Σi−βΣixix^⊤_i Σi

Lemma 3. √ui◮~_∅_[ln]

Σi+1= Σi−αφΣix_ix^⊤_i Σi

√ui+αφυi ⇒x^⊤_i Σi+1x_i=x^⊤_i Σix_i−αφ x^⊤_i Σix_i

x^⊤_i Σix_i

√ui+αφυi

ui=υi− αφυ²_i

√ui+αφυi ⇒√

ui= −αφυi+p

α²φ²υ_i²+ 4υi

2

Lemma 4. µ_i+1◮~_∅_[ln]

∂

∂µL= 0 = Υ⁻_i ²(µ−µ_i)−α~^′

∅f^′yixi+ρ1; ∂

∂ρL= 0 =µ·1−1 Υ⁻_i ²(µ−µ_i)−α~^′

∅f^′yix_i+ρ1 =0⇒µ=µ_i+ Υ²_i α~^′

∅f^′yix_i−ρ1 1^⊤µ=1^⊤µ_i+α~^′

∅f^′yi1^⊤Υ²_ix_i−ρ1^⊤Υ²_i1 ρ1=α~^′

∅f^′yi

1^⊤Υ²_ix_i 1^⊤Υ²_i1

1=α~^′

∅f^′yix_i ⇒µ=µ_i+α~^′

∅f^′yiΥ²_i(xi−x_i) use~^′

∅(.) = 1, f^′(.) = 1andΥ²_i = Σi to haveµ_i+1=µ=µ_i+αyiΣi(xi−xi)

(9)

Lemma 5. α◮~_∅_[ln]

From Lemma 3 √ui = ^−αφυⁱ⁺

√α²φ²υ²i+4υi

2 , which can be simplified with λ+λ^′α = φ√ui . Its quadratic isaα²+bα+c= 0, such that(a, b, c) =

λ^′

λ^′+υiφ² ,2λ

λ^′+^υⁱ₂^φ²

, λ²−υiφ² . The solution toλ+λ^′α=φ√ui is ⁻^b^±^√_2a^b²⁻^4ac. We chooseα=j

−b±√ b²−4ac 2a

kto ensure validα≥0.

To find λ, λ^′

, use binding constraint φkΥxik = ~_∅_[ln] ⇒ φkΥxik = yi(µ·x_i)−ǫ. Apply the updateµ=µ_i+αyiΣi(xi−xi)and√ui≡ kΥxik,

φ√ui=yiµ_i·xi−ǫ+αx^⊤_i Σi(xi−xi) i.e.

λ, λ^′

= yiµ_i·x_i−ǫ,x^⊤_i Σi(xi−x_i)

.

Lemma 6. Σ⁻_i+1¹ ◮~_∅_[ln]

≡ Lemma 1.

Lemma 7. Σi+1◮~_∅_[ln]

≡Lemma 2.

Lemma 8. √ui◮~_∅_[ln]

≡Lemma 3.

Lemma 9. µ_i+1◮~_∅[ln]

Similar to Lemma 4, µ = µ_i +α~^′

∅f^′yiΥ²_i(xi−xi) ; use ~^′

∅(.) = 1; ln (µ·xi) ≈ ln (µ_i·xi) +

(µ−µi)·xi

µi·xi ⇒f^′(.) =_µ¹

i·xi andΥ²_i = Σi,which gives µ_i+1=µ≈µ_i+_µ^αyⁱ

i·xiΣi(xi−x_i)

Lemma 10. α◮~_∅_[ln]

Similar to Lemma 5,α=j

−b±√ b²−4ac 2a

kwhere(a, b, c) = λ^′

λ^′+υiφ² ,2λ

λ^′+^υⁱ₂^φ²

, λ²−υiφ² . To find

λ, λ^′

, set the constraint bindingφkΥxik =~_∅_[ln] ⇒φkΥxik =yiln (µ·x_i)−ǫ. Apply the update µ= µ_i+_µ^αyⁱ

i·xiΣi(xi−xi) and √ui ≡ kΥxik and the approximation yiln (µ·xi)−ǫ ≈ yi

ln (µ_i·x_i) +^(µ⁻_µ^µⁱ^)·^xⁱ

i·xi

−ǫ

φ√ui≈yiln (µ_i·x_i)−ǫ+αx^⊤_i Σi(xi−x_i) (µ_i·x_i)² · , i.e.

λ, λ^′

≈

i·xi)²

.

Lemma 11. Σ⁻¹_i+1◮~_1[ln]

≡Lemma 1.

Lemma 12. Σi+1◮~_1[ln]

≡Lemma 2.

(10)

Lemma 13. √ui◮~_1[ln]

≡Lemma 3.

Lemma 14. µ_i+1◮~_1[ln]

Similar to Lemma Lemma 4, µ = µ_i +α~^′₁f^′yiΥ²_i (xi−x_i). There are two cases, yi(µ·x_i)− ǫ[>] [≤] 0.

Case [>]: ~^′

1(.) = 1, f^′(.) = 1and Υ²_i = Σi,⇒µ_i+1=µ=µ_i+αyiΣi(xi−x_i) Case [≤]: ~^′₁(.) = 0⇒µ_i+1=µ=µ_i

With some manipulation we find aµ−update

µ_i+1=µ_i+hyi(µ_i·xi)−ǫiαyiΣi(xi−xi)

Lemma 15. α◮~_1[ln]

−b±√ b²−4ac 2a

λ^′+υiφ² ,2λ

λ^′+^υⁱ₂^φ²

λ, λ^′

, use binding constraintφkΥxik=~_1[ln]⇒φkΥxik=⌊yi(µ·x_i)−ǫ⌋.We only need the update-caseyi(µ·xi)−ǫ >0. Apply the updateµ=µ_i+αyiΣi(xi−xi)and√ui≡ kΥxik,

φ√ui =⌊yi(µ·xi)−ǫ⌋=yi(µ·xi)−ǫ=yiµ_i·xi−ǫ+αx^⊤_i Σi(xi−xi) i.e.

λ, λ^′

= yiµ_i·x_i−ǫ,x^⊤_i Σi(xi−x_i)

.

Lemma 16. Σ⁻_i+1¹ ◮~_1[ln]

≡ Lemma 6.

≡Lemma 7.

≡Lemma 8.

Lemma 19. µ_i+1◮~_1[ln]

Similar to Lemma 14, µ=µ_i+α~^′

1f^′yiΥ²_i(xi−x_i)with two cases,yiln (µ·x_i)−ǫ[>] [≤] 0.

Case [>]: ~^′

1(.) = 1,ln (µ·xi)≈ln (µ_i·xi) +^(µ⁻_µ^µⁱ⁾^·^xⁱ

i·xi ⇒f^′(.) =_µ¹

i·xi andΥ²_i = Σi

µ_i+1=µ=µ_i+ αyi

µ_i·xi

Σi(xi−xi) Case [≤]: ~^′

1(.) = 0⇒µ_i+1=µ=µ_i With some manipulation we find aµ−update

µ_i+1=µ_i+hyiln (µ_i·x_i)−ǫi αyi

µ_i·xi

Σi(xi−x_i)

(11)

−b±√ b²−4ac 2a

λ^′+υiφ² ,2λ

λ^′+^υⁱ₂^φ²

λ, λ^′

, set the constraint binding φkΥxik =~_1[ln] ⇒ φkΥxik = ⌊yiln (µ·x_i)−ǫ⌋. We only need the update-case yiln (µ·xi)−ǫ > 0. Apply the update µ= µ_i+ _µ^αyⁱ

i·xiΣi(xi−xi) and

√ui≡ kΥxik and the approximationyiln (µ·x_i)−ǫ≈yi

ln (µ_i·x_i) +^(µ⁻_µ^µⁱ^)·^xⁱ

i·xi

−ǫ, φ√ui=⌊yiln (µ·x_i)−ǫ⌋=yiln (µ·x_i)−ǫ≈yiln (µ_i·x_i)−ǫ+αx^⊤_i Σi(xi−x_i)

(µ_i·xi)² , i.e.

λ, λ^′

≈

i·xi)²

.

Lemma 21. Σ⁻_i+1¹ ◮~_2[ln]

≡Lemma 1.

≡Lemma 2.

≡Lemma 3.

Lemma 24. µ_i+1◮~_2[ln]

Similar to Lemma Lemma 4, µ = µ_i +α~^′₂f^′yiΥ²_i (xi−x_i). There are two cases, yi(µ·x_i)− ǫ[>] [≤] 0.

Case [>]: ~^′

2(.) = 2 (yi(µ·x_i)−ǫ); usef^′(.) = 1andΥ²_i = Σi, µ=µ_i+ 2α(yi(µ·xi)−ǫ)yiΣi(xi−xi)

yi(µ·xi)−ǫ=yi(µ_i·xi)−ǫ+ 2α(yi(µ·xi)−ǫ)x^⊤_i Σi(xi−xi) WriteX =yi(µ·xi)−ǫ, C =yi(µ_i·xi)−ǫ, S= 2αx^⊤_i Σi(xi−xi),

(µ, X) =

µ_i+ 2αXyiΣi(xi−x_i), C+SX= C 1−S

Case [≤]: ~^′

2(.) = 0⇒(µ, X) = (µ_i+ 2αXyiΣi(xi−xi),0).

We can conclude the update µ_i+1=µ=µ_i+ 2α⌊X⌋yiΣi(xi−xi) µ_i+1=µ_i+

yi(µ_i·xi)−ǫ 0.5α⁻¹−x^⊤_i Σi(xi−xi)

yiΣi(xi−xi)

(12)

−b±√ b²−4ac 2a

λ^′+υiφ² ,2λ

λ^′+^υⁱ₂^φ²

λ, λ^′

, use binding constraint0≤φkΥxik=~_2[ln] ⇒φkΥxik=⌊yi(µ·x_i)−ǫ⌋².We only need the update-caseyi(µ·xi)−ǫ >0. Apply the updateµ=µ_i+_0.5α₋^y1ⁱ−^(µxⁱ^⊤_i^·^xΣⁱi⁾(x^−ǫi−xi)·yiΣi(xi−xi) and√ui≡ kΥxik.

φ√ui=⌊yi(µ·xi)−ǫ⌋²= (yi(µ·xi)−ǫ)² φ√

ui=

yiµ_i·x_i−ǫ+ yi(µ_i·x_i)−ǫ

0.5α⁻¹−x^⊤_i Σi(xi−xi)·x^⊤_i Σi(xi−x_i) 2

Suppose g(α) =

A+_0.5α^AC−1−C

2

, with (A, C, α0) = yi(µ_i·x_i)−ǫ,x^⊤_i Σi(xi−x_i),0

and use Taylor expansiong(α)≈g(α0) +g^′(α0) (α−α0). It follows that

g(0), g^′(0)

= A²,4A²C , thus φ√ui =g(α)≈A²+ 4A²Cα⇒

λ, λ^′

≈

(yi(µ_i·xi)−ǫ)²,4λx^⊤_i Σi(xi−xi)

.

Lemma 26. Σ⁻_i+1¹ ◮~_2[ln]

≡ Lemma 6.

≡Lemma 7.

≡Lemma 8.

Lemma 29. µ_i+1◮~_2[ln]

Similar to Lemma 24, µ=µ_i+α~^′

2f^′yiΥ²_i(xi−x_i)with two cases,yiln (µ·x_i)−ǫ[>] [≤] 0.

Case [>]: ~^′₂(.) = 2 (yiln (µ·xi)−ǫ); useln (µ·xi)≈ln (µ_i·xi) +^(µ⁻_µ^µⁱ⁾^·^xⁱ

i·xi ⇒f^′(.) = _µ¹

i·xi and Υ²_i = Σi,

µ≈µ_i+ 2α

yi

ln (µ_i·xi) +(µ−µ_i)·xi

µ_i·xi

−ǫ yi

µ_i·xi

Σi(xi−xi) (µ−µ_i)yix_i

µ_i·xi ≈2αx^⊤_i Σi(xi−x_i) (µ_i·xi)²

yiln (µ_i·x_i)−ǫ+(µ−µ_i)yix_i µ_i·xi

WriteX =^(µ⁻_µ^µⁱ^)yⁱ^xⁱ

i·xi , C =yiln (µ_i·xi)−ǫ, S= 2α^x^⊤ⁱ_(µ^Σⁱ^(xⁱ⁻^xⁱ⁾

i·xi)² , hence X =S(C+X) = ₁^SC_−S and

(µ, C+X)≈

µ_i+ 2α(C+X)· yi

µ_i·xi

Σi(xi−x_i), C 1−S

Case [≤]: ~^′₂(.) = 0⇒(µ, C+X) =

µ_i+ 2α(C+X)· µi^y·ⁱxiΣi(xi−x_i),0 .

(13)

We can conclude with the update µ_i+1=µ≈µ_i+ 2α⌊C+X⌋yiΣi(xi−xi)

µ_i+1≈µ_i+





yiln (µ_i·xi)−ǫ 0.5α⁻¹−^x^⊤ⁱ_(µ^Σⁱ_i^(x_·_xⁱi⁻)²^xⁱ⁾



 yi

µ_i·x_iΣi(xi−xi)

−b±√ b²−4ac 2a

λ^′+υiφ² ,2λ

λ^′+^υⁱ₂^φ²

λ, λ^′

, use binding constraint0 ≤φkΥxik=~_2[ln] ⇒φkΥxik=⌊yiln (µ·xi)−ǫ⌋². We only need the update-caseyiln (µ·x_i)−ǫ >0. Apply the update

µ=µ_i+ yiln (µ_i·x_i)−ǫ

0.5α⁻¹−^x^⊤ⁱ_(µ^Σⁱ_i^(x_·_x_iⁱ₎⁻²^x) · yi

µ_i·x_iΣi(xi−x_i) and√ui≡ kΥxik to haveφ√ui =⌊yiln (µ·x_i)−ǫ⌋²= (yiln (µ·x_i)−ǫ)².

Use the approximation yiln (µ·xi)−ǫ≈yi

ln (µ_i·xi) +^(µ⁻_µ^µⁱ⁾^·^xⁱ

i·xi

−ǫ, φ√ui ≈

yi

ln (µ_i·x_i) +^(µ⁻_µ^µⁱ⁾^·^xⁱ

i·xi

−ǫ2

≈



yiln (µ_i·xi)−ǫ+ ^yⁱ^ln(µⁱ^·^xⁱ⁾⁻^ǫ

0.5α⁻¹−^x⊤ⁱ^Σⁱ(xi−xi) (µi·xi)²

· ^x^⊤ⁱ_(µ^Σⁱ_i^(x_·_xⁱ_i⁻₎²^xⁱ⁾





2

Similar to Lemma 25, with (A, C, α0) =

yiln (µ_i·x_i)−ǫ,^x^⊤ⁱ_(µ^Σⁱ^(xⁱ⁻^xⁱ⁾

i·xi)² ,0

; one can show φ√ui ≈ A²+ 4A²Cα⇒

λ, λ^′

≈

(yiln (µ_i·xi)−ǫ)²,4λ^x^⊤ⁱ_(µ^Σⁱ^(xⁱ⁻^xⁱ⁾

i·xi)²

.

References

[Beh00] E. Behrends.Introduction to Markov Chains with Special Emphasis on Rapid Mixing.Advanced Lecture Notes in Mathematics, Vieweg Verlag, 2000.

[BCG05] N. Cesa-Bianchi, A. Conconi, and C. Gentile. A second-order perceptron algorithm.SIAM Journal of Commu- tation, 34(3): 640–668, 2005.

[CY11] Y. Chen and X. Ye. Projection onto a simplex. Department of Mathematics, University of Florida, 2011.

[CDR07] J.F. Coeurjolly, R. Drouilhet, and J.F. Robineau. Normalized information-based divergences.Problems of In- formation Transmission, 43(3): 167–189, 2007.

[CDF08] K. Crammer, M. Dredze, and F. Pereira. Exact convex confidence-weighted learning.Neural Information Pro- cessing Systems (NIPS), 2008.

[Dav06] R. Davidson. Stochastic dominance. Department of Economics, McGill University, 2006.

[LCLMV04] M. Li, X. Chen, X. Li, B. Ma and P.M.B. Vitányi. The similarity metric.IEEE Trans. Inform. Theory, 50(12): 3250–3264, 2004.

[LHZG11] B. Li, S.C.H. Hoi, P. Zhao, and V. Gopalkrishnan. Confidence weighted mean reversion strategy for on-line portfolio selection. School of Computer Engineering, Nanyang Technological University, 2011.

[OC09] F. Orabona and and K. Crammer. New adaptive algorithms for online classification.Neural Information Process- ing Systems (NIPS), 2010.

[PP08] K.B. Petersen and M.S. Pedersen.The Matrix Cookbook, 2008.

[Sni10] M. Sniedovich.Dynamic Programming: Foundations and Principles, 2nd ed.,CRC Press, 2010.