• Keine Ergebnisse gefunden

3. Regularized least-squares estimation 42

3.2. Regularized least-squares

3.2.1. Primal formulation

The least-squares problem considered here amounts to the minimization of the criterion lλ(Θ) = 1

2¯µkY −XΘk2+λkΘknuc , <3.1>

wherein Y, X ∈ W×m are given linear maps from Rm into a Euclidean space W; Θ ranges over them×m symmetric matrices in Sm; ¯µ >0 adjusts the scaling of the first summand; and λ >0 controls the relative importance of the two summands oflλ.

Usually W is spanned by finitely many real-valued andµ-square integrable functions defined on a finite measure space (Ω,F, µ). Then ¯µ = µΩ is a natural choice, but not required by the results of this section, which hold irrespective of the particular value ¯µ >0. As an example, if W equals Rn, that is, Y, X ∈Rn×m, then the total mass equals ¯µ=n. Section 2.1.3 presents further instances of this construct.

Both summands of lλ, namely, g = λkknuc : Sm → R and the composition f ◦X, whereinf =kY − k2/(2¯µ) is convex andX :Sm →W×m is linear, are convex. There-fore, the criterion function Sm 3 Θ 7→ lλ(Θ) is, too. In particular, lλ : Sm → R is continuous, and therefore its sublevel sets

sublevel sets

{lλ ≤t}, t > 0, are closed. The second sum-mand g =λkknuc of lλ implies that these sublevel sets are also bounded and therefore compact. Continuity and compactness guarantee the existence of a minimizerΘb ∈Sm.

The summands of lλ mirror the twofold goal behind the minimization of <3.1>. The first term ensures that a minimizerΘ ofb lλ yields a close substitute—in terms of kk—

to Y in form of XΘ. In addition, lemmab 3.3 suggests that the second term promotes a low rank of minimizers Θb ∈ Sm. Hence, minimizers Θ of the criterionb lλ trade off

“fidelity to the data” X, Y against their own complexity—expressed as the dimension of their image. A proof of lemma 3.3 follows on page 80in appendix 3.b.

Lemma 3.3. The restrictiong0 ofg =λkknuc to thekkop-ballH ={kkop≤1} ⊂Sm, whereinkkop:Sm →R, equals the convex envelope of H3B 7→h0(B) =λrkB.

The restriction on thekkop-length in lemma3.3 is essential: ifλrkB ≥a+hS, Bifor allB ∈Sm, then λm≥ λrk(tS)≥a+tkSk2 for all t >0. Consequently, the conjugate of λrk (on Sm) is merely defined at zero, and its biconjugate equals zero everywhere.

Lemma 3.3 is related to the equality conv{±uuT| kuk= 1} ={kknuc ≤1}. In fact, if convex functions f, f0 satisfy f(B) ≤ f0(B) ≤ rk(B) for all B ∈ {kkop ≤ 1}, then {±uuT| kuk = 1} ⊂ {rk ≤ 1} ∩ {kkop ≤ 1} ⊂ {f0 ≤ 1} ⊂ {f ≤ 1}. The latter two subsets are convex and thus contain H0 = conv{±uuT| kuk = 1}. Hence, lemma 3.3 impliesH0 ⊂ {kknuc ≤1} ⊂ {f ≤1} for every convex f with f ≤rk on {kkop≤1}.

3.2.2. Dual formulation

The minimization of the criterion functionlλ in<3.1>overSm can be cast in alternative form. This so-called dual problem amounts to the maximization of the dual criterion shown in <3.3>. Its form derives from analyzing the sensitivity of the optimal value infΘ∈Smlλ(Θ) to perturbations. Lemma 3.4 contains the essential ingredients of this approach. A proof of this result starts on page80 in appendix3.b.

Lemma 3.4. The function W×m 3Z 7→v(Z)∈R, wherein v(Z) = inf

Θ∈Sm

1

2¯µkY −(XΘ +Z)k2+λkΘknuc

= inf

Θ∈Sm

f(XΘ +Z) +g(Θ)

is convex. Its conjugate is defined for all D∈W×m with khhX, Dii+hhD, Xiikop≤2λ by v?(D) = f?(D) +g?

−hhX, Dii+hhD, Xii 2

, wherein f? :W×m →R, D 7→ µ¯

2

kD+Y /µk¯ 2− kY /¯µk2

, g? :{kkop ≤λ} →R, M 7→0.

Lemma 3.4 considers the scaled nuclear norm g = λkknuc as a real-valued function on Sm—instead of H = {kkop ≤ λ−1} ⊂ Sm, which explains the difference of g? and the intermediate result, in particular, <A3.1>on page 80, used to prove lemma 3.3.

The Fenchel-Young inequality together with lemma 3.4 implies lλ(Θ)≥v(0) ≥ sup

khhX,Mii+hhM,Xiikop≤2λ

−f?(M)≥ −f?(D) <3.2>

for every symmetric Θ andD∈ {khhX, ii+hh, Xiikop≤2λ}. Equality holds if and only if D is a subgradient of v at 0—the neutral element of the additive group (W×m,+).

Convexity ofv on the open set W×m guarantees the existence of such a subgradient ˆD.

Consequently, the minimal value infΘ∈Smlλ(Θ) =v(0) equals the supremum of

−f?(D) = µ¯ 2

kY /¯µk2− kD+Y /µk¯ 2

,

hhX, Dii+hhD, Xii 2

op

≤λ , <3.3>

which is attained at every element ˆDof ∂v(0) 6=∅. This maximization exercise provides

the dual problem dual problem

to the minimization of the (primal) criterionlλ in<3.1>.

Figure3.5(re-)interpretslλas−(−f◦X)+g. The upper part of its panel (A) visualizes this difference for the caseW =R,m= 1 withX = 1 and in terms of the (partial) graphs off andg. Its lower part contains the corresponding part of the graph of lλ. The primal objective functionlλ is not smaller than the sum of the intercepts−f?(D) and−g?(−D) for every choice of (Θ, D), whereinDand−Dcorrespond to affine minorants off andg, respectively. The upper part also sheds light on the form ofg?: in fact, there exists an affine minorant of g = λ|| with slope −D if and only if |D| ≤ λ; in that case, the

slope =−D

g

−f

Y

g?(D)f?(D)

−g?(−D)

f?(D)

lλ

Θ

slope =Dˆ

slope = ˆD

g

−f f

Θb

v(0)=g?(ˆD)f?(ˆD)

−g?(−D)ˆ

f?( ˆD)

−f?( ˆD)

(A) (B)

Figure 3.5

The figure illustrates (the relation between) the primal and dual formulation of the least-squares problem<3.1>for the caseW =R,m= 1, andX = 1. The upper part of panel (A) rephraseslλ as the difference−(−f) +g; the lower part shows parts of its graph. The primal objective value lλ(Θ) exceeds the sum of the intercepts−f?(D) and−g?(−D) for any Θ and D such thatD and −D correspond to affine minorants of f and g, respectively. Panel (B) reproduces the setting of panel (A) and shows that primal and dual objectives coincide for a minimizing/maximizing pair (Θ,b D), which satisfies ˆˆ D∈∂f(Θ) andb −Dˆ ∈∂g(Θ).b

maximal intercept equals zero. Panel (B) of figure3.5 concerns the identical setting. It illustrates that the equality v(0) =lλ(Θ) =b −v?( ˆD) is possible.

If (Θ,b D)ˆ ∈Sm× {khhX, ii+hh, Xiikop ≤2λ}is a minimizing/maximizing pair, then 0 =

f(XΘ) +b g(Θ)b +

f?( ˆD) +g?( ˆG)

=

f(XΘ) +b f?( ˆD)− hXΘ,b Diˆ +

g(bΘ) +g?( ˆG)− hXΘ,b −Diˆ , wherein ˆG = −(hhX,Diiˆ + hhD, Xii)/2.ˆ The Fenchel-Young inequality together with hXΘ,b −Diˆ = hbΘ,Giˆ implies that the latter two summands are generally nonnegative.

Consequently, both summands equal zero, and the two pairs of optimality conditions Dˆ ∈∂f(XΘ) =b XΘ−Yb

¯

µ ,

Gˆ =−hhX,Dii+hhˆ 2 D,Xiiˆ ∈∂g(Θ)b

and XΘb ∈∂f?( ˆD) ={¯µDˆ +Y}, Θb ∈∂g?

= ncone({kkop ≤λ},G)ˆ <3.4>

hold. The second twin in<3.4>follows fromf andg being convex functions on the open sets W×m and Sm, respectively. More specifically, these properties guarantee f = f??

andg =g??, whereinf??andg??equal the biconjugate functions off andg, respectively.

Panel (B) of figure 3.5 reflects the first pair of necessary conditions: if W = R and X equals the identity, then an optimal ˆDprovides a subgradient of f at the corresponding minimizer Θ and at the same timeb −Dˆ lies in the subdifferential ∂g(Θ).b

The conditions in<3.4>are also sufficient in the sense that if (Θ,b D)ˆ ∈Sm×W×m sat-isfies either set of conditions, then this pair is minimizing/maximizing. More specifically, either of its lower parts implieskGkˆ op≤λ, wherein ˆG=−(hhX,Diiˆ +hhD, Xii)/2. In fact,ˆ if the lower part on the lefthand side of<3.4>holds, thenλkΘknuc ≥λkΘkb nuc+hG,ˆ Θ−Θib for all Θ ∈ Sm. Corollary 2.6 guarantees the existence of Θ0 ∈ {kknuc = 1} with hΘ0,Giˆ =kGkˆ op and thus λkΘkb nuc+λ ≥λkΘ + Θb 0knuc ≥ λkΘkb nuc+hG,ˆ Θ0i. In case of the righthand conditions, this inequality holds since subgradients exists only at points where g? is defined. Furthermore, either pair in <3.4> implies the first equality in the display above <3.4>. Hence, the (in)equalities 0 =lλ(bΘ) +v?( ˆD) ≥ v(0) +v?( ˆD)≥ 0 ensure via<3.2>that Θ and ˆb D are optimal in <3.1>and <3.3>, respectively.

3.2.3. The least-squares solution set

Section 3.2.1 proofs the existence of a minimizer of the objective function lλ in <3.1>, that is, an elementΘb ∈Smwithlλ(Θ)b ≤lλ(Θ) for all Θ∈Sm. This section characterizes the set of minimizers oflλ, which is denoted by argminΘ∈Smlλ(Θ) as in section 3.1.2.

The conjugate f? : W×m → R in lemma 3.4 is convex by construction. Its form implies differentiability with ∂f?(D) = {¯µD +Y} and thereby strict convexity. The restriction of f? to the set {khhX, ii+ hh, Xiikop ≤ 2λ} inherits this property, and therefore exhibits at most one minimizer. Consequently, the second component of any minimizing/maximizing pair (Θ,b D) is uniquely determined. The upper parts ofˆ <3.4>

ensure that XΘ = ¯b µDˆ +Y is, too. In particular, any two minimizers Θ,b Θb0 of <3.1>

provide boundary points of the nuclear norm ball{kknuc≤`}ˆ with radius

`ˆ=kΘkb nuc=kΘb0knuc = 1 λ

Θ∈infSm

lλ(Θ)− 1

2¯µkY −XΘkb 2

= 1 λ

Θ∈infSm

lλ(Θ)− 1

2¯µkY −XΘb0k2

.

If ˆ` = 0, then the unique minimizerΘ amounts to theb m×m zero matrix. Otherwise, the lefthand side of <3.4> together with lemma 3.2 implies that ˆG = −(hhX,Diiˆ + hhD, Xii)/2 exhibitsˆ kkop-length λ and that hG,ˆ Θib = λ`ˆfor every minimizer Θ ofb lλ. The latter together with the requirement XΘ = ¯b µDˆ + Y leads to the assertion of lemma3.5. Its proof starts on page81 in appendix3.b.

Lemma 3.5. The least-squares criterion lλ exhibits a unique minimizer Θb if ker( ˆG∓ λid)∩kerX ={0}, wherein id symbolizes the identity map on Rm.

The condition of lemma 3.5is generally satisfied. In fact, if u∈ker( ˆG−λid)∩kerX, then λu = ˆGu = −hhX,Diiu/2ˆ ∈ imghhX, ii = (kerX), thus, u = 0. The case u ∈

span{B¯2,2}

span{B¯1,1}

span {B¯

1,2}

(1 00 1)

Θb0 Θb

3

4 −0.545

−0.545 3 4

Θb00Θb000

{(0 1) = (0 1)}

(A) (B)

Figure 3.6

The figure shows a selection of rank one boundary points of{kknuc ≤`} ⊂S2,` >0, and the differences of elements of the same exposed face of{kknuc≤`}. The two ellipses in panel (A) consist of the rank one boundary points. The panel also contains the spans of a selection of these matrices (represented by dots). Panel (B) shows two pairs (Θ,b Θb0) and (Θb00,Θb000)—

each lying in an exposed face—alongside their component differences. In addition, the panel indicates the solution set of a linear equation. Both panels show coordinates with respect to ¯B1,1,B¯1,2,B¯2,2; see(c)(section2.1.1). Panel (A) omits the coordinate axes for visual clarity.

ker( ˆG+λid)∩kerX is in analogy. Proposition3.6 summarizes the preceding discussion.

A proof of this assertion starts on page81 of appendix 3.b.

Proposition 3.6. The least-squares criterionlλ in<3.1>exhibits a unique minimizerΘ.b The latter equals zero if and only if khhX, Yii+hhY, Xiikop≤2¯µλ.

From a geometric perspective, the lefthand side of<3.4>shows that argminΘ∈

Smlλ(Θ) equals the intersection of the exposed face {Gˆ ∈ ncone(Bnuc, )} of the nuclear norm ball Bnuc ={kknuc ≤`}ˆ and the set of solutions {X = ¯µDˆ +Y}.

If m = 2, then panel (A) of figure 3.6 indicates that the difference between two distinct elements Θ andb Θb0 of the same exposed face has rank two. In fact, all rank one matrices of kk-length ` > 0, that is, matrices of the form ±`uuT, kuk = 1, lie in the upper and lower ellipse shown in that panel. This panel also shows the spans of a selection—represented by dots—of these matrices. The neighboring panel (B) verifies this observation for two pairs Θb 6= Θb0 and Θb00 6= Θb000. Each of these pairs lies in an exposed face of the ball{kknuc≤`}: the gray line connecting the two ellipses and the area circumscribed by the upper ellipse. This observations is, however, incompatible with ∆ =Θb−Θb0 ∈kerX unless X equals zero. After all, the equality rk ∆ = 2 ensures that its columns form a basis of R2. Panel (B) illustrates this point by showing part of {X = ¯µDˆ +Y} with X = (0,1) and ¯µDˆ +Y = (0,1).