Regularized least-squares - Regularized least-squares estimation 42

3. Regularized least-squares estimation 42

3.2. Regularized least-squares

3.2.1. Primal formulation

The least-squares problem considered here amounts to the minimization of the criterion l_λ(Θ) = 1

2¯µkY −XΘk²+λkΘk_nuc , <3.1>

wherein Y, X ∈ W^×m are given linear maps from R^m into a Euclidean space W; Θ ranges over them×m symmetric matrices in S^m; ¯µ >0 adjusts the scaling of the first summand; and λ >0 controls the relative importance of the two summands ofl_λ.

Usually W is spanned by finitely many real-valued andµ-square integrable functions defined on a finite measure space (Ω,F, µ). Then ¯µ = µΩ is a natural choice, but not required by the results of this section, which hold irrespective of the particular value ¯µ >0. As an example, if W equals Rⁿ, that is, Y, X ∈R^n×m, then the total mass equals ¯µ=n. Section 2.1.3 presents further instances of this construct.

Both summands of l_λ, namely, g = λk^•k_nuc : S^m → R and the composition f ◦X, whereinf =kY − ^•k²/(2¯µ) is convex andX :S^m →W^×m is linear, are convex. There-fore, the criterion function S^m 3 Θ 7→ l_λ(Θ) is, too. In particular, l_λ : S^m → R is continuous, and therefore its sublevel sets

sublevel sets

{l_λ ≤t}, t > 0, are closed. The second sum-mand g =λk^•k_nuc of l_λ implies that these sublevel sets are also bounded and therefore compact. Continuity and compactness guarantee the existence of a minimizerΘb ∈S^m.

The summands of l_λ mirror the twofold goal behind the minimization of <3.1>. The first term ensures that a minimizerΘ ofb l_λ yields a close substitute—in terms of k^•k—

to Y in form of XΘ. In addition, lemmab 3.3 suggests that the second term promotes a low rank of minimizers Θb ∈ S^m. Hence, minimizers Θ of the criterionb l_λ trade off

“fidelity to the data” X, Y against their own complexity—expressed as the dimension of their image. A proof of lemma 3.3 follows on page 80in appendix 3.b.

Lemma 3.3. The restrictiong⁰ ofg =λk^•k_nuc to thek^•k_op-ballH ={k^•k_op≤1} ⊂S^m, whereink^•k_op:S^m →R, equals the convex envelope of H3B 7→h⁰(B) =λrkB.

The restriction on thek^•k_op-length in lemma3.3 is essential: ifλrkB ≥a+hS, Bifor allB ∈S^m, then λm≥ λrk(tS)≥a+tkSk² for all t >0. Consequently, the conjugate of λrk (on S^m) is merely defined at zero, and its biconjugate equals zero everywhere.

Lemma 3.3 is related to the equality conv{±uu^T| kuk= 1} ={k^•k_nuc ≤1}. In fact, if convex functions f, f⁰ satisfy f(B) ≤ f⁰(B) ≤ rk(B) for all B ∈ {k^•k_op ≤ 1}, then {±uu^T| kuk = 1} ⊂ {rk ≤ 1} ∩ {k^•k_op ≤ 1} ⊂ {f⁰ ≤ 1} ⊂ {f ≤ 1}. The latter two subsets are convex and thus contain H⁰ = conv{±uu^T| kuk = 1}. Hence, lemma 3.3 impliesH⁰ ⊂ {k^•k_nuc ≤1} ⊂ {f ≤1} for every convex f with f ≤rk on {k^•k_op≤1}.

3.2.2. Dual formulation

The minimization of the criterion functionl_λ in<3.1>overS^m can be cast in alternative form. This so-called dual problem amounts to the maximization of the dual criterion shown in <3.3>. Its form derives from analyzing the sensitivity of the optimal value infΘ∈S^ml_λ(Θ) to perturbations. Lemma 3.4 contains the essential ingredients of this approach. A proof of this result starts on page80 in appendix3.b.

Lemma 3.4. The function W^×m 3Z 7→v(Z)∈R, wherein v(Z) = inf

Θ∈S^m

2¯µkY −(XΘ +Z)k²+λkΘk_nuc

= inf

Θ∈S^m

f(XΘ +Z) +g(Θ)

is convex. Its conjugate is defined for all D∈W^×m with khhX, Dii+hhD, Xiikop≤2λ by v_?(D) = f_?(D) +g_?

−hhX, Dii+hhD, Xii 2

, wherein f_? :W^×m →R, D 7→ µ¯

kD+Y /µk¯ ²− kY /¯µk²

, g_? :{k^•k_op ≤λ} →R, M 7→0.

Lemma 3.4 considers the scaled nuclear norm g = λk^•k_nuc as a real-valued function on S^m—instead of H = {k^•k_op ≤ λ⁻¹} ⊂ S^m, which explains the difference of g_? and the intermediate result, in particular, <A3.1>on page 80, used to prove lemma 3.3.

The Fenchel-Young inequality together with lemma 3.4 implies l_λ(Θ)≥v(0) ≥ sup

khhX,Mii+hhM,Xiikop≤2λ

−f_?(M)≥ −f_?(D) <3.2>

for every symmetric Θ andD∈ {khhX, ^•ii+hh^•, Xiikop≤2λ}. Equality holds if and only if D is a subgradient of v at 0—the neutral element of the additive group (W^×m,+).

Convexity ofv on the open set W^×m guarantees the existence of such a subgradient ˆD.

Consequently, the minimal value infΘ∈S^mlλ(Θ) =v(0) equals the supremum of

−f?(D) = µ¯ 2

kY /¯µk²− kD+Y /µk¯ ²

hhX, Dii+hhD, Xii 2

≤λ , <3.3>

which is attained at every element ˆDof ∂v(0) 6=∅. This maximization exercise provides

the dual problem dual problem

to the minimization of the (primal) criterionl_λ in<3.1>.

Figure3.5(re-)interpretsl_λas−(−f◦X)+g. The upper part of its panel (A) visualizes this difference for the caseW =R,m= 1 withX = 1 and in terms of the (partial) graphs off andg. Its lower part contains the corresponding part of the graph of l_λ. The primal objective functionl_λ is not smaller than the sum of the intercepts−f_?(D) and−g_?(−D) for every choice of (Θ, D), whereinDand−Dcorrespond to affine minorants off andg, respectively. The upper part also sheds light on the form ofg_?: in fact, there exists an affine minorant of g = λ|^•| with slope −D if and only if |D| ≤ λ; in that case, the

slope =−D

−f

−g?(−D)−f?(D)

−g_?(−D)

f?(D)

lλ

slope =−Dˆ

slope = ˆD

−f f

Θb

v(0)=−g?(−ˆD)−f?(ˆD)

−g?(−D)ˆ

f_?( ˆD)

−f?( ˆD)

(A) (B)

Figure 3.5

The figure illustrates (the relation between) the primal and dual formulation of the least-squares problem<3.1>for the caseW =R,m= 1, andX = 1. The upper part of panel (A) rephrasesl_λ as the difference−(−f) +g; the lower part shows parts of its graph. The primal objective value l_λ(Θ) exceeds the sum of the intercepts−f_?(D) and−g_?(−D) for any Θ and D such thatD and −D correspond to affine minorants of f and g, respectively. Panel (B) reproduces the setting of panel (A) and shows that primal and dual objectives coincide for a minimizing/maximizing pair (Θ,b D), which satisfies ˆˆ D∈∂f(Θ) andb −Dˆ ∈∂g(Θ).b

maximal intercept equals zero. Panel (B) of figure3.5 concerns the identical setting. It illustrates that the equality v(0) =l_λ(Θ) =b −v_?( ˆD) is possible.

If (Θ,b D)ˆ ∈S^m× {khhX, ^•ii+hh^•, Xiik_op ≤2λ}is a minimizing/maximizing pair, then 0 =

f(XΘ) +b g(Θ)b +

f_?( ˆD) +g_?( ˆG)

f(XΘ) +b f_?( ˆD)− hXΘ,b Diˆ +

g(bΘ) +g_?( ˆG)− hXΘ,b −Diˆ , wherein ˆG = −(hhX,Diiˆ + hhD, Xii)/2.ˆ The Fenchel-Young inequality together with hXΘ,b −Diˆ = hbΘ,Giˆ implies that the latter two summands are generally nonnegative.

Consequently, both summands equal zero, and the two pairs of optimality conditions Dˆ ∈∂f(XΘ) =b _XΘ−Yb

µ ,

Gˆ =−^hhX,^Dii+hh^ˆ ₂ ^D,Xii^ˆ ∈∂g(Θ)b

and XΘb ∈∂f_?( ˆD) ={¯µDˆ +Y}, Θb ∈∂g_? Gˆ

= ncone({k^•k_op ≤λ},G)ˆ <3.4>

hold. The second twin in<3.4>follows fromf andg being convex functions on the open sets W^×m and S^m, respectively. More specifically, these properties guarantee f = f_??

andg =g_??, whereinf_??andg_??equal the biconjugate functions off andg, respectively.

Panel (B) of figure 3.5 reflects the first pair of necessary conditions: if W = R and X equals the identity, then an optimal ˆDprovides a subgradient of f at the corresponding minimizer Θ and at the same timeb −Dˆ lies in the subdifferential ∂g(Θ).b

The conditions in<3.4>are also sufficient in the sense that if (Θ,b D)ˆ ∈S^m×W^×m sat-isfies either set of conditions, then this pair is minimizing/maximizing. More specifically, either of its lower parts implieskGkˆ op≤λ, wherein ˆG=−(hhX,Diiˆ +hhD, Xii)/2. In fact,ˆ if the lower part on the lefthand side of<3.4>holds, thenλkΘk_nuc ≥λkΘkb _nuc+hG,ˆ Θ−Θib for all Θ ∈ S^m. Corollary 2.6 guarantees the existence of Θ⁰ ∈ {k^•knuc = 1} with hΘ⁰,Giˆ =kGkˆ _op and thus λkΘkb _nuc+λ ≥λkΘ + Θb ⁰k_nuc ≥ λkΘkb _nuc+hG,ˆ Θ⁰i. In case of the righthand conditions, this inequality holds since subgradients exists only at points where g_? is defined. Furthermore, either pair in <3.4> implies the first equality in the display above <3.4>. Hence, the (in)equalities 0 =l_λ(bΘ) +v_?( ˆD) ≥ v(0) +v_?( ˆD)≥ 0 ensure via<3.2>that Θ and ˆb D are optimal in <3.1>and <3.3>, respectively.

3.2.3. The least-squares solution set

Section 3.2.1 proofs the existence of a minimizer of the objective function l_λ in <3.1>, that is, an elementΘb ∈S^mwithl_λ(Θ)b ≤l_λ(Θ) for all Θ∈S^m. This section characterizes the set of minimizers ofl_λ, which is denoted by argmin_Θ∈_Sml_λ(Θ) as in section 3.1.2.

The conjugate f_? : W^×m → R in lemma 3.4 is convex by construction. Its form implies differentiability with ∂f_?(D) = {¯µD +Y} and thereby strict convexity. The restriction of f_? to the set {khhX, ^•ii+ hh^•, Xiik_op ≤ 2λ} inherits this property, and therefore exhibits at most one minimizer. Consequently, the second component of any minimizing/maximizing pair (Θ,b D) is uniquely determined. The upper parts ofˆ <3.4>

ensure that XΘ = ¯b µDˆ +Y is, too. In particular, any two minimizers Θ,b Θb⁰ of <3.1>

provide boundary points of the nuclear norm ball{k^•k_nuc≤`}ˆ with radius

`ˆ=kΘkb _nuc=kΘb⁰k_nuc = 1 λ

Θ∈infS^m

l_λ(Θ)− 1

2¯µkY −XΘkb ²

= 1 λ

Θ∈infS^m

l_λ(Θ)− 1

2¯µkY −XΘb⁰k²

If ˆ` = 0, then the unique minimizerΘ amounts to theb m×m zero matrix. Otherwise, the lefthand side of <3.4> together with lemma 3.2 implies that ˆG = −(hhX,Diiˆ + hhD, Xii)/2 exhibitsˆ k^•k_op-length λ and that hG,ˆ Θib = λ`ˆfor every minimizer Θ ofb l_λ. The latter together with the requirement XΘ = ¯b µDˆ + Y leads to the assertion of lemma3.5. Its proof starts on page81 in appendix3.b.

Lemma 3.5. The least-squares criterion l_λ exhibits a unique minimizer Θb if ker( ˆG∓ λid)∩kerX ={0}, wherein id symbolizes the identity map on R^m.

The condition of lemma 3.5is generally satisfied. In fact, if u∈ker( ˆG−λid)∩kerX, then λu = ˆGu = −hhX,Diiu/2ˆ ∈ imghhX, ^•ii = (kerX)^⊥, thus, u = 0. The case u ∈

span{B¯2,2}

span{B¯^1,1}

span {B¯

1,2}

(^{1 0}_{0 1})

Θb⁰ Θb

₃

4 −0.545

−0.545 −3 4

Θb⁰⁰Θb⁰⁰⁰

{(0 1)^• = (0 1)}

(A) (B)

Figure 3.6

The figure shows a selection of rank one boundary points of{k^•k_nuc ≤`} ⊂S²,` >0, and the differences of elements of the same exposed face of{k^•k_nuc≤`}. The two ellipses in panel (A) consist of the rank one boundary points. The panel also contains the spans of a selection of these matrices (represented by dots). Panel (B) shows two pairs (Θ,b Θb⁰) and (Θb⁰⁰,Θb⁰⁰⁰)—

each lying in an exposed face—alongside their component differences. In addition, the panel indicates the solution set of a linear equation. Both panels show coordinates with respect to ¯B1,1,B¯1,2,B¯2,2; see(c)(section2.1.1). Panel (A) omits the coordinate axes for visual clarity.

ker( ˆG+λid)∩kerX is in analogy. Proposition3.6 summarizes the preceding discussion.

A proof of this assertion starts on page81 of appendix 3.b.

Proposition 3.6. The least-squares criterionl_λ in<3.1>exhibits a unique minimizerΘ.b The latter equals zero if and only if khhX, Yii+hhY, Xiik_op≤2¯µλ.

From a geometric perspective, the lefthand side of<3.4>shows that argmin_Θ∈

S^ml_λ(Θ) equals the intersection of the exposed face {Gˆ ∈ ncone(B_nuc, ^•)} of the nuclear norm ball Bnuc ={k^•knuc ≤`}ˆ and the set of solutions {X^• = ¯µDˆ +Y}.

If m = 2, then panel (A) of figure 3.6 indicates that the difference between two distinct elements Θ andb Θb⁰ of the same exposed face has rank two. In fact, all rank one matrices of k^•k-length ` > 0, that is, matrices of the form ±`uu^T, kuk = 1, lie in the upper and lower ellipse shown in that panel. This panel also shows the spans of a selection—represented by dots—of these matrices. The neighboring panel (B) verifies this observation for two pairs Θb 6= Θb⁰ and Θb⁰⁰ 6= Θb⁰⁰⁰. Each of these pairs lies in an exposed face of the ball{k^•k_nuc≤`}: the gray line connecting the two ellipses and the area circumscribed by the upper ellipse. This observations is, however, incompatible with ∆ =Θb−Θb⁰ ∈kerX unless X equals zero. After all, the equality rk ∆ = 2 ensures that its columns form a basis of R². Panel (B) illustrates this point by showing part of {X^• = ¯µDˆ +Y} with X = (0,1) and ¯µDˆ +Y = (0,1).

Im Dokument A framework for spatiotemporal prediction with small and heterogeneous data - and an application to consumer price indexes - (Seite 55-60)