• Keine Ergebnisse gefunden

Block-coordinate projected gradient descent al- al-gorithms

4.3. Block-coordinate projected gradient descent algorithms 89

4.3 Block-coordinate projected gradient descent

where, for all y ∈ Y, Gi : Y → Yi is given by Gi(y) .

= [yi−ˆai(y)Ti(y)∇ig(y)]+Y

i,Ti(y). (4.11) Thus, the mapping Gis such that each of its block components Gi is a `local' projected gradient descent in Yi characterised by a specic step-size rule ˆai and a specic scaling strategy Ti (i ∈ N). Using (3.21), it is easy to show that (4.11) is equivalent to

Gi(y) = arg min

z∈Yi

{

g(y) +∇ig(y)(z −yi) + 1

2∥z −yi2ai(y)Ti(y)]−1

}

, (4.12) whereGi(y)is the (unique) global minimum overYiof a function which can be regarded as a quadratic approximation of the functiong near the pointy and in the subset Yi. Equivalently, the cost approximation (CA) interpretation of (4.11) regards the function z ∈ Yi → ∇ig(y) + [ˆai(y)Ti(y)]−1(z −yi) as a local, linear approximation of ∇ig at y, and Gi(y) as the point z¯ ∈ Y satisfying the rst-order optimality condition (Proposition 3.5) for the linear approximation of ∇ig over Yi, i.e.

{∇ig(y) + [ˆai(y)Ti(y)]−1(¯z−yi)}

(z −z)¯ ≥ 0, ∀z ∈ Yi. (4.13) From Denition 4.1, we see that the mapping G requires the specication for each i ∈ N of a step size selection rule ˆai (discussed in Section 4.3.2), and of a scaling function Ti. A favourable choice for Ti is the inverse of the local diagonal element of the Hessian matrix ∇ii2g(y), in which case the minimand of (4.12) reduces to a (scaled) Taylor development of g in Yi. If in addition g is quadratic, then Gi reduces to a component solution method (see Section 3.4.5). The choice of theTi functions will be discussed further in Section 4.3.8.

The next proposition states that the condition G(y) = y is necessary and sucient for y to be a solution of Problem 4.1. In other words, the operation G(y) will cause a nonzero displacement from y along at least one component unless y is a solution. Conversely, the operation G(y) will cause no displacement if y is a solution.

Proposition 4.2 (Stationarity) Consider Problem 4.1 and let y ∈ Y. The vector y is a solution i Gi(y) =yi for i ∈ N.

Proof We rst assume that y is a solution of Problem 4.1 and thus satises the optimality condition given in Proposition 4.1. Let i∈ N and h(z) denote the minimand of of (4.12), i.e.

h(z) .

= g(y) +∇ig(y)(z −yi) +1

2∥z −yi2ai(y)Ti(y)]−1. (4.14)

4.3. Block-coordinate projected gradient descent algorithms 91

It follows from (4.4) that h(z) ≥g(y) for all z ∈ Yi. Noting that h(z) =g(y) ifz = yi, we nd that yi minimisesh(z)overz ∈ Yi. Hence Gi(y) =yi, which holds for any i∈ N.

Suppose now that Gi(y) =yi for i∈ N. By setting z¯= yi for each i∈ N in (4.13) we nd that (4.4) holds, and y is a solution of Problem 4.1, which

completes the proof.

Since Y is a product set, the mapping G implicitly denes several modes of implementation. These implementation modes include the Jacobi, Gauss-Seidel, arbitrary, or random modes, dened as follows.

Denition 4.2 (Modes of implementation of G)

(Jacobi) The Jacobi mode of implementation of G is given by the algo-rithm yk+1 = G(yk).

(Cyclic or Gauss-Seidel) Consider the Y →Y mapping S .

= ˆGn◦Gˆn−1◦...◦Gˆ1, (4.15) where we introduce Gˆi : y ∈ Y → Gˆi(y) .

= (y1, ..., yi−1,Gi(y), yi+1, ..., yn), i ∈ N. The Gauss-Seidel mode of implementation of G is given by the algorithm yk+1 = S(yk).

(Arbitrary) A sequential arbitrary mode of implementation of G is given by the algorithm yk+1 = Ak(yk), where (Ak) is a mapping sequence dened by a block selection sequence (ρk) taking arbitrary values in N and

Ak .

= ˆGρk. (4.16)

The various modes of implementation of G specied in Denition 4.2 cor-respond to dierent approaches of parallel optimisation.

In the Jacobi implementation, the local mappings Gi are applied simulta-neously in all the nodes and to an identical point of Y. The Jacobi method can be qualied as synchronous. This study places special emphasis on the sequential implementation modes, where the local mappings are applied se-quentially at dierent points of Y and one at a timethis constraint can nevertheless be relaxed if g(y) has a sparse dependency structure ony1, ..., yn using the colouring schemes already mentioned in Section 3.4.5, so that the algorithms can be executed without global coordination.

Amongst the sequential modes, we make a distinction between the cyclic or Gauss-Seidel method, where yk+1 is obtained after n successive, ordered local mappings Gˆi from yk, and the sequential arbitrary method, where the nodes of the network apply the mappings Gˆi in arbitrary order and at their own pace. The sequence ρk is arbitrary and can be generated at random. We

note that the Gauss-Seidel mode of implementation is a particular case of the arbitrary mode where

k)

: 1, ..., n,1, ..., n,1, ... . (4.17) Therefore any property of the sequence(Ak)is also valid for S, and frequently we will only consider (Ak) when deriving results which hold for both modes of implementation. More sophisticated block selection techniques (e.g. the Gauss-Southwell method) relying on centralised decisions or requiring consen-sus among the nodes are not discussed in this text.

4.3.2 Line search and global convergence of the sequential imple-mentations

Global convergence of the projected gradient methods is typically guaranteed by line search techniques for the step size selection, where the step size at each step is chosen so as to minimise the objective function along a given descent direction. In practice, line search is done approximately by decreasing an initial step size value (usually set to 1) until a satisfactory step size is found, as for instance in the Armijo rule introduced in Section 3.4.4.

The convergence of Jacobi implementations of the gradient projection method is commonly obtained using a unique line search procedure agree-ing on an identical step size for all the nodes. The resultagree-ing algorithm is a mere parallel implementation of the global gradient projection method (4.9).

Global line search, however, is inappropriate for sequential implementations of the method, which require local step size selection procedures. Through-out this chapter, we consider that the sequential mapping S and mapping sequences (Ak) are implemented with local approximate line search routines of the type Armijo. The following denition is then suggested for the func-tions ˆai.

Denition 4.3 (Local approximate line search) Given the xed scalar parameters β, σ ∈ (0,1) and a scaling matrix function Ti : Y → T (mi) for eachi ∈ N, we consider the mappings ˆai : Y → (0,1] where, for y ∈ Y, aˆi(y) is the largest step size a ∈ {βm}m=0 satisfying

g(y)−g(ˆy(a)) ≥σa−1∥yˆi(a)−yi2Ti(y)−1 (4.18) where yˆj(a) .

= yj if j ̸= i and yˆi(a) .

= [yi −aTi(y)∇ig(y)]+Yi,Ti(y).

A particularity of the above step size selection rule with regard to conven-tional applications of the Armijo rule is that the step sizes are decided inde-pendently and thus dierent for all the nodes. Besides, the decisions taken in

4.3. Block-coordinate projected gradient descent algorithms 93

Denition 4.3 are based on local information under the assumption made in Problem 4.1 that the variations ofg along block components can be computed locally.

It is shown in the next section that the step size selection rule (4.18) guarantees the convergence of sequential implementations of the mapping G. 4.3.3 Global convergence of sequential implementations of G The proof of the convergence of the sequential implementations of Gis based on a descent approach. Since the Gauss-Seidel method S is a particular in-stance of the arbitrary implementation, we only need to prove global con-vergence for an arbitrary instance of (Ak). The proof relies on two lemmas.

Lemma 4.1 (Scaled gradient projection) Let i ∈ N, y ∈ Y, T ∈ T(mi), and a ≥0. If y¯∈ Y is such that y¯i = [yi−aT∇ig(y)]+Y

i,T, then

−a∇ig(y)(¯yi−z) ≥(¯yi−yi)T−1(¯yi−z), ∀z ∈ Yi. (4.19) Proof Since Yi is a convex set we have, as per the scaled projection theorem (Proposition B.9-(ii) in Appendix B.4),

(yi−aT∇ig(y)−y¯i)T−1(z−y¯i)≤0, ∀z ∈ Yi, (4.20)

which is equivalent to (4.19).

The next lemma states that local projected gradient descents lead to signi-cant descents on g in the directions where (4.4) does not hold.

Lemma 4.2 (Descent) Let (4.2) hold, i ∈ N, y ∈ Y, T ∈ T(mi), and a ≥ 0. Consider y¯ ∈ Y with y¯i = [yi−aT∇ig(y)]+Yi,T and y¯j = yj if j ̸= i. Then

g(y)−g(¯y)≥ σ

a∥y¯i−yi2T−1 (4.21) is satised for any σ ∈ (0,1) if a ≤2(1−σ)(¯λL)−1.

Proof Let ζ(t) .

= y+t(¯y−y). We have g(y)−g(¯y) =−∇ig(y)(¯yi−yi)−

1 0

[∇ig(ζ(t))− ∇ig(y)](¯yi−yi)dt. (4.22) It follows from Lemma 4.1 that

−a∇ig(y)(¯yi −yi)≥ ∥y¯i −yi2T−1. (4.23)

Since, by (4.2), ∥∇ig(ζ(t))− ∇ig(y)∥ ≤ tL∥y¯−y∥ = tL∥y¯i −yi∥, we nd, using (4.22), (4.23), T ≼ λ¯, and the Cauchy-Schwarz inequality (Proposi-tion A.2 in Appendix A),

g(y)−g(¯y)≥ (1

a − λL¯ 2

)

∥y¯i−yi2T−1, (4.24)

which proves the lemma.

A direct consequence of Lemma 4.2 is that the step size selection rule given in Denition 4.3 is well-dened for Problem 4.1. Indeed, for anyy ∈ Y, i∈ N and Ti : Y → T(mi), we nd

0 < a≤ ˆai(y)≤1, (4.25) where

a .

= min{1,2β(1−σ)/(¯λL)}. (4.26) In other words, the computaion of ˆai requires a nite number of evaluations of (4.18).

We can now prove the convergence of the arbitrary mode of implementation ofG. We consider block selection sequences(ρk)such that each node is chosen with non null limiting relative frequency1 as k → ∞. This ensures that, for all k and i ∈ N, we can nd a nite step k¯ such that ¯k > k and ρ¯k = i, and thus Gˆi is applied an innite number of times. In the case of randomly generated sequences, this condition and the following proposition may only hold with probability one.

Proposition 4.3 (Global convergence of (Ak)) Consider a block-coordi-nate sequence (ρk) such that each i ∈ N is chosen with non null limiting relative frequency as k → ∞, the corresponding mapping sequence (Ak), and a sequence (yk) generated by (Ak). If the Problem 4.1 has a nite optimum, then any limit point of (yk) is a solution.

Proof Consider the sequence (yk) generated by an arbitrary sequence (Ak), and a step k. Suppose that at step k the algorithm yields the point yk+1 = Gˆi(y) for some i ∈ N. Since yk+1 and yk dier only by their ith block component, and using (4.18) and (4.25), we nd

g(yk)−g(yk+1)≥ σ

λ¯∥yk+1−yk2, (4.27)

1The notion of limiting relative frequency is briey discussed in Appendix D.1. In this section, the condition of non null limiting relative frequency may be understood aslim infk→∞k

l=11{i}l)/k >

0,iN, where the indicator function1 is dened by (D.10).

4.3. Block-coordinate projected gradient descent algorithms 95

which holds for anyk. Now, suppose g has a nite optimum on Y. By (4.18) and the convexity ofg, the sequence(yk)is bounded and we can nd a subse-quence converging to a point y. Since g(yk) is monotonically nonincreasing, we have g(yk) ↓g(y), and g(yk+1)−g(yk)→ 0. It follows from (4.27) that

∥yk+1−yk∥ → 0, (4.28)

and thus yk → y. Using Lemma 4.1, we nd, for all z ∈ Yi,

− ∇ig(yk)(yik+1−z)≥ (yik+1−z)Ti(yk)−1(yk+1i −yik) ˆ

ai(yk) . (4.29)

If we write Tˆ .

= ˆai(yk)Ti(yk), (4.29) yields

−∇ig(yk)(z −yik) =∇ig(yk)(yk+1i −z)− ∇ig(yk)(yk+1i −yik)

≤[ ˆT−1(z −yk+1i )− ∇ig(yk)](yik+1−yik)

≤ ∥Tˆ−1(z−yik+1)− ∇ig(yk)∥ ∥yik+1−yki

≤ ∥Tˆ−1(z−yik+1)− ∇ig(yk)∥ ∥yk+1−yk∥. (4.30) Now, we x i and consider the set Wi of all the time steps where a local projected gradient descent occurs inYi and (4.30) holds. Since by assumption the limiting relative frequency of the event ρk = i is positive, the set Wi is innite and the subsequence (yk)k∈Wi converges to y. It follows from (4.28) and the boundedness of Tˆ that the right member of (4.30) vanishes to 0 as k → ∞ and yk → y. Hence we have ∇ig(y)(z −yi) ≥0, and this result can be obtained for each i∈ N. By Proposition 4.1, y is a solution of

Problem 4.1.

4.3.4 Local convergence of the sequential implementations

Deterministic implementations

Linear convergence can generally be proven for deterministic block-coordinate implementations of the projected gradient algorithm (e.g. the Gauss-Seidel or Gauss-Southwell methods). The rst result of this section is concerned with the Gauss-Seidel implementation ofGand a direct application of Theorem C.1 in Appendix C.1. Before proceeding, we make the following assumption on the norm of the residual {y−[y− ∇g(y)]+Y}.

Assumption 4.1 (Local error bound) Let Y˜ = arg miny∈Y g(y) denote the set of solutions of Problem 4.1. The setY˜ is non-empty, and for everyυ ≥ miny∈Y g(y) there exists scalars δ > 0 and τ > 0 such that

min

z∈Y˜ ∥y−z∥ ≤ τ∥y−[y− ∇g(y)]+Y∥ (4.31)

for all y ∈ Y with g(y)≤ υ and ∥y−[y− ∇g(y)]+Y∥ ≤ δ.

Assumption 4.1 is a transcription for Problem 4.1 of Assumption C.1 in Ap-pendix C.1. Examples of situations where Assumption 4.1 holds are given in Appendix C.1. They include for instance the cases wheng is a strongly convex function, when Y is polyhedral and g is a quadratic function, or when Y is a polyhedron and−g is the dual function of a convex optimisation problem with ane constraints and a strongly convex dierentiable objective function with Lipschitz continuous gradient. Note that the second assumption required by Theorem C.1 (Assumption C.2 in Appendix C.1) holds by convexity of Prob-lem 4.1.

The next proposition guarantees the linear convergence of the mapping S under Assumption 4.1.

Proposition 4.4 (Linear convergence of S) Suppose that g is bounded below on Y and Problem 4.1 has a non-empty set of solutions denoted by Y˜ = arg miny∈Y g(y). Let Assumption 4.1 hold and (yk) be a sequence gen-erated by the mapping S with scaling functions Ti : Y → T(mi), i ∈ N. Then {g(yk)} converges at least Q-linearly to miny∈Y g(y) and (yk) converges at least R-linearly to a solution.

Proof We show that the mapping S satises the conditions of Theorem C.1 in Appendix C.1. Let y ∈ Y. Since S(y)∈ Y, we can write

S(y) = [y− ∇g(y) +e]+Y , (4.32) for a certain vector e. By setting z[0] = y and computing successively z[i] = Gˆi(z[i−1]) for i = 1, ..., n, we nd S(y) = z[n]. Since Y = ∏n

i=1Yi, (4.32) reduces to

zi[i] = [yi− ∇ig(y) +ei]+Y

i, i∈ N. (4.33)

Let i ∈ N and Φi : x ∈ Rmi → Φi(x) = [ˆai(z[i−1])Ti(z[i−1])]−1x. It follows from (4.13) that the gradient approximation

[∇ig(z[i−1]) + Φi(zi[i]−zi[i−1])] (4.34) is orthogonal to Yi at zi[i], and thus

zi[i] = [zi[i]− ∇ig(z[i−1])−Φi(zi[i]−zi[i−1])]+Yi, (4.35) where by construction zi[i−1] = yi and zi[i] = Si(y). It is easy to see that (4.35) is equivalent to (4.33) if we set

ei = Si(y)−yi−Φi(Si(y)−yi)− ∇ig(z[i−1]) +∇ig(y). (4.36)

4.3. Block-coordinate projected gradient descent algorithms 97

Using (4.2), ∥z[i−1]−y∥ ≤ ∥S(y)−y∥, and ∥Φi(x)∥ ≤ (aλ)−1∥x∥, where a is given by (4.26), we nd

∥ei∥ ≤ [1 + (aλ)−1]∥Si(y)−yi∥+L∥S(y)−y∥. (4.37) Since (4.37) holds for i∈ N, we have

∥e∥ ≤ n[1 + (aλ)−1+L]∥S(y)−y∥. (4.38) Using (4.27), we also nd

g(S(y))−g(y)≤ −σ

λ¯∥S(y)−y∥2. (4.39) It follows from (4.32), (4.38) and (4.39), that the conditions of Theorem C.1 for Problem 4.1 are satised by the algorithm yk+1 = S(yk) under

Assump-tion 4.1.

Stochastic implementations

The linear convergence of random implementations is more dicult to certify.

A signicant convergence result was recently issued in [Nes12] for a version of the random block-coordinate projected gradient descent algorithm used with constant (scalar) scaling coecients, and with a uniform random block selection where descent occurs at each node i ∈ N with equal probability n1. In this particular algorithm, which has been successfully applied to huge-scale optimisation problems, scaling is done in function of local Lipschitz constants, and conservative in the sense that convergence is guaranteed with unit step sizes (i.e. without the need of a step-size selection routine). The scaling strategy of this algorithm can be described as follows. For each i ∈ N, we denote by Ii the identity matrix in Rmi and by Ui the Rm×mi matrix such that (U1, ..., Un) is the identity matrix in Rm. It is assumed that a Lipschitz constant Li for ∇ig is known, which satises

∥∇ig(y)− ∇ig(y+Uiz)∥ ≤ Li∥z∥, z ∈ Rmi, y,(y+Uiz) ∈ Y. (4.40) The descent direction at a node i ∈ N, which is computed by projected gradient descent in Yi, is then scaled by the constant coecient L−1i . In contrast to the methods based on step-size selection by line search, notice that this algorithm relies on the knowledge of the coordinate Lipschitz constants.

Conservative bounds can be used in place of the Lipschitz constants (e.g.

the actual Lipschitz constant L), though the performance of the algorithm remains dependent on the tightness of these bounds.

In this section, we make the observation that it is possible to recover the above-described algorithm using a random implementation of G in combi-nation with constant scalar scaling and uniformly generated block selection sequences. We infer the following convergence result for the mapping se-quence (Ak).

Proposition 4.5 (Linear convergence of (Ak)) Suppose that g = infy∈Y g(y) is nite and that local Lipschitz constants {Li}i∈N satisfy-ing (4.40) are known. Let (yk) be a sequence generated by a mapping se-quence (Ak) using σ ∈ (0, 12] for the step size selection, Ti(y) =L−1i Ii for all y ∈ Y and i ∈ N, and a block selection sequence (ρk) drawn randomly with P(ρk = i) = n1 for allk and i∈ N. Then the step sizes computed by local line search are all equal to 1 and, for any condence level ρ ∈ (0,1) and accuracy ϵ > 0, there is a k <¯ ∞ such that P(

g(yk)−g ≤ϵ)

≥ 1−ρ for k ≥ k¯. If in addition g is strongly convex, then the expectation E{g(yk)} converges at least Q-linearly to g.

Proof Let i ∈ N and y ∈ Y. By (4.40) and a rationale already used in the proof of Lemma 4.2, we nd, for all z ∈ Yi,

g(y+Uiz)≤g(y) +∇ig(y)z + Li

2 ∥z∥2. (4.41) Take a positive step sizea and deney(a)ˆ as in Denition 4.3 by yˆj(a) =yj if j ̸= i and yˆi(a) = [yi−aTi(y)∇if(y)]+Y

i,Ti(y). It follows from Lemma 4.1 that g(y)−g(ˆy(a)) (4.41)≥ −∇ig(y)(ˆyi(a)−yi)− Li

2 ∥yˆi(a)−yi2 (4.42)

(4.19)

≥ Li (1

a − 1 2

)

∥yˆi(a)−yi2. (4.43) Hence (4.18) is satised for a = 1 if σ ≤ 12. By setting σ ∈ (0, 12] we recover the constrained minimisation algorithm of [Nes12] and the proposition follows

from [RT12a, Theorem 4] and [Nes12, Theorem 5].

4.3.5 Asymptotic behaviour over polyhedrons

This section studies the asymptotic behaviour of convergent block-coordinate implementations of G. It is shown that, when approaching a convergence point enjoying a property of strict complementarity, the algorithms reduce to simple gradient descents in reduced spaces of smaller dimensions. The analysis is restricted to polyhedral product sets and we make the following assumption.

4.3. Block-coordinate projected gradient descent algorithms 99

Assumption 4.2 (Polyhedral feasible set) For i∈ N, Yi is a polyhedron dened by a nite set Ci = {c1, ..., cri} of scalar ane constraints Rmi → R, i.e.

Yi = {y ∈ Rmi : c(y)≤0,∀c ∈ Ci}. (4.44) We introduce the following denitions.

Denition 4.4 (Sets Ai, A¯i and matrices E¯i, Ei) Under Assump-tion 4.2, and for each i ∈ N and y ∈ Yi, we dene the set Ai(y) = {c ∈ Ci : c(y) = 0} as the set of the nonnegativity con-straints specifying Yi active at y, and A¯i(y) = Ci \ Ai as the set of the inactive constraints. We denote by Ei(y) any matrix the columns of which form an orthonormal basis of the nullspace of {∇c(y) : c ∈ Ai(y)}, and by E¯i(y) any matrix such that (Ei(y),E¯i(y)) is an orthogonal matrix of Rmi and Ei(y)i(y) = 0.

Now we introduce the concept of strict complementarity. From Propo-sition B.6 in Appendix B.3, we know that a geometric interpretation of the optimality of a point y for Problem 4.1 under a constraint qualica-tion is that −∇g(y) must be parallel to the normal vector of one hyper-plane supporting Y at y. Under Assumption 4.2 in particular, we can write Ai(yi) ={c1, ..., cr¯i}for all i∈ N. Proposition B.6 states thaty is a solution of Problem 4.1 iif

ig(y) =−∑r¯i

j=1αj∇cj(yi) (4.45) holds for some nonnegative scalars α1, ..., α¯ri and for all i ∈ N. Notice that (4.45) is in fact the KKT condition (3.45). If Ai(yi) = ∅ the condi-tion (4.45) reduces to ∇ig(y) = 0.

We speak of strict complementarity at the solution y when the coe-cients α1, ..., αr¯i are all strictly positive.

Denition 4.5 (Strict complementarity) Let Assumption 4.2 hold and y be a solution of Problem 4.1 with Ai(yi) = {c1, ..., cr¯i}, i ∈ N. Strict complementarity holds at y i the gradient of g at y is a positive combina-tion of the gradients of the active constraints, i.e., for all i ∈ N such that Ai(y)̸= ∅, one can nd positive scalars α1, ..., α¯ri satisfying

ig(y) =−∑¯ri

j=1αj∇cj(yi). (4.46) If Y takes the form (4.3), then strict complementarity at y reduces to

(yi)j = 0⇒ (∇i)jg(y)> 0, ∀j ∈ Pi, i ∈ N, (4.47)

where (∇i)jg denotes the jth component of ∇ig.

The next result states that strict stationarity at the point of convergence of a sequence generated by a block-coordinate implementation ofG ensures that the active nonnegativity constraints at the convergence point are uncovered by the algorithm in a nite number of iterations. Since the Gauss-Seidel mode of implementation of G can be regarded as a particular case of the arbitrary mode, we only consider the Jacobi mode yk+1 = G(yk) and the arbitrary mode yk+1 = Ak(yk). For these two modes note that, at each step k and for each i∈ N, we have either yk+1i = Gi(yik) or yk+1i = yik.

Proposition 4.6 (Identication of the active constraints) Under As-sumption 4.2, let (yk) be a sequence generated by a block-coordinate imple-mentation of G which converges to a solution y of Problem 4.1 where strict complementarity holds. Then there exists a ¯k <∞ such that Ai(yik) = Ai(yi) for all k ≥k, i¯ ∈ N.

Proof Since the set of nodes N is nite, we only need to consider an arbi-trary node i ∈ N, and we suppose that (4.46) holds for the strictly positive scalars α1, ..., αr¯i.

We rst show that there exists a ˆk < ∞ such that Ai(yki) ⊂ Ai(yi) for all k ≥ kˆ. Otherwise there would be a constraint c ∈ A¯i(yi) and a subsequence (yk)k∈K such that c(yik) = 0 for k ∈ K. Since yk → y and by continuity of c, we would nd c(yi) = 0 and thus c ∈ Ai(yi), which is a contradiction.

Suppose now that Ai(yi) = ∅. Then we also have Ai(yik)⊃ Ai(yi) for all k ≥ ˆk and we are done.

If otherwise Ai(yi) ̸= ∅, then ∥∇ig(y)∥ ̸= 0 by strict complementarity at y. LetM = {1, ...,r¯i} denote the index set of the active constraints atyi, and let Ai(y) = ˆA for some point y ∈ Yi with Aˆ= {cj}j∈M¯, M¯ ⊂ M and M¯ ̸= M. We have A ⊂ Aˆ i(yi) and A ̸ˆ= Ai(yi). This implies that ∇ig(y) cannot be expressed as a linear combination of vectors of {∇cj(yi)}j∈M¯. In-deed, we have cj(y) = 0 for j ∈ M¯ and cj(y) < 0 for j ∈ M \M¯, and thus

j∈Mαjcj(y) = ∑

j∈M\M¯ αjcj(y) < 0. Since the constraints are ane we can write cj(y) =∇cj(yi)(y−yi) for all y ∈ Yi and j ∈ M. It follows that

j∈M αjcj(y) =[∑

j∈M αj∇cj(yi) ]

(y−yi) (4.46)= −∇ig(y)(y−yi) (4.48) which is equal to 0 and leads to a contradiction if ∇ig(y) is a linear combi-nation of vectors of {∇cj(yi)}j∈M¯. Hence we can nd a ∆ > 0 independent of y such that

∇ig(y) +∑

j∈M¯ α¯j∇cj(yi)

 >∆, ∀{α¯j}j∈M¯. (4.49)

4.3. Block-coordinate projected gradient descent algorithms 101

Let δ > 0. Since yk → y, we can nd a nite k˜ ≥kˆ such that ∥yk −y∥ < δ and thus ∥yki −yi∥ < δ, ∥yik+1−yki∥ < 2δ for all k ≥ ˜k. By Lipschitz continuity of ∇g, we also have ∥∇ig(yk)− ∇ig(y)∥ ≤ L∥yk −y∥ < Lδ for k ≥ k˜. Consider now an iteration k ≥ k˜ of the algorithm yielding yk+1 from yk, where yk+1i = Gi(yk) and we assume that Ai(yik+1) = ˆA. We infer from the geometric interpretation of yik+1 as the solution of (4.12) Proposition B.6 in Appendix B.3, or the KKT condition (3.45) , that one can nd nonnegative coecients {αˆj}j∈M¯ such that

ig(yk) + [ˆai(yk)Ti(yk)]−1(yik+1−yik) =−∑

j∈M¯ αˆj∇cj(yik+1), (4.50) where the left member of (4.50) is the gradient at yk+1i of the cost approxi-mation of g at yk. We nd, for k ≥ ˜k,

∇ig(y) +∑

j∈M¯ αˆj∇cj(yi)

(4.50)

= 

[∇ig(y)− ∇ig(yk)]−[ˆai(yk)Ti(yk)]−1(yk+1i −yik)

≤[L+ 2(aλ)−1]δ which contradicts (4.49) if we initially set

δ < ∆

L+ 2(aλ)−1. (4.51)

Hence Ai(yik+1)̸= ˆA.

Now, recall that at each step k where G or Ak is applied, we have either yik+1 = yki and thus Ai(yik+1) = Ai(yik), or yik+1 = Gi(yk) which holds for an innity of steps as yk → y. Since the number of constraints specifying Yi and the number of possible values forAˆare nite, by induction we can nd a

¯k <∞ such that Ai(yik) = Ai(yi) for k ≥ ¯k, which completes the proof.

In anticipation of Proposition 4.7, we introduce the following notations.

Let (yk) be a sequence generated by a block-coordinate implementation ofG converging to a point y. If for each i ∈ N we denote by m˜i the number of columns ofEi(yi), we can write, for anyy ∈ Yisuch thatAi(y) = Ai(yi), y = yi+Ei(yi)˜yfor somey˜∈ Rm˜i, whereRm˜i is called the reduced space atyi. We also dene by∇˜ig .

= Ei(yi)ig the projected gradient component nearyi, and byT˜i(y) .

= [Ei(yi)Ti(y)−1Ei(yi)]−1 the projected scaling function nearyi. We denote by m˜ .

= ∑n

i=1i the dimension of the (global) reduced space at y. We can now formulate the following consequence of Proposition 4.6.

Proposition 4.7 (Descent in reduced spaces) Let Assumption 4.2 hold, and (yk) be a sequence of points generated by a block-coordinate implementa-tion of G and converging to a solution y of Problem 4.1 where strict comple-mentarity holds. One can nd a k <¯ ∞ such that, for any i ∈ N and k > k¯ whereyik+1 = Gi(yk), one hasyki = yi+Ei(y)˜yki andyik+1 = yi+Ei(y)˜yik+1, with y˜ik,y˜ik+1 ∈ Rm˜i and

˜

yik+1 = ˜yki −ˆai(yk) ˜Ti(yk) ˜∇ig(yk), (4.52) which can be regarded as an unconstrained gradient descent in the reduced space Rm˜i.

Proof If m˜i = 0, then (4.52) reduces to y˜ik+1 = ˜yki, and the proposition is a direct consequence of Proposition 4.6. Hence we suppose that m˜i > 0. For i ∈ N, let Yi .

= {z ∈ Yi : Ai(z) = Ai(yi)} and Y˜i .

= {Ei(yi)(z −yi) : z ∈ Yi}. By continuity of the constraint functions, it is easy to show that Y˜i is an open subset of Rm˜i if m˜i > 0. Moreover, the function hi(˜y) .

= yi + Ei(yi)˜y is a bijection between Y˜i and Yi, i.e. Yi = {hi(˜z) : ˜z ∈ Y˜i}. It follows from Proposition 4.6 that one can nd a k <¯ ∞ such that, for all k ≥ k¯, we have yik, yik+1 ∈ Yi, and thus yik = hi(˜yik) and yk+1i = hi(˜yik+1) for some points

˜

yik,y˜ik+1 ∈ Y˜i. If, in addition, i and k are such that yik+1 = Gi(yk), then

˜

yk+1i (4.12)= arg min

˜ z∈Y˜i

{

∇˜ig(yk)(˜z−y˜ik) + 1 2

z˜−y˜ik

2

ai(yk) ˜Ti(yk)]−1

}

(4.53)

(3.20)

= [

˜

yik−aˆi(yk) ˜Ti(yk) ˜∇ig(yk) ]+

Y˜i,T˜i(yk) (4.54)

= ˜yik−ˆai(yk) ˜Ti(yk) ˜∇ig(yk) (4.55) where (4.55) follows from the fact that Y˜i is an open set and y˜k+1i ∈ Y˜i. Propositions 4.6 and 4.7 hold for many convergent block-coordinate im-plementations of the scaled gradient projection algorithm with bounded or constant step-sizes. Their implication is that the asymptotic convergence properties for unconstrained problems of the non-projected block-coordinate gradient descent methods are also met by the projected algorithms in con-strained problems.

4.3.6 Asymptotic rates of convergence

In this section we derive asymptotic convergence rates for block-coordinate implementations of G over polyhedrons. We rst assume strict complemen-tarity and twice continuous dierentiability at the point of convergence, then

4.3. Block-coordinate projected gradient descent algorithms 103

uniqueness of the solution. The particular cases when the Hessian is discon-tinuous at the point of convergence and when strict complementarity does not hold are discussed at the end of the section.

Local convergence under strict complementarity

Let y be the point of convergence of a convergent algorithm. We introduce the Rm˜ matrix

E(y) .

= diag(E1(y1), ..., En(yn)), y ∈ Y, (4.56) and dene the projected Hessian of g at y as ∇˜2g(y) .

= E(y)2g(y)E(y) and the block-diagonal form T˜ .

= diag( ˜T1, ...,T˜n), where T˜i is the projected scaling function at node i. We decompose the Hessian matrix of g into

2g = D −L−L, (4.57)

where D = diag(∇112 g, ...,∇nn2 g) is block diagonal and L strictly lower trian-gular. By settingD(y)˜ .

= E(y)D(y)E(y)and L(y)˜ .

= E(y)L(y)E(y), we nd ∇˜2g = ˜D −L˜ −L˜. The next proposition is concerned with the asymp-totic behaviour of the local step size selection rule. Again, we regard S as a particular instance of (Ak).

Proposition 4.8 (Asymptotic eciency of the step size rule) Under Assumption 4.2, consider a block-coordinate implementation of G, and let the algorithm generate a sequence of points (yk) converging to a solu-tion y of Problem 4.1 where T is continuous, strict complementarity holds and g is twice continuously dierentiable. Then the step sizes computed by local line search at a node i ∈ N become identically equal to 1 near y if (1−σ) ˜Ti(y)−112∇˜ii2g(y)≻0, and at all the nodes if

2(1−σ) ˜T(y)−1−D(y˜ ) ≻0. (4.58) Proof Let i ∈ N. According to Propositions 4.6 and 4.7, there is a k¯ such that, for any step k ≥ k¯ at which yik+1 = Gi(yik), we have Ai(yik) = Ai(yik+1) = Ai(yi) and yk+1i −yik = Ei(y)˜δ, where δ ∈ Rm˜i is the displace-ment in the reduced space given by

δ˜= −aT˜i(y) ˜∇ig(y). (4.59) The Taylor theorem (see (A.9) in Appendix A) yields, after simplication,

g(yk+1) = g(yk)−δ˜ [1

a

i(y)−1 − 1 2

∇˜ii2g(yk) ]

δ˜+o(∥δ˜∥2). (4.60)

The condition (4.18) for a to be a valid step size reduces to δ˜

[1−σ a

i(yk)−1 − 1 2

∇˜ii2g(yk) ]

δ˜+o(∥δ˜∥2)(4.60)≥ 0, (4.61) where ∥˜δ∥ → 0 as k → ∞ by stationarity of y. Condition (4.61) holds as

∥δ˜∥ → 0 and a becomes an acceptable step size at node i for k large enough

if 1−σ

a T˜i(y)−1 − 1

2∇˜ii2g(y)≻0, (4.62)

which proves the proposition.

Notice that it is possible to obtain the condition (4.58) by decreasing the scales.

In the next result, we derive the asymptotic convergence rate of the Gauss-Seidel implementation of G. The positive deniteness of ∇2g at the point of convergence is requiredthis implies the uniqueness of the solution.

Proposition 4.9 (Matrix convergence rate of S) Let Assumption 4.2 hold and suppose that Problem 4.1 has a unique and nite solution y where strict complementarity holds and ∇2g is positive denite and continuous.

Let (yk) be a sequence generated by S with scaling functions Ti ∈ T(mi) (i∈ N) continuously dierentiable at y and meeting the conditions for triv-ial step size selection neary specied by Proposition 4.8. Then(yk)converges linearly to y at rate

S(y) = [

T˜(y)−1 −L(y˜ )]−1[

T˜(y)−1 −D(y˜ ) + ˜L(y)]

, (4.63) with spectral radius ρ( ˜RS(y)) < 1.

Proof We study the rst-order sensitivity of S in the reduced space at y. One can nd a k <¯ ∞ such that, for any k ≥ ¯k, we have yk = y+ E(y)˜yk for some y˜k ∈ Rm˜. For i ∈ N, we denote by I˜i the identity matrix of Rm˜i, and dene

i .

= diag( ˜I1, ...,I˜i−1,0, ...,0). (4.64) Let y˜ be a point of the reduced space Rm˜, d˜ symbolise a displacement in Rm˜. For i ∈ N, the point obtained after local displacements from y˜along the i−1 rst components of d˜is given by

¯

yi( ˜d,y)˜ .

=y+E(y)(˜y+ ˜Wid).˜ (4.65) Consider now the function Z˜ = ( ˜Z1, ...,Z˜n) such that

i( ˜d,y) = [ ˜˜ Ti(¯yi( ˜d,y))]˜ −1i+ ˜∇ig(¯yi( ˜d,y)),˜ i∈ N. (4.66)

4.3. Block-coordinate projected gradient descent algorithms 105

The implicit function theorem (Proposition (A.3) in Appendix A) states that there exists a function d(˜˜y) continuously dierentiable on a neighbourhood Ω˜ ⊂ Rm˜i of 0 such that d(0) = 0˜ and Z( ˜˜ d(˜y),y) = 0˜ for all y˜ ∈ Ω˜. For

˜

y ∈ Ω˜, d(˜˜y)gives the displacement caused by an application of Sat the point y = y +E(y)˜y, and is such that

i(˜y) =−T˜i(¯yi( ˜d(˜y),y)) ˜˜ ∇ig(¯yi( ˜d(˜y),y)),˜ i ∈ N. (4.67) Consider the function z(˜˜ y) .

= ˜Z( ˜d(˜y),y)˜ . By dierentiation of z˜at 0 we nd Jz(0) = [ ˜˜ T(y)−1 −L(y˜ )]Jd(0) + ˜˜ ∇2g(y) (4.68) whereJz˜andJd˜denote the Jacobians of z˜andd˜, respectively. FromJz(0) =˜ 0 follows

Jd(0) =˜ −[ ˜T(y)−1 −L(y˜ )]−1∇˜2g(y). (4.69) Fork ≥k¯ we havey˜k+1 = ˜yk+ ˜d(˜yk). The Taylor theorem and d(0) = 0˜ yield

˜

yk+1 (A.8)= y˜k+Jd(0)˜˜ yk +h(˜yk) (4.70)

(4.69)

= R˜S(y)˜yk+h(˜yk), (4.71) with h(y) =o(∥y∥). We can rewrite R˜S(y) as

S(y) = ( ˆD−E)ˆ −1, (4.72) where Dˆ = 2 ˜T(y)−1 − D(y˜ ) and Eˆ = ˜T(y)−1 − D(y˜ ) + ˜L(y). Not-ing that Dˆ −Eˆ − Eˆ = ˜∇2g(y) is positive denite and ( ˆD − E)ˆ is non-singular, the Ostrowski-Reich theorem (Theorem A.2 in Appendix A) states that ρ( ˜RS(y)) < 1 if Dˆ ≻ 0, i.e. if

2 ˜T(y)−1−D(y˜ ) ≻0. (4.73) In that case the local convergence of the algorithm is analogous to that of the procedure which consists of solving the equation ∇˜2g(y)˜y = 0 using the Gauss-Seidel iterative method y˜k+1 = ˜RS(y)˜yk. Observing that (4.58)

implies (4.73) completes the proof.

The fact that (4.58) reduces to (4.73) when we let σ → 0 is unsurprising.

Indeed, if (4.73) holds, then it is possible to nd a σ > 0 satisfying (4.58).

In other words, (4.73) guarantees the existence of parameter values such that yk+1 = S(yk) will converge to the solution with unit step sizes near the opti-mum, and this is only possible if ρ( ˜RS(y)) < 1.

Extensions

In some particular cases, rates of the type (4.63) cannot be derived. These situations include implementations with variable step sizes near the solution, and the absence of strict complementarity or discontinuity of the Hessian matrix at the solution. In such scenarios, the asymptotic trajectory of S is expected to jump from regions of Y to others, where the algorithm behaves dierently.

Variable step-sizes near the point of convergence. When for instance the condition (4.58) does not hold, the step sizes computed by line search may vary near the optimum. One can equivalently consider that the scaling function at any node i takes various expressions in a set {βmTi}mm=0¯ , where βm¯ is the smallest step size permitted by (4.25). By following the rationale of the proof of Proposition 4.9, one shows that the local convergence of the algorithm is that of a stable discrete-time switching system dened by a nite rate set {R˜S(ψ, y)}ψ∈Ψ with Ψ ⊂ {0, ...,m¯}n and ρ( ˜RS(ψ, y)) < 1 for ψ ∈ Ψ, and given near the optimum by an equation of the type

˜

yk+1 = ˜RS(ψ(k), y)˜yk+h(˜yk), (4.74) whereh(y) =o(∥y∥), R˜S(·, y) takes an expression of the type (4.63), andψ : N→ Ψ is a switching function.

Discontinuous Hessian. Propositions 4.8 and 4.9 can also be extended to the case when the Hessian of g shows discontinuities in Y. An example of only piecewise twice continuously dierentiable function is the dual of the NUM problem (Problem 3.3) formulated in Section 3.4.5 when the side constraint setSis polyhedralthis was observed in Examples 3.7 and 3.8 in Section 3.4.3.

Suppose that the minimum y is unique, and that ∇2g is discontinuous at y and continuous, positive denite almost everywhere on a neighbour-hood Ω of y, so that Ω is partitioned in a nite number q of open sets where ∇2g is continuous, and at the borders of which ∇2g is not dened and one should instead consider either the subderivatives of ∇g or the directional derivatives {∇2q}qq=1 , where each ∇2ˆgq is assumed continuous, positive def-inite and equal to ∇2g on the qth subset of Ω. Near the solution and for all i ∈ N, we have Tk(y) = Tiq(y) at any y located in the qth subset of Ω, where each Tiq ∈ T(mi) for q ∈ {1, ..., q}. Consider now the reduced space at y and the corresponding projected forms ∇˜2q and T˜iq for q = 1, ..., q and i ∈ N. By proceeding as in the proof of Proposition 4.8, we nd that 1

4.3. Block-coordinate projected gradient descent algorithms 107

becomes an acceptable step size near y if (1−σ) ˜Tiq(y)−1− 1

2∇˜ii2r(y)> 0 (4.75) holds for all q, r ∈ {1, ..., q}. The local convergence of S is then given by a switching system of the type (4.74) with Ψ ⊂ {1, ..., q}n.

4.3.7 Asymptotic convergence of Jacobi modes of implementations of G.

Matrix convergence rate

Similar developments for the Jacobi mode of implementation ofG lead to the convergence rate

G(y) = ˜I −T˜(y) ˜∇2g(y), (4.76) where I˜denotes the identity matrix in the reduced spaceRm˜. Hence R˜G(y) is the asymptotic rate of convergence of the sequences generated by any Jacobi implementation of G which converge to y with unit step sizes near y. The global convergence of such sequences is typically achieved via global line search and network-scale consensus on the step sizes. In contrast with the Gauss-Seidel mode, the condition for unit step sizes near the optimum as well as the conditionρ( ˜RG(y)) < 1for linear asymptotic convergence are dicult to guarantee in distributed settings by means of a local analysis. In the special case when we setT˜(y) = ˜D(y)−1 in a neighbourhood ofy, then (4.76) reduces to R˜G(y) = ˜D(y)−1[ ˜L(y) + ˜L(y)] and takes the form of the asymptotic convergence rate of a Jacobi method for solving a linear program.

If, in addition, mi = 1 for all i ∈ N, the elements of L(y˜ ) are nonnega-tive, and σ < 12, then ρ( ˜RS(y)) < 1, and it follows from the Stein-Rosenberg theorem (Theorem A.1 in Appendix A) that ρ( ˜RG(y)) < 1 as well and the Jacobi implementation of G is convergent on the conditions that the step-sizes reduce to 1near the point of convergence and that the initial point y0 is chosen suciently close to the optimum y. Global convergence of the Jacobi implementation ofGis however not guaranteed for any step-size selection rule and any initial point y0 ∈ Y, as illustrated in Example 4.2, where a Jacobi implementation of G combined with the local step-size selection rule of De-nition 4.3 proves to diverge, and in Example 4.3, where the same algorithm is convergent when started in a certain neighbourhood of the solution but does not converge to the solution when the initial point lies outside this neigh-bourhood. As already mentioned in Section 4.3.2, the synchronous gradient projection algorithms are generally implemented with identical step sizes for

all the nodes of the network, which usually requires network-scale line search and consensus on the step-sizes at each step.

It is interesting to note that R˜G(y) is in fact the asymptotic rate of con-vergence of any gradient descent algorithm (implemented in parallel or not) reducing near the optimum y to

˜

yk+1 = ˜yk−T˜(yk) ˜∇g(yk). (4.80) Also, (4.76) can be derived by setting n= 1, L(y˜ ) = 0 andD(y˜ ) = ˜∇2g(y) in (4.63).

Accelerated methods

The convergence rates (4.63) and (4.76) cannot be set to0 under the assump-tion that T˜(y) is block-diagonal. Hence attaining superlinear convergence in non-degenerated problems is impossible with block-coordinate implemen-tations of the mapping G. From the proof of Proposition 4.9, it is easily seen that superlinear convergence can only be achieved with non-diagonal

Example 4.2 (Global convergence of block-coordinate algorithms (i))

y1 y2

0 1

1 (a,a)

(2a,2a)

(4a, 4a)

(8a,8a)

This example shows that the global convergence of the Jacobi implementation of Gcombined with the local line search routine given in Denition 4.3 is not guaranteed. We consider the function of Example 4.1, previously dened in (4.7), and setY =R2. Lety0= (a, a)and(yk)be a sequence of points generated by successive applications ofyk+1=G(yk)with local line search as in Denition 4.3 and with the scaling matrices T1(y) = 163 and T2(y) = 103 for all y in Y. If σ 13, we nd yk= (2)k(a, a)and the sequence of points diverges.

The plain line on the gure shows the sequence of point generated by the Gauss-Seidel algorithm yk+1 =S(yk)with initial point y0= (a, a), whereSis implemented with the same constant scaling functions T1 = 163 and T2 = 103 and local line search. We see that this sequence converges to the solution(0,0) by successive descents along y1 andy2.

4.3. Block-coordinate projected gradient descent algorithms 109

block-matrices T˜(y), and this requires the exploitation of non-strictly local information by each node.

One possible way, suggested in recent studies on parallel optimisation, to reduce the convergence rates is to use as descent directions approxima-tions of the (global) Newton direction −[∇2g(y)]−1∇g(y). Under the

condi-Example 4.3 (Global convergence of block-coordinate algorithms (ii))

y1

y2

0

(1, 1)

(1,1) 1

1

1

1

x+y=1

x+y=1

Consider the unconstrained optimisation problem of minimising

g(y) =

(y1+y2+ 2)2+ (y1y2)2 ify1+y21 3(y1+y2)2+ (y1y2)2+ 6 if|y1+y2|<1 (y1+y22)2+ (y1y2)2 ify1+y2≤ −1

(4.77)

on Y = R2. It is easy to verify that g is strictly convex and continuously derivable on Y with gradient

g(y) =

4(y1+ 1, y2+ 1) ify1+y21 4(2y1+y2, y1+ 2y2) if|y1+y2|<1 4(y11, y21) ify1+y2≤ −1

(4.78) and unique minimum y = (0,0). The Hessian2g is given by 2g(y) =A if |y1+y2|<1, and

2g(y) =B if |y1+y2|>1, where we dene A= 4

( 2 1 1 2

)

, B = 4

( 1 0 0 1

)

, (4.79)

and2g is not dened between these regions.

Now, let(yk)be a sequence generated by the Jacobi algorithmyk+1=G(yk), where the mapping G is combined with the local line search of Denition 4.3 and with scalingT(y) =A−1if |y1+y2| ≤1 andT(y) =B−1 if |y1+y2|<1, which yields T˜(y)−1= ˜D(y)for allyY such that |y1+y2| ̸= 1.

It is easy to see that for any initial pointy0such that|y11+y12|<1, we ndy1=yand the sequence converges to the optimum in one iteration. However, if for instancey0= (1,1)andσ < 34 in (4.18), we ndyk= (1)k(1,1)and the resulting sequence oscillates between the points(1,1)and(1,1).

tion ρ(D12(L+L)D12)< 1, the Taylor development [∇2g]−1 = D12

t=0[D12(L+L)D12]tD12 (4.81) is considered, where, under separability assumptions, the dependency struc-ture of each term of the development grows with the parameter t. The tech-nique, denoted by N(q) in this manuscript, consists of setting Tk = T(q)(yk) in (4.9) with

T(q) = D12q

t=0

[

D12(L+L)D12]t

D12, (4.82) where q is a nonnegative integer directly proportional to the computational complexity and the communication overhead. Notice that N(0) reduces to the Jacobi mode of implementation of G with scaling Ti = [∇ii2g]−1.

From (4.76), we nd that the asymptotic convergence rate of the projected gradient algorithm used with (4.82) is given by

N(q)(y) = ˜T(q)(y)E(y)T(q)(y)−1RN(q)(y)E(y), q = 0,1,2, ..., (4.83) where

RN(q)(y) = [

D(y)−1(L(y) +L(y))]q+1

(4.84) is the asymptotic convergence rate of the unconstrained problem (i.e. when Y = Rm). Note that ρ( ˜RN(q)(y)) and thus ρ(RN(q)(y)) vanish as q grows.

This shows that the parameter q symbolises a trade-o between the speed of convergence, and the quantity of exchanged information and remoteness of the neighbours each node must communicate with.

In Section 6.1.1, the performance in terms of matrix convergence rate of theN(q) algorithm will be compared with that of the Gauss-Seidel implemen-tation of G for the dual of a simple NUM problem.

4.3.8 Second-order scaling

From Propositions 4.8 and 4.9 and from the cost approximation interpretation of gradient projections, we see that it is appropriate to scale the descent directions based on second-order information as in the Newton method. In this section we briey discuss two second-order scaling strategies: local Newton scaling and diagonal scaling. Possible variants include for instance the quasi-Newton methods, based on the iterative computation of approximations of the inverse Hessian, or the constant scaling technique based on Lipschitz constants discussed in Section 4.3.4.

4.3. Block-coordinate projected gradient descent algorithms 111

Local Newton scaling

A `local' Newton direction in a subspace Rmi is obtained by setting the scal-ing matrix to the inverse of the opposite of the ith diagonal element of the Hessian ∇2g in block form, i.e. Ti = [∇ii2g]−1, provided that ∇ii2g is well-conditioned. Under the assumption made in Problem 4.1 that the variations in g(y) due to local displacements in the feasible set have local impact, the diagonal elements of ∇2g and thus the local Newton directions can be computed in closed form by the nodes using only local information. This is shown in Appendix B.5 for the dual of the NUM problem (Problem 3.3) intro-duced in Section 3.4.5. Using this strategy fori ∈ N, we have T˜(y)−1 = ˜D(y) near the point of convergence and the asymptotic rate (4.63) is minimised.

Since (4.58) always holds when local Newton scaling is used and σ <

1

2, it follows from Proposition 4.9 that sequences generated by Gauss-Seidel implementations of G converge linearly with step sizes identically equal to 1 near the point of convergence. We have the following proposition2.

Proposition 4.10 (Local Newton scaling) Under Assumption 4.2, sup-pose that Problem 4.1 has a unique and nite solutiony where strict comple-mentarity holds, g is twice continuously dierentiable, and the diagonal ele-ments of ∇2g are invertible and continuously dierentiable. Let (yk) be a se-quence generated by S, where we use local Newton scaling Ti(y) = [∇ii2g(y)]−1 near y and set σ ∈ (0, 12) for the local line search. Then the step sizes be-come identically equal to 1 after a nite number of iterations. If in addition

2g(y) is positive denite, then (yk) converges linearly to y at rate R˜S(y) =[

D(y˜ )−L(y˜ )]−1

L(y˜ ), (4.85) with ρ( ˜RS(y)) < 1.

Diagonal scaling

Drawbacks of local Newton scaling include the computation overhead and matrix conditioning issues due to the inversions of ∇ii2g matrices and the scaled projections on the sets Yi (solutions of quadratic programs). Diagonal scaling is an eort-saving strategy which simplies the computations by using diagonal approximations of the ∇ii2g matrices. If Y is a box, the projections

2It can be seen from from (4.85) that convergence is superlinear iL(y˜ ) = 0. This degenerated case occurs for instance in Problem 3.3 when the constraints are not binding variables assigned to dierent nodes and the problem reduces to n independent optimisation problems, while local Newton scaling leads tonlocal executions of the Newton algorithm with quadratic convergence.