Block-coordinate projected gradient descent al- al-gorithms

4.3. Block-coordinate projected gradient descent algorithms 89

4.3 Block-coordinate projected gradient descent

where, for all y ∈ Y, G_i : Y → Y_i is given by G_i(y) .

= [y_i−ˆa_i(y)T_i(y)∇ⁱg(y)]⁺_Y

i,Ti(y). (4.11) Thus, the mapping Gis such that each of its block components G_i is a `local' projected gradient descent in Y_i characterised by a specic step-size rule ˆa_i and a specic scaling strategy T_i (i ∈ N). Using (3.21), it is easy to show that (4.11) is equivalent to

G_i(y) = arg min

z∈Yi

{

g(y) +∇ⁱg(y)^′(z −y_i) + 1

2∥z −y_i∥²[ˆai(y)Ti(y)]⁻¹

}

, (4.12) whereG_i(y)is the (unique) global minimum overY_iof a function which can be regarded as a quadratic approximation of the functiong near the pointy and in the subset Y_i. Equivalently, the cost approximation (CA) interpretation of (4.11) regards the function z ∈ Y_i → ∇ⁱg(y) + [ˆa_i(y)T_i(y)]⁻¹(z −y_i) as a local, linear approximation of ∇ⁱg at y, and G_i(y) as the point z¯ ∈ Y satisfying the rst-order optimality condition (Proposition 3.5) for the linear approximation of ∇ⁱg over Y_i, i.e.

{∇ⁱg(y) + [ˆa_i(y)T_i(y)]⁻¹(¯z−y_i)}^′

(z −z)¯ ≥ 0, ∀z ∈ Y_i. (4.13) From Denition 4.1, we see that the mapping G requires the specication for each i ∈ N of a step size selection rule ˆa_i (discussed in Section 4.3.2), and of a scaling function T_i. A favourable choice for T_i is the inverse of the local diagonal element of the Hessian matrix ∇ii²g(y), in which case the minimand of (4.12) reduces to a (scaled) Taylor development of g in Y_i. If in addition g is quadratic, then G_i reduces to a component solution method (see Section 3.4.5). The choice of theT_i functions will be discussed further in Section 4.3.8.

The next proposition states that the condition G(y) = y is necessary and sucient for y to be a solution of Problem 4.1. In other words, the operation G(y) will cause a nonzero displacement from y along at least one component unless y is a solution. Conversely, the operation G(y) will cause no displacement if y is a solution.

Proposition 4.2 (Stationarity) Consider Problem 4.1 and let y ∈ Y. The vector y is a solution i G_i(y) =y_i for i ∈ N.

Proof We rst assume that y is a solution of Problem 4.1 and thus satises the optimality condition given in Proposition 4.1. Let i∈ N and h(z) denote the minimand of of (4.12), i.e.

h(z) .

= g(y) +∇ⁱg(y)^′(z −y_i) +1

2∥z −y_i∥²[ˆai(y)Ti(y)]⁻¹. (4.14)

4.3. Block-coordinate projected gradient descent algorithms 91

It follows from (4.4) that h(z) ≥g(y) for all z ∈ Y_i. Noting that h(z) =g(y) ifz = y_i, we nd that y_i minimisesh(z)overz ∈ Y_i. Hence G_i(y) =y_i, which holds for any i∈ N.

Suppose now that G_i(y) =y_i for i∈ N. By setting z¯= y_i for each i∈ N in (4.13) we nd that (4.4) holds, and y is a solution of Problem 4.1, which

completes the proof.

Since Y is a product set, the mapping G implicitly denes several modes of implementation. These implementation modes include the Jacobi, Gauss-Seidel, arbitrary, or random modes, dened as follows.

Denition 4.2 (Modes of implementation of G)

(Jacobi) The Jacobi mode of implementation of G is given by the algo-rithm y^k+1 = G(y^k).

(Cyclic or Gauss-Seidel) Consider the Y →Y mapping S .

= ˆG_n◦Gˆn−1◦...◦Gˆ₁, (4.15) where we introduce Gˆ_i : y ∈ Y → Gˆ_i(y) .

= (y₁, ..., yi−1,G_i(y), y_i+1, ..., y_n), i ∈ N. The Gauss-Seidel mode of implementation of G is given by the algorithm y^k+1 = S(y^k).

(Arbitrary) A sequential arbitrary mode of implementation of G is given by the algorithm y^k+1 = A^k(y^k), where (A^k) is a mapping sequence dened by a block selection sequence (ρ^k) taking arbitrary values in N and

A^k .

= ˆG_ρ^k. (4.16)

The various modes of implementation of G specied in Denition 4.2 cor-respond to dierent approaches of parallel optimisation.

In the Jacobi implementation, the local mappings G_i are applied simulta-neously in all the nodes and to an identical point of Y. The Jacobi method can be qualied as synchronous. This study places special emphasis on the sequential implementation modes, where the local mappings are applied se-quentially at dierent points of Y and one at a timethis constraint can nevertheless be relaxed if g(y) has a sparse dependency structure ony₁, ..., y_n using the colouring schemes already mentioned in Section 3.4.5, so that the algorithms can be executed without global coordination.

Amongst the sequential modes, we make a distinction between the cyclic or Gauss-Seidel method, where y^k+1 is obtained after n successive, ordered local mappings Gˆ_i from y^k, and the sequential arbitrary method, where the nodes of the network apply the mappings Gˆ_i in arbitrary order and at their own pace. The sequence ρ^k is arbitrary and can be generated at random. We

note that the Gauss-Seidel mode of implementation is a particular case of the arbitrary mode where

(ρ^k)

: 1, ..., n,1, ..., n,1, ... . (4.17) Therefore any property of the sequence(A^k)is also valid for S, and frequently we will only consider (A^k) when deriving results which hold for both modes of implementation. More sophisticated block selection techniques (e.g. the Gauss-Southwell method) relying on centralised decisions or requiring consen-sus among the nodes are not discussed in this text.

4.3.2 Line search and global convergence of the sequential imple-mentations

Global convergence of the projected gradient methods is typically guaranteed by line search techniques for the step size selection, where the step size at each step is chosen so as to minimise the objective function along a given descent direction. In practice, line search is done approximately by decreasing an initial step size value (usually set to 1) until a satisfactory step size is found, as for instance in the Armijo rule introduced in Section 3.4.4.

The convergence of Jacobi implementations of the gradient projection method is commonly obtained using a unique line search procedure agree-ing on an identical step size for all the nodes. The resultagree-ing algorithm is a mere parallel implementation of the global gradient projection method (4.9).

Global line search, however, is inappropriate for sequential implementations of the method, which require local step size selection procedures. Through-out this chapter, we consider that the sequential mapping S and mapping sequences (A^k) are implemented with local approximate line search routines of the type Armijo. The following denition is then suggested for the func-tions ˆa_i.

Denition 4.3 (Local approximate line search) Given the xed scalar parameters β, σ ∈ (0,1) and a scaling matrix function T_i : Y → T (m_i) for eachi ∈ N, we consider the mappings ˆa_i : Y → (0,1] where, for y ∈ Y, aˆ_i(y) is the largest step size a ∈ {β^m}^∞m=0 satisfying

g(y)−g(ˆy(a)) ≥σa⁻¹∥yˆ_i(a)−y_i∥²_T_i_(y)⁻¹ (4.18) where yˆ_j(a) .

= y_j if j ̸= i and yˆ_i(a) .

= [y_i −aT_i(y)∇ⁱg(y)]⁺_Y_i_,T_i_(y).

A particularity of the above step size selection rule with regard to conven-tional applications of the Armijo rule is that the step sizes are decided inde-pendently and thus dierent for all the nodes. Besides, the decisions taken in

4.3. Block-coordinate projected gradient descent algorithms 93

Denition 4.3 are based on local information under the assumption made in Problem 4.1 that the variations ofg along block components can be computed locally.

It is shown in the next section that the step size selection rule (4.18) guarantees the convergence of sequential implementations of the mapping G. 4.3.3 Global convergence of sequential implementations of G The proof of the convergence of the sequential implementations of Gis based on a descent approach. Since the Gauss-Seidel method S is a particular in-stance of the arbitrary implementation, we only need to prove global con-vergence for an arbitrary instance of (A^k). The proof relies on two lemmas.

Lemma 4.1 (Scaled gradient projection) Let i ∈ N, y ∈ Y, T ∈ T(m_i), and a ≥0. If y¯∈ Y is such that y¯_i = [y_i−aT∇ⁱg(y)]⁺_Y

i,T, then

−a∇ⁱg(y)^′(¯y_i−z) ≥(¯y_i−y_i)^′T⁻¹(¯y_i−z), ∀z ∈ Y_i. (4.19) Proof Since Y_i is a convex set we have, as per the scaled projection theorem (Proposition B.9-(ii) in Appendix B.4),

(y_i−aT∇ⁱg(y)−y¯_i)^′T⁻¹(z−y¯_i)≤0, ∀z ∈ Y_i, (4.20)

which is equivalent to (4.19).

The next lemma states that local projected gradient descents lead to signi-cant descents on g in the directions where (4.4) does not hold.

Lemma 4.2 (Descent) Let (4.2) hold, i ∈ N, y ∈ Y, T ∈ T(m_i), and a ≥ 0. Consider y¯ ∈ Y with y¯_i = [y_i−aT∇ⁱg(y)]⁺_Y_i_,T and y¯_j = y_j if j ̸= i. Then

g(y)−g(¯y)≥ σ

a∥y¯_i−y_i∥²T⁻¹ (4.21) is satised for any σ ∈ (0,1) if a ≤2(1−σ)(¯λL)⁻¹.

Proof Let ζ(t) .

= y+t(¯y−y). We have g(y)−g(¯y) =−∇ⁱg(y)^′(¯y_i−y_i)−

∫ 1 0

[∇ⁱg(ζ(t))− ∇ⁱg(y)]^′(¯y_i−y_i)dt. (4.22) It follows from Lemma 4.1 that

−a∇ⁱg(y)^′(¯y_i −y_i)≥ ∥y¯_i −y_i∥²T⁻¹. (4.23)

Since, by (4.2), ∥∇ⁱg(ζ(t))− ∇ⁱg(y)∥ ≤ tL∥y¯−y∥ = tL∥y¯_i −y_i∥, we nd, using (4.22), (4.23), T ≼ λ¯, and the Cauchy-Schwarz inequality (Proposi-tion A.2 in Appendix A),

g(y)−g(¯y)≥ (1

a − λL¯ 2

)

∥y¯_i−y_i∥²T⁻¹, (4.24)

which proves the lemma.

A direct consequence of Lemma 4.2 is that the step size selection rule given in Denition 4.3 is well-dened for Problem 4.1. Indeed, for anyy ∈ Y, i∈ N and T_i : Y → T(m_i), we nd

0 < a≤ ˆa_i(y)≤1, (4.25) where

a .

= min{1,2β(1−σ)/(¯λL)}. (4.26) In other words, the computaion of ˆa_i requires a nite number of evaluations of (4.18).

We can now prove the convergence of the arbitrary mode of implementation ofG. We consider block selection sequences(ρ^k)such that each node is chosen with non null limiting relative frequency¹ as k → ∞. This ensures that, for all k and i ∈ N, we can nd a nite step k¯ such that ¯k > k and ρ^¯^k = i, and thus Gˆ_i is applied an innite number of times. In the case of randomly generated sequences, this condition and the following proposition may only hold with probability one.

Proposition 4.3 (Global convergence of (A^k)) Consider a block-coordi-nate sequence (ρ^k) such that each i ∈ N is chosen with non null limiting relative frequency as k → ∞, the corresponding mapping sequence (A^k), and a sequence (y^k) generated by (A^k). If the Problem 4.1 has a nite optimum, then any limit point of (y^k) is a solution.

Proof Consider the sequence (y^k) generated by an arbitrary sequence (A^k), and a step k. Suppose that at step k the algorithm yields the point y^k+1 = Gˆ_i(y) for some i ∈ N. Since y^k+1 and y^k dier only by their i^th block component, and using (4.18) and (4.25), we nd

g(y^k)−g(y^k+1)≥ σ

λ¯∥y^k+1−y^k∥², (4.27)

1The notion of limiting relative frequency is briey discussed in Appendix D.1. In this section, the condition of non null limiting relative frequency may be understood aslim infk→∞∑k

l=11{i}(ρ^l)/k >

0,∀i∈N, where the indicator function1 is dened by (D.10).

4.3. Block-coordinate projected gradient descent algorithms 95

which holds for anyk. Now, suppose g has a nite optimum on Y. By (4.18) and the convexity ofg, the sequence(y^k)is bounded and we can nd a subse-quence converging to a point y^⋆. Since g(y^k) is monotonically nonincreasing, we have g(y^k) ↓g(y^⋆), and g(y^k+1)−g(y^k)→ 0. It follows from (4.27) that

∥y^k+1−y^k∥ → 0, (4.28)

and thus y^k → y^⋆. Using Lemma 4.1, we nd, for all z ∈ Y_i,

− ∇ⁱg(y^k)^′(y_i^k+1−z)≥ (y_i^k+1−z)^′T_i(y^k)⁻¹(y^k+1_i −y_i^k) ˆ

a_i(y^k) . (4.29)

If we write Tˆ .

= ˆa_i(y^k)T_i(y^k), (4.29) yields

−∇ⁱg(y^k)^′(z −y_i^k) =∇ⁱg(y^k)^′(y^k+1_i −z)− ∇ⁱg(y^k)^′(y^k+1_i −y_i^k)

≤[ ˆT⁻¹(z −y^k+1_i )− ∇ⁱg(y^k)]^′(y_i^k+1−y_i^k)

≤ ∥Tˆ⁻¹(z−y_i^k+1)− ∇ⁱg(y^k)∥ ∥y_i^k+1−y^k_i∥

≤ ∥Tˆ⁻¹(z−y_i^k+1)− ∇ⁱg(y^k)∥ ∥y^k+1−y^k∥. (4.30) Now, we x i and consider the set W_i of all the time steps where a local projected gradient descent occurs inY_i and (4.30) holds. Since by assumption the limiting relative frequency of the event ρ^k = i is positive, the set W_i is innite and the subsequence (y^k)k∈W_i converges to y^⋆. It follows from (4.28) and the boundedness of Tˆ that the right member of (4.30) vanishes to 0 as k → ∞ and y^k → y^⋆. Hence we have ∇ⁱg(y^⋆)^′(z −y^⋆_i) ≥0, and this result can be obtained for each i∈ N. By Proposition 4.1, y^⋆ is a solution of

Problem 4.1.

4.3.4 Local convergence of the sequential implementations

Deterministic implementations

Linear convergence can generally be proven for deterministic block-coordinate implementations of the projected gradient algorithm (e.g. the Gauss-Seidel or Gauss-Southwell methods). The rst result of this section is concerned with the Gauss-Seidel implementation ofGand a direct application of Theorem C.1 in Appendix C.1. Before proceeding, we make the following assumption on the norm of the residual {y−[y− ∇g(y)]⁺_Y}.

Assumption 4.1 (Local error bound) Let Y˜ = arg min_y∈Y g(y) denote the set of solutions of Problem 4.1. The setY˜ is non-empty, and for everyυ ≥ min_y∈Y g(y) there exists scalars δ > 0 and τ > 0 such that

min

z∈Y˜ ∥y−z∥ ≤ τ∥y−[y− ∇g(y)]⁺_Y∥ (4.31)

for all y ∈ Y with g(y)≤ υ and ∥y−[y− ∇g(y)]⁺_Y∥ ≤ δ.

Assumption 4.1 is a transcription for Problem 4.1 of Assumption C.1 in Ap-pendix C.1. Examples of situations where Assumption 4.1 holds are given in Appendix C.1. They include for instance the cases wheng is a strongly convex function, when Y is polyhedral and g is a quadratic function, or when Y is a polyhedron and−g is the dual function of a convex optimisation problem with ane constraints and a strongly convex dierentiable objective function with Lipschitz continuous gradient. Note that the second assumption required by Theorem C.1 (Assumption C.2 in Appendix C.1) holds by convexity of Prob-lem 4.1.

The next proposition guarantees the linear convergence of the mapping S under Assumption 4.1.

Proposition 4.4 (Linear convergence of S) Suppose that g is bounded below on Y and Problem 4.1 has a non-empty set of solutions denoted by Y˜ = arg miny∈Y g(y). Let Assumption 4.1 hold and (y^k) be a sequence gen-erated by the mapping S with scaling functions T_i : Y → T(m_i), i ∈ N. Then {g(y^k)} converges at least Q-linearly to min_y∈Y g(y) and (y^k) converges at least R-linearly to a solution.

Proof We show that the mapping S satises the conditions of Theorem C.1 in Appendix C.1. Let y ∈ Y. Since S(y)∈ Y, we can write

S(y) = [y− ∇g(y) +e]⁺_Y , (4.32) for a certain vector e. By setting z[0] = y and computing successively z[i] = Gˆ_i(z[i−1]) for i = 1, ..., n, we nd S(y) = z[n]. Since Y = ∏n

i=1Y_i, (4.32) reduces to

z_i[i] = [y_i− ∇ⁱg(y) +e_i]⁺_Y

i, i∈ N. (4.33)

Let i ∈ N and Φ_i : x ∈ R^mⁱ → Φ_i(x) = [ˆa_i(z[i−1])T_i(z[i−1])]⁻¹x. It follows from (4.13) that the gradient approximation

[∇ⁱg(z[i−1]) + Φ_i(z_i[i]−z_i[i−1])] (4.34) is orthogonal to Y_i at z_i[i], and thus

z_i[i] = [z_i[i]− ∇ⁱg(z[i−1])−Φ_i(z_i[i]−z_i[i−1])]⁺_Y_i, (4.35) where by construction z_i[i−1] = y_i and z_i[i] = S_i(y). It is easy to see that (4.35) is equivalent to (4.33) if we set

e_i = S_i(y)−y_i−Φ_i(S_i(y)−y_i)− ∇ⁱg(z[i−1]) +∇ⁱg(y). (4.36)

4.3. Block-coordinate projected gradient descent algorithms 97

Using (4.2), ∥z[i−1]−y∥ ≤ ∥S(y)−y∥, and ∥Φ_i(x)∥ ≤ (aλ)⁻¹∥x∥, where a is given by (4.26), we nd

∥e_i∥ ≤ [1 + (aλ)⁻¹]∥S_i(y)−y_i∥+L∥S(y)−y∥. (4.37) Since (4.37) holds for i∈ N, we have

∥e∥ ≤ n[1 + (aλ)⁻¹+L]∥S(y)−y∥. (4.38) Using (4.27), we also nd

g(S(y))−g(y)≤ −σ

λ¯∥S(y)−y∥². (4.39) It follows from (4.32), (4.38) and (4.39), that the conditions of Theorem C.1 for Problem 4.1 are satised by the algorithm y^k+1 = S(y^k) under

Assump-tion 4.1.

Stochastic implementations

The linear convergence of random implementations is more dicult to certify.

A signicant convergence result was recently issued in [Nes12] for a version of the random block-coordinate projected gradient descent algorithm used with constant (scalar) scaling coecients, and with a uniform random block selection where descent occurs at each node i ∈ N with equal probability _n¹. In this particular algorithm, which has been successfully applied to huge-scale optimisation problems, scaling is done in function of local Lipschitz constants, and conservative in the sense that convergence is guaranteed with unit step sizes (i.e. without the need of a step-size selection routine). The scaling strategy of this algorithm can be described as follows. For each i ∈ N, we denote by I_i the identity matrix in R^mⁱ and by U_i the R^m×mⁱ matrix such that (U₁^′, ..., U_n^′) is the identity matrix in R^m. It is assumed that a Lipschitz constant L_i for ∇ⁱg is known, which satises

∥∇ⁱg(y)− ∇ⁱg(y+U_iz)∥ ≤ L_i∥z∥, z ∈ R^mⁱ, y,(y+U_iz) ∈ Y. (4.40) The descent direction at a node i ∈ N, which is computed by projected gradient descent in Y_i, is then scaled by the constant coecient L⁻¹_i . In contrast to the methods based on step-size selection by line search, notice that this algorithm relies on the knowledge of the coordinate Lipschitz constants.

Conservative bounds can be used in place of the Lipschitz constants (e.g.

the actual Lipschitz constant L), though the performance of the algorithm remains dependent on the tightness of these bounds.

In this section, we make the observation that it is possible to recover the above-described algorithm using a random implementation of G in combi-nation with constant scalar scaling and uniformly generated block selection sequences. We infer the following convergence result for the mapping se-quence (A^k).

Proposition 4.5 (Linear convergence of (A^k)) Suppose that g^⋆ = inf_y∈Y g(y) is nite and that local Lipschitz constants {L_i}^i∈N satisfy-ing (4.40) are known. Let (y^k) be a sequence generated by a mapping se-quence (A^k) using σ ∈ (0, ¹₂] for the step size selection, T_i(y) =L⁻¹_i I_i for all y ∈ Y and i ∈ N, and a block selection sequence (ρ^k) drawn randomly with P(ρ^k = i) = _n¹ for allk and i∈ N. Then the step sizes computed by local line search are all equal to 1 and, for any condence level ρ ∈ (0,1) and accuracy ϵ > 0, there is a k <¯ ∞ such that P(

g(y^k)−g^⋆ ≤ϵ)

≥ 1−ρ for k ≥ k¯. If in addition g is strongly convex, then the expectation E{g(y^k)} converges at least Q-linearly to g^⋆.

Proof Let i ∈ N and y ∈ Y. By (4.40) and a rationale already used in the proof of Lemma 4.2, we nd, for all z ∈ Y_i,

g(y+U_iz)≤g(y) +∇ⁱg(y)^′z + L_i

2 ∥z∥². (4.41) Take a positive step sizea and deney(a)ˆ as in Denition 4.3 by yˆ_j(a) =y_j if j ̸= i and yˆ_i(a) = [y_i−aT_i(y)∇ⁱf(y)]⁺_Y

i,Ti(y). It follows from Lemma 4.1 that g(y)−g(ˆy(a)) ^(4.41)≥ −∇ⁱg(y)^′(ˆy_i(a)−y_i)− L_i

2 ∥yˆ_i(a)−y_i∥² (4.42)

(4.19)

≥ L_i (1

a − 1 2

)

∥yˆ_i(a)−y_i∥². (4.43) Hence (4.18) is satised for a = 1 if σ ≤ ¹₂. By setting σ ∈ (0, ¹₂] we recover the constrained minimisation algorithm of [Nes12] and the proposition follows

from [RT12a, Theorem 4] and [Nes12, Theorem 5].

4.3.5 Asymptotic behaviour over polyhedrons

This section studies the asymptotic behaviour of convergent block-coordinate implementations of G. It is shown that, when approaching a convergence point enjoying a property of strict complementarity, the algorithms reduce to simple gradient descents in reduced spaces of smaller dimensions. The analysis is restricted to polyhedral product sets and we make the following assumption.

4.3. Block-coordinate projected gradient descent algorithms 99

Assumption 4.2 (Polyhedral feasible set) For i∈ N, Y_i is a polyhedron dened by a nite set C_i = {c₁, ..., c_r_i} of scalar ane constraints R^mⁱ → R, i.e.

Y_i = {y ∈ R^mⁱ : c(y)≤0,∀c ∈ C_i}. (4.44) We introduce the following denitions.

Denition 4.4 (Sets A_i, A¯_i and matrices E¯_i, E_i) Under Assump-tion 4.2, and for each i ∈ N and y ∈ Y_i, we dene the set Aⁱ(y) = {c ∈ C_i : c(y) = 0} as the set of the nonnegativity con-straints specifying Y_i active at y, and A¯ⁱ(y) = C_i \ Aⁱ as the set of the inactive constraints. We denote by E_i(y) any matrix the columns of which form an orthonormal basis of the nullspace of {∇c(y) : c ∈ Aⁱ(y)}, and by E¯_i(y) any matrix such that (E_i(y)^′,E¯_i(y)^′) is an orthogonal matrix of R^mⁱ and E_i(y)^′E¯_i(y) = 0.

Now we introduce the concept of strict complementarity. From Propo-sition B.6 in Appendix B.3, we know that a geometric interpretation of the optimality of a point y^⋆ for Problem 4.1 under a constraint qualica-tion is that −∇g(y^⋆) must be parallel to the normal vector of one hyper-plane supporting Y at y^⋆. Under Assumption 4.2 in particular, we can write Aⁱ(y_i^⋆) ={c₁, ..., c_r_¯_i}for all i∈ N. Proposition B.6 states thaty^⋆ is a solution of Problem 4.1 iif

∇ⁱg(y^⋆) =−∑^r^¯i

j=1α_j∇c_j(y_i^⋆) (4.45) holds for some nonnegative scalars α₁, ..., α_¯_r_i and for all i ∈ N. Notice that (4.45) is in fact the KKT condition (3.45). If Aⁱ(y^⋆_i) = ∅ the condi-tion (4.45) reduces to ∇ⁱg(y^⋆) = 0.

We speak of strict complementarity at the solution y^⋆ when the coe-cients α₁, ..., α_r_¯_i are all strictly positive.

Denition 4.5 (Strict complementarity) Let Assumption 4.2 hold and y^⋆ be a solution of Problem 4.1 with Aⁱ(y^⋆_i) = {c₁, ..., c_r_¯_i}, i ∈ N. Strict complementarity holds at y^⋆ i the gradient of g at y^⋆ is a positive combina-tion of the gradients of the active constraints, i.e., for all i ∈ N such that Aⁱ(y^⋆)̸= ∅, one can nd positive scalars α₁, ..., α_¯_r_i satisfying

∇ⁱg(y^⋆) =−∑^¯^ri

j=1α_j∇c_j(y_i^⋆). (4.46) If Y takes the form (4.3), then strict complementarity at y^⋆ reduces to

(y_i^⋆)_j = 0⇒ (∇ⁱ)_jg(y^⋆)> 0, ∀j ∈ Pⁱ, i ∈ N, (4.47)

where (∇ⁱ)_jg denotes the j^th component of ∇ⁱg.

The next result states that strict stationarity at the point of convergence of a sequence generated by a block-coordinate implementation ofG ensures that the active nonnegativity constraints at the convergence point are uncovered by the algorithm in a nite number of iterations. Since the Gauss-Seidel mode of implementation of G can be regarded as a particular case of the arbitrary mode, we only consider the Jacobi mode y^k+1 = G(y^k) and the arbitrary mode y^k+1 = A^k(y^k). For these two modes note that, at each step k and for each i∈ N, we have either y^k+1_i = G_i(y_i^k) or y^k+1_i = y_i^k.

Proposition 4.6 (Identication of the active constraints) Under As-sumption 4.2, let (y^k) be a sequence generated by a block-coordinate imple-mentation of G which converges to a solution y^⋆ of Problem 4.1 where strict complementarity holds. Then there exists a ¯k <∞ such that Aⁱ(y_i^k) = Aⁱ(y^⋆_i) for all k ≥k, i¯ ∈ N.

Proof Since the set of nodes N is nite, we only need to consider an arbi-trary node i ∈ N, and we suppose that (4.46) holds for the strictly positive scalars α₁, ..., α_r_¯_i.

We rst show that there exists a ˆk < ∞ such that Aⁱ(y^k_i) ⊂ Aⁱ(y^⋆_i) for all k ≥ kˆ. Otherwise there would be a constraint c ∈ A¯ⁱ(y^⋆_i) and a subsequence (y^k)k∈K such that c(y_i^k) = 0 for k ∈ K. Since y^k → y^⋆ and by continuity of c, we would nd c(y_i^⋆) = 0 and thus c ∈ Aⁱ(y^⋆_i), which is a contradiction.

Suppose now that Aⁱ(y_i^⋆) = ∅. Then we also have Aⁱ(y_i^k)⊃ Aⁱ(y^⋆_i) for all k ≥ ˆk and we are done.

If otherwise Aⁱ(y_i^⋆) ̸= ∅, then ∥∇ⁱg(y^⋆)∥ ̸= 0 by strict complementarity at y^⋆. LetM = {1, ...,r¯_i} denote the index set of the active constraints aty_i^⋆, and let Aⁱ(y) = ˆA for some point y ∈ Y_i with Aˆ= {c_j}j∈M¯, M¯ ⊂ M and M¯ ̸= M. We have A ⊂ Aˆ ⁱ(y_i^⋆) and A ̸ˆ= Aⁱ(y_i^⋆). This implies that ∇ⁱg(y^⋆) cannot be expressed as a linear combination of vectors of {∇c_j(y_i^⋆)}j∈M¯. In-deed, we have c_j(y) = 0 for j ∈ M¯ and c_j(y) < 0 for j ∈ M \M¯, and thus

∑

j∈Mα_jc_j(y) = ∑

j∈M\M¯ α_jc_j(y) < 0. Since the constraints are ane we can write c_j(y) =∇c_j(y_i^⋆)^′(y−y_i^⋆) for all y ∈ Y_i and j ∈ M. It follows that

∑

j∈M α_jc_j(y) =[∑

j∈M α_j∇c_j(y_i^⋆)^′ ]

(y−y^⋆_i) ^(4.46)= −∇ⁱg(y^⋆)^′(y−y_i^⋆) (4.48) which is equal to 0 and leads to a contradiction if ∇ⁱg(y^⋆) is a linear combi-nation of vectors of {∇c_j(y^⋆_i)}^j∈M^¯. Hence we can nd a ∆ > 0 independent of y such that



∇ⁱg(y^⋆) +∑

j∈M¯ α¯_j∇c_j(y^⋆_i)



 >∆, ∀{α¯_j}j∈M¯. (4.49)

4.3. Block-coordinate projected gradient descent algorithms 101

Let δ > 0. Since y^k → y^⋆, we can nd a nite k˜ ≥kˆ such that ∥y^k −y^⋆∥ < δ and thus ∥y^k_i −y_i^⋆∥ < δ, ∥y_i^k+1−y^k_i∥ < 2δ for all k ≥ ˜k. By Lipschitz continuity of ∇g, we also have ∥∇ⁱg(y^k)− ∇ⁱg(y^⋆)∥ ≤ L∥y^k −y^⋆∥ < Lδ for k ≥ k˜. Consider now an iteration k ≥ k˜ of the algorithm yielding y^k+1 from y^k, where y^k+1_i = G_i(y^k) and we assume that Aⁱ(y_i^k+1) = ˆA. We infer from the geometric interpretation of y_i^k+1 as the solution of (4.12) Proposition B.6 in Appendix B.3, or the KKT condition (3.45) , that one can nd nonnegative coecients {αˆ_j}j∈M¯ such that

∇ⁱg(y^k) + [ˆa_i(y^k)T_i(y^k)]⁻¹(y_i^k+1−y_i^k) =−∑

j∈M¯ αˆ_j∇c_j(y_i^k+1), (4.50) where the left member of (4.50) is the gradient at y^k+1_i of the cost approxi-mation of g at y^k. We nd, for k ≥ ˜k,



∇ⁱg(y^⋆) +∑

j∈M¯ αˆ_j∇c_j(y^⋆_i)



(4.50)

= 

[∇ⁱg(y^⋆)− ∇ⁱg(y^k)]−[ˆa_i(y^k)T_i(y^k)]⁻¹(y^k+1_i −y_i^k)



≤[L+ 2(aλ)⁻¹]δ which contradicts (4.49) if we initially set

δ < ∆

L+ 2(aλ)⁻¹. (4.51)

Hence Aⁱ(y_i^k+1)̸= ˆA.

Now, recall that at each step k where G or A^k is applied, we have either y_i^k+1 = y^k_i and thus Aⁱ(y_i^k+1) = Aⁱ(y_i^k), or y_i^k+1 = G_i(y^k) which holds for an innity of steps as y^k → y^⋆. Since the number of constraints specifying Y_i and the number of possible values forAˆare nite, by induction we can nd a

¯k <∞ such that Aⁱ(y_i^k) = Aⁱ(y^⋆_i) for k ≥ ¯k, which completes the proof.

In anticipation of Proposition 4.7, we introduce the following notations.

Let (y^k) be a sequence generated by a block-coordinate implementation ofG converging to a point y^⋆. If for each i ∈ N we denote by m˜_i the number of columns ofE_i(y_i^⋆), we can write, for anyy ∈ Y_isuch thatAⁱ(y) = Aⁱ(y^⋆_i), y = y_i^⋆+E_i(y_i^⋆)˜yfor somey˜∈ R^m^˜ⁱ, whereR^m^˜ⁱ is called the reduced space aty_i^⋆. We also dene by∇˜ⁱg .

= E_i(y_i^⋆)^′∇ⁱg the projected gradient component neary_i^⋆, and byT˜_i(y) .

= [E_i(y_i^⋆)^′T_i(y)⁻¹E_i(y_i^⋆)]⁻¹ the projected scaling function neary^⋆_i. We denote by m˜ .

= ∑n

i=1m˜_i the dimension of the (global) reduced space at y^⋆. We can now formulate the following consequence of Proposition 4.6.

Proposition 4.7 (Descent in reduced spaces) Let Assumption 4.2 hold, and (y^k) be a sequence of points generated by a block-coordinate implementa-tion of G and converging to a solution y^⋆ of Problem 4.1 where strict comple-mentarity holds. One can nd a k <¯ ∞ such that, for any i ∈ N and k > k¯ wherey_i^k+1 = G_i(y^k), one hasy^k_i = y_i^⋆+E_i(y^⋆)˜y^k_i andy_i^k+1 = y_i^⋆+E_i(y^⋆)˜y_i^k+1, with y˜_i^k,y˜_i^k+1 ∈ R^m^˜ⁱ and

y_i^k+1 = ˜y^k_i −ˆa_i(y^k) ˜T_i(y^k) ˜∇ⁱg(y^k), (4.52) which can be regarded as an unconstrained gradient descent in the reduced space R^m^˜ⁱ.

Proof If m˜_i = 0, then (4.52) reduces to y˜_i^k+1 = ˜y^k_i, and the proposition is a direct consequence of Proposition 4.6. Hence we suppose that m˜_i > 0. For i ∈ N, let Y_i^⋆ .

= {z ∈ Y_i : Aⁱ(z) = Aⁱ(y_i^⋆)} and Y˜_i .

= {E_i(y_i^⋆)^′(z −y_i^⋆) : z ∈ Y_i^⋆}. By continuity of the constraint functions, it is easy to show that Y˜_i is an open subset of R^m^˜ⁱ if m˜_i > 0. Moreover, the function h_i(˜y) .

= y^⋆_i + E_i(y_i^⋆)˜y is a bijection between Y˜_i and Y_i^⋆, i.e. Y_i^⋆ = {h_i(˜z) : ˜z ∈ Y˜_i}. It follows from Proposition 4.6 that one can nd a k <¯ ∞ such that, for all k ≥ k¯, we have y_i^k, y_i^k+1 ∈ Y_i^⋆, and thus y_i^k = h_i(˜y_i^k) and y^k+1_i = h_i(˜y_i^k+1) for some points

y_i^k,y˜_i^k+1 ∈ Y˜_i. If, in addition, i and k are such that y_i^k+1 = G_i(y^k), then

y^k+1_i ^(4.12)= arg min

˜ z∈Y˜i

{

∇˜ⁱg(y^k)^′(˜z−y˜_i^k) + 1 2



z˜−y˜_i^k



[ˆai(y^k) ˜Ti(y^k)]⁻¹

}

(4.53)

(3.20)

= [

y_i^k−aˆ_i(y^k) ˜T_i(y^k) ˜∇ⁱg(y^k) ]+

Y˜i,T˜i(y^k) (4.54)

= ˜y_i^k−ˆa_i(y^k) ˜T_i(y^k) ˜∇ⁱg(y^k) (4.55) where (4.55) follows from the fact that Y˜_i is an open set and y˜^k+1_i ∈ Y˜_i. Propositions 4.6 and 4.7 hold for many convergent block-coordinate im-plementations of the scaled gradient projection algorithm with bounded or constant step-sizes. Their implication is that the asymptotic convergence properties for unconstrained problems of the non-projected block-coordinate gradient descent methods are also met by the projected algorithms in con-strained problems.

4.3.6 Asymptotic rates of convergence

In this section we derive asymptotic convergence rates for block-coordinate implementations of G over polyhedrons. We rst assume strict complemen-tarity and twice continuous dierentiability at the point of convergence, then

4.3. Block-coordinate projected gradient descent algorithms 103

uniqueness of the solution. The particular cases when the Hessian is discon-tinuous at the point of convergence and when strict complementarity does not hold are discussed at the end of the section.

Local convergence under strict complementarity

Let y^⋆ be the point of convergence of a convergent algorithm. We introduce the R^m×^m^˜ matrix

E(y) .

= diag(E₁(y₁), ..., E_n(y_n)), y ∈ Y, (4.56) and dene the projected Hessian of g at y^⋆ as ∇˜²g(y) .

= E(y^⋆)^′∇²g(y)E(y^⋆) and the block-diagonal form T˜ .

= diag( ˜T₁, ...,T˜_n), where T˜_i is the projected scaling function at node i. We decompose the Hessian matrix of g into

∇²g = D −L−L^′, (4.57)

where D = diag(∇11² g, ...,∇nn² g) is block diagonal and L strictly lower trian-gular. By settingD(y)˜ .

= E(y^⋆)^′D(y)E(y^⋆)and L(y)˜ .

= E(y^⋆)^′L(y)E(y^⋆), we nd ∇˜²g = ˜D −L˜ −L˜^′. The next proposition is concerned with the asymp-totic behaviour of the local step size selection rule. Again, we regard S as a particular instance of (A^k).

Proposition 4.8 (Asymptotic eciency of the step size rule) Under Assumption 4.2, consider a block-coordinate implementation of G, and let the algorithm generate a sequence of points (y^k) converging to a solu-tion y^⋆ of Problem 4.1 where T is continuous, strict complementarity holds and g is twice continuously dierentiable. Then the step sizes computed by local line search at a node i ∈ N become identically equal to 1 near y^⋆ if (1−σ) ˜T_i(y^⋆)⁻¹− ¹₂∇˜ii²g(y^⋆)≻0, and at all the nodes if

2(1−σ) ˜T(y^⋆)⁻¹−D(y˜ ^⋆) ≻0. (4.58) Proof Let i ∈ N. According to Propositions 4.6 and 4.7, there is a k¯ such that, for any step k ≥ k¯ at which y_i^k+1 = G_i(y_i^k), we have Aⁱ(y_i^k) = Aⁱ(y_i^k+1) = Aⁱ(y_i^⋆) and y^k+1_i −y_i^k = E_i(y^⋆)˜δ, where δ ∈ R^m^˜ⁱ is the displace-ment in the reduced space given by

δ˜= −aT˜_i(y) ˜∇ⁱg(y). (4.59) The Taylor theorem (see (A.9) in Appendix A) yields, after simplication,

g(y^k+1) = g(y^k)−δ˜^′ [1

T˜_i(y)⁻¹ − 1 2

∇˜ii²g(y^k) ]

δ˜+o(∥δ˜∥²). (4.60)

The condition (4.18) for a to be a valid step size reduces to δ˜^′

[1−σ a

T˜_i(y^k)⁻¹ − 1 2

∇˜ii²g(y^k) ]

δ˜+o(∥δ˜∥²)^(4.60)≥ 0, (4.61) where ∥˜δ∥ → 0 as k → ∞ by stationarity of y^⋆. Condition (4.61) holds as

∥δ˜∥ → 0 and a becomes an acceptable step size at node i for k large enough

if 1−σ

a T˜_i(y)⁻¹ − 1

2∇˜ii²g(y)≻0, (4.62)

which proves the proposition.

Notice that it is possible to obtain the condition (4.58) by decreasing the scales.

In the next result, we derive the asymptotic convergence rate of the Gauss-Seidel implementation of G. The positive deniteness of ∇²g at the point of convergence is requiredthis implies the uniqueness of the solution.

Proposition 4.9 (Matrix convergence rate of S) Let Assumption 4.2 hold and suppose that Problem 4.1 has a unique and nite solution y^⋆ where strict complementarity holds and ∇²g is positive denite and continuous.

Let (y^k) be a sequence generated by S with scaling functions T_i ∈ T(m_i) (i∈ N) continuously dierentiable at y^⋆ and meeting the conditions for triv-ial step size selection neary^⋆ specied by Proposition 4.8. Then(y^k)converges linearly to y^⋆ at rate

R˜^S(y^⋆) = [

T˜(y^⋆)⁻¹ −L(y˜ ^⋆)]⁻¹[

T˜(y^⋆)⁻¹ −D(y˜ ^⋆) + ˜L(y^⋆)^′]

, (4.63) with spectral radius ρ( ˜R^S(y^⋆)) < 1.

Proof We study the rst-order sensitivity of S in the reduced space at y^⋆. One can nd a k <¯ ∞ such that, for any k ≥ ¯k, we have y^k = y^⋆+ E(y^⋆)˜y^k for some y˜^k ∈ R^m^˜. For i ∈ N, we denote by I˜_i the identity matrix of R^m^˜ⁱ, and dene

W˜_i .

= diag( ˜I₁, ...,I˜_i−1,0, ...,0). (4.64) Let y˜ be a point of the reduced space R^m^˜, d˜ symbolise a displacement in R^m^˜. For i ∈ N, the point obtained after local displacements from y˜along the i−1 rst components of d˜is given by

y_i( ˜d,y)˜ .

=y^⋆+E(y^⋆)(˜y+ ˜W_id).˜ (4.65) Consider now the function Z˜ = ( ˜Z₁, ...,Z˜_n) such that

Z˜_i( ˜d,y) = [ ˜˜ T_i(¯y_i( ˜d,y))]˜ ⁻¹d˜_i+ ˜∇ⁱg(¯y_i( ˜d,y)),˜ i∈ N. (4.66)

4.3. Block-coordinate projected gradient descent algorithms 105

The implicit function theorem (Proposition (A.3) in Appendix A) states that there exists a function d(˜˜y) continuously dierentiable on a neighbourhood Ω˜ ⊂ R^m^˜ⁱ of 0 such that d(0) = 0˜ and Z( ˜˜ d(˜y),y) = 0˜ for all y˜ ∈ Ω˜. For

y ∈ Ω˜, d(˜˜y)gives the displacement caused by an application of Sat the point y = y^⋆ +E(y^⋆)˜y, and is such that

d˜_i(˜y) =−T˜_i(¯y_i( ˜d(˜y),y)) ˜˜ ∇ⁱg(¯y_i( ˜d(˜y),y)),˜ i ∈ N. (4.67) Consider the function z(˜˜ y) .

= ˜Z( ˜d(˜y),y)˜ . By dierentiation of z˜at 0 we nd Jz(0) = [ ˜˜ T(y^⋆)⁻¹ −L(y˜ ^⋆)]Jd(0) + ˜˜ ∇²g(y^⋆) (4.68) whereJz˜andJd˜denote the Jacobians of z˜andd˜, respectively. FromJz(0) =˜ 0 follows

Jd(0) =˜ −[ ˜T(y^⋆)⁻¹ −L(y˜ ^⋆)]⁻¹∇˜²g(y^⋆). (4.69) Fork ≥k¯ we havey˜^k+1 = ˜y^k+ ˜d(˜y^k). The Taylor theorem and d(0) = 0˜ yield

y^k+1 ^(A.8)= y˜^k+Jd(0)˜˜ y^k +h(˜y^k) (4.70)

(4.69)

= R˜^S(y^⋆)˜y^k+h(˜y^k), (4.71) with h(y) =o(∥y∥). We can rewrite R˜^S(y^⋆) as

R˜^S(y^⋆) = ( ˆD−E)ˆ ⁻¹Eˆ^′, (4.72) where Dˆ = 2 ˜T(y^⋆)⁻¹ − D(y˜ ^⋆) and Eˆ = ˜T(y^⋆)⁻¹ − D(y˜ ^⋆) + ˜L(y^⋆). Not-ing that Dˆ −Eˆ − Eˆ^′ = ˜∇²g(y^⋆) is positive denite and ( ˆD − E)ˆ is non-singular, the Ostrowski-Reich theorem (Theorem A.2 in Appendix A) states that ρ( ˜R^S(y^⋆)) < 1 if Dˆ ≻ 0, i.e. if

2 ˜T(y^⋆)⁻¹−D(y˜ ^⋆) ≻0. (4.73) In that case the local convergence of the algorithm is analogous to that of the procedure which consists of solving the equation ∇˜²g(y^⋆)˜y = 0 using the Gauss-Seidel iterative method y˜^k+1 = ˜R^S(y^⋆)˜y^k. Observing that (4.58)

implies (4.73) completes the proof.

The fact that (4.58) reduces to (4.73) when we let σ → 0 is unsurprising.

Indeed, if (4.73) holds, then it is possible to nd a σ > 0 satisfying (4.58).

In other words, (4.73) guarantees the existence of parameter values such that y^k+1 = S(y^k) will converge to the solution with unit step sizes near the opti-mum, and this is only possible if ρ( ˜R^S(y^⋆)) < 1.

Extensions

In some particular cases, rates of the type (4.63) cannot be derived. These situations include implementations with variable step sizes near the solution, and the absence of strict complementarity or discontinuity of the Hessian matrix at the solution. In such scenarios, the asymptotic trajectory of S is expected to jump from regions of Y to others, where the algorithm behaves dierently.

Variable step-sizes near the point of convergence. When for instance the condition (4.58) does not hold, the step sizes computed by line search may vary near the optimum. One can equivalently consider that the scaling function at any node i takes various expressions in a set {β^mT_i}^mm=0^¯ , where β^m^¯ is the smallest step size permitted by (4.25). By following the rationale of the proof of Proposition 4.9, one shows that the local convergence of the algorithm is that of a stable discrete-time switching system dened by a nite rate set {R˜^S(ψ, y^⋆)}^ψ∈Ψ with Ψ ⊂ {0, ...,m¯}ⁿ and ρ( ˜R^S(ψ, y^⋆)) < 1 for ψ ∈ Ψ, and given near the optimum by an equation of the type

y^k+1 = ˜R^S(ψ(k), y^⋆)˜y^k+h(˜y^k), (4.74) whereh(y) =o(∥y∥), R˜^S(·, y^⋆) takes an expression of the type (4.63), andψ : N→ Ψ is a switching function.

Discontinuous Hessian. Propositions 4.8 and 4.9 can also be extended to the case when the Hessian of g shows discontinuities in Y. An example of only piecewise twice continuously dierentiable function is the dual of the NUM problem (Problem 3.3) formulated in Section 3.4.5 when the side constraint setSis polyhedralthis was observed in Examples 3.7 and 3.8 in Section 3.4.3.

Suppose that the minimum y^⋆ is unique, and that ∇²g is discontinuous at y^⋆ and continuous, positive denite almost everywhere on a neighbour-hood Ω of y^⋆, so that Ω is partitioned in a nite number q^⋆ of open sets where ∇²g is continuous, and at the borders of which ∇²g is not dened and one should instead consider either the subderivatives of ∇g or the directional derivatives {∇²gˆ^q}^qq=1^⋆ , where each ∇²ˆg^q is assumed continuous, positive def-inite and equal to ∇²g on the q^th subset of Ω. Near the solution and for all i ∈ N, we have T_k(y) = T_i^q(y) at any y located in the q^th subset of Ω, where each T_i^q ∈ T(m_i) for q ∈ {1, ..., q^⋆}. Consider now the reduced space at y^⋆ and the corresponding projected forms ∇˜²gˆ^q and T˜_i^q for q = 1, ..., q^⋆ and i ∈ N. By proceeding as in the proof of Proposition 4.8, we nd that 1

4.3. Block-coordinate projected gradient descent algorithms 107

becomes an acceptable step size near y^⋆ if (1−σ) ˜T_i^q(y^⋆)⁻¹− 1

2∇˜ii²gˆ^r(y^⋆)> 0 (4.75) holds for all q, r ∈ {1, ..., q^⋆}. The local convergence of S is then given by a switching system of the type (4.74) with Ψ ⊂ {1, ..., q^⋆}ⁿ.

4.3.7 Asymptotic convergence of Jacobi modes of implementations of G.

Matrix convergence rate

Similar developments for the Jacobi mode of implementation ofG lead to the convergence rate

R˜^G(y^⋆) = ˜I −T˜(y^⋆) ˜∇²g(y^⋆), (4.76) where I˜denotes the identity matrix in the reduced spaceR^m^˜. Hence R˜^G(y^⋆) is the asymptotic rate of convergence of the sequences generated by any Jacobi implementation of G which converge to y^⋆ with unit step sizes near y^⋆. The global convergence of such sequences is typically achieved via global line search and network-scale consensus on the step sizes. In contrast with the Gauss-Seidel mode, the condition for unit step sizes near the optimum as well as the conditionρ( ˜R^G(y^⋆)) < 1for linear asymptotic convergence are dicult to guarantee in distributed settings by means of a local analysis. In the special case when we setT˜(y) = ˜D(y)⁻¹ in a neighbourhood ofy^⋆, then (4.76) reduces to R˜^G(y^⋆) = ˜D(y^⋆)⁻¹[ ˜L(y^⋆) + ˜L(y^⋆)^′] and takes the form of the asymptotic convergence rate of a Jacobi method for solving a linear program.

If, in addition, m_i = 1 for all i ∈ N, the elements of L(y˜ ^⋆) are nonnega-tive, and σ < ¹₂, then ρ( ˜R^S(y^⋆)) < 1, and it follows from the Stein-Rosenberg theorem (Theorem A.1 in Appendix A) that ρ( ˜R^G(y^⋆)) < 1 as well and the Jacobi implementation of G is convergent on the conditions that the step-sizes reduce to 1near the point of convergence and that the initial point y⁰ is chosen suciently close to the optimum y^⋆. Global convergence of the Jacobi implementation ofGis however not guaranteed for any step-size selection rule and any initial point y⁰ ∈ Y, as illustrated in Example 4.2, where a Jacobi implementation of G combined with the local step-size selection rule of De-nition 4.3 proves to diverge, and in Example 4.3, where the same algorithm is convergent when started in a certain neighbourhood of the solution but does not converge to the solution when the initial point lies outside this neigh-bourhood. As already mentioned in Section 4.3.2, the synchronous gradient projection algorithms are generally implemented with identical step sizes for

all the nodes of the network, which usually requires network-scale line search and consensus on the step-sizes at each step.

It is interesting to note that R˜^G(y^⋆) is in fact the asymptotic rate of con-vergence of any gradient descent algorithm (implemented in parallel or not) reducing near the optimum y^⋆ to

y^k+1 = ˜y^k−T˜(y^k) ˜∇g(y^k). (4.80) Also, (4.76) can be derived by setting n= 1, L(y˜ ^⋆) = 0 andD(y˜ ^⋆) = ˜∇²g(y^⋆) in (4.63).

Accelerated methods

The convergence rates (4.63) and (4.76) cannot be set to0 under the assump-tion that T˜(y^⋆) is block-diagonal. Hence attaining superlinear convergence in non-degenerated problems is impossible with block-coordinate implemen-tations of the mapping G. From the proof of Proposition 4.9, it is easily seen that superlinear convergence can only be achieved with non-diagonal

Example 4.2 (Global convergence of block-coordinate algorithms (i))

y₁ y₂

0 1

1 (a,a)

(−²^a^,−²^a)

(4a, 4a)

(−⁸^a^,−⁸^a)

This example shows that the global convergence of the Jacobi implementation of Gcombined with the local line search routine given in Denition 4.3 is not guaranteed. We consider the function of Example 4.1, previously dened in (4.7), and setY =R². Lety⁰= (a, a)and(y^k)be a sequence of points generated by successive applications ofy^k+1=G(y^k)with local line search as in Denition 4.3 and with the scaling matrices T₁(y) = ₁₆³ and T₂(y) = ₁₀³ for all y in Y. If σ ≤ ¹3, we nd y^k= (−2)^k(a, a)and the sequence of points diverges.

The plain line on the gure shows the sequence of point generated by the Gauss-Seidel algorithm y^k+1 =S(y^k)with initial point y⁰= (a, a), whereSis implemented with the same constant scaling functions T₁ = ₁₆³ and T₂ = ₁₀³ and local line search. We see that this sequence converges to the solution(0,0) by successive descents along y1 andy2.

4.3. Block-coordinate projected gradient descent algorithms 109

block-matrices T˜(y^⋆), and this requires the exploitation of non-strictly local information by each node.

One possible way, suggested in recent studies on parallel optimisation, to reduce the convergence rates is to use as descent directions approxima-tions of the (global) Newton direction −[∇²g(y)]⁻¹∇g(y). Under the

condi-Example 4.3 (Global convergence of block-coordinate algorithms (ii))

(1, 1)

(−^1,−¹) −¹

−¹

x+y=1

x+y=−¹

Consider the unconstrained optimisation problem of minimising

g(y) =

⏐

(y1+y₂+ 2)²+ (y1−y₂)² ify₁+y₂≥1 3(y1+y₂)²+ (y1−y₂)²+ 6 if|y₁+y₂|<1 (y1+y₂−2)²+ (y1−y₂)² ify₁+y₂≤ −1

(4.77)

on Y = R². It is easy to verify that g is strictly convex and continuously derivable on Y with gradient

∇g(y) =

⏐

4(y₁+ 1, y₂+ 1) ify₁+y₂≥1 4(2y₁+y₂, y₁+ 2y₂) if|y₁+y₂|<1 4(y1−1, y2−1) ify₁+y₂≤ −1

(4.78) and unique minimum y^⋆ = (0,0). The Hessian∇²g is given by ∇²g(y) =A if |y₁+y₂|<1, and

∇²g(y) =B if |y₁+y₂|>1, where we dene A= 4

( 2 1 1 2

)

, B = 4

( 1 0 0 1

)

, (4.79)

and∇²g is not dened between these regions.

Now, let(y^k)be a sequence generated by the Jacobi algorithmy^k+1=G(y^k), where the mapping G is combined with the local line search of Denition 4.3 and with scalingT(y) =A⁻¹if |y₁+y₂| ≤1 andT(y) =B⁻¹ if |y₁+y₂|<1, which yields T˜(y)⁻¹= ˜D(y)for ally∈Y such that |y₁+y₂| ̸= 1.

It is easy to see that for any initial pointy⁰such that|y₁¹+y¹₂|<1, we ndy¹=y^⋆and the sequence converges to the optimum in one iteration. However, if for instancey⁰= (1,1)andσ < ³₄ in (4.18), we ndy^k= (−1)^k(1,1)and the resulting sequence oscillates between the points(1,1)and(−1,−1).

tion ρ(D⁻¹²(L+L^′)D⁻¹²)< 1, the Taylor development [∇²g]⁻¹ = D⁻¹²∑^∞

t=0[D⁻¹²(L+L^′)D⁻¹²]^tD⁻¹² (4.81) is considered, where, under separability assumptions, the dependency struc-ture of each term of the development grows with the parameter t. The tech-nique, denoted by N^(q) in this manuscript, consists of setting T^k = T^(q)(y^k) in (4.9) with

T^(q) = D⁻¹²∑^q

t=0

[

D⁻¹²(L+L^′)D⁻¹²]t

D⁻¹², (4.82) where q is a nonnegative integer directly proportional to the computational complexity and the communication overhead. Notice that N⁽⁰⁾ reduces to the Jacobi mode of implementation of G with scaling T_i = [∇ii²g]⁻¹.

From (4.76), we nd that the asymptotic convergence rate of the projected gradient algorithm used with (4.82) is given by

R˜^N^(q)(y^⋆) = ˜T^(q)(y^⋆)E(y^⋆)^′T^(q)(y^⋆)⁻¹R^N^(q)(y^⋆)E(y^⋆), q = 0,1,2, ..., (4.83) where

R^N^(q)(y^⋆) = [

D(y^⋆)⁻¹(L(y^⋆) +L(y^⋆)^′)]q+1

(4.84) is the asymptotic convergence rate of the unconstrained problem (i.e. when Y = R^m). Note that ρ( ˜R^N^(q)(y^⋆)) and thus ρ(R^N^(q)(y^⋆)) vanish as q grows.

This shows that the parameter q symbolises a trade-o between the speed of convergence, and the quantity of exchanged information and remoteness of the neighbours each node must communicate with.

In Section 6.1.1, the performance in terms of matrix convergence rate of theN^(q) algorithm will be compared with that of the Gauss-Seidel implemen-tation of G for the dual of a simple NUM problem.

4.3.8 Second-order scaling

From Propositions 4.8 and 4.9 and from the cost approximation interpretation of gradient projections, we see that it is appropriate to scale the descent directions based on second-order information as in the Newton method. In this section we briey discuss two second-order scaling strategies: local Newton scaling and diagonal scaling. Possible variants include for instance the quasi-Newton methods, based on the iterative computation of approximations of the inverse Hessian, or the constant scaling technique based on Lipschitz constants discussed in Section 4.3.4.

4.3. Block-coordinate projected gradient descent algorithms 111

Local Newton scaling

A `local' Newton direction in a subspace R^mⁱ is obtained by setting the scal-ing matrix to the inverse of the opposite of the i^th diagonal element of the Hessian ∇²g in block form, i.e. T_i = [∇ii²g]⁻¹, provided that ∇ii²g is well-conditioned. Under the assumption made in Problem 4.1 that the variations in g(y) due to local displacements in the feasible set have local impact, the diagonal elements of ∇²g and thus the local Newton directions can be computed in closed form by the nodes using only local information. This is shown in Appendix B.5 for the dual of the NUM problem (Problem 3.3) intro-duced in Section 3.4.5. Using this strategy fori ∈ N, we have T˜(y)⁻¹ = ˜D(y) near the point of convergence and the asymptotic rate (4.63) is minimised.

Since (4.58) always holds when local Newton scaling is used and σ <

2, it follows from Proposition 4.9 that sequences generated by Gauss-Seidel implementations of G converge linearly with step sizes identically equal to 1 near the point of convergence. We have the following proposition².

Proposition 4.10 (Local Newton scaling) Under Assumption 4.2, sup-pose that Problem 4.1 has a unique and nite solutiony^⋆ where strict comple-mentarity holds, g is twice continuously dierentiable, and the diagonal ele-ments of ∇²g are invertible and continuously dierentiable. Let (y^k) be a se-quence generated by S, where we use local Newton scaling T_i(y) = [∇ii²g(y)]⁻¹ near y^⋆ and set σ ∈ (0, ¹₂) for the local line search. Then the step sizes be-come identically equal to 1 after a nite number of iterations. If in addition

∇²g(y^⋆) is positive denite, then (y^k) converges linearly to y^⋆ at rate R˜^S(y^⋆) =[

D(y˜ ^⋆)−L(y˜ ^⋆)]⁻¹

L(y˜ ^⋆)^′, (4.85) with ρ( ˜R^S(y^⋆)) < 1.

Diagonal scaling

Drawbacks of local Newton scaling include the computation overhead and matrix conditioning issues due to the inversions of ∇ii²g matrices and the scaled projections on the sets Y_i (solutions of quadratic programs). Diagonal scaling is an eort-saving strategy which simplies the computations by using diagonal approximations of the ∇ii²g matrices. If Y is a box, the projections

2It can be seen from from (4.85) that convergence is superlinear iL(y˜ ^⋆) = 0. This degenerated case occurs for instance in Problem 3.3 when the constraints are not binding variables assigned to dierent nodes and the problem reduces to n independent optimisation problems, while local Newton scaling leads tonlocal executions of the Newton algorithm with quadratic convergence.

Im Dokument Distributed methods for convex optimisation : application to cooperative wireless sensor networks (Seite 102-125)