Optimisation by scaled gradient projection with approximate line search

5.4. Optimisation by scaled gradient projection with approximate line search 137

Consider now a vector z ∈ C∩A^S(ρ). We have ¯s(z)∈ S(z)with ∥s(z)¯ ∥ ≤ ρ. Using (5.60) and proceeding similarly, we nd (∇g(z)−s(z))¯ ^′(x−z)≥0 for all x ∈ Y. It follows that s(z) +¯ ∇ˆg^k(z)− ∇g(z) ∈ S^k(z), and

∥s¯^k(z)∥ ≤ ∥s(z) +¯ ∇gˆ^k(z)− ∇g(z)∥ ≤ ∥s(z)¯ ∥+∥∇gˆ^k(z)− ∇g(z)∥ ^(5.67)≤ 3ρ 2 ,

which completes the proof.

The A^S family is used in particular to show that the algorithms derived from gradient projection methods considered in Section 5.4 satisfy Condi-tion 5.3. These algorithms are thus expected to converge in stochastic envi-ronments. Note that the algorithms studied in Section 5.4 involve point-to-point mappings of the type M : F(X)×Y → X, for which we simply write y^k+1 = M(ˆg^k, y^k).

5.4 Optimisation by scaled gradient projection with

convergence of(∇gˆ^k)(Condition 5.2-(ii)) and the Lipschitz continuity of∇gˆ^k with the same nite Lipschitz constant for allk imply the Lipschitz continuity of ∇g. Moreover, if we assume that∇g is Lipschitz continuous, and ∇gˆ^k are Lipschitz continuous for all k, then the true function ∇g and all the mod-els ∇gˆ^k have a common Lipschitz constant, which we denote by L in our developments. In this section, the following assumption will be made.

Assumption 5.1 (Lipschitz continuity of the gradients) The gradient of the limit functiong and the gradients of the modelsgˆ^k for allk are Lipschitz continuous on Y with constant L.

Given two positive constantsλandλ¯ with0< λ ≤ ¯λ < ∞,T(p)is dened in (4.8) as the set of symmetric, positive denite scaling matrices inR^p×p with eigenvalues bounded by the constants λandλ¯, i.e. T ∈ R^p×p belongs to T(p) i λ≼T ≼λ¯. Notice that

λ¹²∥x∥ ≤ ∥x∥^T ≤λ¯¹²∥x∥, ∀x ∈ R^p, T ∈ T(p), (5.70) where ∥·∥^T denotes the scaled norm, dened by (3.19) in Example 3.4. We suggest the following stochastic optimisation algorithm, presented in the shape of a mapping G and covering a broad family of optimisation methods based on gradient descents. The mapping relies on the scaled projection introduced in (3.20).

Algorithm 5.1 (Scaled gradient projection) Consider a sequence of functions (ˆg^k) with gˆ^k ∈ G(Y) for all k, a scaling mapping T : G(Y)×Y → T(m), a step-size selection rule aˆ : G(Y) × Y → (0,1], and an initial point y⁰ ∈ Y. A gradient projection algorithm is given by

y^k+1 = G(ˆg^k, y^k), k = 0,1,2, ... , (5.71) where we dene the mapping G : G(Y)×Y → Y as

G(f, y) .

= [y−ˆa(f, y)T(f, y)∇f(y)]⁺_Y,T_(f,y). (5.72) In the context of distributed networks, Algorithm 5.1 enjoys the property that it can be implemented in parallel as long as the computations are exe-cuted synchronously by the nodes of the network and with global consensus on the step-sizesthis restriction is relaxed in Section 5.4.3, where a cyclic implementation of the distributed gradient projection method is proposed.

Notice that, when gˆ^k = g for all k, we recover the gradient projection algorithm (3.80), for which the choice of the step-sizes and scaling matrices is critical in guaranteeing the convergence to a solution in deterministic contexts,

5.4. Optimisation by scaled gradient projection with approximate line search 139

where constant or decreasing step-size policies often lead to slow convergence.

With gradient descent methods it is common to associate step-size selection rules based on line-search, such as the Armijo rule already used in chapter 4, which guarantees sucient descent at each iteration. We decide to use a similar step-size rule for the mapping G, with the particularity that the line-search is done on the model functionsˆg^k and not on the true functiong, which is unknown in Problem 5.1.

Denition 5.4 (Approximate line search) Given the xed scalar para-meters β, σ ∈ (0,1), a function f : Y → G(Y) and the scaling mapping T : G(Y)× Y → T(m), we consider the mapping ˆa : G(Y) ×Y → (0,1]

where a(f, y)ˆ is the largest step-size a ∈ {β^m}^∞m=0 satisfying

f(y)−f(ˆy(a)) ≥σa⁻¹∥y(a)ˆ −y∥²_T_(f,y)⁻¹, (5.73) where y(a)ˆ .

= [y−aT(f, y)∇f(y)]⁺_Y,T_(f,y).

The developments of Section 4.3.3 reveal that the step-size selection rule given in Denition 5.4 is well dened in the sense that the computation of the step-sizes requires a nite number of evaluations of (5.73) at each step, and that the resulting step-sizes are bounded above 0 by a function of the Lipschitz constant L. Under Assumption 5.1, the Lipschitz constant L is common to the true function g and all the models ˆg^k and there exists a lower bound a for the step-sizes of Algorithm 5.1 with a > 0.

In the basic optimisation problem where gˆ^k = g for all k, the above step-size policy is known to guarantee sucient descent at each step and the global convergence of Algorithm 5.1. Establishing the convergence of the algorithm in the stochastic setting is the object of Section 5.4.2.

5.4.2 Convergence analysis

The aim of this section is to show that, under certain conditions, the gradient projection mapping G of Algorithm 5.1 used with the step-size rule ˆa of De-nition 5.4 satises Condition 5.3 and is thus convergent in stochastic settings in accordance with Proposition 5.1.

We now consider the family A^S suggested in Section 5.3.3 and make the following observation.

Proposition 5.2 (Descent directions) Consider a sequence (ˆg^k) of func-tions in G(Y) satisfying Condition 5.2 and Assumption 5.1. For any com-pact C ⊂Y and ρ > 0, one can nd a k <¯ ∞ such that T(ˆg^k, y)∇gˆ^k(y) is a descent direction of g at y for all y ∈ C \A^S(ρ) and k >k¯.

Proof A vector d ∈ R^M is a descent direction for g at a point y ∈ Y if

∇g(y)^′d < 0. Noting that ∇gˆ^k(y) = ∇g(y) + [∇gˆ^k(y) − ∇g(y)] and us-ing (5.70), we have, for any y ∈ Y and Tˆ ∈ T(m),

∇g(y)^′Tˆ∇gˆ^k(y) =∥∇g(y)∥²Tˆ +∇g(y)^′Tˆ[∇ˆg^k(y)− ∇g(y)] (5.74)

= ∥∇g(y)∥²Tˆ + ( ˆT¹²∇g(y))^′[ ˆT¹²(∇gˆ^k(y)− ∇g(y))] (5.75)

≥ ∥∇g(y)∥²Tˆ − ∥Tˆ¹²∇g(y)∥∥Tˆ¹²(∇gˆ^k(y)− ∇g(y))∥ (5.76)

= ∥∇g(y)∥²Tˆ − ∥∇g(y)∥_T^ˆ∥∇gˆ^k(y)− ∇g(y)∥_T^ˆ (5.77)

≥λ∥∇g(y)∥²−λ¯∥∇g(y)∥∥∇gˆ^k(y)− ∇g(y)∥ (5.78)

= λ∥∇g(y)∥[

∥∇g(y)∥ −λλ¯ ⁻¹∥∇gˆ^k(y)− ∇g(y)∥]

. (5.79) Let C be a compact subset of Y and ρ > 0. By Assumption 5.2-(ii), one can nd a ¯k < ∞ such that ∥∇gˆ^k(y)− ∇g(y)∥ < λ¯λ⁻¹ρ for all k > ¯k. The proposition then follows from (5.79), T(ˆg^k, y) ∈ T(m), and the fact that we have ∥∇g(y)∥ ≥ ρ if y ∈ C \A^S(ρ) by (5.64).

Proposition 5.2 tells us that, given a ρ > 0 and fork large enough, the search directions of the gradient projection Algorithm 5.1 applied successively to the sequence of functions (ˆg^k) are valid descent directions for g in the periphery of the set A^S(ρ). The algorithm is thus expected to approach A^S(ρ), which contains the solutions of Problem 5.1, and the expanse of which can be reduced by decreasing ρ.

In the setting of Section 5.3.2, however, convergence only occurs if lying outside A^S(ρ) guarantees not only valid descent directions for the true func-tion g, but also sucient displacements in Y. This property is only true in some particular cases, and we need to consider the following assumption for the feasible set of the problem and for the scaling mapping of the projected gradient algorithm.

Assumption 5.2 (Restriction on (X, T)) The set X is a closed convex subset of a real space R^p, T is a scaling mapping G(X) × X → T(p), and (X, T) satises one of the following conditions:

(i) The set X is the real space, i.e. X ≡ R^p.

(ii) The scaling matrix T(f, x) is diagonal for all (f, x) ∈ G(X)×X, and the set X is a box of the type

X = {x ∈ R^p|b ≤x ≤¯b}, (5.80) with b ∈ [−∞,∞)^p, ¯b ∈ (−∞,∞]^p and b ≤¯b.

5.4. Optimisation by scaled gradient projection with approximate line search 141

Notice, in particular, that this assumption holds for (Y, T) in Algorithm 5.1 when Problem 5.1 is unconstrained (Y ≡R^m), or when the scaling function T of the algorithm is diagonal and Problem 5.1 the dual of a convex stochastic optimisation problem.

The convergence analysis of Algorithm 5.1 is divided in two distinct parts.

In Appendix C.2 we derive properties of the scaled gradient projection op-erator under Assumption 5.2. These properties address the descent along a generic function over a generic set, and are not directly concerned with the stochastic optimisation framework nor with the convergence issues inherent to the sequences of models used for the true function. The actual convergence of the mapping G in stochastic environment is studied in Appendix D.2, based on the results of Appendix C.2.

As explained in Appendix C.2, the convergence analysis of Algorithm 5.1 is facilitated by Assumption 5.2 for the reason that projected gradient de-scents are then expected to describe particular trajectories on the surface ofY, namely, broken lines composed of a nite number of line segments crossing subspaces of decreasing dimensions. Proposition C.1 shows in particular that these segments are projections of −ˆa(f, y)T(f, y)∇f(y) on their respective subspaces, as illustrated in Figure 5.3.

A last result, stated by Corollary C.1, tells us that the magnitude of the displacements in the feasible space caused by (5.72) can be bounded below by a linear function of the the norm of the `smallest' subgradient of f at the destination point G(f, y). On basis of these observations, it is shown in Appendix D.2 that Assumption 5.2 guarantees both suciently small dis-placements for G inside A^S sets, and displacement large enough outside those sets, which are the two features required by Condition 5.3 for the convergence of the algorithm in the sense of Proposition 5.1.

Note that the relation between the magnitudes of displacements and the setsA^S is only established under Assumption 5.2. In the general case though, vectors lying outside a set A^S(ρ) may lead to extremely short displace-ments regardless of the value of ρ. This can be seen in the settings of Fig-ures 5.3(b) and 5.3(c), where projected gradient descents from a vector (0, ϵ) outside A^S(ρ) with ϵ arbitrarily small may yield the point (0,0) and thus displacements of arbitrarily small lengths for a broad range of values for ρ.

The convergence of Algorithm 5.1 can be stated as follows.

Result 5.3 (Convergence of Algorithm 5.1) Let Assumption 5.2 hold in R^m for (Y, T). Algorithm 5.1 satises Condition 5.3 on Y with the map-ping G, the class of functions G(Y), the family {A^S(ρ)}^ρ≥0, and sequences of models satisfying Condition 5.2.

Proof Condition 5.3-(i) and Condition 5.3-(ii) are direct consequences of

⊴

⊴⊴

✌

x−aT∇f(x)

X ˆ z[1]

ˆ z[2]

0 =z

♣ ♣ ♣ ♣

♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣

⊴

❄

(a) X box,T diagonal

⊴

⊴⊴

✌

x−aT∇f(x)

X ˆ z[1]

ˆ z[2]

♣ ♣ ♣ ♣0

⊴

♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣

⊴

◗

◗◗

♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣◗♣ ♣◗s♣ ♣ ˆ z[4]

(b) X non box

⊴

⊴⊴

✌

x−aT∇f(x)

X ˆ z[1]

ˆ z[2]

⊴

♣ ♣ ♣♣ ♣ ♣♣ ♣

♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣

♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣

⊴

✲ ˆ z[4]

Consider a closed convex set X ⊂R², a function f ∈ G(X), a scaling matrixT ∈ T(2), a vector x∈X and the vector

z= [x−aT∇f(x)]⁺_X,T (5.81)

obtained by projected gradient descent fromxwith scaling matrixT and step-sizea >0. Assume thatx= (1,6)andaT∇f(x) = (4,12).

First suppose that Assumption 5.2 is satised, and letX =R²≥0be a box of the type (5.80) andT a diagonal scaling matrix. The vectorz, depicted in Figure 5.3(a), is such thatz−x= ˆz[1]+ ˆz[2]+ ˆz[3]

withz[3] = 0, where the segmentsˆ z[1],ˆ z[2], andˆ z[3]ˆ are projections of fragments of−aT∇f(x)on subspaces ofR²with decreasing dimensions2,1, and0, respectively.

Suppose now that X is not a box and Assumption 5.2 does not hold. We see in Figure 5.3(b) that z−x= ˆz[1] + ˆz[2] + ˆz[3] + ˆz[4]with z[3] = 0, whereˆ ˆz[1], z[2],ˆ z[3]ˆ and z[4]ˆ are projections on subspaces with respective dimensions 2, 1, 0, and 1. Hence the dimensions of the successive subspaces are not necessarily decreasing.

The same observation can be made if the scaling matrix is not diagonal. In Figure 5.3(c) we set X =R²≥0and

T =

( 2 2 2 3

)

, (5.82)

and observe that the projections are no longer orthogonal andz−x= ˆz[1] + ˆz[2] + ˆz[3] + ˆz[4]with z[3] = 0, whereˆ z[1],ˆ z[2],ˆ z[3]ˆ and z[4]ˆ are projections on subspaces with (not always decreasing) dimensions2,1, 0, and1, respectively.

Figure 5.3: Impact of Assumption 5.2

Lemmas D.2 and D.3 in Appendix D.2.

5.4.3 Cyclic gradient projection algorithm for stochastic problems We now study the applicability of a sequential block-coordinate implementa-tion of Algorithm 5.1, especially designed for the optimisaimplementa-tion of networks in

5.4. Optimisation by scaled gradient projection with approximate line search 143

which node synchronisation is an issue.

The analysis is restricted to a parallel implementation of the type Gauss-Seidel, which can be seen as an extension of the mapping S introduced in Denition 4.2 to stochastic network optimisation. Note that, unlike the Jacobi modes of implementation, the Gauss-Seidel modes cannot be considered as parallel, equivalent implementations of the global methods. Therefore the algorithm suggested in this section requires individual analysis.

We consider, as in Section 5.2.3, a separable stochastic network with a node setN = {1, ..., n}. It is assumed that the feasible set is the Cartesian product set (5.22), with Y_i ⊂ R^mⁱ for i ∈ N, and the true function g is the sum of locally computable terms g_i with sparse dependency structures on y₁, ..., y_n, in accordance with (5.21).

In this section we would like to apply gradient projections along coordinate blocks to a sequence of models (ˆg^k) satisfying Condition 5.2 with gˆ^k ∈ G(Y). In distributed settings, it is sensible to assume that the models gˆ^k are sums of local terms, i.e.

g^k = ∑ⁿ

i=1ˆg_i^k, k = 0,1,2, ... (5.83) wheregˆ^k_i is, likeg_i, a function of a restricted number of componentsy_j withj ∈ N_i∪ {i}, and can therefore be computed locally by node iwith the help of its neighbours. Although the function g is unknown in our stochastic framework, we know from Section 4.2.2 that the structure of Y enables the nodes to compute locally and individually directions of descent on the models gˆ^k.

The principle of the considered algorithm is that local projected gradient descents are processed individually and successively by the nodes. We restrict ourselves to the Gauss-Seidel mode of implementation of the algorithmwhere the local descents are processed sequentially in a predened order, and assume for simplicity that a new modelgˆ^k+1 is computed after each cycle ofn local operations on block coordinates².

The following algorithm is considered.

Algorithm 5.2 (Gauss-Seidel gradient projection) Consider a se-quence of functions (ˆg^k) with ˆg^k ∈ G(Y) satisfying (5.83) for all k, a set of scaling mappings {T_i}^i∈N such that T_i : G(Y) ×Y → T (m_i), a set of step-size selection rules {aˆ_i}^i∈N with ˆa_i : G(Y)×Y → (0,1], and an initial point y⁰ ∈ Y. Consider the n mappings Gⁱ : G(Y)×Y → Y_i dened by

Gⁱ(f, y) .

= [y_i−ˆa_i(f, y)T_i(f, y)∇ⁱf(y)]⁺_Y_i_,T_i_(f,y), ∀i∈ N, (5.84)

2Note that it is possible to consider dierent settings where a new model for the true function after every new local gradient projection.

and the n mappings Gˆⁱ : G(Y)×Y → Y such that Gˆⁱ(f, y) .

= (y₁, ..., y_i−1,Gⁱ(f, y), y_i+1, ..., y_n), ∀i∈ N. (5.85) A sequential gradient projection algorithm is given by

y^k+1 = S(ˆg^k, y^k), k = 0,1,2, ..., (5.86) where we dene the mapping S : G(Y)×Y →Y as

S .

= ˆGⁿ◦Gˆⁿ⁻¹ ◦...◦Gˆ¹. (5.87) Concretely, for any vector y ∈ Y and function f ∈ G(Y), S(f, y) is ob-tained by

(i) setting zˆ⁰ = y,

(ii) computing successively, for i= 1, ..., n, the points zˆⁱ such that ˆ

z_jⁱ =

⏐

[zˆ_iⁱ⁻¹ −ˆa_i(f,zˆⁱ⁻¹)T_i(f,zˆⁱ⁻¹)∇ⁱf(ˆzⁱ⁻¹)]+

Yi,Ti(f,ˆzⁱ⁻¹), if j = i ˆ

zⁱ⁻¹_j , if j ̸= i ,(5.88)

(iii) setting S(f, y) = ˆzⁿ.

It can be seen that in Algorithm 5.2 the individual step-sizes are chosen independently by the nodes and without global consensus, which is an inter-esting property in the context of distributed networks.

A local step-size selection rule of the type Armijo based on approximate line search is used on the models gˆ^k.

Denition 5.5 (Local line search) Given the scalar parameters β, σ ∈ (0,1) and a scaling mapping T_i : G(Y)×Y → T(m_i)for each i∈ N, we con-sider the mappings aˆ_i : G(Y)×Y → (0,1] such that, for any f : Y → G(Y) and y ∈ Y, ˆa_i(f, y) is the largest step-size a ∈ {β^m}^∞m=0 satisfying

f(y)−f(ˆy(a)) ≥σa⁻¹∥yˆ_i(a)−y_i∥²_T_i_(f,y)⁻¹, (5.89) where yˆ_i(a) = [y_i−aT_i(f, y)∇ⁱf(y)]⁺_Y

i,Ti(f,y) and yˆ_j(a) = y_j if j ̸= i.

We know from Section 4.3.3 that the step-size rule given in Denition 5.5 is well dened as long as the gradient of the function given as argument is Lipschitz continuous, and the step-sizes are bounded above zero by a function of the Lipschitz constant, which is identical for all the function models gˆ^k under Assumption 5.1.

Im Dokument Distributed methods for convex optimisation : application to cooperative wireless sensor networks (Seite 150-158)