• Keine Ergebnisse gefunden

Optimisation by scaled gradient projection with approximate line search

5.4. Optimisation by scaled gradient projection with approximate line search 137

Consider now a vector z ∈ C∩AS(ρ). We have ¯s(z)∈ S(z)with ∥s(z)¯ ∥ ≤ ρ. Using (5.60) and proceeding similarly, we nd (∇g(z)−s(z))¯ (x−z)≥0 for all x ∈ Y. It follows that s(z) +¯ ∇ˆgk(z)− ∇g(z) ∈ Sk(z), and

∥s¯k(z)∥ ≤ ∥s(z) +¯ ∇gˆk(z)− ∇g(z)∥ ≤ ∥s(z)¯ ∥+∥∇gˆk(z)− ∇g(z)∥ (5.67)≤ 3ρ 2 ,

which completes the proof.

The AS family is used in particular to show that the algorithms derived from gradient projection methods considered in Section 5.4 satisfy Condi-tion 5.3. These algorithms are thus expected to converge in stochastic envi-ronments. Note that the algorithms studied in Section 5.4 involve point-to-point mappings of the type M : F(X)×Y → X, for which we simply write yk+1 = M(ˆgk, yk).

5.4 Optimisation by scaled gradient projection with

convergence of(∇gˆk)(Condition 5.2-(ii)) and the Lipschitz continuity of∇gˆk with the same nite Lipschitz constant for allk imply the Lipschitz continuity of ∇g. Moreover, if we assume that∇g is Lipschitz continuous, and ∇gˆk are Lipschitz continuous for all k, then the true function ∇g and all the mod-els ∇gˆk have a common Lipschitz constant, which we denote by L in our developments. In this section, the following assumption will be made.

Assumption 5.1 (Lipschitz continuity of the gradients) The gradient of the limit functiong and the gradients of the modelsgˆk for allk are Lipschitz continuous on Y with constant L.

Given two positive constantsλandλ¯ with0< λ ≤ ¯λ < ∞,T(p)is dened in (4.8) as the set of symmetric, positive denite scaling matrices inRp×p with eigenvalues bounded by the constants λandλ¯, i.e. T ∈ Rp×p belongs to T(p) i λ≼T ≼λ¯. Notice that

λ12∥x∥ ≤ ∥x∥T ≤λ¯12∥x∥, ∀x ∈ Rp, T ∈ T(p), (5.70) where ∥·∥T denotes the scaled norm, dened by (3.19) in Example 3.4. We suggest the following stochastic optimisation algorithm, presented in the shape of a mapping G and covering a broad family of optimisation methods based on gradient descents. The mapping relies on the scaled projection introduced in (3.20).

Algorithm 5.1 (Scaled gradient projection) Consider a sequence of functions (ˆgk) with gˆk ∈ G(Y) for all k, a scaling mapping T : G(Y)×Y → T(m), a step-size selection rule aˆ : G(Y) × Y → (0,1], and an initial point y0 ∈ Y. A gradient projection algorithm is given by

yk+1 = G(ˆgk, yk), k = 0,1,2, ... , (5.71) where we dene the mapping G : G(Y)×Y → Y as

G(f, y) .

= [y−ˆa(f, y)T(f, y)∇f(y)]+Y,T(f,y). (5.72) In the context of distributed networks, Algorithm 5.1 enjoys the property that it can be implemented in parallel as long as the computations are exe-cuted synchronously by the nodes of the network and with global consensus on the step-sizesthis restriction is relaxed in Section 5.4.3, where a cyclic implementation of the distributed gradient projection method is proposed.

Notice that, when gˆk = g for all k, we recover the gradient projection algorithm (3.80), for which the choice of the step-sizes and scaling matrices is critical in guaranteeing the convergence to a solution in deterministic contexts,

5.4. Optimisation by scaled gradient projection with approximate line search 139

where constant or decreasing step-size policies often lead to slow convergence.

With gradient descent methods it is common to associate step-size selection rules based on line-search, such as the Armijo rule already used in chapter 4, which guarantees sucient descent at each iteration. We decide to use a similar step-size rule for the mapping G, with the particularity that the line-search is done on the model functionsˆgk and not on the true functiong, which is unknown in Problem 5.1.

Denition 5.4 (Approximate line search) Given the xed scalar para-meters β, σ ∈ (0,1), a function f : Y → G(Y) and the scaling mapping T : G(Y)× Y → T(m), we consider the mapping ˆa : G(Y) ×Y → (0,1]

where a(f, y)ˆ is the largest step-size a ∈ {βm}m=0 satisfying

f(y)−f(ˆy(a)) ≥σa−1∥y(a)ˆ −y∥2T(f,y)−1, (5.73) where y(a)ˆ .

= [y−aT(f, y)∇f(y)]+Y,T(f,y).

The developments of Section 4.3.3 reveal that the step-size selection rule given in Denition 5.4 is well dened in the sense that the computation of the step-sizes requires a nite number of evaluations of (5.73) at each step, and that the resulting step-sizes are bounded above 0 by a function of the Lipschitz constant L. Under Assumption 5.1, the Lipschitz constant L is common to the true function g and all the models ˆgk and there exists a lower bound a for the step-sizes of Algorithm 5.1 with a > 0.

In the basic optimisation problem where gˆk = g for all k, the above step-size policy is known to guarantee sucient descent at each step and the global convergence of Algorithm 5.1. Establishing the convergence of the algorithm in the stochastic setting is the object of Section 5.4.2.

5.4.2 Convergence analysis

The aim of this section is to show that, under certain conditions, the gradient projection mapping G of Algorithm 5.1 used with the step-size rule ˆa of De-nition 5.4 satises Condition 5.3 and is thus convergent in stochastic settings in accordance with Proposition 5.1.

We now consider the family AS suggested in Section 5.3.3 and make the following observation.

Proposition 5.2 (Descent directions) Consider a sequence (ˆgk) of func-tions in G(Y) satisfying Condition 5.2 and Assumption 5.1. For any com-pact C ⊂Y and ρ > 0, one can nd a k <¯ ∞ such that T(ˆgk, y)∇gˆk(y) is a descent direction of g at y for all y ∈ C \AS(ρ) and k >k¯.

Proof A vector d ∈ RM is a descent direction for g at a point y ∈ Y if

∇g(y)d < 0. Noting that ∇gˆk(y) = ∇g(y) + [∇gˆk(y) − ∇g(y)] and us-ing (5.70), we have, for any y ∈ Y and Tˆ ∈ T(m),

∇g(y)Tˆ∇gˆk(y) =∥∇g(y)∥2Tˆ +∇g(y)Tˆ[∇ˆgk(y)− ∇g(y)] (5.74)

= ∥∇g(y)∥2Tˆ + ( ˆT12∇g(y))[ ˆT12(∇gˆk(y)− ∇g(y))] (5.75)

≥ ∥∇g(y)∥2Tˆ − ∥Tˆ12∇g(y)∥∥Tˆ12(∇gˆk(y)− ∇g(y))∥ (5.76)

= ∥∇g(y)∥2Tˆ − ∥∇g(y)∥Tˆ∥∇gˆk(y)− ∇g(y)∥Tˆ (5.77)

≥λ∥∇g(y)∥2−λ¯∥∇g(y)∥∥∇gˆk(y)− ∇g(y)∥ (5.78)

= λ∥∇g(y)∥[

∥∇g(y)∥ −λλ¯ −1∥∇gˆk(y)− ∇g(y)∥]

. (5.79) Let C be a compact subset of Y and ρ > 0. By Assumption 5.2-(ii), one can nd a ¯k < ∞ such that ∥∇gˆk(y)− ∇g(y)∥ < λ¯λ−1ρ for all k > ¯k. The proposition then follows from (5.79), T(ˆgk, y) ∈ T(m), and the fact that we have ∥∇g(y)∥ ≥ ρ if y ∈ C \AS(ρ) by (5.64).

Proposition 5.2 tells us that, given a ρ > 0 and fork large enough, the search directions of the gradient projection Algorithm 5.1 applied successively to the sequence of functions (ˆgk) are valid descent directions for g in the periphery of the set AS(ρ). The algorithm is thus expected to approach AS(ρ), which contains the solutions of Problem 5.1, and the expanse of which can be reduced by decreasing ρ.

In the setting of Section 5.3.2, however, convergence only occurs if lying outside AS(ρ) guarantees not only valid descent directions for the true func-tion g, but also sucient displacements in Y. This property is only true in some particular cases, and we need to consider the following assumption for the feasible set of the problem and for the scaling mapping of the projected gradient algorithm.

Assumption 5.2 (Restriction on (X, T)) The set X is a closed convex subset of a real space Rp, T is a scaling mapping G(X) × X → T(p), and (X, T) satises one of the following conditions:

(i) The set X is the real space, i.e. X ≡ Rp.

(ii) The scaling matrix T(f, x) is diagonal for all (f, x) ∈ G(X)×X, and the set X is a box of the type

X = {x ∈ Rp|b ≤x ≤¯b}, (5.80) with b ∈ [−∞,∞)p, ¯b ∈ (−∞,∞]p and b ≤¯b.

5.4. Optimisation by scaled gradient projection with approximate line search 141

Notice, in particular, that this assumption holds for (Y, T) in Algorithm 5.1 when Problem 5.1 is unconstrained (Y ≡Rm), or when the scaling function T of the algorithm is diagonal and Problem 5.1 the dual of a convex stochastic optimisation problem.

The convergence analysis of Algorithm 5.1 is divided in two distinct parts.

In Appendix C.2 we derive properties of the scaled gradient projection op-erator under Assumption 5.2. These properties address the descent along a generic function over a generic set, and are not directly concerned with the stochastic optimisation framework nor with the convergence issues inherent to the sequences of models used for the true function. The actual convergence of the mapping G in stochastic environment is studied in Appendix D.2, based on the results of Appendix C.2.

As explained in Appendix C.2, the convergence analysis of Algorithm 5.1 is facilitated by Assumption 5.2 for the reason that projected gradient de-scents are then expected to describe particular trajectories on the surface ofY, namely, broken lines composed of a nite number of line segments crossing subspaces of decreasing dimensions. Proposition C.1 shows in particular that these segments are projections of −ˆa(f, y)T(f, y)∇f(y) on their respective subspaces, as illustrated in Figure 5.3.

A last result, stated by Corollary C.1, tells us that the magnitude of the displacements in the feasible space caused by (5.72) can be bounded below by a linear function of the the norm of the `smallest' subgradient of f at the destination point G(f, y). On basis of these observations, it is shown in Appendix D.2 that Assumption 5.2 guarantees both suciently small dis-placements for G inside AS sets, and displacement large enough outside those sets, which are the two features required by Condition 5.3 for the convergence of the algorithm in the sense of Proposition 5.1.

Note that the relation between the magnitudes of displacements and the setsAS is only established under Assumption 5.2. In the general case though, vectors lying outside a set AS(ρ) may lead to extremely short displace-ments regardless of the value of ρ. This can be seen in the settings of Fig-ures 5.3(b) and 5.3(c), where projected gradient descents from a vector (0, ϵ) outside AS(ρ) with ϵ arbitrarily small may yield the point (0,0) and thus displacements of arbitrarily small lengths for a broad range of values for ρ.

The convergence of Algorithm 5.1 can be stated as follows.

Result 5.3 (Convergence of Algorithm 5.1) Let Assumption 5.2 hold in Rm for (Y, T). Algorithm 5.1 satises Condition 5.3 on Y with the map-ping G, the class of functions G(Y), the family {AS(ρ)}ρ≥0, and sequences of models satisfying Condition 5.2.

Proof Condition 5.3-(i) and Condition 5.3-(ii) are direct consequences of

x

xaTf(x)

X ˆ z[1]

ˆ z[2]

0 =z

♣ ♣ ♣ ♣

♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣

(a) X box,T diagonal

x

xaTf(x)

X ˆ z[1]

ˆ z[2]

♣ ♣ ♣ ♣0

♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣

♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣s♣ ♣ ˆ z[4]

z

(b) X non box

x

xaTf(x)

X ˆ z[1]

ˆ z[2]

0

♣ ♣ ♣♣ ♣ ♣♣ ♣

♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣

♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣

ˆ z[4]

z

(c) T non diagonal

Consider a closed convex set X R2, a function f G(X), a scaling matrixT ∈ T(2), a vector xX and the vector

z= [xaTf(x)]+X,T (5.81)

obtained by projected gradient descent fromxwith scaling matrixT and step-sizea >0. Assume thatx= (1,6)andaTf(x) = (4,12).

First suppose that Assumption 5.2 is satised, and letX =R2≥0be a box of the type (5.80) andT a diagonal scaling matrix. The vectorz, depicted in Figure 5.3(a), is such thatzx= ˆz[1]+ ˆz[2]+ ˆz[3]

withz[3] = 0, where the segmentsˆ z[1],ˆ z[2], andˆ z[3]ˆ are projections of fragments ofaTf(x)on subspaces ofR2with decreasing dimensions2,1, and0, respectively.

Suppose now that X is not a box and Assumption 5.2 does not hold. We see in Figure 5.3(b) that zx= ˆz[1] + ˆz[2] + ˆz[3] + ˆz[4]with z[3] = 0, whereˆ ˆz[1], z[2],ˆ z[3]ˆ and z[4]ˆ are projections on subspaces with respective dimensions 2, 1, 0, and 1. Hence the dimensions of the successive subspaces are not necessarily decreasing.

The same observation can be made if the scaling matrix is not diagonal. In Figure 5.3(c) we set X =R2≥0and

T =

( 2 2 2 3

)

, (5.82)

and observe that the projections are no longer orthogonal andzx= ˆz[1] + ˆz[2] + ˆz[3] + ˆz[4]with z[3] = 0, whereˆ z[1],ˆ z[2],ˆ z[3]ˆ and z[4]ˆ are projections on subspaces with (not always decreasing) dimensions2,1, 0, and1, respectively.

Figure 5.3: Impact of Assumption 5.2

Lemmas D.2 and D.3 in Appendix D.2.

5.4.3 Cyclic gradient projection algorithm for stochastic problems We now study the applicability of a sequential block-coordinate implementa-tion of Algorithm 5.1, especially designed for the optimisaimplementa-tion of networks in

5.4. Optimisation by scaled gradient projection with approximate line search 143

which node synchronisation is an issue.

The analysis is restricted to a parallel implementation of the type Gauss-Seidel, which can be seen as an extension of the mapping S introduced in Denition 4.2 to stochastic network optimisation. Note that, unlike the Jacobi modes of implementation, the Gauss-Seidel modes cannot be considered as parallel, equivalent implementations of the global methods. Therefore the algorithm suggested in this section requires individual analysis.

We consider, as in Section 5.2.3, a separable stochastic network with a node setN = {1, ..., n}. It is assumed that the feasible set is the Cartesian product set (5.22), with Yi ⊂ Rmi for i ∈ N, and the true function g is the sum of locally computable terms gi with sparse dependency structures on y1, ..., yn, in accordance with (5.21).

In this section we would like to apply gradient projections along coordinate blocks to a sequence of models (ˆgk) satisfying Condition 5.2 with gˆk ∈ G(Y). In distributed settings, it is sensible to assume that the models gˆk are sums of local terms, i.e.

ˆ

gk = ∑n

i=1ˆgik, k = 0,1,2, ... (5.83) wheregˆki is, likegi, a function of a restricted number of componentsyj withj ∈ Ni∪ {i}, and can therefore be computed locally by node iwith the help of its neighbours. Although the function g is unknown in our stochastic framework, we know from Section 4.2.2 that the structure of Y enables the nodes to compute locally and individually directions of descent on the models gˆk.

The principle of the considered algorithm is that local projected gradient descents are processed individually and successively by the nodes. We restrict ourselves to the Gauss-Seidel mode of implementation of the algorithmwhere the local descents are processed sequentially in a predened order, and assume for simplicity that a new modelgˆk+1 is computed after each cycle ofn local operations on block coordinates2.

The following algorithm is considered.

Algorithm 5.2 (Gauss-Seidel gradient projection) Consider a se-quence of functions (ˆgk) with ˆgk ∈ G(Y) satisfying (5.83) for all k, a set of scaling mappings {Ti}i∈N such that Ti : G(Y) ×Y → T (mi), a set of step-size selection rules {aˆi}i∈N with ˆai : G(Y)×Y → (0,1], and an initial point y0 ∈ Y. Consider the n mappings Gi : G(Y)×Y → Yi dened by

Gi(f, y) .

= [yi−ˆai(f, y)Ti(f, y)∇if(y)]+Yi,Ti(f,y), ∀i∈ N, (5.84)

2Note that it is possible to consider dierent settings where a new model for the true function after every new local gradient projection.

and the n mappings Gˆi : G(Y)×Y → Y such that Gˆi(f, y) .

= (y1, ..., yi−1,Gi(f, y), yi+1, ..., yn), ∀i∈ N. (5.85) A sequential gradient projection algorithm is given by

yk+1 = S(ˆgk, yk), k = 0,1,2, ..., (5.86) where we dene the mapping S : G(Y)×Y →Y as

S .

= ˆGn◦Gˆn−1 ◦...◦Gˆ1. (5.87) Concretely, for any vector y ∈ Y and function f ∈ G(Y), S(f, y) is ob-tained by

(i) setting zˆ0 = y,

(ii) computing successively, for i= 1, ..., n, the points zˆi such that ˆ

zji =

[zˆii−1 −ˆai(f,zˆi−1)Ti(f,zˆi−1)∇if(ˆzi−1)]+

Yi,Ti(f,ˆzi−1), if j = i ˆ

zi−1j , if j ̸= i ,(5.88)

(iii) setting S(f, y) = ˆzn.

It can be seen that in Algorithm 5.2 the individual step-sizes are chosen independently by the nodes and without global consensus, which is an inter-esting property in the context of distributed networks.

A local step-size selection rule of the type Armijo based on approximate line search is used on the models gˆk.

Denition 5.5 (Local line search) Given the scalar parameters β, σ ∈ (0,1) and a scaling mapping Ti : G(Y)×Y → T(mi)for each i∈ N, we con-sider the mappings aˆi : G(Y)×Y → (0,1] such that, for any f : Y → G(Y) and y ∈ Y, ˆai(f, y) is the largest step-size a ∈ {βm}m=0 satisfying

f(y)−f(ˆy(a)) ≥σa−1∥yˆi(a)−yi2Ti(f,y)−1, (5.89) where yˆi(a) = [yi−aTi(f, y)∇if(y)]+Y

i,Ti(f,y) and yˆj(a) = yj if j ̸= i.

We know from Section 4.3.3 that the step-size rule given in Denition 5.5 is well dened as long as the gradient of the function given as argument is Lipschitz continuous, and the step-sizes are bounded above zero by a function of the Lipschitz constant, which is identical for all the function models gˆk under Assumption 5.1.