A Stochastic Quasigradient Algorithm with Variable Metric

(1)

Working Paper

A Stochastic Quasigradient Algorithm

with Variable Metric

S.P. Uryas'ev

\VP-89-98

December 1989

a! 11 ASA

International Institute for Applied Systems Analysis A-2361 Laxenburg Austria

.

L A .

W . B .

. Telephone: ( 0 22 36) 715 2 1 C 0 Telex: 0 7 9 1 3 7 iiasa a Telefax: (0 22 36) 71313

(2)

A Stochastic Quasigradient Algorithm

with Variable Metric

S. P. Uryas 'ev

WP-89-98 December 1989

Working Papers are interim reports on work of the International Institute for Applied Systems Analysis a n d have received only limited review. Views or opinions expressed herein do not necessarily represent those of t h e Institute or of its National Member Organizations.

International Institute for Applied Systems Analysis A-2361 Laxenburg I3 Austria Telephone: (0 22 36) 715 2 1

*

⁰ Telex: 0 7 9 137 iiasa a Telefax: ( 0 22 36) 71313

(3)

Foreword

This paper deals with a new variable metric algorithm for stochastic optimization problems.

The essence of this is as follows: there exist two stochastic quasigradient algorithms working simultaneously - the first in the main space, the second with respect t o the matrices that modify the space variables. Almost sure convergence of the algorithm is proved for the case of the convex (possibly nonsmooth) objective function.

Alexander B. Kurzhanski Chairman System and Decision Sciences Program

iii

(4)

A Stochastic Quasigradient Algorithm

with Variable Metric

S. P. Uryas 'ev

1 Introduction

Stochastic quasigradient (or stochastic approximation) algorithms are used for the optimization of quite general stochastic systems with smooth, nonsmooth, and infinite-dimensional objective functions, for distributed systems and others (see, for example, [3], [4], [6]-[9], [ I 11-[13], [16], [20]). The structure of such algorithms is simple, and at each iteration only few additional calculations are required. However, the simplest variants of these algorithms have a significant drawback - a slow practical convergence rate for ill-conditioned functions. This fact is connected not only with randomness, for the deterministic case the simple gradient algorithm is also quite inefficient for ill-conditioned functions. Variable metric algorithms are more complicated, but they have a considerably faster convergence rate. These algorithms are widely used for smooth deterministic optimization problems (see [2]). Several authors have generalized such algorithms for the stochastic case with a smooth objective function ([I], [5], [8], [lo], [14], [17], [18] and [21]). In this paper, the variable metric algorithm for stochastic programming problems with a nonsmooth objective function is presented. Such algorithms were already proposed in [19].

2 Basic Idea of the Algorithm

Here we consider the problem of minimizing a convex (possibly nonsmooth) function f (x) f ( x ) ^-tmin

zERn

'

where Rn is an n-dimensional Euclidean space. In the class of problems considered here, instead of exact values of gradients or generalized gradients of the function f ( x ) , vectors are known which are statistical estimates of these quantities. (The exact values of the function and its gradients are very difficult t o compute.) Such problems present themselves, for example, in the minimization of functions of the form

(6)

Here and below we assume that all random values are given on the probability space ( R , 3 , P).

Considering that, under the general assumptions, the generalized differential of the convex function f (x) is calculated by the formula (see [15])

then azcp(x,w) is a set of vectors being the statistical estimates of gradients of t h e function f(x). We call these estimates stochastic quasigmdients [3]. To solve problem (1)) the following algorithm is used:

where p s , s = 0 , 1 , .

. .

is a sequence of positive random scalar stepsizes; A S , s = 0 , 1 , .

. .

is a sequence of n x n random square matrices;

tS,

^s^{= 0,1,}

. . .

is a sequence of stochastic quasigradients, i.e. conditional mathematical expectation; E,tS is a generalized gradient:

where E, is the conditional mathematical expectation with respect t o the a-field defined by the random vector x3. How can the matrix H3 be chosen? There exists the natural criterion function Q3(H):

which characterizes the quality of choice for matrix H a t iteration s. T h e function @,(A) is the mathematical expectation of t h e objective function f at the point x3+l. The best matrix H a t iteration s is a solution t o the problem

Q3(H) + min

.

H € R n X n

Problem (5) is somewhat more complicated than problem (1). However, the optimal matrix H is not needed a t each iteration; it is enough to find some updating rule. Let us differentiate the function @,(A) a t some point HoJ (see formula (2)):

where

tsT

is the transposed vector

p.

We denote

,3

as some stochastic quasigradient a t the point x;

def

x3

-

p3HoJp, i.e.

One can see that

(7)

thus

-p3t:t3T

is a stochastic quasigradient of the function @,(H) at the point H,J. We consider that the matrix Hi is known from the previous iteration s - 1. To modify matrix Hi, we use the stochastic quasigradient method (see [3]):

Analogously, the next iteration can be done at the point H1J and so on. Let a t s iteration with respect t o matrix H amount i(s)

>

¹iterations is made. Write this as follows

where t f , i = 0 , .

. . ,

i(s) are stochastic quasigradients, i.e.

In formula (6), the matrix H is modified additively, but multiplicative variants of this algorithm also can be developed (see [19]).

3 Formal Description of the Algorithm and Necessary Conditions for Convergence

Define the optimal set x* for problem (1) as follows:

X * = {x* E Rn : f(x*) = min f ( x ) ) .

Algorithm (3), (6) can solve the optimization problem (1) without constraints. To simplify the convergence proof of the algorithm, we assume that some convex compact set X C Rn is known in advance such that X *

c

X . This is not a serious restriction, since in practical situations such a set is usually known. This set could be very large. If x3

4

X , then we assume that the approximation of x3 is very far from the extremal set X* and we restart the algorithm from the initial point z0 with new initial algorithm parameters.

We also assume that the sequences {e,), s = 0, s,

. . .

and {Asl), s = 0 , 1 , .

. .,

I = 0,1,.

. .

are given before starting the algorithm. This predetermination is not very good from the practical point of view, but this can be relaxed later. Some adaptive formulae also could be written for these sequences, but we do not want t o overload the convergence proof with them now. T h e positive value ea define i ( s ) in the algorithm, iterations with respect t o matrix are stopped if

;(a)-1

pa

El=-,

^Xal

>

en. To avoid misunderstandings, we present here a full formal description of the algorithm.

(8)

Algorithm 1

Step I Initialization

s = 0 , i = -1, xO = xinit, H;' = I is the unit matrix;

to

is a stochastic quasigradient a t the point xO.

Step I1 Set H,J = HI;:.

Step I11 Set i ₌_0.

Step IV Compute the point xf

X J = xa

-

p S H f t a .

Step V Compute H&, = Hf

+

^AS;&^.,

t

^ST

,

here

t/

is a stochastic quasigradient a t the point x J . Step VI If i 2 1 and p3

~ f = :

^A,,

^>

^c3^then^{i ( s )}⁼^{i ;}go t o Step VIII.

Step VII Set i = i

+

1 and return t o Step IV.

Step VIII If x:(,) E X , then x3+' = x3 ~ ( 3 ) '

C+'

=

t:(s);

otherwise x3+' = xO,

t3+'

=

to.

Step

IX

Set s = s

+

¹and return t o Step 11.

Let us define d ( x , X * ) as the distance between a point x and the set X * d ( x , X * ) = min ¹¹² - x*ll

.

x * C X 9

To prove the convergence of algorithm 1, we shall use the following necessary conditions (see [20]) for convergence of stochastic algorithms. (These conditions are similar t o the conditions in [12] but are more general.)

D l There exists a compact set X C Rn such that { x S ( w ) )

c

X a.s.

D2 W : X ^-tR is a continuous function.

D3 If there exists an event B C

R

such that P ( B )

>

0 and for all w E B there exists a subsequence { X ' ~ ( ~ ) ( W ) ) convergent t o a point x l ( w ) with d ( x l ( w ) , X * )

>

0 , then for any random value c(w)

>

0 a.s. there exists a subsequence { u k ( w ) ) such that

W ( x T )

I

W ( x l ( w ) )

+

^c(w) ^for l K ( w )

5

^T

5

u K ( w ) ,

~ ( x " ~ ( ~ ) ( w ) ) =

W ( W ) <

W ( x l ( w ) )

.

S+OO

D4 ( W ( w ) , W ( x l ( w ) ) ) \ W ( X * )

#

8 for almost all w E B , i.e. the open interval

( W ( w ) , W ( x l ( w ) ) ) does not belong t o the set W ( X 8 )

%f

{ W ( x 8 ) : x* E X * ) for almost all

(9)

D5 For almost all subsequences { x S K ( " ) ( w ) } such that lim,,, x S n ( " ) ( w ) = x * ( w ) , x * ( w ) E X * the condition

w

x s s ( u ) + l ( w ) ) -

w

( x s ~ ( ~ ) ( w ) ) ]

,o}

^-+

o

for n ^-+m m a d [

(

is satisfied.

Next is the theorem about these necessary conditions (see [20]).

Theorem 1 Let the stochastic sequence { x b ( w ) } satisfy conditions Dl-D5; then x b ( w ) -+ X * a.s., i.e. d ( x b ( w ) , X * ) ^-+0 a.s.

4 Convergence of the Algorithm

Below we formulate the theorem on the convergence of algorithm 1.

Theorem 2 Let f :

R n

-+

R

be a convex (possibly nonsmooth) function, X be a compact convex set such that X *

c

X

c R n

and

inf llx - x*11 = C 1

>

min llx 0 - x*ll ;

x Q X , X * E X * x * E X *

let the sequences {A,,} and (6,) be given and let {p,} be a random sequence such that p, depends upon the m n d o m vectors

let the stochastic quasigmdients and algorithm parameters satisfy the conditions

11(/11

5

C2 a s . , i = 1 .( s ) ; s = 0 , I , . .

. ,

p b H b 1 0 a.s. for s + m ,

(10)

m

<

^A^{= const,} ^s= 0 , 1 ,

...

1=0

Then almost surely all the accumulation points of the sequence { x S ) generated by algorithm 1 belong to X * .

Proof We use necessary conditions Dl-D5 to prove the convergence of the algorithm. Define W ( x ) = min ((z - yl12 = d 2 ( x , x * )

.

V E X

Condition D l is valid due to the algorithm construction and the compactness of the set X . It is easy t o see that the function W ( x ) is continuous and consequently condition D2 holds.

Let us prove condition D3. Denote

.

qs =

ts

^-g ( x S ) , qf =

t;

^-g ( z ; )

,

u c ( x ) = { Y E Rn :

I ~ Y

^-²¹¹ L

€1 ,

f * =

,&f(~),*

^C³⁼z , Y € X ^max¹¹²^-^yll

^,

x: = arg min _{y € X *})lxS - 911, _2:;= arg min _{y € X *}llx; - yll

.

I

Let the probability of the event B = {w E fl : 3 a subsequence x ' ~ ( " ) ( u ) of the sequence x S ( w ) such that x ' ~ ( " ) ( u ) -, x l ( w )

4

X * ) be greater than zero. We shall omit the latter for the simplicity of argument w. Steps I V , V and V I of the algorithm and conditions (9) and ( 1 0 ) of the theorem imply

Applying this estimate the proper amount of times we obtain

(11)

Estimate W(x6) as follows:

Since the function f (x) is convex, then, with designations (20), we get

Substituting the two previous estimates into estimate (21) and (19)-(20) yields

If

xTG!~)

^EX , then we have from the algorithm formulae

Using this equality the proper amount of times we get

If

x 3 9 X;(S), ..., x?;:~) E X , m

>

^sthen again applying this formula for m - 1,.

. . ,

^s

+

¹^{we obtain}

(12)

In view of conditions (9) and (10) of the theorem, step VI of the algorithm, and the last equality we can estimate

It also follows from (23), (24) and (25) also that

Let us consider the events w E B such that there exists a subsequence {x16) with

xla + X I , W(xl) 2 0, U6(x1)

c ^X

for ^K^--too

,

(27)

where 6 is some positive random value for almost all w E B. Denote r as some random value such that 0

<

r

<

for w E B. We define the index subsequence {v, ) (this subsequence depends upon w) such that

7=ln

the existence of this subsequence follows from the theorem conditions (11)-(13). In view of conditions (15)-(19) and step VI of the algorithm

Since p7llH;t7)1r;' + 0 a.s. for T + oo (see condition (14)), then (26) and (29) imply

From (30) and zl% ⁴x' for n + oo i t follows t h a t t h e approximations x i , I,

<

r

5

v, - 1, 0

5

1

5

i ( r ) belong t o the set u2,(z1) for sufficiently large numbers K (this n depends upon w).

Since

then

(13)

It also follows from ( 2 7 ) t h a t

Since the points xT for I ,

<

^T

<

^{V ,}^-^{1, 0}

⁵

¹

⁵

^{i ( r )}belong t o the set U 2 9 ( x 1 ) for sufficiently large ^{K ,}then ( 3 1 ) implies the existence of a random value a

>

0 a.s. such that

f * - f ( x ; )

<

^{- a} ^for ^1,

^~ ^T

^{U S -}

^S

1 , 0 5 1

5

i ( ~ ) ( 3 2 )

for sufficiently large K . Applying inequality ( 2 2 ) the necessary amount of times with ( 3 2 ) we have

v,-1 2 def

+

^{C , ~ A}

C

^p, ⁼W ( X ' S )

+

^T2

+

^T3

+

^T4

+

^{T s}

+

^{T s}

^.

We estimate the lower limit of the terms in inequality ( 2 3 ) . For the second term we have (see ( 1 4 ) and (28))

lim T2 = lim 2

C

p7C311Hit711 =

- -

,+w lc--rw

7=1*

For term T3

(14)

In view of algorithm step VI and the convexity of the function

11 ^. 1 1 2

for the fourth term in ( 3 3 )

The martingale series

CT=o

^c7q7is convergent with conditions ( 1 1 ) - ( 1 3 ) and thus

For sufficiently large ^{K ,}the points z r , 1,

5

r

<

^v,^-^{1, 0}

5

¹

<

^{i ( r )}belong t o t h e convex set U z q ( x l ) and U Z ~ ( X ' ) n X * = 8. Consequently, for properly small q there exists a positive random value y

>

⁰a.s. such that ( g ( x 7 ) , x * - ^{X I )}

>

y ( ( x * - xlll

>

^{0 , x*}^E^{X * .}Further we get

v,-1 2

1 1 ~ *

^-^x1/1-'

C

c r ( g ( x 7 ) , x* - x') 7=1,

Combining the last inequality with ( 3 6 ) and ( 3 7 ) we obtain

It follows from conditions ( 9 ) , ( 1 6 ) and ( 1 9 ) that

consequently the martingale series

is convergent. This fact implies

(15)

We have from condition (16) also that

Taking the lower limit for (33) and using (34), (35) and (38)-(40),

2 2 c - 9

lim

W(xY")

5

~ ( x l " ) - 2 a y 2 q ~ ; 9

<

^W(xf)^-^{2 a y q} ²

.

K + O O 6-OQ

This last inequality proves the necessary condition D3 for t h e subsequences, which satisfies condition (27).

Now let us consider the case with

where a X is the boundary of the set X . As in the previous case we define the index subsequence {v,} such that

We consider the following two possibilities:

1. There exists an infinite subsequence {dm} such that 1,

<

^8,

<

urn, zem E X , z : ~ ~ )

4

X , xem+' = xo. In this case, condition (8) implies

and subsequence {zem+'} satisfies the necessary condition D3.

2. There exists a number K such that x7 E X for I,

5

^T

5

v,, ^K

2

K. For this case, the proof of condition D3 coincides with the proof where ^{X I}belongs t o t h e interior of the set X .

This proves condition D3.

Condition D4 is valid because the function W ( x ) is constant on X*.

Let us prove the last condition D5. We conside the subsequence xaR such that xas

-

^x*,

z* E X*. It follows from estimate (25) that

Since (see conditions (12), (14), (16) and (19))

(16)

then (41) implies

for almost all o such t h a t xSn + X * , x* E X * . The function W ( x ) is continuous, thus (42) proves condition D5.

All conditions Dl-D5 are checked and the theorem is proved.

5 References

1. Betro, B. and L. De Biase: A Newton-like Method for Stochastic Optimization. In "To- wards Global Optimization 2", North-Holland Publishing Company, 1978, pp. 269-289.

2. Dennis, J.N. and J.J. M o d : Quasi-Newton Methods, Motivation and Theory, SIAM Re- view, 1 9 , 1977, pp. 46-89.

3. Ermoliev, Yu.M.: Stochastic Quasi-Gra.dient Methods and Their Applications t o Systems Optimization. Stochastics, 4, 1983, pp. 1-37.

4. Gaivoronski, A.: Implementation of Stochastic Quasigradient Methods. In: "Numerical Techniques for Stochastic Optimization" (Yu. Ermoliev, R.J-B Wets Eds.) Springer- Verlag, 1988, pp. 313-353.

5. Gerencser, L.: Strong Consistency Theorems Related t o Stochastic Quasi-Newton Meth- ods. In: "Stochastic Optimization", Springer-Verlag Lecture Notes in Control and Infor- mation Science, 81, 1984.

6. Kaniovski, Yu.M., P.S. Knopov and Z.V. Nekrylova: Limit Theorems for Stochastic Pro- gramming Processes, Naukova Dumka, Kiev, 1980 (in Russian).

7. Kushner, H.J. and G. Jin: Stochastic Approximation Algorithms for Parallel and Dis- tributed Processing. Stochastics, 22, 1987, pp. 219-250.

8. Ljung, L. and T. Soderstrom: Theory and Practice of Recursive Identification. MIT Press, 1983.

9. Marti, K.: Descent Stochastic Quasigradient Methods. In: "Numerical Techniques for Stochastic Optimization" (Yu. Ermoliev, R. J-B Wets eds.), Springer-Verlag, 1988, pp.

393-401.

10. McAllister, P.M.: Adaptive Approaches t o Stochastic Programming, this volume.

11. Novikova, N.M.: Some Stochastic Programming Methods in Hilbert Space. Zhurnal Vachislitelnoj Matematiki i Matematicheskoj Phisiki, 12, 25, 1985, pp. 1795-1813.

12. Nurminski, E.A.: Numerical Methods for Solving Deterministic and Stochastic Minimax Problems. Naukova Dumka, Kiev, 1979, 159 p. (in Russian).

(17)

13. Pflug, G.: Stepsize Rules, Stopping Times and Their Implementation in Stochastic Quasi- gradient Algorithms. In: "Numerical Techniques for Stochastic Optimization" (Yu. Er- moliev, R.J-B Wets Eds.) Springer-Verlag, 1988, pp. 353-373.

14. Polyak, B.T. and Ya.Z. Tsypkin: Adaptive estimation algorithms: convergence, optimality, robustness. Automation and Remote Control, 3, 1979, pp. 71-84.

15. Rockafellar, R.T. and R.J-B Wets: On the Interchange of Subdifferentiation and Condi- tional Expectation for Convex Functionals. Stochastics, 7 , 1982, pp. 173-182.

16. Ruszczynski, A. and W. Syski: A Method of Aggregate Stochastic Sub-Gradients with On-Line Stepsize Rules for Convex Stochastic Programming Problem. Mathematical Pro- gramming Study, 28, part 11, 1985, pp. 113-131.

17. Saridis, G.S.: Learning Applied t o Successive Approximation Algorithms. IEEE, Trans.

Syst. Sci. Cybern., 1970, SSC-6, Apr., pp. 97-103.

18. Sarkison, D.J.: T h e Use of Stochastic Approximation t o Solve the System Identification Problem. IEEE Tmnsactions on Automatic Control, AC-12, 1967, pp. 563-567.

19. Uryas'ev, S.P.: Stochastic Quasi-Gradient Algorithms with Adaptively Controlled Param- eters. Working Paper, IIASA, WP-86-32, 1986.

20. Uryas'ev, S.P.: Adaptive Algorithms of Stochastic Optimization and Game Theory. Nauka, Moskow, 1990 ( t o appear in Russian).

21. Wets, R. J-B: Modeling and Solution Strategies for a Constrained Stochastic Optimization Problems. Annals of Operation Research, 1, 1984, pp. 3-22.

A Stochastic Quasigradient Algorithm with Variable Metric

Working Paper