Stochastic Quasi-Gradient Algorithms with Adaptively Controlled Parameters

(1)

Working Paper

Sr(XXASl3C QUASI-GRADIENT

ALGORJTHMS VYITH ADAPTIVELY

CONTROLLED S - P

S.P. U r j a s 'ev

July 1986 WP-86-32

International Institute for Applied Systems Analysis

A-2361 Laxenburg, Austria

(2)

NOT FOR QUOTATION WITHOUT THE PERMISSION OF THE AUTHOR

SrOCHASllC QUASI-GRADIENT ALGORlTHMS WITH

ADAPTIVELY

CONTROLLED PARAMEI'ERS

S.P. U r j a s ' e v

July 1986 WP-86-32

Working P a p e r s a r e interim r e p o r t s on work of t h e International Institute f o r Applied Systems Analysis and have received only limited review. Views o r opinions expressed h e r e i n d o not necessarily r e p r e s e n t those of t h e Institute o r of i t s National Member Organizations.

INTERNATIONAL INSTITUTE FOR APPLIED SYSTEMS ANALYSIS 2361 Laxenburg, Austria

(3)

The p a p e r deals with choosing stepsize and o t h e r parameters in stochastic quasi-gradient methods f o r solving convex problems of stochastic optimization. The principal idea of methods consists in using random estimates of gradients of t h e objective function t o s e a r c h f o r t h e point of extremum. To control algorithm parameters t h e iterative adaptive procedures are suggested which are quasi-gradient algorithms with r e s p e c t t o parameters. The convergence is proved and t h e estimates of t h e r a t e of convergence of such algorithms are given. The results of computations f o r s e v e r a l stochastic optimization problems are considered. The p a p e r is p a r t of t h e r e s e a r c h on numerical techniques f o r stochastic optimization conducted in t h e Adaptation and Optimization p r o j e c t of t h e System and Decision Sciences program.

Alexander B. Kurzhanski Chairman System and Decision Sciences Program

(4)

CONTENTS

Introduction

1 Quasi-Gradient Algorithm with Adaptive Parameter Control, Non-Formal Description and a Brief Review of Results 1.1 Description of t h e Algorithm and Various

Approaches t o Step Size Control

2 Use of Stochastic Quasi-Gradient Algorithms f o r Stochastic Algorithm Parameter Controls

2.1 Step Size Control f o r Stochastic Quasi-Gradient Algorithm [22], [23]

2.2 Stochastic Quasi-Gradient Algorithm with Variable Metric

2.3 Algorithm with t h e Averaging of Stochastic Quasi-Gradients

3 Convergence and Rate of Convergence of Stochastic Quasi-Gradient Algorithm

3.1 The Algorithm convergence

3.2 Asymptotic P r o p e r t i e s of Step Sizes 3.3 Rate of Algorithm Convergence

4 On Program Realization of Stochastic Quasi-Gradient. Algorithm References

(5)

SI'OC-C QUASI-GRADIENT AUSORIT'KMS WJTH

ADAFTIVELY

CONTROLLED PARAMETERS

S.P. Urjas'ev Institute of Cybernetics Academy of Sciences of t h e Ukr-SSR

252207 Kiev 207. USSR

The p a p e r i s devoted t o t h e development of i t e r a t i v e non-monotine optimization algorithms f o r problems of convex stochastic optimization with and without constraints. Most problems under discussion f e a t u r e t h e lack of complete information about objective and c o n s t r a i n t functions and t h e i r derivatives as well as non- smooth n a t u r e of t h e s e functions. The c e n t r a l idea of t h e discussed numerical methods, called t h e s t o c h a s t i c quasi-gradient methods, consists in t h e use of random directions instead of p r e c i s e values of gradients. The random directions a r e statistical estimates of gradients (stochastic quasi-gradients). The definition of a stochastic quasi-gradient w a s introduced in t h e work by Ju. M. Ermoliev, Z.V. Nek- rylova [I] and t h e n t h i s concept w a s developed in works by Ju. M. Ermoliev (see, e.g., [21, C3l).

Stochastic approximation algorithms (which stem from t h e work by H. Robbins and S. Monro [4]) and many random s e a r c h algorithms which are r e p r e s e n t e d in t h e work by L.A. Rastrigin [5] and o t h e r s a r e special c a s e s of stochastic quasi- gradient algorithms.

Adaptive p r o c e d u r e s a r e o f f e r e d and studied through t h e use of which t h e p a r a m e t e r s of t h e algorithms discussed in t h i s p a p e r a r e controlled and p r a c t i c a l c h a r a c t e r i s t i c s of t h e s e algorithms a r e improved. By t h e adaptivity i s h e r e meant t h e dependence of t h e s e p a r a m e t e r s upon t h e p r o c e s s t r a j e c t o r y in distinction t o program p r o c e d u r e s where p a r a m e t e r s depend upon t h e number of iteration only.

In p a r t i c u l a r , s t e p size c o n t r o l and stopping c r i t e r i a a r e suggested f o r t h e sto- c h a s t i c algorithms. I t should be emphasized t h a t i t i s just t h e a s p e c t s which a r e most difficult and problematic in t h e numerical implementation of t h e s e methods.

(6)

The main point of t h e suggested a p p r o a c h consists in t h e following. Almost e a c h i t e r a t i o n algorithm h a s some p a r a m e t e r s t o b e controlled. Usually t h e r e a r e a l s o c r i t e r i a which define t h e quality of t h e chosen c o n t r o l s . But i t i s difficult t o satisfy t h e s e c r i t e r i a in p r a c t i c e (to find optimal c o n t r o l ) because t h e s e quality c r i t e r i a are difficult t o compute. Nevertheless i t is possible t o v a r y t h e s e c r i t e r i a by p a r a m e t e r s and t o c a l c u l a t e t h e i r g r a d i e n t s o r s t o c h a s t i c quasi-gradients. The obtained gradients (quasi-gradients) may b e used in construction of r e c u r r e n c e p r o c e d u r e s t o modify t h e s e p a r a m e t e r s . In s u c h a p p r o a c h s e v e r a l g r a d i e n t p r o - c e d u r e s o p e r a t e in t h e algorithm

-

in t h e main s p a c e and with r e s p e c t t o t h e algorithm p a r a m e t e r s . 1.e.. t h e adaptation of t h e algorithm p a r a m e t e r s o c c u r s .

The following designations will b e used:

R n

i s a n n-dimensional Euclidean s p a c e ;

<-;>is a n i n n e r p r o d u c t in

R n

;

JI.JI

^{i s}^a^{norm in}

^{R n}

^;

af

( z )is a subdiff e r e n t i a l of t h e convex function 'J

:Rn

--,

R

at point z , i. e.

( Q .

7 ,

P ) i s a probability s p a c e on which all random values are defined;

w is a n elementary e v e n t belonging t o t h e set Q;

a.s. means "almost s u r e l y ";

E [ i s a mathematical e x p e c t a t i o n f o r t h e random value [;

E C [ / ~ ~

I

^{o r}

Es [ is a conditional mathematical expectation with r e s p e c t t o t h e &field

9";

l7, (.)is a p r o j e c t i o n on a convex closed set X c

R n .

1. QUASI-GRADIENT ALGOBITHM

WITH

ADAPTIVE PARAMETER CONTROL.

NON-FORMAL DESCRIPTION AND A BRKEF REYIEW OF RESULTS

H e r e w e s h a l l consider t h e problem of minimizing a convex (possibly non- smooth) function

Y

( z )

J

' ( z ) ^-4min z EX

where X is a convex compact s u b s e t of

Rn.

In t h e considered class of problems in-

(7)

stead of e x a c t values of gradients o r generalized gradients of t h e function j ' ( z ) , t h e vectors are known which are statistical estimates of t h e s e quantities while t h e e x a c t values of t h e function and i t s gradients are very difficult t o compute. Such problems p r e s e n t themselves, f o r example, in t h e minimization of functions of t h e form

Considering t h a t under t h e most general assumptions t h e generalized differential of t h e convex function j' ( z ) is calculated by t h e formula [ 6 ]

a,

q ( z , w ) is then a set of v e c t o r s being t h e statistical estimates of gradients of t h e function j' ( z ).

EXAMPLE

A Random Location E q u i l i b r i u m Problem.

The classic formulation of Weber problem is as follows: given are n points wf

,

i

=

^1,

. . . .

ⁿin two-dimensional Euclidean s p a c e R 2 , i t is required t o find a point z E R 2 such t h a t a sum of distances t o all points wf E R 2 , i

=

1,

. . .

, n is minimal.

In t h e generalized statement [ 7 ] each point wf i

=

1,

. . .

^,n is assumed t o b e a random value specified by some probability measure Of ( w ) on R 2 . The problem consists in finding t h e point z E R 2 which minimizes t h e sum of mathematical expec- tations f o r distances from t h e point z t o points wf , i

=

1.

. . .

, n

where

Bf >

0 , i

=

1,

. . .

^,n . The random function

I

⁰ ^{f o r}^z^{= w ,}

where t h e random value w is specified by t h e probability measure Q may b e taken as statistical estimate of generalized gradients i.e.

(8)

J J

.$ ( z , w ) 0 (dw) E B j ' ( z ) is satisfied.

R'=

1.1. Description of the Algorithm and Various Approaches to Step Size Control

An unknown point of t h e minimum of t h e convex function j ' ( z ) on t h e set X i s estimated by t h e r e c u r r e n t sequence [2]

where X i s a convex closed set in

R"

;

tS

i s a stochastic quasi-gradient, i.e. a conditional mathematical expectation f o r t h i s v e c t o r satisfies t h e relation

o-field

P",

is specified by random v e c t o r s (zO. t o , z l , t l ,

. . .

, zS ); p, , s

=

0, 1.

.. .

i s some sequence of random values o r random matrices n X n

.

The algorithm suggested by H. Robbins and S. Monro 141 f o r estimation of t h e r o o t of a r e g r e s s i o n function ( f o r X

=

R') i s a special case of t h e algorithm (1).

The algorithm suggested f o r optimization problems in [8] by H. Kiefer and J. Wol- fowitz i s a l s o a special c a s e of t h e algorithm (1). In his work t h e gradient estimate i s t a k e n as f inite-difference approximations with random estimates of t h e objective function. F u r t h e r t h i s a p p r o a c h w a s developed in works of many a u t h o r s (see, e.g., 19-111) with t h e assumption of smoothness of t h e objective function and t h e ab- s e n c e of t h e dependence of p a r a m e t e r s p, , s

=

0 , l . .

. .

on t h e p r o c e s s t r a j e c t o r y . Such p a r a m e t e r controls w e will call program controls, i.e., t h e s t e p size i s equal t o t h e constant o r d e c r e a s e s monotonically by a pre-specified r u l e depending upon t h e number of i t e r a t i o n s.

Algorithms of t h e t y p e (1) f o r optimization of d i f f e r e n t classes of non-smooth functions are d e s c r i b e d in 121, [12],

1131.

Various a p p r o a c h e s t o estimating t h e rate of convergence of schemes of t y p e (1) with program s t e p size controls are p r e s e n t e d f a i r l y completely in

[lo],

[ I l l , 114-171. In p r a c t i c e i t may o c c u r t h a t t h e initial s t e p s i z e i s chosen small and t o i n c r e a s e t h e rate of convergence i t should b e enlarged. The program s t e p size control, naturally, does not t a k e into account such situation though from t h e viewpoint of t h e asymptotic rate of convergence t h i s control may b e ideal. That i s why adaptive s t e p size controls taking into account t h e behavior of t h e objective function a r e n e c e s s a r y which would e n l a r g e

(9)

t h e s t e p size f a r from t h e extremum if i t i s s m a l l and d e c r e a s e i t n e a r t h e point of t h e minimum. R.J.-B. Wets proposed t h e stochastic quasi-Newton method [25], t h e n t h e s e r e s u l t s were developed by A. Gaivoronski. Below we consider t h e different a p p r o a c h .

The f i r s t s t e p in t h i s direction w a s made by H. Kesten [18] who suggested t h e program-adaptive control. He suggested t o choose in scheme ( I ) , as s t e p sizes p,, a prespecified sequence !ak which satisfies conditions

but t h e s t e p should b e changed not at e a c h iteration but only in t h e c a s e when

< t S ,

tS

^-I>

<

0.

In t h e o r e t i c a l studies on substantiation of stochastic quasi-gradient methods conducted at

V.M.

Glushkov Institute of Cybernetics ( s e e , e.g., t h e generalizing work [Z]) s t e p sizes p, a r e assumed t o b e dependent upon t h e p r o c e s s t r a j e c t o r y ( z O ,

. . .

^,zS). In application studies beginning in 1967 h e u r i s t i c adaptive pro- c e d u r e s were used t o c o n t r o l s t e p sizes in t h e algorithm (1). A t e a c h iteration t h e unbiased estimate zs of t h e objective function 'j( z S ) is assumed t o b e known;

denote

The value Ts may b e used f o r dialogue o r program s t e p size c o n t r o l [Z] (see a l s o [19]). F o r example, t h e s t e p size may b e chosen according t o t h e r u l e

-

^{p s / 2}

1

^if l T s - k - T s l s d * Ps + 1

-

p, otherwise

A. M. Gupal and F. Mirzoahkmedov [ZO] suggested t o change t h e s t e p size ps according t o t h e norm of v e c t o r s vS

which a r e convex combinations of previous stochastic quasi-gradients t i , i

=

0 , 1 ,

. . . ,

s

.

F o r s t o c h a s t i c problems of quadratic programming t h e s t e p size controls which are t h e development of H. Kesten scheme were suggested and justified by G. Pflug [ Z l ] .

(10)

The failing of t h e above-listed s t e p size controls, e x c e p t f o r dialogue ones, consists in a high dependence of t h e efficiency of algorithm operation upon t h e value of t h e initial s t e p size po since t h e s t e p size can only d e c r e a s e during t h e iteration process. In t h e scheme suggested and substantiated by t h e a u t h o r in [22], [23] t h e s t e p size not only can d e c r e a s e but also increase. In t h e next section i t i s shown t h a t this r u l e i s a r e s u l t of using t h e stochastic quasi-gradient algorithm t o control this parameter.

2. USE OF STOCHASTIC QUASI-GRADIENT ALGORITHMS FOR STOCHASTIC ALGORITHM PARAMETER CONTROLS

P a r a m e t e r controls in stochastic algorithms i s usually difficult because of t h e absence of objective function values since only statistical estimates of t h e s e values are available. This circumstance does not make possible, f o r example, t h e realization of efficient procedure of s e a r c h f o r t h e function minimum along some chosen direction. The suggested approach consists in using t h e gradient algorithms f o r parameter controls. To use such procedures t h e r e is no need f o r additional computations of t h e objective function or its gradients.

2.1. Step Size C o n t r o l f o r Stochastic Q u a s i - G r a d i e n t Algorithm 1221. 1231 When constructing adaptive s t e p size control for t h e algorithm ( 1 ) w e assume t h a t t h e algorithm t r a j e c t o r y belongs t o t h e i n t e r i o r of t h e admissible domain and b S

=

0, s

=

0 , 1 ,

...

i.e. E,

tS

E

af

( x S ) . In t h e algorithm ( 1 ) it i s n a t u r a l t o t a k e s t e p sizes p, as t h e point of minimum of t h e function Q, ( p ) with r e s p e c t to p where

Usually i t is difficult to calculate t h e values of t h e function Q, ( p ) . Let us differen- t i a t e t h e function f ( x S

-

^p

tS

) with r e s p e c t t o p at point p,

Since

then - E s

< t S

^+I,

t S >

^Ea Q , ( p , ) .

(11)

To modify t h e s t e p s i z e p, w e may use t h e following g r a d i e n t p r o c e d u r e

where a x S

+' ⁼

^{z S}

+' ^-

^{z S .}

To facilitate t h e proof of t h e algorithm convergence w e r e w r i t e t h e l a s t r e l a t i o n in t h e form

t h e constant

F

bounds t h e s t e p size above.

Note t h a t t h e exponent i s supplemented with t h e additional t e r m

-

^bps^which

d e c r e a s e s t h e s t e p s i z e p,. H e r e d i s some sufficiently s m a l l constant, t h e r e f o r e t h e additional d e c r e a s e of t h e s t e p size o c c u r s in t h e case when t h e value

<tS

+',

t S >

i s sufficiently close t o z e r o and i s comparable t o t h e value 6 . The f o r - mula ( 2 ) may b e i n t e r p r e t e d in t h e following manner. The value

<[,

+', A x S

+'>

gives some information about whether t h e minimum of t h e function @ , ( p ) with r e s p e c t t o p w a s passed through at t h e i t e r a t i o n o r not. If

- <tS

+I, a x S

>

^>O^{t h e n}

with a high probability t h e minimum w a s not passed through and t h e s t e p s i z e in- creases due t o t h e member

<[,

+', a x S

+'>,

otherwise t h e s t e p size d e c r e a s e s . In [ 2 3 ] t h e Cesdro c o n v e r g e n c e of t h e algorithm ( I ) , ( 2 ) was proved, i.e., t h e convergence t o t h e optimal set with probability 1 of t h e sequence

which is a convex combination of t h e t r a j e c t o r y points [ 1 6 ] . In t h i s p a p e r t h e con- v e r g e n c e of t h e algorithm ( I ) , ( 2 ) with probability 1 i s proved, and t h e asymptotic estimate of t h e rate of t h e algorithm convergence f o r t h e case of twice differenti- a b l e function f ( z ) i s obtained.

(12)

2.2. Stochastic Q u a s i - G r a d i e n t Algorithm w i t h V a r i a b l e M e t r i c

Algorithms of t h e t y p e ( 1 ) in t h e c a s e when t h e function i s ill-conditioned have t h e low p r a c t i c a l r a t e of convergence. This f o r c e s t o use more complex v a r i a n t s of t h e algorithms. In non-linear programming a wide spectrum of algorithms i s developed, called t h e algorithms of v a r i a b l e metric [24] which successfully o p e r a t e in such situations. In t h e given c a s e , however, t h e d i r e c t use of t h e s e algorithms i s impossible because only statistical estimates of values of t h e objective function and of i t s g r a d i e n t s a r e known.

Let i t be r e q u i r e d t o minimize a convex possibly non-smooth function f ( x ) specified on t h e s p a c e R". Stochastic quasi-gradients of t h e function are known.

Approximations of t h e extremum point a r e considered by t h e r u l e

where H S , s

=

^0.1 ,

...

i s a sequence of n x n random s q u a r e matrices;

CS .

^s

⁼

^0,^{1 ,}

^...

^{i s}^asequence of stochastic quasi-gradients, i.e. Es

CS

^Ea f ( x S ) , h e r e 6-field i s specified by random values ( x O .

to,

HO, x i ,

ti,

^Hi,

. . .

^,^{x S}^).The matrix HS i s modified at e a c h s t e p in t h e following manner

where Q S , s

=

0 , 1,

...

i s a sequence of s q u a r e matrices.

Denote Q , ( 9 ) = f ( x

-

^QHS

C S ) .

The matrix QS at t h e i t e r a t i o n s c a n b e chosen from t h e condition of t h e minimum of t h e following function of n X n vari- a b l e s

However t h i s problem by complexity i s equivalent t o t h e s o u r c e problem.

We calculate t h e s t o c h a s t i c quasi-gradient of t h e function a s ( Q ) at point I where I i s a unitary matrix.

We d i f f e r e n t i a t e t h e function Q , ( Q ) in a generalized sense with r e s p e c t t o Q at t h e point I

H e r e

i s ,

denote t h e transposed matrix HS and t h e v e c t o r column

CS .

(13)

Since

then

A s a matrix QS we may t a k e t h e matrix which i s formed when executing one s t e p from t h e point I in t h e direction of t h e stochastic quasi-gradient

tS

^+l?iS,

i.e.

where y , is a positive s c a l a r .

Then we may r e w r i t e t h e formula f o r t h e matrix modification in t h e following manner

Note t h a t t h e l a s t formula is close t o t h e method with dilatation of t h e s p a c e along t h e generalized gradient suggested by N. Z. S h o r [26, p. 921 with V. A. Skokov modification.

2.3. Algorithm w i t h the A v e r a g i n g of Stochastic Quasi- G r a d i e n t s

The algorithm with t h e averaging of stochastic quasi-gradients w a s considered by many a u t h o r s [27], [12], [28-301. The advantage of t h i s algorithm consists in t h e e a s e of i t s realization and a l s o in a h i g h e r efficiency f o r t h e ill-conditioned functions as compared t o t h e stochastic quasi-gradient algorithm (1). The drawback of t h i s algorithm consists in i t s "inertial motion", i.e., t h e direction of movement changes weakly from i t e r a t i o n t o iteration, t h e r e f o r e t h e algorithm f o r simple functions may b e l e s s efficient t h a n t h e algorithm (1).

Using t h e suggested a p p r o a c h t h e a u t h o r s of 1301 developed t h e r e c u r r e n c e schemes f o r modification of two p a r a m e t e r s of t h e algorithm: s t e p size and aggregation coefficient. This made i t possible t o i n c r e a s e t h e p r a c t i c a l r a t e of t h e algorithm convergence f a r from t h e extremum, leaving without changes t h e local r a t e of convergence of classical methods. The algorithm convergence w a s proved and t h e asymptotic r a t e of convergence w a s given.

(14)

Let us consider t h e minimization problem of convex possibly non-smooth function j' ( z ) on t h e convex compact subset

X

of t h e s p a c e

R n .

Stochastic gradients of t h e function j' ( z ) are known.

The algorithm g e n e r a t e s sequences of random directions d S and points z E

R n

, s

=

0 . 1 , .

. .

according t o formulas

H e r e

t s

is a s t o c h a s t i c quasi-gradient, i.e., E , t S E B j ' ( z S ) where 6-field

9",

i s generated by random values ( z O , #O,

. . .

, z S z S ) ; p, i s a positive s t e p size; y , i s a positive aggregation coefficient; i s E 10, 11 is a r e s e t coefficient; t E ( 0 ;

+

^{a )}

i s a constant.

A t t h e initial point z 0

E X

we assume d - I

=

^{0 .} From ( 4 ) i t follows t h a t t h e direction d S i s a convex combination of z e r o v e c t o r and s t o c h a s t i c subgradients

C i , i = o ,

. . .

^{, s .}

The r e s e t coefficient i s defined in t h e following manner:

where 6 i s some fixed threshold.

To c o n s t r u c t r e c u r r e n c e relations of modification of p a r a m e t e r s p, , y , w e assume t h a t t h e algorithm o p e r a t e s in t h e i n t e r i o r of t h e admissible domain X and t

=+

^a.For t h e given z S d S and

X

2 0 w e consider regularized function which c h a r a c t e r i z e s t h e quality of t h chosen p a r a m e t e r s p a n d y

where

i s defined by relations ( 4 ) , ( 5 ) . Values p, and y, may b e chosen from t h e condition of t h e minimum of t h e function @, ( p , y )

=

^{E ,}- l t p s ( p , y ) . However, t h e program realization of such s e a r c h at each i t e r a t i o n is difficult. W e differentiate in t h e generalized s e n s e t h e function tp, ( p , y ) at t h e point p, y, After simple

(15)

transformations we obtain

where A x S

=

xS

-

^{x S} Taking into account t h e designations

we have

Thus t h e v e c t o r (Us, v , ) may b e i n t e r p r e t e d as a stochastic quasi-gradient of t h e function cP, at t h e point ( p , y, with t h e a c c u r a c y up t o positive multipliers.

Similarly t o relation (2) t h e v e c t o r ( u s , v,) w a s used in [31] f o r construction of t h e r u l e f o r calculation of t h e s t e p size

P o > O *

P s

=

minip, P, - l e x ~ [ m i n ( q ,

-

d u ,

-

^j, ^bps

¹

^,

where

>

^{0 , q}

>

^{0 ,}a

>

^{0 ,}

^X

^r⁰are fixed p a r a m e t e r s , t h e coefficient j, in t h e l a s t relation is calculated by t h e formula

hi,,

^{i s}^asmall positive value.

The formula f o r calculation of aggregation coefficients y, i s written similarly

(16)

In r e l a t i o n s (6), (7) t h e additional members j , b p, j , A 7 , i n c r e a s e t h e rate of t h e d e c r e a s e of coefficients p,, 7 , in t h e case when t h e values u s and v , are close t o z e r o .

The considered a p p r o a c h may b e applied t o o t h e r algorithms, s t o c h a s t i c and non-stochastic, in which t h e p a r a m e t e r s c o n t r o l i s r e q u i r e d . The a u t h o r suggested and theoretically substantiated adaptive s t e p size c o n t r o l s f o r t h e s t o c h a s t i c Arrow-Hurwicz algorithm of s e a r c h f o r saddle points of convex-concave functions [32] and f o r t h e g r a d i e n t algorithm of s e a r c h f o r Nash equilibrium in non- c o o p e r a t i v e many-person games [33].

3. CONVERGENCE AND BATE OF CONVERGENCE OF STOCHASTIC QUASI-GRADIENT ALGORITHM

W e will p r o v e t h e convergence with probability 1 of t h e s t o c h a s t i c quasi- g r a d i e n t algorithm (1) with s t e p size c o n t r o l (2) to t h e e x t r e m a l set of t h e convex function and estimate i t s asymptotic rate of convergence f o r twice differentiable functions.

W e show t h a t t h e sequence of s t e p sizes chosen according to ⁽²⁾s a t i s f i e s t h e classical conditions

Note t h a t classical theorems about convergence for t h e algorithm (1) with s t e p size c o n t r o l (2) cannot b e used (see, e.g., 123) b e c a u s e i t i s usually assumed t h a t t h e s t e p size p, depends only on random v e c t o r s

(zO, . . .

^,

^zS),

in t h e given case t h i s condition i s b r o k e n s i n c e t h e s t e p size p, depends also on

4'.

Let us consider t h e problem of minimization of t h e convex function f

(z)

on a convex compact s u b s e t X ^ER n

.

W e u s e t h e s t o c h a s t i c quasi-gradient algorithm (1) with s t e p s i z e c o n t r o l (2) a n d a,

=

a

>

1, s

=

^0,1,

...

f o r t h e s e a r c h f o r t h e op- timum of t h e function

p (z),

i.e.,

(17)

where &field

7

i s specified by random values ( z O , t o ,

. . .

, z S

tS

^,z S ) . Denote cpS

= tS - p ( z s ) ,

w h e r e L ( z s ) E i 3 f ( z S ) , E , t S

= L(zs) ⁺

^{b S .}

3.1. The Algorithm convergence

W e will p r o v e t h e convergence with probability 1 of t h e p r o c e s s ( 8 ) , ( 9 ) t o t h e e x t r e m a l set of t h e function f ( z ) on t h e admissible set X.

THEOREM 1 Let f ( z ) be a c o n v e z b o s s i b l e non-smooth) f u n c t i o n s p e w e d o n t h e c o n v e z compact s u b s e t X of t h e space Rn, t h e f u n c t i o n f ( z ) s a t i s f i e s t h e L i p s c i t z c o n d i t i o n o n X .

U

l r a I ~ q

6 2 21n, [E, a a.s..s ^=0.1,

...

, (13)

t h e n w i t h p r o b a b i l i t y 1 all a c c u m u l a t i o n p o i n t s of t h e sequence { x S

4

s p e w e d b y r e l a t i o n (81, (9) belong t o t h e s e t

PROOF P r i o r to proving t h e principal a s s e r t i o n of t h e theorem let u s set s e v e r a l p r o p e r t i e s of s t e p sizes p , , s

=

^0,1,

....

LEMMA 1

[a].

PROOF Suppose t h e opposite, i.e.. t h a t t h e r e e x i s t s such constant K f o r which t h e probability of e v e n t

i s more t h a n z e r o P ( A )

>

^0.From r e l a t i o n s ( 9 ) , (11) i t i s easy to obtain

(18)

where Cg

= (c:

C 6)

>

^0.

For elementary events w E A from t h e last estimate we have

The obtained lower bound f o r t h e s t e p size ps + l i s inconsistent with t h e rela-

DD

tion

z

^p, ²^{K .}^{The lemma}^is^proved.

0

LEMMA 2

PROOF

Taking into account ( 8 ) . (9). (11). the definition of t h e gradient of t h e convex function and p r o p e r t i e s of t h e projection operation w e obtain

Hence

Since according t o (13)

from t h e last inequality we obtain

(19)

- 7 P s 6

Since 0

s

p,

s p

^{t h e n} ^{f o r} a

=

( 1 - a

>

⁰ ^{t h e} ^relation

- ~6 P s

1

-

^{u p ,}

^r

^a i s fulfilled.

By substituting t h i s estimate into t h e previous inequality and by introducing t h e designation p,

=

p, a w e have

By taking t h e mathematical expectation from both sides of t h e inequality w e obtain

where C4

=

inf af (=). Since

E

~ 1 ,

-

E k 2 const t h e n t h e last estimate r e s u l t s z E X

in t h e a s s e r t i o n of t h e lemma.

COROLLARY [23].

LEMMA 3 [23].

PROOF Since p, --, 0 a s . t h e n

From t h e r e l a t i o n (9) i t follows t h a t f o r almost e a c h elementary e v e n t w E

Q

t h e r e may b e found t h e number s (w ) such t h a t for s

>

^{S ( w )}

Since p, --, 0 a s . , t h e n

- <ts

^+I, ^{+ l >}

-

a p , -,

o

a.s.

and t h e a s s e r t i o n of t h e lemma r e s u l t s f r o m t h e relation ( 1 5 ) .

(20)

To p r o v e t h e main a s s e r t i o n of t h e theorem w e use t h e conditions of convergence of s t o c h a s t i c programming algorithms [I31 with insignificant modifications.

THEOREM 2 Let t h e random process l x S ( w ) and a set of soLutions

XY

ERn be s u c h t h a t :

C 1 . A l m o s t f o r all subsequences l x n k ( w ) ] such t h a t lim x n k ( w ) E Xu t h e rela-

k - r w

tion

holds.

C2. T h e r e e x i s t s a compact set X such t h a t

C3. If t h e r e e x i s t s such e v e n t B

c

Cl t h a t P ( B )

>

⁰and f o r all w E B t h e r e ex- i s t s a subsequence l x S k ( w ) {, x S k ( w ) -, x l ( w ) 'E

XY

t h e n f o r almost a l l w E B t h e r e e x i s t s such c O ( w )

>

⁰t h a t f o r all k and 0

<

c 5 c O ( w )

m k ( w )

=

inf lm : l l x m ( w ) -x l ( w ) l l

>

^{E {}^{< O}

.

m >sk

C4. T h e r e e x i s t s a continuous function W ( x ) such t h a t f o r w E B lim w ( x m k ( w ) )

<

W ( x l ( w ) )

.

k - r w

C5. The function W ( x ) t a k e s on X* at m o s t countable number of values.

Then t h e limit of any convergent subsequence belongs to t h e set X* almost f o r all w

.

W e assume

W ( x )

=

min llx

-

y1I2, U , ( x )

=

l y E R n :IJy -xll s c { ,

y EX'

f *

=

m i n f ( x ) , x s = a r g min llxS -yll, q s

* = Cs

- f z ( x s ) - b S

z E X y EX'

W e test t h e satisfiability of conditions C1-C5.

The condition C 1 is satisfied obviously, since by v i r t u e of t h e c o r o l l a r y of t h e l e m m a 2

(21)

The condition C 2 is satisfied by virtue of theorem 1 .

The condition C5 is satisfied since t h e function W ( z ) is a constant on t h e s e t XY.

We t e s t t h e condition C3. Let t h e probability of event B is more than zero, w E B and z S k ( w ) - z ' ( w ) ZXY.

For t h e brevity we will omit t h e argument w . If t h e condition (16) f o r t h e given w is not satisfied then t h e r e may b e found a r b i t r a r i l y small E and number sk

such t h a t f o r s

>

sk valid i s z S E U , ( z ' ) . For s

>

^sk

where

By virtue of conditions (10, (11) t h e s c a l a r product < q s , z ,

* -

z S

>

is bounded, t h e r e f o r e , taking into account (12) and l e m m a 3, we have

S

*

From lemma 2 it follows t h a t t h e martingale s e r i e s

z

^pL < q L , zL

-

^{z L}

>

is con-

L = S k

vergent a s . , t h e r e f o r e

(22)

t h e constant C h e r e depends upon w . Consequently, f o r sufficiently l a r g e numbers s and small E

Taking into account lemma 1 w e obtain t h a t beginning with some number Sk f o r sufficiently l a r g e numbers s t h e estimate

holds.

Passing t o t h e limit s --, w w e obtain t h e contradiction with boundedness W ( z ) on t h e closed bounded set U , ( z ' ) . The contradiction p r o v e s C3.

W e will p r o v e C4. Since IIPrS((--, 0 a.s. t h e n by constructing index mk begin-

rnk - 1

ning with some number k valid i s

)I

pl ['(I

>

^{e / 2 .}By v i r t u e of condition (11) w e

Sk

obtain

Substituting t h e last estimate in (17) f o r s

=

mk

-

¹^{w e}^have

Since w ( z S k ) - w ( z ') t h e n for sufficiently l a r g e k

The last inequality p r o v e s C4.

3.2. Asymptotic Properties of Step Sizes

W e now study adymptotic p r o p e r t i e s of a sequence of s t e p sizes p, , s

=

0 , 1 , .

. .

for t h e case of twice continuously-differentiable function. These r e s u l t s will b e used f o r obtaining asymptotic rate of convergence of t h e algorithm (8)-(9).

(23)

LEMMA 4 Let f o r t h e sequence IzS

I

s p e m e d by reLations (8)-(9) vaLid be aLL c o n d i t i o n s of theorem 1 a n d bS

=

0; s

=

0 , 1 ,

...

t h e f u n c t i o n f ( z ) be twice continuousLy-diPQerentiabLe o n the open s e t c o n t a i n i n g X, t h e n

Ps

=

_{( s}

+

1 ) b l n a 1

PROOF Denote rs

=

In, [(s

+

l ) p s

1,

s

=

0 , 1 , .

. . .

According t o t h e corollary of lemma 2 f o r sufficiently l a r g e numbers s from relation (8) w e obtain

Consequently

1 1 1

( 1

-

ln(a ) 6 a

+

In, ( 1

+

-)

- -

s In a s s l n a

00 1

1

is convergent. From

I t is obvious t h a t t h e s e r i e s

C

[ln,(l

+

-)

- -

s = l s s l n a

00

lemma 2 t h e convergence of t h e martingale s e r i e s

x

^<qS^,^{P z S}

^>

^follows^a.a.^Since

s = l

t h e function f ( z ) is twice continuously differentiable, then <Vf ( z S ) . P z S

> =

f ( z )

-

^f^{( z}

⁺

^qs

^II& ^11'

^where^9, is uniformly bounded f o r all s

.

The equality

is satisfied. The function f ( z ) is bounded on t h e compact set X, t h e s e r i e s

2

qsll&s112 is convergent a s . by virtue of lemma 2, t h e r e f o r e t h e s e r i e s

s = l

x

00 ^<Vf ( z S ) , h S

>

is also convergent. The relation (18) then may b e rewritten as

s = l

follows

sS

=

rs

+

-(I 1

-

61n(a)ar8-l)

+

ts s l n a

OD

where t h e s e r i e s

C

ts is convergent a.s. The last formula is t h e Robbins-Monro

s = l

(24)

algorithm f o r solution of equation 1

-

d l n ( a ) a z

=

0 . Using s t a n d a r d r e s u l t s about convergence of s t o c h a s t i c approximation algorithms (see, e.g. [ I l l ) , w e obtain

Q.E.D.

3.3. Rate of Algorithm Convergence

For t h e case of twice continuously-differentiable function w e estimate t h e asymptotic rate of convergence of t h e algorithm ( 8 ) - ( 9 ) in non-stochastic case.

i.e., f o r

CS =

V f ( x S ) , s

=

0 , 1 ,

....

THEOREM 3 Let a l l c o n d i t i o n s of t h e o r e m 1 h o l d ,

CS =

V f ( x S ) , s

=

0 , 1 ,

...,

t h e f u n c t i o n f ( x ) be t w i c e c o n t i n u o u s l y d w e r e n t i a b l e a n d

w h e r e x* i s a u n i q u e p o i n t of m i n i m u m of t h e f u n c t i o n f ( x ) o n t h e s e t X , l n ( a ) ( ~ i

+

6 ) / 2 B

<

1. m e n

PROOF W e use t h e following l e m m a to p r o v e t h e theorem.

LEMMA 5 [14]. Let t h e r e be a s e q u e n c e v , , s

=

0 , 1 ,

...,

a n d

= 0 , 1 .

...;

lim

- 6 , s 6

< 1

s

--

t h e n v ,

s -

Ps

-6

+ o ( ~ s ) .

From t h e estimate ( 1 6 ' ) and condition ( 1 9 ) w e have

w ( x S

+ I )

s w ( x S ) +

2p,(p* - f ( x S ) )

+

c ; p :

s

(25)

Denote V,

=

2 p , B , v ,

=

W ( z S ) , pS

=

C: pS / 2B. The condition ( 2 0 ) of lemma 5 i s satisfied obviously. W e test t h e condition ( 2 1 ) . According t o t h e c o r o l l a r y of lemma 2 from t h e r e l a t i o n ( 9 ) we have f o r sufficiently l a r g e numbers

where

Ih, 1

5 C:

.

Consequently

and

lim

@, 5 l n ( a ) (c:

+

d)/ ( Z B ) as. by t h e condition of t h e theorem.

s

--

Conditions of l e m m a 5 are t e s t e d , t h e r e f o r e

s i n c e p,

=

0 (-). 1

S

4. ON PROGRAM REALIZATION OF STOCHASTIC QUASI-GEADIENT ALGORITHM

Program realization of algorithms in p r a c t i c e usually r e q u i r e s t h e introduction of some h e u r i t i c elements improving t h e algorithm operation.

Theorem 1 i s proved provided t h a t in s t e p size c o n t r o l s (2) a,

=

const, s

=

^{0 , 1 ,}

...

This may r e s u l t in v e r y speedy change of t h s t e p size p, at e a c h i t e r a t i o n . In program realization of t h e algorithm i t i s d e s i r a b l e t o normalize t h e exponent in relation ( 9 ) t o some value z, which i s t h e averaging of t h e value

I<FS

⁺l , A z S ⁺l>1. The averaging i s made by t h e following r e c u r r e n t formula

I t i s d e s i r a b l e t o s e t some threshold coefficients which limit t h e maximal change of t h e s t e p size p,. In numerical experiments t h e a u t h o r used t h e following step size

(26)

r u l e [23]

if ?s + l / ~ s > 3

.

if ?s + l / p s < 1 / 4 * (Ps + 1 otherwise

The recommended values of p a r a m e t e r s are

In r e l a t i o n (23) t h e additional reduction of t h e s t e p size o c c u r s only if t h e value

Ts

i s negative. Results of computation experiments show t h a t t h e scheme (B), (22), (23), (24) rapidly leads t o t h e point of t h e extremum if t h e objective function i s ot ill-conditioned, i.e., for non-"ravine" functions. In case when t h e function j'(x) i s v e r y "ravine" t h e algorithm g e t s s t a c k "at t h e bottom of t h e ravine". This difficulty may b e overcome by using more complex algorithms which employ ma- t r i c e s of s p a c e dilatation (3). In p r a c t i c e , t h e scaling p r o c e d u r e suggested by S a r i d i s [34] for s t o c h a s t i c approximation algorithms proved to be efficient for such functions.

This p r o c e d u r e contains changes taking into account t h e projection operation and adaptive s t e p size control.

where k ( s

+

I ) , 0 5 k ( s

+

I ) 5 n i s t h e quantity of numbers i _for which

#f " ( z f

-

xf

>

^{0 ;}n i s t h e dimension of t h e s p a c e to which t h e set X belongs;

(27)

For this scheme t h e s t e p size control is t h e same as in t h e previous c a s e , i.e., (23), (24).

The recommended values of p a r a m e t e r s a r e

Note t h a t t h e considered schemes have t h e n a t u r a l c r i t e r i o n of t h e b r e a k of iteration process. In t h e neighborhood of extremum t h e value I ( A z S +I(( becomes small and tends t o z e r o . T h e r e f o r e f o r t h e b r e a k w e may use t h e following a v e r - aged value Q, obtained as follows:

If 9,

=

E B , t h e p r o c e s s i s broken. H e r e EB i s some positive constant which c h a r a c - t e r i z e s t h e r e q u i r e d precision of solution.

We give t h e r e s u l t s of computation experiments f o r scheme (8). (22). (23), (24).

EXAMPLE 1 The following problem statement a r i s e s in solving multi-list inven- t o r y problem [23].

Let u s consider t h e problem

H e r e O i are random values uniformly distributed on intervals [ A i , Bi], i = I , .

. .

, 5. Vectors a

=

( a l , .

. .

, a 5 ) ,

B =

(B1,.

. .

, B5), A

=

(A1,.

. .

^,^A5).

B =

^(B1,

. . .

, B 5 ) are defined as follows:

Analytical form of t h e function i s as follows:

(28)

Analytical form of t h e function j ' ( z ) is used only f o r obtaining explicit solution by one of t h e methods of quadratic programming

f' ( z * )

=

98.10089, z*

=

(41.88057, 7.00000, 2.48092, 41.27456, 22.33456) ,

The stochastic quasi-gradient is computed by t h e formula

To solve this problem t h e scheme ( 8 ) , ( 2 2 ) , ( 2 3 ) , ( 2 4 ) w a s used with parameters a

=

1.5. U

=

0 . 0 , D

=

0.25, po

=

1 . Initial approximation is z 0

=

( 0 , 0 , 0 , 0 , O ) , j' ( z O )

=

278.5. A t t h e 91st iteration t h e s t e p size pgl

=

0.15. The r e s u l t s of a v e r - aged values of coordinates and of t h e functions at 91st t o 100th iteration a r e as follows:

Note t h a t t o obtain t h e final result i t is desirable t o a v e r a g e solution approximations with r e s p e c t t o last iterations.

EXAMPLE 2 Random Location e q u i l i b r i u m probLem [ 7 ] . The calculations are performed by t h e a u t h o r t o g e t h e r with N. Roenko. This problem has been considered in section 1 . I t consists in minimization of t h e function

n

f ( z )

= B ~ S S I ~

- w ( l @ , ( d w ) ^-+min

~2 z E R ~

The number of points t o b e located in n

=

30, probability measures E l i ,

i

=

1,

. . .

, n a r e bivariate normal density function whose means and standard de- viations a r e generated randomly in t h e r a n g e 0-20. The weights

pi

a r e also gen- e r a t e d randomly in t h e r a n g e 0-10. Data are given in t h e Table.

(29)

To solve t h e problem t h e scheme (8), (22), (23), (24) with p a r a m e t e r s a

=

2, I/

=

0.8, D

=

0.2, po

=

1 w a s used. The e x a c t value of t h e point of extremum i s z*

=

(8.36, 9.36). The initial approximation i s z 0

=

(41. 87). The r e s u l t s of averaging of approximations z S from t h e 51st t o 60th iteration are

from t h e 191st t o 200th iteration

With initial approximation z

=

(54, 30) t h e following solution approximations

a r e obtained.

The r e s u l t s of numerical experiments show t h a t approximations of solutions sufficiently quickly fall into t h e neighborhood of solution and after t h i s t h e accu- r a c y of approximation i s not practically improved.

I t should b e noted t h a t t h i s e f f e c t i s connected with asymetry of g e n e r a t o r s of random numbers r a t h e r t h a n with t h e choice of s t e p size control.

The suggested a p p r o a c h has some advantages as compared t o [7] because t o r e a l i z e t h e computation p r o c e s s i t i s not n e c e s s a r y t o i n t e g r a t e complex functions.

Table 1

z l m e a n s a r e 3.02 6.07 9.77 16.26 6.12 14.80 7.24 7.52 15.91 13.57 2.08 12.70 0.16 15.78 3.95 11.89 4.68 6 . 9.19 11.56 12.43 19.98 15.33 18.20 7.84 1.16 4.54 17.48 10.78 1.45

z 2 means are 7.63 6.62 15.40 10.83 4.85 17.14 2.20 9.30 17.30 14.60 5.68 4.77 19.10 17.17 0.80 10.82 11.48 18.99 0.36 2.52 10.00 1.93 11.39 16.41 16.21 2.09 16.69 8.70 12.04 2.93 Z devs. are 18.65 18.95 0.45 13.50 17.55 1.12 18.42 1.59 15.65 9.49 19.13 18.19 19.56 19.14 11.93 7.26 1.72 11.37 7.09 16.05 15.62 4.31 15.44 1.40 5.82 8.56 16.72 5.29 10.36 12.49

Weightsare 8.50 9.48 6.03 8.16 9.05 1.80 8.17 7.57 3.43 9.62 2.87 3.77 4.34 4.88 0.11 2.13 7.75 1.64 5.74 6.12 4.57 4.45 2.95 0.17 7.53 9.39 7.38 1.15 2.09 7.20

(30)

REFERENCES

1 Ju. M. Ermoliev, Z.V. Nekrylova. On Some Stochastic Optimization Methods. Ki- b e r n e t i k a , 1966, No. 6 (in Russian).

2 Ju. M. Ermoliev. Methods of Stochastic Programming. Nauka, Moscow, 1976. 240 p. (in Russian).

3 Ju. M. Ermoliev. Stochastic Quasi-Gradient Methods and t h e i r Applications to Systems Optimization. Stochastics, 1983, N o . 4.

4

H.

Robbins, S . Monro. Stochastic Approximation Methods. Ann. Math. S t a t i s t . , 1951, 22, 400-407.

5 L.A. Rastrigin. Theory of S t a t i s t i c a l S e a r c h Methods. Nauka, 1968 (in Russian).

6 R.T. Rockafellar, J.-B. Wets. On t h e Interchange of Subdifferentiation and Con- ditional Expectation for Convex Functionals. Stochastics, 1982, Vol. 7, 173-182.

7 N. Katz, L. Cooper. An Always-Convergent Numerical Scheme for a Random Lo- cational Equilibrium Problem. SIAM J. Numer. Anal. 1974, Vol. ll, No. 4, Sep- tember, 683-692.

8

H.

Kiefer, J. Wolfowitz. Stochastic Estimation of t h e Maximum of a Regression Function. Ann. Math. Statist., 1952, 23, 462-466.

9 J.R. Blum. Ann. Math. Statist., 1954, 25, 737-744.

10 J. Sacks. Asymptotic Distribution of Stochastic Approximation P r o c e d u r e . Ann.

Math. Statist., 1954, 25, 737-744.

11 M.B. Nevelson, R.Z. Khasminski. Stochastic Approximation and Recursive Esti- mation. Nauka, Moscow, 1972 (in Russian).

12 A.M. Gupal. S t o c h a s t i c Methods for Solving Non-Smooth Extremal Problems.

Naukova Dumka, Kiev, 1979 (in Russian).

13 E.A. Nurminskij. Numerical Methods for Solving Deterministic and Stochastic Minimax Problems. Naukova Dumka, Kiev, 1979, 159 p. (in Russian).

14 B.T. Poljak. Convergence and R a t e of Convergence of I t e r a t i v e Stochastic Algo- rithms. I. General Case. Avtomatika i Telemekhanika. 1976, No. 12, 83-94 (in Russian).

15 Ju.M. Ermoliev, Ju.M. Kaniovski j. Asymptotic p r o p e r t i e s of Some Methods of Stochastic Programming with Constant S t e p size. Zhurn. vych. mat. i mat.

fiziki, 1979, No. 2, 356-366 (in Russian).

16 A.S. Nemirovskij, D.B. Judin. Complexity of Problem and Efficiency of Optimiza- tion Methods. Nauka, Moscow, 1979, 384 p. (in Russian).

17 H.J. Kushner. Asymptotic Behavior of Stochastic Approximation and L a r g e De- viations. Divisions of Appl. Math. and Eng. Lefschetz Center for Dyn. Syst.

Brown Univ. Providence, Rhode Island, 1983, 27 p .

18

H.

Kesten. Accelerated Stochastic Approximation. Ann. Math. S t a t i s t . 1958, 29, 41-59.

19 Ju. M. Ermoliev, G. Leonarki, J. Vira. The Stochastic Quasi-Gradient Methods Applied to a Facility Location Model. Working p a p e r , WR-81-14, 1981, Laxen- b u r g , Austria, International Institute for Applied Systems Analysis.

20 A.M. Gupal,

F.

Mirzoakhmedov. On One Method of S t e p Size Control in Stochastic Programming Methods. Kibernetika, 1978, No. 1, 133-134 (in Russian).

(31)

G . Pflug. On t h e Determination of t h e S t e p Size in Stochastic Quasi-Gradient Methods. Working p a p e r , 1983, May. Laxenburg, Austria, International Insti- t u t e f o r Applied Systems Analysis, 24 p.

S.P. Urjas'ev. A S t e p Size Rule f o r Direct Methods of Stochastic Programming.

Kibernetika (Kiev), 1980, No. 6, 96-98 (in Russian).

F. Mirzoakhmedov. S.P. Urjas'ev. Adaptive S t e p Size Control f o r Stochastic Op- timization Algorithm.

-

Zhurn. vych. m a t . i mat. fiziki., 1983, No. 6 , 1314-1325 (in Russian).

D. Himmelblau. Applied Nonlinear Programming. McGrow-Hill Book Company, 1972.

R.J.-B. Wets. Modeling and solution s t r a t e g i e s f o r unconstraned stochastic optimization problems. Annals of Operations R e s e a r c h 1984, No. 1, 3-22.

N.Z. S h o r . Minimization Methods f o r Non-Differentialbe Functions. Springer- Verlag, 1985, 162 p.

A.M. Gupal. L.T. Bazhenov. A Stochastic Analog of t h e Methods of Conjugate Gradients. Kibernetika, 1972, 124-126, (in Russian).

A.P. Korostelev. On Multi-Step P r o c e d u r e of Stochastic Optimization. Avtomati- k a i Telemekhanika, 1981, 82-90 (in Russian).

H.J. Kushner, Hai-Huang. Asymptotic P r o p e r t i e s of Stochastic Approximation with Constant Coefficients. SIAM Journal on Control and Optimization, 1981, 19, 87-105.

N.D. Chepurnoj. One S t e p Size Control in Stochastic Method of Minimization of Non-Smooth Functions. Kibernetika, 1982, No. 4, 127-129 (in Russian).

A. Ruszczynski, W. Syski. A Method of Aggregate Stochastic Sub-Gradients with On-Line Stepsize Rules f o r Convex Stochastic Programming Problem.

Mathematical Programming Study ( t o a p p e a r ) .

S.P. Urjas'ev. Arrow-Hurwicz Algorithm with Adaptively Controled S t e p Sizes.

In: Operations R e s e a r c h and AMS, 1984, 24, 3-11 (in Russian).

Ju.M. Ermoliev, S.P. Urjas'ev. On S e a r c h f o r Equilibrium by Nash in Many- Person Games. Kibernetika, 1982, No. 3, 85-88 (in Russian).

G.M. Saridis. Learning Applied t o Successive Approximation Algorithms. IEEE Trans. Syst. Sci. Cybern. 1970, Vol. SSC-6, Apv. 97-103.

Stochastic Quasi-Gradient Algorithms with Adaptively Controlled Parameters

Working Paper

Sr(XXASl3C QUASI-GRADIENT

CONTROLLED S - P

International Institute for Applied Systems Analysis

A-2361 Laxenburg, Austria

SrOCHASllC QUASI-GRADIENT ALGORlTHMS WITH

CONTROLLED PARAMEI'ERS

SI'OC-C QUASI-GRADIENT AUSORIT'KMS WJTH

CONTROLLED PARAMETERS

-

R n

R n

JI.JI

R n

af

:Rn

R

7 ,

I

9";

R n .

WITH

Y

Rn.

a,

EXAMPLE

,

=

. . . .

=

. . .

=

. . .

=

. . .

Bf >

=

. . .

I

J J

R"

tS

P",

. . .

=

.. .

.

=

=

. .

1131.

[lo],

tS

<

V.M.

. . .

-

1

-

=

. . . ,

.

=

=

...

tS

af

-

tS

< t S

t S >

+' =

+' -

F

-

<tS

t S >

<[,

+'>

^{R n}

+' ⁼

+' ^-

⁼

^...

¹

^X

^zS),

= L(zs) ⁺