Stochastic Quasigradient Methods and their Implications

(1)

NOT FOR QUOTATION WITHOUT PERMISSION OF THE AUTHOR

S T O C m C QUASIGRADIENT

METHODS AND

THEIR WLEMENTATION

Yuri Ermoliev Alexei Gaivoronski

July 1984 WP-84-55

Working Papers a r e interim reports on work of t h e International Institute for Applied Systems Analysis a n d have received only limited review. Views or opinions expressed herein do not necessarily represent those of t h e Institute or of its National Member Organizations.

INTERNATIONAL INSTITUTE FOR APPLIED SYSTEMS ANALYSIS 2361 Laxenburg. Austria

(2)

PREFACE

This paper discusses various stochastic quasigradient methods and considers t h e i r computer implementation. It i s based on experience gained both a t the V. Glushkov Institute of Cybernetics in Kiev and a t IIASA

The paper falls naturally i n t o t h r e e parts. The first i s concerned with problem definition a n d various ways of choosing t h e s t e p size and s t e p direction. A detailed description of an interactive stochastic optimi- zation package (STO) available a t IIASA forms t h e second part. Finally, t h e use of t h i s package is demonstrated by application t o t h r e e t e s t problems t h a t have arisen during t h e course of IIASA w o r k t h e y include a facility location problem and a water management problem.

This work was c a r r i e d out within t h e Adaptation and Optimization Project of t h e System a n d Decision Sciences Program.

ANDRZE J WIERZBICKI Chairman

System and Decision Sciences

(3)

A number of stochastic quasigradient methods a r e discussed from t h e point of view of implementation. The discussion revolves around t h e interactive package of stochastic optimization routines (STO) r e c e n t l y developed by t h e Adaptation and Optimization group a t IIASk (This package is based on t h e stochastic and nondifferentiable optimization package (NDO) developed a t t h e V. Glushkov Institute of Cybernetics in Kiev.) The IIASA implementation is described a n d i t s use illustrated by application t o t h r e e problems which have arisen in various IlASA projects.

(4)

STOCHASI7C QUASIGRADENT METHODS AND THEIR IMPLEMENTATION

Y w i

E?rmoLiev and A l e z e i G a i v o r o n s k i

1. INTRODUCTlON

This paper discusses various stochastic quasigradient methods (see [1,2]) a n d considers t h e i r c o m p u t e r implementation. It is based on experience gained both a t t h e V. Glushkov Institute of Cybernetics in Kiev and a t IIGk

We a r e concerned h e r e mainly with questions of implementation, s u c h as the best way to choose s t e p directions and s t e p sizes, a n d therefore little a t t e n - tion will be paid t o theoretical aspects s u c h as convergence t h e o r e m s and t h e i r proofs. Readers i n t e r e s t e d in t h e theoretical side a r e referred t o [1.2].

The paper is divided i n t o five sections. After introducing t h e main problem in Section 1, we discuss t h e various ways of choosing t h e step size a n d s t e p direction in Sections 2 a n d 3. A detailed description of an interactive stochastic optimization package (STO) c u r r e n t l y available a t IIASA is given in Section 4.

This package r e p r e s e n t s one possible implementation of t h e methods described in t h e previous sections. Finally, Section 5 deals with the solution of some t e s t problems using this package. These problems were brought to o u r attention by o t h e r IIASA prcjects and collaborating institutions a n d include a facility location problem, a water resources m a n a g e m e n t problem, a n d t h e problem of choosing the p a r a m e t e r s in a closed loop control law for a stochastic dynamical system with delay.

We a r e mainly concerned with the problem

where z r e p r e s e n t s t h e variables to be chosen optimally, X is a s e t of con- s t r a i n t s , and o is a random variable belonging t o some prubabilistic space ( R , B , P ) . Here B is a Bore1 field a n d P is a probabilistic measure.

There a r e c u r r e n t l y two main approaches t o this problem. In t h e first, we take the m a t h e m a t i c a l expectation in ( I ) , which leads to multidimensional

(5)

integration a n d involves t h e use of various approximation s c h e m e s [3-61. This reduces problem (1) to a spe'cial kind of nonlinear programming problem which allows t h e application of deterministic optimization techniques. In this paper we c o n c e n t r a t e on t h e second approach, in which we consider a very Limited n u m b e r of observations of random function f ( z , o ) a t e a c h iteration in order t o d e t e r m i n e t h e direction of t h e next step. The resulting e r r o r s a r e smoothed out until t h e optimization process t e r m i n a t e s (which happens when the s t e p size becomes sufficiently small). This approach was pioneered in [7,9].

W e a s s u m e t h a t s e t X is defined in such a way t h a t t h e projection operation z -. n X ( z ) is comparatively inexpensive from a c o m p u t a t i o n a l point of view, where nx(z)

=

arg min llz

- = ! I .

For instance, if X is defined by linear con-

z EX

s t r a i n t s , t h e n projection is reduced to a q u a d r a t i c programming problem which, although challenging if large scale, c a n nevertheless be solved in a finite n u m b e r of iterations. In this case it is possible to i m p l e m e n t a stochastic quasigradient algorithm of t h e following type:

Here zs is t h e c u r r e n t approximation of t h e optimal solution, ps is the s t e p size, a n d v S is a r a n d o m s t e p direction. This s t e p direction may, for instance, be a s t a t i s t i c a l e s t i m a t e of the gradient (or subgradient in t h e nondifferentiable case) of function F ( z ) : t h e n v S

=

s u c h t h a t

where as decreases as t h e n u m b e r of iterations i n c r e a s e s , and t h e vector v S is called a stochastic quasigradient of function F ( 2 ) . Usually p, ⁴0 as s ⁴= a n d t h e r e f o r e Ilzs+l

-

^zS

11

^-,⁰^from(2). This suggests t h a t we should take zS as t h e initial point for t h e solution of the projection problem a t iteration n u m b e r s +1, t h u s reducing considerably the computational effort n e e d e d t o solve the qua- dratic programming problem a t each s t e p s

=

1,2.

... .

Algorithm (2)-(3) c a n also cope with problems with more general c o n s t r a i n t s formulated in t e r m s of m a t h e m a t i c a l expectations

by making use of penalty functions or t h e Lagrangian (for details see [1,2]).

(6)

The principal peculiarity of s u c h methods is their nonmonotonicity, which may sometimes show itself in highly oscillatory behavior. In this case it is dificult to judge whether t h e algorithm has already approached a neighborhood of the optimal point or n o t , since exact values of t h e objective function a r e not available. The best way of dealing with s u c h difEculties s e e m s to be to use an interactive procedure t o choose t h e step sizes and s t e p directions, especially if it does not take much t i m e t o make one observation. More reasons for adopting an interactive approach a n d details of t h e implementation are given i n the following sections.

Another c h a r a c t e r i s t i c of t h e algorithms described h e r e is t h e i r pattern of convergence. Because of t h e probabilistic n a t u r e of t h e problem, t h e i r asymptotic r a t e of convergence is extremely slow a n d m a y be r e p r e s e n t e d by

Eere z * is t h e optimal point t o which sequence zS converges a n d k is t h e n u m b e r of observations of random p a r a m e t e r s o, which in m a n y cases is proportional t o t h e n u m b e r of iterations. In deterministic optimization a super- linear asymptotic convergence r a t e is generally expected; a r a t e such as (4) would be considered a s nonconvergence. But no algorithm c a n do asymptotically any b e t t e r t h a n t h i s for stochastic problem (1) in t h e presence of nonde- generate random & s t u r b a n c e s , a n d therefore t h e aim is t o r e a c h some neighborhood of the solution r a t h e r t h a n to find t h e precise value of t h e solution itself. Algorithm (2)-(3) is quite good enough for this purpose.

2. CHOICE

OF

SllW DIRECTION

In this section we shall d ~ s c u s s different ways of choosing t h e s t e p direction in algorithm (2) a n d s o m e closely r e l a t e d algorithms. We shall first discuss methods which a r e based on observations made a t t h e c u r r e n t point zS or in its immediate vicinity. More general ways a r e then p r e s e n t e d which take into account observations m a d e a t previous points.

2.1. Gradients of random function f (z , 0)

The simplest case arises when it is possible to obtain gradients (or subgra- dients in t h e nondifferentiable case) of f u n c t i o n f ( z , w ) a t fixed values of z and w. In this case we can simply take

(7)

where oS is an observation of random p a r a m e t e r LI made a t s t e p n u m b e r s . If both t h e observation of random p a r a m e t e r s and the evaluation of gradients a r e computationally inexpensive t h e n it is possible to take t h e average of some specified n u m b e r N of gradient observations:

These observations c a n be s e l e c t e d in two ways. The first is t o choose t h e oilS according t o their probability distribution. If we do not know t h e form of t h e distribution function (as, for example, in Monte-Carlo simulation models) t h i s is t h e only option. However, in t h i s c a s e t h e influence of low-probability high-cost events m a y not be properly t a k e n i n t o account. In addition, t h e asymptotic e r r o r of the gradient e s t i m a t e is approximately proportional to 1/

q .

The second approach may be used when we know the distribution of the random p a r a m e t e r s o. In this case m a n y o t h e r e s t i m a t e s can be derived; t h e u s e of pseudo-random numbers* in p a r t i c u l a r m a y lead to an asymptotic e r r o r approximately proportional to log (N)/ N, which is considerably less t h a n in t h e purely random case. However, m o r e t h e o r e t i c a l research and more computational experience a r e necessary before we c a n assess the true value of t h i s approach.

The main question h e r e is w h e t h e r t h e i n c r e a s e in the speed of convergence is sufficient t o compensate for t h e additional computational effort required for m o r e exact estimations of t h e F,(zS).

Unfortunately, o u r t h e o r e t i c a l knowledge concerning t h e asymptotic behavior of processes of type (2) tells u s little about t h e optimal n u m b e r of samples, even f o r relatively well-studied cases. For instance, what would be t h e optimal n u m b e r N of observations for t h e case in which function F(z) _is diflerentiable a n d t h e r e a r e n o c o n s t r a i n t s ? In this case we can establish both asymptotic normality and t h e value of t h e asymptotic variance. If, additionally, p,

-

^{C / s}t h e n t h e total n u m b e r of observations required to obtain a given asymptotic variance is t h e s a m e for all N

<<

s . If sp, -, = t h e n t h e wait-and-see approach is asymptotically s u p e r i o r as long as 1Ir

<<

s .

*A concept which arose ?om :he xse of quasi-Monte-Carlo zec:lmques in mult~&mensional integration [9].

(8)

However, t h e r e is strong evidence t h a t in c o n s t r a i n e d and/or nondifferentiable cases t h e value of JV should be chosen adaptively. A very simple example provides some insight into the problem. Suppose tfiat z ^ER', X

=

^[a,=),F ( z ) = z, f,(zS.wS)

=

1

+

wS, where the wS, s = 1,2 ,... a r e indepen- d e n t random variables with zero mean. The obvious solution of this problem is z

=

a. Suppose f o r simplicity t h a t ps

=

p. This will not a l t e r o u r a r g u m e n t greatly because ps usually changes very slowly for large s . In this c a s e method ( 2 ) . ( 5 ) will be of t h e form:

T,

=

m a x 1 0 , a - zS

+

p ( l

+

wS)j

Method (2),(6) requires us to choose a s t e p size N t i m e s g r e a t e r t h a n p; otherwise its performance would be inferior to t h a t of m e t h o d (2),(5) (unless t h e initial point is in the immediate vicinity of the m i n i m u m ) . Method (2),(6) t h e n becomes

In order t o compare t h e two methods we shall let s in t h e last equation denote t h e n u m b e r of observations r a t h e r t h a n the n u m b e r of i t e r a t i o n s a n d r e n u m b e r t h e observations winS. The process

lo

^{if i}^#

^1N

^{for 1}

⁼

^1,2.

^...

^{or a}

^<

y i - p ( l

+

oi)

xi =

^a

-

^{y i}+ p ( l + oi) otherwise

h a s t h e property t h a t y i N

=

z S a n d therefore it is sufacient t o compare y k with z k for k =

LN,

where

Suppose t h a t z0

=

y o # a. Then if t: = min tk : zk

=

aj r e p r e s e n t s t h e t i m e a t which process rk first e n c o u n t e r s t h e optimal point and t:

=

min I1 : y L N = a ]

(9)

represents t h e time of t h e corresponding e n c o u n t e r of process z k with the optimal point, it is c l e a r t h a t t: S t i because from (7) a n d ( 9 ) we have t h a t

y k

=

zk lor k < t d . This m e a n s t h a t algorithm (2),(5) will g e t from some r e m o t e initial point to t h e vicinity of the optimal point f a s t e r t h e n algorithm (2),(6) with N

>

1. Now let us take z0

=

^{y o}

=

a. Then (7) a n d (8) imply t h a t

&

=

0 for k

<

N while rk may differ from zero. Therefore in this case z N > y N

=

z 1 a n d t h e performance of algorithm (2),(6) with N

>

1 becomes superior to t h a t of algorithm (2),(5) after r e a c h i n g t h e vicinity of t h e optimal point. This simple example demonstrates several i m p o r t a n t properties of con- strained stochastic optimization problems, although m o r e work is necessary before we c a n make a n y firm theoretical r e c o m m e n d a t i o n s concerning t h e choice of t h e n u m b e r of samples N. Above all, a n appropriate definition of t h e r a t e of convergence is needed: r e c e n t r e s u l t s by Kushner [ l o ] may be useful in this regard.

A r a t h e r g e n e r a l adaptive way of changing t h e n u m b e r N would be to begin with a small value of N for t h e first few iterations ( N

=

1, lor example), and increase N if additional t e s t s show t h a t t h e c u r r e n t point is in t h e vicinity of t h e optimum. The following averaging procedure has been shown t o be useful in tests of t h i s type:

where

r

is defined by (5) or (6). It c a n be shown (see [1.2]) t h a t l/uS

-

F,(zs)II ^-+0 u n d e r r a t h e r g e n e r a l conditions, which include p s / a s ^-+0.

The decision as to whether to change ^iV m a y t h e n be based on t h e value of

r, =

\lzs

-

rrX(zs

-

vs)ll. One possibility is t o e s t i m a t e a n d its empirical vari- a n c e a t the s a m e time:

a n d choose N such t h a t o i S @ r s , where t h e value of @ is s e t before beginning t h e iterations. In practice it is s u f i c i e n t t o consider a constant a, a

-

0.01-0.05, where t h e g r e a t e r the r a n d o m n e s s , t h e smaller t h e value of a. Our empirical recommendation for t h e initial value of N is a

:

-

^{0.1 max} ^112,

^-=,I\.

z l , r z a

(10)

This method c a n be used t o increase the n u m b e r of samples per iteration automatically. Another possibility is t o a l t e r t h e value of N interactively; this is one of t h e options implemented in t h e interactive package STO, which has r e c e n t l y been developed a t IIASk Numerical experiments conducted with this package show t h a t in problems where f , ( z , w ) has a high variance, choosing a value of N g r e a t e r t h a n one c a n bring about considerable improvements in performance.

The method described above uses increasingly precise e s t i m a t e s of t h e gradient. a n d therefore s h a r e s some of t h e features of t h e approximation tech- niques developed in [3-61 for solving s t o c h a s t i c programming problems. All of t h e r e m a r k s made h e r e concerning sampling a r e also valid for t h e o t h e r m e t h o d s of choosing described below.

However, i t is n o t always possible t o use observations of the g r a d i e n t f , ( z , w ) of t h e random function to compute a stochastic quasigradient. In m a n y cases the analytic expression of f , ( z , w ) is not known, and even if it is, it m a y be difficult to c r e a t e a subroutine t o evaluate it, especially for large-scale problems. In this case it is necessary t o use a method which relies only on observations of f ( z , w ) .

2.2. FiniteditTerence approximations

If function F ( z ) is differentiable, one possibility is t o use forward finite differences:

o r c e n t r a l h i t e differences:

where t h e ei a r e unit basis vectors from

R".

The m o s t important question h e r e i s t h e value of 6,. In order to e n s u r e convergence with probability one it is sufficient t o take any sequence 6, s u c h t h a t

Exm=,

^pf/^6f

^<

=J. I f it is possible to t a k e w f , ,

=

u:,, then any 6, + 0 will do. However, t h e method m a y r e a c h t h e vicinity of the optimal point m u c h faster if 6, is chosen adaptively. On t h e first few iterations 6, should be large, decreasing ris the c u r r e n t point approaches

(11)

t h e optimal point. The main reason for this is t h a t taking a large s t e p 6 , when t h e c u r r e n t point is far from t h e solution m a y s m o o t h out t h e randomness to s o m e e x t e n t , and may also overcome some of t h e problems ( s u c h as curved valleys) c a u s e d by t h e e r r a t i c behavior of t h e d e t e r m i n i s t i c function ~ ( z ) . One possible way of implementing such a s t r a t e g y in a n u n c o n s t r a i n e d case is given below.

(i) Take a large initial value of 6 , . s u c h as 6 ,

-

0.1 m a x (12

-

z,ll

r l , z z E X

(ii) Proceed with iterations (2), where is d e t e r m i n e d using (10) or (11).

While doing this, compute an e s t i m a t e of t h e g r a d i e n t v S from (9).

(iii) Take

where t h e values of a n d

p2

should be chosen before beginning t h e iterative process.

It c a n be shown t h a t t h i s process converges when ufB1

=

^o;,~, although i t will also produce a good approximation to the solution even if this r e q u i r e m e n t is n o t met. Estimate (9) is not the only possibility

-

in fact, a n y of the e s t i m a t e s of algorithm performance given in Section 3 would do.

Another s t r a t e g y is to relate changes in t h e finite-difference approximation s t e p t o changes in t h e s t e p size. This is especially advisable if t h e s t e p size is also chosen adaptively (see Section 3). In t h e s i m p l e s t c a s e one may fix

PI >

0 before s t a r t i n g and choose 6,

= pips,

which, although c o n t r a r y to theoretical recommendations, will nevertheless bring t h e c u r r e n t point reasonably close t o t h e optimal point. To obtain a m o r e precise solution i t is n e c e s s a r y t o reduce

PI

during t h e course of t h e iterations. This m a y be done e i t h e r automatically o r interactively; both of these options a r e c u r r e n t l y available in t h e s t o c h a s t i c optimization package STO.

Finite-difference algorithms (10) and (11) have one major &sadvantage, a n d this is t h a t t h e stochastic quasigradient variance increases as 6, decreases. This m e a n s t h a t finite-difference algorithms converge m o r e slowly t h a n algorithms which use gradients (5). There a r e two ways of overcoming this problem. Firstly, if i t is possible to make observations of function f ( z , w ) for

(12)

v a r i o u s values of z a n d fixed w , it is a good idea to t a k e t h e s a m e v a l u e s of w for t h e differences (i.e., uTml

=

^gfS2)when 6, is small b e c a u s e t h i s r e d u c e s t h e vari- a n c e of t h e e s t i m a t e s q u i t e considerably. Another way of avoiding t h i s i n c r e a s e i n t h e v a r i a n c e is t o i n c r e a s e t h e n u m b e r of samples u s e d t o o b t a i n when a p p r o a c h i n g t h e o p t i m a l point, i.e.. t o u s e finite-difference a n a l o g u e s of (6). If t h e r e exists a y

>

0 s u c h t h a t Ns6: > y , where N, is t h e n u m b e r of s a m p l e s t a k e n a t s t e p n u m b e r s , t h e n t h e v a r i a n c e of r e m a i n s bounded.

I t is s o m e t i m e s useful t o n o r m a l i z e t h e

p ,

especially when t h e v a r i a n c e is l a r g e .

Another disadvantage of t h e finite-difference a p p r o a c h is t h a t i t r e q u i r e s n

+

1 evaluations of t h e objective Function for forward differences a n d 2n for c e n t r a l differences, w h e r e n is t h e dimension of v e c t o r z . This m a y n o t be a c c e p t a b l e in Large-scale p r o b l e m s a n d i n c a s e s where f u n c t i o n evaluation is c o m p u t a t i o n a l l y expensive. In t h i s s i t u a t i o n a s t o c h a s t i c q u a s i g r a d i e n t c a n be c o m p u t e d using s o m e a n a l o g u e of r a n d o m s e a r c h t e c h n i q u e s .

2.3. Analogues of random search methods

When i t is not feasible to c o m p u t e n

+

1 values of t h e objective f u n c t i o n a t e a c h i t e r a t i o n , t h e following a p p r o a c h (which has s o m e t h i n g s in c o m m o n with t h e r a n d o m s e a r c h t e c h n i q u e s developed for d e t e r m i n i s t i c o p t i m i z a t i o n prob- l e m s ) m a y be used:

H e r e t h e h, a r e v e c t o r s d i s t r i b u t e d uniformly on t h e u n i t s p h e r e ,

.Ms

is t h e n u m b e r of r a n d o m p o i n t s a n d 6, is t h e s t e p t a k e n in t h e r a n d o m s e a r c h . The c h o i c e of

M,

is d e t e r m i n e d by t h e computational facilities available, a l t h o u g h i t i s advisable t o i n c r e a s e

M,

a s 6, decreases. This m e t h o d of choosing

p

h a s m u c h i n c o m m o n with finite-difference s c h e m e s , a n d t h e s t a t e m e n t s m a d e above a b o u t t h e c h o i c e of 6, i n t h e flnite-difference c a s e also hold for (12).

2.4. Snmothmg the objective function

Methods of choosing

p

which r e l y on finite-difference o r r a n d o m s e a r c h t e c h n i q u e s a r e only a p p r o p r i a t e when t h e objective f u n c t i o n F ( z ) is differentiable. The u s e of s i m i l a r p r o c e d u r e s in t h e n o n d i f f e r e n t i a b l e case

(13)

would require some smoothing of t h e objective function. Suppose t h a t t h e function F ( z ) is not differentiable but satisfies the Lipschitz condition, a n d consider t h e function

where H ( y , r ) is a probability measure with support in a ball of radius r c e n - t e r e d a t zero. We shall assume for simplicity t h a t H ( ~ , T ) has nonzero density inside this ball. The function F ( z , r ) is differentiable a n d F ( z , r ) ⁺^{~ ( z )}uniformly over every compact s e t as r -r 0. It is now possible to minimize t h e nonsmooth function ~ ( z ) by computing stochastic quasigradients for s m o o t h f u n c t i o n s F ( 2 . r ) a n d find t h e optimal solution of t h e initial problem by l e t t i n g r -. 0. This idea was proposed in [ l l . ] a n d studied f u r t h e r in [ 1 2 ] . It is not a c t u - ally n e c e s s a r y t o calculate t h e integral in ( 1 3 ) - it is sufficient to c o m p u t e

tS

using equations ( 1 0 ) - ( 1 2 ) , but a t point z S

+

y S r a t h e r t h a n point z S , where y S is a r a n d o m variable distributed according t o H ( Y , T , ) . In t h i s case ( 1 0 ) becomes:

The m o s t commonly u s e d distribution H ( y , r ) is uniform distribution on a n n - dimensional cube of side r . If we want t o have convergence with probability one we should choose r, s u c h t h a t b , / r S

-.

0 a n d ( r , - r s + l ) / p s ⁺0. In practical computations it is also advisable to choose t h e smoothing p a r a m e t e r rs in a s i m i l a r way t o 6,. using one of the adaptive procedures discussed above.

Smoothing also h a s beneficial side effects in t h a t i t improves the behavior of t h e deterministic function F ( z ) . In t h e case where ~ ( z ) may be written a s t h e s u m of two functions, one with a distinct global m i n i m u m and the o t h e r with highly oscillatory behavior, smoothing may help t o overcome t h e influence of t h e oscillations, which m a y otherwise lead t h e process t o local minima f a r from t h e global one. Thus it c a n sometimes be useful t o smooth t h e objective function even if we c a n obtain a gradient f , ( z , o ) . In this case we should t a k e a large value for the smoothing parameter r, on t h e first few iterations, decreasing it a s we approach the optimal point. The points a t which r, should be decreased m a y be determined using the values of additional e s t i m a t e s , s u c h a s those described below in Section 3 or given by (9). Everything said about t h e choice of t h e finite-difference parameter 6, is also valid for the choice of t h e

(14)

smoothing p a r a m e t e r , including t h e connection between t h e s t e p size a n d the smoothing p a r a m e t e r a n d the possibility of interactive control of r,. The only difference is t h a t a d e c r e a s e in ⁷^, does not lead t o an increase in t h e variance of

tS

and t h a t it is preferable to have 6,

<

⁷^,^. This is also reflected in t h e sto- c h a s t i c optimization software developed a t IIASA

All of t h e methods discussed so far u s e only the information available a t t h e c u r r e n t point o r in its immediate vicinity. We shall now discuss s o m e more general ways of choosing t h e s t e p direction which take into account t h e information obtained a t previous points.

2.5. Averaging over preceding iterations

The definition of a stochastic quasigradient given in (3) allows us t o use information obtained a t previous points as t h e i t e r a t i o n s proceed; this information may s o m e t i m e s lead to f a s t e r convergence t o t h e vicinity of t h e optimal point. One possible way of using such information is t o average t h e stochastic quasigradients obtained in preceding iterations via a procedure s u c h as (9).

The us obtained in t h i s way may t h e n be u s e d in m e t h o d (2). This is a n o t h e r way of smoothing o u t randomness a n d neutralizing s u c h c h a r a c t e r i s t i c s of deterministic behavior as curved valleys a n d oscillations. Methods of this type may be viewed as s t o c h a s t i c analogues of conjugate gradient methods and were first proposed in [13]. We c a n choose according to any of (5). (6). (10). (ll), (12), o r (14). Since v S -,

Fz(zs)

under r a t h e r g e n e r a l conditions (see [1.2]), m e t h o d (9) can be considered as a n alternative t o method (6) for deriving precise estimates of g r a d i e n t Fz(z). This m e t h o d h a s a n advantage over (6) in t h a t it provides a n a t u r a l way of using rough e s t i m a t e s of

F,(zS)

on t h e first few iterations a n d t h e n gradually increasing t h e a c c u r a c y as the c u r r e n t point approaches t h e optimal point. In this case (9) c a n be incorporated in t h e adaptive procedures used to choose t h e smoothing p a r a m e t e r and the s t e p in t h e finite-difference approximation.

However, i t is not necessary t o always take a, -, 0 , because we have convergence for any 0 s a, S 1. Sometimes it is even advantageous t o take a,

=

a

=

constant, because in this case m o r e emphasis is placed on information obtained in r e c e n t iterations. In general, t h e g r e a t e r the randomness. the smaller t h e value of a t h a t should be taken. Another averaging technique is given by

(15)

where Ms is t h e size of t h e memory, which may be fixed.

2.6. Using secon&order information

T h e r e is strong evidence t h a t in s o m e cases setting

may bring a b o u t considerable i m p r o v e m e n t s in performance. Here c a n be chosen in any of t h e ways discussed above. Matrix

-%

should be positive definite a n d take into a c c o u n t both t h e second-order behavior of function F ( z ) a n d t h e s t r u c t u r e of t h e random p a r t of t h e problem. One possible way of obtaining second-order information is t o use analogues of quasi-Newton m e t h o d s t o update m a t r i x

4.

To i m p l e m e n t this approach, which was proposed by Wets in [3], it is n e c e s s a r y to have

- F,(zs)II

^-.0.

3. CHOICE

OF

SlXP SIZE

The simplest way of choosing t h e step-size sequence in (2) is to do it before starting t h e iterative process. Convergence theory suggests t h a t a n y s e r i e s with t h e properties:

c a n be used a s a s e q u e n c e of s t e p sizes. In addition, it may be n e c e s s a r y t o take i n t o a c c o u n t relations between t h e s t e p size a n d s u c h things a s t h e smoothing p a r a m e t e r o r the s t e p in a finite-difference approximation. Rela- tions of this type have been briefly described in the preceding sections. In m o s t cases t h e choice p,

-

^{C / s ,}which obviously satisfies (17), provides t h e b e s t possible asymptotic r a t e of convergence. However, since we a r e mainly c o n c e r n e d with reaching t h e vicinity of t h e solution, rule (17) is of limited use b e c a u s e a wide variety of sequences can be modified to satisfy it. The o t h e r disadvantage of choosing t h e step-size s e q u e n c e in advance is that this approach does not make any use of t h e valuable information which accumulates during solution.

These "programmed" methods t h u s perform relatively badly in t h e majority of cases.

(16)

The best s t r a t e g y t h e r e f o r e s e e m s t o be t o choose t h e s t e p size using a n i n t e r a c t i v e m e t h o d . It is a s s u m e d t h a t t h e u s e r c a n m o n i t o r t h e progress of t h e optimization p r o c e s s a n d c a n i n t e r v e n e t o c h a n g e t h e value of t h e s t e p size o r o t h e r p a r a m e t e r s . This decision should be based on t h e behavior of t h e esti- m a t e s

p(zs)

of t h e c u r r e n t value of t h e objective function. The e s t i m a t e s m a y be very rough a n d a r e g e n e r a l l y c a l c u l a t e d using only o n e observation p e r i t e r a t i o n , a s in t h e following example:

I t a p p e a r s t h a t a l t h o u g h t h e observations f ( z S , o S ) m a y vary g r e a t l y , t h e

?

display m u c h m o r e r e g u l a r behavior. Monitoring t h e behavior of s o m e com- ponents of t h e v e c t o r zs in addition t o t h e

?

also s e e m s t o be useful. One possible i m p l e m e n t a t i o n of t h e i n t e r a c t i v e a p p r o a c h m a y p r o c e e d along t h e following lines:

(i) The u s e r first c h o o s e s t h e value of t h e s t e p size a n d keeps it c o n s t a n t for a n u m b e r of i t e r a t i o n s (usually 10-20). During this period t h e values of t h e e s t i m a t e

?

a n d s o m e of t h e c o m p o n e n t s of t h e v e c t o r zS a r e displayed.

pozsibly with s o m e additional information.

(ii) The u s e r decides on a new value for t h e s t e p size using t h e available information. Three different c a s e s m a y occur:

- The c u r r e n t s t e p size is too large. In this c a s e both t h e values of t h e e s t i m a t e

?

a n d t h e values of t h e m o n i t o r e d c o m p o n e n t s of zS exhibit r a n d o m jumps. It is n e c e s s a r y to d e c r e a s e t h e s t e p size.

-

The c u r r e n t s t e p size i s just right. In t h i s c a s e t h e e s t i m a t e s d e c r e a s e steadily a n d s o m e of t h e m o n i t o r e d c o m p o n e n t s of t h e c u r r e n t vector zs also e x h i b i t r e g u l a r behavior (steadily d e c r e a s e o r i n c r e a s e ) . This m e a n s t h a t t h e u s e r m a y k e e p t h e s t e p size c o n s t a n t u n t i l oscillations o c c u r in t h e e s t i m a t e a n d / o r i n t h e c o m p o n e n t s of t h e c u r r e n t v e c t o r zs

.

-

The c u r r e n t s t e p size is too small. In this c a s e t h e e s t i m a t e

?

will begin to c h a n g e slowly, o r simply fluctuate, a f t e r t h e first few i t e r a - tions, while t h e c h a n g e in zS is negligible. It is n e c e s s a r y to i n c r e a s e t h e s t e p size.

(17)

(iii) Continue with t h e i t e r a t i o n s , periodically performing s t e p (ii), until changes i n t h e s t e p size no longer r e s u l t in any d i s t i n c t t r e n d in e i t h e r t h e function e s t i m a t e o r t h e c u r r e n t v e c t o r z S , which will oscillate a r o u n d s o m e point. This will indicate t h a t t h e c u r r e n t point i s close t o t h e solution.

This m e t h o d of choosing t h e s t e p s i z e r e q u i r e s a n experienced u s e r , b u t we have f o u n d t h a t t h e n e c e s s a r y skills a r e quickly developed by trial a n d e r r o r . The main r e a s o n s for adopting a n i n t e r a c t i v e a p p r o a c h m a y be s u m m a r i z e d a s follows:

- Interactive m e t h o d s m a k e t h e best use of t h e information which a c c u m u - l a t e s d u r i n g t h e optimization process.

-

Because t h e precise value of t h e objective f u n c t i o n is n o t available, i t is impossible t o u s e t h e r u l e s for changing t h e s t e p size developed in d e t e r - ministic optirnization (e.g., line s e a r c h e s ) .

-

Stochastic effects m a k e i t extremely difficult t o define formally when t h e s t e p size is "too big" o r "too small"; t h e o r e t i c a l r e s e a r c h h a s not thrown any light o n t h i s problem.

The m a i n disadvantage of t h e interactive approach is t h a t m u c h of t h e user's t i m e is wasted if i t t a k e s t h e c o m p u t e r a long t i m e t o make one observation f (zS,wS). For t h i s reason a g r e a t effort h a s been made to develop automatic adaptive ways of choosing t h e s t e p size, in which t h e value of t h e s t e p size is c h o s e n on t h e basis of information o b t a i n e d a t all o r s o m e of t h e previous points z i , i

=

- 1,s. Methods of this type a r e c o n s i d e r e d in [14-201. The approach d e s c r i b e d in t h e following s e c t i o n s involves t h e e s t i m a t e of s o m e m e a s u r e s of a l g o r i t h m performance which we d e n o t e by $ i ( 5 s , u S ) , where ZS r e p r e s e n t s t h e whole s e q u e n c e lz1,z2, ..., zs j a n d u s t h e s e t of p a r a m e t e r s used in the e s t i m a t e . In g e n e r a l , algorithm p e r f o r m a n c e m e a s u r e s a r e a t t e m p t s t o formalize t h e notions of "oscillatory behavior" a n d " r e g u l a r behavior" used in interactive step-size regulation, a n d possess o n e o r m o r e of t h e following properties:

- t h e a l g o r i t h m p e r f o r m a n c e m e a s u r e is q u i t e l a r g e when t h e algorithm exhibits d i s t i n c t r e g u l a r behavior, i.e.. when t h e e s t i m a t e s of t h e function value d e c r e a s e o r t h e c o m p o n e n t s of t h e c u r r e n t vector zS show a distinct trend:

(18)

- t h e algorithm performance m e a s u r e becomes small and even changes its sign if t h e e s t i m a t e s of t h e c u r r e n t function value stop improving or if t h e c u r r e n t point s t a r t s to oscillate chaotically;

- t h e algorithm performance m e a s u r e is large far from the solution a n d small in t h e immediate vicinity of t h e optimal point.

Automatic adaptive methods for choosing t h e s t e p size begin with some reasonably large value of t h e step size, which is kept c o n s t a n t as long as the value of t h e algorithm performance m e a s u r e r e m a i n s high, a n d t h e n decreases when t h e performance m e a s u r e becomes less t h a n s o m e prescribed value. The behavior of t h e algorithm usually becomes regular again after a decrease in t h e s t e p size, a n d t h e value of the performance m e a s u r e increases; after a n u m b e r of iterations oscillations set in and the value of the performance m e a s u r e once again decreases. This is a sign t h a t it is t i m e t o decrease the s t e p size. A r a t h e r general convergence result concerning such adaptive piecewise-linear methods of changing t h e s t e p size is given in [18]. However, in m a n y cases it is difficult t o d e t e r m i n e how close t h e c u r r e n t point i s to t h e optimal point using only one s u c h m e a s u r e

-

a more reliable decision c a n be made using several of t h e m e a s u r e s described below. Unfortunately, i t is n o t possible to come to a n y general conclusions as t o which performance m e a s u r e is the "best" for all stochastic optimization problems. Moreover, both t h e values of t h e p a r a m e t e r s used t o e s t i m a t e t h e performance m e a s u r e a n d t h e value of the performance m e a s u r e a t which t h e s t e p size should be decreased a r e different for different problems. Therefore if we fix t h e s e parameters once a n d for all we m a y achieve t h e s a m e poor performance a s if we had chosen t h e whole sequence of s t e p sizes prior t o t h e optimization process. Thus, i t is necessary t o t u n e the p a r a m e t e r s of a u t o m a t i c adaptive methods t o different classes of problems, a n d t h e interactive approach can be very useful here. An experienced u s e r would have l i t t l e difficulty in using t h e values of t h e performance m e a s u r e s to deter- mine t h e c o r r e c t points a t which t o change t h e s t e p size. and in learning what type of performance measure behavior requires a n increase or a decrease in t h e s t e p size. The interactive approach is of p a r t i c u l a r use if one iteration is n o t very time-consuming and t h e r e a r e a n u m b e r of similar problems to be solved. In this case the user can identify the most valuable m e a s u r e s of perfor- m a n c e in t h e &st few runs, fix t h e i r p a r a m e t e r s and incorporate this knowledge in automatic adaptive step-size selection methods for t h e remaining problems.

(19)

Although interactive methods usually provide the quickest m e a n s of reaching t h e solution, they c a n n o t always be implemented, and in this case automatic adaptive methods prove t o be very useful. The stochastic optimization package ST0 developed a t IIASA a n d the Kiev stochastic and nondifferentiable optimization package NDO both give t h e user t h e choice between automatic adaptive methods and interactive methods of determining the step size. Below we describe some particular measures of algorithm performance and methods of choosing t h e s t e p size.

The main indicators used t o evaluate t h e performance of an algorithm a r e estimates of such things as t h e value of t h e objective function a n d its gradient.

The averaging procedure (9) may be used to estimate t h e value of t h e gradient, a s described earlier in this paper. The main advantage of this procedure is t h a t i t allows u s t o obtain estimates of t h e mean values of t h e random variables without extensive sampling a t each iteration, since a very limited n u m b e r of observations (usually only one) is m a d e a t each iteration. This estimate.

although poor a t the beginning, becomes more and more a c c u r a t e as t h e iterations proceed. One example of s u c h an estimate is (18). which is a special case of the more general formula

Any observation pS with the property

can be used instead of f ( z S , o S ) in (19), where d, -, 0. For example, (6) would do. In order to g e t lim,,,l

-

F ( z S )

I =

⁰it is necessary to have p,/7, -, 0.

However, estimate (18) assigns all observations of function values t h e same weight. This sometimes leads t o considerable bias in the estimate for all t h e iterations t h e user can afford to run. Therefore for practical purposes i t is sometimes more useful to adopt procedures of the type described in Section 2 for the estimation of gradients. These include estimate (19) with fixed 7,

=

7.

where 7

-

0.01-0.05. and t h e m e t h o d in which the average is taken over the preceding

rCI,

iterations:

(20)

Although t h e s e e s t i m a t e s do not converge asymptotically t o F(zS), they place m o r e emphasis on observations made a t r e c e n t points. All of t h e e s t i m a t e s

?

may also be used in an interactive mode t o d e t e r m i n e t h e s t e p size, a s described above. In addition, t h e values of t h e p a r a m e t e r s used to d e t e r m i n e t h e s t e p size m a y also be c h o s e n interactively. For example, t h e values of p a r a m e t e r s b a n d b in

c a n be m a d e to depend on t h e behavior of

p .

We shall now describe s o m e a u t o m a t i c adaptive r u l e s for choosing t h e s t e p size. The i m p o r t a n t point a s regards i m p l e m e n t a t i o n is how t o choose t h e initial value of t h e s t e p size po. We suggest t h a t t h e value of a stochastic quasigradient should first be c o m p u t e d a t t h e initial point. a n d t h a t t h e initial value of t h e s t e p size should t h e n be chosen s u c h t h a t

where 1

-

^10-20 ^{a n d}^Dis a rough e s t i m a t e or t h e size of t h e domain in which we believe t h e optimal solution t o be located. This m e a n s t h a t i t is possible t o r e a c h t h e vicinity of each point in t h i s domain within t h e first 20 i t e r a t i o n s o r

S 0.

3.1. Ratio ot function estimate to the path length

Before beginning t h e i t e r a t i o n s we choose t h e initial s t e p size po, two positive c o n s t a n t s cxl a n d az, a sequence

M,

a n d a n integer

fi.

After every

fi

iterations we revise t h e value of t h e s t e p size in t h e following way:

(i) Compute t h e quantity

Here t h e u s a r e t h e averaging p a r a m e t e r s used i n t h e estimation of both

?

a n d

M,

, while Z9 is again t h e whole sequence of points preceding ^{2 ' .}The quantity

(21)

is t h e length of t h e path taken by the algorithm during t h e preceding

Ms

iterations. The function 9 1 ( 5 3 , u S ) is a n o t h e r example of a measure which can be used to assess algorithm performance.

(ii) Take a new value of t h e s t e p size:

f

Ps+l

-

p, otherwise -

1

In this m e t h o d t h e s t e p size is changed a t m o s t once every

fi

iterations. This is essential because function ipl changes slowly, a n d if its value is less t h a n a2 a t iteration n u m b e r s i t is likely t h a t t h e s a m e will be t r u e a t iteration n u m b e r s +l. Therefore

a

should lie in t h e r a n g e 5-20. This procedure can be modified in various ways, s u c h as continuing for

&

iterations with a fixed s t e p size. then starting t o compare values until inequality (24) is satisfied whereupon t h e s t e p size is r e d u c e d We then wait a n o t h e r

fi

iterations a n d repeat t h e procedure.

Recommended values of a l a n d a2 lie within t h e ranges 0.5-0.9 and 0.005-0.1, respectively. The number

Ms

m a y be chosen t o be constant and equal to

&.

If we have a n u m b e r of similar problems i t is very useful t o make t h e first r u n in a semi-automatic mode, i.e., t o intervene in t h e optimization process t o improve t h e values of p a r a m e t e r s al , a 2 ,

@ -

t h e new values can t h e n be used in a fully automatic mode to solve t h e remaining problems.

This algorithm is by n o m e a n s convergent in t h e traditional sense, but i t outperformed traditional choices like C / s in numerical experiments because i t normally r e a c h e s the vicinity of t h e optimal point more quickly. However, i t is possible t o safeguard convergence by considering a second sequence C / s , where C is small, and switching t o t h i s sequence if t h e s t e p size recommended by (24) falls below a certain value. This s t e p size regulation was introduced in

[

151.

3.2. Use of gradient estimates

Take 9 2

= cS

instead of 9 1 ( 2 s , u s ) in (24), where

cs

is one of the gradient estimates discussed above, a n d t h e us r e p r e s e n t all the parameters used.

including averaging parameters and t h e frequency of changes in the s t e p size.

(22)

3.3. Ratio of progress and path

The quantity I I Z ~ - " - zs

11

r e p r e s e n t s the progress made by the algorithm between iteration n u m b e r s - Ms and iteration n u m b e r s . If we keep t h e s t e p size constant, t h e algorithm begans t o oscillate chaotically after reaching some neighborhood of t h e optimal point. The smaller t h e value of t h e s t e p size, t h e smaller t h e neighborhood a t which t h i s occurs, and t h u s t h e total path between iterations s and s

-

Ms begins t o grow compared with t h e distance between points z

-

^61,^andz S . This m e a n s t h a t t h e function

c a n be used a s a performance m e a s u r e in equation (24).

3.4. Analogues of line search techniques

The decision as t o whether (and how) to change t h e s t e p size may be based on t h e values of the scalar product of adjacent s t e p directions. If we have

(P-l,P) >

0, t h e n this may be a sign t h a t regular behavior prevails over stochastic behavior, t h e function is decreasing in t h e s t e p direction and t h e s t e p size should be increased. Due to stochastic effects t h e function will very often increase r a t h e r t h a n decrease, but in t h e long r u n t h e number of bad choices will be less t h a n t h e number of c o r r e c t decisions. Analogously. if this inequality does not hold then t h e s t e p size should be decreased. The rule for changing t h e step size is t h u s basically a s follows:

where t h e values of al, a2, a3 (recommended values a l

-

0.4-0.8 , 1

<

^{a 2 <}1.3 and 0.7

s

a3

<

1) should be chosen before starting t h e iterations. It is also advisable t o have upper and lower bounds on t h e step size t o avoid divergence.

Sometimes i t is convenient t o normalize the vectors of s t e p directions, i.e.,

il =

1. The lower bound may decrease a s the iterations proceed. This method may also be applied to t h e choice of a vector s t e p size, treating some (or all)

(23)

variables or groups of variables separately. A number of different methods based on the use of scalar products of adjacent s t e p directions to control the s t e p size have been developed by Uriasiev [19], Pflug [16]. a n d Ruszczynski and Syski [20].

The interactive stochastic optimization package implemented a t IIASA (STO) is based on t h e s a m e ideas as the package for stochastic and nondifferentiable optimization developed in Kiev (NDO). It allows t h e user t o choose between interactive and automatic modes and makes available t h e s to- chastic quasigradient methods described in Sections 2 a n d 3. In t h e interactive mode t h e program offers the user t h e opportunity to change the step parame- t e r s and the methods by which t h e step size and step direction a r e chosen dur- ing the course of the iterations. The user can also stop the iterative process a n d obtain a more precise e s t i m a t e of the value of the objective function before continuing. The package is written in FORTRAN-77.

Before initiating t h e optimization process the user h a s to:

(i) Provide a subroutine UF which calculates the value of function f ( z , o ) for fixed z and w and. optionally, a subroutine UG which computes t h e gradient f,(z,w) of this function; t h e function evaluation subroutine should be of the form:

FVNCl"I0N UF(N,X) DIMENSION X(N) Calculation of

J

(z,w)

RETURN

END

Here N is the dimension of t h e vector of variables

X

(Note t h a t the implementation on the IIASA

VAX

actually requires the subroutine to be entered in lower-case letters r a t h e r than capitals.) A description of the subroutine which calculates a quasigradient is given later in this paper.

(ii) Compile these subroutines with the source code t o obtain an executable module.

(24)

(iii) Provide a t least one of t h e following additional d a t a files:

-

algorithm control file (used only in t h e non-interactive option)

-

p a r a m e t e r Ale (used only in t h e interactive option)

-

initial data Ale (should always be present)

All of these files a r e described in some detail l a t e r in t h e paper.

The optimization process c a n then begin. The program first asks t h e u s e r a series of questions regarding the required mode (interactive or automatic), method of step size regulation, choice of s t e p direction. etc. These questions appear on t h e monitor a n d should be answered from t h e keyboard or by reference t o a d a t a file. We shall represent t h e dialogue a s follows:

Question? Answm

with t h e user's response given in italics. The first question is Interactive mode? reply yes or no yes/no

To choose t h e interactive option the u s e r should type in yes (or y); to select t h e automatic option h e should answer no (or n). In t h e l a t t e r c a s e t h e program would ask no f u r t h e r questions, but would read all t h e necessary information from t h e algorithm control file (which is usually numbered 2

-

u n d e r UNrX conventions its n a m e is fort.2). The iterative process would t h e n begin, terminat- ing after 10,000 iterations if no other stopping criterion is fulfilled. The algo- r i t h m control file m u s t contain answers t o all of t h e following questions except those concerned either with dialogue during t h e iterations or with t h e parame- t e r f3le (such questions a r e marked with a n asterisk below). This file is given a n a m e only for ease of reference

-

t h e important thing for t h e u s e r i s its number.

Assume now t h a t t h e user has chosen t h e interactive option by answering yes t o the first question. The program then asks

parameter f lle? (number)

The user should respond either with the number of t h e file of default pararne- t e r s or with t h e n u m b e r of the file in which t h e c u r r e n t values of t h e algorithm parameters a r e stored. The file of default p a r a m e t e r s is provided with t h e program and has t h e n a m e fort.12 (under UNlX conventions); t h u s , to refer t h e program t o t h e default file the user should answer 12. The purpose of this file is to help t h e user to s e t t h e values of algorithm p a r a m e t e r s in t h e ensuing dialogue and also t o s t o r e such improved values a s may be discovered by the u s e r

(25)

t h r o u g h t r i a l a n d error. I f t h e u s e r assigns t h e a l g o r i t h m p a r a m e t e r s any values o t h e r t h a n those in t h e default file. t h e new values become t h e default values in s u b s e q u e n t r u n s of t h e program. This Ale is optional.

The p r o g r a m t h e n asks

read parameter file? reply yes or no y e s / n o

The answer y e s implies t h a t t h e file specified i n t h e previous question exists, a n d t h a t default p a r a m e t e r values a r e s t o r e d i n this file. In t h i s case, when asking t h e u s e r about p a r a m e t e r values. t h e p r o g r a m will read t h e default option in t h e p a r a m e t e r file a n d r e p r o d u c e i t on t h e s c r e e n together w i t h t h e question. If t h e u s e r a c c e p t s t h i s default value he s h o u l d respond with 0 (zero);

otherwise h e should e n t e r his own value, which will become t h e new default value.

The a n s w e r no m e a n s t h a t no default values a r e available a t t h e m o m e n t . In this c a s e t h e program will form a new default file (labeled with the n u m b e r given a s a n answer t o t h e previous question); i t s c o n t e n t s will be based on t h e u s e r ' s answers t o f u t u r e questions. This new default file, once formed, c a n be u s e d i n s u b s e q u e n t r u n s .

The n e x t question is

number of variables? ( n u m b e r )

t o which t h e u s e r should respond with t h e dimension of t h e vector of variables

2 . He is t h e n asked

Initial data file? (nurn b e r )

a n d should reply with t h e n u m b e r of t h e initial d a t a file. This file should contain t h e following e l e m e n t s (in exactly this order):

-

The initial point, which should be a s e q u e n c e of n u m b e r s s e p a r a t e d by c o m m a s or o t h e r delimiters.

-

_Any additional d a t a required by s u b r o u t i n e s U F or TJG if s u c h d a t a exists a n d t h e u s e r chooses t o p u t it in t h e initial d a t a file (optional).

-

Information about t h e c o n s t r a i n t s (described in m o r e detail below) The p r o g r a m t h e n asks

step size regulation? is

Here is is a positive i n t e g e r from t h e s e t 11,2.3,4,6,7{, where the diflerent values of is correspond to different ways of choosing t h e s t e p size. (The i n t e g e r 5 is r e s e r v e d for a n option c u r r e n t l y u n d e r development.)

(26)

1 Adaptive automatic s t e p size regulation (24) based on algorithm perfor- m a n c e function (22) and function estimate (18).

2 Manual s t e p size regulation based on algorithm performance function (22) and function estimate (18).

3 Adaptive automatic s t e p size regulation (24) using algorithm perfor- m a n c e measure (22) a n d a function estimate based on a finite number of previous observations (21).

4 Manual s t e p size regulation based on t h e same estimates of algorithm performance a s for is

=

3.

6 Automatic s t e p size regulation using algorithm performance m e a s u r e (24) a n d function estimate (19) with Axed y,.

7 Manual s t e p size regulation based on t h e same estimates of algorithm performance a s for is

=

6.

The difference between adaptive automatic and manual s t e p size regulation (see is

=

^1.2)is t h a t in t h e first case t h e step size is chosen automatically, although t h e user may t e r m i n a t e t h e iterations a t specified points a n d con- t i n u e with another s t e p size regulation. while in the second case t h e user changes t h e value of t h e s t e p size himself. Both s t e p size regulations a r e based on t h e s a m e estimates of function value a n d algorithm performance.

The next question is

s t e p direction? ( 5 figures) id2 i d 2 i d 3 id4 i d 5

The user has to respond with five figures which specify various ways of choosing t h e s t e p direction, e.g., 11111. W e shall refer to these figures a s i d l , i d 2 , id3, id4 and id5. The subroutine which estimates the s t e p direction makes some n u m b e r of initial observations a t e a c h step; these a r e t h e n averaged in some way t o obtain the vector

p .

and t h e final s t e p direction v s is calculated using both

p

and values of v z for i C s.

The value of id1 specifies t h e n a t u r e of the initial observations id1 Definition

A direct observation of a stochastic quasigradient is available For and t h e user has t o specify a subroutine UG to calculate it:

SUBROUTINE UG(N,X,G) DIMENSION X(N),G(N)

Calculation of a stochastic quasigradient RETURN

END

where

C(N)

is an observation of a stochastic quasigradient.

2 Central finite-difference approximation of the gradient as in (11).

(27)

3 The

pvs

a r e c a l c u l a t e d using random s e a r c h t e c h n i q u e s ( 1 2 ) .

4 Forward finite-difference approximation of t h e initial observations

pms

^as

in ( 1 0 ) .

5 Central finite-difference approximation of t h e g r a d i e n t a s i n ( 1 1 ) . All observations of t h e function used in one observation of

raS

a r e made with t h e s a m e values of random p a r a m e t e r s o.

6 The

F"

a r e c a l c u l a t e d using random s e a r c h t e c h n i q u e s ( 1 2 ) . All obser-

vations of t h e function used in one observation of a r e m a d e with the s a m e values of r a n d o m p a r a m e t e r s o.

7 Forward finite-difference approximation of t h e initial observations

pas

^as

in (10). All observations of the:function u s e d in o n e observation of

pus

a r e m a d e with t h e s a m e values of random p a r a m e t e r s o.

Note t h a t for id1

=

5 6 . 7 all observations of t h e function u s e d in one observation of a r e m a d e with t h e s a m e values of random p a r a m e t e r s o. In this case t h e u s e r should write a function UF which supports this f e a t u r e a s follows:

FVNCTION UF(N,X) DIMENSION X(N) COMMON/OMEG/LO,MO If LO= 1 a n d MO= 1 t h e n obtain new values of r a n d o m factors o a n d s e t MO=O.

Make a n observation of t h e function a t point z.

R E T U R N

END

The second figure i d z d e t e r m i n e s t h e point a t which observations a r e made:

i d 2 Definition

1 The initial direction i s calculated a t t h e c u r r e n t point zS

2 The initial direction is calculated a t a point chosen randomly from among those in t h e neighborhood of t h e c u r r e n t point zS

The value of id3 deflnes t h e way in which t h e s t e p in a finite-difference or random s e a r c h approximation of

p"

is chosen:

id3 Definition

1 The approximation s t e p is Axed. The observations of t h e objective function a t point zs originally used t o obtain g r a d i e n t observations

PnS

^{a r e}

not u s e d to update t h e e s t i m a t e of t h e function employed f o r s t e p size regulation.