NOT FOR QUOTATION WITHOUT PERMISSION OF THE AUTHOR
S T O C m C QUASIGRADIENT
METHODS AND
THEIR WLEMENTATIONYuri Ermoliev Alexei Gaivoronski
July 1984 WP-84-55
Working Papers a r e interim reports on work of t h e International Institute for Applied Systems Analysis a n d have received only limited review. Views or opinions expressed herein do not necessarily represent those of t h e Institute or of its National Member Organizations.
INTERNATIONAL INSTITUTE FOR APPLIED SYSTEMS ANALYSIS 2361 Laxenburg. Austria
PREFACE
This paper discusses various stochastic quasigradient methods and considers t h e i r computer implementation. It i s based on experience gained both a t the V. Glushkov Institute of Cybernetics in Kiev and a t IIASA
The paper falls naturally i n t o t h r e e parts. The first i s concerned with problem definition a n d various ways of choosing t h e s t e p size and s t e p direction. A detailed description of an interactive stochastic optimi- zation package (STO) available a t IIASA forms t h e second part. Finally, t h e use of t h i s package is demonstrated by application t o t h r e e t e s t prob- lems t h a t have arisen during t h e course of IIASA w o r k t h e y include a facility location problem and a water management problem.
This work was c a r r i e d out within t h e Adaptation and Optimization Project of t h e System a n d Decision Sciences Program.
ANDRZE J WIERZBICKI Chairman
System and Decision Sciences
A number of stochastic quasigradient methods a r e discussed from t h e point of view of implementation. The discussion revolves around t h e interactive package of stochastic optimization routines (STO) r e c e n t l y developed by t h e Adaptation and Optimization group a t IIASk (This pack- age is based on t h e stochastic and nondifferentiable optimization pack- age (NDO) developed a t t h e V. Glushkov Institute of Cybernetics in Kiev.) The IIASA implementation is described a n d i t s use illustrated by applica- tion t o t h r e e problems which have arisen in various IlASA projects.
STOCHASI7C QUASIGRADENT METHODS AND THEIR IMPLEMENTATION
Y w i
E?rmoLiev and A l e z e i G a i v o r o n s k i1. INTRODUCTlON
This paper discusses various stochastic quasigradient methods (see [1,2]) a n d considers t h e i r c o m p u t e r implementation. It is based on experience gained both a t t h e V. Glushkov Institute of Cybernetics in Kiev and a t IIGk
We a r e concerned h e r e mainly with questions of implementation, s u c h as the best way to choose s t e p directions and s t e p sizes, a n d therefore little a t t e n - tion will be paid t o theoretical aspects s u c h as convergence t h e o r e m s and t h e i r proofs. Readers i n t e r e s t e d in t h e theoretical side a r e referred t o [1.2].
The paper is divided i n t o five sections. After introducing t h e main problem in Section 1, we discuss t h e various ways of choosing t h e step size a n d s t e p direction in Sections 2 a n d 3. A detailed description of an interactive stochas- tic optimization package (STO) c u r r e n t l y available a t IIASA is given in Section 4.
This package r e p r e s e n t s one possible implementation of t h e methods described in t h e previous sections. Finally, Section 5 deals with the solution of some t e s t problems using this package. These problems were brought to o u r attention by o t h e r IIASA prcjects and collaborating institutions a n d include a facility loca- tion problem, a water resources m a n a g e m e n t problem, a n d t h e problem of choosing the p a r a m e t e r s in a closed loop control law for a stochastic dynamical system with delay.
We a r e mainly concerned with the problem
where z r e p r e s e n t s t h e variables to be chosen optimally, X is a s e t of con- s t r a i n t s , and o is a random variable belonging t o some prubabilistic space ( R , B , P ) . Here B is a Bore1 field a n d P is a probabilistic measure.
There a r e c u r r e n t l y two main approaches t o this problem. In t h e first, we take the m a t h e m a t i c a l expectation in ( I ) , which leads to multidimensional
integration a n d involves t h e use of various approximation s c h e m e s [3-61. This reduces problem (1) to a spe'cial kind of nonlinear programming problem which allows t h e application of deterministic optimization techniques. In this paper we c o n c e n t r a t e on t h e second approach, in which we consider a very Limited n u m b e r of observations of random function f ( z , o ) a t e a c h iteration in order t o d e t e r m i n e t h e direction of t h e next step. The resulting e r r o r s a r e smoothed out until t h e optimization process t e r m i n a t e s (which happens when the s t e p size becomes sufficiently small). This approach was pioneered in [7,9].
W e a s s u m e t h a t s e t X is defined in such a way t h a t t h e projection operation z -. n X ( z ) is comparatively inexpensive from a c o m p u t a t i o n a l point of view, where nx(z)
=
arg min llz- = ! I .
For instance, if X is defined by linear con-z EX
s t r a i n t s , t h e n projection is reduced to a q u a d r a t i c programming problem which, although challenging if large scale, c a n nevertheless be solved in a finite n u m b e r of iterations. In this case it is possible to i m p l e m e n t a stochastic quasigradient algorithm of t h e following type:
Here zs is t h e c u r r e n t approximation of t h e optimal solution, ps is the s t e p size, a n d v S is a r a n d o m s t e p direction. This s t e p direction may, for instance, be a s t a t i s t i c a l e s t i m a t e of the gradient (or subgradient in t h e nondifferentiable case) of function F ( z ) : t h e n v S
=
s u c h t h a twhere as decreases as t h e n u m b e r of iterations i n c r e a s e s , and t h e vector v S is called a stochastic quasigradient of function F ( 2 ) . Usually p, 4 0 as s 4 = a n d t h e r e f o r e Ilzs+l
-
zS11
-, 0 from (2). This suggests t h a t we should take zS as t h e initial point for t h e solution of the projection problem a t iteration n u m b e r s +1, t h u s reducing considerably the computational effort n e e d e d t o solve the qua- dratic programming problem a t each s t e p s=
1,2.... .
Algorithm (2)-(3) c a n also cope with problems with more general c o n s t r a i n t s formulated in t e r m s of m a t h e m a t i c a l expectationsby making use of penalty functions or t h e Lagrangian (for details see [1,2]).
The principal peculiarity of s u c h methods is their nonmonotonicity, which may sometimes show itself in highly oscillatory behavior. In this case it is dificult to judge whether t h e algorithm has already approached a neighborhood of the optimal point or n o t , since exact values of t h e objective function a r e not available. The best way of dealing with s u c h difEculties s e e m s to be to use an interactive procedure t o choose t h e step sizes and s t e p directions, especially if it does not take much t i m e t o make one observation. More reasons for adopting an interactive approach a n d details of t h e implementation are given i n the fol- lowing sections.
Another c h a r a c t e r i s t i c of t h e algorithms described h e r e is t h e i r pattern of convergence. Because of t h e probabilistic n a t u r e of t h e problem, t h e i r asymp- totic r a t e of convergence is extremely slow a n d m a y be r e p r e s e n t e d by
Eere z * is t h e optimal point t o which sequence zS converges a n d k is t h e n u m b e r of observations of random p a r a m e t e r s o, which in m a n y cases is pro- portional t o t h e n u m b e r of iterations. In deterministic optimization a super- linear asymptotic convergence r a t e is generally expected; a r a t e such as (4) would be considered a s nonconvergence. But no algorithm c a n do asymptoti- cally any b e t t e r t h a n t h i s for stochastic problem (1) in t h e presence of nonde- generate random & s t u r b a n c e s , a n d therefore t h e aim is t o r e a c h some neigh- borhood of the solution r a t h e r t h a n to find t h e precise value of t h e solution itself. Algorithm (2)-(3) is quite good enough for this purpose.
2. CHOICE
OF
SllW DIRECTIONIn this section we shall d ~ s c u s s different ways of choosing t h e s t e p direc- tion in algorithm (2) a n d s o m e closely r e l a t e d algorithms. We shall first discuss methods which a r e based on observations made a t t h e c u r r e n t point zS or in its immediate vicinity. More general ways a r e then p r e s e n t e d which take into account observations m a d e a t previous points.
2.1. Gradients of random function f (z , 0)
The simplest case arises when it is possible to obtain gradients (or subgra- dients in t h e nondifferentiable case) of f u n c t i o n f ( z , w ) a t fixed values of z and w. In this case we can simply take
where oS is an observation of random p a r a m e t e r LI made a t s t e p n u m b e r s . If both t h e observation of random p a r a m e t e r s and the evaluation of gradients a r e computationally inexpensive t h e n it is possible to take t h e average of some specified n u m b e r N of gradient observations:
These observations c a n be s e l e c t e d in two ways. The first is t o choose t h e oilS according t o their probability distribution. If we do not know t h e form of t h e distribution function (as, for example, in Monte-Carlo simulation models) t h i s is t h e only option. However, in t h i s c a s e t h e influence of low-probability high-cost events m a y not be properly t a k e n i n t o account. In addition, t h e asymptotic e r r o r of the gradient e s t i m a t e is approximately proportional to 1/
q .
The second approach may be used when we know the distribution of the random p a r a m e t e r s o. In this case m a n y o t h e r e s t i m a t e s can be derived; t h e u s e of pseudo-random numbers* in p a r t i c u l a r m a y lead to an asymptotic e r r o r approx- imately proportional to log (N)/ N, which is considerably less t h a n in t h e purely random case. However, m o r e t h e o r e t i c a l research and more computational experience a r e necessary before we c a n assess the true value of t h i s approach.The main question h e r e is w h e t h e r t h e i n c r e a s e in the speed of convergence is sufficient t o compensate for t h e additional computational effort required for m o r e exact estimations of t h e F,(zS).
Unfortunately, o u r t h e o r e t i c a l knowledge concerning t h e asymptotic behavior of processes of type (2) tells u s little about t h e optimal n u m b e r of samples, even f o r relatively well-studied cases. For instance, what would be t h e optimal n u m b e r N of observations for t h e case in which function F(z) is diflerentiable a n d t h e r e a r e n o c o n s t r a i n t s ? In this case we can establish both asymptotic normality and t h e value of t h e asymptotic variance. If, additionally, p,
-
C / s t h e n t h e total n u m b e r of observations required to obtain a given asymptotic variance is t h e s a m e for all N<<
s . If sp, -, = t h e n t h e wait-and-see approach is asymptotically s u p e r i o r as long as 1Ir<<
s .*A concept which arose ?om :he xse of quasi-Monte-Carlo zec:lmques in mult~&mensional integration [9].
However, t h e r e is strong evidence t h a t in c o n s t r a i n e d and/or nondifferentiable cases t h e value of JV should be chosen adaptively. A very sim- ple example provides some insight into the problem. Suppose tfiat z E R', X
=
[a,=), F ( z ) = z, f,(zS.wS)=
1+
wS, where the wS, s = 1,2 ,... a r e indepen- d e n t random variables with zero mean. The obvious solution of this problem is z=
a. Suppose f o r simplicity t h a t ps=
p. This will not a l t e r o u r a r g u m e n t greatly because ps usually changes very slowly for large s . In this c a s e method ( 2 ) . ( 5 ) will be of t h e form:T,
=
m a x 1 0 , a - zS+
p ( l+
wS)jMethod (2),(6) requires us to choose a s t e p size N t i m e s g r e a t e r t h a n p; other- wise its performance would be inferior to t h a t of m e t h o d (2),(5) (unless t h e ini- tial point is in the immediate vicinity of the m i n i m u m ) . Method (2),(6) t h e n becomes
In order t o compare t h e two methods we shall let s in t h e last equation denote t h e n u m b e r of observations r a t h e r t h a n the n u m b e r of i t e r a t i o n s a n d r e n u m b e r t h e observations winS. The process
lo
if i #1N
for 1=
1,2....
or a<
y i - p ( l+
oi)xi =
a-
y i + p ( l + oi) otherwiseh a s t h e property t h a t y i N
=
z S a n d therefore it is sufacient t o compare y k with z k for k =LN,
whereSuppose t h a t z0
=
y o # a. Then if t: = min tk : zk=
aj r e p r e s e n t s t h e t i m e a t which process rk first e n c o u n t e r s t h e optimal point and t:=
min I1 : y L N = a ]represents t h e time of t h e corresponding e n c o u n t e r of process z k with the optimal point, it is c l e a r t h a t t: S t i because from (7) a n d ( 9 ) we have t h a t
y k
=
zk lor k < t d . This m e a n s t h a t algorithm (2),(5) will g e t from some r e m o t e initial point to t h e vicinity of the optimal point f a s t e r t h e n algorithm (2),(6) with N>
1. Now let us take z0=
y o=
a. Then (7) a n d (8) imply t h a t&
=
0 for k<
N while rk may differ from zero. Therefore in this case z N > y N=
z 1 a n d t h e performance of algorithm (2),(6) with N>
1 becomes superior to t h a t of algorithm (2),(5) after r e a c h i n g t h e vicinity of t h e optimal point. This simple example demonstrates several i m p o r t a n t properties of con- strained stochastic optimization problems, although m o r e work is necessary before we c a n make a n y firm theoretical r e c o m m e n d a t i o n s concerning t h e choice of t h e n u m b e r of samples N. Above all, a n appropriate definition of t h e r a t e of convergence is needed: r e c e n t r e s u l t s by Kushner [ l o ] may be useful in this regard.A r a t h e r g e n e r a l adaptive way of changing t h e n u m b e r N would be to begin with a small value of N for t h e first few iterations ( N
=
1, lor example), and increase N if additional t e s t s show t h a t t h e c u r r e n t point is in t h e vicinity of t h e optimum. The following averaging procedure has been shown t o be useful in tests of t h i s type:where
r
is defined by (5) or (6). It c a n be shown (see [1.2]) t h a t l/uS-
F,(zs)II -+ 0 u n d e r r a t h e r g e n e r a l conditions, which include p s / a s -+ 0.The decision as to whether to change iV m a y t h e n be based on t h e value of
r, =
\lzs-
rrX(zs-
vs)ll. One possibility is t o e s t i m a t e a n d its empirical vari- a n c e a t the s a m e time:a n d choose N such t h a t o i S @ r s , where t h e value of @ is s e t before beginning t h e iterations. In practice it is s u f i c i e n t t o consider a constant a, a
-
0.01-0.05, where t h e g r e a t e r the r a n d o m n e s s , t h e smaller t h e value of a. Our empirical recommendation for t h e initial value of N is a:
-
0.1 max 112,-=,I\.
z l , r z a
This method c a n be used t o increase the n u m b e r of samples per iteration automatically. Another possibility is t o a l t e r t h e value of N interactively; this is one of t h e options implemented in t h e interactive package STO, which has r e c e n t l y been developed a t IIASk Numerical experiments conducted with this package show t h a t in problems where f , ( z , w ) has a high variance, choosing a value of N g r e a t e r t h a n one c a n bring about considerable improvements in per- formance.
The method described above uses increasingly precise e s t i m a t e s of t h e gradient. a n d therefore s h a r e s some of t h e features of t h e approximation tech- niques developed in [3-61 for solving s t o c h a s t i c programming problems. All of t h e r e m a r k s made h e r e concerning sampling a r e also valid for t h e o t h e r m e t h o d s of choosing described below.
However, i t is n o t always possible t o use observations of the g r a d i e n t f , ( z , w ) of t h e random function to compute a stochastic quasigradient. In m a n y cases the analytic expression of f , ( z , w ) is not known, and even if it is, it m a y be difficult to c r e a t e a subroutine t o evaluate it, especially for large-scale problems. In this case it is necessary t o use a method which relies only on observations of f ( z , w ) .
2.2. FiniteditTerence approximations
If function F ( z ) is differentiable, one possibility is t o use forward finite differences:
o r c e n t r a l h i t e differences:
where t h e ei a r e unit basis vectors from
R".
The m o s t important question h e r e i s t h e value of 6,. In order to e n s u r e convergence with probability one it is sufficient t o take any sequence 6, s u c h t h a tExm=,
pf/ 6f<
=J. I f it is possible to t a k e w f , ,=
u:,, then any 6, + 0 will do. However, t h e method m a y r e a c h t h e vicinity of the optimal point m u c h faster if 6, is chosen adaptively. On t h e first few iterations 6, should be large, decreasing ris the c u r r e n t point approachest h e optimal point. The main reason for this is t h a t taking a large s t e p 6 , when t h e c u r r e n t point is far from t h e solution m a y s m o o t h out t h e randomness to s o m e e x t e n t , and may also overcome some of t h e problems ( s u c h as curved val- leys) c a u s e d by t h e e r r a t i c behavior of t h e d e t e r m i n i s t i c function ~ ( z ) . One possible way of implementing such a s t r a t e g y in a n u n c o n s t r a i n e d case is given below.
(i) Take a large initial value of 6 , . s u c h as 6 ,
-
0.1 m a x (12-
z,llr l , z z E X
(ii) Proceed with iterations (2), where is d e t e r m i n e d using (10) or (11).
While doing this, compute an e s t i m a t e of t h e g r a d i e n t v S from (9).
(iii) Take
where t h e values of a n d
p2
should be chosen before beginning t h e itera- tive process.It c a n be shown t h a t t h i s process converges when ufB1
=
o;,~, although i t will also produce a good approximation to the solution even if this r e q u i r e m e n t is n o t met. Estimate (9) is not the only possibility-
in fact, a n y of the e s t i m a t e s of algorithm performance given in Section 3 would do.Another s t r a t e g y is to relate changes in t h e finite-difference approximation s t e p t o changes in t h e s t e p size. This is especially advisable if t h e s t e p size is also chosen adaptively (see Section 3). In t h e s i m p l e s t c a s e one may fix
PI >
0 before s t a r t i n g and choose 6,= pips,
which, although c o n t r a r y to theoretical recommendations, will nevertheless bring t h e c u r r e n t point reasonably close t o t h e optimal point. To obtain a m o r e precise solution i t is n e c e s s a r y t o reducePI
during t h e course of t h e iterations. This m a y be done e i t h e r automatically o r interactively; both of these options a r e c u r r e n t l y available in t h e s t o c h a s t i c optimization package STO.Finite-difference algorithms (10) and (11) have one major &sadvantage, a n d this is t h a t t h e stochastic quasigradient variance increases as 6, decreases. This m e a n s t h a t finite-difference algorithms converge m o r e slowly t h a n algorithms which use gradients (5). There a r e two ways of overcoming this problem. Firstly, if i t is possible to make observations of function f ( z , w ) for
v a r i o u s values of z a n d fixed w , it is a good idea to t a k e t h e s a m e v a l u e s of w for t h e differences (i.e., uTml
=
gfS2) when 6, is small b e c a u s e t h i s r e d u c e s t h e vari- a n c e of t h e e s t i m a t e s q u i t e considerably. Another way of avoiding t h i s i n c r e a s e i n t h e v a r i a n c e is t o i n c r e a s e t h e n u m b e r of samples u s e d t o o b t a i n when a p p r o a c h i n g t h e o p t i m a l point, i.e.. t o u s e finite-difference a n a l o g u e s of (6). If t h e r e exists a y>
0 s u c h t h a t Ns6: > y , where N, is t h e n u m b e r of s a m p l e s t a k e n a t s t e p n u m b e r s , t h e n t h e v a r i a n c e of r e m a i n s bounded.I t is s o m e t i m e s useful t o n o r m a l i z e t h e
p ,
especially when t h e v a r i a n c e is l a r g e .Another disadvantage of t h e finite-difference a p p r o a c h is t h a t i t r e q u i r e s n
+
1 evaluations of t h e objective Function for forward differences a n d 2n for c e n t r a l differences, w h e r e n is t h e dimension of v e c t o r z . This m a y n o t be a c c e p t a b l e in Large-scale p r o b l e m s a n d i n c a s e s where f u n c t i o n evaluation is c o m p u t a t i o n a l l y expensive. In t h i s s i t u a t i o n a s t o c h a s t i c q u a s i g r a d i e n t c a n be c o m p u t e d using s o m e a n a l o g u e of r a n d o m s e a r c h t e c h n i q u e s .2.3. Analogues of random search methods
When i t is not feasible to c o m p u t e n
+
1 values of t h e objective f u n c t i o n a t e a c h i t e r a t i o n , t h e following a p p r o a c h (which has s o m e t h i n g s in c o m m o n with t h e r a n d o m s e a r c h t e c h n i q u e s developed for d e t e r m i n i s t i c o p t i m i z a t i o n prob- l e m s ) m a y be used:H e r e t h e h, a r e v e c t o r s d i s t r i b u t e d uniformly on t h e u n i t s p h e r e ,
.Ms
is t h e n u m b e r of r a n d o m p o i n t s a n d 6, is t h e s t e p t a k e n in t h e r a n d o m s e a r c h . The c h o i c e ofM,
is d e t e r m i n e d by t h e computational facilities available, a l t h o u g h i t i s advisable t o i n c r e a s eM,
a s 6, decreases. This m e t h o d of choosingp
h a s m u c h i n c o m m o n with finite-difference s c h e m e s , a n d t h e s t a t e m e n t s m a d e above a b o u t t h e c h o i c e of 6, i n t h e flnite-difference c a s e also hold for (12).2.4. Snmothmg the objective function
Methods of choosing
p
which r e l y on finite-difference o r r a n d o m s e a r c h t e c h n i q u e s a r e only a p p r o p r i a t e when t h e objective f u n c t i o n F ( z ) is differentiable. The u s e of s i m i l a r p r o c e d u r e s in t h e n o n d i f f e r e n t i a b l e casewould require some smoothing of t h e objective function. Suppose t h a t t h e function F ( z ) is not differentiable but satisfies the Lipschitz condition, a n d con- sider t h e function
where H ( y , r ) is a probability measure with support in a ball of radius r c e n - t e r e d a t zero. We shall assume for simplicity t h a t H ( ~ , T ) has nonzero density inside this ball. The function F ( z , r ) is differentiable a n d F ( z , r ) + ~ ( z ) uni- formly over every compact s e t as r -r 0. It is now possible to minimize t h e nonsmooth function ~ ( z ) by computing stochastic quasigradients for s m o o t h f u n c t i o n s F ( 2 . r ) a n d find t h e optimal solution of t h e initial problem by l e t t i n g r -. 0. This idea was proposed in [ l l . ] a n d studied f u r t h e r in [ 1 2 ] . It is not a c t u - ally n e c e s s a r y t o calculate t h e integral in ( 1 3 ) - it is sufficient to c o m p u t e
tS
using equations ( 1 0 ) - ( 1 2 ) , but a t point z S
+
y S r a t h e r t h a n point z S , where y S is a r a n d o m variable distributed according t o H ( Y , T , ) . In t h i s case ( 1 0 ) becomes:The m o s t commonly u s e d distribution H ( y , r ) is uniform distribution on a n n - dimensional cube of side r . If we want t o have convergence with probability one we should choose r, s u c h t h a t b , / r S
-.
0 a n d ( r , - r s + l ) / p s + 0. In practical computations it is also advisable to choose t h e smoothing p a r a m e t e r rs in a s i m i l a r way t o 6,. using one of the adaptive procedures discussed above.Smoothing also h a s beneficial side effects in t h a t i t improves the behavior of t h e deterministic function F ( z ) . In t h e case where ~ ( z ) may be written a s t h e s u m of two functions, one with a distinct global m i n i m u m and the o t h e r with highly oscillatory behavior, smoothing may help t o overcome t h e influence of t h e oscillations, which m a y otherwise lead t h e process t o local minima f a r from t h e global one. Thus it c a n sometimes be useful t o smooth t h e objective func- tion even if we c a n obtain a gradient f , ( z , o ) . In this case we should t a k e a large value for the smoothing parameter r, on t h e first few iterations, decreas- ing it a s we approach the optimal point. The points a t which r, should be decreased m a y be determined using the values of additional e s t i m a t e s , s u c h a s those described below in Section 3 or given by (9). Everything said about t h e choice of t h e finite-difference parameter 6, is also valid for the choice of t h e
smoothing p a r a m e t e r , including t h e connection between t h e s t e p size a n d the smoothing p a r a m e t e r a n d the possibility of interactive control of r,. The only difference is t h a t a d e c r e a s e in 7, does not lead t o an increase in t h e variance of
tS
and t h a t it is preferable to have 6,<
7,. This is also reflected in t h e sto- c h a s t i c optimization software developed a t IIASAAll of t h e methods discussed so far u s e only the information available a t t h e c u r r e n t point o r in its immediate vicinity. We shall now discuss s o m e more general ways of choosing t h e s t e p direction which take into account t h e infor- mation obtained a t previous points.
2.5. Averaging over preceding iterations
The definition of a stochastic quasigradient given in (3) allows us t o use information obtained a t previous points as t h e i t e r a t i o n s proceed; this informa- tion may s o m e t i m e s lead to f a s t e r convergence t o t h e vicinity of t h e optimal point. One possible way of using such information is t o average t h e stochastic quasigradients obtained in preceding iterations via a procedure s u c h as (9).
The us obtained in t h i s way may t h e n be u s e d in m e t h o d (2). This is a n o t h e r way of smoothing o u t randomness a n d neutralizing s u c h c h a r a c t e r i s t i c s of deterministic behavior as curved valleys a n d oscillations. Methods of this type may be viewed as s t o c h a s t i c analogues of conjugate gradient methods and were first proposed in [13]. We c a n choose according to any of (5). (6). (10). (ll), (12), o r (14). Since v S -,
Fz(zs)
under r a t h e r g e n e r a l conditions (see [1.2]), m e t h o d (9) can be considered as a n alternative t o method (6) for deriving pre- cise estimates of g r a d i e n t Fz(z). This m e t h o d h a s a n advantage over (6) in t h a t it provides a n a t u r a l way of using rough e s t i m a t e s ofF,(zS)
on t h e first few iterations a n d t h e n gradually increasing t h e a c c u r a c y as the c u r r e n t point approaches t h e optimal point. In this case (9) c a n be incorporated in t h e adap- tive procedures used to choose t h e smoothing p a r a m e t e r and the s t e p in t h e finite-difference approximation.However, i t is not necessary t o always take a, -, 0 , because we have conver- gence for any 0 s a, S 1. Sometimes it is even advantageous t o take a,
=
a=
constant, because in this case m o r e emphasis is placed on information obtained in r e c e n t iterations. In general, t h e g r e a t e r the randomness. the smaller t h e value of a t h a t should be taken. Another averaging technique is given bywhere Ms is t h e size of t h e memory, which may be fixed.
2.6. Using secon&order information
T h e r e is strong evidence t h a t in s o m e cases setting
may bring a b o u t considerable i m p r o v e m e n t s in performance. Here c a n be chosen in any of t h e ways discussed above. Matrix
-%
should be positive definite a n d take into a c c o u n t both t h e second-order behavior of function F ( z ) a n d t h e s t r u c t u r e of t h e random p a r t of t h e problem. One possible way of obtaining second-order information is t o use analogues of quasi-Newton m e t h o d s t o update m a t r i x4.
To i m p l e m e n t this approach, which was proposed by Wets in [3], it is n e c e s s a r y to have- F,(zs)II
-. 0.3. CHOICE
OF
SlXP SIZEThe simplest way of choosing t h e step-size sequence in (2) is to do it before starting t h e iterative process. Convergence theory suggests t h a t a n y s e r i e s with t h e properties:
c a n be used a s a s e q u e n c e of s t e p sizes. In addition, it may be n e c e s s a r y t o take i n t o a c c o u n t relations between t h e s t e p size a n d s u c h things a s t h e smoothing p a r a m e t e r o r the s t e p in a finite-difference approximation. Rela- tions of this type have been briefly described in the preceding sections. In m o s t cases t h e choice p,
-
C / s , which obviously satisfies (17), provides t h e b e s t pos- sible asymptotic r a t e of convergence. However, since we a r e mainly c o n c e r n e d with reaching t h e vicinity of t h e solution, rule (17) is of limited use b e c a u s e a wide variety of sequences can be modified to satisfy it. The o t h e r disadvantage of choosing t h e step-size s e q u e n c e in advance is that this approach does not make any use of t h e valuable information which accumulates during solution.These "programmed" methods t h u s perform relatively badly in t h e majority of cases.
The best s t r a t e g y t h e r e f o r e s e e m s t o be t o choose t h e s t e p size using a n i n t e r a c t i v e m e t h o d . It is a s s u m e d t h a t t h e u s e r c a n m o n i t o r t h e progress of t h e optimization p r o c e s s a n d c a n i n t e r v e n e t o c h a n g e t h e value of t h e s t e p size o r o t h e r p a r a m e t e r s . This decision should be based on t h e behavior of t h e esti- m a t e s
p(zs)
of t h e c u r r e n t value of t h e objective function. The e s t i m a t e s m a y be very rough a n d a r e g e n e r a l l y c a l c u l a t e d using only o n e observation p e r i t e r a t i o n , a s in t h e following example:I t a p p e a r s t h a t a l t h o u g h t h e observations f ( z S , o S ) m a y vary g r e a t l y , t h e
?
display m u c h m o r e r e g u l a r behavior. Monitoring t h e behavior of s o m e com- ponents of t h e v e c t o r zs in addition t o t h e
?
also s e e m s t o be useful. One pos- sible i m p l e m e n t a t i o n of t h e i n t e r a c t i v e a p p r o a c h m a y p r o c e e d along t h e follow- ing lines:(i) The u s e r first c h o o s e s t h e value of t h e s t e p size a n d keeps it c o n s t a n t for a n u m b e r of i t e r a t i o n s (usually 10-20). During this period t h e values of t h e e s t i m a t e
?
a n d s o m e of t h e c o m p o n e n t s of t h e v e c t o r zS a r e displayed.pozsibly with s o m e additional information.
(ii) The u s e r decides on a new value for t h e s t e p size using t h e available infor- mation. Three different c a s e s m a y occur:
- The c u r r e n t s t e p size is too large. In this c a s e both t h e values of t h e e s t i m a t e
?
a n d t h e values of t h e m o n i t o r e d c o m p o n e n t s of zS exhibit r a n d o m jumps. It is n e c e s s a r y to d e c r e a s e t h e s t e p size.-
The c u r r e n t s t e p size i s just right. In t h i s c a s e t h e e s t i m a t e s d e c r e a s e steadily a n d s o m e of t h e m o n i t o r e d c o m p o n e n t s of t h e c u r r e n t vector zs also e x h i b i t r e g u l a r behavior (steadily d e c r e a s e o r i n c r e a s e ) . This m e a n s t h a t t h e u s e r m a y k e e p t h e s t e p size c o n s t a n t u n t i l oscillations o c c u r in t h e e s t i m a t e a n d / o r i n t h e c o m p o n e n t s of t h e c u r r e n t v e c t o r zs.
-
The c u r r e n t s t e p size is too small. In this c a s e t h e e s t i m a t e?
will begin to c h a n g e slowly, o r simply fluctuate, a f t e r t h e first few i t e r a - tions, while t h e c h a n g e in zS is negligible. It is n e c e s s a r y to i n c r e a s e t h e s t e p size.(iii) Continue with t h e i t e r a t i o n s , periodically performing s t e p (ii), until changes i n t h e s t e p size no longer r e s u l t in any d i s t i n c t t r e n d in e i t h e r t h e function e s t i m a t e o r t h e c u r r e n t v e c t o r z S , which will oscillate a r o u n d s o m e point. This will indicate t h a t t h e c u r r e n t point i s close t o t h e solu- tion.
This m e t h o d of choosing t h e s t e p s i z e r e q u i r e s a n experienced u s e r , b u t we have f o u n d t h a t t h e n e c e s s a r y skills a r e quickly developed by trial a n d e r r o r . The main r e a s o n s for adopting a n i n t e r a c t i v e a p p r o a c h m a y be s u m m a r i z e d a s follows:
- Interactive m e t h o d s m a k e t h e best use of t h e information which a c c u m u - l a t e s d u r i n g t h e optimization process.
-
Because t h e precise value of t h e objective f u n c t i o n is n o t available, i t is impossible t o u s e t h e r u l e s for changing t h e s t e p size developed in d e t e r - ministic optirnization (e.g., line s e a r c h e s ) .-
Stochastic effects m a k e i t extremely difficult t o define formally when t h e s t e p size is "too big" o r "too small"; t h e o r e t i c a l r e s e a r c h h a s not thrown any light o n t h i s problem.The m a i n disadvantage of t h e interactive approach is t h a t m u c h of t h e user's t i m e is wasted if i t t a k e s t h e c o m p u t e r a long t i m e t o make one observa- tion f (zS,wS). For t h i s reason a g r e a t effort h a s been made to develop automatic adaptive ways of choosing t h e s t e p size, in which t h e value of t h e s t e p size is c h o s e n on t h e basis of information o b t a i n e d a t all o r s o m e of t h e previous points z i , i
=
- 1,s. Methods of this type a r e c o n s i d e r e d in [14-201. The approach d e s c r i b e d in t h e following s e c t i o n s involves t h e e s t i m a t e of s o m e m e a s u r e s of a l g o r i t h m performance which we d e n o t e by $ i ( 5 s , u S ) , where ZS r e p r e s e n t s t h e whole s e q u e n c e lz1,z2, ..., zs j a n d u s t h e s e t of p a r a m e t e r s used in the e s t i m a t e . In g e n e r a l , algorithm p e r f o r m a n c e m e a s u r e s a r e a t t e m p t s t o formalize t h e notions of "oscillatory behavior" a n d " r e g u l a r behavior" used in interactive step-size regulation, a n d possess o n e o r m o r e of t h e following pro- perties:- t h e a l g o r i t h m p e r f o r m a n c e m e a s u r e is q u i t e l a r g e when t h e algorithm exhibits d i s t i n c t r e g u l a r behavior, i.e.. when t h e e s t i m a t e s of t h e function value d e c r e a s e o r t h e c o m p o n e n t s of t h e c u r r e n t vector zS show a distinct trend:
- t h e algorithm performance m e a s u r e becomes small and even changes its sign if t h e e s t i m a t e s of t h e c u r r e n t function value stop improving or if t h e c u r r e n t point s t a r t s to oscillate chaotically;
- t h e algorithm performance m e a s u r e is large far from the solution a n d small in t h e immediate vicinity of t h e optimal point.
Automatic adaptive methods for choosing t h e s t e p size begin with some reason- ably large value of t h e step size, which is kept c o n s t a n t as long as the value of t h e algorithm performance m e a s u r e r e m a i n s high, a n d t h e n decreases when t h e performance m e a s u r e becomes less t h a n s o m e prescribed value. The behavior of t h e algorithm usually becomes regular again after a decrease in t h e s t e p size, a n d t h e value of the performance m e a s u r e increases; after a n u m b e r of iterations oscillations set in and the value of the performance m e a s u r e once again decreases. This is a sign t h a t it is t i m e t o decrease the s t e p size. A r a t h e r general convergence result concerning such adaptive piecewise-linear methods of changing t h e s t e p size is given in [18]. However, in m a n y cases it is difficult t o d e t e r m i n e how close t h e c u r r e n t point i s to t h e optimal point using only one s u c h m e a s u r e
-
a more reliable decision c a n be made using several of t h e m e a s u r e s described below. Unfortunately, i t is n o t possible to come to a n y general conclusions as t o which performance m e a s u r e is the "best" for all sto- chastic optimization problems. Moreover, both t h e values of t h e p a r a m e t e r s used t o e s t i m a t e t h e performance m e a s u r e a n d t h e value of the performance m e a s u r e a t which t h e s t e p size should be decreased a r e different for different problems. Therefore if we fix t h e s e parameters once a n d for all we m a y achieve t h e s a m e poor performance a s if we had chosen t h e whole sequence of s t e p sizes prior t o t h e optimization process. Thus, i t is necessary t o t u n e the p a r a m e t e r s of a u t o m a t i c adaptive methods t o different classes of problems, a n d t h e interactive approach can be very useful here. An experienced u s e r would have l i t t l e difficulty in using t h e values of t h e performance m e a s u r e s to deter- mine t h e c o r r e c t points a t which t o change t h e s t e p size. and in learning what type of performance measure behavior requires a n increase or a decrease in t h e s t e p size. The interactive approach is of p a r t i c u l a r use if one iteration is n o t very time-consuming and t h e r e a r e a n u m b e r of similar problems to be solved. In this case the user can identify the most valuable m e a s u r e s of perfor- m a n c e in t h e &st few runs, fix t h e i r p a r a m e t e r s and incorporate this knowledge in automatic adaptive step-size selection methods for t h e remaining problems.Although interactive methods usually provide the quickest m e a n s of reach- ing t h e solution, they c a n n o t always be implemented, and in this case automatic adaptive methods prove t o be very useful. The stochastic optimiza- tion package ST0 developed a t IIASA a n d the Kiev stochastic and nondifferentiable optimization package NDO both give t h e user t h e choice between automatic adaptive methods and interactive methods of determining the step size. Below we describe some particular measures of algorithm perfor- mance and methods of choosing t h e s t e p size.
The main indicators used t o evaluate t h e performance of an algorithm a r e estimates of such things as t h e value of t h e objective function a n d its gradient.
The averaging procedure (9) may be used to estimate t h e value of t h e gradient, a s described earlier in this paper. The main advantage of this procedure is t h a t i t allows u s t o obtain estimates of t h e mean values of t h e random variables without extensive sampling a t each iteration, since a very limited n u m b e r of observations (usually only one) is m a d e a t each iteration. This estimate.
although poor a t the beginning, becomes more and more a c c u r a t e as t h e itera- tions proceed. One example of s u c h an estimate is (18). which is a special case of the more general formula
Any observation pS with the property
can be used instead of f ( z S , o S ) in (19), where d, -, 0. For example, (6) would do. In order to g e t lim,,,l
-
F ( z S )I =
0 it is necessary to have p,/7, -, 0.However, estimate (18) assigns all observations of function values t h e same weight. This sometimes leads t o considerable bias in the estimate for all t h e iterations t h e user can afford to run. Therefore for practical purposes i t is sometimes more useful to adopt procedures of the type described in Section 2 for the estimation of gradients. These include estimate (19) with fixed 7,
=
7.where 7
-
0.01-0.05. and t h e m e t h o d in which the average is taken over the precedingrCI,
iterations:Although t h e s e e s t i m a t e s do not converge asymptotically t o F(zS), they place m o r e emphasis on observations made a t r e c e n t points. All of t h e e s t i m a t e s
?
may also be used in an interactive mode t o d e t e r m i n e t h e s t e p size, a s described above. In addition, t h e values of t h e p a r a m e t e r s used to d e t e r m i n e t h e s t e p size m a y also be c h o s e n interactively. For example, t h e values of p a r a m e t e r s b a n d b in
c a n be m a d e to depend on t h e behavior of
p .
We shall now describe s o m e a u t o m a t i c adaptive r u l e s for choosing t h e s t e p size. The i m p o r t a n t point a s regards i m p l e m e n t a t i o n is how t o choose t h e ini- tial value of t h e s t e p size po. We suggest t h a t t h e value of a stochastic quasigra- dient should first be c o m p u t e d a t t h e initial point. a n d t h a t t h e initial value of t h e s t e p size should t h e n be chosen s u c h t h a t
where 1
-
10-20 a n d D is a rough e s t i m a t e or t h e size of t h e domain in which we believe t h e optimal solution t o be located. This m e a n s t h a t i t is possible t o r e a c h t h e vicinity of each point in t h i s domain within t h e first 20 i t e r a t i o n s o rS 0.
3.1. Ratio ot function estimate to the path length
Before beginning t h e i t e r a t i o n s we choose t h e initial s t e p size po, two posi- tive c o n s t a n t s cxl a n d az, a sequence
M,
a n d a n integerfi.
After everyfi
itera- tions we revise t h e value of t h e s t e p size in t h e following way:(i) Compute t h e quantity
Here t h e u s a r e t h e averaging p a r a m e t e r s used i n t h e estimation of both
?
a n d
M,
, while Z9 is again t h e whole sequence of points preceding 2 ' . The quan- tityis t h e length of t h e path taken by the algorithm during t h e preceding
Ms
itera- tions. The function 9 1 ( 5 3 , u S ) is a n o t h e r example of a measure which can be used to assess algorithm performance.(ii) Take a new value of t h e s t e p size:
f
Ps+l
-
p, otherwise -1
In this m e t h o d t h e s t e p size is changed a t m o s t once every
fi
iterations. This is essential because function ipl changes slowly, a n d if its value is less t h a n a2 a t iteration n u m b e r s i t is likely t h a t t h e s a m e will be t r u e a t iteration n u m b e r s +l. Thereforea
should lie in t h e r a n g e 5-20. This procedure can be modified in various ways, s u c h as continuing for&
iterations with a fixed s t e p size. then starting t o compare values until inequality (24) is satisfied whereupon t h e s t e p size is r e d u c e d We then wait a n o t h e rfi
iterations a n d repeat t h e procedure.Recommended values of a l a n d a2 lie within t h e ranges 0.5-0.9 and 0.005-0.1, respectively. The number
Ms
m a y be chosen t o be constant and equal to&.
If we have a n u m b e r of similar problems i t is very useful t o make t h e first r u n in a semi-automatic mode, i.e., t o intervene in t h e optimization process t o improve t h e values of p a r a m e t e r s al , a 2 ,@ -
t h e new values can t h e n be used in a fully automatic mode to solve t h e remaining problems.This algorithm is by n o m e a n s convergent in t h e traditional sense, but i t outperformed traditional choices like C / s in numerical experiments because i t normally r e a c h e s the vicinity of t h e optimal point more quickly. However, i t is possible t o safeguard convergence by considering a second sequence C / s , where C is small, and switching t o t h i s sequence if t h e s t e p size recommended by (24) falls below a certain value. This s t e p size regulation was introduced in
[
151.3.2. Use of gradient estimates
Take 9 2
= cS
instead of 9 1 ( 2 s , u s ) in (24), wherecs
is one of the gradient estimates discussed above, a n d t h e us r e p r e s e n t all the parameters used.including averaging parameters and t h e frequency of changes in the s t e p size.
3.3. Ratio of progress and path
The quantity I I Z ~ - " - zs
11
r e p r e s e n t s the progress made by the algorithm between iteration n u m b e r s - Ms and iteration n u m b e r s . If we keep t h e s t e p size constant, t h e algorithm begans t o oscillate chaotically after reaching some neighborhood of t h e optimal point. The smaller t h e value of t h e s t e p size, t h e smaller t h e neighborhood a t which t h i s occurs, and t h u s t h e total path between iterations s and s-
Ms begins t o grow compared with t h e distance between points z-
61, and z S . This m e a n s t h a t t h e functionc a n be used a s a performance m e a s u r e in equation (24).
3.4. Analogues of line search techniques
The decision as t o whether (and how) to change t h e s t e p size may be based on t h e values of the scalar product of adjacent s t e p directions. If we have
(P-l,P) >
0, t h e n this may be a sign t h a t regular behavior prevails over sto- chastic behavior, t h e function is decreasing in t h e s t e p direction and t h e s t e p size should be increased. Due to stochastic effects t h e function will very often increase r a t h e r t h a n decrease, but in t h e long r u n t h e number of bad choices will be less t h a n t h e number of c o r r e c t decisions. Analogously. if this inequal- ity does not hold then t h e s t e p size should be decreased. The rule for changing t h e step size is t h u s basically a s follows:where t h e values of al, a2, a3 (recommended values a l
-
0.4-0.8 , 1<
a 2 < 1.3 and 0.7s
a3<
1) should be chosen before starting t h e iterations. It is also advisable t o have upper and lower bounds on t h e step size t o avoid divergence.Sometimes i t is convenient t o normalize the vectors of s t e p directions, i.e.,
il =
1. The lower bound may decrease a s the iterations proceed. This method may also be applied to t h e choice of a vector s t e p size, treating some (or all)variables or groups of variables separately. A number of different methods based on the use of scalar products of adjacent s t e p directions to control the s t e p size have been developed by Uriasiev [19], Pflug [16]. a n d Ruszczynski and Syski [20].
The interactive stochastic optimization package implemented a t IIASA (STO) is based on t h e s a m e ideas as the package for stochastic and nondifferentiable optimization developed in Kiev (NDO). It allows t h e user t o choose between interactive and automatic modes and makes available t h e s to- chastic quasigradient methods described in Sections 2 a n d 3. In t h e interactive mode t h e program offers the user t h e opportunity to change the step parame- t e r s and the methods by which t h e step size and step direction a r e chosen dur- ing the course of the iterations. The user can also stop the iterative process a n d obtain a more precise e s t i m a t e of the value of the objective function before continuing. The package is written in FORTRAN-77.
Before initiating t h e optimization process the user h a s to:
(i) Provide a subroutine UF which calculates the value of function f ( z , o ) for fixed z and w and. optionally, a subroutine UG which computes t h e gra- dient f,(z,w) of this function; t h e function evaluation subroutine should be of the form:
FVNCl"I0N UF(N,X) DIMENSION X(N) Calculation of
J
(z,w)RETURN
END
Here N is the dimension of t h e vector of variables
X
(Note t h a t the imple- mentation on the IIASAVAX
actually requires the subroutine to be entered in lower-case letters r a t h e r than capitals.) A description of the subroutine which calculates a quasigradient is given later in this paper.(ii) Compile these subroutines with the source code t o obtain an executable module.
(iii) Provide a t least one of t h e following additional d a t a files:
-
algorithm control file (used only in t h e non-interactive option)-
p a r a m e t e r Ale (used only in t h e interactive option)-
initial data Ale (should always be present)All of these files a r e described in some detail l a t e r in t h e paper.
The optimization process c a n then begin. The program first asks t h e u s e r a series of questions regarding the required mode (interactive or automatic), method of step size regulation, choice of s t e p direction. etc. These questions appear on t h e monitor a n d should be answered from t h e keyboard or by refer- ence t o a d a t a file. We shall represent t h e dialogue a s follows:
Question? Answm
with t h e user's response given in italics. The first question is Interactive mode? reply yes or no yes/no
To choose t h e interactive option the u s e r should type in yes (or y); to select t h e automatic option h e should answer no (or n). In t h e l a t t e r c a s e t h e program would ask no f u r t h e r questions, but would read all t h e necessary information from t h e algorithm control file (which is usually numbered 2
-
u n d e r UNrX con- ventions its n a m e is fort.2). The iterative process would t h e n begin, terminat- ing after 10,000 iterations if no other stopping criterion is fulfilled. The algo- r i t h m control file m u s t contain answers t o all of t h e following questions except those concerned either with dialogue during t h e iterations or with t h e parame- t e r f3le (such questions a r e marked with a n asterisk below). This file is given a n a m e only for ease of reference-
t h e important thing for t h e u s e r i s its number.Assume now t h a t t h e user has chosen t h e interactive option by answering yes t o the first question. The program then asks
parameter f lle? (number)
The user should respond either with the number of t h e file of default pararne- t e r s or with t h e n u m b e r of the file in which t h e c u r r e n t values of t h e algorithm parameters a r e stored. The file of default p a r a m e t e r s is provided with t h e pro- gram and has t h e n a m e fort.12 (under UNlX conventions); t h u s , to refer t h e program t o t h e default file the user should answer 12. The purpose of this file is to help t h e user to s e t t h e values of algorithm p a r a m e t e r s in t h e ensuing dialo- gue and also t o s t o r e such improved values a s may be discovered by the u s e r
t h r o u g h t r i a l a n d error. I f t h e u s e r assigns t h e a l g o r i t h m p a r a m e t e r s any values o t h e r t h a n those in t h e default file. t h e new values become t h e default values in s u b s e q u e n t r u n s of t h e program. This Ale is optional.
The p r o g r a m t h e n asks
read parameter file? reply yes or no y e s / n o
The answer y e s implies t h a t t h e file specified i n t h e previous question exists, a n d t h a t default p a r a m e t e r values a r e s t o r e d i n this file. In t h i s case, when asking t h e u s e r about p a r a m e t e r values. t h e p r o g r a m will read t h e default option in t h e p a r a m e t e r file a n d r e p r o d u c e i t on t h e s c r e e n together w i t h t h e question. If t h e u s e r a c c e p t s t h i s default value he s h o u l d respond with 0 (zero);
otherwise h e should e n t e r his own value, which will become t h e new default value.
The a n s w e r no m e a n s t h a t no default values a r e available a t t h e m o m e n t . In this c a s e t h e program will form a new default file (labeled with the n u m b e r given a s a n answer t o t h e previous question); i t s c o n t e n t s will be based on t h e u s e r ' s answers t o f u t u r e questions. This new default file, once formed, c a n be u s e d i n s u b s e q u e n t r u n s .
The n e x t question is
number of variables? ( n u m b e r )
t o which t h e u s e r should respond with t h e dimension of t h e vector of variables
2 . He is t h e n asked
Initial data file? (nurn b e r )
a n d should reply with t h e n u m b e r of t h e initial d a t a file. This file should con- tain t h e following e l e m e n t s (in exactly this order):
-
The initial point, which should be a s e q u e n c e of n u m b e r s s e p a r a t e d by c o m m a s or o t h e r delimiters.-
Any additional d a t a required by s u b r o u t i n e s U F or TJG if s u c h d a t a exists a n d t h e u s e r chooses t o p u t it in t h e initial d a t a file (optional).-
Information about t h e c o n s t r a i n t s (described in m o r e detail below) The p r o g r a m t h e n asksstep size regulation? is
Here is is a positive i n t e g e r from t h e s e t 11,2.3,4,6,7{, where the diflerent values of is correspond to different ways of choosing t h e s t e p size. (The i n t e g e r 5 is r e s e r v e d for a n option c u r r e n t l y u n d e r development.)
1 Adaptive automatic s t e p size regulation (24) based on algorithm perfor- m a n c e function (22) and function estimate (18).
2 Manual s t e p size regulation based on algorithm performance function (22) and function estimate (18).
3 Adaptive automatic s t e p size regulation (24) using algorithm perfor- m a n c e measure (22) a n d a function estimate based on a finite number of previous observations (21).
4 Manual s t e p size regulation based on t h e same estimates of algorithm performance a s for is
=
3.6 Automatic s t e p size regulation using algorithm performance m e a s u r e (24) a n d function estimate (19) with Axed y,.
7 Manual s t e p size regulation based on t h e same estimates of algorithm performance a s for is
=
6.The difference between adaptive automatic and manual s t e p size regulation (see is
=
1.2) is t h a t in t h e first case t h e step size is chosen automatically, although t h e user may t e r m i n a t e t h e iterations a t specified points a n d con- t i n u e with another s t e p size regulation. while in the second case t h e user changes t h e value of t h e s t e p size himself. Both s t e p size regulations a r e based on t h e s a m e estimates of function value a n d algorithm performance.The next question is
s t e p direction? ( 5 figures) id2 i d 2 i d 3 id4 i d 5
The user has to respond with five figures which specify various ways of choosing t h e s t e p direction, e.g., 11111. W e shall refer to these figures a s i d l , i d 2 , id3, id4 and id5. The subroutine which estimates the s t e p direction makes some n u m b e r of initial observations a t e a c h step; these a r e t h e n averaged in some way t o obtain the vector
p .
and t h e final s t e p direction v s is calculated using bothp
and values of v z for i C s.The value of id1 specifies t h e n a t u r e of the initial observations id1 Definition
A direct observation of a stochastic quasigradient is available For and t h e user has t o specify a subroutine UG to calculate it:
SUBROUTINE UG(N,X,G) DIMENSION X(N),G(N)
Calculation of a stochastic quasigradient RETURN
END
where
C(N)
is an observation of a stochastic quasigradient.2 Central finite-difference approximation of the gradient as in (11).
3 The
pvs
a r e c a l c u l a t e d using random s e a r c h t e c h n i q u e s ( 1 2 ) .4 Forward finite-difference approximation of t h e initial observations
pms
asin ( 1 0 ) .
5 Central finite-difference approximation of t h e g r a d i e n t a s i n ( 1 1 ) . All observations of t h e function used in one observation of
raS
a r e made with t h e s a m e values of random p a r a m e t e r s o.6 The
F"
a r e c a l c u l a t e d using random s e a r c h t e c h n i q u e s ( 1 2 ) . All obser-vations of t h e function used in one observation of a r e m a d e with the s a m e values of r a n d o m p a r a m e t e r s o.
7 Forward finite-difference approximation of t h e initial observations
pas
asin (10). All observations of the:function u s e d in o n e observation of
pus
a r e m a d e with t h e s a m e values of random p a r a m e t e r s o.
Note t h a t for id1
=
5 6 . 7 all observations of t h e function u s e d in one observation of a r e m a d e with t h e s a m e values of random p a r a m e t e r s o. In this case t h e u s e r should write a function UF which supports this f e a t u r e a s follows:FVNCTION UF(N,X) DIMENSION X(N) COMMON/OMEG/LO,MO If LO= 1 a n d MO= 1 t h e n obtain new values of r a n d o m factors o a n d s e t MO=O.
Make a n observation of t h e function a t point z.
R E T U R N
END
The second figure i d z d e t e r m i n e s t h e point a t which observations a r e made:
i d 2 Definition
1 The initial direction i s calculated a t t h e c u r r e n t point zS
2 The initial direction is calculated a t a point chosen randomly from among those in t h e neighborhood of t h e c u r r e n t point zS
The value of id3 deflnes t h e way in which t h e s t e p in a finite-difference or ran- dom s e a r c h approximation of
p"
is chosen:id3 Definition
1 The approximation s t e p is Axed. The observations of t h e objective func- tion a t point zs originally used t o obtain g r a d i e n t observations
PnS
a r enot u s e d to update t h e e s t i m a t e of t h e function employed f o r s t e p size regulation.