• Keine Ergebnisse gefunden

Stochastic Quasigradient Methods and their Implications

N/A
N/A
Protected

Academic year: 2022

Aktie "Stochastic Quasigradient Methods and their Implications"

Copied!
47
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

NOT FOR QUOTATION WITHOUT PERMISSION OF THE AUTHOR

S T O C m C QUASIGRADIENT

METHODS AND

THEIR WLEMENTATION

Yuri Ermoliev Alexei Gaivoronski

July 1984 WP-84-55

Working Papers a r e interim reports on work of t h e International Institute for Applied Systems Analysis a n d have received only limited review. Views or opinions expressed herein do not necessarily represent those of t h e Institute or of its National Member Organizations.

INTERNATIONAL INSTITUTE FOR APPLIED SYSTEMS ANALYSIS 2361 Laxenburg. Austria

(2)

PREFACE

This paper discusses various stochastic quasigradient methods and considers t h e i r computer implementation. It i s based on experience gained both a t the V. Glushkov Institute of Cybernetics in Kiev and a t IIASA

The paper falls naturally i n t o t h r e e parts. The first i s concerned with problem definition a n d various ways of choosing t h e s t e p size and s t e p direction. A detailed description of an interactive stochastic optimi- zation package (STO) available a t IIASA forms t h e second part. Finally, t h e use of t h i s package is demonstrated by application t o t h r e e t e s t prob- lems t h a t have arisen during t h e course of IIASA w o r k t h e y include a facility location problem and a water management problem.

This work was c a r r i e d out within t h e Adaptation and Optimization Project of t h e System a n d Decision Sciences Program.

ANDRZE J WIERZBICKI Chairman

System and Decision Sciences

(3)

A number of stochastic quasigradient methods a r e discussed from t h e point of view of implementation. The discussion revolves around t h e interactive package of stochastic optimization routines (STO) r e c e n t l y developed by t h e Adaptation and Optimization group a t IIASk (This pack- age is based on t h e stochastic and nondifferentiable optimization pack- age (NDO) developed a t t h e V. Glushkov Institute of Cybernetics in Kiev.) The IIASA implementation is described a n d i t s use illustrated by applica- tion t o t h r e e problems which have arisen in various IlASA projects.

(4)

STOCHASI7C QUASIGRADENT METHODS AND THEIR IMPLEMENTATION

Y w i

E?rmoLiev and A l e z e i G a i v o r o n s k i

1. INTRODUCTlON

This paper discusses various stochastic quasigradient methods (see [1,2]) a n d considers t h e i r c o m p u t e r implementation. It is based on experience gained both a t t h e V. Glushkov Institute of Cybernetics in Kiev and a t IIGk

We a r e concerned h e r e mainly with questions of implementation, s u c h as the best way to choose s t e p directions and s t e p sizes, a n d therefore little a t t e n - tion will be paid t o theoretical aspects s u c h as convergence t h e o r e m s and t h e i r proofs. Readers i n t e r e s t e d in t h e theoretical side a r e referred t o [1.2].

The paper is divided i n t o five sections. After introducing t h e main problem in Section 1, we discuss t h e various ways of choosing t h e step size a n d s t e p direction in Sections 2 a n d 3. A detailed description of an interactive stochas- tic optimization package (STO) c u r r e n t l y available a t IIASA is given in Section 4.

This package r e p r e s e n t s one possible implementation of t h e methods described in t h e previous sections. Finally, Section 5 deals with the solution of some t e s t problems using this package. These problems were brought to o u r attention by o t h e r IIASA prcjects and collaborating institutions a n d include a facility loca- tion problem, a water resources m a n a g e m e n t problem, a n d t h e problem of choosing the p a r a m e t e r s in a closed loop control law for a stochastic dynamical system with delay.

We a r e mainly concerned with the problem

where z r e p r e s e n t s t h e variables to be chosen optimally, X is a s e t of con- s t r a i n t s , and o is a random variable belonging t o some prubabilistic space ( R , B , P ) . Here B is a Bore1 field a n d P is a probabilistic measure.

There a r e c u r r e n t l y two main approaches t o this problem. In t h e first, we take the m a t h e m a t i c a l expectation in ( I ) , which leads to multidimensional

(5)

integration a n d involves t h e use of various approximation s c h e m e s [3-61. This reduces problem (1) to a spe'cial kind of nonlinear programming problem which allows t h e application of deterministic optimization techniques. In this paper we c o n c e n t r a t e on t h e second approach, in which we consider a very Limited n u m b e r of observations of random function f ( z , o ) a t e a c h iteration in order t o d e t e r m i n e t h e direction of t h e next step. The resulting e r r o r s a r e smoothed out until t h e optimization process t e r m i n a t e s (which happens when the s t e p size becomes sufficiently small). This approach was pioneered in [7,9].

W e a s s u m e t h a t s e t X is defined in such a way t h a t t h e projection operation z -. n X ( z ) is comparatively inexpensive from a c o m p u t a t i o n a l point of view, where nx(z)

=

arg min llz

- = ! I .

For instance, if X is defined by linear con-

z EX

s t r a i n t s , t h e n projection is reduced to a q u a d r a t i c programming problem which, although challenging if large scale, c a n nevertheless be solved in a finite n u m b e r of iterations. In this case it is possible to i m p l e m e n t a stochastic quasigradient algorithm of t h e following type:

Here zs is t h e c u r r e n t approximation of t h e optimal solution, ps is the s t e p size, a n d v S is a r a n d o m s t e p direction. This s t e p direction may, for instance, be a s t a t i s t i c a l e s t i m a t e of the gradient (or subgradient in t h e nondifferentiable case) of function F ( z ) : t h e n v S

=

s u c h t h a t

where as decreases as t h e n u m b e r of iterations i n c r e a s e s , and t h e vector v S is called a stochastic quasigradient of function F ( 2 ) . Usually p, 4 0 as s 4 = a n d t h e r e f o r e Ilzs+l

-

zS

11

-, 0 from (2). This suggests t h a t we should take zS as t h e initial point for t h e solution of the projection problem a t iteration n u m b e r s +1, t h u s reducing considerably the computational effort n e e d e d t o solve the qua- dratic programming problem a t each s t e p s

=

1,2.

... .

Algorithm (2)-(3) c a n also cope with problems with more general c o n s t r a i n t s formulated in t e r m s of m a t h e m a t i c a l expectations

by making use of penalty functions or t h e Lagrangian (for details see [1,2]).

(6)

The principal peculiarity of s u c h methods is their nonmonotonicity, which may sometimes show itself in highly oscillatory behavior. In this case it is dificult to judge whether t h e algorithm has already approached a neighborhood of the optimal point or n o t , since exact values of t h e objective function a r e not available. The best way of dealing with s u c h difEculties s e e m s to be to use an interactive procedure t o choose t h e step sizes and s t e p directions, especially if it does not take much t i m e t o make one observation. More reasons for adopting an interactive approach a n d details of t h e implementation are given i n the fol- lowing sections.

Another c h a r a c t e r i s t i c of t h e algorithms described h e r e is t h e i r pattern of convergence. Because of t h e probabilistic n a t u r e of t h e problem, t h e i r asymp- totic r a t e of convergence is extremely slow a n d m a y be r e p r e s e n t e d by

Eere z * is t h e optimal point t o which sequence zS converges a n d k is t h e n u m b e r of observations of random p a r a m e t e r s o, which in m a n y cases is pro- portional t o t h e n u m b e r of iterations. In deterministic optimization a super- linear asymptotic convergence r a t e is generally expected; a r a t e such as (4) would be considered a s nonconvergence. But no algorithm c a n do asymptoti- cally any b e t t e r t h a n t h i s for stochastic problem (1) in t h e presence of nonde- generate random & s t u r b a n c e s , a n d therefore t h e aim is t o r e a c h some neigh- borhood of the solution r a t h e r t h a n to find t h e precise value of t h e solution itself. Algorithm (2)-(3) is quite good enough for this purpose.

2. CHOICE

OF

SllW DIRECTION

In this section we shall d ~ s c u s s different ways of choosing t h e s t e p direc- tion in algorithm (2) a n d s o m e closely r e l a t e d algorithms. We shall first discuss methods which a r e based on observations made a t t h e c u r r e n t point zS or in its immediate vicinity. More general ways a r e then p r e s e n t e d which take into account observations m a d e a t previous points.

2.1. Gradients of random function f (z , 0)

The simplest case arises when it is possible to obtain gradients (or subgra- dients in t h e nondifferentiable case) of f u n c t i o n f ( z , w ) a t fixed values of z and w. In this case we can simply take

(7)

where oS is an observation of random p a r a m e t e r LI made a t s t e p n u m b e r s . If both t h e observation of random p a r a m e t e r s and the evaluation of gradients a r e computationally inexpensive t h e n it is possible to take t h e average of some specified n u m b e r N of gradient observations:

These observations c a n be s e l e c t e d in two ways. The first is t o choose t h e oilS according t o their probability distribution. If we do not know t h e form of t h e distribution function (as, for example, in Monte-Carlo simulation models) t h i s is t h e only option. However, in t h i s c a s e t h e influence of low-probability high-cost events m a y not be properly t a k e n i n t o account. In addition, t h e asymptotic e r r o r of the gradient e s t i m a t e is approximately proportional to 1/

q .

The second approach may be used when we know the distribution of the random p a r a m e t e r s o. In this case m a n y o t h e r e s t i m a t e s can be derived; t h e u s e of pseudo-random numbers* in p a r t i c u l a r m a y lead to an asymptotic e r r o r approx- imately proportional to log (N)/ N, which is considerably less t h a n in t h e purely random case. However, m o r e t h e o r e t i c a l research and more computational experience a r e necessary before we c a n assess the true value of t h i s approach.

The main question h e r e is w h e t h e r t h e i n c r e a s e in the speed of convergence is sufficient t o compensate for t h e additional computational effort required for m o r e exact estimations of t h e F,(zS).

Unfortunately, o u r t h e o r e t i c a l knowledge concerning t h e asymptotic behavior of processes of type (2) tells u s little about t h e optimal n u m b e r of samples, even f o r relatively well-studied cases. For instance, what would be t h e optimal n u m b e r N of observations for t h e case in which function F(z) is diflerentiable a n d t h e r e a r e n o c o n s t r a i n t s ? In this case we can establish both asymptotic normality and t h e value of t h e asymptotic variance. If, additionally, p,

-

C / s t h e n t h e total n u m b e r of observations required to obtain a given asymptotic variance is t h e s a m e for all N

<<

s . If sp, -, = t h e n t h e wait-and-see approach is asymptotically s u p e r i o r as long as 1Ir

<<

s .

*A concept which arose ?om :he xse of quasi-Monte-Carlo zec:lmques in mult~&mensional integration [9].

(8)

However, t h e r e is strong evidence t h a t in c o n s t r a i n e d and/or nondifferentiable cases t h e value of JV should be chosen adaptively. A very sim- ple example provides some insight into the problem. Suppose tfiat z E R', X

=

[a,=), F ( z ) = z, f,(zS.wS)

=

1

+

wS, where the wS, s = 1,2 ,... a r e indepen- d e n t random variables with zero mean. The obvious solution of this problem is z

=

a. Suppose f o r simplicity t h a t ps

=

p. This will not a l t e r o u r a r g u m e n t greatly because ps usually changes very slowly for large s . In this c a s e method ( 2 ) . ( 5 ) will be of t h e form:

T,

=

m a x 1 0 , a - zS

+

p ( l

+

wS)j

Method (2),(6) requires us to choose a s t e p size N t i m e s g r e a t e r t h a n p; other- wise its performance would be inferior to t h a t of m e t h o d (2),(5) (unless t h e ini- tial point is in the immediate vicinity of the m i n i m u m ) . Method (2),(6) t h e n becomes

In order t o compare t h e two methods we shall let s in t h e last equation denote t h e n u m b e r of observations r a t h e r t h a n the n u m b e r of i t e r a t i o n s a n d r e n u m b e r t h e observations winS. The process

lo

if i #

1N

for 1

=

1,2.

...

or a

<

y i - p ( l

+

oi)

xi =

a

-

y i + p ( l + oi) otherwise

h a s t h e property t h a t y i N

=

z S a n d therefore it is sufacient t o compare y k with z k for k =

LN,

where

Suppose t h a t z0

=

y o # a. Then if t: = min tk : zk

=

aj r e p r e s e n t s t h e t i m e a t which process rk first e n c o u n t e r s t h e optimal point and t:

=

min I1 : y L N = a ]

(9)

represents t h e time of t h e corresponding e n c o u n t e r of process z k with the optimal point, it is c l e a r t h a t t: S t i because from (7) a n d ( 9 ) we have t h a t

y k

=

zk lor k < t d . This m e a n s t h a t algorithm (2),(5) will g e t from some r e m o t e initial point to t h e vicinity of the optimal point f a s t e r t h e n algorithm (2),(6) with N

>

1. Now let us take z0

=

y o

=

a. Then (7) a n d (8) imply t h a t

&

=

0 for k

<

N while rk may differ from zero. Therefore in this case z N > y N

=

z 1 a n d t h e performance of algorithm (2),(6) with N

>

1 becomes superior to t h a t of algorithm (2),(5) after r e a c h i n g t h e vicinity of t h e optimal point. This simple example demonstrates several i m p o r t a n t properties of con- strained stochastic optimization problems, although m o r e work is necessary before we c a n make a n y firm theoretical r e c o m m e n d a t i o n s concerning t h e choice of t h e n u m b e r of samples N. Above all, a n appropriate definition of t h e r a t e of convergence is needed: r e c e n t r e s u l t s by Kushner [ l o ] may be useful in this regard.

A r a t h e r g e n e r a l adaptive way of changing t h e n u m b e r N would be to begin with a small value of N for t h e first few iterations ( N

=

1, lor example), and increase N if additional t e s t s show t h a t t h e c u r r e n t point is in t h e vicinity of t h e optimum. The following averaging procedure has been shown t o be useful in tests of t h i s type:

where

r

is defined by (5) or (6). It c a n be shown (see [1.2]) t h a t l/uS

-

F,(zs)II -+ 0 u n d e r r a t h e r g e n e r a l conditions, which include p s / a s -+ 0.

The decision as to whether to change iV m a y t h e n be based on t h e value of

r, =

\lzs

-

rrX(zs

-

vs)ll. One possibility is t o e s t i m a t e a n d its empirical vari- a n c e a t the s a m e time:

a n d choose N such t h a t o i S @ r s , where t h e value of @ is s e t before beginning t h e iterations. In practice it is s u f i c i e n t t o consider a constant a, a

-

0.01-0.05, where t h e g r e a t e r the r a n d o m n e s s , t h e smaller t h e value of a. Our empirical recommendation for t h e initial value of N is a

:

-

0.1 max 112,

-=,I\.

z l , r z a

(10)

This method c a n be used t o increase the n u m b e r of samples per iteration automatically. Another possibility is t o a l t e r t h e value of N interactively; this is one of t h e options implemented in t h e interactive package STO, which has r e c e n t l y been developed a t IIASk Numerical experiments conducted with this package show t h a t in problems where f , ( z , w ) has a high variance, choosing a value of N g r e a t e r t h a n one c a n bring about considerable improvements in per- formance.

The method described above uses increasingly precise e s t i m a t e s of t h e gradient. a n d therefore s h a r e s some of t h e features of t h e approximation tech- niques developed in [3-61 for solving s t o c h a s t i c programming problems. All of t h e r e m a r k s made h e r e concerning sampling a r e also valid for t h e o t h e r m e t h o d s of choosing described below.

However, i t is n o t always possible t o use observations of the g r a d i e n t f , ( z , w ) of t h e random function to compute a stochastic quasigradient. In m a n y cases the analytic expression of f , ( z , w ) is not known, and even if it is, it m a y be difficult to c r e a t e a subroutine t o evaluate it, especially for large-scale problems. In this case it is necessary t o use a method which relies only on observations of f ( z , w ) .

2.2. FiniteditTerence approximations

If function F ( z ) is differentiable, one possibility is t o use forward finite differences:

o r c e n t r a l h i t e differences:

where t h e ei a r e unit basis vectors from

R".

The m o s t important question h e r e i s t h e value of 6,. In order to e n s u r e convergence with probability one it is sufficient t o take any sequence 6, s u c h t h a t

Exm=,

pf/ 6f

<

=J. I f it is possible to t a k e w f , ,

=

u:,, then any 6, + 0 will do. However, t h e method m a y r e a c h t h e vicinity of the optimal point m u c h faster if 6, is chosen adaptively. On t h e first few iterations 6, should be large, decreasing ris the c u r r e n t point approaches

(11)

t h e optimal point. The main reason for this is t h a t taking a large s t e p 6 , when t h e c u r r e n t point is far from t h e solution m a y s m o o t h out t h e randomness to s o m e e x t e n t , and may also overcome some of t h e problems ( s u c h as curved val- leys) c a u s e d by t h e e r r a t i c behavior of t h e d e t e r m i n i s t i c function ~ ( z ) . One possible way of implementing such a s t r a t e g y in a n u n c o n s t r a i n e d case is given below.

(i) Take a large initial value of 6 , . s u c h as 6 ,

-

0.1 m a x (12

-

z,ll

r l , z z E X

(ii) Proceed with iterations (2), where is d e t e r m i n e d using (10) or (11).

While doing this, compute an e s t i m a t e of t h e g r a d i e n t v S from (9).

(iii) Take

where t h e values of a n d

p2

should be chosen before beginning t h e itera- tive process.

It c a n be shown t h a t t h i s process converges when ufB1

=

o;,~, although i t will also produce a good approximation to the solution even if this r e q u i r e m e n t is n o t met. Estimate (9) is not the only possibility

-

in fact, a n y of the e s t i m a t e s of algorithm performance given in Section 3 would do.

Another s t r a t e g y is to relate changes in t h e finite-difference approximation s t e p t o changes in t h e s t e p size. This is especially advisable if t h e s t e p size is also chosen adaptively (see Section 3). In t h e s i m p l e s t c a s e one may fix

PI >

0 before s t a r t i n g and choose 6,

= pips,

which, although c o n t r a r y to theoretical recommendations, will nevertheless bring t h e c u r r e n t point reasonably close t o t h e optimal point. To obtain a m o r e precise solution i t is n e c e s s a r y t o reduce

PI

during t h e course of t h e iterations. This m a y be done e i t h e r automatically o r interactively; both of these options a r e c u r r e n t l y available in t h e s t o c h a s t i c optimization package STO.

Finite-difference algorithms (10) and (11) have one major &sadvantage, a n d this is t h a t t h e stochastic quasigradient variance increases as 6, decreases. This m e a n s t h a t finite-difference algorithms converge m o r e slowly t h a n algorithms which use gradients (5). There a r e two ways of overcoming this problem. Firstly, if i t is possible to make observations of function f ( z , w ) for

(12)

v a r i o u s values of z a n d fixed w , it is a good idea to t a k e t h e s a m e v a l u e s of w for t h e differences (i.e., uTml

=

gfS2) when 6, is small b e c a u s e t h i s r e d u c e s t h e vari- a n c e of t h e e s t i m a t e s q u i t e considerably. Another way of avoiding t h i s i n c r e a s e i n t h e v a r i a n c e is t o i n c r e a s e t h e n u m b e r of samples u s e d t o o b t a i n when a p p r o a c h i n g t h e o p t i m a l point, i.e.. t o u s e finite-difference a n a l o g u e s of (6). If t h e r e exists a y

>

0 s u c h t h a t Ns6: > y , where N, is t h e n u m b e r of s a m p l e s t a k e n a t s t e p n u m b e r s , t h e n t h e v a r i a n c e of r e m a i n s bounded.

I t is s o m e t i m e s useful t o n o r m a l i z e t h e

p ,

especially when t h e v a r i a n c e is l a r g e .

Another disadvantage of t h e finite-difference a p p r o a c h is t h a t i t r e q u i r e s n

+

1 evaluations of t h e objective Function for forward differences a n d 2n for c e n t r a l differences, w h e r e n is t h e dimension of v e c t o r z . This m a y n o t be a c c e p t a b l e in Large-scale p r o b l e m s a n d i n c a s e s where f u n c t i o n evaluation is c o m p u t a t i o n a l l y expensive. In t h i s s i t u a t i o n a s t o c h a s t i c q u a s i g r a d i e n t c a n be c o m p u t e d using s o m e a n a l o g u e of r a n d o m s e a r c h t e c h n i q u e s .

2.3. Analogues of random search methods

When i t is not feasible to c o m p u t e n

+

1 values of t h e objective f u n c t i o n a t e a c h i t e r a t i o n , t h e following a p p r o a c h (which has s o m e t h i n g s in c o m m o n with t h e r a n d o m s e a r c h t e c h n i q u e s developed for d e t e r m i n i s t i c o p t i m i z a t i o n prob- l e m s ) m a y be used:

H e r e t h e h, a r e v e c t o r s d i s t r i b u t e d uniformly on t h e u n i t s p h e r e ,

.Ms

is t h e n u m b e r of r a n d o m p o i n t s a n d 6, is t h e s t e p t a k e n in t h e r a n d o m s e a r c h . The c h o i c e of

M,

is d e t e r m i n e d by t h e computational facilities available, a l t h o u g h i t i s advisable t o i n c r e a s e

M,

a s 6, decreases. This m e t h o d of choosing

p

h a s m u c h i n c o m m o n with finite-difference s c h e m e s , a n d t h e s t a t e m e n t s m a d e above a b o u t t h e c h o i c e of 6, i n t h e flnite-difference c a s e also hold for (12).

2.4. Snmothmg the objective function

Methods of choosing

p

which r e l y on finite-difference o r r a n d o m s e a r c h t e c h n i q u e s a r e only a p p r o p r i a t e when t h e objective f u n c t i o n F ( z ) is differentiable. The u s e of s i m i l a r p r o c e d u r e s in t h e n o n d i f f e r e n t i a b l e case

(13)

would require some smoothing of t h e objective function. Suppose t h a t t h e function F ( z ) is not differentiable but satisfies the Lipschitz condition, a n d con- sider t h e function

where H ( y , r ) is a probability measure with support in a ball of radius r c e n - t e r e d a t zero. We shall assume for simplicity t h a t H ( ~ , T ) has nonzero density inside this ball. The function F ( z , r ) is differentiable a n d F ( z , r ) + ~ ( z ) uni- formly over every compact s e t as r -r 0. It is now possible to minimize t h e nonsmooth function ~ ( z ) by computing stochastic quasigradients for s m o o t h f u n c t i o n s F ( 2 . r ) a n d find t h e optimal solution of t h e initial problem by l e t t i n g r -. 0. This idea was proposed in [ l l . ] a n d studied f u r t h e r in [ 1 2 ] . It is not a c t u - ally n e c e s s a r y t o calculate t h e integral in ( 1 3 ) - it is sufficient to c o m p u t e

tS

using equations ( 1 0 ) - ( 1 2 ) , but a t point z S

+

y S r a t h e r t h a n point z S , where y S is a r a n d o m variable distributed according t o H ( Y , T , ) . In t h i s case ( 1 0 ) becomes:

The m o s t commonly u s e d distribution H ( y , r ) is uniform distribution on a n n - dimensional cube of side r . If we want t o have convergence with probability one we should choose r, s u c h t h a t b , / r S

-.

0 a n d ( r , - r s + l ) / p s + 0. In practical computations it is also advisable to choose t h e smoothing p a r a m e t e r rs in a s i m i l a r way t o 6,. using one of the adaptive procedures discussed above.

Smoothing also h a s beneficial side effects in t h a t i t improves the behavior of t h e deterministic function F ( z ) . In t h e case where ~ ( z ) may be written a s t h e s u m of two functions, one with a distinct global m i n i m u m and the o t h e r with highly oscillatory behavior, smoothing may help t o overcome t h e influence of t h e oscillations, which m a y otherwise lead t h e process t o local minima f a r from t h e global one. Thus it c a n sometimes be useful t o smooth t h e objective func- tion even if we c a n obtain a gradient f , ( z , o ) . In this case we should t a k e a large value for the smoothing parameter r, on t h e first few iterations, decreas- ing it a s we approach the optimal point. The points a t which r, should be decreased m a y be determined using the values of additional e s t i m a t e s , s u c h a s those described below in Section 3 or given by (9). Everything said about t h e choice of t h e finite-difference parameter 6, is also valid for the choice of t h e

(14)

smoothing p a r a m e t e r , including t h e connection between t h e s t e p size a n d the smoothing p a r a m e t e r a n d the possibility of interactive control of r,. The only difference is t h a t a d e c r e a s e in 7, does not lead t o an increase in t h e variance of

tS

and t h a t it is preferable to have 6,

<

7,. This is also reflected in t h e sto- c h a s t i c optimization software developed a t IIASA

All of t h e methods discussed so far u s e only the information available a t t h e c u r r e n t point o r in its immediate vicinity. We shall now discuss s o m e more general ways of choosing t h e s t e p direction which take into account t h e infor- mation obtained a t previous points.

2.5. Averaging over preceding iterations

The definition of a stochastic quasigradient given in (3) allows us t o use information obtained a t previous points as t h e i t e r a t i o n s proceed; this informa- tion may s o m e t i m e s lead to f a s t e r convergence t o t h e vicinity of t h e optimal point. One possible way of using such information is t o average t h e stochastic quasigradients obtained in preceding iterations via a procedure s u c h as (9).

The us obtained in t h i s way may t h e n be u s e d in m e t h o d (2). This is a n o t h e r way of smoothing o u t randomness a n d neutralizing s u c h c h a r a c t e r i s t i c s of deterministic behavior as curved valleys a n d oscillations. Methods of this type may be viewed as s t o c h a s t i c analogues of conjugate gradient methods and were first proposed in [13]. We c a n choose according to any of (5). (6). (10). (ll), (12), o r (14). Since v S -,

Fz(zs)

under r a t h e r g e n e r a l conditions (see [1.2]), m e t h o d (9) can be considered as a n alternative t o method (6) for deriving pre- cise estimates of g r a d i e n t Fz(z). This m e t h o d h a s a n advantage over (6) in t h a t it provides a n a t u r a l way of using rough e s t i m a t e s of

F,(zS)

on t h e first few iterations a n d t h e n gradually increasing t h e a c c u r a c y as the c u r r e n t point approaches t h e optimal point. In this case (9) c a n be incorporated in t h e adap- tive procedures used to choose t h e smoothing p a r a m e t e r and the s t e p in t h e finite-difference approximation.

However, i t is not necessary t o always take a, -, 0 , because we have conver- gence for any 0 s a, S 1. Sometimes it is even advantageous t o take a,

=

a

=

constant, because in this case m o r e emphasis is placed on information obtained in r e c e n t iterations. In general, t h e g r e a t e r the randomness. the smaller t h e value of a t h a t should be taken. Another averaging technique is given by

(15)

where Ms is t h e size of t h e memory, which may be fixed.

2.6. Using secon&order information

T h e r e is strong evidence t h a t in s o m e cases setting

may bring a b o u t considerable i m p r o v e m e n t s in performance. Here c a n be chosen in any of t h e ways discussed above. Matrix

-%

should be positive definite a n d take into a c c o u n t both t h e second-order behavior of function F ( z ) a n d t h e s t r u c t u r e of t h e random p a r t of t h e problem. One possible way of obtaining second-order information is t o use analogues of quasi-Newton m e t h o d s t o update m a t r i x

4.

To i m p l e m e n t this approach, which was proposed by Wets in [3], it is n e c e s s a r y to have

- F,(zs)II

-. 0.

3. CHOICE

OF

SlXP SIZE

The simplest way of choosing t h e step-size sequence in (2) is to do it before starting t h e iterative process. Convergence theory suggests t h a t a n y s e r i e s with t h e properties:

c a n be used a s a s e q u e n c e of s t e p sizes. In addition, it may be n e c e s s a r y t o take i n t o a c c o u n t relations between t h e s t e p size a n d s u c h things a s t h e smoothing p a r a m e t e r o r the s t e p in a finite-difference approximation. Rela- tions of this type have been briefly described in the preceding sections. In m o s t cases t h e choice p,

-

C / s , which obviously satisfies (17), provides t h e b e s t pos- sible asymptotic r a t e of convergence. However, since we a r e mainly c o n c e r n e d with reaching t h e vicinity of t h e solution, rule (17) is of limited use b e c a u s e a wide variety of sequences can be modified to satisfy it. The o t h e r disadvantage of choosing t h e step-size s e q u e n c e in advance is that this approach does not make any use of t h e valuable information which accumulates during solution.

These "programmed" methods t h u s perform relatively badly in t h e majority of cases.

(16)

The best s t r a t e g y t h e r e f o r e s e e m s t o be t o choose t h e s t e p size using a n i n t e r a c t i v e m e t h o d . It is a s s u m e d t h a t t h e u s e r c a n m o n i t o r t h e progress of t h e optimization p r o c e s s a n d c a n i n t e r v e n e t o c h a n g e t h e value of t h e s t e p size o r o t h e r p a r a m e t e r s . This decision should be based on t h e behavior of t h e esti- m a t e s

p(zs)

of t h e c u r r e n t value of t h e objective function. The e s t i m a t e s m a y be very rough a n d a r e g e n e r a l l y c a l c u l a t e d using only o n e observation p e r i t e r a t i o n , a s in t h e following example:

I t a p p e a r s t h a t a l t h o u g h t h e observations f ( z S , o S ) m a y vary g r e a t l y , t h e

?

display m u c h m o r e r e g u l a r behavior. Monitoring t h e behavior of s o m e com- ponents of t h e v e c t o r zs in addition t o t h e

?

also s e e m s t o be useful. One pos- sible i m p l e m e n t a t i o n of t h e i n t e r a c t i v e a p p r o a c h m a y p r o c e e d along t h e follow- ing lines:

(i) The u s e r first c h o o s e s t h e value of t h e s t e p size a n d keeps it c o n s t a n t for a n u m b e r of i t e r a t i o n s (usually 10-20). During this period t h e values of t h e e s t i m a t e

?

a n d s o m e of t h e c o m p o n e n t s of t h e v e c t o r zS a r e displayed.

pozsibly with s o m e additional information.

(ii) The u s e r decides on a new value for t h e s t e p size using t h e available infor- mation. Three different c a s e s m a y occur:

- The c u r r e n t s t e p size is too large. In this c a s e both t h e values of t h e e s t i m a t e

?

a n d t h e values of t h e m o n i t o r e d c o m p o n e n t s of zS exhibit r a n d o m jumps. It is n e c e s s a r y to d e c r e a s e t h e s t e p size.

-

The c u r r e n t s t e p size i s just right. In t h i s c a s e t h e e s t i m a t e s d e c r e a s e steadily a n d s o m e of t h e m o n i t o r e d c o m p o n e n t s of t h e c u r r e n t vector zs also e x h i b i t r e g u l a r behavior (steadily d e c r e a s e o r i n c r e a s e ) . This m e a n s t h a t t h e u s e r m a y k e e p t h e s t e p size c o n s t a n t u n t i l oscillations o c c u r in t h e e s t i m a t e a n d / o r i n t h e c o m p o n e n t s of t h e c u r r e n t v e c t o r zs

.

-

The c u r r e n t s t e p size is too small. In this c a s e t h e e s t i m a t e

?

will begin to c h a n g e slowly, o r simply fluctuate, a f t e r t h e first few i t e r a - tions, while t h e c h a n g e in zS is negligible. It is n e c e s s a r y to i n c r e a s e t h e s t e p size.

(17)

(iii) Continue with t h e i t e r a t i o n s , periodically performing s t e p (ii), until changes i n t h e s t e p size no longer r e s u l t in any d i s t i n c t t r e n d in e i t h e r t h e function e s t i m a t e o r t h e c u r r e n t v e c t o r z S , which will oscillate a r o u n d s o m e point. This will indicate t h a t t h e c u r r e n t point i s close t o t h e solu- tion.

This m e t h o d of choosing t h e s t e p s i z e r e q u i r e s a n experienced u s e r , b u t we have f o u n d t h a t t h e n e c e s s a r y skills a r e quickly developed by trial a n d e r r o r . The main r e a s o n s for adopting a n i n t e r a c t i v e a p p r o a c h m a y be s u m m a r i z e d a s follows:

- Interactive m e t h o d s m a k e t h e best use of t h e information which a c c u m u - l a t e s d u r i n g t h e optimization process.

-

Because t h e precise value of t h e objective f u n c t i o n is n o t available, i t is impossible t o u s e t h e r u l e s for changing t h e s t e p size developed in d e t e r - ministic optirnization (e.g., line s e a r c h e s ) .

-

Stochastic effects m a k e i t extremely difficult t o define formally when t h e s t e p size is "too big" o r "too small"; t h e o r e t i c a l r e s e a r c h h a s not thrown any light o n t h i s problem.

The m a i n disadvantage of t h e interactive approach is t h a t m u c h of t h e user's t i m e is wasted if i t t a k e s t h e c o m p u t e r a long t i m e t o make one observa- tion f (zS,wS). For t h i s reason a g r e a t effort h a s been made to develop automatic adaptive ways of choosing t h e s t e p size, in which t h e value of t h e s t e p size is c h o s e n on t h e basis of information o b t a i n e d a t all o r s o m e of t h e previous points z i , i

=

- 1,s. Methods of this type a r e c o n s i d e r e d in [14-201. The approach d e s c r i b e d in t h e following s e c t i o n s involves t h e e s t i m a t e of s o m e m e a s u r e s of a l g o r i t h m performance which we d e n o t e by $ i ( 5 s , u S ) , where ZS r e p r e s e n t s t h e whole s e q u e n c e lz1,z2, ..., zs j a n d u s t h e s e t of p a r a m e t e r s used in the e s t i m a t e . In g e n e r a l , algorithm p e r f o r m a n c e m e a s u r e s a r e a t t e m p t s t o formalize t h e notions of "oscillatory behavior" a n d " r e g u l a r behavior" used in interactive step-size regulation, a n d possess o n e o r m o r e of t h e following pro- perties:

- t h e a l g o r i t h m p e r f o r m a n c e m e a s u r e is q u i t e l a r g e when t h e algorithm exhibits d i s t i n c t r e g u l a r behavior, i.e.. when t h e e s t i m a t e s of t h e function value d e c r e a s e o r t h e c o m p o n e n t s of t h e c u r r e n t vector zS show a distinct trend:

(18)

- t h e algorithm performance m e a s u r e becomes small and even changes its sign if t h e e s t i m a t e s of t h e c u r r e n t function value stop improving or if t h e c u r r e n t point s t a r t s to oscillate chaotically;

- t h e algorithm performance m e a s u r e is large far from the solution a n d small in t h e immediate vicinity of t h e optimal point.

Automatic adaptive methods for choosing t h e s t e p size begin with some reason- ably large value of t h e step size, which is kept c o n s t a n t as long as the value of t h e algorithm performance m e a s u r e r e m a i n s high, a n d t h e n decreases when t h e performance m e a s u r e becomes less t h a n s o m e prescribed value. The behavior of t h e algorithm usually becomes regular again after a decrease in t h e s t e p size, a n d t h e value of the performance m e a s u r e increases; after a n u m b e r of iterations oscillations set in and the value of the performance m e a s u r e once again decreases. This is a sign t h a t it is t i m e t o decrease the s t e p size. A r a t h e r general convergence result concerning such adaptive piecewise-linear methods of changing t h e s t e p size is given in [18]. However, in m a n y cases it is difficult t o d e t e r m i n e how close t h e c u r r e n t point i s to t h e optimal point using only one s u c h m e a s u r e

-

a more reliable decision c a n be made using several of t h e m e a s u r e s described below. Unfortunately, i t is n o t possible to come to a n y general conclusions as t o which performance m e a s u r e is the "best" for all sto- chastic optimization problems. Moreover, both t h e values of t h e p a r a m e t e r s used t o e s t i m a t e t h e performance m e a s u r e a n d t h e value of the performance m e a s u r e a t which t h e s t e p size should be decreased a r e different for different problems. Therefore if we fix t h e s e parameters once a n d for all we m a y achieve t h e s a m e poor performance a s if we had chosen t h e whole sequence of s t e p sizes prior t o t h e optimization process. Thus, i t is necessary t o t u n e the p a r a m e t e r s of a u t o m a t i c adaptive methods t o different classes of problems, a n d t h e interactive approach can be very useful here. An experienced u s e r would have l i t t l e difficulty in using t h e values of t h e performance m e a s u r e s to deter- mine t h e c o r r e c t points a t which t o change t h e s t e p size. and in learning what type of performance measure behavior requires a n increase or a decrease in t h e s t e p size. The interactive approach is of p a r t i c u l a r use if one iteration is n o t very time-consuming and t h e r e a r e a n u m b e r of similar problems to be solved. In this case the user can identify the most valuable m e a s u r e s of perfor- m a n c e in t h e &st few runs, fix t h e i r p a r a m e t e r s and incorporate this knowledge in automatic adaptive step-size selection methods for t h e remaining problems.

(19)

Although interactive methods usually provide the quickest m e a n s of reach- ing t h e solution, they c a n n o t always be implemented, and in this case automatic adaptive methods prove t o be very useful. The stochastic optimiza- tion package ST0 developed a t IIASA a n d the Kiev stochastic and nondifferentiable optimization package NDO both give t h e user t h e choice between automatic adaptive methods and interactive methods of determining the step size. Below we describe some particular measures of algorithm perfor- mance and methods of choosing t h e s t e p size.

The main indicators used t o evaluate t h e performance of an algorithm a r e estimates of such things as t h e value of t h e objective function a n d its gradient.

The averaging procedure (9) may be used to estimate t h e value of t h e gradient, a s described earlier in this paper. The main advantage of this procedure is t h a t i t allows u s t o obtain estimates of t h e mean values of t h e random variables without extensive sampling a t each iteration, since a very limited n u m b e r of observations (usually only one) is m a d e a t each iteration. This estimate.

although poor a t the beginning, becomes more and more a c c u r a t e as t h e itera- tions proceed. One example of s u c h an estimate is (18). which is a special case of the more general formula

Any observation pS with the property

can be used instead of f ( z S , o S ) in (19), where d, -, 0. For example, (6) would do. In order to g e t lim,,,l

-

F ( z S )

I =

0 it is necessary to have p,/7, -, 0.

However, estimate (18) assigns all observations of function values t h e same weight. This sometimes leads t o considerable bias in the estimate for all t h e iterations t h e user can afford to run. Therefore for practical purposes i t is sometimes more useful to adopt procedures of the type described in Section 2 for the estimation of gradients. These include estimate (19) with fixed 7,

=

7.

where 7

-

0.01-0.05. and t h e m e t h o d in which the average is taken over the preceding

rCI,

iterations:

(20)

Although t h e s e e s t i m a t e s do not converge asymptotically t o F(zS), they place m o r e emphasis on observations made a t r e c e n t points. All of t h e e s t i m a t e s

?

may also be used in an interactive mode t o d e t e r m i n e t h e s t e p size, a s described above. In addition, t h e values of t h e p a r a m e t e r s used to d e t e r m i n e t h e s t e p size m a y also be c h o s e n interactively. For example, t h e values of p a r a m e t e r s b a n d b in

c a n be m a d e to depend on t h e behavior of

p .

We shall now describe s o m e a u t o m a t i c adaptive r u l e s for choosing t h e s t e p size. The i m p o r t a n t point a s regards i m p l e m e n t a t i o n is how t o choose t h e ini- tial value of t h e s t e p size po. We suggest t h a t t h e value of a stochastic quasigra- dient should first be c o m p u t e d a t t h e initial point. a n d t h a t t h e initial value of t h e s t e p size should t h e n be chosen s u c h t h a t

where 1

-

10-20 a n d D is a rough e s t i m a t e or t h e size of t h e domain in which we believe t h e optimal solution t o be located. This m e a n s t h a t i t is possible t o r e a c h t h e vicinity of each point in t h i s domain within t h e first 20 i t e r a t i o n s o r

S 0.

3.1. Ratio ot function estimate to the path length

Before beginning t h e i t e r a t i o n s we choose t h e initial s t e p size po, two posi- tive c o n s t a n t s cxl a n d az, a sequence

M,

a n d a n integer

fi.

After every

fi

itera- tions we revise t h e value of t h e s t e p size in t h e following way:

(i) Compute t h e quantity

Here t h e u s a r e t h e averaging p a r a m e t e r s used i n t h e estimation of both

?

a n d

M,

, while Z9 is again t h e whole sequence of points preceding 2 ' . The quan- tity

(21)

is t h e length of t h e path taken by the algorithm during t h e preceding

Ms

itera- tions. The function 9 1 ( 5 3 , u S ) is a n o t h e r example of a measure which can be used to assess algorithm performance.

(ii) Take a new value of t h e s t e p size:

f

Ps+l

-

p, otherwise -

1

In this m e t h o d t h e s t e p size is changed a t m o s t once every

fi

iterations. This is essential because function ipl changes slowly, a n d if its value is less t h a n a2 a t iteration n u m b e r s i t is likely t h a t t h e s a m e will be t r u e a t iteration n u m b e r s +l. Therefore

a

should lie in t h e r a n g e 5-20. This procedure can be modified in various ways, s u c h as continuing for

&

iterations with a fixed s t e p size. then starting t o compare values until inequality (24) is satisfied whereupon t h e s t e p size is r e d u c e d We then wait a n o t h e r

fi

iterations a n d repeat t h e procedure.

Recommended values of a l a n d a2 lie within t h e ranges 0.5-0.9 and 0.005-0.1, respectively. The number

Ms

m a y be chosen t o be constant and equal to

&.

If we have a n u m b e r of similar problems i t is very useful t o make t h e first r u n in a semi-automatic mode, i.e., t o intervene in t h e optimization process t o improve t h e values of p a r a m e t e r s al , a 2 ,

@ -

t h e new values can t h e n be used in a fully automatic mode to solve t h e remaining problems.

This algorithm is by n o m e a n s convergent in t h e traditional sense, but i t outperformed traditional choices like C / s in numerical experiments because i t normally r e a c h e s the vicinity of t h e optimal point more quickly. However, i t is possible t o safeguard convergence by considering a second sequence C / s , where C is small, and switching t o t h i s sequence if t h e s t e p size recommended by (24) falls below a certain value. This s t e p size regulation was introduced in

[

151.

3.2. Use of gradient estimates

Take 9 2

= cS

instead of 9 1 ( 2 s , u s ) in (24), where

cs

is one of the gradient estimates discussed above, a n d t h e us r e p r e s e n t all the parameters used.

including averaging parameters and t h e frequency of changes in the s t e p size.

(22)

3.3. Ratio of progress and path

The quantity I I Z ~ - " - zs

11

r e p r e s e n t s the progress made by the algorithm between iteration n u m b e r s - Ms and iteration n u m b e r s . If we keep t h e s t e p size constant, t h e algorithm begans t o oscillate chaotically after reaching some neighborhood of t h e optimal point. The smaller t h e value of t h e s t e p size, t h e smaller t h e neighborhood a t which t h i s occurs, and t h u s t h e total path between iterations s and s

-

Ms begins t o grow compared with t h e distance between points z

-

61, and z S . This m e a n s t h a t t h e function

c a n be used a s a performance m e a s u r e in equation (24).

3.4. Analogues of line search techniques

The decision as t o whether (and how) to change t h e s t e p size may be based on t h e values of the scalar product of adjacent s t e p directions. If we have

(P-l,P) >

0, t h e n this may be a sign t h a t regular behavior prevails over sto- chastic behavior, t h e function is decreasing in t h e s t e p direction and t h e s t e p size should be increased. Due to stochastic effects t h e function will very often increase r a t h e r t h a n decrease, but in t h e long r u n t h e number of bad choices will be less t h a n t h e number of c o r r e c t decisions. Analogously. if this inequal- ity does not hold then t h e s t e p size should be decreased. The rule for changing t h e step size is t h u s basically a s follows:

where t h e values of al, a2, a3 (recommended values a l

-

0.4-0.8 , 1

<

a 2 < 1.3 and 0.7

s

a3

<

1) should be chosen before starting t h e iterations. It is also advisable t o have upper and lower bounds on t h e step size t o avoid divergence.

Sometimes i t is convenient t o normalize the vectors of s t e p directions, i.e.,

il =

1. The lower bound may decrease a s the iterations proceed. This method may also be applied to t h e choice of a vector s t e p size, treating some (or all)

(23)

variables or groups of variables separately. A number of different methods based on the use of scalar products of adjacent s t e p directions to control the s t e p size have been developed by Uriasiev [19], Pflug [16]. a n d Ruszczynski and Syski [20].

The interactive stochastic optimization package implemented a t IIASA (STO) is based on t h e s a m e ideas as the package for stochastic and nondifferentiable optimization developed in Kiev (NDO). It allows t h e user t o choose between interactive and automatic modes and makes available t h e s to- chastic quasigradient methods described in Sections 2 a n d 3. In t h e interactive mode t h e program offers the user t h e opportunity to change the step parame- t e r s and the methods by which t h e step size and step direction a r e chosen dur- ing the course of the iterations. The user can also stop the iterative process a n d obtain a more precise e s t i m a t e of the value of the objective function before continuing. The package is written in FORTRAN-77.

Before initiating t h e optimization process the user h a s to:

(i) Provide a subroutine UF which calculates the value of function f ( z , o ) for fixed z and w and. optionally, a subroutine UG which computes t h e gra- dient f,(z,w) of this function; t h e function evaluation subroutine should be of the form:

FVNCl"I0N UF(N,X) DIMENSION X(N) Calculation of

J

(z,w)

RETURN

END

Here N is the dimension of t h e vector of variables

X

(Note t h a t the imple- mentation on the IIASA

VAX

actually requires the subroutine to be entered in lower-case letters r a t h e r than capitals.) A description of the subroutine which calculates a quasigradient is given later in this paper.

(ii) Compile these subroutines with the source code t o obtain an executable module.

(24)

(iii) Provide a t least one of t h e following additional d a t a files:

-

algorithm control file (used only in t h e non-interactive option)

-

p a r a m e t e r Ale (used only in t h e interactive option)

-

initial data Ale (should always be present)

All of these files a r e described in some detail l a t e r in t h e paper.

The optimization process c a n then begin. The program first asks t h e u s e r a series of questions regarding the required mode (interactive or automatic), method of step size regulation, choice of s t e p direction. etc. These questions appear on t h e monitor a n d should be answered from t h e keyboard or by refer- ence t o a d a t a file. We shall represent t h e dialogue a s follows:

Question? Answm

with t h e user's response given in italics. The first question is Interactive mode? reply yes or no yes/no

To choose t h e interactive option the u s e r should type in yes (or y); to select t h e automatic option h e should answer no (or n). In t h e l a t t e r c a s e t h e program would ask no f u r t h e r questions, but would read all t h e necessary information from t h e algorithm control file (which is usually numbered 2

-

u n d e r UNrX con- ventions its n a m e is fort.2). The iterative process would t h e n begin, terminat- ing after 10,000 iterations if no other stopping criterion is fulfilled. The algo- r i t h m control file m u s t contain answers t o all of t h e following questions except those concerned either with dialogue during t h e iterations or with t h e parame- t e r f3le (such questions a r e marked with a n asterisk below). This file is given a n a m e only for ease of reference

-

t h e important thing for t h e u s e r i s its number.

Assume now t h a t t h e user has chosen t h e interactive option by answering yes t o the first question. The program then asks

parameter f lle? (number)

The user should respond either with the number of t h e file of default pararne- t e r s or with t h e n u m b e r of the file in which t h e c u r r e n t values of t h e algorithm parameters a r e stored. The file of default p a r a m e t e r s is provided with t h e pro- gram and has t h e n a m e fort.12 (under UNlX conventions); t h u s , to refer t h e program t o t h e default file the user should answer 12. The purpose of this file is to help t h e user to s e t t h e values of algorithm p a r a m e t e r s in t h e ensuing dialo- gue and also t o s t o r e such improved values a s may be discovered by the u s e r

(25)

t h r o u g h t r i a l a n d error. I f t h e u s e r assigns t h e a l g o r i t h m p a r a m e t e r s any values o t h e r t h a n those in t h e default file. t h e new values become t h e default values in s u b s e q u e n t r u n s of t h e program. This Ale is optional.

The p r o g r a m t h e n asks

read parameter file? reply yes or no y e s / n o

The answer y e s implies t h a t t h e file specified i n t h e previous question exists, a n d t h a t default p a r a m e t e r values a r e s t o r e d i n this file. In t h i s case, when asking t h e u s e r about p a r a m e t e r values. t h e p r o g r a m will read t h e default option in t h e p a r a m e t e r file a n d r e p r o d u c e i t on t h e s c r e e n together w i t h t h e question. If t h e u s e r a c c e p t s t h i s default value he s h o u l d respond with 0 (zero);

otherwise h e should e n t e r his own value, which will become t h e new default value.

The a n s w e r no m e a n s t h a t no default values a r e available a t t h e m o m e n t . In this c a s e t h e program will form a new default file (labeled with the n u m b e r given a s a n answer t o t h e previous question); i t s c o n t e n t s will be based on t h e u s e r ' s answers t o f u t u r e questions. This new default file, once formed, c a n be u s e d i n s u b s e q u e n t r u n s .

The n e x t question is

number of variables? ( n u m b e r )

t o which t h e u s e r should respond with t h e dimension of t h e vector of variables

2 . He is t h e n asked

Initial data file? (nurn b e r )

a n d should reply with t h e n u m b e r of t h e initial d a t a file. This file should con- tain t h e following e l e m e n t s (in exactly this order):

-

The initial point, which should be a s e q u e n c e of n u m b e r s s e p a r a t e d by c o m m a s or o t h e r delimiters.

-

Any additional d a t a required by s u b r o u t i n e s U F or TJG if s u c h d a t a exists a n d t h e u s e r chooses t o p u t it in t h e initial d a t a file (optional).

-

Information about t h e c o n s t r a i n t s (described in m o r e detail below) The p r o g r a m t h e n asks

step size regulation? is

Here is is a positive i n t e g e r from t h e s e t 11,2.3,4,6,7{, where the diflerent values of is correspond to different ways of choosing t h e s t e p size. (The i n t e g e r 5 is r e s e r v e d for a n option c u r r e n t l y u n d e r development.)

(26)

1 Adaptive automatic s t e p size regulation (24) based on algorithm perfor- m a n c e function (22) and function estimate (18).

2 Manual s t e p size regulation based on algorithm performance function (22) and function estimate (18).

3 Adaptive automatic s t e p size regulation (24) using algorithm perfor- m a n c e measure (22) a n d a function estimate based on a finite number of previous observations (21).

4 Manual s t e p size regulation based on t h e same estimates of algorithm performance a s for is

=

3.

6 Automatic s t e p size regulation using algorithm performance m e a s u r e (24) a n d function estimate (19) with Axed y,.

7 Manual s t e p size regulation based on t h e same estimates of algorithm performance a s for is

=

6.

The difference between adaptive automatic and manual s t e p size regulation (see is

=

1.2) is t h a t in t h e first case t h e step size is chosen automatically, although t h e user may t e r m i n a t e t h e iterations a t specified points a n d con- t i n u e with another s t e p size regulation. while in the second case t h e user changes t h e value of t h e s t e p size himself. Both s t e p size regulations a r e based on t h e s a m e estimates of function value a n d algorithm performance.

The next question is

s t e p direction? ( 5 figures) id2 i d 2 i d 3 id4 i d 5

The user has to respond with five figures which specify various ways of choosing t h e s t e p direction, e.g., 11111. W e shall refer to these figures a s i d l , i d 2 , id3, id4 and id5. The subroutine which estimates the s t e p direction makes some n u m b e r of initial observations a t e a c h step; these a r e t h e n averaged in some way t o obtain the vector

p .

and t h e final s t e p direction v s is calculated using both

p

and values of v z for i C s.

The value of id1 specifies t h e n a t u r e of the initial observations id1 Definition

A direct observation of a stochastic quasigradient is available For and t h e user has t o specify a subroutine UG to calculate it:

SUBROUTINE UG(N,X,G) DIMENSION X(N),G(N)

Calculation of a stochastic quasigradient RETURN

END

where

C(N)

is an observation of a stochastic quasigradient.

2 Central finite-difference approximation of the gradient as in (11).

(27)

3 The

pvs

a r e c a l c u l a t e d using random s e a r c h t e c h n i q u e s ( 1 2 ) .

4 Forward finite-difference approximation of t h e initial observations

pms

as

in ( 1 0 ) .

5 Central finite-difference approximation of t h e g r a d i e n t a s i n ( 1 1 ) . All observations of t h e function used in one observation of

raS

a r e made with t h e s a m e values of random p a r a m e t e r s o.

6 The

F"

a r e c a l c u l a t e d using random s e a r c h t e c h n i q u e s ( 1 2 ) . All obser-

vations of t h e function used in one observation of a r e m a d e with the s a m e values of r a n d o m p a r a m e t e r s o.

7 Forward finite-difference approximation of t h e initial observations

pas

as

in (10). All observations of the:function u s e d in o n e observation of

pus

a r e m a d e with t h e s a m e values of random p a r a m e t e r s o.

Note t h a t for id1

=

5 6 . 7 all observations of t h e function u s e d in one observation of a r e m a d e with t h e s a m e values of random p a r a m e t e r s o. In this case t h e u s e r should write a function UF which supports this f e a t u r e a s follows:

FVNCTION UF(N,X) DIMENSION X(N) COMMON/OMEG/LO,MO If LO= 1 a n d MO= 1 t h e n obtain new values of r a n d o m factors o a n d s e t MO=O.

Make a n observation of t h e function a t point z.

R E T U R N

END

The second figure i d z d e t e r m i n e s t h e point a t which observations a r e made:

i d 2 Definition

1 The initial direction i s calculated a t t h e c u r r e n t point zS

2 The initial direction is calculated a t a point chosen randomly from among those in t h e neighborhood of t h e c u r r e n t point zS

The value of id3 deflnes t h e way in which t h e s t e p in a finite-difference or ran- dom s e a r c h approximation of

p"

is chosen:

id3 Definition

1 The approximation s t e p is Axed. The observations of t h e objective func- tion a t point zs originally used t o obtain g r a d i e n t observations

PnS

a r e

not u s e d to update t h e e s t i m a t e of t h e function employed f o r s t e p size regulation.

Referenzen

ÄHNLICHE DOKUMENTE

In their approach of using MDA, the military aircraft industry implements dependabil- ity in two ways: (i) by verifying the models in different domains and mapping rules used

In this paper, the variable metric algorithm for stochastic programming problems with a nonsmooth objective function is presented.. Such algorithms were already proposed in

In the design of solution procedures for stochastic optimization problems of type (1.10), one must come to grips with two major difficulties that are usually brushed aside in the

The convergence theory for iterated random averaged operators turns out to be simple in R n : If an invariant measure for the Markov operator exists, the chain converges to an

While trying to trace the trajectories of the Newton flow or even the dis- crete dynamical system produced by iterating the relaxed Newton’s method of a polynomial as above on

In this paper, Professor Georg Pflug of the University of Giessen considers existing methods of con- trolling the step size in algorithms based on stochastic quasi-

For the stochastic programming problems, SQG methods generalize the well-known stochastic approximation method for un- constrained optimization of the expectation of random functions

The papers themselves provide a wide spectrum of approaches to both technical and substantive problems, for example, the way space is treated ( continu- ously in