ADAPTIW NONMONOTONIC METHODS WITH AVERAGING OF SUBGRADIENTS
N.D.
ChepurnojJuly 1987 WP-87-62
Working Papers are interim r e p o r t s on work of t h e International Institute f o r Applied Systems Analysis and have r e c e i v e d only limited review. Views o r opinions e x p r e s s e d h e r e i n d o not n e c e s s a r i l y r e p r e s e n t those of t h e Institute or of i t s National Member Organizations.
INTERNATIONAL INSTITUTE FOR APPLIED SYSTEMS ANALYSIS A-2361 Laxenburg, Austria
FOREWORD
The numerical methods of t h e nondifferentiable optimization are used f o r solv- ing decision analysis problems in economic, engineering, environment and agricul- t u r e . This p a p e r i s devoted t o t h e adaptive nonmonotonic methods with averaging of t h e subgradients. The unified a p p r o a c h is suggested f o r construction of new deterministic s u b g r a d i e n t methods, t h e i r s t o c h a s t i c finite-difference analogs and a p o s t e r i o r i estimates of a c c u r a c y of solution.
Alexander B. Kurzhanski Chairman System a n d Decision Sciences Program
CONTENTS
1 Overview of Results in Nonmonotonic Subgradient Methods
2 Subgradient Methods with Program-Adaptive Step-Size Regulation
3 Methods with Averaging of Subgradients and
Program-Adaptive Successive Step-Size Regulation 4 Stochastic Finite-Difference Analogs t o Adaptive
Nonmonotonic Methods with Averaging of Subgradients 5 A P o s t e r i o r i Estimates of Accuracy of Solution to
Adaptive Subgradient Methods and Their Stochastic Finite-Difference Analogs
R e f e r e n c e s
ADAPTIVE N O ~ O N O T O N I C MlCI'HODS WITH AVERAGING OF SUBGRADEXTI'S
N.D. Chepurnoj
1. OVEBYIEW OF
RESULTS
IN NONMONOTONIC SUBGEWDIENT METHODSAmong t h e existing numerical methods of solution of nondffferentiable optimi- zation problems, t h e nonmonotonic subgradient methods hold an important position.
The pioneering work by N.Z. S h o r [26] gave impetus t o t h e i r explosive pro- g r e s s . In 1962, h e suggested an i t e r a t i v e p r o c e s s of minimization of convex piecewise-linear function named afterwards t h e generalized gradient descent
(GGD):
where gS E a f ' ( x S ) i s a set of subgradients of a function f ' ( x ) at a point x S ; rs r 0 i s a s t e p size.
For t h e differentiable functions this method a g r e e s v e r y closely with t h e well-known subgradient method. The fundamental difference between them i s t h a t t h e motion direction (- g ) in ( 1 . 1 ) is, a s a r u l e , not a descent direction.
A t t h e f i r s t attempts to substantiate theoretically t h e convergence of pro- c e d u r e s of t h e type ( 1 . 1 ) t h e r e s e a r c h e r s immediately faced two difficulties. For one thing, t h e objective function lacked t h e p r o p e r t y of differentiability. For a n o t h e r , method ( 1 . 1 ) w a s not monotonic. These combined f e a t u r e s r e n d e r e d im- p r a c t i c a l t h e use of known g r a d i e n t p r o c e d u r e convergence theorems.
New t h e o r e t i c a l a p p r o a c h e s t h e r e f o r e became a must.
One more "misfortune" came on t h e neck of t h e others: numerical computa- tions demonstrated t h a t GGD h a s a low convergence r a t e .
Initially g r e a t hopes were pinned on t h e step-size selection s t r a t e g y as a way towards overcoming t h e crisis.
By t h e e a r l y 1970s difficulties caused by t h e formal substantiation of conver- gence of nonmontonic subgradient p r o c e d u r e s had been mastered and different ap- p r o a c h e s to t h e step-size regulation had been offered [6, 7 , 8 , 1 9 , 20, 261. However t h e computations continued t o prove t h e p o o r convergence of GGD in p r a c t i c e .
I t c a n b e said t h a t t h e f i r s t s t a g e in GGD evolution w a s o v e r in 1976.
Thereupon t h e numerical methods of nondifferentiable optimization developed in t h r e e directions, i.e., methods with s p a c e dilation, monotone, and adaptive non- monotonic methods w e r e explored.
Let us dwell on e a c h of t h e s e approaches.
In a n e f f o r t t o enhance t h e GGD efficiency,
N.Z.
S h o r e l a b o r a t e d methods where t h e o p e r a t i o n of s p a c e dilation in t h e direction of a subgradient and a difference between t w o successive subgradients w a s employed. Literally t h e n e x t f e w y e a r s w e r e prolific f o r p a p e r s [27, 28, 291 investigating i n t o t h e s p a c e dilation o p e r a t i o n in nondifferentiable function minimization problems. A high rate of con- v e r g e n c e of t h e suggested methods w a s c o r r o b o r a t e d theoretically.Computational p r a c t i c e a t t e s t e d convincingly to t h e advantageousness of ap- plication of t h e algorithms with s p a c e dilation, especially t h e r-algorithm [29], as a l t e r n a t i v e t o GGD, providing dimensions of t h e s p a c e d o not e x c e e d 200 t o 300.
However, if dimensions are ample, f i r s t , a considerable amount of computations is s p e n t on t h e s p a c e dilation matrix transformation, second, some e x t r a capacity of computer memory i s r e q u i r e d .
The monotonic methods became a n o t h e r essential direction.
Even though t h e f i r s t p a p e r s on t h e monotonic methods a p p e a r e d back in 1968 (V.F. Dem'janov [30]), t h e i r p r o g r e s s r e a c h e d i t s peak in t h e e a r l y 70's. Two classes of t h e s e algorithms should b e distinguished h e r e : t h e &-steepest d e s c e n t 15, 301 a n d t h e &-subgradient algorithms [31-341. W e s h a l l not examine them in de- tail b u t note, t h a t t h e monotonic methods o f f e r e d h i g h e r r a t e of convergence a s against GGD. J u s t as with t h e methods using t h e s p a c e dilation, v a s t dimensions of problems to be solved still remained Achilles' heel f o r t h e monotonic algorithms.
Thus, t h e nonmonotonic s u b g r a d i e n t methods have come into p a r t i c u l a r impor- t a n c e in t h e solution of large-scale nondifferentiable optimization problems.
The nonmonotonic p r o c e d u r e s have a n o t h e r important o b j e c t of application, a p a r t from t h e large-scale problems, i.e., t h e problems in which t h e subgradient cannot b e precisely defined at a point. The latter encompass problems of identifi- cation, learning, a n d p a t t e r n recognition [I, 211. The minimized function i s t h e r e a
mathematical expectation whose distribution law is unknown. E r r o r s in subgradient calculation may s t e m from computation errors and many o t h e r r e a l p r o c e s s e s .
Ju.M. Ermol'ev and Z.V. Nekrylova [9] w e r e t h e f i r s t to investigate t h e like p r o c e d u r e s . Stochastic programming problems have increasingly drawn t h e atten- tion to t h e nonmonotonic subgradient methods.
However, as pointed out e a r l i e r . GGD, widely used, r e s i s t a n t to e r r o r s in subgradient computations, saving memory capacity, s t i l l had a p o o r rate of conver- gence. Of g r e a t importance t h e r e f o r e w a s t h e construction of nonmonotonic methods s u c h t h a t , on t h e one hand, r e t a i n all advantages of GGD and, on t h e o t h e r , possess a high rate of convergence.
I t h a s been t h i s requirement t h a t h a s l e t to elaboration of t h e adaptive non- monotonic p r o c e d u r e s .
An analysis r e v e a l e d t h a t t h e Markov n a t u r e of GGD is t h e chief c a u s e of i t s slow convergence. I t is quite obvious t h a t t h e use of t h e m o s t intimate knowledge of p r o g r e s s of t h e computations is indispensable t o selection of t h e direction and re- gulation of t h e stepsize.
S e v e r a l ideas provided t h e basis f o r t h e development of adaptive nonmonoton- i c methods.
The major c o n c e p t of all techniques f o r selecting t h e direction and regulating t h e step-size w a s t h e use of information about t h e fulfillment of n e c e s s a r y condi- tions to have t h e extremal-value function.
I t s implementation are t h e methods with averaging of t h e subgradients.
In t h e most g e n e r a l case by t h e operation of averaging i s meant a p r o c e d u r e of "taking" t h e convex hull of a n a r b i t r a r y finite number of v e c t o r s .
The operation of averaging in t h e numerical methods w a s f i r s t applied by Ja.2.
Cypkin [ Z Z ] and Ju.M. Ermol'ev [ll].
The p a p e r by A.M. Gupal and L.G. Bazhenov [3] also dealing with t h e use of o p e r a t i o n of averaging of s t o c h a s t i c estimates of t h e generalized g r a d i e n t s ap- p e a r e d in 1972.
However all t h e above p a p e r s considered t h e program regulation of t h e step- size, i.e., a sequence [ r , ] independent of computations w a s selected such t h a t
The n e x t n a t u r a l s t a g e in t h e evolution of this concept w a s t h e construction of adaptive step-size regulation using t h e operation of averaging of preceding subgradients.
In 1974, E.A. Nurminskij and L.A. Zhelikovskij [I81 suggested a successive p r o g r a m a d a p t i v e regulation of t h e step-size f o r t h e quasigradient method of minimization of weakly convex function.
The c r u x of this relation consists in t h e following.
Let an i t e r a t i v e sequence be constructed according to t h e r u l e
where g S E a j ( z s ) i s a quasi-gradient of t h e function j ( z ) at t h e point z S , r o i s a constant step-size.
Assume t h a t t h e r e e x i s t
z
E En and numerical p a r a m e t e r s t>
0 ,6 >
0 such t h a t f o r any s=
0 , 1, 2,...
llzS- G 11
5 6. Let us suppose also t h a t a convex combi- nation of subgradientstgi if
SLO
e x i s t s such t h a t IlesolI S t,e S D E conv
tgi
ji S :0.
Then t h e point
z
i s sufficiently close to t h e setX ' =
argmin j ( z ) according to t h e n e c e s s a r y extremum conditions. In t h e given case t h e step-size h a s to be reduced and t h e p r o c e d u r e r e p e a t e d with t h e new step-size value r l s t a r t i n g at t h e ob- tained point z S D . The numerical realization of t h e described algorithm r e q u i r e s a specific r u l e f o r constructing v e c t o r s eS'. In [10] t h e v e c t o r eS' is c o n s t r u c t e d by t h e r u l e os'=
P r o j O/conv g k k ' s , S t h a t is, all quasi-gradients are included into t h e convex hull s t a r t i n g from t h e m o s t r e c e n t instant of t h e step-size change.Numerical computations b o r e o u t t h e expediency of making allowances f o r such re- gulation. However a g r a v e disadvantage w a s i n h e r e n t in it: t h e g r e a t laboriousness of i t e r a t i o n . Considering t h a t t h e a p p r o a c h as a whole holds promise, averaging schemes had to be developed f o r t h e efficient use when selecting t h e d i r e c t i o n and regulating t h e step-size.
This p a p e r treats such averaging schemes. They s e r v e as a foundation f o r new nonmonotonic s u b g r a d i e n t methods, f o r t h e description of s t o c h a s t i c finite- difference analogs, a p o s t e r i o r i estimates of solution a c c u r a c y . P r i o r to discuss- ing r e s u l t s , let us make s o m e g e n e r a l assumptions. Presume t h a t t h e minimization problem i s being solved on t h e e n t i r e s p a c e of t h e function j ( z ) :
where En i s a n n-dimensional Euclidean space. The function j'(z) will b e every- where thought of as being t h e convex eigenfunction j'(z), dom j' = E n , t h e sets
[z : /(z) 5 c j being bounded f o r any bounded constant C. The set of solutions of t h e problem (*) will b e believed to be t h e s e t
2. SUBGRADIENT KETHODS WITFi PROGRAM-ADAPTIVE STEP-SIZE ImGULATION
The concept of adaptive successive step-size regulation h a s a l r e a d y been set f o r t h . In 1231 a way of determining t h e instant of t h e step-size variation w a s sug- gested. Central to i t was t h e simplest scheme of averaging of t h e preceding subgra- dients. This method i s e a s y to implement and e f f e c t s a saving in computer memory capacity. Compared t o t h e program regulation, t h e adaptive regulation improves convergence of t h e s u b g r a d i e n t methods.
Description of Algorithm 1
Let z 0 b e a n a r b i t r a r y initial point, b
>
0 b e a constant, itk j, [ r k j b e number sequences such t h a t ck>
0, tk -4 0, rk>
0, rk -+ 0. P u t s=
0, j=
0 , k=
0, L O=
E aj'(zO).S t e p 1. Construct
S t e p 2. If j'(zS +') >j'(zO)
+
b, t h e n s e l e c t zS E Iz :j'(z) zSj'(zO)j and g o to S t e p 5.S t e p 3. Define
S t e p 4 . 1f
lies +41 >
c k , then s=
s+
1 and g o t o s t e p 1.S t e p 5. S e t k
=
k+
1, j=
s+
1, s=
s+
1 and g o t o S t e p 1.THEOREM 1.1 A s s u m e t h a t t h e problem ( 8 ) i s s o l v e d b y a l g o r i t h m 2. T h e n a l l Limit p o i n t s of t h e s e q u e n c e [ z S
1
belong t ox*.
PROOF Denote t h e instants of step-size variations by s m . Let us p r o v e t h a t t h e step-size rk v a r i e s a n infinite number of times. Suppose i t i s not s o , i.e., t h e s t e p - size does not v a r y s t a r t i n g from a n instant s, and i s equal r , . Then t h e points z S f o r s 5 s, belong t o t h e set
and are r e l a t e d by
Considering t h a t t h e step-size does not v a r y , llesll
>
E ,>
0 f o r sr
s , . In passing t o t h e limit by s -4 in t h e inequalitywe obtain a contradiction in t h e boundedness of t h e set
The f u r t h e r proof of Theorem 1.1 amounts to checking t h e g e n e r a l conditions of algorithm convergence d e r i v e d by E.A. Nurminskij [17].
NURMINSKIJ THEOREM Let t h e s e q u e n c e lzS
1
a n d t h e s e t of s o l u t i o n sX *
be s u c h t h a t t h e f o l l o w i n g c o n d i t i o n s a r e s a t i s f i e d : D l . F o r any sequence [ z s k
1
such t h a tD2 There e x i s t s t h e closed bounded set S such t h a t
D3 F o r any subsequence [ z nk
1
such t h a tt h e r e e x i s t s co
>
0 s u c h t h a t f o r all 0<
E S ro and a n y k inf m : [IIzm -znkII>
rj = m k< =
I.m " ' k
D4 The continuous function W ( z ) e x i s t s s u c h t h a t f o r a n a r b i t r a r y subsequence [ z nk j s u c h t h a t
and f o r t h e subsequence [ z m k j c o r r e s p o n d i n g to i t by condition D3 f o r a n a r b i - t r a r y 0
<
E S r0D5. The function W ( z ) of condition D4 assumes no more t h a n c o u n t a b l e number of values on t h e set x*.
Then a l l limiting points of t h e s e q u e n c e [ z S
1
belong to x*.Select t h e function f ( z ) as t h e function W ( z ) . Conditions Dl, D5 are s a t i s f i e d in view of t h e algorithm s t r u c t u r e and t h e e a r l i e r assumptions.
The rest of t h e conditions will b e verified by t h e following scheme. W e will p r o v e t h a t conditions D3, D4 hold t h e points being t h e i n n e r p o i n t s of t h e set
s
=
[ z : f ( z ) 3 f ( z O )1.
I t is t h e r e w i t h obvious t h a t max W ( z )<
inf W ( z )t ES
Then t h e s e q u e n c e [ z S
1
f a l l s outside t h e set S only finite number of times. Conse- quently, condition D2 is s a t i s f i e d a n d t h i s automatically e n t a i l s t h e validity of D3 a n d D4.S o , let t h e s u b s e q u e n c e [znpj e x i s t s such t h a t znp --, z ' x*. Assume at t h i s s t a g e of t h e proof t h a t z ' E int S. W e will p r o v e t h a t t h e r e e x i s t s ro
>
0 s u c h t h a t f o r all 0<
E 3 r0 at a n a r b i t r a r y p:Now s u p p o s e condition (2.1) i s not s a t i s f i e d , t h a t i s , f o r a n y r
>
0 t h e r e e x i s t s n p such t h a t l1zs-
zn41 3 r f o r all s>
np.W e have
f o r sufficiently l a r g e n p and s
>
n,,. By the supposition 0 Z B f ( x ' ) . By v i r t u e of t h e closedness, convexity and u p p e r semi-continuity of t h e many-valued mapping a f ( x ) t h e r e e x i s t s E>
0 such t h a t 0=
conv G q c ( z '), where conv1.1
i s a convex hull and G 4 r ( x ' ) i s a setI t i s easily s e e n t h a t E
>
0 can b e always s e l e c t e d in s u c h a way t h a t U I e ( x ') C int S, where ( z)=
x : z-
x1
5 4.
Let 6=
min11; 11,
f E conv G 4,(x '). Obviously 6
>
0. As ek-
0, t h e r e e x i s t s a n i n t e g e rL(6)
s u c ht h a t f o r k 2 K ( 6 ) we have S r ) / Z . P u t n p 2 K ( 6 ) . Then i t i s readily s e e n t h a t f o r s 2 n p t h e step-size r k can v a r y no m o r e t h a n once within t h e set U I c ( z ' ) . Ex- amine t h e sequence IsS
1
s e p a r a t e l y on t h e intervals n p 5 s<
s p*
, wheres;
=
min sm : sm a n p ' .When n p S s
<
s p t h e points z S are r e l a t e d as followswhere t h e index L i s r e c o n s t r u c t e d with r e s p e c t to s p
.
Let us consider t h e s c a l a r p r o d u c t swhere z np
= grip,
S i n c e z s E conv G q , ( z ' ) , s 2 n p , i t i s possible t o p r o v e t h a t N 1 N 1 + l
( z , g ) 2 y , y = 1 / 2 l 9 ~ . Thus,
We n e x t c o n s i d e r t h e scalar p r o d u c t s
ds
=
( z N 1 + l- z s ,
g s ) = r 1 ( s - N l - l ) ( z s - l , g s ) , where s 2 N1+
1.N e N e + l
The index N2 e x i s t s s u c h t h a t ( z , g ) 2 y and d N e + l h r l ( N 2
-
Then in a s i m i l a r way we c a n p r o v e t h e e x i s t e n c e of indices Nt
( t
2 3 ) s u c h t h a tI t i s e a s y t o p r o v e t h a t Nt +
-
Nt S N<
=,t =
1, 2,.. . .
Let Nt b e t h e maximal of indices Nt t h a t d o e s not e x c e e d s;.
ThenSince s p
-
Nt0 5 N , t h e n with p --r=
t h e l a s t t e r m on t h e right-hand side of t h e inequalitya p p r o a c h e s z e r o . W e finally o b t a i n
where E; -+ 0 with p' --r w.
I t i s not difficult t o notice t h a t t h e reasoning which u n d e r l i e s t h e d e r i v a t i o n of inequality (2.2) may b e also r e p e a t e d without c h a n g e s f o r t h e i n t e r v a l s L s p t o I
g e t
f ( z r n )
-
f ( z n p )s -
7 1-
s;) y+
c;Adding (2.2) t o (2.3) we o b t a i n
In passing t o t h e limit by m --, = in inequality ( 2 . 4 ) we are l e d t o a c o n t r a d i c t i o n with r e s p e c t t o t h e boundedness of continuous function on t h e closed bounded set U q t ( x d ) . Consequently, condition ( 2 . 1 ) i s p r o v e d .
Let
m p
=
inf m : l / x m-
x nPI\ >
r.
m > n p
By s t r u c t u r e xmp
F
u , ( x n p ) , b u t f o r sufficiently l a r g e pAll t h e reasoning involved in d e r i v a t i o n of inequality ( 2 . 4 ) r e m a i n s valid f o r t h e in- s t a n t m p , t h a t is,
we h a v e
In passing to t h e limit by p --, -we g e t
-
lim w(xmp)
<
lim w ( a n p ).
P - - P - -
The f u r t h e r proof of t h i s t h e o r e m follows from t h e Nurminskij theorem.
To fix more p r e c i s e l y t h e i n s t a n t when t h e i t e r a t i o n p r o c e s s g e t s i n t o t h e neighborhood of t h e solution we c a n employ t h e following modification of algorithm 1 provided t h e c o m p u t e r c a p a c i t y allows.
Let z 0 be a n a r b i t r a r y initial point, d
>
0 be a constant, [ E1 ,
~Irk1
b e number sequences such t h a t ck>
0, ck 4 0 , r k>
0 , rk 4 0 ; k l , k 2 ,. . . , k,
b e i n t e g e r positive bounded constants.P u t s
=
0, j=
0 , k=
0 , e0=
g o E B f ( z O ) . S t e p 1 ConstructS t e p 2 If f ( z S f l ) > f ( z O ) + d , t h e n ~ ~ + ~ ~ [ z : f ( z ) ~ f ( z ~ ) ] and go t o S t e p 5.
S t e p 3 Define
"0s
+ 1=
s + l e,S+
1 g s + ls - j + 2 s - j + 2
Each of t h e notations P i ( - ,
.,
-) designates a n a r b i t r a r y convex combination of a finite number of t h e indicated preceding subgradients.Find
-
min IIe; +lI I .
LLs + l - o s p s m
S t e p 4 If p s + l
>
E ~t h e n s , = s + l a n d g o t o S t e p l .S t e p 5 S e t k = k + 1 , j = s +1, s = s + 1 , e S = g S a n d g o L o S t e p 1 .
THEOREM 2.1 m p p o s e t h a t the problem (*) is solved b y the modiJ%ed algo- r i t h m I . Then all l i m i t points of the sequence iz
1
belong tox*.
3. METHODS WITH AVERAGING OF SUBGRADIENTS AND PROGRAM-ADAPTIVE SUCCESSIVE STEP-SIZE REGULATION
S u c c e s s i v e Step- Size R e g u l a t i o n
A s noted in a number of works [Z, 3, 1 2 , 161 i t i s expedient t o a v e r a g e subgra- dients calculated at t h e previous iterations s o t h a t t h e s u b g r a d i e n t methods will b e more r e g u l a r . F o r instance, when t h e "ravineu-type functions are minimized, t h e a v e r a g e d direction points t h e way along t h e bottom of t h e "ravine".
I t will b e demonstrated in Section 5 t h a t t h e operation of averaging enables t h e improvement of a p o s t e r i o r i estimates of t h e solution a c c u r a c y along with t h e upgrading of r e g u l a r i t y of t h e d e s c r i b e d methods.
Methods with averaging of subgradients and consecutive p r o g r a m a d a p t i v e re- gulation of t h e step-size are set f o r t h in t h i s section.
Results obtained h e r e stem from [24].
Description of Algorithm 2.
Let z 0 b e a n a r b i t r a r y initial approximation;
3 > o
be a constant; Irk1,
i r k j b e number sequences such t h a tP u t s
=
0, j = 0 , k = 0 ,S t e p 1 Construct
S t e p 2 If f ( z S + I )
>
f ( z O )+ 8,
then g o t o S t e p 7.S t e p 3 Define v S according t o t h e schemes a ) o r b).
S t e p 4 ~ o n s t r u c t e ~ + ~ = e ~ + ( s - j + ~ ) - l ( v ~ + ~ - e ~ ) . S t e p 5 1f \ l e S + 4 1
>
el:, t h e n s = s + l a n d g o t o S t e p l .S t e p 6 S e t k = k + 1 , j = s + 1 , s = s + 1 , e S = v S a n d g o t o S t e p 1 .
S t e p 7 ~ e t z ~ + ~ E i z : f ( z ) ~ f ( z ~ ) ] , s = s + l , j = s , k = k + l a n d g o t o S t e p 1.
In construction of t h e direction v S t h e following schemes of subgradient averaging a r e dealt with.
a ) The "moving" a v e r a g e . Let
K +
1 be a n integer. Thenwhere gi E a f ( z i ), hi,
+
0.b) The "weighted" a v e r a g e . Let M
+
1 be a n integer. Thenv S = g S + h S ( v S - l - g S ) , where O S h s S 1 f o r s f 0 (mod M), 0 S As S
6
<1 f o r s = O (modM).THEOREM 3.1 Assume that t h e probLem (*) is solved by aLgorithm 2. Then a L L l i m i t p o i n t s of t h e sequence [zS j belong to t h e s e t
x*.
4. STOCHASTIC FINITE-DIPFEENCE ANALOGS TO ADAPTIVE NONBEONOTONIC METHODS WITH AVERAGING OF SUBGBADIENTS
I t should b e emphasized t h a t t h e p r a c t i c a l value of t h e subgradient-type methods essentially depends upon t h e existence of t h e i r finite-difference analogs.
Of g r e a t importance t h e finite-difference methods a r e primarily in situations when subgradient computation programs a r e unavailable. This generally o c c u r s in t h e solution of large-scale problems. Construction of t h e finite-difference methods in t h e nonsmooth optimization originated two approaches: t h e nondeterministic and t h e stochastic ones. Each of them h a s i t s own advantages and disadvantages. The stochastic a p p r o a c h i s favored h e r e .
One of t h e advantages of t h e introduced averaging operation i s t h e f a c t t h a t t h e construction of stochastic analogs t o subgradient methods p r e s e n t s no special problems.
The offered methods a r e close t o those with smoothing [4] which, in t h e i r t u r n , a r e closest to t h e schemes of stochastic quasi-gradient methods [IZ]. R e s e a r c h into t h e stochastic quasi-gradient methods with successive step-size regulation i s quite a new and underdeveloped field. Ju. M. Ermol'jev s p u r r e d f i r s t t h e investiga- tions in this direction. His and Ju. M. Kaniovskij r e s u l t s [13] a r e undoubtedly of
t h e o r e t i c a l i n t e r e s t . However implementation of methods described in [14] c r e a t e s complications as t h e r e is no r u l e to regulate variations in t h e step-size.
Let us f i r s t dwell on functions f ( x , i ) of t h e form
where ai
>
0.P r o p e r t i e s of t h e functions f ( x , i ) have been studied by A.M. Gupal [4]
proceeding from t h e assumption t h a t f ( z ) satisfies t h e Lipschitz local condition.
THEOREM 4.1 f!, f ( z ) is a convez e i g q f b n c t i o n , dom f
=
E" , t h e n f ( z , i ) is a l s o a convez eige@unction, dom f ( z , i )=
E n , f o r a n y ai>
0.THEOREM 4.2 A s e q u e n c e of j b n c t i o n s f ( z , i ) u n ~ r m l y converges to f ( s ) w i t h ai -+ 0 i n a n y b o u n d e d d o m a i n X.
Now we shall g o t o t h e description of stochastic finite-difference analogs t o algorithms with successive program-adaptive regulation of t h e step-size and with averaging of t h e direction.
D e s c r i p t i o n of Algorithm 3 Let s o b e a n a r b i t r a r y initial approximation, b
>
0be a constant, [ t i j, [ t i j, [ai
1,
Ipi j be number sequences.P u t s
=
0, i=
0, j=
0.S t e p 1 Compute 1
"
<s
=- (f(si8
. - . , ~ t + Q i ,. .
. , X n ) 'IS 2 a i I:=
1where S;, k
=
1 ,-
n are independent random values distributed uniformly on inter- vals [zi-
a i , z;+
a i l , ai>
0.S t e p 2 Construct e S in compliance with t h e schemes a ) and b), where t h e subgradients a r e r e p l a c e by t h e i r stochastic estimates.
S t e p 3 F i n d s S + ' = z S - t i e S .
S t e p 4 I f f ( z S + ' ) > f ( z 0 ) + b , t h e n g o t o S t e p 9 .
S t e p 5 ~ e f i n e z ~ " = z S + ( s - j + l ) - ' ( e s - z S ) S t e p 6 If s
-
j<
p i , then s=
s+
1 and g o t o S t e p 1 . S t e p 7 ~ f I l z ~ + ~ I I > t ~ , t h e n s = s + l a n d g o t o S t e p l . S t e p 8 P u t i = i + l , j = s + l , s = s + l a n d g o t o S t e p l .S t e p 9 ~ e t z ~ + ~ € I z : f ( z ) S f ( z 0 ) ] , j = s + 1 , i = i + 1 , s = s + l a n d g o t o S t e p 1.
THEOREM
4.3 Let t h e problem (*) be solved by a l g o r i t h m 3 a n d t h e n u m b e r se- q u e n c e ssatisfy t h e following conditions
Then almost f o r all o t h e sequence f ( z S (o)) converges and a l l Limit points of t h e sequence [ z S ( a ) j belong t o t h e set of solutions
x*.
Theorem 4.3 is proved in detail in [25].5.
A POSTERIORI ESTIMATES OF ACCURACY OF SOLUTION TO ADAPTIYE SUBGRADIENT METHODSAND
THEIR STOCHASTICFINITE-DIFFERENCE ANALOGS
In numeric solution of extremum problems of nondifferentiable optimization s t r o n g emphasis i s placed on t h e c h e c k of obtained solution accuracy. Given t h e solution a c c u r a c y estimates, f i r s t , a v e r y efficient r u l e of algorithm stopping c a n b e formulated, second, t h e obtained estimates c a n form t h e basis f o r justified con- clusions with r e s p e c t t o t h e s t r a t e g y of selection of algorithm parameters.
Using r a t h e r simple p r o c e d u r e a p o s t e r i o r i estimates of solution a c c u r a c y for t h e introduced adaptive algorithms are c o n s t r u c t e d h e r e . The estimates provide a means f o r s t r i c t l y evaluating efficiency of t h e averaging o p e r a t i o n use.
Thus, assume t h a t t h e convex function minimization problem
i s being solved.
Suppose t h e set
X*
contains only one point x*.
To solve t h e problem (0) consider algorithm 1. The spin-off from t h e proof of theorem 2.1 i s t h e proof t h a t t h e sequence l x S j falls outside t h e set
lx : p ( x ) 5 f ( x O )
+ 61
a finite number of times only. T h e r e f o r e ,c
2 0 e x i s t s such t h a t f o r s 2 ?Then t h e s t e p size will v a r y only if t h e condition l i e s
+'I1
5 r k is satisfied, whereWithout loss of generality w e will assume t h a t t h e f i r s t instant of t h e change from t h e s t e p r o to r l o c c u r r e d just because t h e condition
is satisfied.
From t h e convexity of t h e function f ( x ) i t i s i n f e r r e d t h a t
Summation of inequalities (5.1), (5.2),
. . .
(5.3) yieldsDenote the expression ( s o
+
I)-'C ; O ~
x i-
x O ) by A,,.W e have obvious inequalities
where with s o d s d s l the points x S a r e related by x S +
' =
x-
r ' g S . For thesevalues of s it i s possible t o derive that
s o + 1
€ l x : j ' ( x ) d min l f ( z ),
. . .
, f ' ( x s ' ) ] ].
Thus, f o r s k + I d s d ~ ~ + have ~ w e
where
22;'
E i x : J ( x > 5 min [ p ( x S k + l ) ,. . .
, j ( x S k + l ) ] ] ,I t i s easily proved t h a t Ak 4 0.
THEOREM 5.1 A s s u m e that t h e p r o b l e m (*) is s o l v e d b y a l g o r i t h m 2. T h e n t h e i n e q u a l i t i e s
hold f o r s u c h instants sk at which t h e step-size v a r i e s because t h e condition lleskIl S ,rk i s satisfied.
REMARK I t follows from theorem 5.1 t h a t t h e s a m e estimate o c c u r s both f o r t h e subsequence of "records"
11 2,
{ a n d f o r C e s a m subsequence(8"
{.Let t h e problem (*) b e solved by algorithm 2 where t h e o p e r a t i o n of averaging of proceeding s u b g r a d i e n t s i s used. Denote instants of changes in t h e step-size by s i , i
=
0 , 1, 2,....
Suppose t h e f i r s t instant of t h e change from r o to r l t a k e s place because t h e inequality lleSolI S E O holds. Examine t h e scheme of averaging by"moving" a v e r a g e . W e have
gS
s p *+
( g " , 2 s - Z e ) ,s
Designate t h e e x p r e s s i o n
C
X i , byj ' .
i = O
Then
Whence f o r s
s
K w e haveF o r s
>
K w e s h a l l h a v eThus,
From t h e formula
t h e following recommendations c a n b e o f f e r e d with r e s p e c t t o t h e s e l e c t i o n of p a r a m e t e r s Xi,, :
(2) min
5
X i , s ( g i , x i - x O ) ,f:
X i , s= I
h , S * O i
= o
i = oThe s u b g r a d i e n t a v e r a g i n g t h e r e b y allows improving a p o s t e r i o r i estimates of t h e solution a c c u r a c y . This may s u b s t a n t i a t e formally t h a t i t is of a d v a n t a g e to in- t r o d u c e a n d s t u d y t h e o p e r a t i o n of s u b g r a d i e n t a v e r a g i n g .
F o r a n a r b i t r a r y i n s t a n t of step-size v a r i a t i o n s f
>
K we c a n e a s i l y o b t a i n t h e estimateTHEOREM 5.1 Let t h e problem ( a ) be solved b y algorithm 2 w i t h t h e u s e of averaging scheme a). Then for t h e i n s t a n t s s f , for w h i c h
11
es'l1
5 c i , i n e q u a l i t y (5.9) holds. The scheme of averaging b y "weighted" average b) i s treated in asimilar way.
The a p o s t e r i o r i estimates of t h e solution a c c u r a c y attained f o r t h e adaptive subgradient methods c a n b e extended t o t h e i r s t o c h a s t i c finite-difference analogs with t h e minimum of a l t e r a t i o n s . The way of getting them is illustrated with algo- rithm 3 . We will use notations introduced in Section 4 . When proving theorem 4.3 i t is possible t o demonstrate t h a t t h e step-size r f v a r i e s an infinite number of times.
A s algorithm 3 converges with a probability of unity, t h e n f o r almost all o i t i s pos- sible t o indicate E(o) such t h a t with s 2
T h e r e f o r e , with s 2 E(o) t h e step-size r f v a r i e s because t h e condition
holds, where sf 2 pi
+
j , zs'=
zs'-' +
(sf-
j ) l ( # s '-
z'
) sequences Itf )and Ipf
1
comply with p r o p e r t i e s formulated in theorem 4.3, j is r e c o n s t r u c t e d byS f .
Consider t h e e v e n t
where st is t h e instant of step-size change t h a t p r e c e d e s s f . There e x i s t s t h e constant 0
<
c<
such t h a t with t h e probability g r e a t e r than 1-
Cdi i t i s possi- ble t o state t h a tThen f o r t h e instant si t h e inequality
holds with t h e s a m e probability.
Theorem 5.3 i s readily formulated and proved. Assume t h a t t h e problem (*) i s solved by algorithm 3. Then f o r almost all w i t i s possible t o isolate a subsequence of points jxs'(w)j f o r which with t h e probability g r e a t e r t h a n 1
-
C bi t h e inequal- ities holdwhere
f i Y l =
min f (x , i-
I ) ,2 €En
x i Y l E Argmin f (x , i
-
1 ).
BEFERENCES
Ajzerman, M.A., E.M. Braverman and L.I. Rozonoer: Potential Functions Method in Machine Learning Theory. M.: Nauka, 1970, p. 384.
Glushkova, O.V. and A.M. Gupal: About Nonmonotonic Methods of Nonsmooth Function Minimization with Averaging of Subgradients. Kibernetika, 1980, No.
6, pp. 128-129.
Gupal, A.M. and L.G. Bazhenov: Stochastic Analog t o Conjugate Gradient Method. Kibernetika, 1972, No. 1, pp. 125-126.
Gupal, A.M.: Stochastic Methods of Solution of Nonsmooth Extremum Problems.
Kiev: Naukova dumka, 1979, p. 152.
Dem'janov, V.F. and V.N. Malozemov: Introduction t o Minimax. M. : Nauka, 1972, p. 368.
Eremin, 1.1.: The Relaxation Method of Solving Systems of Inequalities with Convex Functions on t h e Left Side. Dokl. AN SSSR, 1965, Vol. 160, No. 5 , pp.
994-996.
Ermol'ev, Ju.M.: Methods of Solution of Nonlinear Extremum Problems. Kiber- netika, 1966, No. 4, pp. 1-17.
Ermol'ev, Ju.M. and N.Z. Shor: On Minimization of Nondifferentiable Functions.
Kibernetika, 1967, No. 1, pp. 101-102.
Ermol'ev, Ju.M. and Z.V. Nekrylova: Some Methods of Stochastic Optimization.
Kibernetika, 1966, No. 6 , pp. 96-98.
Ermol'ev, Ju.M.: On t h e Method of Generalized Stochastic Gradients and Sto- chastic Quasi-Fejer Sequences. Kibernetika, 1969, No. 2 , pp. 73-83.
Ermol'ev, Ju.M.: On One General Problem of Stochastic Programming. Kiber- netika, 1971, No. 3 , pp. 47-50.
Ermol'ev, Ju.M. : Stochastic Programming Methods. M. : Nauka, 1976, p. 240.
Ermol'ev, Ju.M. and Ju.M. Kaniovskij: Asymptotic P r o p e r t i e s of Some Stochas- t i c Programming Methods with Constant Step-Size. Zhurn. Vych. Mat. i Mat.
Fiziki, 1979, Vol. 19, No. 2, pp. 356-366.
Kaniovskij, Ju.M., P.S. Knopov and Z.V. Nekrylova: Limit Theorems f o r Sto- c h a s t i c Programming. Kiev: Naukova dumka, 1980, p. 156.
Loev, hi.: Probability Theory. M.: Izd-vo inostr. lit., 1967, p. 720.
Norkin, V.N.: Method of Nondifferentiable Function Minimization with Averag- ing of Generalized Gradients. Kibernetika, 1980, N o . 6, pp. 86-89, 102.
Nurminskij, E.A.: Convergence Conditions f o r Nonlinear Programming Algo- rithms. Kibernetika, 1973, N o . l, pp. 122-125.
Nurminskij, E.A. and A.A. Zhelikovskij: Investigation of One Regulation of S t e p in Quasi-Gradient Method f o r Minimizing Weakly Convex Functions. Kiberneti- k a , 1974, No. 6, pp. 101-105.
Poljak, B.T.: Generalized Method of Solving Extremum Problems. Dokl. AN SSSR, 1967, Vol. 174, No. 1, pp. 33-36.
Poljak, B.T.: Minimization of Nonsmooth Functionals. Zhurn. vychisl. mat. i mat. fiziki, 1969, Vol. 9, No. 3, pp. 509-521.
Tsypkin, Ja.Z.: Adaptation and Learning in Automatic Systems. M.: Nauka, 1968.
Tsypkin, Ja.Z.: Generalized Learning Algorithms. Avtomatika i telemekhanika, 1970, No. 1, pp. 97-103.
Chepurnoj, N.D.: One Successive Step-Size Regulation f o r Quasi-Gradient Method of Weakly Convex Function Minimization. Collection: Issledovanie Operacij i ASU. Kiev: Vyshcha shkola, 1981, No. 19, pp. 13-15.
Chepurnoj, N.D.: Averaged Quasi-Gradient Method with Successive Step-Size Regulation t o Minimize Weakly Convex Functions. Kibernetika, 1981, No. 6, pp.
131-132.
Chepurnoj, N.D.: One Successive Step-Size Regulation in Stochastic Method of Nonsmooth Function Minimization. Kibernetika, 1982, No. 4, pp. 127-129.
S h o r , N.Z.: Application of Gradient Descent Method f o r Solution of Network Transportation Problem. In: Materialy nauchnogo seminara p o prikladnym voprosam kibernetiki i issledovanija o p e r a c i j . Nauchnyj sovet p o kibernetike IK AN USSR, Kiev, 1962, vypusk 1, pp. 9-17.
S h o r , N.Z.: Investigation of S p a c e Dilation Operation in Convex Function Minimization Problems. Kibernetika, 1970, N o . 1, pp. 6-12.
S h o r , N.Z. and N.G. Zhurbenko: Minimization Method Using S p a c e Dilation, in t h e Direction of Difference of Two Successive Gradient. Kibernetika, 1971, No. 3, pp. 51-59.
S h o r , N.Z.: Nondifferentiable Function Minimization Methods and Their Appli- cations. Kiev, Nauk. dumka, 1979, p. 200.
Demjanov, V.F.: Algorithms f o r Some Minimax Problems. Journal of Computer a n d Systems Science, 1968, 2, No. 4, pp. 342-380.
Lemarechal, C. : An Algorithm f o r Minimizing Convex Functions. In: Information Processing'74 /ed. Rosenfeld/, 1974, North-Holland, Amsterdam, pp. 552-556.
Lemarechal, C.: Nondifferentiable Optimization: Subgradient and Epsilon Subgradient Methods. L e c t u r e Notes in Economics and Mathematical Systems /ed. Oettli W./, 1975, 117, S p r i n g e r , Berlin, pp. 191-199.
33 B e r t s e k a s , D.P. a n d S.K. Mitter: A Descent Numerical Method f o r Optimization Problems with Nondifferentiable Cost Functions. SIAM J o u r n a l on Control, 1973, 11, No. 4, pp. 63'7-652.
34 Wolfe, P.: A Method of Conjugate S u b g r a d i e n t s f o r Minimizing Non- d i f f e r e n t i a b l e Functions. In: Nondifferentiable Optimization /eds. Balinski M.L., Wolfe P./, Mathematical Programming Study 3, 1975, North-Holland, Am- s t e r d a m . pp. 145-1'73.