Adaptive Nonmonotonic Methods With Averaging of Subgradients

(1)

ADAPTIW NONMONOTONIC METHODS WITH AVERAGING OF SUBGRADIENTS

N.D.

Chepurnoj

July 1987 WP-87-62

Working Papers are interim r e p o r t s on work of t h e International Institute f o r Applied Systems Analysis and have r e c e i v e d only limited review. Views o r opinions e x p r e s s e d h e r e i n d o not n e c e s s a r i l y r e p r e s e n t those of t h e Institute or of i t s National Member Organizations.

INTERNATIONAL INSTITUTE FOR APPLIED SYSTEMS ANALYSIS A-2361 Laxenburg, Austria

(2)

FOREWORD

The numerical methods of t h e nondifferentiable optimization are used f o r solving decision analysis problems in economic, engineering, environment and agricul- t u r e . This p a p e r i s devoted t o t h e adaptive nonmonotonic methods with averaging of t h e subgradients. The unified a p p r o a c h is suggested f o r construction of new deterministic s u b g r a d i e n t methods, t h e i r s t o c h a s t i c finite-difference analogs and a p o s t e r i o r i estimates of a c c u r a c y of solution.

Alexander B. Kurzhanski Chairman System a n d Decision Sciences Program

(3)

CONTENTS

1 Overview of Results in Nonmonotonic Subgradient Methods

2 Subgradient Methods with Program-Adaptive Step-Size Regulation

3 Methods with Averaging of Subgradients and

Program-Adaptive Successive Step-Size Regulation 4 Stochastic Finite-Difference Analogs t o Adaptive

Nonmonotonic Methods with Averaging of Subgradients 5 A P o s t e r i o r i Estimates of Accuracy of Solution to

Adaptive Subgradient Methods and Their Stochastic Finite-Difference Analogs

R e f e r e n c e s

(4)

ADAPTIVE N O ~ O N O T O N I C MlCI'HODS WITH AVERAGING OF SUBGRADEXTI'S

N.D. Chepurnoj

1. OVEBYIEW OF

RESULTS

IN NONMONOTONIC SUBGEWDIENT METHODS

Among t h e existing numerical methods of solution of nondffferentiable optimization problems, t h e nonmonotonic subgradient methods hold an important position.

The pioneering work by N.Z. S h o r [26] gave impetus t o t h e i r explosive pro- g r e s s . In 1962, h e suggested an i t e r a t i v e p r o c e s s of minimization of convex piecewise-linear function named afterwards t h e generalized gradient descent

(GGD):

where gS E a f ' ( x S ) i s a set of subgradients of a function f ' ( x ) at a point x S ; rs r 0 i s a s t e p size.

For t h e differentiable functions this method a g r e e s v e r y closely with t h e well-known subgradient method. The fundamental difference between them i s t h a t t h e motion direction (- g ) in ( 1 . 1 ) is, a s a r u l e , not a descent direction.

A t t h e f i r s t attempts to substantiate theoretically t h e convergence of pro- c e d u r e s of t h e type ( 1 . 1 ) t h e r e s e a r c h e r s immediately faced two difficulties. For one thing, t h e objective function lacked t h e p r o p e r t y of differentiability. For a n o t h e r , method ( 1 . 1 ) w a s not monotonic. These combined f e a t u r e s r e n d e r e d im- p r a c t i c a l t h e use of known g r a d i e n t p r o c e d u r e convergence theorems.

New t h e o r e t i c a l a p p r o a c h e s t h e r e f o r e became a must.

One more "misfortune" came on t h e neck of t h e others: numerical computations demonstrated t h a t GGD h a s a low convergence r a t e .

Initially g r e a t hopes were pinned on t h e step-size selection s t r a t e g y as a way towards overcoming t h e crisis.

(5)

By t h e e a r l y 1970s difficulties caused by t h e formal substantiation of convergence of nonmontonic subgradient p r o c e d u r e s had been mastered and different ap- p r o a c h e s to t h e step-size regulation had been offered [6, 7 , 8 , 1 9 , 20, 261. However t h e computations continued t o prove t h e p o o r convergence of GGD in p r a c t i c e .

I t c a n b e said t h a t t h e f i r s t s t a g e in GGD evolution w a s o v e r in 1976.

Thereupon t h e numerical methods of nondifferentiable optimization developed in t h r e e directions, i.e., methods with s p a c e dilation, monotone, and adaptive nonmonotonic methods w e r e explored.

Let us dwell on e a c h of t h e s e approaches.

In a n e f f o r t t o enhance t h e GGD efficiency,

N.Z.

S h o r e l a b o r a t e d methods where t h e o p e r a t i o n of s p a c e dilation in t h e direction of a subgradient and a difference between t w o successive subgradients w a s employed. Literally t h e n e x t f e w y e a r s w e r e prolific f o r p a p e r s [27, 28, 291 investigating i n t o t h e s p a c e dilation o p e r a t i o n in nondifferentiable function minimization problems. A high rate of con- v e r g e n c e of t h e suggested methods w a s c o r r o b o r a t e d theoretically.

Computational p r a c t i c e a t t e s t e d convincingly to t h e advantageousness of application of t h e algorithms with s p a c e dilation, especially t h e r-algorithm [29], as a l t e r n a t i v e t o GGD, providing dimensions of t h e s p a c e d o not e x c e e d 200 t o 300.

However, if dimensions are ample, f i r s t , a considerable amount of computations is s p e n t on t h e s p a c e dilation matrix transformation, second, some e x t r a capacity of computer memory i s r e q u i r e d .

The monotonic methods became a n o t h e r essential direction.

Even though t h e f i r s t p a p e r s on t h e monotonic methods a p p e a r e d back in 1968 (V.F. Dem'janov [30]), t h e i r p r o g r e s s r e a c h e d i t s peak in t h e e a r l y 70's. Two classes of t h e s e algorithms should b e distinguished h e r e : t h e &-steepest d e s c e n t 15, 301 a n d t h e &-subgradient algorithms [31-341. W e s h a l l not examine them in detail b u t note, t h a t t h e monotonic methods o f f e r e d h i g h e r r a t e of convergence a s against GGD. J u s t as with t h e methods using t h e s p a c e dilation, v a s t dimensions of problems to be solved still remained Achilles' heel f o r t h e monotonic algorithms.

Thus, t h e nonmonotonic s u b g r a d i e n t methods have come into p a r t i c u l a r impor- t a n c e in t h e solution of large-scale nondifferentiable optimization problems.

The nonmonotonic p r o c e d u r e s have a n o t h e r important o b j e c t of application, a p a r t from t h e large-scale problems, i.e., t h e problems in which t h e subgradient cannot b e precisely defined at a point. The latter encompass problems of identifi- cation, learning, a n d p a t t e r n recognition [I, 211. The minimized function i s t h e r e a

(6)

mathematical expectation whose distribution law is unknown. E r r o r s in subgradient calculation may s t e m from computation errors and many o t h e r r e a l p r o c e s s e s .

Ju.M. Ermol'ev and Z.V. Nekrylova [9] w e r e t h e f i r s t to investigate t h e like p r o c e d u r e s . Stochastic programming problems have increasingly drawn t h e atten- tion to t h e nonmonotonic subgradient methods.

However, as pointed out e a r l i e r . GGD, widely used, r e s i s t a n t to e r r o r s in subgradient computations, saving memory capacity, s t i l l had a p o o r rate of convergence. Of g r e a t importance t h e r e f o r e w a s t h e construction of nonmonotonic methods s u c h t h a t , on t h e one hand, r e t a i n all advantages of GGD and, on t h e o t h e r , possess a high rate of convergence.

I t h a s been t h i s requirement t h a t h a s l e t to elaboration of t h e adaptive nonmonotonic p r o c e d u r e s .

An analysis r e v e a l e d t h a t t h e Markov n a t u r e of GGD is t h e chief c a u s e of i t s slow convergence. I t is quite obvious t h a t t h e use of t h e m o s t intimate knowledge of p r o g r e s s of t h e computations is indispensable t o selection of t h e direction and regulation of t h e stepsize.

S e v e r a l ideas provided t h e basis f o r t h e development of adaptive nonmonoton- i c methods.

The major c o n c e p t of all techniques f o r selecting t h e direction and regulating t h e step-size w a s t h e use of information about t h e fulfillment of n e c e s s a r y conditions to have t h e extremal-value function.

I t s implementation are t h e methods with averaging of t h e subgradients.

In t h e most g e n e r a l case by t h e operation of averaging i s meant a p r o c e d u r e of "taking" t h e convex hull of a n a r b i t r a r y finite number of v e c t o r s .

The operation of averaging in t h e numerical methods w a s f i r s t applied by Ja.2.

Cypkin [ Z Z ] and Ju.M. Ermol'ev [ll].

The p a p e r by A.M. Gupal and L.G. Bazhenov [3] also dealing with t h e use of o p e r a t i o n of averaging of s t o c h a s t i c estimates of t h e generalized g r a d i e n t s ap- p e a r e d in 1972.

However all t h e above p a p e r s considered t h e program regulation of t h e step- size, i.e., a sequence [ r , ] independent of computations w a s selected such t h a t

(7)

The n e x t n a t u r a l s t a g e in t h e evolution of this concept w a s t h e construction of adaptive step-size regulation using t h e operation of averaging of preceding subgradients.

In 1974, E.A. Nurminskij and L.A. Zhelikovskij [I81 suggested a successive p r o g r a m a d a p t i v e regulation of t h e step-size f o r t h e quasigradient method of minimization of weakly convex function.

The c r u x of this relation consists in t h e following.

Let an i t e r a t i v e sequence be constructed according to t h e r u l e

where g S E a j ( z s ) i s a quasi-gradient of t h e function j ( z ) at t h e point z S , r o i s a constant step-size.

Assume t h a t t h e r e e x i s t

z

^EEn and numerical p a r a m e t e r s t

>

0 ,

6 >

0 such t h a t f o r any s

=

0 , 1, 2,

...

llzS

- ^G 11

⁵6. Let us suppose also t h a t a convex combination of subgradients

tgi if

S

LO

e x i s t s such t h a t IlesolI S t,

e S D ^Econv

tgi

ji S ^:0

.

Then t h e point

z

i s sufficiently close to t h e set

X ' =

argmin j ( z ) according to t h e n e c e s s a r y extremum conditions. In t h e given case t h e step-size h a s to be reduced and t h e p r o c e d u r e r e p e a t e d with t h e new step-size value r l s t a r t i n g at t h e obtained point z S D . The numerical realization of t h e described algorithm r e q u i r e s a specific r u l e f o r constructing v e c t o r s eS'. In [10] t h e v e c t o r eS' is c o n s t r u c t e d by t h e r u l e os'

=

P r o j O/conv g k k ' s , S t h a t is, all quasi-gradients are included into t h e convex hull s t a r t i n g from t h e m o s t r e c e n t instant of t h e step-size change.

Numerical computations b o r e o u t t h e expediency of making allowances f o r such regulation. However a g r a v e disadvantage w a s i n h e r e n t in it: t h e g r e a t laboriousness of i t e r a t i o n . Considering t h a t t h e a p p r o a c h as a whole holds promise, averaging schemes had to be developed f o r t h e efficient use when selecting t h e d i r e c t i o n and regulating t h e step-size.

This p a p e r treats such averaging schemes. They s e r v e as a foundation f o r new nonmonotonic s u b g r a d i e n t methods, f o r t h e description of s t o c h a s t i c finite- difference analogs, a p o s t e r i o r i estimates of solution a c c u r a c y . P r i o r to discuss- ing r e s u l t s , let us make s o m e g e n e r a l assumptions. Presume t h a t t h e minimization problem i s being solved on t h e e n t i r e s p a c e of t h e function j ( z ) :

(8)

where En i s a n n-dimensional Euclidean space. The function j'(z) will b e every- where thought of as being t h e convex eigenfunction j'(z), dom j' = E n , t h e sets

[z : /(z) 5 c j being bounded f o r any bounded constant C. The set of solutions of t h e problem (*) will b e believed to be t h e s e t

2. SUBGRADIENT KETHODS WITFi PROGRAM-ADAPTIVE STEP-SIZE ImGULATION

The concept of adaptive successive step-size regulation h a s a l r e a d y been set f o r t h . In 1231 a way of determining t h e instant of t h e step-size variation w a s suggested. Central to i t was t h e simplest scheme of averaging of t h e preceding subgradients. This method i s e a s y to implement and e f f e c t s a saving in computer memory capacity. Compared t o t h e program regulation, t h e adaptive regulation improves convergence of t h e s u b g r a d i e n t methods.

Description of Algorithm 1

Let z 0 b e a n a r b i t r a r y initial point, b

>

0 b e a constant, itk j, [ r k j b e number sequences such t h a t ck

>

^0,^tk^-4^0,^rk

>

^0,^rk^-+^0.P u t s

=

0, j

=

^{0 ,}k

=

0, L O

=

^Eaj'(zO).

S t e p 1. Construct

S t e p 2. If j'(zS +') >j'(zO)

+

b, t h e n s e l e c t zS ^EIz :j'(z) zSj'(zO)j and g o to S t e p 5.

S t e p 3. Define

S t e p 4 . 1f

lies +41 >

^{c k ,}then s

=

s

+

1 and g o t o s t e p 1.

S t e p 5. S e t k

=

k

+

^1,j

=

s

+

^1,^s

=

s

+

1 and g o t o S t e p 1.

(9)

THEOREM 1.1 A s s u m e t h a t t h e problem ( 8 ) i s s o l v e d b y a l g o r i t h m 2. T h e n a l l Limit p o i n t s of t h e s e q u e n c e [ z S

1

belong t o

**x*.**

PROOF Denote t h e instants of step-size variations by s m . Let us p r o v e t h a t t h e step-size rk v a r i e s a n infinite number of times. Suppose i t i s not s o , i.e., t h e s t e p - size does not v a r y s t a r t i n g from a n instant s, and i s equal r , . Then t h e points z S f o r s 5 s, belong t o t h e set

and are r e l a t e d by

Considering t h a t t h e step-size does not v a r y , llesll

>

^{E ,}

>

⁰f o r s

r

s , . In passing t o t h e limit by s ^-4 in t h e inequality

we obtain a contradiction in t h e boundedness of t h e set

The f u r t h e r proof of Theorem 1.1 amounts to checking t h e g e n e r a l conditions of algorithm convergence d e r i v e d by E.A. Nurminskij [17].

NURMINSKIJ THEOREM Let t h e s e q u e n c e lzS

1

a n d t h e s e t of s o l u t i o n s

X *

be s u c h t h a t t h e f o l l o w i n g c o n d i t i o n s a r e s a t i s f i e d : D l . F o r any sequence [ z s k

1

such t h a t

D2 There e x i s t s t h e closed bounded set S such t h a t

D3 F o r any subsequence [ z nk

1

such t h a t

(10)

t h e r e e x i s t s co

>

0 s u c h t h a t f o r all 0

<

^ES ro and a n y k inf m : [IIzm -znkII

>

rj = m k

< =

I.

m " ' k

D4 The continuous function W ( z ) e x i s t s s u c h t h a t f o r a n a r b i t r a r y subsequence [ z nk j s u c h t h a t

and f o r t h e subsequence [ z m k j c o r r e s p o n d i n g to i t by condition D3 f o r a n a r b i - t r a r y 0

<

^E^Sr0

D5. The function W ( z ) of condition D4 assumes no more t h a n c o u n t a b l e number of values on t h e set x*.

Then a l l limiting points of t h e s e q u e n c e [ z S

1

belong to x*.

Select t h e function f ( z ) as t h e function W ( z ) . Conditions Dl, D5 are s a t i s f i e d in view of t h e algorithm s t r u c t u r e and t h e e a r l i e r assumptions.

The rest of t h e conditions will b e verified by t h e following scheme. W e will p r o v e t h a t conditions D3, D4 hold t h e points being t h e i n n e r p o i n t s of t h e set

s

=

[ z : f ( z ) 3 f ( z O )

1.

I t is t h e r e w i t h obvious t h a t max W ( z )

<

inf W ( z )

t ES

Then t h e s e q u e n c e [ z S

1

f a l l s outside t h e set S only finite number of times. Conse- quently, condition D2 is s a t i s f i e d a n d t h i s automatically e n t a i l s t h e validity of D3 a n d D4.

S o , let t h e s u b s e q u e n c e [znpj e x i s t s such t h a t znp --, z ' x*. Assume at t h i s s t a g e of t h e proof t h a t z ' ^Eint S. W e will p r o v e t h a t t h e r e e x i s t s ro

>

0 s u c h t h a t f o r all 0

<

^E³r0 at a n a r b i t r a r y p:

Now s u p p o s e condition (2.1) i s not s a t i s f i e d , t h a t i s , f o r a n y r

>

⁰t h e r e e x i s t s n p such t h a t l1zs

-

^zn41³^rf o r all s

>

^np.

(11)

W e have

f o r sufficiently l a r g e n p and s

>

^n,,. By the supposition 0 Z B f ( x ' ) . By v i r t u e of t h e closedness, convexity and u p p e r semi-continuity of t h e many-valued mapping a f ( x ) t h e r e e x i s t s ^E

>

⁰such t h a t 0

=

conv G q c ( z '), where conv

1.1

i s a convex hull and G 4 r ( x ' ) i s a set

I t i s easily s e e n t h a t ^E

>

0 can b e always s e l e c t e d in s u c h a way t h a t U I e ( x ') C int S, where ( z)

=

x : z

-

^x

¹

⁵⁴

^.

^Let ⁶

⁼

^min

11; 11,

f E conv G 4,(x '). Obviously 6

>

^0.As ek

-

^0,t h e r e e x i s t s a n i n t e g e r

L(6)

^{s u c h}

t h a t f o r k 2 K ( 6 ) we have S r ) / Z . P u t n p 2 K ( 6 ) . Then i t i s readily s e e n t h a t f o r s 2 n p t h e step-size r k can v a r y no m o r e t h a n once within t h e set U I c ( z ' ) . Ex- amine t h e sequence IsS

1

s e p a r a t e l y on t h e intervals n p 5 s

<

^{s p}

*

^,^where

s;

=

min sm : sm a n p ' .

When n p S s

<

^{s p}t h e points z S are r e l a t e d as follows

where t h e index L i s r e c o n s t r u c t e d with r e s p e c t to ^{s p}

.

^Let^usconsider t h e s c a l a r p r o d u c t s

where z np

= grip,

(12)

S i n c e z s E conv G q , ( z ' ) , s 2 n p , i t i s possible t o p r o v e t h a t N 1 N 1 + l

( z , g ) 2 y , y = 1 / 2 l 9 ~ . Thus,

We n e x t c o n s i d e r t h e scalar p r o d u c t s

ds

=

( z N 1 + l

- z s ,

g s ) = r 1 ( s - N l - l ) ( z s - l , g s ) , where s 2 N1

+

1.

N e N e + l

The index N2 e x i s t s s u c h t h a t ( z , g ⁾²^yand d N e + l h r l ( N 2

-

Then in a s i m i l a r way we c a n p r o v e t h e e x i s t e n c e of indices Nt

( t

2 3 ) s u c h t h a t

I t i s e a s y t o p r o v e t h a t Nt ⁺

-

^Nt ^S^N

<

=,

t =

1, 2,.

. . .

Let Nt b e t h e maximal of indices Nt t h a t d o e s not e x c e e d s;

.

Then

Since s p

-

Nt0 5 N , t h e n with p --r

=

t h e l a s t t e r m on t h e right-hand side of t h e inequality

a p p r o a c h e s z e r o . W e finally o b t a i n

where E; -+ 0 with p' --r w.

I t i s not difficult t o notice t h a t t h e reasoning which u n d e r l i e s t h e d e r i v a t i o n of inequality (2.2) may b e also r e p e a t e d without c h a n g e s f o r t h e i n t e r v a l s L s p t o I

g e t

f ( z r n )

-

f ( z n p )

s -

^{7 1}

-

^s;)^y

+

c;

Adding (2.2) t o (2.3) we o b t a i n

(13)

In passing t o t h e limit by m --, = in inequality ( 2 . 4 ) we are l e d t o a c o n t r a d i c t i o n with r e s p e c t t o t h e boundedness of continuous function on t h e closed bounded set U q t ( x d ) . Consequently, condition ( 2 . 1 ) i s p r o v e d .

Let

m p

=

inf m : l / x m

-

^xn

PI\ >

r

.

m > n p

By s t r u c t u r e xmp

F

u , ( x n p ) , b u t f o r sufficiently l a r g e p

All t h e reasoning involved in d e r i v a t i o n of inequality ( 2 . 4 ) r e m a i n s valid f o r t h e in- s t a n t m p , t h a t is,

we h a v e

In passing to t h e limit by p --, -we g e t

-

lim w(xmp)

<

lim w ( a n p )

.

P - - P - -

The f u r t h e r proof of t h i s t h e o r e m follows from t h e Nurminskij theorem.

To fix more p r e c i s e l y t h e i n s t a n t when t h e i t e r a t i o n p r o c e s s g e t s i n t o t h e neighborhood of t h e solution we c a n employ t h e following modification of algorithm 1 provided t h e c o m p u t e r c a p a c i t y allows.

(14)

Let z 0 be a n a r b i t r a r y initial point, d

>

⁰be a constant, [ E

1 ,

~^Irk

1

b e number sequences such t h a t ck

>

^0,^ck⁴^{0 ,}^{r k}

>

^{0 , rk}⁴^{0 ;}k l , k 2 ,

. . . , k,

b e i n t e g e r positive bounded constants.

P u t s

=

0, j

=

0 , k

=

0 , e0

=

g o ^EB f ( z O ) . S t e p 1 Construct

S t e p 2 If f ( z S f l ) > f ( z O ) + d , t h e n ~ ~ + ~ ~ [ z : f ( z ) ~ f ( z ~ ) ] and go t o S t e p 5.

S t e p 3 Define

"0s

^{+ 1}

=

^{s + l}e,S

+

¹ ^{g s}^{+ l}

s - j + 2 s - j + 2

Each of t h e notations P i ( - ,

.,

-) designates a n a r b i t r a r y convex combination of a finite number of t h e indicated preceding subgradients.

Find

-

_min _IIe;_+l

_{I I} _.

LLs + l - o s p s m

S t e p 4 If p s + l

>

^E ^~t h e n s ^, = s + l a n d g o t o S t e p l .

S t e p 5 S e t k = k + 1 , j = s +1, s = s + 1 , e S = g S a n d g o L o S t e p 1 .

THEOREM 2.1 m p p o s e t h a t the problem (*) is solved b y the modiJ%ed algo- r i t h m I . Then all l i m i t points of the sequence iz

1

belong to

x.*

(15)

3. METHODS WITH AVERAGING OF SUBGRADIENTS AND PROGRAM-ADAPTIVE SUCCESSIVE STEP-SIZE REGULATION

S u c c e s s i v e Step- Size R e g u l a t i o n

A s noted in a number of works [Z, 3, 1 2 , 161 i t i s expedient t o a v e r a g e subgradients calculated at t h e previous iterations s o t h a t t h e s u b g r a d i e n t methods will b e more r e g u l a r . F o r instance, when t h e "ravineu-type functions are minimized, t h e a v e r a g e d direction points t h e way along t h e bottom of t h e "ravine".

I t will b e demonstrated in Section 5 t h a t t h e operation of averaging enables t h e improvement of a p o s t e r i o r i estimates of t h e solution a c c u r a c y along with t h e upgrading of r e g u l a r i t y of t h e d e s c r i b e d methods.

Methods with averaging of subgradients and consecutive p r o g r a m a d a p t i v e regulation of t h e step-size are set f o r t h in t h i s section.

Results obtained h e r e stem from [24].

Description of Algorithm 2.

Let z 0 b e a n a r b i t r a r y initial approximation;

3 > o

be a constant; Irk

1,

^{i r k}j b e number sequences such t h a t

P u t s

=

0, j = 0 , k = 0 ,

S t e p 1 Construct

S t e p 2 If f ( z S + I )

>

f ( z O )

+ 8,

then g o t o S t e p 7.

S t e p 3 Define v S according t o t h e schemes a ) o r b).

S t e p 4 ~ o n s t r u c t e ~ + ~ = e ~ + ( s - j + ~ ) - l ( v ~ + ~ - e ~ ) . S t e p 5 1f \ l e S + 4 1

>

^el:,t h e n s = s + l a n d g o t o S t e p l .

S t e p 6 S e t k = k + 1 , j = s + 1 , s = s + 1 , e S = v S a n d g o t o S t e p 1 .

S t e p 7 ~ e t z ~ + ~ E i z : f ( z ) ~ f ( z ~ ) ] , s = s + l , j = s , k = k + l a n d g o t o S t e p 1.

(16)

In construction of t h e direction v S t h e following schemes of subgradient averaging a r e dealt with.

a ) The "moving" a v e r a g e . Let

K +

1 be a n integer. Then

where gi E a f ( z i ), hi,

+

0.

b) The "weighted" a v e r a g e . Let M

+

¹ ^be ^{a n} ^integer. ^Then

v S = g S + h S ( v S - l - g S ) , where O S h s S 1 f o r s f 0 (mod M), 0 S As S

6

<1 f o r s = O (modM).

THEOREM 3.1 Assume that t h e probLem (*) is solved by aLgorithm 2. Then a L L l i m i t p o i n t s of t h e sequence [zS j belong to t h e s e t

x*.

4. STOCHASTIC FINITE-DIPFEENCE ANALOGS TO ADAPTIVE NONBEONOTONIC METHODS WITH AVERAGING OF SUBGBADIENTS

I t should b e emphasized t h a t t h e p r a c t i c a l value of t h e subgradient-type methods essentially depends upon t h e existence of t h e i r finite-difference analogs.

Of g r e a t importance t h e finite-difference methods a r e primarily in situations when subgradient computation programs a r e unavailable. This generally o c c u r s in t h e solution of large-scale problems. Construction of t h e finite-difference methods in t h e nonsmooth optimization originated two approaches: t h e nondeterministic and t h e stochastic ones. Each of them h a s i t s own advantages and disadvantages. The stochastic a p p r o a c h i s favored h e r e .

One of t h e advantages of t h e introduced averaging operation i s t h e f a c t t h a t t h e construction of stochastic analogs t o subgradient methods p r e s e n t s no special problems.

The offered methods a r e close t o those with smoothing [4] which, in t h e i r t u r n , a r e closest to t h e schemes of stochastic quasi-gradient methods [IZ]. R e s e a r c h into t h e stochastic quasi-gradient methods with successive step-size regulation i s quite a new and underdeveloped field. Ju. M. Ermol'jev s p u r r e d f i r s t t h e investiga- tions in this direction. His and Ju. M. Kaniovskij r e s u l t s [13] a r e undoubtedly of

(17)

t h e o r e t i c a l i n t e r e s t . However implementation of methods described in [14] c r e a t e s complications as t h e r e is no r u l e to regulate variations in t h e step-size.

Let us f i r s t dwell on functions f ( x , i ) of t h e form

where ai

>

^0.

P r o p e r t i e s of t h e functions f ( x , i ) have been studied by A.M. Gupal [4]

proceeding from t h e assumption t h a t f ( z ) satisfies t h e Lipschitz local condition.

THEOREM 4.1 f!, f ( z ) is a convez e i g q f b n c t i o n , dom f

=

E" , t h e n f ( z , i ) is a l s o a convez eige@unction, dom f ( z , i )

=

E n , f o r a n y ai

>

^0.

THEOREM 4.2 A s e q u e n c e of j b n c t i o n s f ( z , i ) u n ~ r m l y converges to f ( s ) w i t h ai -+ 0 i n a n y b o u n d e d d o m a i n X.

Now we shall g o t o t h e description of stochastic finite-difference analogs t o algorithms with successive program-adaptive regulation of t h e step-size and with averaging of t h e direction.

D e s c r i p t i o n of Algorithm 3 Let s o b e a n a r b i t r a r y initial approximation, b

>

⁰

be a constant, [ t i j, [ t i j, [ai

1,

Ipi j be number sequences.

P u t s

=

0, i

=

0, j

=

0.

S t e p 1 Compute 1

"

<s

=- (f(si8

. - . , ~ t + Q i ,

. .

. , X n ) ^'IS 2 a i I:

=

¹

where S;, k

=

1 ,

-

n are independent random values distributed uniformly on intervals [zi

-

^{a i ,}^z;

⁺

^{a i l ,}^ai

^>

^0.

S t e p 2 Construct e S in compliance with t h e schemes a ) and b), where t h e subgradients a r e r e p l a c e by t h e i r stochastic estimates.

S t e p 3 F i n d s S + ' = z S - t i e S .

S t e p 4 I f f ( z S + ' ) > f ( z 0 ) + b , t h e n g o t o S t e p 9 .

(18)

S t e p 5 ~ e f i n e z ~ " = z S + ( s - j + l ) - ' ( e s - z S ) S t e p 6 If s

-

^j

<

p i , then s

=

s

+

1 and g o t o S t e p 1 . S t e p 7 ~ f I l z ~ + ~ I I > t ~ , t h e n s = s + l a n d g o t o S t e p l . S t e p 8 P u t i = i + l , j = s + l , s = s + l a n d g o t o S t e p l .

S t e p 9 ~ e t z ~ + ~ € I z : f ( z ) S f ( z 0 ) ] , j = s + 1 , i = i + 1 , s = s + l a n d g o t o S t e p 1.

THEOREM

4.3 Let t h e problem (*) be solved by a l g o r i t h m 3 a n d t h e n u m b e r se- q u e n c e s

satisfy t h e following conditions

Then almost f o r all o t h e sequence f ( z S (o)) converges and a l l Limit points of t h e sequence [ z S ( a ) j belong t o t h e set of solutions

**x*.**

Theorem 4.3 is proved in detail in [25].

5.

A POSTERIORI ESTIMATES OF ACCURACY OF SOLUTION TO ADAPTIYE SUBGRADIENT METHODS

AND

THEIR STOCHASTIC

FINITE-DIFFERENCE ANALOGS

In numeric solution of extremum problems of nondifferentiable optimization s t r o n g emphasis i s placed on t h e c h e c k of obtained solution accuracy. Given t h e solution a c c u r a c y estimates, f i r s t , a v e r y efficient r u l e of algorithm stopping c a n b e formulated, second, t h e obtained estimates c a n form t h e basis f o r justified con- clusions with r e s p e c t t o t h e s t r a t e g y of selection of algorithm parameters.

(19)

Using r a t h e r simple p r o c e d u r e a p o s t e r i o r i estimates of solution a c c u r a c y for t h e introduced adaptive algorithms are c o n s t r u c t e d h e r e . The estimates provide a means f o r s t r i c t l y evaluating efficiency of t h e averaging o p e r a t i o n use.

Thus, assume t h a t t h e convex function minimization problem

i s being solved.

Suppose t h e set

X*

contains only one point x

*.

To solve t h e problem (0) consider algorithm 1. The spin-off from t h e proof of theorem 2.1 i s t h e proof t h a t t h e sequence l x S j falls outside t h e set

lx : p ( x ) 5 f ( x O )

+ ⁶¹

a finite number of times only. T h e r e f o r e ,

c

²⁰e x i s t s such t h a t f o r s 2 ?

Then t h e s t e p size will v a r y only if t h e condition l i e s

+'I1

⁵^{r k}is satisfied, where

Without loss of generality w e will assume t h a t t h e f i r s t instant of t h e change from t h e s t e p r o to r l o c c u r r e d just because t h e condition

is satisfied.

From t h e convexity of t h e function f ( x ) i t i s i n f e r r e d t h a t

Summation of inequalities (5.1), (5.2),

. . .

(5.3) yields

(20)

Denote the expression ( s o

+

^I)-'

C ; O ~

^{x i}

-

^{x O )}^by^A,,.

W e have obvious inequalities

where with s o d s d s l the points x S a r e related by x S +

' ⁼

^x

-

^r^{' g S .}^{For these}

values of s it i s possible t o derive that

s o + 1

€ l x : j ' ( x ) d min l f ( z ),

. . .

, f ' ( x s ' ) ] ]

.

Thus, f o r s k + I d s d ~ ~ + have ~ w e

where

22;'

^Ei x : J ( x > 5 min [ p ( x S k + l ) ,

. . .

, j ( x S k + l ) ] ] ,

(21)

I t i s easily proved t h a t Ak ⁴0.

THEOREM 5.1 A s s u m e that t h e p r o b l e m (*) is s o l v e d b y a l g o r i t h m 2. T h e n t h e i n e q u a l i t i e s

hold f o r s u c h instants sk at which t h e step-size v a r i e s because t h e condition lleskIl S ,rk i s satisfied.

REMARK I t follows from theorem 5.1 t h a t t h e s a m e estimate o c c u r s both f o r t h e subsequence of "records"

11 2,

^{a n d f o r C e s a m subsequence

(8"

{.

Let t h e problem (*) b e solved by algorithm 2 where t h e o p e r a t i o n of averaging of proceeding s u b g r a d i e n t s i s used. Denote instants of changes in t h e step-size by s i , i

=

0 , 1, 2,

....

Suppose t h e f i r s t instant of t h e change from r o to r l t a k e s place because t h e inequality lleSolI S E O holds. Examine t h e scheme of averaging by

"moving" a v e r a g e . W e have

gS

s p *

+

( g " , 2 s ^{- Z e )} ,

s

Designate t h e e x p r e s s i o n

C

X i , by

j ' .

i = O

Then

Whence f o r s

s

K w e have

(22)

F o r s

>

^Kw e s h a l l h a v e

Thus,

From t h e formula

t h e following recommendations c a n b e o f f e r e d with r e s p e c t t o t h e s e l e c t i o n of p a r a m e t e r s Xi,, :

(2) ^min

5

^{X i ,}s ( g i , x i - x O ) ,

f:

^{X i , s}

^{= I}

h , S * O i

= o

i = o

The s u b g r a d i e n t a v e r a g i n g t h e r e b y allows improving a p o s t e r i o r i estimates of t h e solution a c c u r a c y . This may s u b s t a n t i a t e formally t h a t i t is of a d v a n t a g e to in- t r o d u c e a n d s t u d y t h e o p e r a t i o n of s u b g r a d i e n t a v e r a g i n g .

F o r a n a r b i t r a r y i n s t a n t of step-size v a r i a t i o n s f

>

K we c a n e a s i l y o b t a i n t h e estimate

THEOREM 5.1 Let t h e problem ( a ) be solved b y algorithm 2 w i t h t h e u s e of averaging scheme a). Then for t h e i n s t a n t s s f , for w h i c h

11

es'l

1

⁵c i , i n e q u a l i t y (5.9) holds. The scheme of averaging b y "weighted" average b) i s treated in a

(23)

similar way.

The a p o s t e r i o r i estimates of t h e solution a c c u r a c y attained f o r t h e adaptive subgradient methods c a n b e extended t o t h e i r s t o c h a s t i c finite-difference analogs with t h e minimum of a l t e r a t i o n s . The way of getting them is illustrated with algorithm 3 . We will use notations introduced in Section 4 . When proving theorem 4.3 i t is possible t o demonstrate t h a t t h e step-size r f v a r i e s an infinite number of times.

A s algorithm 3 converges with a probability of unity, t h e n f o r almost all o i t i s possible t o indicate E(o) such t h a t with s 2

T h e r e f o r e , with s 2 E(o) t h e step-size r f v a r i e s because t h e condition

holds, where sf 2 pi

+

^{j ,}^zs'

=

zs'

-' ⁺

^(sf

-

^j) l ( # s '

-

^z

'

⁾ ^sequences ^Itf⁾

and Ipf

1

comply with p r o p e r t i e s formulated in theorem 4.3, j is r e c o n s t r u c t e d by

S f .

Consider t h e e v e n t

where st is t h e instant of step-size change t h a t p r e c e d e s s f . There e x i s t s t h e constant 0

<

c

<

such t h a t with t h e probability g r e a t e r than 1

-

Cdi i t i s possible t o state t h a t

Then f o r t h e instant si t h e inequality

holds with t h e s a m e probability.

(24)

Theorem 5.3 i s readily formulated and proved. Assume t h a t t h e problem (*) i s solved by algorithm 3. Then f o r almost all w i t i s possible t o isolate a subsequence of points jxs'(w)j f o r which with t h e probability g r e a t e r t h a n 1

-

^C^bit h e inequalities hold

where

f i Y l =

min f (x , i

-

I ) ,

2 €En

x i Y l E Argmin f (x , i

-

^{1 )}

.

BEFERENCES

Ajzerman, M.A., E.M. Braverman and L.I. Rozonoer: Potential Functions Method in Machine Learning Theory. M.: Nauka, 1970, p. 384.

Glushkova, O.V. and A.M. Gupal: About Nonmonotonic Methods of Nonsmooth Function Minimization with Averaging of Subgradients. Kibernetika, 1980, No.

6, pp. 128-129.

Gupal, A.M. and L.G. Bazhenov: Stochastic Analog t o Conjugate Gradient Method. Kibernetika, 1972, No. 1, pp. 125-126.

Gupal, A.M.: Stochastic Methods of Solution of Nonsmooth Extremum Problems.

Kiev: Naukova dumka, 1979, p. 152.

Dem'janov, V.F. and V.N. Malozemov: Introduction t o Minimax. M. : Nauka, 1972, p. 368.

Eremin, 1.1.: The Relaxation Method of Solving Systems of Inequalities with Convex Functions on t h e Left Side. Dokl. AN SSSR, 1965, Vol. 160, No. 5 , pp.

994-996.

Ermol'ev, Ju.M.: Methods of Solution of Nonlinear Extremum Problems. Kiber- netika, 1966, No. 4, pp. 1-17.

Ermol'ev, Ju.M. and N.Z. Shor: On Minimization of Nondifferentiable Functions.

Kibernetika, 1967, No. 1, pp. 101-102.

Ermol'ev, Ju.M. and Z.V. Nekrylova: Some Methods of Stochastic Optimization.

Kibernetika, 1966, No. 6 , pp. 96-98.

Ermol'ev, Ju.M.: On t h e Method of Generalized Stochastic Gradients and Sto- chastic Quasi-Fejer Sequences. Kibernetika, 1969, No. 2 , pp. 73-83.

Ermol'ev, Ju.M.: On One General Problem of Stochastic Programming. Kiber- netika, 1971, No. 3 , pp. 47-50.

Ermol'ev, Ju.M. : Stochastic Programming Methods. M. : Nauka, 1976, p. 240.

(25)

Ermol'ev, Ju.M. and Ju.M. Kaniovskij: Asymptotic P r o p e r t i e s of Some Stochas- t i c Programming Methods with Constant Step-Size. Zhurn. Vych. Mat. i Mat.

Fiziki, 1979, Vol. 19, No. 2, pp. 356-366.

Kaniovskij, Ju.M., P.S. Knopov and Z.V. Nekrylova: Limit Theorems f o r Sto- c h a s t i c Programming. Kiev: Naukova dumka, 1980, p. 156.

Loev, hi.: Probability Theory. M.: Izd-vo inostr. lit., 1967, p. 720.

Norkin, V.N.: Method of Nondifferentiable Function Minimization with Averag- ing of Generalized Gradients. Kibernetika, 1980, N o . 6, pp. 86-89, 102.

Nurminskij, E.A.: Convergence Conditions f o r Nonlinear Programming Algo- rithms. Kibernetika, 1973, N o . l, pp. 122-125.

Nurminskij, E.A. and A.A. Zhelikovskij: Investigation of One Regulation of S t e p in Quasi-Gradient Method f o r Minimizing Weakly Convex Functions. Kiberneti- k a , 1974, No. 6, pp. 101-105.

Poljak, B.T.: Generalized Method of Solving Extremum Problems. Dokl. AN SSSR, 1967, Vol. 174, No. 1, pp. 33-36.

Poljak, B.T.: Minimization of Nonsmooth Functionals. Zhurn. vychisl. mat. i mat. fiziki, 1969, Vol. 9, No. 3, pp. 509-521.

Tsypkin, Ja.Z.: Adaptation and Learning in Automatic Systems. M.: Nauka, 1968.

Tsypkin, Ja.Z.: Generalized Learning Algorithms. Avtomatika i telemekhanika, 1970, No. 1, pp. 97-103.

Chepurnoj, N.D.: One Successive Step-Size Regulation f o r Quasi-Gradient Method of Weakly Convex Function Minimization. Collection: Issledovanie Operacij i ASU. Kiev: Vyshcha shkola, 1981, No. 19, pp. 13-15.

Chepurnoj, N.D.: Averaged Quasi-Gradient Method with Successive Step-Size Regulation t o Minimize Weakly Convex Functions. Kibernetika, 1981, No. 6, pp.

131-132.

Chepurnoj, N.D.: One Successive Step-Size Regulation in Stochastic Method of Nonsmooth Function Minimization. Kibernetika, 1982, No. 4, pp. 127-129.

S h o r , N.Z.: Application of Gradient Descent Method f o r Solution of Network Transportation Problem. In: Materialy nauchnogo seminara p o prikladnym voprosam kibernetiki i issledovanija o p e r a c i j . Nauchnyj sovet p o kibernetike IK AN USSR, Kiev, 1962, vypusk 1, pp. 9-17.

S h o r , N.Z.: Investigation of S p a c e Dilation Operation in Convex Function Minimization Problems. Kibernetika, 1970, N o . 1, pp. 6-12.

S h o r , N.Z. and N.G. Zhurbenko: Minimization Method Using S p a c e Dilation, in t h e Direction of Difference of Two Successive Gradient. Kibernetika, 1971, No. 3, pp. 51-59.

S h o r , N.Z.: Nondifferentiable Function Minimization Methods and Their Appli- cations. Kiev, Nauk. dumka, 1979, p. 200.

Demjanov, V.F.: Algorithms f o r Some Minimax Problems. Journal of Computer a n d Systems Science, 1968, 2, No. 4, pp. 342-380.

Lemarechal, C. : An Algorithm f o r Minimizing Convex Functions. In: Information Processing'74 /ed. Rosenfeld/, 1974, North-Holland, Amsterdam, pp. 552-556.

Lemarechal, C.: Nondifferentiable Optimization: Subgradient and Epsilon Subgradient Methods. L e c t u r e Notes in Economics and Mathematical Systems /ed. Oettli W./, 1975, 117, S p r i n g e r , Berlin, pp. 191-199.

(26)

33 B e r t s e k a s , D.P. a n d S.K. Mitter: A Descent Numerical Method f o r Optimization Problems with Nondifferentiable Cost Functions. SIAM J o u r n a l on Control, 1973, 11, No. 4, pp. 63'7-652.

34 Wolfe, P.: A Method of Conjugate S u b g r a d i e n t s f o r Minimizing Non- d i f f e r e n t i a b l e Functions. In: Nondifferentiable Optimization /eds. Balinski M.L., Wolfe P./, Mathematical Programming Study 3, 1975, North-Holland, Am- s t e r d a m . pp. 145-1'73.

Adaptive Nonmonotonic Methods With Averaging of Subgradients

ADAPTIW NONMONOTONIC METHODS WITH AVERAGING OF SUBGRADIENTS

N.D.

ADAPTIVE N O ~ O N O T O N I C MlCI'HODS WITH AVERAGING OF SUBGRADEXTI'S

RESULTS

N.Z.

z

>

6 >

=

...

- G 11

tgi if

LO

tgi

.

z

X ' =

=

>

>

>

=

=

=

=

+

lies +41 >

=

+

=

+

=

+

=

+

1

x*.

>

>

r

1

X *

1

1

>

<

>

< =

<

1

=

1.

<

1

>

<

>

-

>

>

>

=

1.1

>

=

-

1

.

=

11; 11,

>

-

L(6)

1

<

*

=

<

.

- ^G 11

**x*.**

¹

^.

⁼

_{I I} _.

x.*