• Keine Ergebnisse gefunden

The Lagrange Multiplier Method The method is characterized by the relations

ElleO(B)112 < const < 00

6.2.2 The Lagrange Multiplier Method The method is characterized by the relations

XIl+1=1rX(X

8 -PII [eO(s) + ~U1ei(s)},

u~+l

=

max{O,u.-{s)

+

8111]i(S)}

147

and whenX

=

R", 811

==

PII

==

const, eV(s)

= F;'(x

ll),1]i(S)

= Fi(x 8),

i

=

1: m, and the r(x), v

=

0 : m are smooth it is a deterministic algorithm proposed in [52]. The stochastic version of this method was studied in [1], [5], where it was proved that the

mink<8FO(xk)

converge to

minFO(x)

with probability 1, provided that FO(x) is st~ictly convex and 88

==

PII' The convergence for convex functions FO (x)--not necessarily strictly convex-was studied in [21]

with assumptions that P8/811--+O.

148 Stochastic Optim~'zationProblems 6.2.3 Penalty Function Methods. Averaging Operation

Constraints of type (6.2) of the general problem (6.1)-(6.3) can be taken into account by means of penalty functions and instead of the original problem, we can minimize a penalized function, for instance

m

'lI(x,c)

= FO(x)

+

c

:L>nin{O,F j (x)}

;=1

on the set X, where cis a big enough number. A generalized gradient of

'lI(x, c)

at z = x8 is

m

F2 (x 8)

+ c Lmin{O,Fj

(x8)}F~(X8).

;=1

If the exact values ofF"(Z8),FO(x8),F~(X8) are known, then a deterministic generalized gradient procedure can be used for minimizing'IIi(z, c). The penalty function methods for a problem with known values of the constraint functions Fi(x8) was considered in[4.61, [63]. In such cases the projection method (6.11) is applicable to minimizing'IIi (x, c),since the estimate of the subgradient f:J:(Z8,c) is vector

m

eO(s)

+c

Lmin{O,F;(xS)}~"'(s).

i= 1

In general, if instead of the values FV(ZS), F;ti, II =

° :

m, only statis·

tical estimations

1/v(s), eV(s)

are available, it is impossible to actually find min{O,Fj(ZS)}. How to handle this situation was studied in [4.], [5].

Consider the following variant of the iterative scheme studied in the pre-vious section.

m

X

H1 = 1I"x(XS - ps[e(s)

+ cLmin{O,F;(s)}e'(s)]),

(6.16)

i=1

Fi(s +

1)= 'IIi.1/;(s

+

1)

+

(1 - 'llis)Fds),i= 1 : m, (6.17) where 'IIi. is the step-size,

°

~'llis ~1,F;(O) = 1/i(O),

E{7J,,(s)lxO, ... ,ZS} = Fi(X S)

+ a,·(s),

FV(z·) _FV(xS)

~

(E{eV(s)lxo, ... ,xS},x· -x') +')'v(s).

For convergence with probability 1 of these kinds of procedures in addition to (6.15),we must demand that with probability 1

00

'lli8 ~

O,PSN8

--->0,

LE'IIi: <

00,

8=0

Stocha,tic Qua,igradient Method,

00

LE{Pah';(B)1 +

tPala;(B)I}

<

oo,i

=

1:m

a=O

149

Itis worthwhile to note that the above mentioned method may not converge when tPa

=

1. I.e., for F;(B)

=

l1;(B). If

8a = 1/(B +

1) then

_ 1 a

F;{B)

= - L

l1;(k).

Bk=O

This is why the (6.17) was called the averaging operation. In the case when

Fi(z) = E/i(z,w),

Fi(B +

1)= tPat" (za+l,w a

+

1)

+

(1-

tPa)Fi!B),i

= 1 : m (6.18) The averaging procedure proved to be very useful of stochastic and non·

differentiable optimization, the following general fact. is decisive concerning this operation. Consider the auxiliary procedure (6.17) itself for a given sequence {za}~o' The procedure (6.17) has the following general form

~(B

+

1)

=

~(B) -

tP a

[~(B) - 11(B

+

1)

I, B =

0,1, . .. (6.19) wheretPais Ba·measurable function and11(B)is a random observation of a vector

V(B):

E{n(B)IBa} =

V(B)

+ ai')'

(6.20)

which in the case of method (6.16)(6.17) takes on the form

V (B) = F(za)

=

(F

l(za), ...

,F

m(ZO)). Under rather general assumptions (see, for instance

[lOll,

p.46) provided that with probability 1

IIV(B

+

1) -

V(B)IINa ---

0

00

1/;0 ~0,

LEtP; <

00,Ila(s)IINa ---0

0=0

it can be shown that with probability 1

11,8(B) - V(B)II---0for B ---00

(6.21) (6.22)

(6.23).

Therefore the ,8(8) estimates vector V(s) with increasing precision and we can "substitute" unknown V(B) by,8(B). IfF,.(z),i

=

1 :mare Lipschitz con·

tinuous functions in X and points zO+l, ZO are connected through the equation (6.16), 111;(B)1

<

const, 11e'(B)II

<

const, i

=

1 :m, v

=

0:m,then assumption (6.21) follows from the condition

with probability 1.

Pa/'IjJ" ---0 for B ---00 (6.24)

150 Stochastic Optimization Problems The assertion (6.23) has close connections to the general theorem 5 con·

cerning the convergence of nonstationary optimization procedures, since the step direction

2[.8(,) -1/(1+

1)]

of the (6.19) is the stochastic quasigradient of the time.depending function

~S(.8) =

EII.8 -1/(' +1)11

2 (6.25)

at

.8

=

.8(,).

The averaging operation enables us to elaborate many stochastic analogues of known deterministic methods. Gupal

[8/

has studied the following stochastic version of the deterministic procedure described in paper

[36/:

xs

+1= 1I"x[X

S- psrS],

(6.26)

{

eO(I),

if

Fis(l)

= max

Fi(')

~O,

r

S

= . _

l:S,:Sm

e

'S , if

Fi

s

(I) >

O.

The requirements for convergence of this method are similar to those for the method (6.16).

Consider now some other methods for which the averaging operation ap·

peared to be crucial.

6.2.( Mixed Stochastic: Quasigradient Method

Bajenov and Gupal [25] were first to apply the averaging procedure to step directions. The method is defined by the relations

xs

+! = 1I"x[XS - PsdS]

(6.27)

dS

+! =

5seO(1 +

1)

+

(1 -

5s)d S

=

dS + 5s[eO(1 +

1) - dS],

(6.28)

E{eO(B)/xo,Jl, ... ,xs,dS}

= F~(xS)

+bO(I), (6.29)

I

=

0,1. .. with initial

Jl = eO

(0). Such types of methods have also been studied in [10], [70], [71],

['13].

The sequence {XS} converges with probability 1 to an optimal solution provided that in addition to requirements a), b), c) of the theorem 2 the scalars Ps, Os are chosen so as to satisfy with probability 1

Ps ~0,Os ~ 0,

Psl5s ...

0,IIbo

(I) II ...

0,

00 00

L

Ps = 00,

LE(p; +

o~)

<

00.

s=O s=O

(6.30)

The vector dS defined the recurrent formula (6.28) is called the averaged, aggregated, or mixed stochastic quasigradient.

Stochastic Quasigradient Methods 151 6.3 Nonconvex NondiffereDtiable Functions - Finite Difference Approximations Schemes

The convergence of SQG methods for nonconvex objective functions and con·

straints functions have been studied by many authors (see [5], [10], [12]). In [12J, Nurminski generalized method (6.11) for the case of nonconvex and non-differentiable objective functions satisfying the inequality

FO(z) - FO(x) ~

(F2(x), z -

x)

+

oUlz - xII)

when z -. x for all x from a compact set. Such functions are called weakly convex. The class of weakly convex functions includes convex functions as well as nonconvex differentiable. Moreover, the maximum of a collection of weakly convex functions is also the weakly convex function. Significant results in elaborat.ing SQG methods for the nonconvex and nondifferentiable functions were obtained in

[oj,

[10], [31]. In these papers the following stochastic versions of the finite difference approximation schemes were proposed.

Ifvalues of the functions

F/I(x),v

= 0 : m can be easily calculated and F/I (x) are differentiable functions, then there exist methods using a finite dif·

ference approximations of the gradients

F: (x·)

at current point

x·:

F;' (x·) ~

t

F/I (x·

+

.:l.ei) -

F/I(x·) d

i=1 .:l. '

F; (x.) '" t

F/I (x·

+

.:l.ei) - F/I (x· - .:l.ei ) .

i=1 2.:l. el,

(6.31 )

(6.32) where ei is the unit vector on the .l·th axis and .:l.

> o.

Although the fi-nite difference approximations exist for nondifferentiable functions, the use of them does not guarantee the convergence of optimization procedures. The pro·

posed modification of finite-difference approximation schemes consists a slight randomization of them:

F:(xX) ~

e/l(8)=

t F/I(~ + .:l.e

Ai) -

F/I(~)

el,

.

i=1 .u..

F; (x·) '"

e/l(8)=

t

F/I

(~j + .:l"e

i) - Pv

(~j

- .:l"ei)

d

~I ~

'

(6.33)

(6.34) where F:(x)is a subgradientj ~ = (xf

+

h1,.·., xi

+

hi, ... ,x~

+

h~), ~i = (x1

+

hf, .. ·, Xi -

+

hi-I' xi, Xi+l

+

hi+I""Ix~

+

h~),.l= I;n"andhi are independent random quantities uniformly distributed on interval [-

¥, ¥].

152 Stochastic Optimization Problems The convergence of cornsponding optimization procedures is based on the fact that with probability 1

min

liE {t F V (Z"i + u 8

e

i ) - F V (Z"i - .6.8ei)

e

i IX

8 } _

F: (x8) II

--+0

r%(z8) i=1 .6.8

(6.35) when .6.8 --+0 and FV(x) are local Lipschitz functions. Therefore vectors

e

v

(I)

defined by the (6.33), (6.34) are also statistical estimates of the subgradient

F:(x

8) ,satisfying general requirements (6.4), (6.5), (6.6).

For stochastic programming problems when

FV(x)

=

EfV(x,w),

we have analogues of the (6.33), (6.34)

eV(I) = t t

(Z"

+

.6.8ei ,W8i

) - r

(~8

,w 8

0)

ei

i=1 .6.8

eV(I) = t rr x8i +.6.8ei,w 8i) - r(XS i - .6.8

e

i ,w8,n+i)

e

i

i=1 .6.8

(6.36)

(6.37)

which also satisfy the relation (6.35).

Different generalizations of SEG methods to the case of local Lipschitz functions

FV(x)

making use of the (6.33), (6.34), (6.36), (6.37) type approxi·

mations have been studied in papers

[10], [13], lUI.

Let us discuss the genenl idea of such procedures with more details.

6.<& Simultaneous Optimization and Approximation Procedures,

N onstationary Optimization

Suppose we have to minimize a function

fO

(x) of a rather complex nature, for example, it does not have continuous derivatives. Consider the sequence of the "good" functions {FO (x, I)}, for instance smooth, converging to

fO (x)

for

1 --+00. Now consider the procedure

x8

+

1 = x8 - P8F2 (x, I),

1

=

0, 1, ... (6.38) Under rather general conditions (P8

!

0,~P8

=

00) it is possible to show (see

[5], [1<&],[11] and Theorem 6.3) that FO(x8,I) --+min fO(x).

Often approximate functions may have the form of mathematical expecta·

tions

FO(x,l) = / fO(x+h)P8(dh) =EfO(x+h(I)),

(6.39) where the measureP8 (dw) for1--+00is centered at the pointO. Hence instead of the procedure given by (6.38)that requires the exact value of the gradient of the mathematical expectation, we can use the ideas of the stochastic quasigradient methods.

Stochastic Quasigradient Methods 153 For example, see [9], let h(B) be random vectors with independent com·

ponents uniformly distributed on [-.6.6

/2,

.6.6

/2],

.6.6 -

0

for B - 00, and suppose that

/0 (x, B)

is continuous differentiable and FO (x, B) - jO

(x)

uni·

formly on any bounded domain. Consider the stochastic procedure with the (6.34) type approximation

6+1 6 cO ( )

x = x - P6"

B, B

= 0,1 ...

eO

(B) =

t /0

(xsi

+

.6.6d) -

/0

(XSi - .6.6d)

d.

i=l .6.6

It can be shown that

(6.40)

E{eo(B)lx6}

= F~(X6,B) where

FO(x, B)

is defined by (6.39).

In other words the method (6.40) is a stochastic analogue of the method (6.38). Procedures (6.38), (6.40)are examples of simultaneous optimization and estimation procedures. The development of such procedures is connected with the following general problem of nonstationary optimization

[15]-[20], [53], [75].

The objective function FO(x,B) and the feasible set X6 ofthe nonstationary problem depend on the iteration munber B= 0,1, .... Itis necessary to create a sequence of approximate solutions

{x

6} , that tends, in some sense to follow the time path of the optimal solutions: for B - 00,

lim[FO(x6, B) -

min{FO

(x, B) Ix

E X

6}]

=

o.

The case when there exist limFo (x,B) and limX6 (in some sense) for B

-00 was called the limit extremal problems

[14], [17], [5].

The optimization problems with time· varying functions and known trend of the optimal solutions is considered in[53], [54,], [60].

To illustrate the ideas involved in the proof of convergence results, let us consider the following simple ca.se:

Theorem 6.3. Assume that:

(aJ FO(x, B),/O(X) are

convex continuous functions, (bJ X isa convex compact set,

(cJ

FO(x, B) -

/O(x) uniformly inX, (dJ 11F~(x6,8)11 ~const,

X6+l = 1I"x[x6 - P6F~(x 6,8)]

and the parametersP6 satisfy the conditions

00

P6 - 0,

L

P6 = 00

6=0

(6.41 )

154 Stochastic Optimization Problems

Then FO (xS,

8) --.

r (x·) = min{r

(x)

Ix EX}.

The principal difficulties associated with the convergence of procedure (6.41) are connect,ed with the choice of the step·size P8' There is no guar·

antee that the new approximate solutionxS+1will belong to the domain of the smaller values of functions FO (x,t)fort ~ 8+1 (see Figure6.1). Therefore even for X

=

Rn and differentiable (continuously) functions FO (x,8), the (6.41) is essentially nonmonotonic optimization procedure. There is one more difficulty.

In the general case without the assumption c), the aim of{x8} is to track the set of optimal solutions

X:

= {xIFO(x,8) =minFO(x,8),x

E X8 } .

Unfortunately the Hausdorf distance between

X:

and

X:+

1may be large even for small distance between

FO(x,

8) and

FO(x,

8

+

1), as it shows in the Figure 6.2.

---I' - _

\

'

...

" FO

(x,s+l) ~ FO(xs,s+l) "

...

'-,

Figure 6.1.

F~(xS,s)

I I

r II

L...--~__~ ~ ~ _ ~ _______... __

...

_~...-

..

_-~---~-

. ..

x~ x;+1

Figure 6.2.

The convergence study of the (6.38), (6.40), (6.41) type procedures in gen-eral case involves the sets of €"·solutions (see [18], [15], [16]).

The essentially nonmonotonic solution procedures need an appropriatr technique to prove their convergence. Often the necessary analysis can be based on the following result [5], [11].

Stochastic Quasigradient Methods 155 Theorem 6.4. [5, p. 181] Suppose that X·

c

Rn is closed and {x&} IS a sequence ofYectorsinRn

(1) for all8, x"E K, with K compact

(2) for any subsequence {x"k} withlimxsk = x'

(a)

iLl EX·, then IIxsk+1 - x"k

II-

0 as k - 00

(b) if x'fj.X·, then for c sufficiently small and for any'k

1"k= min{,I,~ 'k,

Ilx

sk-

x&1/ <

€}

>

00.

(3) there exists a continuous function V(x) attaining onX· an at most count-able set ofvalues and

lim V(X Tk )

<

lim V(X Sk ).

k--+00 k--+00

Then the sequence {V(XS) } converges and all accumulating points of the sequence {x&} belong to X· .

The conditions of this theorem are similar to necessary and sufficient con·

vergence conditions, proposed by Zangwill (see 165]). However, Zangwill's con-ditions are very difficult to verify for a nonmonotonic procedure.

Conditions (2) of Theorem 6.3 prevent all sequence {X S}converge to limit point

x',

which does not belong to the set X·. However, condition (2) alone does not prevent "cycling", i.e., such a behavior of{X S}that it will be visiting any neighborhood of x' fj. infinitely many times. To exclude such a case the condition (3) is imposed, which guarantees that the sequence {XS} will be leaving a neighborhood of x' with decreasing values of some Lyapunov functions V(x). Let us now illustrate the use of this theorem.

Proof of Theorem 6.3: The conditions 1,2(a) of Theorem 6.4 are ful611ed. It suffices to verify the conditions 2(b) and3. Letx&k _ x' EX·, we need to show

that 1"k

<

00. We argue by contradiction, to suppose the contrary that1"k= 00.

For this purpose, we consider the function V(x) = mID.r.EX.

Ilx· - x11

2 We have that

V(XS+l)

=

min

IIx·

- XS

+

1

1r" = IIx·('+l)

_xs+1112~

Ilx·(,) _x&+111

2

.r·EX·

= V(XS)

+

2Ps{F~(x&, ,), x· (,) -

X

S)

+

P~ IIF~(x&,,)112

Since x"k _

x'eX·

and

II

XS- x"k

II <

c for sufficiently large , and any c. Then there exists6

>

0 such that

JO(x·) - JO(X S)

<-8 and for E we have

(F~(xS,,),x·

-x&)

~FO(x.,,)

-Fo(xS,,)

~FO

(x·,,) - JO(x·) + J

O(x&) - FO

(x&,,)

6

< -2'

156 ThereCore

Stocha,tic Optimization Problem,

V(~8+!) 5 V(~8) - tip8

+

cp~ = V(~8) - P8(ti - CP8) ti 8

5 V(~8k) -

2" I:

Pe

f.:==8k

and Cor a sufficiently large 8,this contradicts the Cact that IV(~)I

<

const when

~E X·. So, condition 2 is satisfied. Looking at condition 3, it is easy to realize

that ti 8

V(~Tk)

5

V(~8k)-

2" I:

Pe·

f.:==8k

Hence, in view oCthe properties oC1r:r,

Tk--! Tk-!

e< IlzTk - ~8kll 5

I:

11~8+! - ~811 50

I:

P8'

8=8k where 0 is a constant. Then

V(zTk) 5 V(~8k)- 20eti or equivalently

limV(~Tk)

<

limV(~8k) and this completes the proof.

Consider now more general procedure

8=8k

~8+! =1rx[~8-P8eO(8)],

8=0,1, ... ,

E{eO(8)1~0,...,~8} =Fs(~S,8)

+aO(8)

Theorem 6.5.

[191

Assume that

(aJ

FO(Z,8)

are convex continuous [unctions, (bJ X isa convex compact set,

(cJ max

IF

s+!(z) - FS(~)15tis, ElleO(8)11 < const,

:rEX

(dJ with probability1

00 00

6s/Ps -+0,IIaO(8)11-+O,ps ~

O,I:

P8 = oo,I:Ep~

<

00.

8=0 6=0

Then with probability1

1F0(~s,8) - min{FO(~, 8)1~EX}I-+0[or 8-+00.

(6.42)

Stochastic Quasigradient Methods 157 6.5 Feasibile Diredions Methods

Consider the minimization of a continuously differentiable function FO (x) in a compact convex set X. If FO (x 8) and

FJ (x 8)

are known, then the standard linearization method is defined by the relations

XH1

= x8 +

P8(~ - x8),

(F~(X8),~)

=

min{(F~(X8),x)lx E

X}

FO(x8

+!) =

min

FO(x 8 +

p(~ _

x8)).

O~p~1

The stochastic variant ohhis method has been studied in

[5], [6/, [10], [30]

and is defined by the relations

x8

+1

= x8 +

P8(~ _ x8),

W,~)

=

min{(d8

,x)lx

E

X}

d

8+! =

158

eO (8 +

1)

+

(1-15

8

)d

8 =

d

8 +

158

[eO (8 +

1) - d

8],

(6.43)

where P8' 158 satisfy conditions similar to those of Section 6.2. Notice that if instead ofd8 the vectors

eo (8)

are used (158

==

1) then, some simple examples show that the method may not converge.

The linearization method usually is applied when X is defined by linear constraints. In such case this method requires at each iteration a solution of linear subproblem in contrast to the projection method (6.11), which requires the solution of quadratic subproblem. Let us notice that only small perturba-tions occur in the objective function of the subproblem at each8

>

0, therefore for 8

> °

only small adjustments of the preceding solution are needed in order to obtain a solution of the current subproblem.

Consider now the case when

X

= {xIFi(x)

$ O,i

=

1: m}.

Assume that

FV(x),v = °:

m are continuously differentiable functions, the set X is compact, and the gradient

FJ (-)

is Lipschitz continuous on X. Let sequences

{x

8} and {v8} be defined by the relations

[10], ['T8]-[80]

:1:8+1 =

x8 +

P8V8 (6.44)

d

8

+

1 =

d

8 +

158

(eO (8 +

1) - d

8)

I

J' = eO

(0),

E{f(8)IB8} =

F~(X8)+bO(8),

where Bs is a-field generated by points {(zO,vO,~),

•.. ,(x8,vS,d8)}

and

v8

is a solution of the subproblem:

max{rl(d8

,v) +

r $ 0,(F~(X8),V)

+

15$ O,iE

r,

-1$ Vj $

l,i

= 1 :

n},

(6.45)

158 Stochastic Optimization Problems

(6.46)

r

=

{i:

-C8 :5F,(X8) :50},C8

!

0

Therefore it is assumed that we can calculate exact values

F1

(x),i = 1 :m.

Consider

p~= max{plx8

+

pv8 EX,P~ O}

and let

. {' "} ">

0 P8 =mm P8,P8 'P8 - . Theorem

6.6.

(see

[10},

p.113) If with probability 1

00 00

P: ~0,

l:

P: =

00,

P: /68--+0,

l:

E6;

< 00,

C8 --+0,

8=0 8=0

EII(I(,)I/2:5

C < 00,EIW(')11

2 :5

C,E{llbO(')IIIB

8 } --+0

for some constant C, then the sequence {FO(x8) } converges with probability 1 and all cluster points of the sequence {x8} satisfy the necessary optimality conditions of the problem.

Ruszczynski

[801

modified the method (6.43) for nonconvex objective func-tion with the following property: there exist 6~ 0and f.l ~ 0such that for all xE X allz satisfying

liz -

xii :56

FO(z) -FO(x) ~ (F~(x),z- x) - f.ll/z -

x11

2,

where X is a compact set and F~(x) is a subgradient. This class of functions is identical with the family of functions, which in some open neighborhood of x have a representation

[81):

FO(x) = max~(x,u),

uEU

where U is a compact and ~(·,u) has second derivatives continuous in (x,u).

In the method the following direction-finding subproblem is used instead of the subproblem (6.45):

min{W,y - x8)

+ tlly -

x8112IFi(x8)

+

(F~(X8),y- x8):50,

y

i= 1 :m,yEX},

where

F'"(x),i

= 1 : m are supposed to be convex and differentiable in X functions, X is a convex compact. Ify8is a solution of the subproblem then

v8= y8 _ x8

is used in equation (6.44). The convergence theorem is similar to that of the method (6.44) provided in addition to the mentioned above alternations that with probability1

b8(,)

==

0,68 = ap8'0

<

P8 :5min(l,l/a)

Stochastic QuaBigradient Methods

00 00

LP6=00,LEp~<00,

6=0 6=0

159

where scalar P8 may as usually depend on the past history generated by the (zO,iJO , ••• ,a:8,d8) .

The paper

180]

contains a rather general requirement on the choice of direction v

6,

which enables different modifications ofthe subproblem. In papers

110], 130],

procedures (6.43), (6.44) were generalized to the minimization oflocal Lipschitz functions making use of approximations (6.33), (6.34), (6.36), (6.37).