Stochastic Generalized Gradient Method with Application to Insurance Risk Management

(1)

IIASA

I n t e r n a t i o n a l I n s t i t u t e f o r A p p l i e d S y s t e m s A n a l y s i s_• A - 2 3 6 1 L a x e n b u r g _• A u s t r i a Tel: +43 2236 807 _•Fax: +43 2236 71313_• E-mail: info@iiasa.ac.at_•Web: www.iiasa.ac.at

INTERIM REPORT IR-97-021 / April

Stochastic generalized gradient method with application to insurance risk

management ^a

Yuri M. Ermoliev (ermoliev@iiasa.ac.at) Vladimir I. Norkin (norkin@umc.kiev.ua)

aWe would like to thank Gordon MacDonald and Joanne Linnerooth-Bayer for their helpful comments.

Approved by

GordonMacDonald (macdon@iiasa.ac.at) Director, IIASA

Interim Reports on work of the International Institute for Applied Systems Analysis receive only limited review. Views or opinions expressed herein do not necessarily represent those of the Institute, its National Member Organizations, or other organizations supporting the work.

(2)

Abstract

Recently [9] we analyzed important classes of nonsmooth and nonconvex risk control problems which can not be solved by standard optimization techniques. The aim of this article is to develop computational procedures enabling us to bypass some of the obstacles identified in this paper. We illustrate this by using insurance risk processes with insolvency (stopping time).

Key words: Discrete event system, stochastic gradient method, generalized differentiable function, risk processes, insurance.

(3)

Stochastic generalized gradient method with application to insurance risk management

¹

Yuri M. Ermoliev (ermoliev@iiasa.ac.at) Vladimir I. Norkin (norkin@umc.kiev.ua)

1 Introduction

In a rather general form the problems analyzed in [9] can be formulated in the following way:

minimize[F(x) =Ef(x, θ)] (1)

subject to

x∈X⊂Rⁿ, (2)

where x is a vector of decision (variable), θ is a random parameter, defined on a probability space (Θ,Σ,P), f(x, θ) is a random performance function, F(x) is the expected performance function, X is a feasible set. The essential feature of the problems is the lack of the analytical structure of f(·, θ), in particular its highly discontinuous character, which makes the deterministic approximation meaningless:

minimize[FN(x) = 1 N

XN i=1

f(x, θi)] (3)

subject to

x∈X⊂Rⁿ, (4)

where θi, i = 1, . . . , N, are i.i.d. observations of θ, since FN(x) lacks analytical structure. Nonconvex and nonsmooth character of random function f, leads to a highly multi- extremal nonsmooth and even discontinuous functionF_N(x) with local minimums having nothing in common with local minimums of F(x), which can be a continuously differentiable and even convex function. In such a case random search procedures based on direct estimation of F(x) and its derivatives are required. The case of continuously differentiable expectation functions F(x) was considered by Glynn [14], Ho and Cao [17], Suri [26], Gaivoronski [13], Rubinstein and Shapiro [24].

In the case of nonsmooth stochastic systems an important factor is the concept of Lipschitz expectation functions (see Gupal [15], Ermoliev and Gaivoronski [8], Gaivoron- ski [13]). Moreover, as it was shown in [9] we often deal not with a general class of Lipschitz functions but with a subclass generated from some basic (continuously differentiable) functions by means of maximum, minimum or smooth transformation operations.

These functions belong to the class of so-called generalized differentiable functions. In Section 2 we briefly discuss important insurance risk control problems with such functions. Section 3 introduces formally the class of generalized differentiable functions. In Sections 4, 5 we prove convergence of the deterministic and stochastic generalized gradient methods with orthogonal projection on nonconvex feasible sets. Section 6 concludes.

1We would like to thank Gordon MacDonald and Joanne Linnerooth-Bayer for their helpful comments.

(5)

2 Insurance risk control processes

Even a simple situation illustrates the complexity of insurance risk control problems.

Assume that an insurer has the initial capitalx₁. Claims arrive at random time moments τ₁, τ₂, . . . with random sizes L₁, L₂, . . .. The risk reserveR(t) at timet is the difference between accumulated premium P(t), initial capitalx1 and aggregated claimC(t):

R(x, t) =x₁+P(x₂, t)−C(x₃, t), 0≤t≤T , where the premium income is P(x₂, t) =x₂t. The aggregated claim

C(x₃, t) =

NX(t) k=1

min{L_k, x₃},

where N(t) is the random number of claims in [0, t),x₃ is the variable defined by excess- of-loss reinsurance. The ruin occurs at the random stopping time τ(x) = min{0 < t ≤ T : R(x, t) < 0}; if R(x, t) ≥ 0 for all t ∈ [0, T] then by convention τ(x) = T + 1.

The ruin can be mitigated by the choice of policy variables x = (x1, x2, x3) from a feasible set. Assume that τ1, τ2, . . .and L1, L2, . . .are defined on some probability space (Θ,Σ,P). An important performance indicator of this process is the following risk function F(x) =Ef(x, θ), whereθ denotes all random variables involved in the problem and

f(x, θ) =R(x, τ).

The function f(x, θ) is defined by means of min and -min operations. It becomes more evident from further simplification of the problem. Consider the case of two time epochs:

current time moment and the future. For a fixed current policy variablex= (x1, x2, x3) the future risk reserve

R(x) =x₁+x₂−min{L, x₃}, where L is a random claim. The risk function

F(x) =Ef(x, θ), f(x, θ) =min{0, x₁+x₂−min[L, x₃]}

is nonconvex and nonsmooth. The random functionf(x, θ) is generated byminand -min operations from linear functions.

Assume now that Prob{R(x, t) = 0} = 0 for all x and t (we can always achieve this by adding some independent small random noise with density to R(x, t)). Then with probability 1 function f(x, θ) is generalized differentiable (see next section) with generalized gradients:

g(x, θ) =













 1 τ(x)

−n(x₃)



, τ(x)≤T , 0∈R³, τ(x)> T , where n(x₃) is the number of cases when L_t> x₃, 0< t≤τ(x).

Stochastic jumping processR(x, t) has a rather complicated structure and purely analytical analysis of its characteristics and appropriate policy variables x is only possible in special cases. In realistic situations parameters of these processes may be time dependent and there may be a variety of policy variables interconnecting different lines of insurance industry. Extreme and catastrophic events such as fires, floods, windstorms, human-made

(6)

accidents and disasters produce highly correlated claims, which should be properly diver- sified in time and space. All these require the analysis of multidimensional interdependent insurance risk processes that is formally often equivalent to the analysis of large number integro-differential equations with “trajectories” depending on policy variables. These equations are analytically tractable only in very special cases. Of course, it is possible to use Monte-Carlo simulation techniques in a straightforward manner for any given collection of policy variables, but unfortunately the number of possible combinations exponentially approaches infinity. For example for 10 policy alternatives (say, levels of contracts with reinsurance) and 10 scenarios the number of combinations is 10¹⁰. Procedure (35)-(37) confronts this complexity. It allows us to simulate stochastic processes directly without solving differential equations and generate feedbacks to policy variables after each random simulation forcing these variables to converge towards better values, for example such that decrease insolvencies of companies, increase their profits and satisfactions of individuals.

We analyze these aspects in [12].

Let us discuss a simple example. Consider process R(x, t) and assume for the sake of simplicity that variablesx₁, x₂ are fixed sayx₁ =R₀, x₂=a. Hence the policy variable is the level of contract with reinsurancex₃ =xand letc(x) be related cost. A decrease inx reduces the chance of insolvency but at the same time it increases the cost c(x). Consider the following risk function

F(x) =c(x) +rER(x, τ(x)),

where the expectation is taken with respect to the randomness involved inτ andris a risk parameter. The functionF(x) reflects in a sense a trade-off between the risk of insolvency and costs on the risk reduction measurex. It is possible to show that for a givenr >0 the minimization ofF(x) can be viewed as the minimization ofc(x) subject to constraint: the probability of insolvency does not exceed a given level. The minimization ofF(x) is not in general possible by using standard techniques. Thus deterministic approximation (3) is impossible because τ(x) is an implicit random function of x. Procedure (35)-(37) starts with a given initial values of reinsurance contractx⁰ and sequentially updates this value after each simulation run. Assume x^k is the value of x⁰ after k simulations. New value x^k+1 is calculated as the following. For givenx^k the random process R(x^k, t),0≤t≤T, is simulated and τ(x^k) is observed. The value x^k is adjusted according to the feedback:

x^k+1=

( min{0, x^k− _k+1^c [c⁰(x^k)−n(x^k]}, τ(x^x)≤T , min{0, x^k− _k+1^c c⁰(x^k)}, τ(x^k)> T ,

)

where cis a positive constant. Since the situationτ(x^k)< T may be rather rare for some levelsx^k, special measures are required to increase the frequency of cases τ(x^k)≤T. We discuss it with more details in [12]. After a finite number of adjustments k the valuex^k is stabilized around the desirable value. It is important that the number of simulations required for such type adaptive adjustments usually has the same order of magnitude as the estimation of F(x) at a given initial value x⁰.

3 Generalized differentiable functions

Let us introduce a class of functions that is closed under operationsminand max ( -min) and smooth transformations. Continuous differentiable functions belong to this class. As we can see in section 4, 5 there is simple gradient type procedure for the optimization of such functions.

(7)

Definition 3.1 (Norkin [21]) Function f :Rⁿ −→ R is called generalized differentiable (GD) atx∈Rⁿif in a vicinity ofxthere exists upper semicontinuous multivalued mapping

∂f with closed convex compact values ∂f(x) such that

f(y) =f(x)+< g, y−x >+o(x, y, g), (5) where <·,·>denotes the inner product of two vectors in Rⁿ, g∈∂f(y) and

limk

|o(x, y^k, g^k)|

ky^k−xk = 0 (6)

for any sequences y^k −→x, g^k −→g, g^k ∈∂f(y^k). The function f is called generalized differentiable if it is generalized differentiable at each point x ∈ Rⁿ; ∂f(x) is called a subdifferential of f at x.

Example 3.1 Function|x|, x∈R, is generalized differentiable with

∂|x|=







+1, x >0, [−1,+1] x= 0,

−1, x <0 Its expansion (5) at x= 0 has the form

|y|=|0|+sign(y)·(y−0) + 0.

Generalized differentiable (GD) functions possess the following properties (see Norkin [21], Mikhalevich, Gupal and Norkin [19]):

They are locally Lipschitzian, but generally not directionally differentiable; continuously differentiable, convex and concave functions are generalized differentiable, gradients and subgradients of these functions can be taken as generalized gradients; class GD- functions is closed with respect to finite max,minoperations and superpositions;

∂max(f₁(x), f₂(x)) =co{∂f_i(x)| f_i(x) = max(f₁(x), f₂(x))}, (7) and subdifferential ∂f₀(f₁, . . . , f_m) of a composite function f₀(f₁, . . . , f_m) is calculated by the chain rule; class of GD-functions is closed with respect to taking expectation:

∂F(x) = E∂f(x, ω) for F(x) = Ef(x, ω), where f(·, ω) is a generalized differentiable function. Thus the expectation functions discussed in Section 2 are indeed generalized differentiable; the subdifferential∂f(x) is defined not uniquely, but Clarke subdifferential

∂f(x) always satisfy Definition 3.1, and∂f(x)⊆∂f(x) for any∂f(x) and∂f(x) is a single- ton almost everywhere inRⁿ; some elements of∂f(x) for a composite functionf(x) such as f(x) = max(f₁(x), f₂(x)), f(x) = min(f₁(x), f₂(x)), and f(x) =f₀(f₁(x), . . . , f_m(x)) can be calculated by the lexicographic method (Nesterov [20]); there is the following analog of Newton-Leibnitz formula

f(y)−f(x) = Z ₁

0

< g((1−t)x+ty), y−x > dt, where g((1−t)x+ty)∈∂f((1−t)x+ty).

These properties of generalized differentiable functions make them suitable for model- ing various nonsmooth stochastic systems.

(8)

4 Deterministic generalized gradient method with projec- tion on a nonconvex feasible set

Let us first analyze the deterministic procedure to demonstrate the convergence analysis technique. Consider a problem:

f(x)−→min

x∈X, (8)

where

X ={x∈Rⁿ|ψ(x)≤0}, (9)

f(x) and ψ(x) are generalized differentiable functions. Let ∂f(x) and ∂ψ(x) be subdifferentials off(x) andψ(x), in particular they may coincide with Clarke’s subdifferentials

∂f(x) and∂ψ(x). Assume that

ρ(0, ∂ψ(x)) = inf

g∈∂ψ(x))kgk>0 (10)

for allx such thatψ(x) = 0.

The necessary optimality condition for this problem has the form [19]:

0∈∂f(x) +N_X(x), where

N_X(x) =

( {λ∂ψ(x)|λ≥0}, ψ(x) = 0,

0, ψ(x)<0.

Let X^∗ = {x ∈ X| 0 ∈ ∂f(x) +N_X(x)} and f^∗ = {f(x)| x ∈ X^∗}. Consider the following conceptual iterative search procedure:

x⁰ ∈ X, (11)

x^k+1 ∈ Π_X(x^k−ρ_kg^k), (12)

g^k ∈ ∂f(x^k) k= 0,1, . . . , (13) where Π_X is a (multivalued) projection operator on the set X, i.e. z ∈Π_X(y) iffy−z ∈ N_X(z); nonnegative numbers ρ_k satisfy conditions

klim→∞ρ_k= 0, X∞ k=0

ρ_k=∞. (14)

Remark 4.1 Method (11)-(13) is an extension of the projection subgradient method by Shor, Ermoliev, Polyak (see further references in [1], pp.143-144) for nonconvex problems.

Dorofeev [4], [5] studied a similar method for the class of subdifferentially regular (quasid- ifferentiable) functions, which do not cover important applications (for instance, this class includes convex, weakly convex [22] and max- functions, but does not include concave and min- functions).

Theorem 4.1 Sequence{x^k} generated by method (11)-(13) converges to the solution of problem (8): minimal in function f cluster points of {x^k} belong to X^∗ and all cluster points of {f(x^k} constitute an interval in f^∗. If the setf^∗ does not contain intervals (for instance, f^∗ is finite or countable), then all cluster points of {x^k} belong to a connected subset of X^∗ and {f(x^k)} has a limit in f^∗.

(9)

The proof of convergence is based on using nonsmooth nonconvex Lyapunov functions and techniques developed by Nurminski [22], Ermoliev [7], Dorofeev [5], Mikhalevich, Gupal and Norkin [19].

Lemma 4.1 Assume that lims→∞x^k^s = y∈X^∗. Then for any > 0 there must exist indices l_s> k_s such that kx^k−yk ≤for all k∈[k_s, l_s) and

lim sup

s f(x^l^s)> f(y) = lim

s f(x^k^s). (15)

Proof. Denotex^k+1 =x^k−ρ_kg^k and represent

x^k+1 = Π_X(x^k−ρ_kg^k) =x^k−ρ_k(g^k+h^k) =x^k−ρ_kQ^k, where

Q^k =g^k+h^k, h^k=h^k(x^k+1) = 1

ρ_k(x^k+1−Π_X(x^k+1))∈N_X(x^k+1), (16) Then:

kh^kk= 1

ρkkx^k+1−Π_X(x^k+1)k ≤ 1

ρkkx^k+1−x^kk=kg^kk, kQ^kk= 1

ρkkx^k+1−x^kk ≤ 1

ρkkx^k+1−x^kk=kg^kk.

We have to consider two cases: ψ(y) < 0 and ψ(y) = 0. In the first case for k ≥k_s method (12) operates in a sufficiently small vicinity ofy as an unconstrained subgradient method and the statement of the lemma is known (see [21],[19]). In what follows we consider a new case ψ(y) = 0 ( the caseψ(x)<0 may be considered as a simplification of the caseψ(y) = 0 ). For y= limsx^k^s define

µ=ρ(0, ∂ψ(y)) = inf

g {kgk|g∈∂ψ(y)}, (17)

ν=ρ(0, ∂f(y) +N_X(y)) = inf

g {kgk|g∈(∂f(y) +N_X(y))}; (18) γ= sup

g {kgk|g∈∂f(y)}. (19)

Due to upper semicontinuity of∂f,∂ψ(x) there exists₁-vicinity of y such that sup

g,z{kgk|g∈∂f(z), kz−yk ≤₁} ≤2γ = Γ; (20) sup

g,z{kgk|g∈∂ψ(z), kz−yk ≤₁} ≤2γ= Γ; (21) Define

N(z) ={g∈NX(z)| kgk ≤Γ}, G(z) =∂f(z) +N(z).

Obviously,

ρ(0, G(y)) = inf

g {kgk|g∈G(y)} ≥ν.

Due to upper semicontinuity of ∂ψ and G(y) there exists ₂-vicinity (₂ ≤ ₁) of y such that for all z,kz−yk ≤2,

ρ(∂ψ(z), ∂ψ(y))≤µ/2, (22)

(10)

where ρ(·,·) is the Hausdorff distance between sets.

Due to generalized differentiability of f and ψ forc= 64Γ(1+2Γ/µ)^ν² there exists₃ ≤₂ such that for kz−yk ≤3 :

f(z)≤f(y)+< g, z−y >+ckz−yk, (23) ψ(z) ≤ ψ(y)+< d, z−y >+ckz−yk

= < d, z−y >+ckz−yk, (24) for all g ∈ ∂f(z), d ∈ ∂ψ(z). Now set = ₃ and fix some ≤. Set ρ₁ =/(3Γ). Let ky^s−yk ≤/3, andρ_s≤ρ₁ fors≥S. Denote

m_s= sup{m| kx^k−yk ≤2/3 ∀k∈[k_s, m)}.

We now show that m_s<∞ fors≥S. Indeed, if for allkkx^k−yk ≤2/3 then we obtain the contradiction:

2/3≥ kx^k−yk ≥ kx^k−x^k^sk − kx^k^s−yk ≥ν/2

k−1

X

r=ks

ρ_r−/3−→ ∞

as k−→ ∞. Furthermore

kx^m^s −yk ≤ kx^m^s⁻¹−yk+ρ_m_s₋₁kQ^m_s^s⁻¹k ≤. Since

/3≤ k^m^X^s⁻¹

k=ks

ρkQ^kk ≤Γ

mXs−1 k=ks

ρk,

then mXs−1

k=ks

ρ_k≥ 3Γ.

For k∈[k_s, m_s],s∈S,x^k andg^k∈∂f(x^k) from (23) follows that:

f(x^k) ≤ f(y)+< g^k, x^k−y >+ckx^k−yk

≤ f(y)+< g^k, x^k−x^k^s >+ckx^k−x^k^sk+ (Γ +c)kx^k^s−yk

= f(y)+< g^k+h^k, x^k−x^k^s >−< h^k, x^k−x^k^s >+

ckx^k−x^k^sk+ (Γ +c)kx^k^s−yk, (25) where h^k is defined by (16), and let us estimate the term u_k = − < h^k, x^k−x^k^s >. If ψ(x^k)≤0 then h^k= 0 and u_k= 0. Consider the case ψ(x^k)>0, i.e. h^k 6= 0. Since

h^k∈N_X(x^k) ={λg|g∈∂ψ(x^k), λ≥0}, then

h^k =λ_kd^k, d^k∈∂ψ(x^k), λ_k>0, and

0< λ_k =kh^kk/kd^kk ≤Γ/(µ/2) = 2Γ/µ.

Substitute x^k= ΠX(x^k) and d^k into (24):

0 =ψ(x^k)≤< d^k, x^k−y >+ckx^k−yk. (26)

(11)

Now multiplying (26) by λ_k, we obtain:

−< h^k, x^k−y > ≤ λkckx^k−yk ≤(2cΓ/µ)kx^k−yk

≤ (2cΓ/µ)kx^k−x^k^sk+ (2cΓ/µ)kx^k^s−yk. (27) Using inequality (27) we can rewrite (25) in the following form:

f(x^k) ≤ f(y)+< g^k+h^k, x^k−x^k^s >+

(1 + 2Γ/µ)ckx^k−x^k^sk+ (Γ +c+ 2cΓ/µ)kx^k^s−yk. (28) Now we have to estimate scalar products

< g^k+h^k, x^k−x^k^s >=< g^k+h^k,

kX−1 i=ks

(gⁱ+hⁱ)> .

Lemma 4.2 (see Mikhalevich, Gupal and Norkin [19]). Let P be a convex set in Rⁿ such that 0 < γ0 ≤ kpk ≤ Γ0 < +∞ for all p ∈ P. Then for an arbitrary collection of vectors {p^r ∈P|r =k, . . . , m} and any collection of non-negative numbers{ρ_r ∈R¹|r = k, . . . , m−1} such that

mX−1 r=k

ρr≥σ0 >0, sup

k≤r≤m

ρr≤ σ0γ₀² 6Γ²₀ , there exists index l∈(k, m]such that

< p^l,

l−1

X

r=k

ρrp^r/

l−1

X

r=k

ρr>≥ γ₀² 4 ,

l−1

X

r=k

ρr≥ σ0γ0

3Γ₀ . Proof. For completeness we give the proof of the lemma. Let

t−1

X

r=k

ρ_r< γσ 3Γ ≤

Xt r=k

ρ_r,

mX⁰−1 r=k

ρ_r< σ≤

m⁰

X

r=k

ρ_r. (29)

Suppose the opposite to the statement of the lemma is true, i.e. for all l∈(t, m⁰] p^l,

l−1

X

r=k

ρrp^r ,_l₋₁

X

r=k

ρr

!

< γ²

4 . (30)

We have

Xl r=k

ρ_rp^r

2

=

l−1

X

r=k

ρ_rp^r

2

+ 2ρ_l p^l,

l−1

X

r=k

ρ_rp^r

!

+ρ²_lkp^lk²

and

m⁰

X

r=k

ρrp^r

2

=

Xt r=k

ρrp^r

2

+ 2

m⁰

X

l=t+1

ρl p^l,

l−1

X

r=k

ρrp^r

! +

m⁰

X

l=t+1

ρ²_lkp^lk². (31) Substituting (29), (30) into (31), we obtain:

γ²σ² ≤ Γ²^P^t_r=kρ_rp^r²+^γ₂²2^P^m_l=t+1⁰ ρ_l^P^l−1_r=kρ_r+ Γ²sup_k_≤_r_≤_m0ρ_r^P^m_l=t+1⁰ ρ_l

≤ Γ²_3Γ^γ +_6Γ^γ²2

σ²+^γ₂²σ²+ Γ²^γ_6Γ²^σ2σ ≤ ¹¹₁₂γ²σ².

(12)

This contradiction proves the lemma. ²

Now let us come back to the proof of Lemma 4.1. Set P =co{G(z)| kz−yk ≤k,

p^r=g^r+h^r, k=k_s≤r≤m=m_s, γ₀=ν/2, Γ₀ = Γ.

We have

ms

X

k=ks

ρk≥

mXs−1 k=ks

ρk≥ kx^m^s−x^k^sk

Γ ≥

3Γ =σ0 >0,

slim→∞sup

k≥ks

ρk= 0.

By Lemma 4.2 for all sufficiently large sthere exist indicesl_s, k_s< l_s≤m_s, such that

*

g^l^s+h^l^s,

lXs−1 k=ks

ρ_k(g^k+h^k)/

lXs−1 k=ks

ρ_k +

≥ ν² 16,

lXs−1 k=ks

ρ_k≥ ν 18Γ².

Substituting these estimates for k= ls into inequality (28), we obtain the final estimate with c= 64Γ(1+2Γ/µ)^ν² :

f(x^l^s) ≤ f(y)−ν² 16

lXs−1 k=ks

ρk+ Γ(1 + 2Γ/µ)c

lXs−1 k=ks

ρk

+(Γ +c+ 2cΓ/µ)kx^k^s−yk

≤ f(y)− ν²

600Γ²ν+ (Γ +c+ 2cΓ/µ)kx^k^s−yk. (32) Thus we have proved that for all sufficiently small≤and sufficiently largesthere exist indices ls such that kx^k−yk ≤ fork ∈[ks, ls) andf(x^l^s) satisfies (32). From here the statement of the lemma follows.²

Proof of Theorem 4.1. The proof is based on Lemma 4.1.

1⁰. Obviously, the sequence {x^k} belongs to a compact set X.

2⁰. By boundedness of subgradients ∂f(x) on a compact setX we obtain

klim→∞kx^k+1(ω)−x^k(ω)k ≤ sup

g∈∂f(x), x∈X

kgk lim

k→∞ρk= 0.

From here it follows that cluster points of {x^k}constitute a connected set inX.

3⁰. Sequence {x^k} from compact set X has a closed set of limit points X⁰. The continuous function f(x) achieves its minimum on X⁰, say, at some point x⁰. The point x⁰ = lim_s_→∞x^k^s belongs toX^∗ because otherwise due to Lemma 4.1 it is not minimal in the above sense. Thus lim infk→∞f(x^k)∈f^∗.

4⁰. Now prove that limit points of the sequence {f(x^k)} constitute an interval in f^∗. If lim sup_k_→∞f(x^k) = lim inf_k_→∞f(x^k) then the statement follows from 3⁰. Suppose

lim sup

k→∞ f(x^k)>lim inf

k→∞ f(x^k) =f₀^∗∈f^∗.

(13)

Assume the opposite to the statement of the theorem. Then there exists some number f₁∈f^∗ such thatf₁ <lim sup_k_→∞f(x^k(ω)). Let us choose numberf₂ such that

lim inf

k→∞ f(x^k) =f^∗ < f₁ < f₂ <lim sup

k→∞ f(x^k).

Sequence {f(x^k)} intersects interval (f1, f2) from below infinitely many times, so there exist subsequences {x^k^s} and {xⁿ^s}such that

f(x^k^s)≤f1 < f(x^k)< f2≤f(xⁿ^s), ks < k < ns. (33) Without loss of generality we can consider that x^k^s −→x⁰. Due to 2⁰ and continuity off we have

slim→∞f(x^k^s) =f(x⁰) =f₁∈f^∗.

Hence lim_s_→∞x^k^s = x⁰∈X^∗. Now we can apply Lemma 4.1 to subsequences {x^k}^∞_k=k_s. Choose such that

sup

{y:ky−x⁰k≤}f(y)< f2. Then (15) contradicts to inequalities (33). Hence

lim inf

k→∞ f(x^k),lim sup

k→∞ f(x^k)

⊆f^∗.

Since X^∗ andf^∗ are closed sets then

lim inf

k→∞ f(x^k),lim sup

k→∞ f(x^k)

⊆f^∗.

5⁰. Suppose now thatf^∗ does not contain intervals, for instance, f^∗ is finite or countable. From statement 4⁰ we have

klim→∞f(x^k) =f₀^∗ ∈f^∗. (34)

If a cluster point x⁰ = lim_s_→∞x^k^s does not belong to X^∗, then due to Lemma 4.1 we would have a contradiction{f(x^k)} stated in (34). ²

Remark 4.2 The convergence result of Theorem 4.1 remains true for generalized gradient method (11), (12), where

g^k∈∂f(˜x^k), kx˜^k−x^kk ≤δk, lim

k δk= 0.

In this case the basic Lemma 4.1 follows from the stability result of Lemma 5.4. If pointsx˜^k are taken at random then with probability one ∂f(˜x^k) =∂f(˜x^k) and the method converges to X^∗ ={x|0 ∈∂f(x) +N_X(x)}. In the last case we can use formula (7) and the chain rule to calculateg^k∈∂f(˜x^k). The use of∂f(˜x^k)resembles the concept of mollifier gradient [9].

(14)

5 Stochastic generalized gradient method

Consider now stochastic optimization problem (1), (2), where the objective functionF(x) is generalized differentiable, the set X = {x| ψ(x) ≤ 0} is given by a generalized differentiable function ψ(x), satisfying regularity condition (10). Define X^∗ = {x| 0 ∈

∂F(x) +NX(x)}and F^∗ ={F(x)|x∈X^∗}.

Consider the following procedure

x⁰ ∈ X, (35)

x^k+1(ω) ∈ Π_X(x^k−ρ_ks^k(ω)), k= 0,1, . . . , (36) s^k(ω) = 1

n_k Xk i=rk

ξⁱ(ω), n_k=k−r_k+ 1≥0, (37) where all random quantitiesx^k(ω),ξ^k(ω), s^k(ω), k= 0,1, . . . , are defined on some probability space (Ω,Σ,P), ξⁱ(ω), i = 0,1, . . . , are random vectors (stochastic generalized gradients) such that

E{ξⁱ(ω)|x⁰(ω), . . . , xⁱ(ω)}=gⁱ(ω)∈∂f(xⁱ(ω)),

kξⁱ(ω)k ≤C <+∞; (38)

ΠX is a (multivalued) projection operator on the setX, i.e. z∈ΠX(y) iffy−z∈NX(z);

non-negative numbers rk, nk and ρk satisfy conditions

n_k=k+ 1−r_k≤m <+∞; (39)

X∞ k=0

ρ_k= +∞, X∞ k=0

ρ²_k<+∞. (40)

Remark 5.1 Method (35)-(37) combines ideas of projection stochastic quasigradient method by Ermoliev (see details and further references in [11], pp. 142-185) and stochastic gradient averaging method [1], [5], [7], [15], [19], It is easy to generalize the convergence analysis to biased estimates of generalized gradients – stochastic quasigradients.

Theorem 5.1 Let f(x) and ψ(x) be generalized differentiable functions, sequence x^k(ω) is generated by method (35)-(37), where r_k, n_k, ρ_k satisfy (39), (40). Then minimal (in functionF) cluster points of{x^k(ω)}a.s. belong toX^∗and all cluster points of{F(x^k(ω)}

a.s. constitute an interval inF^∗. If the setF^∗ does not contain intervals (for instance,F^∗ is finite or countable) then all cluster points of {x^k(ω)} a.s. belong to a connected subset of X^∗ and{F(x^k(ω))}has a limit in F^∗.

Proof. Denotex^k+1 =x^k−ρks^k and represent

x^k+1 = ΠX(x^k−ρks^k) =x^k−ρk(s^k+h^k) =x^k−ρkQ^k, where

Q^k=s^k+h^k, h^k=h^k(x^k+1) = 1

ρ_k(x^k+1−ΠX(x^k+1))∈NX(x^k+1), kh^kk= 1

ρ_kkx^k+1−Π_X(x^k+1)k ≤ 1

ρ_kkx^k+1−x^kk=ks^kk.

(15)

kQ^kk= 1

ρ_kkx^k+1−x^kk ≤ 1

ρ_kkx^k+1−x^kk=ks^kk. Now fix a subsequence {x^k^s(ω)}. Fork > k_s

x^k+1(ω) = x^k^s(ω)− Xk t=ks

ρtQ^t(ω)

= x^k^s(ω)− Xk t=ks

ρ_tQ^t(ω)−ζ_k^k+1

s (ω)

= y_k^k+1

s (ω)−ζ_k^k+1

s (ω), (41)

where

y^k_k_s^s(ω) = x^k^s(ω), (42)

y_k^k+1

s (ω) =

Xk k=ks

ρ_tQ^t(ω) =y_k^k_s(ω)−ρ_kQ^k(ω), k≥k_s; (43)

Q^k(ω) = 1 nk

Xk r=rk

(g^r(ω) +h^r(ω)), (44)

g^r(ω) = E{ξ^r(ω)|x⁰(ω), . . . , x^r(ω)} ∈∂f(x^r(ω)), (45) h^r(ω) = 1

ρ_r(x^r(ω)−ΠX(x^r(ω))∈NX(x^r+1(ω), (46) ζ_n^m(ω) =

mX−1 t=n

ρt

1 n_t

Xt r=rt

(ξ^r(ω)−g^r(ω)). (47)

Instead of {x^k(ω)} we shall study the behavior of the close sequence {y_k^k_s(ω)}k≥ks, s = 0,1, . . . , generated by deterministic (under fixed ω) procedure (43)-(47). This procedure uses subgradientsg^r(ω) of functionF taken not at pointsy^r_k

s(ω) but at close pointsx^r(ω).

Besides, the vector h^r(ω) is normal to X not at the point y_k^r+1_s (ω), but at a close point x^r+1(ω). We have an estimate:

ky_k^k_s(ω)−x^k(ω)k=kζ_k^k_s(ω)k ≤ sup

k≥ks

kζ_k^k_s(ω)k=δks(ω).

Let us show (Lemma 5.1) that lim_s_→∞δ_k_s(ω) = 0 a.s. Notice that

|f(x^k(ω))−f(y_k^k_s(ω))| ≤L_fkx^k(ω)−y^k_k_s(ω)k=L_fδ_k_s(ω), (48) whereLf is a Lipschitz constraint of functionf over setX. Then the difference|f(x^k(ω))− f(y^k_k

s(ω))|, k ≥k_s, is arbitrary small for s sufficiently large. The remaining part of the proof we subdivide into several separate lemmas.

Lemma 5.1 Random sequence{ζ₀^k(ω)}^∞k=0, ζ₀^k(ω) =

kX−1 t=0

ρ_t 1 n_t

Xt r=rt

(ξ^r(ω)−g^r(ω)), n_t≤m, (49) a.s. has a limit.

(16)

Proof. Denote

λ_tr = ( ₁

nt, r_t≤r≤t, 0, otherwise.

Then

ζ₀^k = ^P^k−1_t=0 ρ_t^P^t_r=r_tλ_tr(ξ^r−g^r) =^P^k−1_t=0(^P^k−1_t=r λ_trρ_t)(ξ^r−g^r)

= ^P^k_t=0⁻¹(^P^∞_t=rλtrρt)(ξ^r−g^r)−^P^k_t=0⁻¹(^P^∞_t=kλtrρt)(ξ^r−g^r).

Sequence

ζ^k₀ =

k−1

X

t=0

( X∞ t=r

λ_trρ_t)(ξ^r−g^r) (50)

is a martingale with respect to σ-field generated by{x^k(ω)}^∞_k=0. Denote Γ = sup{kgk|g∈∂f(x), x∈X}<+∞.

Then

Ekζ^k₀(ω)k² ≤ (Γ +C)²^P^∞_r=0(^P^∞_t=rλtrρt)²≤(Γ +C)²^P^∞_r=0(^P^r+m_t=r ρt)²

≤ (Γ +C)²m²^P^∞_r=0ρ²_t <+∞ and

Ekζ^k₀(ω)k ≤1 +Ekζ^k₀(ω)k²<+∞.

Hence the martingale (50) a.s. has a finite limit. For the remainder term α^k(ω) =

kX−1 t=0

( X∞ t=k

λtrρt)(ξ^r−g^r) the following estimates hold true:

α^k(ω) ≤ ^P^kr=0⁻¹(^P^∞_t=kλtrρt)(kξ^rk+kg^rk)

≤ (Γ +C)^P^k_r=0⁻¹(^P^∞_t=kλ_trρ_t) = (Γ +C)^P^∞_t=kρ_t(^P^k_r=0λ_tr)

= (Γ +C)^P^∞_t=kρ_t(^P^k_r=r_tλ_tr)

≤ (Γ +C)^P^k+m_t=k ρt−→0 as k−→ ∞. Hence the sequence {ζ₀^k(ω) =ζ^k₀(ω) +α^k(ω)} a.s. has a limit.² Corollary 5.1 For any subsequence of indices {k_s} −→ ∞

δks(ω) = sup

k≥ks

kζ_k^k_s(ω)k −→0 a.s. as s−→ ∞.

Remark 5.2 Lemma 5.1 and Corollary 5.1 remain true if r_k = k in (37) and (38) is replaced by

Ekξⁱ(ω)k²<+∞.

Lemma 5.2 Let ω be such that {ζ₀^k(ω)}^∞_k=0 has a limit. Assume that lims→∞x^k^s(ω) = x(ω)∈X^∗. Denote

ms(, ω) = sup{m| kx^k(ω)−x(ω)k ≤ for k∈ {ks, m}.

Then a.s. there exists (ω) such that for any ∈ (0, ] there exist indices l_s(ω) ∈ [ks(ω), ms(, ω)], and

f(x(ω)) = lim

s→∞f(x^k^s(ω))>lim sup

s→∞ f(x^l^s(ω)). (51)

(17)

Lemma 5.2 due to (41), (48) and Corollary 5.1 follow from the similar property of the sequences {y_k^k_s(ω)}k≥ks, generated by (43)-(45). We formulate this property as a separate lemma.

Lemma 5.3 Let ω be such that {ζ₀^k(ω)}^∞_k=0 has a limit. Assume that lims→∞x^k^s(ω) =x(ω)∈X^∗. Denote

m_s(, ω) = sup{m| ky^k_k_s(ω)−x(ω)k ≤ for k∈[k_s, m)}.

Then a.s. there exists (ω) such that for any ∈ (0, ] there exist indices l_s(ω) ∈ [ks(ω), ms(, ω)], and

f(x(ω)) = lim

s→∞f(x^k^s(ω))>lim sup

s→∞ f(y_k^l^s

s(ω)). (52)

Lemma 5.3 follows from the following stability property of the deterministic subgradient method.

Lemma 5.4 Let some sequence of starting points {y^s} converge to y = lim_s_→∞y^s. For each sconsider a sequence{y_k^k_s}ⁿ_k=k^s _s such that

y_s^k^s =y^s,

y_s^k+1=y^k_s−ρk(g^k_s+h^k_s), s≤k < ns; g_s^k∈G_δk

s(y_s^k) =co{g∈∂f(y)| ky−y_s^kk ≤δ^k_s}, h^k_s ∈ {^y⁻^Π_ρ^X_k^(y)| ky−y^k_sk ≤δ^k_s},

y^k_s =y_s^k−ρ^k_sg_s^k. Denote

ρs= sup

ks≤k≤ns

ρ^k_s, δs= sup

ks≤k≤ns

δ_s^k, σs=

nXs−1 k=ks

ρ^k_s.

If0∈∂f(y) +NX(y)andσs ≥σ >0then for any sufficiently smallthere existρ =ρ(y, ) and δ =δ(y, ) such that for {y_s^k}ⁿ_k=k^s _s with δ_s^k ≤δ and ρ^k_s ≤ρ there exist indicesls such that ky_s^k−yk ≤ for k∈[k_s, l_s) and

f(y) = lim

s→∞f(y^s)>lim sup

s→∞ f(y_s^l^s).

Proof. The proof is similar to the proof of Lemma 4.1. We have to consider again two cases: ψ(y) < 0 and ψ(y) = 0. In the first case the subgradient method operates in a sufficiently small vicinity of y as an unconstrained method and the statement of the lemma is known (see [19]). In what follows we consider a new case ψ(y) = 0 (the case ψ(x) < 0 may also be considered as a simple repetition of the case ψ(y) = 0). As in proof of Lemma 4.1 for y= limsy^s define µ, ν, γ by (17)- (19) and1,2,3,csuch that (20)-(24) hold.

Now set

= min{3, σν/2}

and fix some≤. Set δ1=/4,ρ₁=/(4Γ). Let forky^s−yk ≤/4,δs≤δ1,ρs≤ρ₁ for s≥S.

Define the index

m_s= sup{m| ky_s^r−yk ≤/2 ∀r∈[k_s, m)}.

(18)

We now show that /2≤ ky_s^m^s−yk ≤3/4. Firstly we shall prove the left inequality.

If ky_s^m^s−yk ≤/2 then m_s=n_s and we obtain a contradiction:

₂>3/4≥ ky_sⁿ^s −y_sk ≥σν/2.

Furthermore

ky^m_s^s −yk ≤ ky_s^m^s⁻¹−yk+ρ^m_s^s⁻¹kg^m_s ^s⁻¹+h^m_s ^s⁻¹k ≤3/4.

Since

/4≤ k

mXs−1 k=ks

ρ^k_s(g_s^k+h^k_s)k ≤Γ

mXs−1 k=ks

ρ^k_s,

then mXs−1

k=ks

ρ^k_s ≥ 4Γ. Let g_s^k∈G_δk

s(y_s^k), then

g_s^k=

n+1X

i=1

λ^ki_sg_s^ki,

n+1X

i=1

λ^ki_s = 1;

g_s^ki∈∂f(y_s^ki), ky_s^ki−y^k_sk ≤δ_s^k. If ky^s−yk ≤/4,δ_s ≤/4,k_s≤k≤m_s, 1≤i≤n+ 1, then

ky_s^ki−yk ≤ ky^ki_s −y_s^kk+ky_s^k−yk ≤δ_s^k+ 3/4≤≤₃. For y_s^ki we can use (23):

f(y_s^ki)≤f(y)+< g^ki_s , y^ki_s −y^s>+cky_s^ki−y^sk+ (Γ +c)ky^s−yk.

If we replace y_s^ki (1≤i≤n+ 1) by a close point y^k_s, then f(y^k_s) ≤ f(y)+< g^ki_s , y_s^k−y^s>

+ +cky_s^k−yk+ (2Γ +c)δs+ (Γ +c)ky^s−yk. Multiplying these inequalities by λ^ki_s and summing ini, we obtain

f(y^k_s) ≤ f(y)+< g_s^k, y_s^k−y^s>

+cky_s^k−y^sk+ (2Γ +c)δs+ (Γ +c)ky^s−yk

= f(y)+< g_s^k+h^k_s, y_s^k−y^s>−< h^k_s, y_s^k−y^s>+

+cky_s^k−y^sk+ (2Γ +c)δ_s+ (Γ +c)ky^s−yk, (53) where

h^k_s = (˜y^k_s−z^k_s)/ρ^k_s, ky˜_s^k−y^k_sk ≤δ^k_s, z_s^k= Π_X(˜y_s^k).

Let us evaluete the termu^k_s =−< h^k_s, y^k_s−y^s>. Ifψ(y^k_s)≤0 thenh^k_s = 0 andu^k_s = 0.

Consider the case ψ(y^k_s)>0, i.e. u^k_s 6= 0. Since

h^k_s ∈NX(z^k_s) ={λg|g∈∂ψ(z_s^k), λ≥0}, then

h^k_s =λ^k_sd^k_s, d^k_s ∈∂ψ(z_s^k), λ^k_s >0.

Stochastic Generalized Gradient Method with Application to Insurance Risk Management

IIASA

INTERIM REPORT IR-97-021 / April

Stochastic generalized gradient method with application to insurance risk

management a

Abstract

Contents

Stochastic generalized gradient method with application to insurance risk management

Yuri M. Ermoliev (ermoliev@iiasa.ac.at) Vladimir I. Norkin (norkin@umc.kiev.ua)

1 Introduction

2 Insurance risk control processes

3 Generalized differentiable functions

4 Deterministic generalized gradient method with projec- tion on a nonconvex feasible set

5 Stochastic generalized gradient method

management ^a