A flexible framework for cubic regularization algorithms for non-convex optimization in function space

(1)

for non-convex optimization in function space

Anton Schiela November 3, 2017

Abstract

We propose a cubic regularization algorithm that is constructed to deal with non- convex minimization problems in function space. It allows for a flexible choice of the regularization term and thus accounts for the fact that in such problems one often has to deal with more than one norm. Global and local convergence results are established in a general framework.

AMS MSC 2000: 49M37, 90C30

Keywords: non-convex optimization, optimization in function space, cubic regularization

1 Introduction

In broad terms, non-linear optimization algorithms rely on two types of models. Alocal model q_xat a given point xof the functionalf to be minimized that can be treated with techniques of linear algebra, and arough parametrized model for the remaining difference betweenq_x and the functional, i.e., the local errorf −q_x. Usually such an error model is based on a normk · k. In classical trust region methods (cf. e.g. [4]) the error model is 0 inside a ball of varying radius around the current iterate (the trust-region) and +∞outside.

In cubic regularization methods the error model is chosen according to the assumption that f−qxis of third order. Thus, a scaling of k · k³ by an algorithmic parameter (calledω in the following) is taken as a model for the error.

The reason for introducing such an error model (in contrast to a line-search approach) is the wish to transfer information about the f −qx attained at sampling points (i.e., at trial steps) to a whole neighborhood of the current iterate. The implicit assumption behind this reasoning is that the error indeed behaves more or less isotropically with respect to the chosen norm. Of course, this cannot be guaranteed in general, but in those cases where the error model predicts the actual error well, we expect a stable and efficient behavior of our algorithm. It is thus important, in particular for large scale problems, to choose the error model, and thus the underlying norm, carefully. Many state-of-the-art optimization codes incorporate this idea by allowing the use of a preconditioner for the problem, which in turn defines a problem related normk · k.

In this paper we consider a cubic error model. This idea is not new, and, to the best knowledge of the author, has first been proposed by Griewank [9] in an unpublished technical report. Independently, Weiser, Deuflhard, and Erdmann [20] proposed a cubic regularization in an attempt to generalize the works [6, 7] on convex optimization to the non- convex case. Focus was laid on the construction of estimates for the third order remainder

1

(2)

term. Even more recently Cartis, Gould, and Toint proposed an algorithmic framework, similar to trust-region methods, but with a cubic regularization term, and provided detailed first and second order convergence analysis [2] and a complexity analysis [3]. Common idea of all these methods is the cubic regularization, but apart from this basic idea the proposed methods differ significantly.

Compared to trust-region methods cubic regularization methods have some appealing features which caused recent interest in them. First, as already indicated, the cubic term can straightforwardly be interpreted as a model for the local error f −qx as long as the model is second order consistent. At least in the smooth case Taylor expansion shows that this difference is of third order. This allows for an elegant update of the scaling parameter ω (cf. (26), below). Second, [3] have shown that cubic regularization methods exhibit better worst case complexity bounds than trust-region methods. Nevertheless, both classes of methods have many things in common, and even admit a unified convergence theory, as shown in [16].

Many large scale optimization problems are discretizations of problems with partial differential equations (PDEs). These may comprise problems of energy minimization, such as nonlinear elliptic problems, or problems from optimal control of PDEs. If one wants to apply cubic regularization methods in this setting, one is naturally led to versions that work in function space, and the need for a convergence theory in function space arises. In trust-region methods this topic is well understood and several works have been published that pursue this line of thought (cf. e.g. [15, 11, 19, 18]). Concerning cubic regularization a global convergence theory as performed in [2] may be lifted to function space in a relatively straightforward way. Iffis continuously differentiable and the second order term is bounded above and below with respect to a normk · k, one ends up with a globally convergent cubic regularization method that is equipped with the error modelf−qx∼ k · k³. Thus, the error model is based on the same norm that is used to define directions of steepest descent and thus Cauchy points, which are needed to define acceptable search directions.

However, analysis reveals, as we will sketch in Section 2.2, that there are large classes of infinite dimensional problems where the behaviour off−q_xis not described adequately by the normk·k. The error term would appear highly anisotropic, when compared to the model k · k³. Rather, one can find different third order models, which we callR_x(·), that are much better suited as a model: f −q_x∼R_x(·). This analytic structure calls for decoupling the normk · kused to define directions of steepest descent, and the modelRx(·) for third order terms. In cubic regularization methods this can be realized easily. Of course,Rx(·) cannot be chosen arbitrarily, but has to be compatible with the problem. The aim of this paper is to find and analyse such compatibility conditions on the choice ofRx(·). Our framework employs two norms which are used to formulate the required regularity assumptions onf. Then conditions depending on these norms are imposed onRx(·) which allow for a global and a local convergence analysis.

In particular, we will introduce our flexible analytic framework in Section 2. In particular, we discuss some examples to illustrate and motivate the abstract concepts. In Section 3 we develop our algorithmic framework. It resembles in a couple of points the classical trust region-like algorithms with the usual fraction of model decrease acceptance criterion and a fraction of Cauchy decrease condition. However, the latter condition has to be modified to take into account non-equivalence of norms. Also here it was our aim to leave as much flexibility for concrete implementations of algorithms, concerning updates of regularization parameters and computation of steps. Within this framework we show in Section 4 global and local convergence results.

We emphasize that the focus of this paper is to establish a framework for algorithms, rather than propose concrete implementations. In particular, we will only briefly address

(3)

the issue of step computation and rather provide minimal requirements that acceptable steps have to fulfill. There are excellent candidates available in the literature (cf. e.g. [4, Chapt. 5] in a general context and [9, 20, 2] for cubic regularization) that certainly can be used within our framework. We would like to postpone the treatment of the arising algorithmic issues and also numerical testings to a future publication.

2 Functional analytic framework

Consider for a given functionf :X →Ron a linear spaceX the minimization problem

x∈Xminf(x).

Suppose that we can compute for each x ∈ X a quadratic model, consisting of a linear functionalf_x⁰ :X →Rand a bilinear formHx:X×X→R:

q_x(δx) :=f(x) +f_x⁰δx+1

2H_x(δx, δx). (1)

Further, let us denote the error of the quadratic model as follows

w_x(δx) :=f(x+δx)−q_x(δx) (2)

with the help of a functionwx:X →R. Later, we will impose various smoothness conditions onf, i.e., make assumptions on the limiting behavior ofwx for smallδx. Depending on the smoothness of the problem,wx(λδx) may be of higher order locally, such as o(λ), o(λ²) or even O(λ³) as λ→0. The last case motivates the construction of the following cubic model forf with parameterω >0:

f(x+δx)−f(x)≈m^ω_x(δx) :=f_x⁰δx+1

2Hx(δx, δx) +ω

6Rx(δx). (3) HereRx is a functional, which is homogenous of order 3:

Rx(λδx) =|λ|³Rx(δx) ∀λ∈R (4)

and positive:

R_x(δx)>0 ∀δx6= 0. (5)

In (3) the parameterω > 0 is updated adaptively during the course of the algorithm in order to globalize the method. Comparison of (2) and (3) yields that (ω/6)Rx can be seen as a model forwx. The subscript inRxindicates thatRx may vary from point to point, as long as the conditions imposed below hold independently ofx.

2.1 Assumptions for global and local convergence

IfX is equipped with the normk · k, the classic cubic regularization method usesR_x(δx) :=

kδxk³. However, in most function space problems an adequate analysis requires the use of several non-equivalent norms. There are a couple of different issues, which each on its own may require a separate choice. This is why we aim for a theoretical framework that is flexible with respect to choosing more than one norm.

(4)

Assumptions for global convergence. Let us collect the following set of assumptions for later reference, which are needed to show global convergence, i.e., lim inf_k→∞kf_x⁰

kk= 0 for our algorithm.

(i) Let (X,k · k) be a Hilbert space andX^∗its normed dual. The primary normk · k on X has to be strong enough thatf is continuously differentiable onX. This means in particular that we have for eachx∈X the property:

kf_x⁰k:= sup

kδxk=1

|f_x⁰δx|<∞, (6) and, moreover thatx_k→x_∗in (X,k · k) impliesf_x⁰

k →f_x⁰_∗ inX^∗, i.e., w.r.t the norm (6).

(ii) The secondary norm | · | of X is used to describe possible non-convexity of the quadratic modelqx. We assume that| · |is weaker thank · k:

∃C <∞:|v| ≤Ckvk ∀v∈X.

With the help of our two norms we impose a condition of G˚arding-type:

∃γ >−∞, Γ<∞: γ|v|²≤Hx(v, v)≤Γkvk² ∀v∈X. (7) He donot assume completeness of (X,| · |), which allows to choose| · |strictly weaker than k · k. Hence, Hx is assumed to be bounded below in a weaker norm than it is bounded above. Similar conditions appear, for example, in the theory of linear monotone operators (cf. e.g. [22, Chap. 22]). In the next section we will discuss some examples, where this condition is fulfilled.

(iii) The main purpose ofR_xis to compensate the possible non-convexity of the quadratic models and to model the remainder term. Thus, we impose the following flexible boundedness and coercivity condition (without a constant in the left inequality for simplicity):

|v|³≤Rx(v)≤Ckvk³ ∀v∈X. (8) Among these assumptions the only standing assumptions we will use is existence off_x⁰ in X^∗. All other assumptions will be referenced later, when needed.

Lemma 2.1. Consider a sequence xk in X that converges to some limit x∗ and sequence δvk →0 in X. Assume that f is continuously differentiable and (7) holds. Then for the remainder term, defined in (2)we conclude

lim

k→∞

w_x_k(δv_k)

kδvkk = 0. (9)

Proof. Concerning (9), we conclude from a standard result of analysis (cf. e.g. [14, Thm.

25.23], an application of the fundamental theorem of calculus) that

k→∞lim kδvkk= 0 ⇒ lim

k→∞

f(xk+δvk)−f(xk)−f_x⁰_kδvk

kδvkk = 0.

Moreover, ifδvk →0, then (7) impliesHx_k(δvk, δvk) =o(kδvkk). Hence

wx_k(δvk) =f(xk+δvk)−qx_k(δvk) =f(xk+δvk)−f(xk)−f_x⁰_kδvk+o(δvk).

Combining these two results yields (9).

(5)

Our global convergence results will be proved by contradiction. The following lemma, which uses reflexivity of the Hilbert space X, will serve as a key facility to obtain this contradiction (see Theorem 4.3).

Lemma 2.2. Let x_k ∈X be a sequence such that |xk| →0 and kxkk is bounded. Then by reflexivity ofX:

x_k*0 weakly in(X,k · k).

Proof. Since (X,k · k) is reflexive,x_k has a weakly convergent subsequence, sayx_k_j * x_∗. Since|xkj| →0 we conclude that for each ε > 0, x_k_j is eventually contained in a ball of

| · |-radius ε, and thus also x_∗ has to be contained in that ball. It follows that x_∗ = 0.

This also shows that every possible weak accumulation point of our sequence is 0, so by a standard argument the whole sequence converges weakly to 0.

Assumptions for fast local convergence. If we want to show fast local convergence we need the following additional assumptions, which strengthen assumption (i) and (ii) from the above list:

(i)loc Settingδxk=xk+1−xk we need a second order approximation error estimate in (2):

lim

kxk−x∗k→0

wx_k(δxk)

kδxkk² = 0, (10)

close to a local minimizer, which is fulfilled in particular, iff is twice continuously differentiable andH_x=f_x⁰⁰.

(ii)_loc Locally, we have to impose stronger assumptions on H_x. Close to a minimizer we assume in addition to (7) ellipticity ofH_x with respect to the strong normk · k:

∃γ >0 : γkδxk²≤Hx(δx, δx). (11)

2.2 Examples

To get a feeling for the peculiarities of the class of problems that fit into our framework, we will discuss a couple of typical examples. The purpose of Section 2.2.1 is to show that several norms appear naturally in non-convex optimization problems in function space, while the Section 2.2.2 illustrates, howRxcan be chosen, and why the additional flexibility of our new framework is beneficial. In Section 2.2.3 a further important class of problems is introduced that fall into our framework.

2.2.1 Two illustrative examples

The following well known simple examples serve as an illustration, why the choice of two norms in (7) is quite natural in infinite dimensional optimization.

In contrast to finite dimensional problems, where existence of minimizers of lower semi- continuous functions is obtained by the classical theorem of Weierstrass, infinite dimensional problems are notoriously hard to analyse. The main reason is thelack of compactness of closed and bounded sets in infinite dimensions. By turning to weak lower semi-continuity existence of minimizers can often still be attained, but this property only holds under certain convexity or compactness assumptions on the problem, or subtle combinations of both. As a rule, non-convex problems are tractable, as long as there are other, additional compactness results available, such as compact Sobolev embeddings, e.g., E : H₀¹ ,→ Lp (cf. e.g. [1]).

The interested reader is referred to the textbooks [8, 5] for a thorough introduction.

(6)

Let us consider the following two functionals:

φ(v) :=

Z 1 0

1

2v(t)²dt, ψ(v) :=

Z 1 0

(v(t)²−1)²dt.

We observe that φ : L2(0,1) → R is convex, while ψ : L4(0,1) → R is non-convex (the integrand has minima at±1) and both functionals are non-negative.

The following minimization problem, which involves the first derivative ˙u = du/dt is well defined inH₀¹(0,1):

min

u∈H₀¹(0,1)

f(u) :=φ( ˙u) +ψ(Eu).

The non-convex part of this problem appears together with the compact Sobolev embedding E, which suffices to show weak lower semi-continutiy of f and conclude existence of minimizers.

In contrast, the following minimization problem, commonly attributed to Bolza, which is well defined onW₀^1,4(0,1):

min

u∈W₀^1,4(0,1)

f˜(u) :=φ(Eu) +ψ( ˙u) does not admit a global minimizer as is well known.

In view of (7), let us consider the (formal) second derivatives of f and ˜f, for example at u_∗= 0:

f_u⁰⁰_∗(δu, δu) = Z 1

0

(δu⁰)²−4δu²dt ⇒ −4kδuk²_L

2≤f_u⁰⁰_∗(δu, δu)≤ kδuk²_H1

f˜_u⁰⁰_∗(δu, δu) = Z 1

0

δu²−4(δu⁰)²dt ⇒ −4kδuk²_H1 ≤f˜_u⁰⁰_∗(δu, δu)≤ kδuk²_H1. We observe that (7) is fulfilled with two different norms forf with the weaker norm measuring the non-convexity. We may set| · |=k · k_L₂ andk · k=k · k_H1. For ˜f the choice of a weaker norm for the lower bound is not possible.

The bottom line is that compactness, an important principle on which existence of minimizers for non-convex problems rests, is closely related to the presence of two norms in (7), where the lower bound is measured in a weaker norm than the upper bound.

2.2.2 Semi-linear elliptic PDEs

In the following let Ω ⊂R^d (1≤ d≤ 3) be a smoothly bounded open domain. Further, let H₀¹(Ω) be the usual Sobolev space of weakly differentiable functions on Ω with zero boundary conditions. By the Sobolev embedding theorem there is a continuous embedding H₀¹(Ω) ,→ L₆(Ω) for d ≤ 3. Further, denote by v·w the euclidean scalar product of v, w∈R^d. We denote the spatial variable bys∈R^d.

As a prototypical example we consider the following energy functional of a semi-linear elliptic PDEf :H₀¹(Ω)→R:

f(u) :=

Z

Ω

1

2∇u(s)· ∇u(s) +a(u(s), s)ds.

Herea:R×Ω→Ris a Carath´eodory function that is twice continuously differentiable with respect tou. For more information on the theory of semi-linear PDEs and wider classed of

(7)

problems we refer the reader to the textbooks [23, 12], variational approaches can be found e.g. in [8, 21, 5].

In the following discussion we will consider a second order modelqu, settingHu =f_u⁰⁰. We are looking for the solution of the minimization problem

min

u∈H₀¹(Ω)

f(u).

Its (formal) first and second derivatives are given by:

f_u⁰δu= Z

Ω

∇u· ∇δu+ ∂

∂ua(u, s)δu ds f_u⁰⁰(δu1, δu2) =

Z

Ω

∇δu1· ∇δu2+ ∂²

∂u²a(u, s)δu1δu2ds.

Let us analyse these functionals. We may assume thatu∈H₀¹(Ω), which implies that

Z

Ω

∇u· ∇δu ds

≤c(u)kδukH¹

and similarly

0≤ Z

Ω

∇δu· ∇δu dx≤ kδuk²_H1.

Let us now assume for simplicity that _∂u^∂ a(u,·)∈L₂ and _∂u^∂²₂a(u,·)∈L_∞. We obtain the following estimates for the second parts of the derivatives via the H¨older inequality:

Z

Ω

∂

∂ua(u, s)δu ds

≤ k ∂

∂ua(u,·)kL2kδukL2 ≤c(u)kδukH¹

Z

Ω

∂²

∂u²a(u, s)δu1δu2ds

≤ k ∂²

∂u²a(u,·)kL∞kδu1kL₂kδu2kL₂

Taking these estimates together, we obtain the following results:

|f_u⁰δu| ≤c(u)kδukH¹

c0(u)kδuk²_L₂ ≤f_u⁰⁰(δu, δu)≤c1(u)kδuk²_H1,

wherec0(u)>−∞may be negative, andc(u), c1(u)<+∞are positive. Our first observation is thatf⁰ andf⁰⁰can be bounded (from above) via a strong normk · k:=k · kH¹, while it only takes a weaker norm| · |:=k · kL2 to formulate a lower bound onf⁰⁰.

Thus, the condition (8) onR_u reads in this example kδuk³_L

2 ≤R_u(δu)≤Ckδuk³_H1.

Let us take into account thatRu(δu) should model the qualitative behaviour of the difference wu off and its quadratic modelqu(cf. (2)). Under the assumption that _∂u^∂²2ais Lipschitz w.r.t u with Lipschitz-constant ω, repeated application of the fundamental theorem of calculus yields

|wu(δu)|=|f(u+δu)−q_u(δu)|=

f(u+δu)−f(u)−f_u⁰δu−1

2f_u⁰⁰(δu, δu)

= Z

Ω

a(u+δu, s)−a(u, s)− ∂

∂ua(u, s)δu−1 2

∂²

∂u²a(u, s)δu²ds

≤ω 6

Z

Ω

|δu|³ds=ω

6kδuk³_L₃.

(8)

In particular, we observe that |w_u(δu)| only depends on the values of δu, but not on its derivatives. Thus, an appropriate choice forR_uin this context is

R_u(δu) :=kδuk³_L

3.

This is allowed in our flexible framework, while without the added flexibility we would have to choose

R˜_u(δu) =kδuk³_H1.

These two error models differ significantly. In particular for corrections δuthat are highly oscillating (like for exampleδu(s) :=csin(ν|s|), where the frequencyν is large), we get that R˜u(δu) Ru(δu), while for smooth corrections δu we expect ˜Ru(δu) ≈ Ru(δu). Thus, in view of our calculation, ˜Ru would tend to overestimate |wu| for rough corrections, so that |wu| may appear highly anisotropic w.r.t. ˜Ru(δu). This example also demonstrates that the norm, induced by a good preconditioner (which would be ofH¹-type here) is not automatically a suitable norm for measuring remainder terms.

2.2.3 Nonlinear optimal control: black-box approach

The aim in (PDE constrained) optimal control is to minimize a cost functional subject to a (partial) differential equation as equality constraint. For an introduction into this topic, we refer to the textbooks [13, 17, 10, 18]. Usually, the optimization variable is divided into a control u which enters the differential equation as data, and the state y, which is the corresponding solution. This relation can be described by a nonlinear operator via y=S(u). Elimination of ythen yields an optimization problem of the following form:

min

u∈Uf(S(u), u).

This general problem, however, is hardly tractable theoretically, and thus, one restricts considerations often to the following special case:

f(S(u), u) =g1(S(u)) +g2(u) =g1(S(u)) +α 2kuk²_U.

If, for example, S is the solution operator for a non-linear elliptic PDE, and U =L2(Ω), then an appropriate choice of norms would be

kvk:=kvk_L₂_(Ω) |v|:=kS⁰(u)vk_H1 0.

Here| · |depends onu, an issue that is encountered frequently. We will ignore this, however, for the sake of simplicity. Usually,S⁰(u) is a compact linear operator, so that| · |is strictly weaker than k · k.

Due to the special form of f, which only allows non-convexity in g1 ◦S we obtain, similarly as above the following estimates:

|f_u⁰δu| ≤c(u)kδuk

c0(u)|δu|²≤f_u⁰⁰(δu, δu)≤c1(u)kδuk².

3 Algorithmic framework

In this work we will follow to some extent the ideas of [20] and [2] and consider algorithms that are based on successive computation of trial steps, acceptance or rejection of these

(9)

steps, and update of the model parameters. We consider in the following Algorithm 3.1, which consists of a simple outer iteration in which accepted stepsδx_k are added to iterates xk, and an inner loop, summarized here in the subroutine “CompAccStep”. At a given point x“CompAccStep” computes an increasing sequence of parameters ωi and corresponding trial steps δxi until one of them is found acceptable. Moreover, it returns a parameter ωi+1, which is used in the next call of “CompAccStep” as a starting valueω0. The boolean variableιmod which appears in the algorithm is used in Condition 3.5 (below) to switch a modification of a “fraction-of-Cauchy-decrease” condition on or off.

Concerning the naming of the indices we use the convention that an index k always refers to the outer loop, while the indexiis connected to the inner loop at a given pointx.

Algorithm 3.1.

Input: initial guessx0, initial parameterω0

k← −1,ιmod←0 repeat (outer loop)

k←k+ 1

(δxk,ωk+1,0,ιmod,ωk)=CompAccStep(xk,ωk,0,ιmod)

xk+1=xk+δxk (δxk is an accepted step for the modelm^ω_x_k^k ) untilconvergence test satisfied

Output: x_k+1

subroutineCompAccStep(x, ω0,ιmod)

Input: current iteratex, current parameterω0, current value ofιmod

i← −1

Specify atxthe quantitiesf_x⁰,Hx,Rx

Compute a direction ∆x^C that satisfies Condition 3.4 repeat (inner loop)

i←i+ 1

Create model functionm^ω_xⁱ

Computeδx^C_i along the direction ∆x^C (cf. Condition 3.4) if ^ωⁱ_kδx^R^x^(δxC ^Cⁱ⁾

ik² ≥Cmodthenιmod←1

Computeδxithat satisfies Condition 3.5 (which depends on ιmod) Computeωi+1 as described in Section 3.4

untilδxi satisfies Condition 3.7 withm^ω_xⁱ. if ^ωⁱ_kδx^R^x^(δxC ^Cⁱ⁾

ik² < Cmodthenιmod←0 Output: (δxi,ωi+1,ιmod, ωi)

This general algorithm offers room for a large variety of implementations. They may differ in the wayδxis computed,ω is updated, and iterates are accepted. In this section we will discuss the main features of this algorithm and show that the subroutine “CompAcc- Step” terminates after finitely many iteration. Global and local convergence properties of the overall algorithm are discussed in Section 4.

In practical implementations, a convergence test (last line of the outer loop) may check, whether kf_x⁰

k+1k is sufficiently small. For our theoretical purpose, where we consider a possibly infinite sequence of iterates, it is sufficient to check whetherf_x⁰_k+1 = 0.

(10)

3.1 Directional model minimizers

As a minimal requirement, we suppose that any trial step δx computed in Algorithm 3.1 minimizesm^ω_x (defined in (3)) on span{δx}. We call any such a correctionδxadirectional minimizer ofm^ω_x. Directional minimizers are easy to compute and have nice properties.

Existence of a minimizer ofm^ω_x inXmay not hold due to a possible lack ofk·k-coercivity (consider e.g. the casem^ω_x(δx) =f_x⁰δx+ω/6|δx|³). However, directional minimizers ofm^ω_x always exist.

Lemma 3.2. For a directional minimizer δxofm^ω_x it holdsf_x⁰δx≤0 and 0 =f_x⁰δx+H_x(δx, δx) +ω

2R_x(δx), (12)

m^ω_x(δx) = 1

2f_x⁰δx− ω

12R_x(δx) (13)

=−1

2H_x(δx, δx)−ω

3R_x(δx). (14)

Proof. Since the term ¹₂Hx(δx, δx) +ω/6Rx(δx) is the same for +δxand−δxit follows that m^ω_x(−δx)< m^ω_x(δx) iff_x⁰δx >0. Hence, a directional minimizer ofm^ω_x satisfiesf_x⁰δx≤0.

As first order optimality conditions for a minimizerδxofm^ω_x we compute:

0 = (m^ω_x)⁰(δx)v=f_x⁰v+Hx(δx, v) +ω

6R_x⁰(δx)v ∀v∈span{δx}. (15) and thus, by homogeneity (4) of Rx we conclude R⁰_x(δx)v = 3Rx(δx)v and thus (12).

Inserting this into the definition of m^ω_x, we obtain (13) – (14).

The following basic property is a simple consequence:

Lemma 3.3. Denote byδx(ω) =λ(ω)∆xdirectional model minimizers along a fixed direction∆xfor varyingω >0. We have

ω→∞lim kδx(ω)k= lim

ω→∞λ(ω) = 0. (16)

Proof. Fixω0 >0 and denote the corresponding directional minimizer in our direction by

∆x. For any other ω >0 we have δx(ω) =λ∆xwith someλ >0. Inserting this into (12) and dividing byλwe obtain the following quadratic equation for λ:

0 =f_x⁰∆x+Hx(∆x,∆x)λ+ω

2Rx(∆x)λ² (17)

that depends on a parameter ω > 0, and which is of the form 0 =c+ 2bλ+aωλ² with c=f_x⁰∆x <0 andaω >0. This yields exacly one non-negative solution

λ(ω) = 1 aω

−b+p

b²+|c|aω Considering the limitω→ ∞we find that (16) holds.

3.2 Acceptable steps

In addition to being a directional minimizer, a trial step δx has to satisfy a “fraction of Cauchy decrease” type condition. Classically, this involves the explicit computation of a direction of steepest descent ∆x^SDin each step of the outer loop. Its purpose is to establish

(11)

a link between primal quantities δx and dual quantities f_x⁰. We emphasize that steepest descent directions depend on the choice of the normk · k.

In many cases the analytically straightforward choice ofk·kwill lead to a rather expensive computation of ∆x^SD. For example, ifk · kH¹ is used, then ∆x^SD has to be computed from f_x⁰ via the solution of an elliptic partial differential equation. It is sufficient, however, to compute directions of significant descent:

Condition 3.4. Let1≥µ >0be fixed. We compute at the beginning of each inner loop a fixed direction∆x^C that satisfies

f_x⁰∆x^C≤ −µkf_x⁰kk∆x^Ck, (18) and call∆x^C a direction of significant descent. For given ω >0 the directional minimizer ofm^ω_x in direction of∆x^C is called quasi Cauchy step δx^C.

Often such steps are much cheaper to compute via apreconditioner than exact steepest descent directions. The corresponding parameterµdoes not need to be specified explicitely.

Note thatδx^C results from ascaling of ∆x^C (cf. Lemma 3.3):

δx^C =λ(ω)∆x^C, for some λ(ω)>0.

In our flexible framework R_x can be chosen quite independently of k · k. This results in a modification of the classical Cauchy decrease condition. This modification penalizes irregular search directions, i.e., directions, where kδxk³ R_x(δx) and thus avoids that iterates leave (X,k · k).

However, since such a modification might exclude useful search directions, we will only employ it to enforce global convergence in difficult cases. To this end we introduce the logic variableι_mod∈ {0,1}which is set to 1, if during an inner loop the relation

ωiRx(δx^C_i )

kδx^C_i k² ≥Cmod for some constantCmod>0 (19) holds, and it is set to 0, if after acceptance of a step the converse relation holds (cf. Al- gorithm 3.1). We will show in the course of our analysis that the left hand side of this inequality tends to∞(thusιmod= 1 eventually), if global convergence is delayed, while it tends to 0 (thusιmod= 0 eventually) in case of fast local convergence.

Condition 3.5. Let 1≥β > 0 be fixed and δx^C be the quasi Cauchy step of m^ω_x. For a given directional minimizerδxdefine

σ:= Rx(δx^C)

kδx^Ck³ · kδxk³ Rx(δx). β:=

β if ι_mod= 0 βmax{1,√

σ} if ιmod= 1. (20)

Then chooseδxas a directional minimizer ofm^ω_x, such that

m^ω_x(δx)≤βm^ω_x(δx^C). (21) The criterion (20) reduces toβ=β, if eitherδx=δx^C, so thatδx^Cis always acceptable, orRx(·) =k · k³, which is the classical case.

(12)

Lemma 3.6. The following inequality holds for δx^C, as defined in Condition 3.4:

µkf_x⁰kkδx^Ck ≤Hx(δx^C, δx^C) +ω

2Rx(δx^C). (22)

Let δxbe a directional minimizer that satisfies Condition 3.5 with ιmod= 1. Then there is c(µ, β)>0, such that

Rx(δx)

kδxk ≥cRx(δx^C)

kδx^Ck . (23)

Proof. From (12) and (18) we conclude (22):

µkf_x⁰kkδx^Ck

(18)

≤ |f_x⁰δx^C|⁽¹²⁾= H_x(δx^C, δx^C) +ω

2R_x(δx^C).

To show (23) assume first thatf_x⁰δx+β|f_x⁰δx^C| ≤0. Then

kf_x⁰kkδxk ≥ |f_x⁰δx|=−f_x⁰δx≥βµkf_x⁰kkδx^Ck, and thuskδxk ≥βµkδx^Ck. Inserting (20) we obtain

Rx(δx) kδxk

1/2

≥βµ

Rx(δx^C) kδx^Ck

^1/2

which implies (23) in this case.

Otherwise, we use (13) forδx, (21), and (13) forδx^C to compute ω

6Rx(δx) =f_x⁰δx−2m^ω_x(δx)≥f_x⁰δx−2βm^ω_x(δx^C)

=f_x⁰δx+β|f_x⁰δx^C|

| {z }

≥0

+βω

6Rx(δx^C)≥βω

6Rx(δx^C), and thus,R_x(δx)≥βR_x(δx^C). Inserting once again (20) we get

Rx(δx) kδxk

^3/2

≥β

Rx(δx^C) kδx^Ck

^3/2 .

and thus also (23).

3.3 Acceptance of trial steps

After a directional minimizer of our model has been computed and serves as a trial step, we have to decide, whether this trial step is acceptable as an optimization step. For this purpose we impose the following relative acceptance criterion, which is well known and popular in trust-region methods. To this end, let us define the ratio of decrease in f and in the modelm^ω_x:

η:= f(x+δx)−f(x)

m^ωx(δx) =f(x+δx)−f(x)

m^ωx(δx)−m^ωx(0). (24) Recall that we have chosenm^ω_x in a way thatm^ω_x(0) = 0. Sincem^ω_x(δx)<0, we see that η >0 yields a decrease off, and η = 0 means thatf has remained constant. This yields the following classical condition:

Condition 3.7. Chooseη∈]0,1[. A trial stepδxis accepted, if it satisfies the condition

η≥η. (25)

Otherwise it is rejected.

(13)

3.4 Adaptive choice of ω

In this section we discuss the adaptive choice of the sequence of regularization parameters ωi in m^ω_xⁱ, needed in subroutine “CompAccSteps” at a given point x. We assume thatωi

andδxi have already been computed, and the updateωi+1 has to be made.

In [2] the choice of regularization parameters is made according to a classification of the steps into “unsuccessful”, “successful” and “very successful”. Here we follow [20] and base our considerations on the idea thatωi/6Rxshould be a third order model for the difference f−qx, which leads to the assignment:

ω_i,raw:= 6(f(x+δx_i)−q_x(δx_i)) Rx(δxi)

(2)= 6w_x(δx_i)

Rx(δxi). (26) Of course,ω_i,raw cannot be used directly as ω_i+1 in our algorithm. We have to introduce some safeguard restrictions.

In order to guarantee positivity of ω_i+1 and to avoid oscillatory behavior, we assume that the algorithm provides restrictions on updatesωi →ωi+1 to guarantee:

ωi+1>0 ωi+1≤ 1 ρ

ωi+Cω+ 2Cf⁰|f_x⁰δxi|+|Hx(δxi, δxi)|

Rx(δxi)

for constants 0< ρ <1, C_f⁰ ≥0, C_ω≥0.

(27)

Positivity of ω is, of course, a basic requirement which guarantees that the term Rx is present throughout the computation. The second condition (27) inhibits thatωis increased too quickly, with the result that the next trial step has to be chosen much shorter than the previous one. However, in a certain range (corresponding toCω) the increase can be performed freely. IfR_x is much smaller than the remaining terms ofm^ω_xⁱ a fast increase of ω is also possible. Technically, this restriction enters into the global convergence proof in (54), below.

The following theory will cover algorithms that respect these restrictions, and increase ωafter a rejected trial step, according to

After rejected trial step: ω_i+1≥min{ω_i,raw, ρ⁻¹ω_i} (ρas defined in (27)) (28) Any algorithm that does not allow an increase like this, is likely to get stuck in an inner loop. Technically this condition is used at the beginning of the proof of Theorem 3.8. Next, we impose the restriction on our algorithm that after an increase ofω, ωi+1 should not be chosen larger thanωi,raw:

ωi+1≤max{ωi, ωi,raw}.

By (26) this implies the following estimate:

ωi+1≥ωi ⇒ ωi+1Rx(δxi)≤6wx(δxi). (29) To obtain fast local convergence under weak assumptions onR_xwe do not increaseω if ηi (defined via (24) forδxi) is very close to 1. Let us chooseη∈[η,1[ and state

If ηi > η then ωi+1≤ωi. (30) The following simple example update that satisfies all these requirements is the following rule (where we could replace the simple upper boundρ⁻¹ωiby the right hand side of (27)):

(14)

- If η_i≤η: ω_i+1= max{ρω_i,min{ρ⁻¹ω_i, ω_i,raw}}for some 0< ρ <1 - If ηi> η: ωi+1=ωi.

In our framework we deliberately dispense with a-priori restrictions likeω≤ωifor some very small lower bound 0< ω1. Such restrictions can, and should of course, be added in finite precision arithmetic.

3.5 Finite termination of inner loops

Next, we show that each inner loop, i.e., the subroutine “CompAccStep” of our algorithm accepts a finiteωafter finitely many updates and thus terminates finitely. Hence, in the following we consider fixedxand a sequenceωi of parameters andδxiof trial steps, computed by the subroutine “CompAccStep”.

Theorem 3.8. Assume thatf is Fr´echet differentiable atxandf_x⁰ 6= 0. Moreover, assume that the left inequalities of (7)and (8)hold atx. Then:

(i) If a trial step δx_i is rejected, thenω_i+1≥min{ρ⁻¹,(1−η)/2 + 1}ωi> ω_i. (ii) The inner loop terminates successfully after finitely many iterations.

Proof. In view of (28), assume thatωi+1< ρ⁻¹ωi. Then violation of (25) implies ωi+1≥ωi,raw= 6

R_x(δx_i)(f(x+δxi)−qx(δxi))

= 6

Rx(δxi)

f(x+δxi)−f(x)−m^ω_xⁱ(δxi) +ω_i

6 Rx(δxi)

> 6

Rx(δxi)(η−1)m^ω_xⁱ(δx_i) +ω_i

= 6

R_x(δx_i)(1−η)

−1/2f_x⁰δxi+ωi

12Rx(δxi) +ωi

≥((1−η)/2 + 1)ωi.

Hence, (i) is shown: each rejection is followed by an increase ofω by at least a fixed factor.

Next, assume for contradiction that (25) fails infinitely often during successive updates from ω_i to ω_i+1. Then, we have lim_i→∞ω_i =∞ so that Lemma 3.3 yields for the quasi Cauchy steps (which are all scalings of a fixed direction: δx^C_i =λ(ωi)∆x^C)

i→∞limkδx^C_i k= lim

i→∞λ(ωi) = 0 and thus

lim inf

i→∞

ωiRx(δx^C_i ) 2kδx^C_i k

(22)

≥ µkf_x⁰k − lim

i→∞

Hx(δx^C_i , δx^C_i ) kδx^C_i k

=µkf_x⁰k − lim

i→∞λ(ω_i)H_x(∆x^C,∆x^C)

k∆x^Ck =µkf_x⁰k>0 sincekf_x⁰k 6= 0. This implies that

i→∞lim

ωiRx(δx^C_i ) kδx^C_i k² =∞

(15)

and thusι_mod= 1 by (19) for sufficiently largei. It thus follows from (23) that there exists a constantW₀>0 such that for the following positive sequence

ω_iR_x(δx_i)

kδxik ≥W₀>0 ∀i∈N. (31) Sinceωi+1> ωi, (29) holds forωi+1. Thus, we conclude that

0< W₀

(31)

≤ ω_iR_x(δx_i)

kδxik <ω_i+1R_x(δx_i) kδxik

(29)

≤ 6w_x(δx_i)

kδxik , (32)

and hence by Fr´echet differentiability and Lemma 2.1 (withxk=xthe constant sequence) there isD0>0 such thatkδxik ≥D0, and hence, by (32) also

ωiRx(δxi)≥W0D0>0. (33) With this, we compute from (12) (usingf_x⁰δxi =−|f_x⁰δxi|)

|f_x⁰δx_i|

ωiRx(δxi)= H_x(δx_i, δx_i) ωiRx(δxi) +1

2.

By (7), if the middle term including the Hessian has a negative contribution it vanishes asymptotically:

i→∞lim

min{Hx(δxi, δxi),0}

ωiRx(δxi)

(7)

≤ lim

i→∞

|γ||δxi|² ωiRx(δxi)

(8)

≤ lim

i→∞

|γ|

ω_i^2/3(ωiRx(δxi))^1/3

(33)= 0.

Thus we conclude that

lim inf

i→∞

|f_x⁰δx_i| ωiRx(δxi)≥ 1

2 >0 (34)

and thus via (31) that also

lim inf

i→∞

|f_x⁰δx_i| kδxik ≥W₀

2 >0. (35)

However, as a consequence of (33) and (8) we have:

ω_iR_x_i(δx_i)

(8)

≥ω_iR_x(δx_i)^2/3|δx_i|⁽³³⁾≥ (W₀D₀)^2/3ω^1/3_i |δx_i|. (36) Thus, by (34) we conclude

i→∞lim kδxik

|δxi| ≥ lim

i→∞

|f_x⁰δxi|

|δxi|kf_x⁰k

(34)

≥ lim

i→∞

1 2

ωiRx_i(δxi)

|δxi|kf_x⁰k

(36)

≥ 1 2

(W0D0)^2/3 kf_x⁰k lim

i→∞ω_i^1/3=∞.

Thus, lim_i→∞|δxi|/kδxik= 0 and by Lemma 2.2 we conclude weak convergenceδxi/kδxik* 0 in (X,k · k). This, however, implies

k→∞lim

|f_x⁰δxi| kδx_ik = 0.

in contradiction to (35).

(16)

4 Convergence Theory

In this section we will establish first order global convergence, and second order local convergence results. In the following we will consider the sequencexk, generated by Algorithm 3.1, corresponding derivatives f_x⁰_k ∈X^∗, and accepted corrections δxk. We denote byωk that parameter for which δxk has been computed as an accepted directional minimizer ofm^ω_x_k^k. At the end of subroutine “CompAccStep”, after the inner loop, this parameter appears as ωi, to be distinguished from the updateωi+1 that corresponds toωk+1,0 in the outer loop and is computed after acceptance of δxk. Similarly, δx^C_k denotes a quasi-Cauchy step for m^ω_x_k^k andηk is defined by (24) forδxk.

In the whole section we use that the computed quantities satisfy Conditions 3.4, 3.5, and 3.7 and the update conditions from Section 3.4. Moreover, we assume throughout the basic properties, introduced in the beginning of Section 2. From the assumptions Section 2.1 we only use existence of f_x⁰ ∈X^∗ throughout, as well as all assumptions, needed to show finite termination of the inner loops, i.e., existence of the sequencex_k. All other assumptions will be referenced explicitly, when needed.

In the whole section we exclude the trivial case thatf_x⁰

k = 0, for somek, which leads to finite termination of our algorithm. Moreover, we may assume that the sequence of computed function values f(xk) is bounded from below. Otherwise our algorithm, which enforces descent, fulfills its purpose of minimization by generating a sequence xk with lim_k→∞f(xk) =−∞.

4.1 Global Convergence

Under mild assumptions we will show that our algorithm cannot converge to non-stationary points, while slightly stronger assumptions yield convergence of derivatives to 0. Our tech- nique will be to derive a contradiction to the case that xk converges to a non-stationary point, so that in particular kf_x⁰_kk remains bounded away from zero.

Lemma 4.1. Assume that Algorithm 3.1 generates an infinite sequencexk such thatf(xk) is bounded from below. Then

∞

X

k=0

kf_x⁰

kkkδx^C_kk<∞ (37)

∞

X

k=0

ω_kR_x_k(δx^C_k)<∞,

∞

X

k=0

ω_kR_x_k(δx_k)<∞. (38) Proof. We use (25) and (21) (using only β≥β) to compute

f(xk+1)−f(xk)

(25)

≤ ηm^ω_x^k

k(δxk)⁽¹²⁾= η 1

2f_x⁰

kδxk−ωk

12Rx_k(δxk)

(21)

≤ ηβm^ω_x^k

k(δx^C_k)≤η βm^ω_x^k

k(δx^C_k)⁽¹²⁾= η β 1

2f_x⁰

kδx^C_k −ω_k

12R_x_k(δx^C_k)

≤η β 2 f_x⁰_kδx^C_k

(18)

≤ −µη β

2 kf_x⁰_kkkδx^C_kk ≤0.

By monotonicity and boundedness

∞

X

k=0

f(xk+1)−f(xk) = inf

k f(xk)−f(x0)>−∞,

(17)

and by our chain of inequalities we conclude (37) and (38).

The main observation here is thatkf_x⁰_kkkδx^C_kk →0, and to obtainkf_x⁰_kk →0 it remains to preventkδx^C_kk from becoming too small, compared tokf_x⁰_kk. The extraordinary role of δx^C_k has its origin in the acceptance criterion (21), which compares all steps to the quasi Cauchy steps.

To obtain a quick understanding of the situation, take a look at (22) and observe the following relation:

µkf_x⁰_kk

(22)

≤ H_x_k(δx^C_k, δx^C_k) kδx^C_kk +ω_k

2

R_x_k(δx^C_k) kδx^C_kk .

The undesired case is that kf_x⁰_kk is bounded away from zero, which in turn implies that lim_k→∞kδx^C_kk= 0 by (37). Taking into account the upper bounds (7) for H_x and (8) for R_x, we see that

lim sup

k→∞

H_x_k(δx^C_k, δx^C_k)

kδx^C_kk ≤0, (39)

lim

k→∞

R_x_k(δx^C_k)

kδx^C_kk = 0. (40)

In fact for all global convergence results we may replace the upper bounds of (7) and (8) by these weaker assumptions.

In view of (39) and (40), which exclude that our iteration is stalled by H_x and R_x being overly large, the “bad case” can only happen, ifω_k is increased too rapidly. Under smoothness assumptions onf that imply boundedness of ω_k (a global Lipschitz condition onHx=f⁰⁰(x)) we would be finished at this point. To cover the more general case, we have to invest some more theoretical work. Let us start with collecting some simple consequences ofkf_x⁰_kk being bounded away from 0:

Lemma 4.2. Assume that the sequence f(xk) is bounded from below. For fixed ν > 0 assume that the following set of indices is infinite:

L^ν :={k:kf_x⁰

kk ≥ν}.

Assume that (39)and (40)hold forδx^C_k along the sequence of iteratesxk fork∈ L^ν. Then

k∈Llim^ν→∞

ωkRx_k(δx^C_k)

kδx^C_kk² =∞ (41)

lim

k∈L^ν→∞ω_k=∞. (42)

Let δv_k be any directional minimizer of m^ω_x^k

k that satisfies Condition 3.5 with ι_mod = 1.

Then

inf

k∈L^ν

ωkRx(δvk)

kf_x⁰_kkkδvkk >0 (43) For iterationk∈ L^ν letδxk be the accepted step. Eventually, (for sufficiently large k)δxk

satisfies Condition 3.5 withιmod= 1 and it holds:

X

k∈L^ν

kδxkk<∞. (44)