• Keine Ergebnisse gefunden

Non-convergence of stochastic gradient descent in the training of deep neural networks

N/A
N/A
Protected

Academic year: 2021

Aktie "Non-convergence of stochastic gradient descent in the training of deep neural networks"

Copied!
11
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Research Collection

Journal Article

Non-convergence of stochastic gradient descent in the training of deep neural networks

Author(s):

Cheridito, Patrick; Jentzen, Arnulf; Rossmannek, Florian Publication Date:

2021-06

Permanent Link:

https://doi.org/10.3929/ethz-b-000454958

Originally published in:

Journal of Complexity 64, http://doi.org/10.1016/j.jco.2020.101540

Rights / License:

Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International

This page was generated automatically upon download from the ETH Zurich Research Collection. For more information please consult the Terms of use.

ETH Library

(2)

Journal of Complexity 64 (2021) 101540

Contents lists available atScienceDirect

Journal of Complexity

journal homepage:www.elsevier.com/locate/jco

Non-convergence of stochastic gradient descent in the training of deep neural networks

Patrick Cheridito

a

, Arnulf Jentzen

b

, Florian Rossmannek

a,

aDepartment of Mathematics, ETH Zurich, Switzerland

bFaculty of Mathematics and Computer Science, University of Münster, Germany

a r t i c l e i n f o

Article history:

Received 14 June 2020

Received in revised form 10 November 2020 Accepted 19 November 2020

Available online 27 November 2020 Keywords:

Machine learning Deep neural networks Stochastic gradient descend Empirical risk minimization Non-convergence

a b s t r a c t

Deep neural networks have successfully been trained in various application areas with stochastic gradient descent. However, there exists no rigorous mathematical explanation why this works so well. The training of neural networks with stochastic gradient descent has four different discretization parameters:

(i) the network architecture; (ii) the amount of training data;

(iii) the number of gradient steps; and (iv) the number of randomly initialized gradient trajectories. While it can be shown that the approximation error converges to zero if all four param- eters are sent to infinity in the right order, we demonstrate in this paper that stochastic gradient descent fails to converge for ReLU networks if their depth is much larger than their width and the number of random initializations does not increase to infinity fast enough.

©2020 The Author(s). Published by Elsevier Inc. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

1. Introduction

Deep learning has produced remarkable results in different practical applications such as image classification, speech recognition, machine translation, and game intelligence. In this paper, we analyze it in the context of a supervised learning task, though it has also successfully been applied in unsupervised learning and reinforcement learning. Deep learning is usually implemented with a

Communicated by E. Novak.

Corresponding author.

E-mail addresses: patrick.cheridito@math.ethz.ch(P. Cheridito),ajentzen@uni-muenster.de(A. Jentzen), florian.rossmannek@math.ethz.ch(F. Rossmannek).

https://doi.org/10.1016/j.jco.2020.101540

0885-064X/©2020 The Author(s). Published by Elsevier Inc. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

(3)

stochastic gradient descent (SGD) method based on training data. Gradient descent methods have long been known to work, even with good rates, for convex problems; see, e.g., [3]. However, the training of a deep neural network (DNN) is a non-convex problem, and questions about guarantees and convergence rates of SGD in this context are currently among the most important research topics in the mathematical theory of machine learning.

To obtain optimal approximation results with a DNN, several hyper-parameters have to be fine-tuned. First, the architecture of the network determines what type of functions can be ap- proximated. To be able to efficiently approximate complicated functions, it needs to be sufficiently wide and deep. Secondly, the goal is to approximate the target function with respect to the true risk, but the algorithm only has access to the empirical risk. The gap between the two goes to zero as the amount of training data increases to infinity. Thirdly, the gradient method attempts to minimize the empirical risk, and the chance of finding a good approximate minimum increases with the number of gradient steps. Finally, since a single gradient trajectory may not yield good results, it is common to run several of them with different random initializations. [2,10] have shown that general networks converge if their size, the amount of training data, and the number of random initializations are increased to infinity in the correct way, albeit with an extremely slow speed of convergence. In general, one cannot hope to overcome the slow speed of converge; see [16]. On the other hand, it has been shown that, for the training error, faster convergence can be guaranteed with certain probabilities if over-parametrized networks are used; see [4,6,18,22,23] and the references therein. A different approach to the convergence problem relies on landscape analysis of the loss surface. For example, it is known that there are no local minima if the networks are linear; see [1,11].

This is no longer true for non-linear networks1; see [15]. But in this case, there are results about the frequency of local minima; see, e.g., [5,7,14,15,20,21]. The initialization method is important for any type of network. But for ReLU networks it plays a special role due to the particular form of the ReLU activation function; see [8,9,13,17].

The main contribution of this paper is a demonstration that SGD fails to converge for ReLU networks if the number of random initializations does not increase fast enough compared to the size of the network. To illustrate our findings, we present a special case of our main result,Theorem 5.3, inTheorem 1.1.

We denote bydN = {1,2, . . .} the dimension of the input domain of the approximation problem. The setAd = ⋃

DN({d} ×ND1× {1}) represents all network architectures with input dimensiondand output dimension 1. In particular, a vectora = (a0, . . . ,aD)∈ Addescribes the depthDof a network and the number of neuronsa0, . . . ,aDin the different layers. For any such architecturea, the quantityP(a)=∑D

j=1aj(aj1+1) counts the number of real parameters, that is, the number of weights and biases of a DNN with architecturea. We consider networks with ReLU activation in the hidden layers and a linear read-out map. That is, the realization function Rθa:RdRof a fully connected feedforward DNN with architecturea = (a0, . . . ,aD)∈ Adand weights and biasesθ ∈RP(a)is given by

Rθa=Aθ,

D1 i=1ai(ai1+1)

aD,aD1 ◦ρ◦Aθ,

D2 i=1 ai(ai1+1)

aD1,aD2 ◦ρ◦. . .◦Aθ,a2a,a11(a0+1)◦ρ◦Aθ,a0

1,a0, (1.1) whereAθ,m,kn:RnRmdenotes the affine mapping

(x1, . . . ,xn)↦→

θk+1 θk+2 · · · θk+n

θk+n+1 θk+n+2 · · · θk+2n

... ... ... ...

θk+(m1)n+1 θk+(m1)n+2 · · · θk+mn

x1 x2 ...

xn

⎠ +

⎝ θk+mn+1

θk+mn+2

θk+mn... +m

(1.2)

andρ:

kNRk→⋃

kNRkis the ReLU function (x1, . . . ,xk)↦→(max{x1,0}, . . . ,max{xk,0}). In the following description of the SGD algorithm,nNis the index of the trajectory,tN0represents the index of the step along the trajectory, mNdenotes the batch size of the empirical risk,

1 Unless the loss is measured with respect to a finite data set on which the network is heavily overfitted by, e.g., greatly over-parametrizing the last hidden layer; see [12,19].

2

(4)

P. Cheridito, A. Jentzen and F. Rossmannek Journal of Complexity 64 (2021) 101540

andaAddescribes the architecture under consideration. We assume the training data is given by functionsXjn,t:Ω → [0,1]dandYjn,t:Ω → [0,1],j,n,tN0, on a given probability space (Ω,F,P).

In a typical learning problem, (Xjn,t,Yjn,t),j,n,tN0, are i.i.d. random variables. But forTheorem 1.1 to hold, it is enough if (Xj0,0,Yj0,0),jN0, are i.i.d. random variables, whereas (Xjn,t,Yjn,t):Ω → [0,1]d+1 are arbitrary mappings for (n,t)̸= (0,0). The target functionE:[0,1]d → [0,1]we are trying to learn is the factorized conditional expectation given (P-a.s.) byE(X00,0)=E[Y00,0|X00,0]. The empirical risk used for training is

Lna,,tm)= 1 m

m

j=1

⏐(c◦Rθa)(Xjn,t)−Yjn,t

2, (1.3)

where we compose the network realization with the clipping functionc(x)=max{0,min{x,1}}. This composition inside the risk is equivalent to a non-linear read-out map of the network. However, it is more convenient for us to viewcas part of the risk criterion instead of the network. But this is only a matter of notation. Observe that (1.3)is a supervised learning task with noise since, in general, the best possible least squares approximation ofY00,0with a deterministic function ofX00,0 is E(X00,0), which is only equal to Y00,0 in the special case whereY00,0 is X00,0-measurable. We let Gna,,mt:RP(a)×RP(a) be a function that is equal to the gradient ofLna,,mt where it exists. The trajectories of the SGD algorithm are given by random variablesΘan,,mt:Ω → RP(a) satisfying the defining relation

Θan,,mt =Θan,,mt1−γtGna,,mtan,,mt1) (1.4) for given step sizes (γt)tNR. Now, we are ready to state the following result, which is a consequence of Theorem 6.5 in [10] andCorollary 5.4.

Theorem 1.1. Assume that the target functionE is Lipschitz continuous and that E(X00,0)is notP- a.s.-constant. Suppose that, for all aAdand mN, the random initializations Θan,,m0, nN, are independent and uniformly distributed on[−c,c]P(a), where c ∈ [2,∞)is larger than the Lipschitz constant ofE. Letka,M,N,T:Ω→N×N0be random variables satisfying

ka,M,N,T)∈ argmin

(n,t)∈{1,...,N}×{0,...,T}, Θna,,Mt(ω)∈[−c,c]P(a)

L0a,,M0an,,Mt(ω), ω). (1.5) Then, one has

lim sup

a=(a0,...,aD)Ad min{D,a1,...,aD1}→∞

lim sup

M,NN min{M,N}→∞

sup

TN0

E [

min {∫

[0,1]d

⏐ (

c◦RΘ

ka,M,N,T a,M a

)

(x)−E(x)

⏐ PX0,0

0

(dx),1 }]

=0

(1.6) and

inf

NN

lim sup

a=(a0,...,aD)Ad min{D,a1,...,aD1}→∞

inf

MN TN0

E [

min {∫

[0,1]d

⏐ (

c◦RΘ

ka,M,N,T a,M a

)

(x)−E(x)

⏐ PX0,0

0

(dx),1 }]

>0.

(1.7) The integrals in (1.6)and (1.7)describe the true risk. Note that in Theorem 1.1 the random initializations of the different trajectories are assumed to be independent and uniformly distributed on the hypercube[−c,c]P(a), but our main result,Theorem 5.3, also covers more general cases. The random variable ka,M,N,T determines the specific trajectory and gradient step among the first N trajectories andT steps which minimize the empirical risk corresponding to batch sizeM. Note thatE(X00,0) not being a.s.-constant is a weak assumption since it merely means that the learning task is non-trivial. Moreover, the stronger condition thatE must be Lipschitz continuous is made

3

(5)

only to ensure the validity of the positive result(1.6), whereas our new contribution(1.7)does not need this requirement. Similarly, we use the clipping functioncto ensure the validity of(1.6), which in [10] is formulated for networks with clipping function as read-out map.

Our arguments are based on an analysis of regions in the parameter space related to ‘‘inactive’’

neurons. In these regions, the realization function is constant not only in its argument but also in the network parameter. For example, if θ contains only strictly negative parameters, then ρ◦Aθ,a2a,1a(a10+1)◦ρ(x) is constantly zero inxand in a neighborhood ofθ. As a consequence, SGD will not be able to escape from such a region. The fact that random initialization can render parts of a ReLU network inactive has already been noticed in [13,17]. While the focus of [13,17] is on the design of alternative random initialization schemes to make the training more efficient, we here give precise estimates on the probability that the whole network becomes inactive and deduce that SGD fails to converge if the number of random initializations does not increase fast enough. Note that in(1.7)we take the limit superior over all architectures (a0, . . . ,aD)∈Adwhose depthDand minimal width min{a1, . . . ,aD1}both tend to infinity. In particular, to prove(1.7), it is sufficient to construct a single sequence of such architectures over which the limit is positive. For the sequence we use, the depth grows much faster than the maximal width max{a1, . . . ,aD1}. This imbalance between depth and width has the effect that the training procedure does not converge.

The remainder of this article is organized as follows: In Section2, we provide an abstract version of the SGD algorithm for training neural networks in a supervised learning framework. Section3 contains preliminary results on inactive neurons and constant network realization functions. In Section4, we discuss the consequences of these preliminary results for the convergence of the SGD method, and Section5contains our main results,Theorem 5.3andCorollary 5.4.

2. Mathematical description of the SGD method

In this section, we give a mathematical description of an abstract version of the SGD algorithm for training neural networks in a supervised learning framework. To do that, we slightly generalize the setup of the introduction. We begin with an informal description and give a precise formulation afterwards. First, fix a network architecture a = (a0, . . . ,aD) ∈ Ad. Let X:Ω → [u, v]d and B:Ω → [u,v] be random variables on a probability space (Ω,F,P), on which the true risk L(θ)=E[|(c◦Rθa)(X)−B|]of a networkθ ∈RP(a)is based. Here,c:R→Rcan be any continuous function, which covers the case of network realizations with non-linear read-out maps. In the context of the introduction,Bstands for the random variableE(X00,0). Throughout,nNwill denote the index of the gradient trajectory and tN0 the index of the gradient step.Ln,t denotes the empirical risk defined on the space of functionsC(Rd,R). In this general setting,Ln,t can be any function fromC(Rd,R)× toR, but the specific example we have in mind is

Ln,t(f)= 1 m

m

j=1

f(Xjn,t)−Yjn,t

2 (2.1)

for a given batch sizemN.Ln,t is the empirical risk defined on the space of network parameters, given in terms ofLn,t byLn,t) = Ln,t(c◦Rθa). Let Gn,t:RP(a) ×RP(a) be a function that agrees with the gradient ofLn,t where it exists. Then, we can introduce the gradient trajectories Θn,t:Ω→RP(a)satisfying

Θn,t =Θn,t1−γtGn,tn,t1) (2.2)

for given step sizesγt. TheNrandom initializationsΘn,0,n ∈ {1, . . . ,N}, are assumed to be i.i.d innand have independent marginals. Lastly,k:Ω →N×N0specifies the output of the algorithm consisting of a pair of indices for a gradient trajectory and a gradient step. The expected true risk isV=E[min{L(Θk),1}]. In the following, we present the formal algorithm.

Setting 2.1. Let u,uR,v∈(u,∞),v∈(u,∞),c∈C(R,R), d,D,NN, a=(a0, . . . ,aD)∈Ad, andt)tNR. Consider random variablesX:Ω → [u, v]dandB:Ω → [u,v]on a probability space (Ω,F,P). LetL:RP(a) → [0,∞]be given byL(θ)=E[|(c◦Rθa)(X)−B|]. For all nNand tN0,

4

(6)

P. Cheridito, A. Jentzen and F. Rossmannek Journal of Complexity 64 (2021) 101540

let Ln,t be a function from C(Rd,R)× toR, and denote byLn,t:RP(a)×Rthe mapping given byLn,t)=Ln,t(c◦Rθa). LetGn,t=(G1n,t, . . . ,GP(a)n,t ):RP(a)×RP(a)be a function satisfying

Gin,t(θ, ω)= ∂

∂θi

Ln,t(θ, ω) (2.3)

for all n,tN, i∈ {1, . . . ,P(a)},ω∈, and θ=(θ1, . . . , θP(a))

∈ {

ϑ=(ϑ1, . . . , ϑP(a))∈RP(a): L

n,t1, . . . , ϑi1,(·), ϑi+1, . . . , ϑP(a), ω) as a functionR→Ris differentiable atϑi.

}

. (2.4)

Let Θn,t = (Θ1n,t, . . . ,ΘP(a)n,t ):Ω → RP(a), nN, tN0, be random variables such that Θ1,0, . . . ,ΘN,0are i.i.d.,Θ11,0, . . . ,ΘP(a)1,0 are independent, and

Θn,t =Θn,t1−γtGn,tn,t1) (2.5)

for all n,tN. Letk:Ω→ {1, . . . ,N} ×N0be a random variable, and denoteV=E[min{L(Θk),1}]. Note that, by [10, Lemma 6.2] and Tonelli’s theorem, it follows fromSetting 2.1thatL(Θk):Ω→ [0,∞]is measurable and, as a consequence,V=E[min{L(Θk),1}]is well-defined.

3. DNNs with constant realization functions

In this section, we study a subset of the parameter space, specified inDefinition 3.1, for which neurons in a DNN become ‘‘inactive’’, rendering the realization function of the DNN constant. We deduce a few properties for such DNNs inLemmas 3.2–3.4. The material in this section is related to the findings in [13,17].

Definition 3.1. LetDNanda=(a0, . . . ,aD)∈ND+1. For alljN∩(0,D), letIa,jRP(a) be the set

Ia,j= {

θ=(θ1, . . . , θP(a))∈RP(a): [

kN∩ (j1

i=1

ai(ai1+1),

j

i=1

ai(ai1+1) ]

k<0 ]}

,

(3.1) and denote Ia=⋃

jN(1,D)Ia,j.

First, we verify that the realization function is constant in both the argument and the network parameter on certain subsets ofIa,j.

Lemma 3.2. Let DN, jN ∩ (1,D), a = (a0, . . . ,aD) ∈ ND+1, θ = (θ1, . . . , θP(a)), ϑ =(ϑ1, . . . , ϑP(a))∈Ia,j, xRa0, and assume thatθkkfor all kN∩(∑j

i=1ai(ai1+1),P(a)] . ThenRθa(0)=Rθa(x)=Rϑa(x)=Rϑa(0).

Proof. For allk∈ {1, . . . ,D}, denotemk=∑k

i=1ai(ai1+1). Since, by assumption,θ, ϑ∈Ia,j, one has for allkN∩(mj1,mj]thatθk<0 andϑk<0. This andρ(Raj1)= [0,∞)aj1 imply for all yRaj1,φ∈ {θ, ϑ}thatAφ,aj,majj11◦ρ(y)∈(−∞,0]aj. This ensures for allyRaj1,φ∈ {θ, ϑ}that ρ◦Aφ,aj,majj11◦ρ(y)=0. Moreover, the assumption thatθkkfor allkN∩(∑j

i=1ai(ai1+1),P(a)] yieldsAθ,akm,akk11 =Aϑ,ak,makk11 for allkN∩(j,D]. This implies thatRθa(y)=Rϑa(z) for ally,zRa0, which completes the proof ofLemma 3.2.

The next lemma shows that networks with parameters in Ia cannot perform better than a constant solution to the learning task.

5

(7)

Lemma 3.3. AssumeSetting2.1and letθ ∈Ia. ThenL(θ)≥infbRE[|bB|].

Proof. Let ζ ∈ . By Lemma 3.2, one has Rθa(x) = Rθa(0) for all xRd. Therefore, we obtain Rθa(X)) = Rθa(X)) for all ω ∈ . In particular, L(θ) = E[|(c◦Rθa)(X))B|] ≥ infbRE[|bB|].

Finally, we show that SGD cannot escape from Ia.

Lemma 3.4. AssumeSetting2.1and let n,tN,ω∈, jN∩(1,D). Suppose thatΘn,0(ω)∈Ia,j. ThenΘn,t(ω)∈Ia,j.

Proof. Denotem0 = ∑j1

i=1ai(ai1+1) andm1 =∑j

i=1ai(ai1+1). We prove by induction that for allsN0 we haveΘn,s(ω) ∈ Ia,j. The cases = 0 is true by assumption. Now suppose that sN0andθ =(θ1, . . . , θP(a))∈RP(a)satisfyθ=Θn,s)∈Ia,j. LetURP(a)be the set given by U = {(θ1, . . . , θm0)} ×(−∞,0)m1m0× {(θm1+1, . . . , θP(a))}. Thenθ ∈U⊆Ia,j. ByLemma 3.2, we haveRφa(x)=Rθa(x) for allφ∈UandxRd. Hence,Ln,s+1(φ, ω)=Ln,s+1(θ, ω) for allφ∈Uand, as a consequence, ∂θ

kLn,s+1(θ, ω)=0 for allkN∩(m0,m1]. So, it follows from(2.3),(2.5), and the induction hypothesis thatΘn,s+1(ω)∈Ia,j, which completes the proof ofLemma 3.4. 4. Quantitative lower bounds for the SGD method in the training of DNNs

In this section, we establish inProposition 4.2a quantitative lower bound for the error of the SGD method in the training of DNNs.

Lemma 4.1. Assume Setting 2.1 and suppose D ≥ 3. For all j ∈ {1, . . . ,D−1}, denote kj =

j

i=1ai(ai1+1), p=infi∈{1,...,P(a)}P(Θi1,0<0), and W =max{a1, . . . ,aD1}. Then P(

n∈ {1, . . . ,N},tN0n,t ∈Ia)

= [

1−

D1

j=2

( 1−

kj

i=1+kj1

P(Θi1,0<0))]N

≥[

1−(1−pW(W+1))D2]N

.

(4.1)

Proof. It follows from the independence ofΘ11,0, . . . ,ΘP(a)1,0 that P(Θ1,0∈Ia)=P(

jN∩(1,D):iN∩(kj1,kj]:Θi1,0<0)

=1−

D1

j=2

( 1−

kj

i=1+kj1

P(Θi1,0<0))

. (4.2)

By definition of pandW, the right hand side is greater than or equal to 1−(1−pW(W+1))D2. Moreover,Lemma 3.4and the assumption thatΘ1,0, . . . ,ΘN,0are i.i.d. yield

P(

n∈ {1, . . . ,N},tN0n,t ∈Ia)

=P(

n∈ {1, . . . ,N}:Θn,0∈Ia)

=(

P(Θ1,0∈Ia))N

, (4.3) which completes the proof ofLemma 4.1.

Proposition 4.2. Under the same assumptions as inLemma4.1, one has

V=E[min{L(Θk),1}] ≥ [

1−

D1

j=2

( 1−

kj

i=1+kj1

P(Θi1,0<0))]N

min {

inf

bRE[|bB|],1}

≥[

1−(1−pW(W+1))D2]N

min {

inf

bRE[|bB|],1} .

(4.4)

6

(8)

P. Cheridito, A. Jentzen and F. Rossmannek Journal of Complexity 64 (2021) 101540

Proof. DenoteC = min{infbRE[|bB|],1}and observe thatLemma 3.3 implies for allω ∈ withΘk(ω)(ω)∈Iathat min{L(Θk(ω)(ω)),1} ≥C. Markov’s inequality hence ensures that

CP(Θk ∈Ia)≤CP(min{L(Θk),1} ≥C)V. (4.5) Combining this withLemma 4.1and the fact thatP(Θk ∈Ia)≥P(∀n∈ {1, . . . ,N},tN0n,t ∈ Ia) establishes(4.4).

Let us briefly discuss how the inequality inProposition 4.2relates to prior work in the literature.

Fix a depthDNand consider the problem of distributing a given number of neurons among the D−1 hidden layers. In order to minimize the chance of starting with an inactive network, one needs to minimize the quantity 1−∏D1

j=2(1−∏kj

i=1+kj1P(Θi1,0<0)) from(4.4). Under the assumption that P(Θi1,0<0) does not depend oni, this can be achieved by choosing the same number of neurons in each layer.

The effects of initialization and architecture on early training have also been studied in [8,9].

While [8] investigates the problem of vanishing and exploding gradients, [9] studies two failure modes associated with poor starting conditions. Both find that, given a total number of neurons to spend, distributing them evenly among the hidden layers, yields the best results. This is in line with our findings.

5. Main results

In this section, we prove the paper’s main results, Theorem 5.3 and Corollary 5.4. While Theorem 5.3provides precise quantitative conditions under which SGD does not converge in the training of DNNs, Corollary 5.4 is a qualitative result. To prove them, we need the following elementary result. Throughout, log denotes the natural logarithm.

Lemma 5.1. Let D,N,W ∈ (0,∞) and κ,p ∈ (0,1) be such that D ≥ |log(p)|WpW and N≤ |log(κ)|(1−pW)1D. Then[1−(1−pW)D]N≥κ.

Proof. Let the functionsf:[0,1)Randg:[0,1)Rbe given byf(x)= x+log(1−x) and g(x)=(1−pW)1x+log(1−x). Sincef(0)= 0 andf(x)= 1−(1−x)1 <0 for allx∈(0,1), one has|log(1−x)|1 < x1 for allx∈(0,1). Hence,D>|log(p)|W|log(1−pW)|1, from which it follows that (1−pW)D < pW. In addition, g(0) = 0 andg(x)= (1−pW)1−(1−x)1 > 0 for allx ∈ (0,pW), which implies that |log(1−x)| < (1pW)1xfor allx ∈ (0,pW). Hence, we deduce from (1−pW)D<pWthatN|log(1−(1−pW)D)|<N(1pW)D1≤ |log(κ)|, and taking the exponential yields the desired statement.

We provedProposition 4.2in the abstract framework ofSetting 2.1. For the sake of concreteness, we now return to the setup of the introduction. We quickly recall it below.

Setting 5.2. Let u,uR,v ∈(u,∞),v∈ (u,∞),c ∈C(R,R), d∈ N, andt)tNR. Consider functions Xjn,t:Ω → [u, v]dand Yjn,t:Ω → [u,v], j,n,tN0, on a probability space(Ω,F,P)such that X00,0 and Y00,0are random variables. Let E:[u, v]d → [u,v]be a measurable function such that P-a.s.E(X00,0)=E[Y00,0|X00,0]. LetLna,,tm:RP(a)×R, mN, n,tN0, aAd, be given by

Lna,,tm)= 1 m

m

j=1

⏐(c◦Rθa)(Xjn,t)−Yjn,t

2, (5.1)

and assumeGan,,mt =(Gan,,mt,1, . . . ,Gan,,mt,P(a)):RP(a)×RP(a)are mappings satisfying Gan,,mt,i(θ, ω)= ∂

∂θi

Lna,,tm(θ, ω) (5.2)

7

Referenzen

ÄHNLICHE DOKUMENTE

In regard of the mechanistic process of UNC-45 ubiquitination and the question why two E3 ligases are required for this regulation, a special focus of this work was the

Paleocene and Eocene.” They then propose “a new hypothesis regarding the extinction of stem birds and the survival of crown birds across the K-Pg boundary: namely, that global

Within this framework pure sphericity (or curvature) effects are included, but not effects due to the compression of the nucleus by the surface tension. For

This is a relatively easy problem sheet, reflecting that the chapter is rather on observations and empirical studies. Use also the time to catch up with previous problem sheets.