Non-convergence of stochastic gradient descent in the training of deep neural networks

(1)

Research Collection

Journal Article

Non-convergence of stochastic gradient descent in the training of deep neural networks

Author(s):

Cheridito, Patrick; Jentzen, Arnulf; Rossmannek, Florian Publication Date:

2021-06

Permanent Link:

https://doi.org/10.3929/ethz-b-000454958

Originally published in:

Journal of Complexity 64, http://doi.org/10.1016/j.jco.2020.101540

Rights / License:

Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International

This page was generated automatically upon download from the ETH Zurich Research Collection. For more information please consult the Terms of use.

ETH Library

(2)

Journal of Complexity 64 (2021) 101540

Contents lists available atScienceDirect

Journal of Complexity

journal homepage:www.elsevier.com/locate/jco

Non-convergence of stochastic gradient descent in the training of deep neural networks

^✩

Patrick Cheridito

^a

, Arnulf Jentzen

^b

, Florian Rossmannek

^a^,^∗

aDepartment of Mathematics, ETH Zurich, Switzerland

bFaculty of Mathematics and Computer Science, University of Münster, Germany

a r t i c l e i n f o

Article history:

Received 14 June 2020

Received in revised form 10 November 2020 Accepted 19 November 2020

Available online 27 November 2020 Keywords:

Machine learning Deep neural networks Stochastic gradient descend Empirical risk minimization Non-convergence

a b s t r a c t

Deep neural networks have successfully been trained in various application areas with stochastic gradient descent. However, there exists no rigorous mathematical explanation why this works so well. The training of neural networks with stochastic gradient descent has four different discretization parameters:

(i) the network architecture; (ii) the amount of training data;

(iii) the number of gradient steps; and (iv) the number of randomly initialized gradient trajectories. While it can be shown that the approximation error converges to zero if all four parameters are sent to infinity in the right order, we demonstrate in this paper that stochastic gradient descent fails to converge for ReLU networks if their depth is much larger than their width and the number of random initializations does not increase to infinity fast enough.

1. Introduction

Deep learning has produced remarkable results in different practical applications such as image classification, speech recognition, machine translation, and game intelligence. In this paper, we analyze it in the context of a supervised learning task, though it has also successfully been applied in unsupervised learning and reinforcement learning. Deep learning is usually implemented with a

✩ Communicated by E. Novak.

∗ Corresponding author.

E-mail addresses: patrick.cheridito@math.ethz.ch(P. Cheridito),ajentzen@uni-muenster.de(A. Jentzen), florian.rossmannek@math.ethz.ch(F. Rossmannek).

https://doi.org/10.1016/j.jco.2020.101540

(3)

stochastic gradient descent (SGD) method based on training data. Gradient descent methods have long been known to work, even with good rates, for convex problems; see, e.g., [3]. However, the training of a deep neural network (DNN) is a non-convex problem, and questions about guarantees and convergence rates of SGD in this context are currently among the most important research topics in the mathematical theory of machine learning.

To obtain optimal approximation results with a DNN, several hyper-parameters have to be fine-tuned. First, the architecture of the network determines what type of functions can be ap- proximated. To be able to efficiently approximate complicated functions, it needs to be sufficiently wide and deep. Secondly, the goal is to approximate the target function with respect to the true risk, but the algorithm only has access to the empirical risk. The gap between the two goes to zero as the amount of training data increases to infinity. Thirdly, the gradient method attempts to minimize the empirical risk, and the chance of finding a good approximate minimum increases with the number of gradient steps. Finally, since a single gradient trajectory may not yield good results, it is common to run several of them with different random initializations. [2,10] have shown that general networks converge if their size, the amount of training data, and the number of random initializations are increased to infinity in the correct way, albeit with an extremely slow speed of convergence. In general, one cannot hope to overcome the slow speed of converge; see [16]. On the other hand, it has been shown that, for the training error, faster convergence can be guaranteed with certain probabilities if over-parametrized networks are used; see [4,6,18,22,23] and the references therein. A different approach to the convergence problem relies on landscape analysis of the loss surface. For example, it is known that there are no local minima if the networks are linear; see [1,11].

This is no longer true for non-linear networks¹; see [15]. But in this case, there are results about the frequency of local minima; see, e.g., [5,7,14,15,20,21]. The initialization method is important for any type of network. But for ReLU networks it plays a special role due to the particular form of the ReLU activation function; see [8,9,13,17].

The main contribution of this paper is a demonstration that SGD fails to converge for ReLU networks if the number of random initializations does not increase fast enough compared to the size of the network. To illustrate our findings, we present a special case of our main result,Theorem 5.3, inTheorem 1.1.

We denote byd ∈ _N = {1,², . . .} the dimension of the input domain of the approximation problem. The setA_d = ⋃

D∈N({d} ×_N^D⁻¹× {1}) represents all network architectures with input dimensiondand output dimension 1. In particular, a vectora = (a₀, . . . ,^aD)∈ A_ddescribes the depthDof a network and the number of neuronsa₀, . . . ,^aDin the different layers. For any such architecturea, the quantityP(a)=∑D

j=₁a_j(a_j−₁+1) counts the number of real parameters, that is, the number of weights and biases of a DNN with architecturea. We consider networks with ReLU activation in the hidden layers and a linear read-out map. That is, the realization function R^θ_a:R^d → _Rof a fully connected feedforward DNN with architecturea = (a₀, . . . ,^aD)∈ A_dand weights and biasesθ ∈_R^P(a)is given by

R^θ_a=A^θ,

∑D−1 i=1ai(ai−1+₁₎

aD,^aD−1 ◦ρ◦A^θ,

∑D−2 i=1 ai(ai−1+₁₎

aD−1,^aD−2 ◦ρ◦. . .◦A^θ,_a₂^a_,_a¹₁^(a⁰⁺¹⁾◦ρ◦A^θ,_a⁰

1,a0, ^(1.1) whereA^θ,_m_,^k_n:Rⁿ→_R^mdenotes the affine mapping

(x₁, . . . ,^xn)↦→

⎛

⎜

⎝

θk+₁ θk+₂ · · · θk+_n

θk+_n+₁ θk+_n+₂ · · · θk+_2n

... ... ... ...

θk+(m−1)n+1 θk+(m−1)n+2 · · · θk+mn

⎞

⎟

⎠

⎛

⎜

⎝ x₁ x₂ ...

x_n

⎞

⎟

⎠ +

⎛

⎜

⎝ θk+_mn+₁

θk+_mn+₂

θk+mn... +m

⎞

⎟

⎠

(1.2)

andρ^:⋃

k∈NR^k→⋃

k∈NR^kis the ReLU function (x₁, . . . ,^xk)↦→(max{x₁,⁰}, . . . ,^max{x_k,⁰}). In the following description of the SGD algorithm,n∈_Nis the index of the trajectory,t ∈_N₀represents the index of the step along the trajectory, m ∈ _Ndenotes the batch size of the empirical risk,

1 Unless the loss is measured with respect to a finite data set on which the network is heavily overfitted by, e.g., greatly over-parametrizing the last hidden layer; see [12,19].

2

(4)

P. Cheridito, A. Jentzen and F. Rossmannek Journal of Complexity 64 (2021) 101540

anda∈A_ddescribes the architecture under consideration. We assume the training data is given by functionsX_jⁿ^,^t:Ω → [0,¹]^dandY_jⁿ^,^t:Ω → [0,¹],j,ⁿ,^t∈_N₀, on a given probability space (Ω,^F,P).

In a typical learning problem, (X_jⁿ^,^t,^Yjⁿ^,^t),j,ⁿ,^t∈_N₀, are i.i.d. random variables. But forTheorem 1.1 to hold, it is enough if (X_j⁰^,⁰,^Y_j⁰^,⁰^),^j ∈ _N₀, are i.i.d. random variables, whereas (X_jⁿ^,^t,^Y_jⁿ^,^t^):Ω → [0,¹]^d⁺¹ are arbitrary mappings for (n,^t)̸= (0,0). The target functionE:[0,¹]^d → [0,¹]we are trying to learn is the factorized conditional expectation given (P-a.s.) byE(X₀⁰^,⁰)=_E[Y₀⁰^,⁰|X₀⁰^,⁰]. The empirical risk used for training is

Lⁿ_a_,^,^t_m(θ⁾= ¹ m

m

∑

j=₁

⏐

⏐(c◦R^θ_a)(X_jⁿ^,^t)−Y_jⁿ^,^t⏐

⏐², ^(1.3)

where we compose the network realization with the clipping functionc(x)=max{0,^min{x,¹}}. This composition inside the risk is equivalent to a non-linear read-out map of the network. However, it is more convenient for us to viewcas part of the risk criterion instead of the network. But this is only a matter of notation. Observe that (1.3)is a supervised learning task with noise since, in general, the best possible least squares approximation ofY₀⁰^,⁰with a deterministic function ofX₀⁰^,⁰ is E(X₀⁰^,⁰), which is only equal to Y₀⁰^,⁰ in the special case whereY₀⁰^,⁰ is X₀⁰^,⁰-measurable. We let Gⁿ_a_,^,_m^t:R^P(a)×_Ω → _R^P(a) be a function that is equal to the gradient ofLⁿ_a_,^,_m^t where it exists. The trajectories of the SGD algorithm are given by random variablesΘaⁿ,^,m^t:Ω → _R^P(a) satisfying the defining relation

Θ_aⁿ_,^,_m^t =_Θ_aⁿ_,^,_m^t⁻¹−γtGⁿ_a_,^,_m^t(Θ_aⁿ_,^,_m^t⁻¹⁾ ^(1.4) for given step sizes (γt)_t∈N ⊆ _R. Now, we are ready to state the following result, which is a consequence of Theorem 6.5 in [10] andCorollary 5.4.

Theorem 1.1. Assume that the target functionE is Lipschitz continuous and that E(X₀⁰^,⁰)is notP- a.s.-constant. Suppose that, for all a ∈ A_dand m ∈ _N, the random initializations Θaⁿ,^,m⁰, n ∈ _N, are independent and uniformly distributed on[−c,^c]^P(a), where c ∈ [2,∞)is larger than the Lipschitz constant ofE. Letk_a_,_M_,_N_,_T:Ω→_N×_N₀be random variables satisfying

k_a_,_M_,_N_,_T(ω⁾∈ argmin

(n,t)∈{₁,...,N}×{₀,...,T}, Θⁿa,^,^M^t(ω)∈[−_c,c]^P(a)

L⁰_a_,^,_M⁰(Θ_aⁿ_,^,_M^t⁽ω⁾, ω⁾. ^(1.5) Then, one has

lim sup

a=(a0,...,_aD)∈_Ad min{D,a1,...,_aD−1}→∞

lim sup

M,N∈_N min{M,^N}→∞

sup

T∈N0

E [

min {∫

[₀,1]^d

⏐

⏐ (

c◦R^Θ

ka,M,N,T a,M a

)

(x)−E(x)

⏐

⏐ P_X⁰^,⁰

0

(dx),¹ }]

=0

(1.6) and

inf

N∈N

lim sup

a=(a0,...,_aD)∈_Ad min{D,^a1,...,aD−1}→∞

inf

M∈_N T∈_N0

E [

min {∫

[₀,1]^d

⏐

⏐ (

c◦R^Θ

ka,M,N,T a,^M a

)

(x)−E(x)

⏐

⏐ P_X⁰,⁰

0

(dx),¹ }]

>⁰.

(1.7) The integrals in (1.6)and (1.7)describe the true risk. Note that in Theorem 1.1 the random initializations of the different trajectories are assumed to be independent and uniformly distributed on the hypercube[−c,^c]^P(a), but our main result,Theorem 5.3, also covers more general cases. The random variable k_a_,_M_,_N_,_T determines the specific trajectory and gradient step among the first N trajectories andT steps which minimize the empirical risk corresponding to batch sizeM. Note thatE(X₀⁰^,⁰) not being a.s.-constant is a weak assumption since it merely means that the learning task is non-trivial. Moreover, the stronger condition thatE must be Lipschitz continuous is made

3

(5)

only to ensure the validity of the positive result(1.6), whereas our new contribution(1.7)does not need this requirement. Similarly, we use the clipping functioncto ensure the validity of(1.6), which in [10] is formulated for networks with clipping function as read-out map.

Our arguments are based on an analysis of regions in the parameter space related to ‘‘inactive’’

neurons. In these regions, the realization function is constant not only in its argument but also in the network parameter. For example, if θ contains only strictly negative parameters, then ρ◦A^θ,_a₂â_,¹_a^(a₁⁰⁺¹⁾◦ρ(x) is constantly zero inxand in a neighborhood ofθ. As a consequence, SGD will not be able to escape from such a region. The fact that random initialization can render parts of a ReLU network inactive has already been noticed in [13,17]. While the focus of [13,17] is on the design of alternative random initialization schemes to make the training more efficient, we here give precise estimates on the probability that the whole network becomes inactive and deduce that SGD fails to converge if the number of random initializations does not increase fast enough. Note that in(1.7)we take the limit superior over all architectures (a₀, . . . ,âD)∈A_dwhose depthDand minimal width min{a₁, . . . ,âD−₁}both tend to infinity. In particular, to prove(1.7), it is sufficient to construct a single sequence of such architectures over which the limit is positive. For the sequence we use, the depth grows much faster than the maximal width max{a₁, . . . ,âD−₁}. This imbalance between depth and width has the effect that the training procedure does not converge.

The remainder of this article is organized as follows: In Section2, we provide an abstract version of the SGD algorithm for training neural networks in a supervised learning framework. Section3 contains preliminary results on inactive neurons and constant network realization functions. In Section4, we discuss the consequences of these preliminary results for the convergence of the SGD method, and Section5contains our main results,Theorem 5.3andCorollary 5.4.

2. Mathematical description of the SGD method

In this section, we give a mathematical description of an abstract version of the SGD algorithm for training neural networks in a supervised learning framework. To do that, we slightly generalize the setup of the introduction. We begin with an informal description and give a precise formulation afterwards. First, fix a network architecture a = (a₀, . . . ,^aD) ∈ A_d. Let X:Ω → [u, v]^d and B:Ω → [u,^v] be random variables on a probability space (Ω,^F,P), on which the true risk L(θ⁾=_E[|(c◦R^θ_a)(X)−B|]of a networkθ ∈_R^P(a)is based. Here,c:R→_Rcan be any continuous function, which covers the case of network realizations with non-linear read-out maps. In the context of the introduction,Bstands for the random variableE(X₀⁰^,⁰). Throughout,n∈_Nwill denote the index of the gradient trajectory and t ∈ _N₀ the index of the gradient step.Lⁿ^,^t denotes the empirical risk defined on the space of functionsC(R^d,R). In this general setting,Lⁿ^,^t can be any function fromC(R^d,R)×_Ω toR, but the specific example we have in mind is

Lⁿ^,^t(f)= ¹ m

m

∑

j=₁

⏐

⏐f(X_jⁿ^,^t)−Y_jⁿ^,^t⏐

⏐² (2.1)

for a given batch sizem∈_N.Lⁿ^,^t is the empirical risk defined on the space of network parameters, given in terms ofLⁿ^,^t byLⁿ^,^t(θ⁾ = Lⁿ^,^t(c◦R^θ_a). Let Gⁿ^,^t:R^P(a) ×_Ω → _R^P(a) be a function that agrees with the gradient ofLⁿ^,^t where it exists. Then, we can introduce the gradient trajectories Θⁿ^,^t^:Ω→_R^P^(a)satisfying

Θⁿ^,^t =_Θⁿ^,^t⁻¹−γtGⁿ^,^t(Θⁿ^,^t⁻¹⁾ ^(2.2)

for given step sizesγt. TheNrandom initializationsΘⁿ^,⁰^,ⁿ ∈ {1, . . . ,^N}, are assumed to be i.i.d innand have independent marginals. Lastly,k:Ω →_N×_N₀specifies the output of the algorithm consisting of a pair of indices for a gradient trajectory and a gradient step. The expected true risk isV=_E[min{L(Θ^k⁾,¹}]. In the following, we present the formal algorithm.

Setting 2.1. Let u,^u∈_R,v∈(u,∞),v∈(u,∞),c∈C(R,R), d,^D,^N ∈_N, a=(a₀, . . . ,^aD)∈A_d, and(γt)_t∈N⊆_R. Consider random variablesX:Ω → [u, v]^dandB:Ω → [u,^v]on a probability space (Ω,^F,P). LetL:R^P(a) → [0,∞]be given byL(θ⁾=_E[|(c◦R^θ_a)(X)−B|]. For all n∈_Nand t ∈_N₀,

4

(6)

let Lⁿ^,^t be a function from C(R^d,R)×_Ω toR, and denote byLⁿ^,^t:R^P(a)×_Ω→_Rthe mapping given byLⁿ^,^t(θ⁾=Lⁿ^,^t(c◦R^θ_a). LetGⁿ^,^t=(G₁ⁿ^,^t, . . . ,^G_P(a)ⁿ^,^t ^):R^P(a)×_Ω→_R^P(a)be a function satisfying

G_iⁿ^,^t(θ, ω⁾= ∂

∂θi

Lⁿ^,^t(θ, ω⁾ ^(2.3)

for all n,^t ∈_N, i∈ {1, . . . ,^P^(a)},ω∈_Ω, and θ=(θ1, . . . , θP(a))

∈ {

ϑ=(ϑ1, . . . , ϑP(a))∈_R^P(a): ^L

n,^t(ϑ1, . . . , ϑi−₁,⁽·), ϑi+₁, . . . , ϑP(a), ω⁾ as a functionR→_Ris differentiable atϑi.

}

. ^(2.4)

Let Θⁿ^,^t = (Θ1ⁿ^,^t, . . . ,ΘP(a)ⁿ^,^t ):Ω → _R^P(a), n ∈ _N, t ∈ _N₀, be random variables such that Θ¹^,⁰, . . . ,Θ^N^,⁰are i.i.d.,Θ₁¹^,⁰, . . . ,Θ_P(a)¹^,⁰ are independent, and

Θⁿ^,^t =_Θⁿ^,^t⁻¹−γtGⁿ^,^t(Θⁿ^,^t⁻¹⁾ ^(2.5)

for all n,^t∈_N. Letk:Ω→ {1, . . . ,^N} ×_N₀be a random variable, and denoteV=_E[min{L(Θ^k⁾,¹}]. Note that, by [10, Lemma 6.2] and Tonelli’s theorem, it follows fromSetting 2.1thatL(Θ^k^):Ω→ [0,∞]is measurable and, as a consequence,V=_E[min{L(Θ^k⁾,¹}]is well-defined.

3. DNNs with constant realization functions

In this section, we study a subset of the parameter space, specified inDefinition 3.1, for which neurons in a DNN become ‘‘inactive’’, rendering the realization function of the DNN constant. We deduce a few properties for such DNNs inLemmas 3.2–3.4. The material in this section is related to the findings in [13,17].

Definition 3.1. LetD∈_Nanda=(a₀, . . . ,^aD)∈_N^D⁺¹. For allj∈_N∩(0,^{D), let}^Ia,j⊆_R^P(a) be the set

I_a_,_j= {

θ=(θ1, . . . , θ^P(a))∈_R^P(a): [

∀k∈_N∩ (j−₁

∑

i=₁

a_i(a_i−₁+1),

j

∑

i=₁

a_i(a_i−₁+1) ]

:θk<⁰ ]}

,

(3.1) and denote I_a=⋃

j∈N∩₍₁,D)I_a_,_j.

First, we verify that the realization function is constant in both the argument and the network parameter on certain subsets ofI_a_,_j.

Lemma 3.2. Let D ∈ _N, j ∈ _N ∩ (1,^{D), a} = (a₀, . . . ,^aD) ∈ _N^D⁺¹, θ = (θ1, . . . , θ^P(a)), ϑ =(ϑ1, . . . , ϑP(a))∈I_a_,_j, x∈_R^a⁰, and assume thatθk=ϑkfor all k∈_N∩(∑j

i=₁a_i(a_i−₁+1),^P^(a)] . ThenR^θ_a(0)=R^θ_a(x)=R^ϑ_a(x)=R^ϑ_a(0).

Proof. For allk∈ {1, . . . ,^D}, denotem_k=∑k

i=₁a_i(a_i−₁+1). Since, by assumption,θ, ϑ∈I_a_,_j, one has for allk∈_N∩(m_j−₁,^mj]thatθk<^{0 and}ϑk<0. This andρ⁽Râ^j⁻¹)= [0,∞)â^j⁻¹ imply for all y∈_Râ^j⁻¹,φ∈ {θ, ϑ}thatA^φ,_a_j_,^m_a_j^j₋⁻₁¹◦ρ^(y)∈(−∞,⁰]â^j. This ensures for ally∈_Râ^j⁻¹,φ∈ {θ, ϑ}that ρ◦A^φ,_a_j_,^m_a_j^j₋⁻₁¹◦ρ^(y)=0. Moreover, the assumption thatθk=ϑkfor allk∈_N∩(∑j

i=₁a_i(a_i−₁+1),^P^(a)] yieldsA^θ,_a_k^m_,_a^k_k⁻₋¹₁ =A^ϑ,_a_k_,^m_a_k^k⁻₋¹₁ for allk∈_N∩(j,^D]. This implies thatR^θ_a(y)=R^ϑ_a(z) for ally,^z∈_R^a⁰, which completes the proof ofLemma 3.2. □

The next lemma shows that networks with parameters in I_a cannot perform better than a constant solution to the learning task.

5

(7)

Lemma 3.3. AssumeSetting2.1and letθ ∈I_a. ThenL(θ⁾≥inf_b∈RE[|b−B|].

Proof. Let ζ ∈ _Ω. By Lemma 3.2, one has R^θ_a(x) = R^θ_a(0) for all x ∈ _R^d. Therefore, we obtain R^θ_a(X(ω⁾⁾ = R^θ_a(X(ζ^{)) for all} ω ∈ _Ω. In particular, L(θ⁾ = _E[|(c◦R^θ_a)(X(ζ⁾⁾−B|] ≥ inf_b∈RE[|b−B|]. □

Finally, we show that SGD cannot escape from I_a.

Lemma 3.4. AssumeSetting2.1and let n,^t∈_N,ω∈_Ω, j∈_N∩(1,D). Suppose thatΘⁿ^,⁰⁽ω⁾∈I_a_,_j. ThenΘⁿ^,^t⁽ω⁾∈I_a_,_j.

Proof. Denotem₀ = ∑j−₁

i=₁a_i(a_i−₁+1) andm₁ =∑j

i=₁a_i(a_i−₁+1). We prove by induction that for alls ∈ _N₀ we haveΘⁿ^,^s⁽ω⁾ ∈ I_a_,_j. The cases = 0 is true by assumption. Now suppose that s∈_N₀andθ =(θ1, . . . , θP(a))∈_R^P(a)satisfyθ=_Θⁿ^,^s(ω⁾∈I_a_,_j. LetU⊆_R^P(a)be the set given by U = {(θ1, . . . , θm0)} ×(−∞,⁰⁾^m¹⁻^m⁰× {(θm1+₁, . . . , θP(a))}. Thenθ ∈U⊆I_a_,_j. ByLemma 3.2, we haveR^φ_a(x)=R^θ_a(x) for allφ∈Uandx∈_R^d. Hence,Lⁿ^,^s⁺¹(φ, ω⁾=Lⁿ^,^s⁺¹(θ, ω^{) for all}φ∈Uand, as a consequence, _∂θ^∂

kLⁿ^,^s⁺¹(θ, ω⁾=0 for allk∈_N∩(m₀,^m1]. So, it follows from(2.3),(2.5), and the induction hypothesis thatΘⁿ^,^s⁺¹⁽ω⁾∈I_a_,_j, which completes the proof ofLemma 3.4. □ 4. Quantitative lower bounds for the SGD method in the training of DNNs

In this section, we establish inProposition 4.2a quantitative lower bound for the error of the SGD method in the training of DNNs.

Lemma 4.1. Assume Setting 2.1 and suppose D ≥ 3. For all j ∈ {1, . . . ,^D−1}, denote k_j =

∑j

i=1a_i(a_i−₁+1), p=inf_i∈{₁,...,^P(a)}P(Θi¹^,⁰<^{0), and W} =max{a₁, . . . ,^aD−₁}. Then P(

∀n∈ {1, . . . ,^N},^t ∈_N₀:Θⁿ^,^t ∈I_a)

= [

1−

D−₁

∏

j=2

( 1−

kj

∏

i=1+kj−1

P(Θi¹^,⁰<⁰⁾)]N

≥[

1−(1−p^W(W⁺¹⁾)^D⁻²]N

.

(4.1)

Proof. It follows from the independence ofΘ1¹^,⁰, . . . ,ΘP(a)¹^,⁰ that P(Θ¹^,⁰∈Ia)=_P(

∃j∈_N∩(1,^D):∀i∈_N∩(kj−1,^kj]:Θ_i¹^,⁰<⁰)

=1−

D−₁

∏

j=₂

( 1−

kj

∏

i=₁+_k_j₋₁

P(Θ_i¹^,⁰<⁰⁾)

. ^(4.2)

By definition of pandW, the right hand side is greater than or equal to 1−(1−p^W(W⁺¹⁾)^D⁻². Moreover,Lemma 3.4and the assumption thatΘ¹^,⁰, . . . ,Θ^N^,⁰are i.i.d. yield

P(

∀n∈ {1, . . . ,^N},^t ∈_N₀:Θⁿ^,^t ∈I_a)

=_P(

∀n∈ {1, . . . ,^N}:Θⁿ^,⁰∈I_a)

=(

P(Θ¹^,⁰∈I_a))N

, (4.3) which completes the proof ofLemma 4.1. □

Proposition 4.2. Under the same assumptions as inLemma4.1, one has

V=_E[min{L(Θ^k⁾,¹}] ≥ [

1−

D−₁

∏

j=₂

( 1−

kj

∏

i=₁+_k_j₋₁

P(Θ_i¹^,⁰<⁰⁾)]N

min {

inf

b∈RE[|b−B|],¹}

≥[

1−(1−p^W(W⁺¹⁾)^D⁻²]N

min {

inf

b∈RE[|b−B|],¹} .

(4.4)

6

(8)

Proof. DenoteC = min{inf_b∈RE[|b−B|],¹}and observe thatLemma 3.3 implies for allω ∈ _Ω withΘ^k(^ω⁾⁽ω⁾∈I_athat min{L(Θ^k(^ω⁾⁽ω⁾⁾,¹} ≥C. Markov’s inequality hence ensures that

CP(Θ^k ∈I_a)≤CP(min{L(Θ^k⁾,¹} ≥C)≤V. ^(4.5) Combining this withLemma 4.1and the fact thatP(Θ^k ∈I_a)≥_P(∀n∈ {1, . . . ,^N},^t ∈_N₀:Θⁿ^,^t ∈ I_a) establishes(4.4). _□

Let us briefly discuss how the inequality inProposition 4.2relates to prior work in the literature.

Fix a depthD∈_Nand consider the problem of distributing a given number of neurons among the D−1 hidden layers. In order to minimize the chance of starting with an inactive network, one needs to minimize the quantity 1−∏D−₁

j=₂(1−∏kj

i=₁+_k_j₋₁P(Θ_i¹^,⁰<^{0)) from}(4.4). Under the assumption that P(Θ_i¹^,⁰<0) does not depend oni, this can be achieved by choosing the same number of neurons in each layer.

The effects of initialization and architecture on early training have also been studied in [8,9].

While [8] investigates the problem of vanishing and exploding gradients, [9] studies two failure modes associated with poor starting conditions. Both find that, given a total number of neurons to spend, distributing them evenly among the hidden layers, yields the best results. This is in line with our findings.

5. Main results

In this section, we prove the paper’s main results, Theorem 5.3 and Corollary 5.4. While Theorem 5.3provides precise quantitative conditions under which SGD does not converge in the training of DNNs, Corollary 5.4 is a qualitative result. To prove them, we need the following elementary result. Throughout, log denotes the natural logarithm.

Lemma 5.1. Let D,^N,^W ∈ (0,∞) and κ,^p ∈ (0,¹⁾ be such that D ≥ |log(p)|Wp⁻^W and N≤ |log(κ⁾|(1−p^W)¹⁻^D. Then[1−(1−p^W)^D]^N≥κ^.

Proof. Let the functionsf:[0,¹⁾ → _Randg:[0,¹⁾→ _Rbe given byf(x)= x+log(1−x) and g(x)=(1−p^W)⁻¹x+log(1−x). Sincef(0)= 0 andf^′(x)= 1−(1−x)⁻¹ <^{0 for all}^x∈(0,^1), one has|log(1−x)|⁻¹ < ^x⁻¹ ^{for all}^x∈(0,^{1). Hence,}^D>|log(p)|W|log(1−p^W)|⁻¹, from which it follows that (1−p^W)^D < ^p^W. In addition, g(0) = 0 andg^′(x)= (1−p^W)⁻¹−(1−x)⁻¹ > ⁰ for allx ∈ (0,^p^W), which implies that |log(1−x)| < ⁽¹−p^W)⁻¹xfor allx ∈ (0,^p^W). Hence, we deduce from (1−p^W)^D<^p^W^that^N|log(1−(1−p^W)^D)|<^N(1−p^W)^D⁻¹≤ |log(κ⁾|, and taking the exponential yields the desired statement. _□

We provedProposition 4.2in the abstract framework ofSetting 2.1. For the sake of concreteness, we now return to the setup of the introduction. We quickly recall it below.

Setting 5.2. Let u,^u∈_R,v ∈(u,∞),v∈ (u,∞),c ∈C(R,R), d∈ _N, and(γt)_t∈N ⊆_R. Consider functions X_jⁿ^,^t:Ω → [u, v]^dand Y_jⁿ^,^t:Ω → [u,^v], j,ⁿ,^t ∈_N₀, on a probability space(Ω,^F,P)such that X₀⁰^,⁰ and Y₀⁰^,⁰are random variables. Let E:[u, v]^d → [u,^v]be a measurable function such that P-a.s.E(X₀⁰^,⁰)=_E[Y₀⁰^,⁰|X₀⁰^,⁰]. LetLⁿ_a^,_,^t_m:R^P(a)×_Ω→_R, m∈_N, n,^t∈_N₀, a∈A_d, be given by

Lⁿ_a_,^,^t_m(θ⁾= ¹ m

m

∑

j=₁

⏐

⏐(c◦R^θ_a)(X_jⁿ^,^t)−Y_jⁿ^,^t⏐

⏐², ^(5.1)

and assumeG_aⁿ_,^,_m^t =(G_aⁿ_,^,_m^t_,₁, . . . ,^G_aⁿ_,^,_m^t_,_P(a)^):R^P(a)×_Ω →_R^P(a)are mappings satisfying G_aⁿ_,^,_m^t_,_i(θ, ω⁾= ∂

∂θi

Lⁿ_a_,^,^t_m(θ, ω⁾ ^(5.2)

7