• Keine Ergebnisse gefunden

Matrix Bernstein inequality in the subexponential case

As we mentioned above, one of the prominent applications of the uniform Hanson-Wright inequalities is the recent concentration result in the Gaussian covariance estimation problem.

It is known that covariance estimation problems may be alternatively approached by the matrix Bernstein inequality. Following the truncation approach, which was taken above we provide a version of matrix Bernstein inequality, that does not require uniformly bounded

4.3 Matrix Bernstein inequality in the subexponential case matrices. The standard version of the inequality (see Tropp (2012) and reference therein) may be formulated as follows: consider random independent matricesX1, . . . ,XN ∈Rn×n,

. The first problem with this result is that it does not hold in general cases when maxikXikψ1 or maxikXikψ2 are bounded.

The second problem is the dependence on the dimensionn, which does not allow applying it to operators in Hilbert spaces. For a positive definite real square matrixAwe define the effective rankas ˜r(A) =tr(A)kAk. We show the following bound.

Proposition 4.3. Suppose, we have random independent symmetric matrices X1, . . . ,XN ∈ Rn×n, each satisfying

kXik

ψ1 <∞.Set M =

maxi≤NkXik

ψ1 and let positive-definite matrix R be such thatE∑Ni=1Xi2R. Finally, setσ2=kRk. There are absolute constants

Remark 4.8. Using the well known bound for the maximum of subexponential random variables (see Ledoux and Talagrand (2013)) we have

max When n=1the effective rank plays no role and our bound recovers the version of classical Bernstein inequality which is due to Adamczak (2008). In this paper, it is also shown that the logN factor cannot be removed in general, meaning that M=

maxi≤NkXik

ψ1 cannot be replaced bymaxi≤N

kXik

ψ1 in general.

Proof. FixU>0 and consider the decomposition

Xi=Yi+Zi, Yi=XiI(kXik ≤U), Zi=XiI(kXik>U),

so that the matrices Yi are uniformly bounded byU in operator norm. By the triangle inequality and the union bound,

P

so the two parts can be treated separately. Throughout the proof c> 0 is an absolute constant which may change from line to line. It is known that uniformly bounded random matrices satisfy Bernstein-type inequality (see Theorem 3.1 in Minsker (2017)) for u≥

1 notYi, we need the following modification of the proof of Minsker’s theorem. Using the notation of his proof, it follows from Lemma 3.1 in Minsker (2017):

logEexp(θ(Yi−EYi)) φ(θU)

U2 E(Yi−EYi)2 φ(θU)

U2 2EYi2 φ(θU) U2 2EXi2. Now, using the same lines of the proof, instead of formula (3.4) we have

Etrφ θ

whereσ2=kRk. Following last lines of the proof of Theorem 3.1 we finally have P

4.3 Matrix Bernstein inequality in the subexponential case

Thus, we can apply Proposition 6.8 from Ledoux and Talagrand (2013) toZitaking values the Banach space(Rn×n,k · k)equipped with the spectral norm. We have,

E which implies with some constantK>0,

E

Using Theorem 6.21 from Ledoux and Talagrand (2013) in(Rn×n,k · k)we have,

wherec>0 is an absolute constant. Combining it with (4.35), and that for some absolute C>0 we haveU≤C

To the best of our knowledge, the Proposition 4.3 is the first to combine two important properties: it simultaneously captures the effective rank instead of the dimensionnand is valid for matrices with subexponential operator norm (previously matrix Bernstein inequality in the unbounded case was granted under the so-called Bernstein moment condition; we refer to Tropp (2012) and the references therein). We should also compare our results with

Proposition 2 of Koltchinskii (2011), which has the same form as our bound, but instead of the effective rank, the original dimensionnis used andM=

maxi≤nkXik

ψ1 is replaced by maxi≤N

kXik ψ1log

N

maxi≤N kXik

ψ1

2

2

.

Application to covariance estimation with missing observations

Now we turn to the problem studied in Koltchinskii and Lounici (2017) and Lounici (2014).

Suppose, we want to estimate the covariance structure of a centered random subgaussian vectorX ∈Rn(which will be assumed centered) based onNi.i.d. observationsX1, . . . ,XN. For the sake of brevity, we work with the finite-dimensional case, while as in Koltchinskii and Lounici (2017) our results will not depend explicitly on the dimensionn. Recall, that a centered random vectorX ∈Rnissubgaussianif for allu∈Rnit holds

khX,uikψ2 .(EhX,ui2)12, (4.36) which does not require any independence of components ofX.

In what follows we discuss a more general framework suggested by Lounici (2014). Let δi,j,i≤N,j≤nbe independent Bernoulli random variables with the meanδ. We assume that instead of observing X1, . . . ,XN we observe vectorsY1, . . . ,YN, which are defined as Yiji,jXij. This means that some components of vectorsX1, . . . ,XN are missing (replaced by zero) each with probability 1−δ. Sinceδ can be easily estimated we assume that it is known. Following Lounici (2014), denote

Σˆ(δ)= 1 N

N i=1

YiYi>. It can be easily shown that the estimator

Σˆ = (δ−1−δ−2)Diag(Σˆ(δ)) +δ−2Σˆ(δ) is an unbiased estimator ofΣ=EXiXi>. In particular,

Σ= (δ−1−δ−2)Diag(EYiYi>) +δ−2EYiYi>. (4.37)

4.3 Matrix Bernstein inequality in the subexponential case Theorem 4.3. Under the assumptions defined above, it holds with probability at least1−e−t for t≥1

Remark 4.9. The upper-bound above provides an important improvement upon Proposition 3 in Lounici (2014), which is

kΣˆ−Σk.kΣkmax The bound (4.38)depends on n and therefore is not applicable in the infinite dimensional scenarios. It also contains a term proportional to t2, which appears due to a straightforward truncation of each observation. Moreover, this result has an unnecessary factorr(Σ)˜ in the term

qr(Σ)t˜

2. Finally, when δ =1tighter results may be obtained using high probability generic chaining bounds for quadratic processes. In particular, Theorem 9 in Koltchinskii and Lounici (2017) implies Unfortunately, this analysis may not be implied forδ <1in general, since the assumption (4.36)will not hold for the vector Y , defined by Yiji,jXij. Therefore, our technique is a reasonable alternative which works for generalδ and is almost as tight as(4.39)when δ =1.

To prove Theorem 4.3 we need the following technical Lemma, parts of which may as well be found in Lounici (2014). For a matrixAlet Diag(A) denote its diagonal part and define Off(A) =A−Diag(A).

Lemma 4.9. Let X ∈Rnsatisfy(4.36)with covariance matrixΣany Y = (δ1X1, . . . ,δnXn), whereδi, i≤n are independent Bernoulli random variables with the meanδ. Then, it holds

kDiag(YY>)k

ψ1.r(Σ)kΣk,˜

kOff(YY>)k

ψ1.r(Σ)kΣk.˜

Additionally, it holds for some absolute constant C>0

EOff(YY>)22tr(Σ)(Σ+Diag(Σ)), and EDiag(YY>)2.Cδtr(Σ)Diag(Σ).

(4.40)

Proof. Observe, thatkDiag(YY>)k ≤ kYk2andkOff(YY>)k ≤ kYY>k+kDiag(YY>)k ≤ 2kYk2. Therefore,

kOff(YY>)k ψ1

≤2kkYkk2ψ2 ≤2kkXkk2ψ2.tr(Σ), and the same bound holds for

kDiag(YY>)k ψ1.

Let A be an arbitrary symmetric matrix and let us calculate E(Aδ δ>)2, where denotes Hadamard product andδ = (δ1, . . . ,δn)is a vector with independent components having Bernoulli distribution with the meanδ. We have,

h

E(Aδ δ>)2i

ii=E

k

AikδiδkAkiδiδk=

k

AikAiki2δk22[A2]ii+ (δ−δ2)A2ii. For the element at the positioni jwithi6= jwe have,

h

E(Aδ δ>)2i

i j =E

k

AikδiδkAk jδjδk=

k

AikAk jiδjδk2

3[A2]i j+ (δ2−δ3)(AiiAi j+Ai jAj j).

This can be put together in the following expression, E(δ δ>A)23A2+ (δ2−δ3)

Diag(A2) +Off(A)Diag(A) +Diag(A)Off(A) + (δ−δ2)Diag(A)2.

Note, that all of these matrices are positive definite, apart from the term Off(A)Diag(A) + Diag(A)Off(A), which we can obviously bound by 12(Off(A) +Diag(A))2=A2/2. Taking into accountδ ≤1, we have a simple bound

E(δ δ>A)21

2(δ32)A2+ (δ2−δ3)Diag(A2) + (δ−δ2)Diag(A)2 δ2(A2+Diag(A2)) +δDiag(A)2.

4.3 Matrix Bernstein inequality in the subexponential case Now recall thatY =diag(δ)X, therefore Off(YY>) =δ δ>Off(X X>). Since the latter has zero diagonal, the term withδ in the formula above disappears. Therefore,

EOff(YY>)2δ2 h

EOff(X X>)2+Diag

EOff(X X>)2i

. (4.41)

It holdsEOff(X XT)22E(X X>)2+2EDiag(X XT)2, and we also have from Lounici (2014) that E(X X>)2Ctr(Σ)Σ. Additionally, due to subgaussianity (4.36) we have EXi42ii. Finally, the following bound holds

EDiag(X X>)2CDiag(Σ)2Ctr(Σ)Diag(Σ).

Plugging this bounds into (4.41) we get the second inequality.

As for the diagonal, we have forA=Diag(X X>),

EDiag(YY>)3δEDiag(X X>)2Cδtr(Σ)Diag(Σ).

Lemma 4.10. For Y as in Lemma 4.9 and any unit u∈Rnit holds,

ku>Off(YY>)ukL22kΣk, ku>Diag(YY>)ukL2.δkΣk.

Proof. Letv∈Rnbe as well arbitrary unit vector. First we want to check, that

ku>Diag(X X>)vkL4 .kΣk, ku>Off(X X>)vkL4.kΣk. (4.42) Obviously, ku>X X>vkL4 ≤ ku>XkL8kv>XkL8 .kΣk, so it is enough to check just for the diagonal. Let us apply simmetrization argument. Suppose,ε= (ε1, . . . ,εd)>are independent Rademacher variables, then

u>Diag(X X>)v=Eεε>diag(u)X X>diag(v)ε=EεuεX X>vε,

whereuε= (u1ε1, . . . ,udεd)>andEε denotes expectation conditioned onX. Then, by Jensen and Hölder inequalities,

E

u>Diag(X X>)u4

≤E

u>εX X>uε 4

=EεE1/2[(u>εX)8|ε]E1/2[(v>εX)8|ε].kΣk4, thus implying (4.42).

Next, let us consider a zero diagonal symmetric matrixB. We have,

Therefore, due to the fact thatBis symmetric we have

E(δ>Bδ)24

4.3 Matrix Bernstein inequality in the subexponential case As for the diagonal, we have

E

Before we start with the proof of deviation bound let us present the following version of Talagrand’s concentration inequality for the empirical processes, which will help us to capture the tail behavior in the subgaussian regime. Remarkably, the following result can be proven using very similar techniques: at first one may use the modified logarithmic Sobolev inequality to prove a version of Talagrand’s concentration inequality in the bounded case and then use the truncation as in the proof of Theorem 4.1 to get the result in the unbounded case.

Theorem 4.4(Theorem 4 in Adamczak (2008)). Let X1, . . . ,XN∈X be independent sample

Proof of Theorem 4.3. At first, using (4.37) we have kΣˆ−Σk.δ−1

We have ˜r(R)≤2˜r(Σ)andkRk.Nδ2tr(Σ)kΣk. Therefore, with probability at least 1−e−t Integrating this bound (see e.g. Theorem 2.3 in Boucheron et al. (2013)) we easily get

EkOff(Σˆ(δ))−EOff(Σˆ(δ))k.kΣkmax

We proceed with the diagonal term. Applying Proposition 4.3 to the sumNDiag(Σˆ)) =

Ni=1Diag(YiYi>)withR=CNδtr(Σ)Diag(Σ)we haver(R).r(Σ)andkRk.Nδtr(Σ)kΣk.

Thus, with probability at least 1−e−twe get, kDiag(Σˆ(δ))−EDiag(Σˆ(δ))k.kΣkmax

Again, integrating this inequality we get a bound for the expectation, EkDiag(Σˆ(δ))−EDiag(Σˆ(δ))k.kΣkmax

4.4 Approximation argument for non-smooth functions We have ku>Diag(YiYi>)uk2L

2 . δkΣk2 and kmaxikOff(YiYi>)kkψ1 .r(Σ)kΣklog˜ N by Lemma 4.10 and Lemma 4.9. By Theorem 4.4 we have with probability at least 1−e−t, kDiag(Σˆ(δ))−EDiag(Σˆ(δ))k ≤2EkDiag(Σˆ(δ))−EDiag(Σˆ))k+kΣk

r δt

N +kΣkr(Σ)t˜ logN N .kΣkmax

r

δr(Σ)logr(Σ)

N ,

r δt

N,r(Σ)(logr(Σ) +t)logN N

! .

It is left to combine the off-diagonal and diagonal bounds,

kΣˆ−Σk ≤δ−2kOff(Σˆ(δ))−EOff(Σˆ(δ))k+δ−1kDiag(Σˆ))−EDiag(Σˆ))k.

4.4 Approximation argument for non-smooth functions

In this section we explain how one can apply the Sobolev inequality for functions that are not everywhere diffirentiable rigorously. In order to use the Assumption (4.6), we need to take smooth approximations of the function

Z(X) =sup

A

(X>AX−EX>AX).

Notice, that we have

|Z(X)−Z(Y)| ≤ kX−Yk

sup

A

kAXk+sup

A

kAYk

.

The following simple lemma shows how to apply the logarithmic Sobolev inequality to non-differentiable functions that satisfy such inequality.

Lemma 4.11. Suppose, a random vector X satisfies Assumption 4.1. Let f :Rn→Rbe such that

|f(x)− f(y)| ≤ |x−y|max(L(x),L(y)),

for some continuous L(x)≥0. Then, for some absolute constant C>0and anyλ ∈Rit holds,

Ent(eλf)≤CK2λ2EL(x)2eλf

Proof. Seth(x) =x2(1−x)2+ and consider a smoothing kernel supported on unit ball,

Moreover, we have due to the symmetry

∇Fm(x) =

4.4 Approximation argument for non-smooth functions Assumption 1,

Ent(Fm2)≤K2Ek∇Fm(x)k2≤2CgK2EL2m(x)Fem(x)2, and taking limitm→∞gives the required inequality.

Appendix A Technical tools

A.1 Lasso and missing observations

Suppose, we observe a signaly∈Rnof the form y=Φb+ε,

whereΦ= [φ1, . . . ,φp]∈Rn×p is a dictionary of words φj ∈Rn and b is some sparse parameter with a supportΛ⊂ {1, . . . ,p}. We want to recover exact sparse representation by solving quadratic program

1

2ky−Φbk2+γkbk1→ min

b∈Rp. (A.1)

Denote byRΛ the set of vectors with elements indexed byΛ, forb∈RnletxΛ∈RΛbe the result of taking only elements indexed byΛ. With some abuse of notation we will also associate each vectorxΛ∈RΛwith a vectorxfromRnthat has same coefficients onΛand zeros elsewhere. Let us alsoΦΛ= [φj]j∈Λbe a subdictionary composed of words indexed byΛandPΛis the projector onto the corresponding subspace.

The following sufficient conditions for the global minimizer of (A.1) to be supported on Λare due to Tropp (2006), who uses the notion ofexact recovery coefficient,

ERCΦ(Λ) =1−max

j∈Λ/+Λφjk1,

The results are summarized in the next theorem.

Theorem A.1 (Tropp (2006)). Let b˜ be a solution to (A.1). Suppose, that kΦ>εk ≤ γERC(Λ). Then,

• the support ofb˜ is contained inΛ;

• the distance betweenb˜ and optimal (non-penalized) parameter satisfies, kb˜−bk≤ kΦ+

Λεk+γk(ΦΛΦ>Λ)−1k1,∞, kΦΛ(b˜−b)−PΛεk2≤γk(Φ+

Λ)>k2,∞;

In what follows we want to extend this result for the possibility of using missing observa-tions model. Observe that the program (A.1) is equivalent to

1

2b>>Φ]b−b>>y] +γkbk1→ min

b∈Rp

,

so that for the minimization procedure only knowingD=Φ>Φand c=Φ>yis required.

Suppose, that instead we have only access to some estimators ˆD≥0 and ˆcthat are close enough to the original matrix and vector, which may come e.g. from missing observations model. Then, we can solve instead the following problem,

1

2b>Dbˆ −b>cˆ+γkbk1→ min

b∈Rp. (A.2)

In what follows we provide a slight extension of Tropp’s result towards missing observations, the proof mainly follows the same steps.

Further, for a matrixDand two sets of indicesA,Bwe denote the submatrix on those indices asDA,Band for a vectorc, the corresponding subvector iscA.

Lemma A.1. Suppose, that

kDˆΛc−1Λ,ΛΛ−cˆΛck≤γ(1− kDˆΛc−1Λ,Λk1,∞).

Then, the solutionb˜ to(A.2)is supported onΛ.

Proof. Let ˜bbe the solution to (A.2) with the restriction supp(b)⊂Λ. Since ˆD≥0 this is a convex problem and therefore the solution is unique and satisfy,

Λ,Λb˜−cˆΛ+γg=0, g∈∂kbk˜ 1,

A.1 Lasso and missing observations where∂f(b)denotes subdifferential of a convex function f at a pointb, in the case of`1 norm we havekgk≤1. Thus,

b˜ =Dˆ−1Λ,ΛΛ−γDˆ−1Λ,Λg. (A.3)

Next, we want to check that ˜bis a global minimizer. To do so, let us compare the objective function at a pointb=b˜+δej for arbitrary index j∈/Λ. Sincekbk1=kbk˜ 1+|δ|, we have

L(b)˜ −L(b) =1

2b˜>Dˆb˜−1

2b>Dbˆ −cˆ>(b˜−b)−γ|δ|

2

2 e>jDeˆ j+|δ|γ−δe>jDˆb˜+δcbj

>|δ|γ−δe>jDˆb˜+δcbj,

where the latter comes from the fact that ˆDis positively definite. Applying the equality (A.3) yields,

e>jDˆb˜ =Dˆj,Λ−1Λ,ΛΛ−γDˆj,Λ−1Λ,Λg, therefore, taking into accountkgk≤1 we have,

L(b)˜ −L(b)>|δ|h

γ(1− kDˆΛc−1Λ,Λk1,∞)−

j,Λ−1Λ,ΛΛ−cbj i

,

where the right-hand side is nonnegative by the condition of the lemma. Since j∈/ Λ is arbitrary, ˜bis a global solution as well.

Remark A.1. It is not hard to see that in the exact caseDˆ =Φ>Φandcˆ=Φ>ythe condition of the lemma above turns into the conditionkΦ>ΛcPΛεk≤γERC(Λ)of Theorem A.1.

Since we are particularly interested in an application to time series, the features matrixΦ should in fact be random, thus stating a ERC-like condition onto it might result in additional unnecessary technical difficulties. Instead, let us assume that there is some other matrix ¯D, potentially the expectation ofΦ>Φ, such that it is close enough to ˆD(with some probability, but we are stating all the results deterministically in this section), and the value that controls the exact recovery looks like

ERC(Λ; ¯D) =1− kD¯Λc−1Λ,Λk1,∞.

Additionally, we set ¯c=Db¯ =D¯·,ΛbΛ— the vector that ˆcis intended to approximate. Note that in this case we have ¯DΛc−1Λ,ΛΛ−c¯Λc =D¯ΛcbΛ−c¯Λc =0, thus the conditions of Lemma A.1 hold for ¯D,c¯once ERC(Λ; ¯D)andγ are nonnegative. In what follows we control the values appearing in the lemma for ˆDand ˆcthrough the differences between ¯c, ¯Dand ˆc, ˆD, respectively, thus allowing the exact recovery of the sparsity pattern. Lemma 3.7

Corollary A.1. LetD and¯ c¯ be such thatc¯=Db¯ . Assume that

kˆc−ck¯ ≤δc, kD¯−1Λ,Λ(ˆcΛ−c¯Λ)k≤δc0, kD¯−1Λ,Λ(DˆΛ,·−D¯Λ,·)k∞,∞≤δD, k(Dˆ·,Λ−D¯·,Λ)bΛk≤δD0, kD¯−1Λ,Λ(D¯Λ,Λ−DˆΛ,Λ)bΛk≤δD00. Suppose,ERC(Λ)≥3/4and

c+3δD0 ≤γ, sδD≤ 1 16,

where|Λ|=s. Then, the solution to(A.2)is supported onΛand satisfies

Λ=Dˆ−1Λ,ΛΛ−γDˆ−1Λ,Λg, (A.4) with someg∈RssatisfyingkgΛk≤1and the max-norm error satisfies

kb˜−bk≤2(δD00c0+γkD¯−1Λ,Λk1,∞), while the`2-norm error satisfies

kb˜−bk ≤2√

s(δD00c0+γ σmin−1).

If additionally2(δD00c0+γkD¯−1Λ,Λk1,∞)≤minj∈Λ|bj|,then we have the exact recovery, so that the following equality takes place

Λ=Dˆ−1Λ,Λλ−γDˆ−1Λ,ΛsΛ, wheres=sign(b).

A.1 Lasso and missing observations Proof. First observe thatDΛcD−1Λ,ΛcΛ−cΛc>Λc+

Λy−y) =Φ>Λc(PΛ−I)ε. By Lemma A.2 we have,

kDˆΛc−1Λ,Λk1,∞≤ kD¯Λc−1Λ,Λk1,∞+4sδD≤1/2, while since ¯cΛc =D¯ΛcbΛ=D¯Λc−1Λ,ΛΛ,

kDˆΛc−1Λ,ΛΛ−cˆΛck≤ kDˆΛc−1Λ,ΛΛ−D¯Λc−1Λ,ΛΛk+kˆcΛc−c¯Λck

≤ kDˆΛc−1Λ,Λ(ˆcΛ−c¯Λ)k+kDˆΛc(Dˆ−1Λ,Λ−D¯−1Λ,Λ)¯cΛk +k(DˆΛc−D¯Λc)D¯−1Λ,ΛΛkc

≤ kDˆΛc−1Λ,Λ(ˆcΛ−c¯Λ)k+kDˆΛc(Dˆ−1Λ,Λ−D¯−1Λ,Λ)¯cΛkD0c. Here,kDˆΛc−1Λ,Λ(ˆcΛ−c¯Λ)k≤δc/2 due tokDˆΛc−1Λ,Λk1,∞≤1/2. Moreover, we have

kDˆΛc(Dˆ−1Λ,Λ−D¯−1Λ,Λ)¯cΛk=kDˆΛc−1Λ,Λ(D¯Λ,Λ−DˆΛ,Λ)D¯−1Λ,ΛΛk

≤ kDˆΛc−1Λ,Λk1,∞k(D¯Λ,Λ−DˆΛ,Λ)D¯−1Λ,ΛΛk

≤δD0/2.

Using the condition onγ, we get that kDˆΛc−1Λ,ΛΛ−cˆΛck≤ 3

2(δD0c)≤ γ

2≤γ(1− kDˆΛc−1Λ,Λk1,∞),

so that the conditions of Lemma A.1 are satisfied and (A.4) takes place. This allows us to write

Λ−bΛ=Dˆ−1Λ,ΛΛ−D¯−1Λ,ΛΛ−γDˆ−1Λ,Λg,

=Dˆ−1Λ,Λ(D¯Λ,Λ−DˆΛ,Λ)D¯−1Λ,ΛΛ+Dˆ−1Λ,Λ(ˆcΛ−c¯Λ)−γDˆ−1Λ,Λg

=Dˆ−1Λ,Λ(D¯Λ,Λ−DˆΛ,Λ)bΛ+Dˆ−1Λ,Λ(ˆcΛ−c¯Λ)−γDˆ−1Λ,Λg

=Dˆ−1Λ,ΛΛ,Λ

−1Λ,Λ(D¯Λ,Λ−DˆΛ,Λ)bΛ+D¯−1Λ,Λ(ˆcΛ−c¯Λ)−γD¯−1Λ,Λg By Lemma A.2 we havekDˆ−1Λ,ΛΛ,Λk∞7→∞≤2 so that

kb˜Λ−bΛk≤2kD¯−1Λ,Λ(D¯Λ,Λ−DˆΛ,Λ)bΛk+2kD¯−1Λ,Λ(ˆcΛ−c¯Λ)k+2γkD¯−1Λ,Λk1,∞.

and since we also have|||Dˆ−1Λ,ΛΛ,Λ|||op≤2 andkgk ≤√

s, it holds kb˜Λ−bΛk ≤2√

s

kD¯−1Λ,Λ(D¯Λ,Λ−DˆΛ,Λ)bΛk+kD¯−1Λ,Λ(ˆcΛ−c¯Λ)k+γ|||D¯−1Λ,Λ|||op .

Before we proceed with the proof of this corollary, we present a technical lemma that collects some trivial inequalities.

Lemma A.2. Setδc=kˆc−ck¯ D=k(DˆΛc−D¯Λc)D¯−1Λ,Λk∞,∞. Suppose,kD¯ΛcΛ−1

ΛΛk1,∞≤ 1and sδD≤1/2. It holds,

• for each q≥1

kDΛ,Λ−1Λ,Λkq→q≤2, kDˆ−1Λ,ΛDΛ,Λkq→q≤2 ;

kDˆΛc−1Λ,Λ−DΛcD−1Λ,Λk1,∞≤4sδD. Proof. First, we have

kDΛ,Λ−1Λ,Λkq→q=kI+ (DΛ,Λ−DˆΛ,Λ)Dˆ−1Λ,Λkq→q

≤1+k(DΛ,Λ−DˆΛ,Λ)D−1Λ,Λkq→qkDΛ,Λ−1Λ,Λkq→q

≤1+sδDkDΛ,Λ−1Λ,Λkq→q, which solving the inequality and sincesδD≤1/2 turns into

kDΛ,Λ−1Λ,Λkq→q≤ 1

1−sδD 2.

Similarly,kDˆ−1Λ,ΛDΛ,Λkq→q≤2.

Furthermore,

k(DˆΛc−DΛc)Dˆ−1Λ,Λk1,∞≤ k(DˆΛc−DΛc)D−1Λ,Λk1,∞kDΛ,Λ−1Λ,Λk1→1

≤2sδD.

A.2 Gaussian approximation for change point statistic and

kDΛc(D−1Λ,Λ−Dˆ−1Λ,Λ)k1,∞≤kDΛ,ΛcD−1Λ,Λk1,∞kDˆ−1Λ,Λ(DˆΛ,Λ−DΛ,Λ)k1→1

≤kDΛ,ΛcD−1Λ,Λk1,∞kDˆ−1Λ,ΛDΛ,Λk1→1kD−1Λ,Λ(Dˆ−D)k1→1

≤2kDΛ,ΛcD−1Λ,Λk1,∞D, which together give us the second inequality.

A.2 Gaussian approximation for change point statistic

LetX1, . . . ,Xn∈Rd be a martingale difference sequence (MDS) with coefficientsbk, and set q}. Let additionally, with probability one

|Xi j| ≤Dn, 1≤i≤n; 1≤ j≤p. Theorem A.2(Chernozhukov et al. (2013), Theorem B.1). Suppose, positive r,q be such that r+q≤n/2and for some c1,C1>0and0<c2<1/4, c1≤σ(q)≤σ(q)∨σ(r)≤C1

Suppose we have another MDSX10, . . . ,Xn0, from which we construct a similar to (A.5) statistic ˇT0. Suppose, the sequence hasβ-mixing coefficients bounded by the same values bk and the values of the vectors bounded a.s. by the same Dn. Finally, let us set Σ0 =

1

nni=1EXiXi>. Combining the result above with Gaussian comparison and anti-concentration we get the following corollary.

Lemma A.3. Suppose, there are positive q,r such that q+r<n/2and there are c1,C1>0

Proof. Simply apply Theorem A.2, together with Theorem 2 of Chernozhukov et al. (2015) and Theorem 1 of Chernozhukov et al. (2017).

Let nowX1, . . . ,Xn∈Rpbe a martingale difference sequence, withβ-mixing coefficients

into the above form. Following Zhilova (2015) we consider the following approximation.

LetGε be anε-net of the unit sphere inRp, such that for eacha∈Rpit holds,

A.2 Gaussian approximation for change point statistic

and assume that for each suchI it holds,

kVI0−Vk ≤∆I, ∆q=max

|I|=qI.

Denote by analogy the test statistics ˆT0and the vectorsXei0. In what follows we assume that the dimension p is constant and the size ofS is growing withn. Moreover, assume that

|Xi j|,|Xi j0| ≤Dnfor eachi,jand that ˆT,Tˆ0≤An, all with probability≥1−1/n.

Moreover, assume∆r,∆q≤c1/2. Then, for any C2>0there are c,C>0that only depend covariance difference∆. We have, that (assumings1≤s2)

1 between two is bounded by,

jk−Σ0jk| ≤a2s1

Bibliography

Adamczak, R. (2008). A tail inequality for suprema of unbounded empirical processes with applications to markov chains. Electronic Journal of Probability.

Adamczak, R. (2015). A note on the Hanson-Wright inequality for random vectors with dependencies. Electronic Communications in Probability.

Adamczak, R., Kotowski, M., Polaczyk, B., and Strzelecki, M. (2018a). A note on concen-tration for polynomials in the Ising model. arxiv.org/abs/1809.03187.

Adamczak, R., Latała, R., and Meller, R. (2018b). Hanson-Wright inequality in Banach spaces. arXiv preprint arXiv:1811.00353.

Adamczak, R. and Wolff, P. (2015). Concentration inequalities for non-lipschitz functions with bounded derivatives of higher order. Probab. Theory Relat. Fields.

Adams, Z., Füss, R., and Gropp, R. (2014). Spillover effects among financial institutions: A state-dependent sensitivity value-at-risk approach. Journal of Financial and Quantitative Analysis, 49(3):575–598.

Arcones, M. and Gine, E. (1993). On decoupling, series expansions, and tail behavior of chaos processes. Journal of Theoretical Probability.

Avanesov, V. and Buzun, N. (2016). Change-point detection in high-dimensional covariance structure. arXiv preprint arXiv:1610.03783.

Avery, C. N., Chevalier, J. A., and Zeckhauser, R. J. (2016). The “CAPS” Prediction System and Stock Market Returns. Review of Finance, 20(4):1363–1381.

Baele, L. and Inghelbrecht, K. (2010). Time-varying integration, interdependence and contagion. Journal of International Money and Finance, 29(5):791–818.

Bauwens, L., Laurent, S., and Rombouts, J. V. (2006). Multivariate GARCH models: a survey. Journal of Applied Econometrics, 21(1):79–109.

Borell, C. (1984). On the taylor series of a wiener polynomial. Seminar Notes on multiple stochastic integration, polynomial chaos and their integration. Case Western Reserve Univ., Cleveland.

Boucheron, S., Bousquet, O., and Lugosi, G. (2005a). Theory of classification: A survey of some recent advances. ESAIM: probability and statistics, 9:323–375.

Boucheron, S., Bousquet, O., Lugosi, G., and Massart, P. (2005b). Moment inequalities for functions of independent random variables. The Annals of Probability.

Boucheron, S., Lugosi, G., and Massart, P. (2003). Concentration inequalities using the entropy method. The Annals of Probability.

Boucheron, S., Lugosi, G., and Massart, P. (2013). Concentration inequalities: A nonasymp-totic theory of independence. Oxford university press.

Brody, S. and Diakopoulos, N. (2011). Cooooooooooooooollllllllllllll!!!!!!!!!!!!!!: Using word lengthening to detect sentiment in microblogs. InProceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP ’11, pages 562–570, Stroudsburg, PA, USA. Association for Computational Linguistics.

Cha, M., Haddadi, H., Benevenuto, F., and Gummadi, K. P. (2010). Measuring user influence in twitter: The million follower fallacy. In fourth international AAAI conference on weblogs and social media.

Chen, C. Y., Després, R., Guo, L., and Renault, T. (2019a). What makes cryptocurrencies special? Investor sentiment and price predictability during the bubble. working paper.

Chen, C. Y.-H., Härdle, W. K., and Okhrin, Y. (2019b). Tail event driven networks of SIFIs.

Journal of Econometrics, 208(1):282–298.

Chen, S. and Schienle, M. (2019). Pre-screening and reduced rank regression for high-dimensional cointegration. KIT working paper.

Chen, X. and Fan, Y. (2006a). Estimation and model selection of semiparametric copula-based multivariate dynamic models under copula misspecification. Journal of Economet-rics, 135(1-2):125–154.

Chen, X. and Fan, Y. (2006b). Estimation of copula-based semiparametric time series models.

Journal of Econometrics, 130(2):307–335.

Chen, Y., Härdle, W. K., and Pigorsch, U. (2010). Localized realized volatility modeling.

Journal of the American Statistical Association, 105(492):1376–1393.

Chen, Y. and Niu, L. (2014). Adaptive dynamic Nelson–Siegel term structure model with applications. Journal of Econometrics, 180(1):98–115.

Chen, Y., Trimborn, S., and Zhang, J. (2018). Discover regional and size effects in global bitcoin blockchain via sparse-group network autoregressive modeling. Available at SSRN 3245031.

Chernozhukov, V., Chetverikov, D., and Kato, K. (2013). Testing many moment inequalities.

arXiv preprint arXiv:1312.7614.

Chernozhukov, V., Chetverikov, D., and Kato, K. (2015). Comparison and anti-concentration bounds for maxima of gaussian random vectors. Probability Theory and Related Fields, 162(1-2):47–70.

Chernozhukov, V., Chetverikov, D., and Kato, K. (2017). Detailed proof of Nazarov’s inequality. arXiv preprint arXiv:1711.10696.

Bibliography Chernozhukov, V., Härdle, W. K., Huang, C., and Wang, W. (2018). Lasso-driven inference

in time and space. arXiv preprint arXiv:1806.05081.

in time and space. arXiv preprint arXiv:1806.05081.