Robustness to model misspecification - Multiscale Change-point Segmentation: Beyond Step Functi

n⁻¹(2kn+ 1)min{r/p,r/2}

(logn)^r/2

This concludes the proof forq =a√ logn.

Finally, we consider the choice of threshold q=q(β). The corresponding assertions follow readily from the proof above, by noting the facts thatq(β)≤a√

lognfor some constanta, due to (3.12), and thatP{T_I(yⁿ;f)> q(β)}=O(n^−r) by the choice ofβ =O(n^−r).

Remark 3.1.5. In the above theorem, we note that the choice of the only tuning parameter q is universal, i.e., completely independent of the (unknown) true regression function. One can easily obtain a lower bound of order (k_n/n)min{1/2,1/p} on the best possible rate in terms of L^p-loss, 0 < p < ∞, by standard arguments based on testing many hypotheses and information inequalities (cf. Tsybakov, 2009; Li et al., 2016). Thus, the multiscale change-point segmentation method adapts to the underlying complexity of the truth, and is up to a log-factor minimax optimal over classes S_L(k_n) with different choices of k_n, such that kn = o(n), in particular, kn n^θ, 0 ≤ θ < 1. This includes the case θ = 0, where, by convention, k_n is finite. Moreover, we point out that the choice of threshold q is independent of the specific loss function, but depends on the order r of the moments of the loss.

3.2 Robustness to model misspecification

In practical applications the underlying function in model (2.3) is usually not a precise step function. As a robustness study, we next consider the convergence behavior of the multiscale change-point segmentation methods for more general functions. Using the ter-minology introduced in Section 2.3, we consider the following approximation space.

A^γ:=

f ∈ D([0,1)) : sup

k≥1

k^γΓ_k(f)<∞o

, forγ >0, and the subclasses

A^γ_L:=n

f ∈ D([0,1)) : sup

k≥1

k^γΓ_k(f)≤L, and kfk_L^∞ ≤Lo

, forγ >0 and L >0, where Γ_k(f) is the approximation error defined by

Γ_k(f) := inf

kf−gk_L^∞ :g∈ S([0,1)),#J(g)≤k

. (3.14)

3.2 Robustness to model misspecification Note that A^γ = S

L>0A^γ_L. The order γ of these spaces (or classes) reflects the speed of approximation off by step functions as the number of change-points increases. In addition, it is worth noting that if we consider instead the L^q-loss only for a fixed q, then we can replace kf −gk_L^∞ by kf −gk_L^q in the definition (3.14) of the approximation error ∆k. This will slightly enlarge the approximation spaces.

The rates of convergence for these spaces (or classes) are provided in the following theorem.

Theorem 3.2.1. Assume model (2.3) and that Assumption 1 holds with constants c > 1 and δ > 0. Let 0 < p, r < ∞, and let fˆ_n be the multiscale change-point segmentation estimator from (2.4) with threshold

q =ap

logn for some a > δ+σ√ 2r+ 4, or q =q(β) as in (2.6) withβ =O(n^−r).

Then it holds that

kfˆ_n−fk^r_Lp=O

n⁻^2γ+1^2γ min{1/p,1/2}r

(logn)

γ+(1/2−1/p)+

2γ+1 r

, a.s., uniformly for f ∈ A^γ_L. Furthermore, the same result also holds in expectation,

kfˆn−fk^r_Lp

n⁻

2γ

2γ+1min{1/p,1/2}r

(logn)

γ+(1/2−1/p)+

2γ+1 r

uniformly for f ∈ A^γ_L.

Proof. The idea behind is that we first approximate the truth f by a step function f_k_n withO(k_n) jumps, and then treatfkn as the underlying “true” signal in model (2.3) (with additional approximation error). In this way, it allows us to employ similar techniques as in the proof of Theorem 3.1.4. To be rigorous, we give a detailed proof as follows.

Firstly, we consider the choice of threshold q=a√ logn.

(i) Good noise case. Assume for the moment that the observations yⁿ = {yⁿ_i}ⁿ⁻¹_i=0 from model (2.3) are close to the truthf in the sense that the event

G_n:=







yⁿ : sup

I∈I

1 pn|I|

i/n∈I

y_iⁿ−f i n

−sI≤a0

plogn





 holds witha0 =δ+σ√

2r+ 4.Now let kn:=

l 2L a−a₀

2/(2γ+1) n logn

1/(2γ+1)m .

Note that f ∈ A^γ_L, so for every n, by means of compactness argument, there exists a step function ˜f_k_n ∈ S([0,1)) with #J( ˜f_k_n) ≤ kn such that kf −f˜_k_nk_L^∞ ≤ kn^−γL. By

3 Theory

introducing additional change-points at {i/k_n}^k_i=1ⁿ⁻¹ into ˜fkn, one can construct another step function f_k_n with #J(f_k_n) ≤ 2kn such that its largest segment length ≤ 1/kn and kf −f_k_nk_L^∞ ≤2kn^−γL. Then

TI(yⁿ;fkn)≤ sup

I∈I f_kn≡c_I onI

1 pn|I|

i/n∈I

f i

n −cI

+ sup

I∈I

1 pn|I|

i/n∈I

y_iⁿ−f i n

−sI

≤2n^1/2k_n^−γ−1/2L+a0

plogn≤ap logn.

That is,f_k_n lies in the constraint of (2.4). Thus, by definition, #J( ˆfn)≤#J(f_k_n)≤2kn. Let intervals {I_i}^m_i=0 be the partition of [0,1) by J( ˆf_n)∪J(f_k_n) withm≤3k_n. Then

kfˆ_n−f_k_nk^p_Lp=

i=0

(ˆθ_i−θ_i)^p|I_i| with ˆf_n|_I_i ≡θˆ_i andf_k_n|_I_i ≡θ_i.

If |I_i|> c/n, there is ˜I_i ∈ I such that ˜I_i⊆I_i and |I˜_i| ≥ |I_i|/c. Then,

I˜i

1/2

θ− 1 n|I˜_i|

j/n∈I˜i

y_jⁿ

≤(a+δ)

rlogn

n forθ=θi or ˆθi, which, together with|I˜i| ≥ |I_i|/c, implies

|I_i|^1/2|θˆi−θi| ≤2(a+δ)

rclogn n . If |I_i| ≤c/n, then we have for somei0,j0

|θˆi| ≤ |θˆi−yⁿ_i₀|+

y_iⁿ₀ −f i0

+kfk_L^∞ ≤2(a+δ)p

logn+L, and |θ_i|=|f_k_n j0

| ≤ kf−f_k_nk_L^∞+kfk_L^∞ ≤(2k_n^−γ+ 1)L, which lead to

|θˆi−θi| ≤ |θˆi|+|θ_i| ≤2(a+δ)

rlogn

n + 2(k_n^−γ+ 1)L.

Thus, by combining these two situations, we obtain that kfˆn−f_k_nk^p_Lp≤ X

i:|I_i|>c/n

2(a+δ) s

clogn n|I_i|

|I_i|

+ X

i:|Ii|≤c/n

2(a+δ)p

logn+ 2(k^−γ_n + 1)Lp c n.

3.2 Robustness to model misspecification

Then, with a similar argument as for (3.11), we obtain asn→ ∞ kfˆn−f_k_nk^p_Lp ≤2

4(a+δ)²lognp/2(3kn+ 1)c n

min{1,p/2}

1 +o(1) ,

which together with a triangular inequality leads to kfˆn−fk^r_Lp≤2^(2/p+1)r

4(a+δ)²logn

r/2(3kn+ 1)c n

min{r/p,r/2}

1 +o(1)

. (3.15) (ii) Rates of convergence. The rate of almost convergence is a consequence of (3.15) and the fact that, due to (3.12),

lim sup

n→∞ P{G_n^c} ≤lim sup

n→∞ P





 sup

I∈I

1 pn|I|

i/n∈I

ξ_iⁿ

>(a0−δ)p logn







= 0.

Similar to the proof step (iii) of Theorem 3.1.4, we drive from (3.15) that, asn→ ∞, E

kfˆ_n−fk^r_Lp

kfˆn−fk^r_Lp;G_n +E

kfˆn−fk^r_Lp;G_n^c

≤E

kfˆn−fk^r_Lp;G_n

+ 2^r/pn^r/2P{G_n^c}+ Z ∞

2n^p/2

P n

kfˆn−fk^p_Lp ≥u or

pu^r/p−1du

≤O

(logn)^r/2 n⁻¹kn

min{r/p,r/2}

+O n^−r/2

(logn)^r/2 n⁻¹kn

min{r/p,r/2}

which shows the rate of convergence in expectation.

Lastly, for the choice of threshold q = q(β), the proof follows in the same way as above, based on the facts that q(β) ≤ a√

logn for some constant a, due to (3.12), and that

P{G_n^c}=O(n^−r) by the choice ofβ=O(n^−r).

Remark 3.2.2. Similar to Theorem 3.1.4, the above theorem shows that the multiscale change-point segmentation method with a universal threshold automatically adapts to the smoothness of the approximation spaces, in the sense that it has a faster rate for larger order γ. However, unlike in Theorem 3.1.4, we assume the constanta in the definition of the thresholdq should bestrictly greater than δ+σ√

2r+ 4, which is necessary since the constant hidden in the O notation tends to infinity as a→δ+σ√

2r+ 4.

Example 3.2.3. (i) (Piecewise) H¨older functions. For 0 < α ≤1 and L >0, we consider the H¨older function classes

H_L^α([0,1)) :=

f ∈ D([0,1)) :kfk_L^∞ ≤L and

3 Theory

|f(x1)−f(x2)| ≤L|x₁−x2|^α for all x1, x2∈[0,1) , and the piecewise H¨older function classes with at mostκ jumps

H_κ,L^α ([0,1)) :=n

f ∈ D([0,1) : there is a partition{I_i}^l_i=0, withl≤κ, of [0,1) such that f

Ii ∈H_L^α(Ii) for all possiblei o

. Obviously, the latter one contains the former as a special case when κ = 0, that is, H_0,L^α ([0,1)) ≡ H_L^α([0,1)). It is easy to see that H_L^α([0,1)) ⊆ A^α_L0 with L⁰ ≥ L, and H_κ,L^α ([0,1))⊆ A^α_L0 withL⁰≥L(κ+ 1)^α+1/2 (cf. Boysen et al., 2009).

It is known that the fastest possible rate over H_L^α([0,1)), 0 < α ≤ 1 with respect to the L^p-loss, 0 < p < ∞, is at most of order n−2α/(2α+1) min{1/2,1/p} (see eg. Ibragimov and Has’minski˘ı, 1981; Ibragimov and Khas’minski˘ı, 1982). Thus, as a consequence of Theo-rem 3.2.1, the multiscale change-point segmentation method with a universal threshold is simultaneously minimax optimal (up to a log-factor) over H_L^α([0,1)) and H_κ,L^α ([0,1)) for every κ ∈ N0, 0 < α ≤ 1 and L >0, that is, adaptive to the smoothness order α of the underlying function.

(ii) Bounded variation functions. Recall that the (total) variation k·k_TV of a function f is defined as

kfk_TV := sup nX^m

i=0

|f(xi+1)−f(xi)|: 0 =x0<· · ·< xm+1 = 1, m∈N o

We introduce the c`adl`ag-bounded variation classes BVL([0,1)) :=

f ∈ D([0,1)) :kfk_L^∞ ≤Land kfk_TV≤L forL >0.

Elementary calculation, together with Jordan decomposition, implies that BVL([0,1))⊆ A¹_L0 forL⁰ ≥L.

Since the H¨older class H_L¹([0,1)) ⊆ BVL([0,1)), the best possible rate for BVL([0,1)) cannot be faster than that for H_L¹([0,1)), which is of order n−2/3 min{1/2,1/p}. Then, Theo-rem 3.2.1 implies that the multiscale change-point segmentation method attains the min-imax optimal rate (up to a log-factor) over the bounded variation classes BVL([0,1)) for L >0.

We point out that the convergence rates of the multiscale change-point segmentation meth-ods in the examples above coincide with the rates reported in Boysen et al. (2009) for jump-penalized least square estimators, while they are faster than the rates reported in Fryzlewicz (2007) for the unbalanced Haar wavelets based estimator, with the difference being in log-factors. All these examples concern the approximation spacesA^γforγ ≤1. Note, however,

3.3 Implications of the convergence rates

Im Dokument Multiscale Change-point Segmentation: Beyond Step Functions (Seite 32-37)