6.2 Maximum likelihood type estimation
6.2.3 Estimation given the finite past
λ1+
X∞ j=2
λ2jdo−1+λ3colog(j)jdo−1 Xt−j
)
=−λ2εt−1.
The left-hand side is measurable with respect to Ft−2 and thus independent of the right-hand side which implies λ2 = 0. Hence,
λ1+ X∞
j=2
λ3colog(j)jdo−1Xt−j = 0.
Taking expectation yields λ1 = 0, whereas considering the variance leads to
λ3 = 0.
6.2.3 Estimation given the finite past
Given a finite sample X1, . . . , Xn, we can not calculate the infinite series σt(θ) and thus the objective function of the estimator θn(h) is infeasible. Therefore we replaceσt(θ) by ¯σt(θ), see (6.4), and define the computable version of the modified conditional maximum likelihood estimator θn(h):
Definition 6.2 Let h > 0. For a sample X1, . . . , Xn, the feasible estimator of the parameter vector θ is defined by
θ¯n(h) :=argmin
θ∈Θ
L¯n(θ), where the feasible objective function is given by
L¯n,h(θ) := 1 n
Xn t=1
Xt2+h
¯
σt2(θ) +h + ln(¯σ2t(θ) +h)
(6.21) and σ¯t(θ) by (6.4).
The statistical properties of ¯θ(h)n certainly depend on the asymptotic behavior of the difference of the objective functions |Ln,h(θ)−L¯n,h(θ)|as n tends to infinity.
We will show that this error converges to zero uniformly inθimplying consistency of ¯θn(h). However, below, it will turn out that convergence is too slow to deduce the asymptotic distribution of ¯θ(h)n from asymptotic normality of the infeasible estimator θ(h)n .
Denote the gradient of ¯Ln,h(θ) by ¯L′n,h(θ) and the corresponding Hessian matrix by ¯L′′n,h(θ). These function are given analogously as L′n(θ) and L′′n(θ) with σt(θ) replaced by ¯σt(θ), see (6.12) - (6.14).
CHAPTER 6. LARCH - STATISTICAL INFERENCE 124 Lemma 6.6 Let h >0 and assumptions (A5.1), (B) and (S) hold.
(a) If further (M3) or (M3′) holds, then Proof: From the mean value theorem applied to the functions (x2+h)−1 and ln(x+h) (note that the corresponding derivatives are bounded) we get
sup
where K is a finite constant. Next, by lemma6.2b, assumption (M3) respectively (M′3) implies
E[sup
θ∈Θ|σ¯t(θ)−σt(θ)|3]→0 as n→ ∞.
Together with H¨older’s inequality and Ces`aro summability, this implies E[sup
CHAPTER 6. LARCH - STATISTICAL INFERENCE 125 and thus the proof of (a) is finished since E[|Xt|3]<∞.
For (6.23) consider the following decomposition L′n,h(θ)−L¯′n,h(θ) = 1 x. Hence an application of the mean value theorem leads to
sup As above, assumption (M′′4) ensures that all expected values are finite and further, by using lemma 6.2b, one gets
E[sup
CHAPTER 6. LARCH - STATISTICAL INFERENCE 126 whereby the proof is finished by Ces`aro summability. The remaining part (6.24)
can be proven analogously.
The preceding result and lemma 6.4 can now be combined to derive consistency of the feasible estimator ¯θn(h).
Theorem 6.4 Let h > 0 and assume that (A5.1), (B) and (S) hold. If further (M3) respectively (M3′) holds, then
θ¯(h)n →θo as n→ ∞, where convergence holds in probability.
Proof: By lemma6.4and6.6, we get uniform convergence in probability of ¯Ln,h(θ) to Lh(θ). Thus the conditions described in (4.22) are fulfilled.
Obtaining the asymptotic distribution of ¯θn(h) is more complicated. The reason is the slow convergence of |σt(θ)−σ¯t(θ)|to zero. To be more specific, note that
E[σt(θ)−¯σt(θ)]2 =E[σ2t] X∞
j=t
b2j(θ)∼c1t2d−1 as t → ∞,
with a constant c1 <∞. As in the proof of theorem 6.3, Taylor series expansion yields
0 = ¯L′n(¯θ(h)n ) = ¯L′n,h(θo) +Lf¯′′n·(¯θn(h)−θo).
Again Lf¯′′n coincides with the Hessian matrix ¯L′′n(θ), where the latter is evaluated in each row at a value ˜θjn with kθ˜nj −θok ≤ kθ¯(h)n −θok, j = 1,2,3. By lemmas 6.4 and 6.6 together with consistency of ¯θ(h)n , the matrix Lf¯′′n converges to Hh in probability as n tends to infinity, and thus the asymptotic distribution of ¯θ(h)n is given, up to the factor Hh, by the asymptotic distribution of ¯L′n,h(θo). The latter is the same as the one of L′n,h(θo) provided that
Dn:=√
n(L′n,h(θo)−L¯′n,h(θo))→0 (6.26) in probability as n→ ∞.However, it is not clear whether (6.26) holds in general, in particular ifdo is positive and thus convergence of|σt(θ)−σ¯t(θ)|to zero is very slow. First, we study the asymptotic behavior of Dn and thus the asymptotic distribution of ¯θn(h) in the short memory case where do <0:
CHAPTER 6. LARCH - STATISTICAL INFERENCE 127 Proposition 6.2 Let h >0 and θo be an interior point of Θ with do <0. Then under assumptions (A5.1), (B), (S) and (M6),
n1/2(¯θ(h)n −θo)→ Nd (0, Hh−1GhHh−1)
as n → ∞, i.e. in the short memory situation, the feasible estimator θ¯(h)n is asymptotically normal distributed with n−12-rate of convergence.
Proof: The proposition follows by proving Dn → 0 in theL1-norm. Recall that σt = σt(θo) and denote ¯σt := ¯σt(θo), ˙σt := ∂θ∂σt(θo) and ˙¯σt := ∂θ∂ ¯σt(θo). Using Σ2 consists of uncorrelated random variables and thus
E
asn tends to infinity by lemma6.2b and Ces`aro summability. We further decom-pose Σ1 into
CHAPTER 6. LARCH - STATISTICAL INFERENCE 128 where the constant K originates from the application of the mean value theorem to the function x2x2+h2 . Hence, by assumption (M6), all expected values are finite and E[σt−σ¯t]4 → 0 implying E√1nΣ1,2 the short memory situation do <0 and the bound (6.27) converges to zero.
Note, that assumption (M6) was needed in the proof of proposition 6.2to express the upper bound (6.27) in terms of E[σt −σ¯t]2 for which the rate of decay is known.
In the long-memory case, i.e. do > 0, the proof of the preceding proposition does not hold anymore, since the upper bound (6.27) for EkDnk (more precisely for Ek√1nΣ1,1k) does not converge to zero. We therefore propose an alternative estimator, at the cost of a slower rate of convergence:
Definition 6.3 Let h >0and0< β <1. Definem(n) =⌊nβ⌋−1. For a sample
where the truncated objective function is given by L˜n,h(θ) := 1
Here, ⌊·⌋ denotes the floor function, i.e. ⌊x⌋is the largest integer smaller than x.
The purpose of the additional function m(n) can be explained as follows: As described above, the asymptotic distributions of θn(h) and ¯θn(h) may differ due to
CHAPTER 6. LARCH - STATISTICAL INFERENCE 129 the slow convergence of |σt(θo)−σ¯t(θo)| to zero. More precisely, the difference of the objective functions Dn mainly comes from poor estimates ofσt(θo) for low t, since then only a small number of past X1, . . . , Xt is used for the calculation of the approximation ¯σt(θo) and thus a large deviation of ¯σt(θo) fromσt(θo) can occur. In the definition ofθn(h,β), we try to avoid this problem by skipping the first n−m(n)−1 summands, i.e. we only use the part of the sum which is based on the most reliable values of ¯σt(θo). The functionm(n) is chosen in such a way, that the corresponding bound (6.27) of the difference converges to zero (see the proof of theorem 6.5 below). Note however, that all available values of Xs, s= 1, . . . , t are used for the calculations of σt(θo), t = 1, . . . , n. The estimator of definition 6.3 has the following properties:
Theorem 6.5 Let h > 0 and θo be in the interior of Θ. Further assume that (A5.1), (B) and (S) hold.
(a) If (M3) or (M3′) holds and 0 < β < 1, then θn(h,β) converges to θo in the L1-norm, i.e. θn(h,β) is a weakly consistent estimator.
(b) If (M6′) holds and 0< β <1−2do, then
nβ2(θn(h,β)−θo)→d N(0, Hh−1GhHh−1) as n tends to infinity.
Proof: (a) Define
Lˇn,h(θ) := 1 m(n) + 1
Xn t=n−m(n)
Xt2+h
σ2t(θ) +h + ln(σt2(θ) +h).
Then, the same arguments as in the proof of lemma 6.6 can be applied to show that supθ∈ΘkLˇn,h(θ)−L˜n,h(θ)k converges to zero in the L1-norm and that the analogue results hold for the gradient and Hessian matrix of ˜Ln,h(θ). Moreover, since Xt and σt(θ) are stationary, the distributions of ˇLn,h(θ) and
Lˇ∗n,h(θ) := 1 m(n) + 1
m(n)+1
X
t=1
Xt2+h
σt2(θ) +h + ln(σt2(θ) +h)
coincide. This leads to supθ∈ΘkLˇ∗n,h(θ)−Lh(θ)k → 0, since ˇL∗n,h(θ) is a subse-quence ofLn,h(θ). Altogether, ˜Ln,h(θ) converges toLh(θ) in probability uniformly
CHAPTER 6. LARCH - STATISTICAL INFERENCE 130 in θ and thus consistency of θ(h,β)n follows.
(b) Again, by Taylor expansion,
0 = ˜L′n,h(θn(h,β)) = ˜L′n,h(θo) +Lf˜′′n·(θn(h,β)−θo)
where ˜L′n,h(θ) denotes the gradient of ˜Ln,h(θ) and Lf˜′′n coincides with the Hessian matrix of ˜Ln,h(θ) evaluated in each row in such a way that Lf˜′′n → Hh in proba-bility (compare with the preceding proofs). Thus the asymptotic distribution of L˜′n,h(θo) has to be derived. First note again, that the distributions of ˇL′n,h(θo) and Lˇ∗′n,h(θo) coincide and that the later is a subsequence ofL′n,h(θo) which is a mar-tingale difference. Thusp
m(n) + 1 ˇL′n,h(θo)→ Nd (0, Gh).It remains to show that D˜n :=p
m(n) + 1( ˜L′n,h(θo)−Lˇ′n,h(θo)) converges to zero. This can be shown by decomposing ˜Dnas in the proof of propostion6.2(recallDn = √1n(Σ1,1+Σ1,2+Σ2)) changing the upper bound (6.27) into
K m(n) + 1
pm(n) + 1ndo−12 =K ⌊nβ⌋
p⌊nβ⌋ndo−12 ∼Kndo−12+β2 →0
as n tends to infinity (note that β2 < 12 −do).
Thus the asymptotic properties of θn(h,β) depend on the value of do. If do > 0 is close to zero, then the best rate of convergence nβ/2 is close ton1/2. However, for strong long memory withdo close to 1/2, the upper bound forβ, given by 1−2do, is very small. Thus, the number ofσt’s used for estimation is very small compared to n and the rate of convergence of θ(h,β)n is very slow. Though consistency holds for all β ∈(0,1], the asymptotic distribution of θ(h,β)n for β ≥1−2do remains an open problem.