Linear Regression and Evaluation - Constructing Multi-Annual, Comparable SVIs

1.1 Constructing Multi-Annual, Comparable SVIs

1.1.2 Linear Regression and Evaluation

max_m∈M,t∈T(s_m,i,t) −L.

The rounding errorν_j,i,t can be assumed to be independently and identically distributed (i.i.d.). In particular, it is independent of the total search volume s_j,i,t.

Even though Google limits the length of the time frame which the user is allowed to choose, the structure of the SVI as outlined in Equations (1.1) and (1.2) allows to construct a consistent multi-annual SVI of arbitrary length based on downloading overlapping SVIs.

To do so, one can exploit the linear relationship between the SVIs obtained for two time frames T and T^′ for the same point in time t∈ {T ∩ T^′}, which is formally described as

SV I_j,i,t∣M,T =γ+δSV I_j,i,t∣M,T′+ε_j,i,t. (1.3) δ and γ are the parameters of this linear relation and clearly depend on the region i, the time interval t as well as the time frames T andT^′ as well as sets of simultaneously downloaded search terms MandM^′. For simplicity, all these dependencies are suppressed in the notation of Equation (1.3).

Again, the rounding errorε_j,i,t is assumed to be i.i.d. More details on the derivation of Equation (1.3) are provided in the appendix. We will use this linear relationship in the next section to construct consistent multi-annual Google Trends time series.

1.1.2 Linear Regression and Evaluation

As there is little explanation made available by Google on how the SVI is calculated exactly, and since the scientific literature that uses daily Google Trends SVIs is rather unconcerned with a detailed explanation of constructing coherent time series, we deem it necessary to clearly describe how we arrive at our algorithm. We assume, according to the description Google provides, that Google adjusts the search volume according to Equation (1.1) for a single search-term.

Another possibility, used by Google up to the end of 2011, is to standardize the time series of search volume index values. To distinguish this standardization approach, we denote the resulting index with v_j,t, for some search-term j for some interval t ∈ [t₀,T ]. Back

then, Google subtracted the mean µ_t₀_,T, divided by the standard deviation σ_t₀_,T of the number of searches within a certain time frame. Google then transformed the series to unit mean ¯µ=1 and unit standard deviation ¯σ=1 to obtain the index

v_j,t=

n_j,t−µ_t₀_,T

σ_t₀_,T σ¯+µ.¯ (1.4)

We know that Google made SVIs available according to Equation (1.4) in 2011⁵. Back then, the user could choose on which time frame the mean µ_t₀_,T and standard deviation σ_t₀_,T would be calculated on. In ’relative mode’, mean and standard deviation were calculated on the chosen time frame [t₀, T], whereas in fixed mode the user could choose a reference time period [τ₀, τ₁]. The fixed mode allowed the construction of multi-annual, consistent time series. Unfortunately, this is not the case anymore and only (another form of the former) ’relative mode’ is available which, in our understanding, can be formalized by Equation (1.1).

Due to Equation (1.3), however, we can knit separately scaled time series that are downloadable from Google together if there are overlapping points in the data sets. In theory, two overlapping points in time would suffice to identify the parameters γ and δ in Equation (1.3). Since the relationship only holds approximately, we suggest at least 30 overlapping days. We estimate the parameters via standard ordinary least-squares (OLS) regression. If the overlapping points contain a lot of zeros in both sets, an even longer overlapping period is advisable. In our algorithm, we require that there are at least 30 days in the overlapping window where at least one of the two data sets has a non-zero value. Furthermore, we require that within the overlapping time period each of the two data sets taken alone exhibits at least 20 non-zero values.

According to whether we start with the youngest or oldest time frame when knitting the time series together, we distinguish between the backward and forward method.

Furthermore, for each concatenation step, i.e., each time Equation (1.3) is used, we can test whether our estimate for the constant parameter γ is statistically significant on a 5%

significance level. To calculate the test statistic, we use robust standard errors. If the null hypothesis is not rejected on the 5% significance level in a two-sided test, we can choose to re-estimate the linear relationship based on the model

SV I_j,i,t∣M,T =δSV I_j,i,t∣M,T^′ +εj,i,t. (1.5)

5 Source: Question 8 on https://web.archive.org/web/20101229150233/http://www.google.de:

80/intl/en/trends/about.html(Last access: February 13, 2018.)

Figure 1.2: The Regression Based Construction Algorithm

The figure illustrates the forward method of the regression based construction algorithm.

Download over-lapping SVIs Estimate parame-ters on the overlap Predict beyond

overlap Concatenate

SVIs left

Yes No

A B1

B2 B3

B4 . . .

A B1

...

SV IA,t_T =γ+δSV IB1,t_T +εt

A B1

...

SV IA,t_B =γˆ+δSV Iˆ B1,t_B

A B1

B2 B3

The regression-based construction algorithm can be summarized in the following steps:

1. Download 270-day SVI data sets from Google for the time period of interest. Make sure that each two subsequent data sets overlap by at least 30 non-missing values.

2. Estimate Equation (1.3) on the overlapping data points (do not exclude zeros). Begin with the two data sets containing the youngest (backward method) or oldest (forward method) SVI observations for a search-term. We call the data set containing the starting pointA and denote the values in it with SV I_j,i,t∣M,T_A. The subsequent 270-day data set is calledB and the SVI values in it are denoted with SV I_j,i,t∣M,T_B.

Test if the hypothesis for the intercept H0 ∶γˆ=0 can be rejected. If so, keep estimates for Equation (1.3). If not, estimate Equation (1.5).

3. Predict the SV I_j,i,t∣M,T_A out of sample (over the time range of SV I_j,i,t∣M,T_B without the overlap) by using the estimates ˆγ and ˆδ for the relation in Equation (1.3) or only ˆδ if Equation (1.5) is used.

4. Concatenate the original SV I_j,i,t∣M,T_A and the predicted values ̂SV I_j,i,t∣M,T_B to one data set. This data set takes the place of data set A whereas B is replaced with the next data set to be attached.

5. Repeat steps 2 to 4 until there are no further data sets left.

Figure 1.2 summarizes the steps of the algorithm (left) and illustrates the implementation in an abstract way (right).

Table 1.2: Correlations of Constructed and Original SVI

The table reports the correlation coefficients of the RBC SVI using the respective method with the original search volume index as downloaded in 2012 byDimpfl and Jank(2016).

With Intercept Optional Intercept Index forward backward forward backward CAC 0.9786 0.9777 0.9813 0.9804 DAX 0.9578 0.9758 0.9704 0.9779 DJIA 0.9911 0.9854 0.9913 0.9886 FTSE 0.9471 0.9610 0.9642 0.9615

We have two options to evaluate the accuracy of our proposed algorithm. First, we compare a so-constructed data set to a data set which was obtained from Google when immediate concatenation was still possible. Second, we can aggregate the RBC SVI to a lower frequency and compare it to an SVI on this frequency obtained directly from Google.

The first option relies on the data sets used by Dimpfl and Jank(2016). In 2011, when the authors collected the data, it was possible to download Google Trends SVI scaled to a fixed reference date and simply string them together. Back then, the SVI was also not rounded. Dimpfl and Jank (2016) downloaded data sets for the search-terms CAC (related to the French stock index CAC40), DAX (related to the German stock market index), Dow Jones and FTSE (related to the British Financial Times Stock Exchange Index). The data cover Google’s SVI from July 3, 2006 until January 30, 2011 for searches originating from the country in which the respective market is located.

For the construction of the SVI from currently accessible Google Trends time series, we downloaded 24 separate data sets ranging back until 2004. Each data set contains 270 days and overlaps with the previous data set in at least 30 non-zero observations. We use the data from Google Trends based on searches originating from the country in which the respective index is located. The timezone is fixed to UTC+1.⁶

As we can either use the forward or the backward method, and choose to always include an intercept or only if it is found to be statistically significant, we have 4 options to construct the time series. Table 1.2 reports the correlation coefficients of the 4 methods with the benchmark SVI times series. For all methods and search-terms, we find correlation coefficients larger than 0.94. It turns out that we can increase the accuracy of the RBC SVI time series by only optionally including the intercept parameter in the estimation.

Figure 1.3 compares the forward (upper panel) and backward (lower panel) RBC SVI for the search-term Dow Jones when we always include an intercept to the benchmark

6 With the HTTP-request to Google Trends, a parametertzis set to−60 if the request is made from Germany which corresponds to a time-zone offset of 1 hour. We extended the gtrendsR-package available forRto include the possibility to fix the time zone.

Table 1.3: Correlation Between Naively Concatenated and RBC SVI with the Original SVI

The table presents the correlation of the naively concatenated and the RBC SVI with the original SVI in levels and returns. The RBC SVI is calculated using the backward method including an intercept.

The biased returns are dropped from the naively concatenated SVI. The backward method including an intercept consistently exhibits a higher correlation with the original SVI than using naively concatenated SVI returns. For the backward method with optional intercept, this is not always the case. When considering levels, the correlation of the RBC SVI with the backward method and optional intercept has a high correlation with the original SVI.

In Levels Returns

Index RBC Naiv RBC Naiv

CAC 0.9777 0.2432 0.5078 0.4584 DAX 0.9758 0.2628 0.6358 0.5961 DJIA 0.9854 0.4036 0.7294 0.6496 FTSE 0.9610 0.2285 0.5837 0.5374

time series. Figure 1.4 compares the two methods when the intercept is only optionally included when it turns out statistically significant in step 4 of the algorithm. Comparing Figures 1.3 and 1.4, as well as Table 1.2, we can see that for the search-terms CAC, DAX, Dow Jones andFTSE, all the methods perform well, but it seems admissible to use the intercept only for concatenation if it is statistically significant.

When using SVIs in empirical work, usually the logarithmic growth rates of the SVI or logarithmic first differences are used. To evaluate our method, we therefore report in Table 1.3 the correlation between levels and first differences of the original SVIs ofDimpfl and Jank (2016), the RBC SVI, and a naive concatenation where downloaded series are attached to each other without adjustment. We interpret a correlation coefficient smaller than 1 as a measure for the loss of information from the construction of the index.

As can be seen, the correlation between our RBC index in levels and the original one is very close to one. In contrast, the naive concatenation comes at the cost of a huge loss of information. This is in line with Figure 1.1 which shows that the naive concatenation method results in an SVI time series which does not correspond to the original SVI series at all. When using returns, the backward RBC SVI (with intercept) consistently exhibits a higher correlation with the original time series than the naive SVI log-returns.

In order to evaluate whether our proposed regression-based construction method preserves the statistical properties of the SVI, kernel densities and moments based on log-returns of the original SVI, log-returns of the RBC SVI as well as log-returns from the naively concatenated SVI are calculated. The kernel densities are displayed in Figure 1.5. For the return series, it turns out that constructing the SVI backwards and always including an intercept is the best choice for all series as this kernel density is closest to the one of the original data. The naive concatenation always results in the worst approximation of

Figure 1.3: Comparison: RBC SVI and Original Google SVI – Search-Term Dow Jones

Google’s original SVI index as downloaded on 30-1-2011 (right scale, black line) compared to the RBC SVI based on currently available data (left scale, blue line). For the construction, a linear transformation is used that always contains a constant.

2007 2008 2009 2010 2011 2012

020406080100120

Backward RBC SVI Original SVI

0246810

RBC SVI Original SVI

(a) SVI for Search-Term ’Dow Jones’ Backwards Constructed

2007 2008 2009 2010 2011 2012

406080100120140

Forward RBC SVI Original SVI

0246810

RBC SVI Original SVI

(b)Search-Term ’Dow Jones’ Forward Constructed

Figure 1.4: Comparison

of RBC SVI (Optional Intercept) and Original Google SVI – Search-Term Dow Jones

Google’s original SVI index as downloaded on 30-1-2011 (right scale, black line) compared to the RBC SVI based on currently available time series (left scale, blue line). When constructing the SVI, in this case we excluded the constant from the linear transformation, when we were not able to reject the hypothesis γ=0 based on a t-test with robust standard errors.

2007 2008 2009 2010 2011 2012

020406080100

Backward RBC SVI Original SVI

0246810

RBC SVI Original SVI

(a)Backward RBC SVI Compared to Original SVI

2007 2008 2009 2010 2011 2012

0200400600800

Forward RBC SVI Original SVI

0246810

RBC SVI Original SVI

(b)Forward RBC SVI Compared to Original SVI

Table 1.4: Moments of the Original, Naive and RBC SVI

The table displays the meanµ, standard deviationσ as well as the skewness and kurtosis of the returns of the original SVI (Original) and of the backward regression-based constructed (RBC) for various search-terms. When constructing the SVI returns backwards, an intercept is always included. The third line (Naive) presents the moments, if returns are calculated on a naively concatenated SVI time series. As the naively, concatenated SVI simply chains data time frames of 270 days together, the fourth line (Naive Ex.) tables the moments, if the biased inter-time-frame-returns are excluded from the naively concatenated time series.

Query Series µ σ Skewness Kurtosis

CAC

Original 0.00 0.15 0.84 9.60

RBC 0.00 0.15 0.60 7.27

Naive -0.00 0.26 0.18 4.40

Naive Ex. -0.00 0.25 0.15 4.01

DAX

Original 0.00 0.15 1.51 19.12

RBC -0.00 0.15 0.81 10.38

Naive -0.00 0.23 0.28 9.28

Naive Ex. 0.00 0.22 0.53 7.61

DJIA

Original 0.00 0.17 1.67 15.57

RBC 0.00 0.20 0.95 9.53

Naive -0.00 0.27 0.43 10.60

Naive Ex. 0.00 0.26 0.75 8.95

FTSE

Original -0.00 0.16 1.52 14.73

RBC -0.00 0.14 0.60 7.72

Naive -0.00 0.25 0.41 5.90

Naive Ex. -0.00 0.24 0.40 5.43

the original data, even if returns across the border points at which adjacent time frames are concatenated are excluded. The comparison of moments is presented in Table 1.4.

The means of the logarithmic growth rates of the original as well as all RBC/naive SVIs are centered around zero. However, the log-returns of the naive SVI are (in some cases decisively) more volatile. Also, naive concatenation reduces skewness and kurtosis by much more than our proposed algorithm, alienating the distributional properties further from the original data. Considering volatility, skewness and kurtosis together, the returns from the backward RBC SVI (with intercept) reflect the moments of the original SVI best and in particular much better than the returns from the naively concatenated SVI.

Based on all the criteria above, we conclude that the regression-based construction of the SVI according to our algorithm is sensible and useful. It is able to mimic the statistical properties which a hypothetical time series that Google could provide might have. This is most important if the data are to be used in levels (which is often the case in forecasting applications). If first differences are used, our methodology still performs better than a naive concatenation, but the differences are not as pronounced any more as for the levels.

Figure 1.5: Density Comparison of the Logarithmic Growth Rates of SVIs

This figure compares the kernel density of the logarithmic growth rates of Google’s original SVI as downloaded on 30-1-2011 (black line) to the kernel density of the logarithmic growth rates of the RBC SVI based on currently available data (blue line). For the construction, the backwards method is used.

The density of a normal distribution with the same mean and standard deviation as the original SVI is displayed with a dotted red line. In green, the kernel density estimation for the naively concatenated SVI returns are displayed, which is almost identically with the naively kernel density estimate of the concatenated SVI returns without the biased inter-time-frame-returns. The latter is depicted by the orange dashed line.

−1.0 −0.5 0.0 0.5 1.0 1.5

01234

∆SVIt

Kernel Density

(a)Kernel Density of the SVI for the Search-TermDAX

−0.5 0.0 0.5 1.0

0123

∆SVI_t

Kernel Density

(b)Kernel Density of the SVI for the Search-Term CAC

Figure 1.6: Comparison of Original and RBC Weekly SVI – Search-Term ”DAX”

The graph compares Google’s original weekly SVI (black line) and our transformed, aggregate weekly RBC SVI (red line) for the term ”DAX”.

2005 2010 2015

20406080100

Agg. RBC and original weekly SVI

As Google makes SVI time series available for longer time horizons with weekly resolution and in order to evaluate the RBC algorithm with another data set directly obtained from Google, we aggregate our constructed time series to a weekly frequency. For this comparison, we limit ourselves to the SVI which turned out most accurate in the evaluation above, i.e., the SVI based on the backwards construction with optional estimation of the intercept. We aggregate it by taking the weekly sum of the daily observations.

After this aggregation step we still need to account for the scaling of the time series.

Therefore, we regress the downloaded weekly time series on the aggregated RBC SVI and calculate the fitted values. The success of the method is illustrated in Figure 1.6 for the DAX in which fitted values and the downloaded SVI series are shown. The two time series can almost not be distinguished by the naked eye. The high fit is also supported by the high R²s that result in the auxiliary regressions (not reported). These are above 98% for all considered search-terms.

Im Dokument Essays on the Statistics of Financial Markets (Seite 24-33)