• Keine Ergebnisse gefunden

Case II - Effect Does Exist

4.1 Modus Operandi

Do not go where the path may lead; go instead where there is no path and leave a trail.

Ralph W. Emerson We use a large arsenal of criteria to judge how well a model performs. These criteria are split into two parts. First, we use various loss functions (described in subsection 4.1.1) to measure

the difference between the estimated and actual values. These functions almost all relate to the error terms of each model and help us to identify important properties of a model’s performance.

Second, we calculate several classification rates (seesubsection 4.1.2). These describe how well the endogenous variable is estimated by the models in respect to some more general and very important categories. The reasoning is that it may be more important to replicate, for example, a significant and negative (normalized) t-value than to approximate its actual value.

To study the quality of the various estimators we employ several approaches. To study the fit of the models we use the whole data set to calculate the fitted values. We also use some bootstrapping to get the first and second moments of the values of the loss functions. To predict the data, and for the bootstrapping, we randomly partition the data into a training and a test-set. The same estimators as described insubsections 3.6.3,3.6.4,3.6.5and3.6.6are studied.

We chose to use two ways to construct the test- and training sets. First, we randomly chose 50%

of the data to belong to the training set and the remaining 50% is assigned to the test set. Although the size of 50% is unusually large for a test set, it is prerogative in our case because these values are most likely not independent (many studies will be present in both sets). Second, we chose to assign 90% of all studies to the training set. The latter approach better corresponds to the idea of forecasting the outcome of an unknown (i.e., not included) study, while the first relates more to the prediction of random data in general. Each forecast or fit is calculated with ten independently and randomly generated training sets. The ten sets are the same for each estimator within the same approach (e.g., a training set of 50% or 90%).

4.1.1 Loss-Functions

Just because you have a choice, it doesn’t mean that any of them has to be right.

Norton Juster, The Phantom Tollbooth, 1961

We employ a wide variety of loss functions to distinguish various characteristics of the esti-mators. A summary of fifty loss functions, from which we have taken some, is given byAndres and Spiwoks (2000). Each loss function has its own merits and justification. Most of them are symmetrical (punishing deviations in both directions equally), while asymmetrical loss functions are also possible but are, at least to some extent, quite arbitrary.

• RMSE: the root mean squared error q

Ni=1(yiyˆi)2

N .

The most commonly used loss function. The lower its value, the better the estimates are.

Small errors (<1) are less important than larger (>1) errors.

• Cor.: the Pearson correlation betweenyand ˆy. There shouldn’t be any negative values; the closer to one the better the estimates are.

• Adj. R2: the classic adjusted R squared, 1−(1−R2)N−k−1N−1 , withk being the number of regressors. It is the same as the correlation but adjusted for the ratio between the of number of regressors and observations.

• U: Theil’s (new) inequality coefficient r

Ni=1(yiyˆi)2

Ni=1y2i and its decomposition – U.bias= 1 (y−¯ y)¯ˆ2

NNi=1(yiyˆi)2, – U.var= 1 (sy−syˆ)2

NNi=1(yiyˆi)2, – U.cov= 12(1−ρ)sysyˆ

NNi=1(yiyˆi)2.

U should be zero in the case of a perfect, and one for the most naive estimator (the constant;

any values above one indicate that the estimator is worse than the naive estimator). The esti-mation errors are divided into U.bias (systematic error in the mean value), U.var (systematic error in the variance) and U.cov (unsystematic random error). These should add up to one (except for rounding errors). The perfect estimator has a U.cov of one.

• RMSPE: the root mean squared proportional error,q

1

NNi=1(yiyyˆi

i )2.

It is similar to the RMSE but measures the error relative to the true values. Therefore, it is independent of scaling.

• CI.hit: the fraction of predicted values in a c·sy interval ofy. In our case we setc to 0.5.

This measures whether or not the predicted values are “near” the actual values but does not consider the size of the errors (similar as in Koop and Potter(2003) but with a tighter bandwidth).

• Sign.: The percentage of correctly classified significant (normalized) t-values at a 5% level (the categories are negative and significant as well as positive and significant). Although this measure does belong to the classification ratings, it is also included in the loss functions because a similar measure (the percentage of negative significant (normalized) t-values) is included in many of the tables insection 3.5as well.

• Neg2pos4: a loss function which punishes large deviations much harder in the case of posi-tive values,

q1

NNi=1 1[yi≤0](yi−yˆi)2+1[yi>0](yi−yˆi)4

We implement this loss function because we have an abundance of negative but only rela-tively few positive values. With this function we see whether an estimator fares better or worse with positive values.

• FsRMSE: false sign root mean squared error, q1

NNi=11[yi·ˆyi<0](yi−yˆi)2.

This function is very similar to the RMSE-function but only punishes those estimates which carry the wrong sign. We implemented this because the sign of an estimate is, to some degree, even more important than the extent of its deviation.

• Min. dev. and Max. dev.: the maximum max(y,y)ˆ and minimum min(y,y)ˆ deviation. Since there are some outliers in the data which are not catched by any model, the minimum and maximum deviation are only of academic importance.

• Mean pos.: the mean positive deviation N1Ni=11y1>yi]· |yi−yˆi|,

• Mean neg.: the mean negative deviation N1Ni=11y1<yi]· |yi−yˆi|,

• Mean abs.: the mean absolute deviation N1Ni=1|yi−yˆi|.

By comparing the mean absolute and the mean squared error, we can judge wether the estimator tends to vary around the actual values more closely with some large errors (larger RMSE and smaller mean absolute error) or has less large errors but deviates more in general (smaller RMSE and larger mean absolute error).

• A Log-predictive-Score LPS, somewhat similar to the original (see Good(1952)), but de-fined as −N1Ni=1log 2(1−N (|yˆi−yi|/s));ˆ N is the inverse CDF of a standard normal distributed random variable. A perfect fit would result in an LPS of 0.

• An adjusted LPS by adding a penalty 2k/Nfor the number of used regressorsk(similar as for the AIC).

Additionally, we perform a general encompassing test (a bit more simplified than inClements and Harvey (2004)), by regressing y=∑#Ei=1βi(i)+ε and analyzing the calculated coefficients cEncomp. and the respective p-values pEncomp. (#Eis the number of competing estimators ˆy(s)).

Good estimators should have coefficients near one and low p-values because they should contain most of the required information (in the sense of minimizing the sum of squared errors).

When possible, we also calculate the 95% confidence intervals of the mean of each loss function to see whether any method is significantly superior to the naive estimators. We call a model A superior (inferior) to model B in regard to a certain loss function f, if the confidence interval of f for model A does not include the mean value of f for model B and the mean of f(A) is better (worse) than the mean of f(B).

4.1.2 Classification Ratings

While the loss-functions reflect the behavior of the error terms, it is also important to know whether the estimators are able to catch more general characteristics and behavior of the data.

For example, it might be more important to estimate a t-value correctly to be negative or negative and significant, than whether it is−2.5 or−5.5. To study these characteristics we employ several categories:

Sign All (normalized) t-values are categorized to be negative or positive. We use this to have a look at the general direction of the estimates. The correct sign might be considered more

important than an exact estimate. Moreover, we examine each sign separately to see whether an estimator fares better with results which tend to approve or disapprove of the deterrence hypothesis.

Pos. Sign A (normalized) t-values belongs to this category if it is significant and positive.

Neg. Sign A (normalized) t-values belongs to this category if it is significant and negative.

20% sign. A (normalized) t-value belongs to this category, if it is significant at a 20% level (i.e.,

|t|>1.28). Since values around zero might be considered as noise, we ignore these results in this category.

20% pos. A (normalized) t-value belongs to this category, if it is significant at a 20% level and positive (i.e.,t>1.28).

20% neg. A (normalized) t-value belongs to this category, if it is significant at a 20% level and negative (i.e.,t <−1.28).

5% sign. A (normalized) t-value belongs to this category, if it is significant at a 5% level (i.e.,|t|>

1.96). Naturally, it is very interesting to see how well the estimators handle the significant (normalized) t-values. We use the 5% level to discard results which do not significantly approve (disapprove) of the deterrence hypothesis.

5% pos. A (normalized) t-value belongs to this category, if it is significant at a 5% level and positive (i.e.,t>1.96).

5% neg. A (normalized) t-value belongs to this category, if it is significant at a 5% level and negative (i.e.,t <−1.96).

For these categories, which are assumed to be non-empty sets, we calculate two measures:

Precision The percentage of values which are correctly classified (e.g., if a category actually containsnvalues, and of these ˆnare not estimated correctly, the precision is n−nˆ divided byn). In the perfect case the precision is one, and zero in the worst case.

Error The error rate (if m values are estimated to belong to a category, but of these ˆm are not correct, the error rate is ˆmdivided bym). In the perfect case the error rate is zero, and one in the worst case. In the case thatm=0 the error rate is defined to be zero.

Total miss rate The sum of not correctly and falsely classified values divided by the total number of values belonging to that category (e.g.,[(1−precision)n+ (error rate)m]divided byn).

In the perfect case the total miss rate is zero, and limited by ˆmin the worst case.

As insubsection 4.1.1, we calculate the 95% confidence intervals of the mean of each classifi-cation rate to see whether any method is significantly superior (inferior) to the naive estimators.