the 2
test
Klaus Abberger, University of Konstanz, Germany
Abstract
Toestimatecellprobabilitiesfororderedsparsecontingencytablesseveralsmooth-
ingtechniques have been investigated. It has been recognizedthat nonparamet-
ric smoothing methods provide estimators of cell probabilities that have better
performance than the pure frequency estimators. With the help of simulation
examplesit is shown in this paper that these smoothingtechniques may help to
gettestwhicharemorepowerfulthan 2
testwithrawdata. Butthedistribution
of the 2
statistics after smoothing is unknown. This distribution can also be
estimated by simulation methods.
Keywords: nonparametric estimation, local polynomial smoothers, local
likelihood,sparse contingency tables, 2
test, independence test
1 Introduction
There is a vast literature on nonparametric regression smoothers for continuous
dependent and independent variables. Many dierent methods for estimation
regression curves have been proposed, including kernel, local polynomial, spline
and wavelet estimators. In this paper smoothingis applied to the estimation of
probabilitiesincategorical data. Incontrast to the situationofcontinuous data,
where the benets of smoothing (in formof scatterplot smoothers,for example)
areobvious, the applicabilityofsmoothingmethodstodiscretedataisless clear.
For a d-dimensional contingency table with k
j
ordered cells in the j-th di-
mension (j = 1;2;:::;d) cell probabilities are usually estimated by frequency
estimators. Tables which have small-to-moderate cell counts are called sparse
tables. Such sparse tables occur when k = Q
d
j=1 k
j
(the total number of cells)
and n (the total number of observations) are both large. For sparse tables it is
recognized that nonparametric smoothing techniques provide estimators for the
cell probabilities with better performance than frequency estimators (see Aerts
etal. (1997) for discussion).
consequencesofsmoothingonstatisticalinference,inparticularinthe behaviour
ofthe 2
test ofindependence for two dimensionalsparse contingency tables. In
the next section two smoothing methods for categorical data recently discussed
in the literature are presented. Section 3 contains the main part of this paper
and shows powersimulations forthe 2
test of independence.
2 Smoothing methods for ordinal contingency ta-
bles
In this section two nonparametric estimators for ordinal contingency tables are
presented. Foramorecomprehensive treatiseonsmoothingmethodsfor discrete
data see Simono and Tutz (2000).
Usingweightedleast-squarespolynomialttingisapossibilitytosmoothcon-
tingency tables. This is a well known method for smoothing scatterplots (Fan
and Gijbels, 1996). For example, alocallinear estimator^
ij
for the probability
of fallingin the (i;j)th cell of anRC two-dimensionaltable is
^
0
, where
^
is
the minimizerof
R
X
k=1 C
X
l =1
"
p
kl
0
1 i
R k
R
!
2 j
C l
C
!#
2
K
h
R
;h
C
(i;j;k;l;R ;C); (1)
with p
kl
the relative frequencies and K
h
R
;h
C
()is a two dimensional kernel func-
tion with h
R
and h
C
the smoothingparameters for either rows and columns. A
commontechnique for generating K
d
isusing the product of univariate kernels:
K
d (u)=
d
Y
j=1 K
1 (u
j
): (2)
A diculty with local polynomial probability estimates is that while an ar-
bitrary regression function can take onpositive ornegativevalues, aprobability
vectorcannottakeonnegativevalues. Theproblemisthattheestimatorisbased
on the minimization of a local least squares criterion, which is appropriate for
regression data, but not for categorical data.
Toovercomethese diculties Simono(1998)introducedanestimatorwhich
isbasedonlocallikelihood,ratherthan localleastsquares. The locallinearlike-
lihoodestimatorfora two-dimensionaltable isexp(
^
0
),where
^
0
isthe constant
term of the minimizerof
R
X
k=1 c
X
l =1
n
l k
0 +
1 i
R k
R +
2 j
C l
C
(3)
exp
"
0 +
1 i
R k
R
!
+
2 j
C l
C
!#)
K
h
R
;h
C
(i;j;k;l;R ;C):
Thus itis guaranteed thatthe estimates willbenonnegative. Fora detailedmo-
tivationand discussion of this estimatorsee Simonoand Tutz (2000).
Although we prefer the likelihood method proposed by Simono the simula-
tionsinthenextsectionarecalculatedwiththeLOESSprocedurewhichgrounds
on local polynomial estimation. LOESS is used because of its fast implementa-
tion inS-Plus. For the simulation studies this is very important since for power
simulationsa huge amount of repetitionsare required.
3 Power simulations for the
2
test
Beingaware ofthe advantagesof smoothingfrequenciestoestimateprobabilities
in sparse ordered contingency tables the purpose of this simulation study is to
examine the eect of smoothingon the usual 2
test of independence. Does the
improved estimates yield more powerful tests?
Inthe simulationsexamplesthe followingdata patternischosen. Thedimen-
sionof the tableis 55and the totalnumberof observations isalways n=100.
For easy control of the dependency structure the underlying random process is
bivariate normal with varying correlations. In the independence situation the
correlation coecient is set to zero. The 100 observations are generated from
thisbivariatestandardnormal. Theresultingsampleisstandardizedby thespan
sothatthe observed values liebetween 1and 1. This bivariatedata set isthen
categorized. For the rst dimension we have 5 categories. The observation falls
incategoryI,if 1x
i
< 0:3, incategoryII,if 0:3x
i
< 0:05,incategory
III, if 0:05 x
i
<0:05, incategoryVI, if 0:05x
i
<0:3,and incategory V,if
0:3 x
i
1. The same categorizationis applied tothe seconddimension. This
procedure yields independent 55 contingency table. A typical data set is
shown inTable 1.
It ispossible touse a 2
test totest the independence of this data. Since the
countsare smallandeven zerosometimes smoothingthe tablemaybeofadvan-
tage. As mentioned inthe previoussection for smoothingthe LOESS procedure
is used. The polynomial degree is xed as one so that we arrive at local linear
smoothing. We chose the in S-Plus implemented default smoothing parameter
I 1 1 1 4 0
II 4 10 10 18 2
III 1 10 3 6 0
IV 1 6 4 10 2
V 1 1 2 2 0
100
Table 1: Example of cell counts a of categorized random samplefrom anuncor-
relatedbivariate normaldistribution
which is span = 2=3, with span the percentage of the total number of points
usedinthe smoothing. Both theestimationmethodandthe choiceof smoothing
parametercanbefurtherimproved andcalibrated. Butaswewillseebeloweven
this straightforward but very fast smoothingmethodleads toappealing results.
The above described data generating algorithm is replicated 10,000 times to
get impressions about the 2
statistic.
Figure 1 shows the estimated densitiesof 2
statistics once for the raw data
and twice for the smoothed data. For the 2
statistic of the raw data there is
nothingexceptional. Testing for independence with =0:05 and 44 =16 de-
grees of freedomleads toa simulationbased estimate of^
da
=0:0538. So538 of
the10,000 tests are signicant. Thexed iskept verywell,althoughthe usual
rule of thumb that allcellcounts should have aminimum size of ve isviolated.
Alsoshown in Figure1isthe estimated density of 2
statisticsaftersmooth-
ing. Unsurprisingly, the usual 2
behaviour is destroyed. The 2
statistic after
smoothingis not 2
distributed. Especially the scale iscompletely changed and
quitedierent fromthe scale of the usual 2
statistic. Sothe standard 2
tables
are not applicable tothe smoothed 2
.
This problem will be discussed further at the end of this section. For the
powersimulationsthe criticalvaluecan be estimatedfromthe simulateddensity
inFigure1. Since thesimulationsaredone underthenullhypothesisofindepen-
dencethe 1 quantileof thisdensitycan beusedasanestimateof thecritical
value. For =0:05 the estimated critical value is 4:163184 in comparison with
26:3,whichisthecriticalvalue ofthe 2
distributionwith16degrees offreedom.
After xing the critical values for both procedures, the correlation coe-
cient of the data generating bivariatenormal process can be varied tostudy the
power of the two procedures. 10,000 repetitions for the correlation coecients
Chi**2 statistic for raw data
est. density
0 20 40 60
0.0 0.02 0.04 0.06
Chi**2 statistic after smoothing
est. density
0 2 4 6 8 10 12
0.0 0.1 0.2 0.3 0.4
Figure1: Monte Carlo estimated densitiesof 2
statisticsfor rawand smoothed
data
correlation
power
0.0 0.1 0.2 0.3 0.4 0.5 0.6
0.0 0.2 0.4 0.6 0.8 1.0
correlation
power
0.0 0.1 0.2 0.3 0.4 0.5 0.6
0.0 0.2 0.4 0.6 0.8 1.0
correlation
power
0.0 0.1 0.2 0.3 0.4 0.5 0.6
0.0 0.2 0.4 0.6 0.8 1.0
raw data smoothed
Figure2: MonteCarloestimatedpowerof the 2
testforrawandsmootheddata
signicanttest is used as anestimateof the power.
Figure 2 shows the results of these calculations. The gure illustrates the
benetsofsmoothingveryclear,becausethepowerfunctionaftersmoothingthe
frequencies is much steeper than the power function of the usual 2
test. Thus
smoothing leads to a considerable improvement of the common 2
test relating
tothe power of the procedure.
Chi**2 statistic
est. density
0 2 4 6 8 10 12
0.0 0.1 0.2 0.3 0.4 0.5 0.6
Chi**2 statistic
est. density
0 2 4 6 8 10 12
0.0 0.1 0.2 0.3 0.4 0.5 0.6
normal uniform
Figure3: MonteCarloestimateddensitiesof 2
statisticsforsmootheddatawith
independent normal and independent uniform data generating process
The pricewe havetopay for thisimprovementis the impossibilityof making
use of the 2
distributiontable. Instead wehave touse more complicatedmeth-
ods.
Figure3 shows againthe density of the 2
statisticaftersmoothingindepen-
dent bivariate normal date already included in Figure 1. In addition Figure 3
shows the simulationbased estimate of the density of 2
statistics forsmoothed
categorized data generated by two independent uniform distributions. The two
densitiesdo not coincide. Thus the density of the 2
statistic and therefore the
valuealsodepends onthe kindofsmoothingespeciallyonthe chosenbandwidth.
Therefore the suitable criticalvalue depends on the specic problemat hand.
A Monte Carlo based estimation method for this critical value of a specic
tableconsists of the followingsteps: 1. Take the marginaldistributionsas xed.
2. Chose smoothing method and smoothing parameter. 3. Draw bivariate ob-
servations from two independent uniform distributions. 4. Discretisize the data
according to the relative marginalfrequencies from step 1. 5. Calculate the 2
statistic. Nowrepeat thesteps 1-5manytimes toachieveanestimateof thespe-
cic distributionof the statistic under the nullhypothesis and chose the (1 )
quantileof this distribution ascritical value.
smoothed normal data
est. density
0 2 4 6 8 10 12
0.0 0.5 1.0 1.5
0 2 4 6 8 10 12
0.0 0.5 1.0 1.5
smoothed uniform data
est. density
0 2 4 6 8 10 12
0.0 0.5 1.0 1.5 2.0 2.5
0 2 4 6 8 10 12
0.0 0.5 1.0 1.5 2.0 2.5
statistic critical value
statistic critical value
Figure 4: Estimated densitiesof Monte Carlobased estimates of critical values
Figure4illustratestheresultsofansimulationexperimentbasedontheabove
describedalgorithm. Thetwodatageneratingprocesses independentnormaland
independentuniformwhichare alreadyusedinFigure3are usedagain. Forboth
processes werst drawone sample of size 100 which isused instep 1. Then the
steps 3-5 are repeated 1,000 times each to generate a density and an estimate
of the critical value. The whole procedure is then repeated 100 times to get an
4 together with the densities of 2
statistics from Figure 3. From these calcula-
tions one can conclude that the above described algorithmyieldsquite accurate
estimates of the critical value.
To sum up the various simulations in this section we can state rst that
smoothingordered sparse contingency tables may lead to more powerful 2
test
thantesting withoutsmoothing. The pricewe haveto pay for this improvement
isanuncertainty about the test distributionand furthermoreabout the suitable
critical value. The critical value may be determined with simulation methods.
Thereforeanalgorithmisproposedwhichseemstogivesuitableresults. Improve-
ments of the whole procedure are especially possible by the estimation method
and the choice of smoothing parameter.
4 References
Aerts M., AugustynsI., JanssenP.(1997): SmoothingSparseMultinomial
DataUsing LocalPolynomialFitting,Nonparametric Statistics ,8, 127-147.
Aerts M., Augustyns I., Janssen P. (1997): Local Polynomial Estima-
tion of Contingency Table Cell Probabilities,Statistics , 30, 127-148.
Aerts M., Augustyns I., Janssen P. (1997): Sparse Contingency and
Smoothing for MultinomialData,Statistics and Probability Letters , 33,41-48.
Cleveland W.S.(1979): RobustLocallyWeightedRegressionandSmooth-
ingScatterplots,J. Amer. Statist. Assoc., 74, 829-836.
Fan J., Gijbels I. (1996): LocalPolynomialModeling andits Applications,
Chapman and Hall,London.
SimonoJ.S.(1995): SmoothingCategoricalData,L. Statist. Plann. Inf.,
47,41-69-156.
SimonoJ.S.(1998): ThreeSidesofSmoothing: CategoricalDataSmooth-
ing,NonparametricRegression,andDensity Estimation,InternationalStatistical
Review , 66,137-156.
Simono J.S., Tutz G. (2000): Smoothing Methods for Discrete Data,
in: Smoothing andRegression: Approaches, Computation, andApplication (Ed.:
Schimek M. G. , 193-228,Wiley,New York.