Smoothing ordered sparse contingency tables and the chi-2 test

(1)

the 2

test

Klaus Abberger, University of Konstanz, Germany

Abstract

Toestimatecellprobabilitiesfororderedsparsecontingencytablesseveralsmooth-

ingtechniques have been investigated. It has been recognizedthat nonparamet-

ric smoothing methods provide estimators of cell probabilities that have better

performance than the pure frequency estimators. With the help of simulation

examplesit is shown in this paper that these smoothingtechniques may help to

gettestwhicharemorepowerfulthan 2

testwithrawdata. Butthedistribution

of the 2

statistics after smoothing is unknown. This distribution can also be

estimated by simulation methods.

Keywords: nonparametric estimation, local polynomial smoothers, local

likelihood,sparse contingency tables, 2

test, independence test

1 Introduction

There is a vast literature on nonparametric regression smoothers for continuous

dependent and independent variables. Many dierent methods for estimation

regression curves have been proposed, including kernel, local polynomial, spline

and wavelet estimators. In this paper smoothingis applied to the estimation of

probabilitiesincategorical data. Incontrast to the situationofcontinuous data,

where the benets of smoothing (in formof scatterplot smoothers,for example)

areobvious, the applicabilityofsmoothingmethodstodiscretedataisless clear.

For a d-dimensional contingency table with k

j

ordered cells in the j-th di-

mension (j = 1;2;:::;d) cell probabilities are usually estimated by frequency

estimators. Tables which have small-to-moderate cell counts are called sparse

tables. Such sparse tables occur when k = Q

d

j=1 k

j

(the total number of cells)

and n (the total number of observations) are both large. For sparse tables it is

recognized that nonparametric smoothing techniques provide estimators for the

cell probabilities with better performance than frequency estimators (see Aerts

etal. (1997) for discussion).

(2)

consequencesofsmoothingonstatisticalinference,inparticularinthe behaviour

ofthe 2

test ofindependence for two dimensionalsparse contingency tables. In

the next section two smoothing methods for categorical data recently discussed

in the literature are presented. Section 3 contains the main part of this paper

and shows powersimulations forthe 2

test of independence.

2 Smoothing methods for ordinal contingency ta-

bles

In this section two nonparametric estimators for ordinal contingency tables are

presented. Foramorecomprehensive treatiseonsmoothingmethodsfor discrete

data see Simono and Tutz (2000).

Usingweightedleast-squarespolynomialttingisapossibilitytosmoothcon-

tingency tables. This is a well known method for smoothing scatterplots (Fan

and Gijbels, 1996). For example, alocallinear estimator^

ij

for the probability

of fallingin the (i;j)th cell of anRC two-dimensionaltable is

^

0

, where

^

is

the minimizerof

R

X

k=1 C

X

l =1

"

p

kl

0

1 i

R k

R

!

2 j

C l

C

!#

2

K

h

R

;h

C

(i;j;k;l;R ;C); (1)

with p

kl

the relative frequencies and K

h

R

;h

C

()is a two dimensional kernel func-

tion with h

R

and h

C

the smoothingparameters for either rows and columns. A

commontechnique for generating K

d

isusing the product of univariate kernels:

K

d (u)=

d

Y

j=1 K

1 (u

j

): (2)

A diculty with local polynomial probability estimates is that while an ar-

bitrary regression function can take onpositive ornegativevalues, aprobability

vectorcannottakeonnegativevalues. Theproblemisthattheestimatorisbased

on the minimization of a local least squares criterion, which is appropriate for

regression data, but not for categorical data.

Toovercomethese diculties Simono(1998)introducedanestimatorwhich

isbasedonlocallikelihood,ratherthan localleastsquares. The locallinearlike-

lihoodestimatorfora two-dimensionaltable isexp(

^

0

),where

^

0

isthe constant

term of the minimizerof

(3)

R

X

k=1 c

X

l =1

n

l k

0 +

1 i

R k

R +

2 j

C l

C

(3)

exp

"

0 +

1 i

R k

R

!

+

2 j

C l

C

!#)

K

h

R

;h

C

(i;j;k;l;R ;C):

Thus itis guaranteed thatthe estimates willbenonnegative. Fora detailedmo-

tivationand discussion of this estimatorsee Simonoand Tutz (2000).

Although we prefer the likelihood method proposed by Simono the simula-

tionsinthenextsectionarecalculatedwiththeLOESSprocedurewhichgrounds

on local polynomial estimation. LOESS is used because of its fast implementa-

tion inS-Plus. For the simulation studies this is very important since for power

simulationsa huge amount of repetitionsare required.

3 Power simulations for the

2

test

Beingaware ofthe advantagesof smoothingfrequenciestoestimateprobabilities

in sparse ordered contingency tables the purpose of this simulation study is to

examine the eect of smoothingon the usual 2

test of independence. Does the

improved estimates yield more powerful tests?

Inthe simulationsexamplesthe followingdata patternischosen. Thedimen-

sionof the tableis 55and the totalnumberof observations isalways n=100.

For easy control of the dependency structure the underlying random process is

bivariate normal with varying correlations. In the independence situation the

correlation coecient is set to zero. The 100 observations are generated from

thisbivariatestandardnormal. Theresultingsampleisstandardizedby thespan

sothatthe observed values liebetween 1and 1. This bivariatedata set isthen

categorized. For the rst dimension we have 5 categories. The observation falls

incategoryI,if 1x

i

< 0:3, incategoryII,if 0:3x

i

< 0:05,incategory

III, if 0:05 x

i

<0:05, incategoryVI, if 0:05x

i

<0:3,and incategory V,if

0:3 x

i

1. The same categorizationis applied tothe seconddimension. This

procedure yields independent 55 contingency table. A typical data set is

shown inTable 1.

It ispossible touse a 2

test totest the independence of this data. Since the

countsare smallandeven zerosometimes smoothingthe tablemaybeofadvan-

tage. As mentioned inthe previoussection for smoothingthe LOESS procedure

is used. The polynomial degree is xed as one so that we arrive at local linear

smoothing. We chose the in S-Plus implemented default smoothing parameter

(4)

I 1 1 1 4 0

II 4 10 10 18 2

III 1 10 3 6 0

IV 1 6 4 10 2

V 1 1 2 2 0

100

Table 1: Example of cell counts a of categorized random samplefrom anuncor-

relatedbivariate normaldistribution

which is span = 2=3, with span the percentage of the total number of points

usedinthe smoothing. Both theestimationmethodandthe choiceof smoothing

parametercanbefurtherimproved andcalibrated. Butaswewillseebeloweven

this straightforward but very fast smoothingmethodleads toappealing results.

The above described data generating algorithm is replicated 10,000 times to

get impressions about the 2

statistic.

Figure 1 shows the estimated densitiesof 2

statistics once for the raw data

and twice for the smoothed data. For the 2

statistic of the raw data there is

nothingexceptional. Testing for independence with =0:05 and 44 =16 de-

grees of freedomleads toa simulationbased estimate of^

da

=0:0538. So538 of

the10,000 tests are signicant. Thexed iskept verywell,althoughthe usual

rule of thumb that allcellcounts should have aminimum size of ve isviolated.

Alsoshown in Figure1isthe estimated density of 2

statisticsaftersmooth-

ing. Unsurprisingly, the usual 2

behaviour is destroyed. The 2

statistic after

smoothingis not 2

distributed. Especially the scale iscompletely changed and

quitedierent fromthe scale of the usual 2

statistic. Sothe standard 2

tables

are not applicable tothe smoothed 2

.

This problem will be discussed further at the end of this section. For the

powersimulationsthe criticalvaluecan be estimatedfromthe simulateddensity

inFigure1. Since thesimulationsaredone underthenullhypothesisofindepen-

dencethe 1 quantileof thisdensitycan beusedasanestimateof thecritical

value. For =0:05 the estimated critical value is 4:163184 in comparison with

26:3,whichisthecriticalvalue ofthe 2

distributionwith16degrees offreedom.

After xing the critical values for both procedures, the correlation coe-

cient of the data generating bivariatenormal process can be varied tostudy the

power of the two procedures. 10,000 repetitions for the correlation coecients

(5)

Chi**2 statistic for raw data

est. density

0 20 40 60

0.0 0.02 0.04 0.06

Chi**2 statistic after smoothing

est. density

0 2 4 6 8 10 12

0.0 0.1 0.2 0.3 0.4

Figure1: Monte Carlo estimated densitiesof 2

statisticsfor rawand smoothed

data

(6)

correlation

power

0.0 0.1 0.2 0.3 0.4 0.5 0.6

0.0 0.2 0.4 0.6 0.8 1.0

correlation

power

0.0 0.1 0.2 0.3 0.4 0.5 0.6

0.0 0.2 0.4 0.6 0.8 1.0

correlation

power

0.0 0.1 0.2 0.3 0.4 0.5 0.6

0.0 0.2 0.4 0.6 0.8 1.0

raw data smoothed

Figure2: MonteCarloestimatedpowerof the 2

testforrawandsmootheddata

(7)

signicanttest is used as anestimateof the power.

Figure 2 shows the results of these calculations. The gure illustrates the

benetsofsmoothingveryclear,becausethepowerfunctionaftersmoothingthe

frequencies is much steeper than the power function of the usual 2

test. Thus

smoothing leads to a considerable improvement of the common 2

test relating

tothe power of the procedure.

Chi**2 statistic

est. density

0 2 4 6 8 10 12

0.0 0.1 0.2 0.3 0.4 0.5 0.6

Chi**2 statistic

est. density

0 2 4 6 8 10 12

0.0 0.1 0.2 0.3 0.4 0.5 0.6

normal uniform

Figure3: MonteCarloestimateddensitiesof 2

statisticsforsmootheddatawith

independent normal and independent uniform data generating process

The pricewe havetopay for thisimprovementis the impossibilityof making

use of the 2

distributiontable. Instead wehave touse more complicatedmeth-

ods.

Figure3 shows againthe density of the 2

statisticaftersmoothingindepen-

dent bivariate normal date already included in Figure 1. In addition Figure 3

shows the simulationbased estimate of the density of 2

statistics forsmoothed

categorized data generated by two independent uniform distributions. The two

densitiesdo not coincide. Thus the density of the 2

statistic and therefore the

(8)

valuealsodepends onthe kindofsmoothingespeciallyonthe chosenbandwidth.

Therefore the suitable criticalvalue depends on the specic problemat hand.

A Monte Carlo based estimation method for this critical value of a specic

tableconsists of the followingsteps: 1. Take the marginaldistributionsas xed.

2. Chose smoothing method and smoothing parameter. 3. Draw bivariate ob-

servations from two independent uniform distributions. 4. Discretisize the data

according to the relative marginalfrequencies from step 1. 5. Calculate the 2

statistic. Nowrepeat thesteps 1-5manytimes toachieveanestimateof thespe-

cic distributionof the statistic under the nullhypothesis and chose the (1 )

quantileof this distribution ascritical value.

smoothed normal data

est. density

0 2 4 6 8 10 12

0.0 0.5 1.0 1.5

0 2 4 6 8 10 12

0.0 0.5 1.0 1.5

smoothed uniform data

est. density

0 2 4 6 8 10 12

0.0 0.5 1.0 1.5 2.0 2.5

0 2 4 6 8 10 12

0.0 0.5 1.0 1.5 2.0 2.5

statistic critical value

Figure 4: Estimated densitiesof Monte Carlobased estimates of critical values

Figure4illustratestheresultsofansimulationexperimentbasedontheabove

describedalgorithm. Thetwodatageneratingprocesses independentnormaland

independentuniformwhichare alreadyusedinFigure3are usedagain. Forboth

processes werst drawone sample of size 100 which isused instep 1. Then the

steps 3-5 are repeated 1,000 times each to generate a density and an estimate

of the critical value. The whole procedure is then repeated 100 times to get an

(9)

4 together with the densities of 2

statistics from Figure 3. From these calcula-

tions one can conclude that the above described algorithmyieldsquite accurate

estimates of the critical value.

To sum up the various simulations in this section we can state rst that

smoothingordered sparse contingency tables may lead to more powerful 2

test

thantesting withoutsmoothing. The pricewe haveto pay for this improvement

isanuncertainty about the test distributionand furthermoreabout the suitable

critical value. The critical value may be determined with simulation methods.

Thereforeanalgorithmisproposedwhichseemstogivesuitableresults. Improve-

ments of the whole procedure are especially possible by the estimation method

and the choice of smoothing parameter.

4 References

Aerts M., AugustynsI., JanssenP.(1997): SmoothingSparseMultinomial

DataUsing LocalPolynomialFitting,Nonparametric Statistics ,8, 127-147.

Aerts M., Augustyns I., Janssen P. (1997): Local Polynomial Estima-

tion of Contingency Table Cell Probabilities,Statistics , 30, 127-148.

Aerts M., Augustyns I., Janssen P. (1997): Sparse Contingency and

Smoothing for MultinomialData,Statistics and Probability Letters , 33,41-48.

Cleveland W.S.(1979): RobustLocallyWeightedRegressionandSmooth-

ingScatterplots,J. Amer. Statist. Assoc., 74, 829-836.

Fan J., Gijbels I. (1996): LocalPolynomialModeling andits Applications,

Chapman and Hall,London.

SimonoJ.S.(1995): SmoothingCategoricalData,L. Statist. Plann. Inf.,

47,41-69-156.

SimonoJ.S.(1998): ThreeSidesofSmoothing: CategoricalDataSmooth-

ing,NonparametricRegression,andDensity Estimation,InternationalStatistical

Review , 66,137-156.

Simono J.S., Tutz G. (2000): Smoothing Methods for Discrete Data,

in: Smoothing andRegression: Approaches, Computation, andApplication (Ed.:

Schimek M. G. , 193-228,Wiley,New York.