Chi-squared Test on a Contingency Table

Stochastic Approach

3.3 Chi-squared Test on a Contingency Table

to “reject” a conjecture or hypothesis about the process. The conjecture is called the

null hypothesis. Not rejecting may be a good result if we want to continue to act as if we “believe” the null hypothesis is true. Or it may be a disappointing result, possibly indicating we may not yet have enough data to “prove” something by rejecting the null hypothesis. A classic use of a statistical test is to check wherever two random variables, XandY, are independent or not. The hypotheses of the test, are hence so defined: “null hypothesis” the variables are independent; “alternative hypothesis” the variables are dependent. The null hypothesis is so called, because it proposes some-thing initially presumed true. It is rejected only when it becomes evidently false. That is, when there exists a certain degree of confidence that the data do not support the null hypothesis. This confidence is reached through the application of some statistical test, in our case the Chi-squared Test on a Contingency Table.

In other words, the null hypothesis is a statement about a belief. We may doubt that the null hypothesis is true, which might be why we are "testing" it. The alternative hypothesis might, in fact, be what we believe to be true. The test procedure is con-structed so that the risk of rejecting the null hypothesis, when it is in fact true, is small.

This riskαis often referred to as the significance level of the test. By having a test with a small value ofα, we feel that we have actually “proved” something when we reject the null hypothesis. The choice of αis somehow arbitrary, although in prac-tice values of1%,5% and10% are common. A valueα= 5% implies that the null hypothesis is rejected5% of the time when it is in fact true. Hence, the significance level is the probability that the null hypothesis will be rejected in error when it is true (a decision known as a Type I error, or "false positive").

The risk of failing to reject the null hypothesis when it is in fact false is not chosen by the user but is determined, as one might expect, by the magnitude of the real dis-crepancy. This riskβis usually referred to as the Type II error. Large discrepancies between reality and the null hypothesis are easier to detect and lead to small errors of the second kind; while small discrepancies are more difficult to detect and lead to large errors of the second kind. Also the riskβincreases as the riskαdecreases.

Differentα-levels have different advantages and disadvantages. A very smallα-level (say1%) is less likely to be more extreme than the critical value and so is more signif-icant than highα-level values (say10%). However, smallerα-levels run greater risks of failing to reject a false null hypothesis (a Type II error or "false negative"), and so have less statistical power.

In particular a Chi-squared Test is any statistical hypothesis test in which the test statistic has a Chi-squared distribution when the null hypothesis is true, or anyone in which the probability distribution of the test statistic (assuming the null hypothesis is true) can be approximated by a Chi-squared distribution as closely as desired by making the sample size large enough. A quantity has a Chi-squared distribution with r degrees of freedom if it can be rewritten as the sum of the squares of rstandard normal independent variables (mean zero and variance equal to one).

Given a significance levelα, the outcome of the test is compared with the critical value of a Chi-squared variable with the corresponding number of degrees of freedom. The

“null hypothesys” is rejected if the value of the test is bigger than the critical value.

3.3 Chi-squared Test on a Contingency Table

An easy test to check the independence of variables is the Chi-squared Test on a Con-tingency Table (see [77]). This is based on the concept of conditional probability.

From a data set we consider two random variablesXandY. Willing to test if they are independent, we define two hypotheses:

“null hypothesis”: the variables are independent;

“alternative hypothesis” : the variables are dependent.

We can think of the following situation: the values of the random variablesX andY can be subdivided respectively intorandsclasses. As notation we have:

• ntotal number of observations;

• handvclass-indexes forXandY;

• nhvnumber of elements belonging to the intersection of the classesXhandYv. We can write the Contingency Table as

n11 n12 n13 . . . n1v . . . n1s n1

n21 n22 n23 . . . n2v . . . n2s n2

... ... ... ... ... ... nh1 nh2 nh3 . . . nhv . . . nhs nh

... ... ... ... ... ... n_r1 n_r2 n_r3 . . . n_rv . . . n_rs n_r

n1 n2 n3 . . . nv . . . ns n

wherenhandnvare the partial sums on a line and on a column respectively, i.e.

nh = X

nhvandnv = X

nhv.

The joint distribution ofXandYis given by the probabilitiesπhv=P(X=h,Y=v), i.e. the probability that a single observation belongs to both the classes Xh andYv. We suppose that the observations are independent, i.e. every observation “chooses”

the classesh andkwithout being influenced from any previous observations. Then the quantitiesNhv, i.e. the random variables that correspond to the resultsnhv, can be represented with a multinomial distribution of sizenand parametersπ_hv:

[N₁₁,N12, . . . ,Nrs]∼M(n, π11, π12, . . . , πrs)

We are interested in the null hypothesis that the variablesXandYare independent i.e.

(3.1) P(X=h,Y=v) =P(X=h)P(Y =v) If we define

πh=P(X=h) =X

πhvandπv=P(Y =v) =X

πhv

we can then rewrite the “null hypothesis” (i.e. the variables are independent) as πhv=πhπv for everyhandv

The Chi-squared Test is

T =X

h,v

(N_hv−nπ_hπ_v)² nπhπv

The parametersπ_handπ_vare nuisance parameters, since we are not interested in them.

We approximate them through the data bπ_h = N_h

n andbπ_v= N_v n

3.3 Chi-squared Test on a Contingency Table 47 Thus the test can be rewritten as

(3.2) T =X

h,v

(Nhv− ^N^h_n^N^v)²

NhNv

The quantityT has a Chi-squared distribution with a number of degrees of freedom equal to

ν=(number of classes)−1−(number of nuisance parameters)

The number of classes is given by the product rs. Concerning the nuisance param-eters, we observe thatbπr is fixed once we definebπ1,bπ2. . .bπr−1 since the probability is normalized to one, i.e. X

bπ_h = 1. Therefore we need to approximate justr−1 respectivelys−1parameters. Then we have

ν=rs−1− (r−1) − (s−1) = (r−1)(s−1)

Given a desired significance level, we can compare the value of (3.2) with the critical values of a Chi-squared variable characterized by(r−1)(s−1)degrees of freedom:

i.e. the independence (null) hypothesis is refused if the value ofT is bigger than the critical value.

In the delay problem the variables X andY can be seen as two random variables corresponding to the delays of two events in the graph. In our case the (two) classes for every variable correspond to the states:

• punctuality (e.g. earlier arrival or delays smaller than3minutes);

• delay (e.g. delays bigger than3minutes).

Thus in our caser=s=2andv=1.

Applying the Contingency Table Test directly to every possible pair of variables we can check if there exists a dependency between the variables. What we would like to do is to look for more complex dependencies, in fact we want to see not only if two variables are independent, but also if they are independent given all the other variables and in particular given a third variable, i.e. if there exist a third variable that can ex-plain the dependence (e.g given at one station two guaranteed connections, between the arrival event of traint1and both the departure events of trainst2andt3. We do not expect any direct dependence between the two departure events, since their similar behavior can be explained through the arrival event).

The idea of considering triples of variables arises from the Tri-graph, a stochastic graphical model that will be explained later in this chapter (Subsection 3.5.3). There-fore given two variablesXandYcorresponding to two different events in the system, we will consider a third variable Z (chosen among all other variables of the system except X andY) and we will define two contingency tablesP_Z andD_Z. In any of these tables we will insert the observed values of the pair (X,Y), corresponding to the days in which the variableZis punctual (PZ) or delayed (DZ) as defined above.

Then the test given by Formula (3.2) will be applied to bothP_ZandD_Z to check the

”null hypothesis” under the constraint of punctuality/delay of the third variable. We will repeat this step for every pair of events(X,Y)and for every possible choice of the third variableZexcept the variables forming the considered pair, i.e.Z,X,Y. The alternative hypothesis (X andY are dependent) will be “accepted” if the direct contingency table test rejected the null hypothesis and given any third variableZ, the null hypothesis will be accepted in at least one table P_Z or D_Z (i.e. there does not exist any third variable Z that can explain the dependence betweenX andY). This procedure should allow a direct comparison between the results of the method and the ones of the stochastic methods that will be introduced later in this chapter.

Im Dokument Identifying dependencies among delays (Seite 61-64)