AP Statistics Curriculum 2007 Contingency Indep

(Difference between revisions)
 Revision as of 02:27, 4 March 2008 (view source)IvoDinov (Talk | contribs)← Older edit Revision as of 02:33, 4 March 2008 (view source)IvoDinov (Talk | contribs) Newer edit → Line 67: Line 67: ==Notes== ==Notes== * CAUTION: ''Association does not imply Causality!'' * CAUTION: ''Association does not imply Causality!'' + == $r\times k$ Contingency Tables== == $r\times k$ Contingency Tables== Line 75: Line 76: ===Example=== ===Example=== Many factors are considered when purchasing earthquake insurance.  One factor of interest may be location with respect to a major earthquake fault.  Suppose a survey was mailed to California residents in four counties (data shown below).  Is there a statistically significant association between county of residence and purchase of earthquake insurance?  Test using a = 0.05. Many factors are considered when purchasing earthquake insurance.  One factor of interest may be location with respect to a major earthquake fault.  Suppose a survey was mailed to California residents in four counties (data shown below).  Is there a statistically significant association between county of residence and purchase of earthquake insurance?  Test using a = 0.05. - - County - Contra Costa - CC Santa Clara - SC Los Angeles - LA San Bernardino - SB Total - Earthquake Yes 117 222 133 109 581 - Insurance No 404 334 204 263 1205 - Total 521 556 337 372 1786
Line 91: Line 82: |  || || colspan=5| '''County''' |  || || colspan=5| '''County''' |- |- - |  || || '''Contra Costa''' (CC) || '''Santa Clara''' (SC) || '''Los Angeles''' (LA) || '''San Bernardino''' (SB) + |  || || '''Contra Costa''' (CC) || '''Santa Clara''' (SC) || '''Los Angeles''' (LA) || '''San Bernardino''' (SB) || '''Total''' |- |- | rowspan=3| '''Earthquake Insurance''' || '''Yes''' || 117 || 222 || 133 || 109 || 581 | rowspan=3| '''Earthquake Insurance''' || '''Yes''' || 117 || 222 || 133 || 109 || 581 Line 100: Line 91: |} |}
+ + * Hypotheses: + : $H_o$:  There is no association between ''Earthquake insurance'' and ''county of residence'' in California. That is: + :: P(Y|CC) = P(Y|SC) = P(Y|LA) = P(Y|SB) + :: P(N|CC) = P(N|SC) = P(N|LA) = P(N|SB) + + : $H_a$:  There is an association between ''Earthquake insurance'' and ''county of residence'' in California. The probability of having earthquake insurance is not the same in each county. + + == Chi-Square Test COnditions== + Conditions for validity of the $\chi^2$ test are: + * Design conditions + ** for a goodness of fit, it must be reasonable to regard the data as a random sample of categorical observations from a large population. + ** for a contingency table, it must be appropriate to view the data in one of the following ways: + as two or more independent random samples, observed with respect to a categorical variable + as one random sample, observed with respect to two categorical variables + **  for either type of  test, the observations within a sample must be independent of one another. + + * Sample conditions + ** critical values only work if each expected value > 5

General Advance-Placement (AP) Statistics Curriculum - Contingency Tables: Independence and Homogeneity

Contingency Tables: Independence and Homogeneity

The chi-square test may also be used to assess independence and association between variables.

Motivational example

Suppose 200 randomly selected cancer patients were asked if their primary diagnosis was Brain cancer and if they owned a cell phone before their diagnosis. The results are presented in the table below.

Suppose we want to analyze the association, if any, between brain cancer and cell phone use. The 2x2 table below lists two possible outcomes for each variable (each variable is dichotomous). We have the following population parameters:

P(CP|BC) = true probability of owning a cell phone (CP) given that the patient had brain cancer (BC). This chance may be estimated by P(CP|BC) = 0.72.
P(CP|NBC) = true probability of owning a cell phone given that the patient had another cancer, which is estimated by P(CP|NBC) = 0.46
 Brain cancer Yes No Total Cell Phone Use Yes 18 80 98 No 7 95 102 Total 25 175 200

Does it seem like there is an association between brain cancer and cell phone use? Of the brain cancer patients 18/25 = 0.72, owned a cell phone before their diagnosis. P(CP|BC) = 0.72, estimated probability of owning a cell phone given that the patient has brain cancer.

Of the other cancer patients, 80/175 = 0.46, owned a cell phone before their diagnosis. P(CP|NBC) = 0.46, estimated probability of owning a cell phone given that the patient has another cancer.

Calculations

• The hypotheses:
Ho: there is no association between variable 1 and variable 2 (independence)
P(BC|CP)=P(BC), that is brain-cancer (BC) is independent of cell-phone (CP) usage.
Ha: there is an association between variable 1 and variable 2 (dependence)
$P(BC|CP)={P(BC \cap CP) \over P(CP) } \not= P(BC).$
• Test statistics:

The test statistic:

$\chi_o^2 = \sum_{all-categories}{(O-E)^2 \over E} \sim \chi_{(df)}^2$, where df = (# rows – 1)(# columns – 1).
Expected cell counts can be calculated by
$E = { (row\_total)(column\_total)\over grand-total}$
• Results:

For the Brain-cancer and cell-phone usage data we have:

$\chi_o^2 = {(18-12.25)^2\over 12.25} + {(7-12.75)^2\over 12.75} + {(80-85.75)^2\over 85.75}+ {(95-89.25)^2\over89.25}$
$\chi_o^2 = =6.048 \sim \chi_{(df=1)}^2$
P-value: $P(\chi_1^2 > \chi_o^2)= 0.014306.$ and we can reject the null hypothesis at α = 0.05.

Notes

• CAUTION: Association does not imply Causality!

$r\times k$ Contingency Tables

We now consider tables that are larger than a 2x2 (more than 2 groups or more than 2 categories), called $r\times k$ contingency tables. The testing procedure is the same as the 2x2 contingency table, just more work and no possibility for a directional alternative. The goal of an $r\times k$ contingency table is to investigate the relationship between the row and column variables

Note: Ho is a compound hypothesis because it contains more than one independent assertion. This will be true for all $r\times k$ tables larger than 2x2. In other words, the alternative hypothesis for $r\times k$ tables larger than 2x2, will always be non-directional.

Example

Many factors are considered when purchasing earthquake insurance. One factor of interest may be location with respect to a major earthquake fault. Suppose a survey was mailed to California residents in four counties (data shown below). Is there a statistically significant association between county of residence and purchase of earthquake insurance? Test using a = 0.05.

 County Contra Costa (CC) Santa Clara (SC) Los Angeles (LA) San Bernardino (SB) Total Earthquake Insurance Yes 117 222 133 109 581 No 404 334 204 263 1205 Total 521 556 337 372 1786
• Hypotheses:
Ho: There is no association between Earthquake insurance and county of residence in California. That is:
P(Y|CC) = P(Y|SC) = P(Y|LA) = P(Y|SB)
P(N|CC) = P(N|SC) = P(N|LA) = P(N|SB)
Ha: There is an association between Earthquake insurance and county of residence in California. The probability of having earthquake insurance is not the same in each county.

Chi-Square Test COnditions

Conditions for validity of the χ2 test are:

• Design conditions
• for a goodness of fit, it must be reasonable to regard the data as a random sample of categorical observations from a large population.
• for a contingency table, it must be appropriate to view the data in one of the following ways:

as two or more independent random samples, observed with respect to a categorical variable as one random sample, observed with respect to two categorical variables

• for either type of test, the observations within a sample must be independent of one another.
• Sample conditions
• critical values only work if each expected value > 5

• TBD