AP Statistics Curriculum 2007 Hypothesis Basics
From Socr
(→Type I Error, Type II Error and Power) |
|||
(9 intermediate revisions not shown) | |||
Line 2: | Line 2: | ||
=== Fundamentals of Hypothesis Testing=== | === Fundamentals of Hypothesis Testing=== | ||
- | A (statistical) '''Hypothesis Test''' is a method of making statistical decisions about populations or processes based on experimental data. Hypothesis testing just answers the question of ''how well the findings fit the possibility that chance alone might be responsible for the observed discrepancy between the theoretical model and the empirical observations''. This is accomplished by asking and answering a hypothetical question | + | A (statistical) '''Hypothesis Test''' is a method of making statistical decisions about populations or processes based on experimental data. Hypothesis testing just answers the question of ''how well the findings fit the possibility that the chance alone might be responsible for the observed discrepancy between the theoretical model and the empirical observations''. This is accomplished by asking and answering a hypothetical question. What is the likelihood of the observed summary statistics of interest, if the data did come from the distribution specified by the null-hypothesis? One use of hypothesis-testing is deciding whether experimental results contain enough information to cast doubt on conventional wisdom. |
* Example: Consider determining whether a suitcase contains some radioactive material. Placed under a [http://en.wikipedia.org/wiki/Geiger_counter Geiger counter], the suitcase produces 10 clicks (counts) per minute. The '''null hypothesis''' is that there is no radioactive material in the suitcase and that all measured counts are due to ambient radioactivity typical of the surrounding air and harmless objects in a suitcase. We can then calculate how likely it is that the null hypothesis produces 10 counts per minute. If it is likely, for example if the null hypothesis predicts on average 9 counts per minute, we say that the suitcase is compatible with the null hypothesis (which does not imply that there is no radioactive material, we just can't determine from the 1-minute sample we took using this specific method!); On the other hand, if the null hypothesis predicts for example 1 count per minute, then the suitcase is not compatible with the null hypothesis and there must be other factors responsible to produce the increased radioactive counts. | * Example: Consider determining whether a suitcase contains some radioactive material. Placed under a [http://en.wikipedia.org/wiki/Geiger_counter Geiger counter], the suitcase produces 10 clicks (counts) per minute. The '''null hypothesis''' is that there is no radioactive material in the suitcase and that all measured counts are due to ambient radioactivity typical of the surrounding air and harmless objects in a suitcase. We can then calculate how likely it is that the null hypothesis produces 10 counts per minute. If it is likely, for example if the null hypothesis predicts on average 9 counts per minute, we say that the suitcase is compatible with the null hypothesis (which does not imply that there is no radioactive material, we just can't determine from the 1-minute sample we took using this specific method!); On the other hand, if the null hypothesis predicts for example 1 count per minute, then the suitcase is not compatible with the null hypothesis and there must be other factors responsible to produce the increased radioactive counts. | ||
Line 22: | Line 22: | ||
Alternatively, the null hypothesis can postulate that the two samples are drawn from the same population, so that the [[AP_Statistics_Curriculum_2007#Chapter_II:_Describing.2C_Exploring.2C_and_Comparing_Data | center, variance and shape of the distributions]] are equal. | Alternatively, the null hypothesis can postulate that the two samples are drawn from the same population, so that the [[AP_Statistics_Curriculum_2007#Chapter_II:_Describing.2C_Exploring.2C_and_Comparing_Data | center, variance and shape of the distributions]] are equal. | ||
- | Formulation of the null hypothesis is a vital step in testing statistical significance. Having formulated such a hypothesis, one can establish the probability of observing the obtained data from the prediction of the null hypothesis, if the null hypothesis is true. That probability is what | + | Formulation of the null hypothesis is a vital step in testing statistical significance. Having formulated such a hypothesis, one can establish the probability of observing the obtained data from the prediction of the null hypothesis, if the null hypothesis is true. That probability is what commonly called the ''significance level'' of the results. |
In many scientific experimental designs we predict that a particular factor will produce an effect on our dependent variable — this is our alternative hypothesis. We then consider how often we would expect to observe our experimental results, or results even more extreme, if we were to take many samples from a population where there was no effect (i.e. we test against our null hypothesis). If we find that this happens rarely (up to, say, 5% of the time), we can conclude that our results support our experimental prediction — we reject our null hypothesis. | In many scientific experimental designs we predict that a particular factor will produce an effect on our dependent variable — this is our alternative hypothesis. We then consider how often we would expect to observe our experimental results, or results even more extreme, if we were to take many samples from a population where there was no effect (i.e. we test against our null hypothesis). If we find that this happens rarely (up to, say, 5% of the time), we can conclude that our results support our experimental prediction — we reject our null hypothesis. | ||
Line 60: | Line 60: | ||
* Remarks: | * Remarks: | ||
** A '''Specificity''' of 100% means that the test recognizes all healthy individuals as (normal) healthy. The maximum is trivially achieved by a test that claims everybody is healthy regardless of the true condition. Therefore, the specificity alone does not tell us how well the test recognizes positive cases. | ** A '''Specificity''' of 100% means that the test recognizes all healthy individuals as (normal) healthy. The maximum is trivially achieved by a test that claims everybody is healthy regardless of the true condition. Therefore, the specificity alone does not tell us how well the test recognizes positive cases. | ||
- | ** '''False positive rate (α)'''= | + | ** '''False positive rate (α)'''= \(\frac{FP}{FP+TN} = \frac{0.00995}{0.00995 + 0.98505}=0.01 \)= 1 - Specificity. |
- | ** '''Sensitivity''' is a measure of how well a test correctly identifies a condition, whether this is medical screening tests picking up on a disease, or quality control in factories deciding if a new product is good enough to be sold. | + | ** '''Sensitivity''' is a measure of how well a test correctly identifies a condition, whether this is medical screening tests picking up on a disease, or quality control in factories deciding if a new product is good enough to be sold. '''Sensitivity''' = \(\frac{TP}{TP+FN} = \frac{0.00475}{0.00475+ 0.00025}= 0.95.\) |
- | ** '''False Negative Rate (β)'''= | + | ** '''False Negative Rate (β)'''= \(\frac{FN}{FN+TP} = \frac{0.00025}{0.00025+0.00475}=0.05 \)= 1 - Sensitivity. |
- | ** '''Power''' = 1 − β= 0. | + | ** '''Power''' = 1 − β= 0.95, see [[Power_Analysis_for_Normal_Distribution]]. |
+ | ** Both (''Type I (\(\alpha\))'' and ''Type II (\(\beta\))'') errors are proportions in the range [0,1], so they represent ''error-rates''. The reason they are listed in the corresponding cells in the table is that they are directly proportionate to the numerical values of the FP and FN, respectively. | ||
+ | ** The two alternative definitions of ''power'' are equivalent: | ||
+ | ::: power \(=1-\beta\), and | ||
+ | ::: power=sensitivity | ||
+ | :: This is because power= \(1-\beta=1-\frac{FN}{FN+TP}=\frac{FN+TP}{FN+TP} - \frac{FN}{FN+TP}=\frac{TP}{FN+TP}=\) sensitivity. | ||
===Example 2: Sodium content in hot-dogs=== | ===Example 2: Sodium content in hot-dogs=== | ||
Use the [[SOCR_012708_ID_Data_HotDogs |Hot-dog dataset]] to see if there are statistically significant differences in the sodium content of the poultry vs. meat hotdogs. | Use the [[SOCR_012708_ID_Data_HotDogs |Hot-dog dataset]] to see if there are statistically significant differences in the sodium content of the poultry vs. meat hotdogs. | ||
- | * Formulate Hypotheses: | + | * Formulate Hypotheses: \(H_o: \mu_p = \mu_m\) vs. \(H_1: \mu_p \not= \mu_m\), where <math>\mu_p, \mu_m</math> represent the mean sodium content in poultry and mean hotdogs. |
- | * Plug in the data in [http://socr.stat.ucla.edu/htmls/SOCR_Analyses.html SOCR Analyses] under the [[SOCR_EduMaterials_AnalysisActivities_TwoIndepTU |Two Independent Sample T-Test (Unpooled)]] will generate results as shown in the figure below (Two-Sided P-Value (Unpooled) = 0.196, which does not provide strong evidence to reject the null hypothesis that the two types of hot-dogs have the same mean sodium content) | + | * Plug in the data in [http://socr.stat.ucla.edu/htmls/SOCR_Analyses.html SOCR Analyses] under the [[SOCR_EduMaterials_AnalysisActivities_TwoIndepTU |Two Independent Sample T-Test (Unpooled)]] will generate results as shown in the figure below (Two-Sided P-Value (Unpooled) = 0.196, which does not provide strong evidence to reject the null hypothesis that the two types of hot-dogs have the same mean sodium content). |
<center>[[Image:SOCR_EBook_Dinov_Hypothesis_020508_Fig1.jpg|600px]]</center> | <center>[[Image:SOCR_EBook_Dinov_Hypothesis_020508_Fig1.jpg|600px]]</center> | ||
Line 76: | Line 81: | ||
Study used 443 patients who had clinical pharyngitis diagnosed as group A <math>\beta</math>-hemolytic streptococcus infection in the past 28 days and compared them with 232 control patients who had symptoms of pharyngitis but no recent diagnosis of streptococcal pharyngitis. The aim was narrowly focused to compare the rapid strep test with the culture method used in clinical practice. | Study used 443 patients who had clinical pharyngitis diagnosed as group A <math>\beta</math>-hemolytic streptococcus infection in the past 28 days and compared them with 232 control patients who had symptoms of pharyngitis but no recent diagnosis of streptococcal pharyngitis. The aim was narrowly focused to compare the rapid strep test with the culture method used in clinical practice. | ||
- | The study found that the rapid strep test in this setting showed no difference in specificity (0.96 vs. 0.98) | + | The study found that the rapid strep test in this setting showed no difference in specificity (0.96 vs. 0.98). Hence, the assertion that rapid antigen testing had higher false-positive rates in those with recent infection was not confirmed. It also found that in patients who had recent streptococcal pharyngitis, the rapid strep test appears to be more reliable (sensitivity 0.91 vs 0.70, P < .001) than in those patients who had not had recent streptococcal pharyngitis. These findings indicated that the rapid strep test is both sensitive and specific in the setting of recent group A <math>\beta</math>-hemolytic streptococcal pharyngitis, and its use might allow earlier treatment in this subgroup of patients. |
- | Table 1. Sensitivity and Specificity of Laboratory Culture and Rapid Strep Test in | + | Table 1. Sensitivity and Specificity of Laboratory Culture and Rapid Strep Test in ''Patients With Recently Treated Cases of Streptococcal Pharyngitis'' (N=443). |
<center> | <center> | ||
Line 105: | Line 110: | ||
</center> | </center> | ||
- | Table 2. Sensitivity and Specificity of Laboratory Culture and Rapid Strep Test in ''Patients With No Recently Treated Cases of Streptococcal Pharyngitis''. | + | Table 2. Sensitivity and Specificity of Laboratory Culture and Rapid Strep Test in ''Patients With No Recently Treated Cases of Streptococcal Pharyngitis'' (N=232). |
<center> | <center> |
Current revision as of 17:48, 18 November 2015
Contents |
General Advance-Placement (AP) Statistics Curriculum - Fundamentals of Hypothesis Testing
Fundamentals of Hypothesis Testing
A (statistical) Hypothesis Test is a method of making statistical decisions about populations or processes based on experimental data. Hypothesis testing just answers the question of how well the findings fit the possibility that the chance alone might be responsible for the observed discrepancy between the theoretical model and the empirical observations. This is accomplished by asking and answering a hypothetical question. What is the likelihood of the observed summary statistics of interest, if the data did come from the distribution specified by the null-hypothesis? One use of hypothesis-testing is deciding whether experimental results contain enough information to cast doubt on conventional wisdom.
- Example: Consider determining whether a suitcase contains some radioactive material. Placed under a Geiger counter, the suitcase produces 10 clicks (counts) per minute. The null hypothesis is that there is no radioactive material in the suitcase and that all measured counts are due to ambient radioactivity typical of the surrounding air and harmless objects in a suitcase. We can then calculate how likely it is that the null hypothesis produces 10 counts per minute. If it is likely, for example if the null hypothesis predicts on average 9 counts per minute, we say that the suitcase is compatible with the null hypothesis (which does not imply that there is no radioactive material, we just can't determine from the 1-minute sample we took using this specific method!); On the other hand, if the null hypothesis predicts for example 1 count per minute, then the suitcase is not compatible with the null hypothesis and there must be other factors responsible to produce the increased radioactive counts.
The Hypothesis Testing is also known as Statistical Significance Testing. The null hypothesis is a conjecture that exists solely to be disproved, rejected or falsified by the sample-statistics used to estimate the unknown population parameters. Statistical significance is a possible finding of the test, that the sample is unlikely to have occurred in this process by chance given the truth of the null hypothesis. The name of the test describes its formulation and its possible outcome. One characteristic of hypothesis testing is its crisp decision about the null-hypothesis: reject or do not reject (which is not the same as accept).
Null and Alternative (Research) Hypotheses
A Null Hypothesis is a thesis set up to be nullified or refuted in order to support an Alternate (research) Hypothesis. The null hypothesis is presumed true until statistical evidence, in the form of a hypothesis test, indicates otherwise. In science, the null hypothesis is used to test differences between treatment and control groups, and the assumption at the outset of the experiment is that no difference exists between the two groups for the variable of interest (e.g., population means). The null hypothesis proposes something initially presumed true, and it is rejected only when it becomes evidently false. That is, when a researcher has a certain degree of confidence, usually 95% to 99%, that the data do not support the null hypothesis.
Example 1: Gender effects
If we want to compare the test scores of two random samples of men and women, a null hypothesis would be that the mean score of the male population was the same as the mean score of the female population:
- H_{0} : μ_{men} = μ_{women}
where:
- H_{0} = the null hypothesis
- μ_{men} = the mean of the males (population 1), and
- μ_{women} = the mean of the females (population 2).
Alternatively, the null hypothesis can postulate that the two samples are drawn from the same population, so that the center, variance and shape of the distributions are equal.
Formulation of the null hypothesis is a vital step in testing statistical significance. Having formulated such a hypothesis, one can establish the probability of observing the obtained data from the prediction of the null hypothesis, if the null hypothesis is true. That probability is what commonly called the significance level of the results.
In many scientific experimental designs we predict that a particular factor will produce an effect on our dependent variable — this is our alternative hypothesis. We then consider how often we would expect to observe our experimental results, or results even more extreme, if we were to take many samples from a population where there was no effect (i.e. we test against our null hypothesis). If we find that this happens rarely (up to, say, 5% of the time), we can conclude that our results support our experimental prediction — we reject our null hypothesis.
Type I Error, Type II Error and Power
Directly related to hypothesis testing are the following 3 concepts:
- Type I Error: The false positive (Type I) Error of rejecting the null hypothesis given that it is actually true; e.g., A court finding a person guilty of a crime that they did not actually commit.
- Type II Error: The Type II Error (false negative) is the error of failing to reject the null hypothesis given that the alternative hypothesis is actually true; e.g., A court finding a person not guilty of a crime that they did actually commit.
- Statistical Power: The Power of a Statistical Test is the probability that the test will reject a false null hypothesis (that it will not make a Type II Error). As power increases, the chances of a Type II error decrease. The probability of a Type II error is referred to as the false negative rate (β). Therefore power is equal to 1 − β. You can also see this SOCR Power Activity.
Actual condition | |||
---|---|---|---|
Absent (H_{o} is true) | Present (H_{1} is true) | ||
Test Result | Negative (fail to reject H_{o}) | Condition absent + Negative result = True (accurate) Negative (TN, 0.98505) | Condition present + Negative result = False (invalid) Negative (FN, 0.00025) Type II error (β) |
Positive (reject H_{o}) | Condition absent + Positive result = False Positive (FP, 0.00995) Type I error (α) | Condition Present + Positive result = True Positive (TP, 0.00475) | |
Test Interpretation | Power = 1-FN= 1-0.00025 = 0.99975 | Specificity: TN/(TN+FP) = 0.98505/(0.98505+ 0.00995) = 0.99 | Sensitivity: TP/(TP+FN) = 0.00475/(0.00475+ 0.00025)= 0.95 |
- Remarks:
- A Specificity of 100% means that the test recognizes all healthy individuals as (normal) healthy. The maximum is trivially achieved by a test that claims everybody is healthy regardless of the true condition. Therefore, the specificity alone does not tell us how well the test recognizes positive cases.
- False positive rate (α)= \(\frac{FP}{FP+TN} = \frac{0.00995}{0.00995 + 0.98505}=0.01 \)= 1 - Specificity.
- Sensitivity is a measure of how well a test correctly identifies a condition, whether this is medical screening tests picking up on a disease, or quality control in factories deciding if a new product is good enough to be sold. Sensitivity = \(\frac{TP}{TP+FN} = \frac{0.00475}{0.00475+ 0.00025}= 0.95.\)
- False Negative Rate (β)= \(\frac{FN}{FN+TP} = \frac{0.00025}{0.00025+0.00475}=0.05 \)= 1 - Sensitivity.
- Power = 1 − β= 0.95, see Power_Analysis_for_Normal_Distribution.
- Both (Type I (\(\alpha\)) and Type II (\(\beta\))) errors are proportions in the range [0,1], so they represent error-rates. The reason they are listed in the corresponding cells in the table is that they are directly proportionate to the numerical values of the FP and FN, respectively.
- The two alternative definitions of power are equivalent:
- power \(=1-\beta\), and
- power=sensitivity
- This is because power= \(1-\beta=1-\frac{FN}{FN+TP}=\frac{FN+TP}{FN+TP} - \frac{FN}{FN+TP}=\frac{TP}{FN+TP}=\) sensitivity.
Example 2: Sodium content in hot-dogs
Use the Hot-dog dataset to see if there are statistically significant differences in the sodium content of the poultry vs. meat hotdogs.
- Formulate Hypotheses: \(H_o: \mu_p = \mu_m\) vs. \(H_1: \mu_p \not= \mu_m\), where μ_{p},μ_{m} represent the mean sodium content in poultry and mean hotdogs.
- Plug in the data in SOCR Analyses under the Two Independent Sample T-Test (Unpooled) will generate results as shown in the figure below (Two-Sided P-Value (Unpooled) = 0.196, which does not provide strong evidence to reject the null hypothesis that the two types of hot-dogs have the same mean sodium content).
Example 3: Rapid testing in strep-throat
This study investigated the accuracy of rapid diagnosis of group A β-streptococcal pharyngitis by commercial immunochemical antigen test kits in the setting of recent streptococcal pharyngitis. Specifically, it explored whether the false-positive rate of the rapid test was increased because of presumed antigen persistence.
Study used 443 patients who had clinical pharyngitis diagnosed as group A β-hemolytic streptococcus infection in the past 28 days and compared them with 232 control patients who had symptoms of pharyngitis but no recent diagnosis of streptococcal pharyngitis. The aim was narrowly focused to compare the rapid strep test with the culture method used in clinical practice.
The study found that the rapid strep test in this setting showed no difference in specificity (0.96 vs. 0.98). Hence, the assertion that rapid antigen testing had higher false-positive rates in those with recent infection was not confirmed. It also found that in patients who had recent streptococcal pharyngitis, the rapid strep test appears to be more reliable (sensitivity 0.91 vs 0.70, P < .001) than in those patients who had not had recent streptococcal pharyngitis. These findings indicated that the rapid strep test is both sensitive and specific in the setting of recent group A β-hemolytic streptococcal pharyngitis, and its use might allow earlier treatment in this subgroup of patients.
Table 1. Sensitivity and Specificity of Laboratory Culture and Rapid Strep Test in Patients With Recently Treated Cases of Streptococcal Pharyngitis (N=443).
Results | Culture Negative | Culture Positive |
---|---|---|
Rapid strep test negative | 93 | 10 |
Rapid strep test positive | 4 | 104 |
Estimate 95% CI | ||
Sensitivity | 104/(104+10) = 0.91 | 0.84, 0.96 |
Specificity | 93/(93+4) = 0.96 | 0.90, 0.99 |
Positive predictive value | 0.96 | 0.91, 0.99 |
Negative predictive value | 0.90 | 0.83, 0.95 |
False-positive rate | 0.04 | 0.01, 0.10 |
False-negative rate | 0.09 | 0.04, 0.15 |
Table 2. Sensitivity and Specificity of Laboratory Culture and Rapid Strep Test in Patients With No Recently Treated Cases of Streptococcal Pharyngitis (N=232).
Results | Culture Negative | Culture Positive |
---|---|---|
Rapid strep test negative | 165 | 19 |
Rapid strep test positive | 4 | 44 |
Estimate 95% CI | ||
Sensitivity | 44/(44+19) = 0.70 | 0.57, 0.81 |
Specificity | 168/(165+4) = 0.98 | 0.94, 0.99 |
Positive predictive value | 0.92 | 0.80, 0.99 |
Negative predictive value | 0.90 | 0.84, 0.94 |
False-positive rate | 0.02 | 0.01, 0.06 |
False-negative rate | 0.30 | 0.19, 0.43 |
Problems
References
Robert D. Sheeler, MD, Margaret S. Houston, MD, Sharon Radke, RN, Jane C. Dale, MD, and Steven C. Adamson, MD. (2002) Accuracy of Rapid Strep Testing in Patients Who Have Had Recent Streptococcal Pharyngitis. JABFP, 2002, 15(4), 261-265.
- SOCR Home page: http://www.socr.ucla.edu
Translate this page: