# AP Statistics Curriculum 2007 Limits CLT

(Difference between revisions)
 Revision as of 18:52, 14 June 2007 (view source)IvoDinov (Talk | contribs)← Older edit Revision as of 00:01, 3 February 2008 (view source)IvoDinov (Talk | contribs) Newer edit → Line 1: Line 1: ==[[AP_Statistics_Curriculum_2007 | General Advance-Placement (AP) Statistics Curriculum]] - The Central Limit Theorem== ==[[AP_Statistics_Curriculum_2007 | General Advance-Placement (AP) Statistics Curriculum]] - The Central Limit Theorem== - === The Central Limit Theorem=== - Example on how to attach images to Wiki documents in included below (this needs to be replaced by an appropriate figure for this section)! -
[[Image:AP_Statistics_Curriculum_2007_IntroVar_Dinov_061407_Fig1.png|500px]]
- ===Approach=== + ===Motivation=== - Models & strategies for solving the problem, data understanding & inference. + The following example motivates the need to study the '''sampling distribution of the sample average''', i.e., the distribution of $\overline{X_n}={1\over n}\sum_{i=1}^n{X_i}$, as we vary the sample {$X_1, X_2, X_3, \cdots , X_n$}. - * TBD + Suppose there are 10 world renowned laboratories each of which is charged with conducting the same experiment (e.g., sequencing the genome [http://en.wikipedia.org/wiki/Drosophila Drosophila, the fruit fly]), using the same protocols. One of the outcomes of this study that the sponsor agency is interested could be the ''average rate of occurrence'' of the [http://lists.informatics.jax.org/searches/GO.cgi?id=GO:0033434 ATC codon] in 10,000 base-pairs in the Drosophila genome. After completing the sequencing of the genome, each lab selects a random segment of 1,000,000 base-pairs and counts the number of ATC codons in every segment of 10,000 base-pairs (there are 100 such segments). Finally they compute the average of the 100 counts they obtained. The funding/sponsoring agency receives 10 average counts from the 10 distinct laboratories. Most likely there will be differences between these averages. - ===Model Validation=== + The most important question the funding agency poses is: Can we predict how much variation (i.e., discrepancy) will be there between the 10 lab averages, if we had only been able to fund/conduct 1 experiment at one site (due to resource/budgetary limitations)? In other words, if the sponsoring organization only support one lab to carry out the experiment, can they estimate what are the possible error that may be committed by using the sample average (of 100 samples) obtained from the chosen lab? - Checking/affirming underlying assumptions. + - * TBD + The answer is yes, they can accurately estimate the real count of ATC codons in the Drosophila genome from a single lab experiment, as the sampling distribution of the average (across labs) is known to be (approximately) Normal! - ===Computational Resources: Internet-based SOCR Tools=== + You can see [[SOCR_EduMaterials_Activities_GCLT_Applications | a number of applications of the Central Limit Theorem here]]. - * TBD + - ===Examples=== + === The Central Limit Theorem=== - Computer simulations and real observed data. + The Central Limit Theorem (CLT) argues that the distribution of the sum or average of independent observations from the same random process (with final mean and variance), will be approximately Normally distributed (i.e., bell-shaped curve). That is, the CLT express the fact that any sum or average of (many) independent and identically-distributed random variables will tend to be distributed according to a particular ''attractor distribution''; the Normal distribution effectively represents the core of the universe of all (nice) distributions. - * TBD + ===The Central Limit Theorem=== - + The [http://en.wikipedia.org/wiki/Central_limit_theorem formal statement of the CLT is described here], however a more appropriate statement for many undergraduate and graduate classes use the following statement of the central limit theorem: - ===Hands-on activities=== + - Step-by-step practice problems. + - * TBD + Let {$X_1,X_2, \cdots, Xn$} be a random sample (IID) from a (native) distribution with well-defined and finite mean $\mu_X$ and variance $\sigma_X^2$. Then as n increases, the sampling distributions of the sample average $\overline{X_n}={1\over n}\sum_{i=1}^n{X_i}$ and the total sum  $\overline{T_n}=\sum_{i=1}^n{X_i}$ approach Normal distributions with corresponding means and variances: + : $\mu_{\overline{X_n}}=\mu_X; [itex]\sigma_{\overline{X_n}}^2={\sigma_X^2\over n};$ + : $\mu_{\overline{T_n}}=n\times\mu_X; [itex]\sigma_{\overline{T_n}}^2={n\times\sigma_X^2};$ + In essence the CLT implies that the Normal distribution is the center of the universe of all nice distributions. And this is the reason why we encounter so frequently estimates involving arithmetic-averaging –- the pathway from a nice distribution to Normal distribution is paved by sample averages. In other words, the CLT provides a unifying framework for all (nice) distributions, the way the Grand Unifying Theory attempts to unite the theory behind the three fundamental forces in physics. + + ===Are there CLTs for other sample statistics?=== + The ramifications of the CLT go beyond the scope of this interpretation. For example, one frequently wonders if there are other types of population-parameters or sample-statistics that yield similar limiting behavior. How large does the sample size have to be to ensure normality of the sample average or total sum? Does the convergence depend on the characteristics of the native distribution (e.g., shape, center, dispersion)? How about weighted averages, non-linear combinations or more general functions of the random sample? Many other interesting questions are frequently asked by people exposed to the CLT. Some may have known theoretical answers (exact or approximate); other questions may be better addressed empirically by simulations and experiments ([[SOCR_EduMaterials_Activities_GeneralCentralLimitTheorem |See the SOCR CLT Applet with Activity]]). + + ===CLT Applications=== + [[SOCR_EduMaterials_Activities_GCLT_Applications | A number of applications of the Central Limit Theorem are included in the SOCR CLT Activity]]. + + To start the '''SOCR CLT Experiment''', go to [http://www.socr.ucla.edu/htmls/SOCR_Experiments.html SOCR Experiments] and select the '''SOCR Sampling Distribution CLT Experiment''' from the drop-down list of experiments in the left panel. The image below shows the interface to this experiment. Notice the main control widgets on this image (boxed in blue and pointed to by arrows). The generic control buttons on the top allow you to do one or multiple steps/runs, stop and reset this experiment. The two tabs in the main frame provide graphical access to the results of the experiment (Histograms and Summaries) or the Distribution selection panel (Distributions). Remember that choosing sample-sizes <= 16 will animate the samples (second graphing row), whereas larger sample-sizes (N>20) will only show the updates of the sampling distributions (bottom two graphing rows). + +
[[Image:SOCR_Activities_General_CLT_Dinov_012207_Fig1.jpg|400px]]
+

===References=== ===References=== - * TBD

## General Advance-Placement (AP) Statistics Curriculum - The Central Limit Theorem

### Motivation

The following example motivates the need to study the sampling distribution of the sample average, i.e., the distribution of $\overline{X_n}={1\over n}\sum_{i=1}^n{X_i}$, as we vary the sample {$X_1, X_2, X_3, \cdots , X_n$}.

Suppose there are 10 world renowned laboratories each of which is charged with conducting the same experiment (e.g., sequencing the genome Drosophila, the fruit fly), using the same protocols. One of the outcomes of this study that the sponsor agency is interested could be the average rate of occurrence of the ATC codon in 10,000 base-pairs in the Drosophila genome. After completing the sequencing of the genome, each lab selects a random segment of 1,000,000 base-pairs and counts the number of ATC codons in every segment of 10,000 base-pairs (there are 100 such segments). Finally they compute the average of the 100 counts they obtained. The funding/sponsoring agency receives 10 average counts from the 10 distinct laboratories. Most likely there will be differences between these averages.

The most important question the funding agency poses is: Can we predict how much variation (i.e., discrepancy) will be there between the 10 lab averages, if we had only been able to fund/conduct 1 experiment at one site (due to resource/budgetary limitations)? In other words, if the sponsoring organization only support one lab to carry out the experiment, can they estimate what are the possible error that may be committed by using the sample average (of 100 samples) obtained from the chosen lab?

The answer is yes, they can accurately estimate the real count of ATC codons in the Drosophila genome from a single lab experiment, as the sampling distribution of the average (across labs) is known to be (approximately) Normal!

### The Central Limit Theorem

The Central Limit Theorem (CLT) argues that the distribution of the sum or average of independent observations from the same random process (with final mean and variance), will be approximately Normally distributed (i.e., bell-shaped curve). That is, the CLT express the fact that any sum or average of (many) independent and identically-distributed random variables will tend to be distributed according to a particular attractor distribution; the Normal distribution effectively represents the core of the universe of all (nice) distributions.

### The Central Limit Theorem

The formal statement of the CLT is described here, however a more appropriate statement for many undergraduate and graduate classes use the following statement of the central limit theorem:

Let {$X_1,X_2, \cdots, Xn$} be a random sample (IID) from a (native) distribution with well-defined and finite mean μX and variance $\sigma_X^2$. Then as n increases, the sampling distributions of the sample average $\overline{X_n}={1\over n}\sum_{i=1}^n{X_i}$ and the total sum $\overline{T_n}=\sum_{i=1}^n{X_i}$ approach Normal distributions with corresponding means and variances:

$\mu_{\overline{X_n}}=\mu_X; [itex]\sigma_{\overline{X_n}}^2={\sigma_X^2\over n};$
$\mu_{\overline{T_n}}=n\times\mu_X; [itex]\sigma_{\overline{T_n}}^2={n\times\sigma_X^2};$

In essence the CLT implies that the Normal distribution is the center of the universe of all nice distributions. And this is the reason why we encounter so frequently estimates involving arithmetic-averaging –- the pathway from a nice distribution to Normal distribution is paved by sample averages. In other words, the CLT provides a unifying framework for all (nice) distributions, the way the Grand Unifying Theory attempts to unite the theory behind the three fundamental forces in physics.

### Are there CLTs for other sample statistics?

The ramifications of the CLT go beyond the scope of this interpretation. For example, one frequently wonders if there are other types of population-parameters or sample-statistics that yield similar limiting behavior. How large does the sample size have to be to ensure normality of the sample average or total sum? Does the convergence depend on the characteristics of the native distribution (e.g., shape, center, dispersion)? How about weighted averages, non-linear combinations or more general functions of the random sample? Many other interesting questions are frequently asked by people exposed to the CLT. Some may have known theoretical answers (exact or approximate); other questions may be better addressed empirically by simulations and experiments (See the SOCR CLT Applet with Activity).

### CLT Applications

To start the SOCR CLT Experiment, go to SOCR Experiments and select the SOCR Sampling Distribution CLT Experiment from the drop-down list of experiments in the left panel. The image below shows the interface to this experiment. Notice the main control widgets on this image (boxed in blue and pointed to by arrows). The generic control buttons on the top allow you to do one or multiple steps/runs, stop and reset this experiment. The two tabs in the main frame provide graphical access to the results of the experiment (Histograms and Summaries) or the Distribution selection panel (Distributions). Remember that choosing sample-sizes <= 16 will animate the samples (second graphing row), whereas larger sample-sizes (N>20) will only show the updates of the sampling distributions (bottom two graphing rows).