AP Statistics Curriculum 2007 GLM Predict

From Socr

(Difference between revisions)
Jump to: navigation, search
(added a link to the Problems set)
 
(One intermediate revision not shown)
Line 3: Line 3:
=== Inference on Linear Models ===
=== Inference on Linear Models ===
-
Suppose we have again ''n'' pairs ''(X,Y)'', {<math>X_1, X_2, X_3, \cdots, X_n</math>} and {<math>Y_1, Y_2, Y_3, \cdots, Y_n</math>}, of observations of the same process. in the [[AP_Statistics_Curriculum_2007_GLM_Regress |previous section, we discussed how to fit a line to the data]]. The main question is how to determine the best line?
+
Suppose we have again ''n'' pairs ''(X,Y)'', {<math>X_1, X_2, X_3, \cdots, X_n</math>} and {<math>Y_1, Y_2, Y_3, \cdots, Y_n</math>}, of observations of the same process. In the [[AP_Statistics_Curriculum_2007_GLM_Regress |previous section, we discussed how to fit a line to the data]]. The main question is how to determine the best line?
====[[AP_Statistics_Curriculum_2007_GLM_Corr#Airfare_Example |Airfare Example]]====
====[[AP_Statistics_Curriculum_2007_GLM_Corr#Airfare_Example |Airfare Example]]====
-
We can see from the [[SOCR_EduMaterials_Activities_ScatterChart |scatterplot]] that [[AP_Statistics_Curriculum_2007_GLM_Corr#Airfare_Example | greater distance is associated with higher airfare]]. In other words, airports that tend to be further from Baltimore tend to be more expensive airfare. To decide on the best fitting line, we use the '''least-squares method''' to fit the least squares (regression) line.
+
We can see from the [[SOCR_EduMaterials_Activities_ScatterChart |scatterplot]] that [[AP_Statistics_Curriculum_2007_GLM_Corr#Airfare_Example | greater distance is associated with higher airfare]]. In other words, airports that tend to be further from Baltimore tend to have more expensive airfare. To decide on the best fitting line, we use the '''least-squares method''' to fit the least squares (regression) line.
<center>[[Image:SOCR_EBook_Dinov_GLM_Regr_021708_Fig1.jpg|500px]]</center>
<center>[[Image:SOCR_EBook_Dinov_GLM_Regr_021708_Fig1.jpg|500px]]</center>
====Confidence Interval Estimating of the Slope and Intercept of Linear Model====
====Confidence Interval Estimating of the Slope and Intercept of Linear Model====
-
The parameters (''a'' and ''b'') of the linear regression line, <math>Y = a + bX</math>, are estimated using [http://en.wikipedia.org/wiki/Ordinary_Least_Squares  Least Squares]. The least squares technique finds the line that minimizes the sum of the squares of the regression '''residuals''', <math>\hat{\varepsilon_i}=\hat{y}_{i}-y_i</math>, <math> \sum_{i=1}^N {\hat{\varepsilon_i}^2} = \sum_{i=1}^N (\hat{y}_{i}-y_i)^2 </math>, where <math>y_i</math> and <math>\hat{y}_{i}=a+bx_i</math> are the observed and the predicted values of ''Y'' for <math>x_i</math>, respectfully.
+
The parameters (''a'' and ''b'') of the linear regression line, <math>Y = a + bX</math>, are estimated using [http://en.wikipedia.org/wiki/Ordinary_Least_Squares  Least Squares]. The least squares technique finds the line that minimizes the sum of the squares of the regression '''residuals''', <math>\hat{\varepsilon_i}=\hat{y}_{i}-y_i</math>, <math> \sum_{i=1}^N {\hat{\varepsilon_i}^2} = \sum_{i=1}^N (\hat{y}_{i}-y_i)^2 </math>, where <math>y_i</math> and <math>\hat{y}_{i}=a+bx_i</math> are the observed and the predicted values of ''Y'' for <math>x_i</math>.
The minimization problem can be solved using calculus, by finding the first order partial derivatives and setting them equal to zero. The solution gives the slope and y-intercept of the regressions line:
The minimization problem can be solved using calculus, by finding the first order partial derivatives and setting them equal to zero. The solution gives the slope and y-intercept of the regressions line:
Line 22: Line 22:
: <math> \hat{a} = \bar{y} - \hat{b} \bar{x} </math>
: <math> \hat{a} = \bar{y} - \hat{b} \bar{x} </math>
-
If the error terms are Normally distributed, the estimate of the slope coefficient has a normal distribution with mean equal to '''b''' and '''standard error''' given by:
+
If the error terms are Normally distributed, the estimate of the slope coefficient has a normal distribution with mean equals to '''b''' and '''standard error''' given by:
: <math> s_ \hat{b} = \sqrt { {1\over (N-2)} \frac {\sum_{i=1}^N \hat{\varepsilon_i}^2} {\sum_{i=1}^N (x_i - \bar{x})^2} }</math>.
: <math> s_ \hat{b} = \sqrt { {1\over (N-2)} \frac {\sum_{i=1}^N \hat{\varepsilon_i}^2} {\sum_{i=1}^N (x_i - \bar{x})^2} }</math>.
Line 30: Line 30:
:<math> [ \hat{b} - s_ \hat{b} t_{(\alpha/2, N-2)},\hat{b} + s_ \hat{b} t_{(\alpha/2, N-2)}] </math>
:<math> [ \hat{b} - s_ \hat{b} t_{(\alpha/2, N-2)},\hat{b} + s_ \hat{b} t_{(\alpha/2, N-2)}] </math>
-
In other words, if there is a 1 mile increase in distance the airfare will go up by between $0.054 and $0.180.
+
In other words, if there is an 1 mile increase in distance the airfare will go up by between $0.054 and $0.180.
* '''Significance testing''':  If X is not useful for predicting Y, then the true slope is zero. In a hypothesis test ,our status quo null hypothesis would be that there is no relationship between X and Y
* '''Significance testing''':  If X is not useful for predicting Y, then the true slope is zero. In a hypothesis test ,our status quo null hypothesis would be that there is no relationship between X and Y

Current revision as of 20:46, 28 June 2010

Contents

General Advance-Placement (AP) Statistics Curriculum - Variation and Prediction Intervals

Inference on Linear Models

Suppose we have again n pairs (X,Y), {X_1, X_2, X_3, \cdots, X_n} and {Y_1, Y_2, Y_3, \cdots, Y_n}, of observations of the same process. In the previous section, we discussed how to fit a line to the data. The main question is how to determine the best line?

Airfare Example

We can see from the scatterplot that greater distance is associated with higher airfare. In other words, airports that tend to be further from Baltimore tend to have more expensive airfare. To decide on the best fitting line, we use the least-squares method to fit the least squares (regression) line.

Confidence Interval Estimating of the Slope and Intercept of Linear Model

The parameters (a and b) of the linear regression line, Y = a + bX, are estimated using Least Squares. The least squares technique finds the line that minimizes the sum of the squares of the regression residuals, \hat{\varepsilon_i}=\hat{y}_{i}-y_i,  \sum_{i=1}^N {\hat{\varepsilon_i}^2} = \sum_{i=1}^N (\hat{y}_{i}-y_i)^2 , where yi and \hat{y}_{i}=a+bx_i are the observed and the predicted values of Y for xi.

The minimization problem can be solved using calculus, by finding the first order partial derivatives and setting them equal to zero. The solution gives the slope and y-intercept of the regressions line:

  • Regression line Slope:
 \hat{b} = \frac {\sum_{i=1}^{N}  (x_{i} - \bar{x})(y_{i} - \bar{y}) }  {\sum_{i=1}^{N} (x_{i} - \bar{x}) ^2}
 \hat{b} = \frac {\sum_{i=1}^{N} {(x_{i}y_{i})} - N \bar{x} \bar{y}}  {\sum_{i=1}^{N} (x_{i})^2 - N \bar{x}^2}  = \rho_{X,Y} \frac {s_y}{s_x} , where ρX,Y is the correlation coefficient.
  • Y-intercept:
 \hat{a} = \bar{y} - \hat{b} \bar{x}

If the error terms are Normally distributed, the estimate of the slope coefficient has a normal distribution with mean equals to b and standard error given by:

 s_ \hat{b} = \sqrt { {1\over (N-2)} \frac {\sum_{i=1}^N \hat{\varepsilon_i}^2} {\sum_{i=1}^N (x_i - \bar{x})^2} }.
 [ \hat{b} - s_ \hat{b} t_{(\alpha/2, N-2)},\hat{b} + s_ \hat{b} t_{(\alpha/2, N-2)}]

In other words, if there is an 1 mile increase in distance the airfare will go up by between $0.054 and $0.180.

  • Significance testing: If X is not useful for predicting Y, then the true slope is zero. In a hypothesis test ,our status quo null hypothesis would be that there is no relationship between X and Y
Hypotheses: Ho:b = 0 vs. H_1: b \not= 0 (or H1:b > 0 or H1:b < 0).
Test-statistics: t_o={b-0\over SE(b)}, where t_o \sim t_{(df=n-2)} is the T-Distribution.

Example

For the distance vs. airfare example, we can compute the standard error of the slope coefficient (b), SE(b)

SE(b)={37.83 \over \sqrt{1786499}}=0.0283.
  • Then a 95% confidence interval for b is given by:
CI(b): b \pm t_{(\alpha/2, df=10)}SE(b)=0.11738 \pm 2.228\times 0.02832=[0.054 , 0.180].
  • Significance testing:
t_o={b-0\over SE(b)}={0.11738-0 \over 0.02832}=4.145 and pvalue = 0.002.

Earthquake Example

Use the SOCR Earthquake Dataset to formulate and test a research hypothesis about the slope of the best-leaner fit between the Longitude and the Latitude of the California Earthquakes since 1900. You can see the SOCR Geomap of these Earthquakes. The image below shows how to use the Simple Linear regression in SOCR Analyses to calculate the regression line and make inference on the slope.


Problems




Translate this page:

(default)

Deutsch

Español

Français

Italiano

Português

日本語

България

الامارات العربية المتحدة

Suomi

इस भाषा में

Norge

한국어

中文

繁体中文

Русский

Nederlands

Ελληνικά

Hrvatska

Česká republika

Danmark

Polska

România

Sverige

Personal tools