Current revision as of 06:31, 14 December 2010

1 Probability and Statistics EBook Practice Problems
2 I. Introduction to Statistics
3 II. Describing, Exploring, and Comparing Data
4 III. Probability
5 IV. Probability Distributions
6 V. Normal Probability Distribution
7 VI. Relations Between Distributions
8 VII. Point and Interval Estimates
9 VIII. Hypothesis Testing
10 IX. Inferences from Two Samples
11 X. Correlation and regression
12 XI. Analysis of Variance (ANOVA)
- 12.1 One-Way ANOVA
  - 12.1.1 Problems
- 12.2 Two-Way ANOVA
  - 12.2.1 Problems
13 XII. Non-Parametric Inference
14 XIII. Multinomial Experiments and Contingency Tables

Probability and Statistics EBook Practice Problems

The problems provided below may be useful for practicing the concepts, methods and analysis protocols, and for self-evaluation of learning of the materials presented in the EBook.

I. Introduction to Statistics

The Nature of Data and Variation

Although natural phenomena in real life are unpredictable, the designs of experiments are bound to generate data that varies because of intrinsic (internal to the system) or extrinsic (due to the ambient environment) effects. How many natural processes or phenomena in real life can we describe that have an exact mathematical closed-form description and are completely deterministic? How do we model the rest of the processes that are unpredictable and have random characteristics?

@@ Line 17: / Line 17: @@
 ===[[EBook_Problems_EDA_IntroDesign | Problems]]===
-==[[AP_Statistics_Curriculum_2007_IntroTools |Statistics with Tools (Calculators and Computers)]]===
+==[[AP_Statistics_Curriculum_2007_IntroTools |Statistics with Tools (Calculators and Computers)]]==
 All methods for data analysis, understanding or visualizing are based on models that often have compact analytical representations (e.g., formulas, symbolic equations, etc.) Models are used to study processes theoretically. Empirical validations of the utility of models are achieved by inputting data and executing tests of the models. This validation step may be done manually, by computing the model prediction or model inference from recorded measurements. This process may be possible by hand, but only for small numbers of observations (<10). In practice, we write (or use existent) algorithms and computer programs that automate these calculations for better efficiency, accuracy and consistency in applying models to larger datasets.
 ===[[EBook_Problems_EDA_IntroTools | Problems]]===
@@ Line 115: / Line 115: @@
 In addition to being able to compute probability (p) values, we often need to estimate the critical values of the Normal Distribution for a given p-value.
 ===[[EBook_Problems_Normal_Critical | Problems]]===
+==[[AP_Statistics_Curriculum_2007_MultivariateNormal |Multivariate Normal Distribution]]==
+The multivariate normal distribution (also known as multivariate Gaussian distribution) is a generalization of the [[AP_Statistics_Curriculum_2007_Normal_Prob|univariate (one-dimensional) normal distribution]] to higher dimensions (2D, 3D, etc.) The multivariate normal distribution is useful in studies of correlated real-valued random variables.
+===[[EBook_Problems_MultivariateNormal | Problems]]===
 =VI. Relations Between Distributions=
@@ Line 163: / Line 167: @@
 ===[[EBook_Problems_StudentsT | Problems]]===
-===[[AP_Statistics_Curriculum_2007_Estim_Proportion |Estimating a Population Proportion]]===
+==[[AP_Statistics_Curriculum_2007_Estim_Proportion |Estimating a Population Proportion]]==
 '''Normal Distribution''' is appropriate model for proportions, when the sample size is large enough. In this section, we demonstrate how to obtain point and interval estimates for population proportion.
@@ Line 185: / Line 189: @@
 ==[[AP_Statistics_Curriculum_2007_Hypothesis_S_Mean |Testing a Claim about a Mean: Small Samples]]==
 We continue with the discussion on inference for the population mean for small samples.
-'''1. To test the claim that the average home in a certain town is within 5.5 miles of the nearest fire station, and insurance company measured the distances from 25 randomly selected homes to the nearest fire station and found x-bar = 5.8 miles and sd = 2.4 miles. Determine what the insurance company found out with a test of significance. Check all that apply.'''
-'''Choose at least one answer.'''
-:''(a) There is no evidence in the data to conclude that the distance is different from 5.5.''
-:''(b) The average of 5.8 miles observed is by chance.''
-:''(c) We cannot reject the null.''
-:''(d) There is evidence in the data to conclude that the distance is 5.5.''
 ===[[EBook_Problems_Hypothesis_S_Mean | Problems]]===
 ==[[AP_Statistics_Curriculum_2007_Hypothesis_Proportion |Testing a Claim about a Proportion]]==
 When the sample size is large, the sampling distribution of the sample proportion <math>\hat{p}</math> is approximately Normal, by [[AP_Statistics_Curriculum_2007_Limits_CLT | CLT]]. This helps us formulate hypothesis testing protocols and compute the appropriate statistics and p-values to assess significance.
-'''1. A random sample of 1000 Americans aged 65 and older was collected in 1980 and found that 15% had "hazardous" levels of drinking, which is defined as regularly drinking an amount of alcohol that could cause health problems given the subject's medical conditions. Researchers wanted to know if this proportion has changed since 1980 and so collected a random sample of 1500 Americans aged 65 and older in 2004. They found that 12% drank at hazardous levels. Which of the following is closest to the value of a test statistic that could be used to test the hypothesis that the proportion of hazardous drinkers over the age of 65 has declined since 1980?'''
-'''Choose one answer.'''
-:''(a) -2.13''
-:''(b) 0.014''
-:''(c) 0.418''
-:''(d) 4.54''
 ===[[EBook_Problems_Hypothesis_Proportion | Problems]]===
@@ Line 243: / Line 223: @@
 ==[[AP_Statistics_Curriculum_2007_GLM_Corr |Correlation]]==
 The '''Correlation''' between X and Y represents the first bivariate model of association which may be used to make predictions.
-'''1. A positive correlation between two variables X and Y means that if X increases, this will cause the value of Y to increase.'''
-:''(a) This is always true.''
-:''(b) This is sometimes true.''
-:''(c) This is never true.''
-{{hidden|Answer|(c)}}
-'''2. The correlation between high school algebra and geometry scores was found to be + 0.8. Which of the following statements is not true?'''
-:''(a) Most of the students who have above average scores in algebra also have above average scores in geometry. ''
-:''(b) Most people who have above average scores in algebra will have below average scores in geometry ''
-:''(c) If we increase a student's score in algebra (ie. with extra tutoring in algebra), then the student's geometry scores will always increase accordingly.''
-:''(d) Most students who have below average scores in algebra also have below average scores in geometry. ''
-{{hidden|Answer|(c)}}
-'''3. Researchers discover that the correlation between miles ran per week and cardiovascular endurance is +0.75. They also discover that the correlation between hours spent watching television per week and cardiovascular endurance is -0.75. What is the conclusion that best characterizes the result of this study?'''
-'''Choose one answer.'''
-:''(a) Most people who spend a lot of hours watching television have low cardiovascular endurance.''
-:''(b) Most people who have good cardiovascular endurance spend a lot of time running and little time watching television.''
-:''(c) Based on the correlation, if you increase your running hours per week, your cardiovascular endurance will decrease.''
-:''(d) Based on the correlation, if you increases your television watching time, your cardiovascular endurance will decrease.''
-:''(e) Most people with a lot of miles ran per week have high cardiovascular endurance.''
-'''4. The correlation between working out and body fat was found to be exactly -1.0. Which of the following would not be true about the corresponding scatterplot?'''
-'''Choose one answer.'''
-:''(a) The slope of the best line of fit should be -1.0.''
-:''(b) All the points would lie along a perfect straight line, with no deviation at all.''
-:''(c) The best fitting line would have a downhill (negative) slope.''
-:''(d) 100% of the variance in body fat can be predicted from workout.''
-'''5. Suppose that the correlation between working out and body fat was found to be exactly -1.0. Which of the following would NOT be true, about the corresponding scatterplot?'''
-'''Choose one answer.'''
-:''(a) All points would lie along a straight line, with no deviation at all.''
-:''(b) 100% of the variance in body fat can be predicted from the workout.''
-:''(c) The slope of the linear model is -1.0.''
-:''(d) The best fitting line would have a negative slope.''
-'''6. A recent article in an educational research journal reports a correlation of +0.8 between math achievement and overall math aptitude. It also reports a correlation of -0.8 between math achievement and a math anxiety test. Which of the following interpretations is the most correct?'''
-'''Choose one answer'''
-:''(a) You cannot compare a positive and a negative correlation.''
-:''(b) The correlation of +0.8 indicates a stronger relationship than the correlation of -0.8.''
-:''(c) The correlation of +0.8 is just as strong as the correlation of -0.8.''
-:''(d) It is impossible to tell which correlation is stronger.''
-{{hidden|Answer|(c)}}
-'''7. Psychologists have shown that there is a relationship between stress levels and productivity. As stress levels increase, productivity also increases up to a certain point, and after that productivity decreases as stress levels increase. Suppose you were given this data for a random sample of 200 adults. If you calculated the Pearson coefficient of correlation, what would you expect to find?'''
-'''Choose one answer.'''
-:''(a) I would expect r to be between -0.50 to -0.70.''
-:''(b) I would expect r to be -1.''
-:''(c) I would expect r to be between 0.50 and 0.70.''
-:''(d) I would expect r to be +1.''
-:''(e) I would expect r to be zero.''
-'''8. If the correlation coefficient is 0.80, then:'''
-'''Choose one answer.'''
-:''(a) The explanatory variable is usually less than the response variable.''
-:''(b) The explanatory variable is usually more than the response variable.''
-:''(c) None of the statements are correct.''
-:''(d) Below-average values of the explanatory variable are more often associated with below-average values of the response variable.''
-:''(e) Below-average values of the explanatory variable are more often associated with above-average values of the response variable.''
-'''9. Given the following data, what is the best estimate for the coefficient of correlation between the ages of the husbands and wives?'''
-'''There are 50 couples (husband and wife). The age range for men is from 50 to 70 years old. The age range for women is from 48 to 68 years old. For all of the couples, the husband is two years older than the wife. For instance, in one couple the husband is 50 years old and the wife is 48 years old.'''
-'''Choose one answer.'''
-:''(a) The coefficient of correlation between the age of husband and wife is equal to +1.''
-:''(b) We need the actual data to compute the coefficient of correlatin between the age of the husband and wife.''
-:''(c) The coefficient of correlation between the age of husband and wife is equal to zero.''
-:''(d) The coefficient of correlation between the age of husband and wife is equal to +0.50.''
-:''(e) The coefficient of correlation between the age of husband and wife is equal to -1.''
 ===[[EBook_Problems_GLM_Corr | Problems]]===
 ==[[AP_Statistics_Curriculum_2007_GLM_Regress |Regression]]==
 We are now ready to discuss the modeling of linear relations between two variables using '''Regression Analysis'''. This section demonstrates this methodology for the SOCR California Earthquake dataset.
-'''1. Use the information from the Heights of Fathers and Sons to write the linear model that best predicts the height of the son from the height of the father.'''
-'''Choose one answer.'''
-:''(a) Son's height = 35 + 0.5*Father's height'''
-:''(b) Son's height = 1.00 + 1.00* Father's height''
-:''(c) The model cannot be determined without the actual data''
-:''(d) Son's height = 0.5 + 35*Father's height''
-'''2. A congressional report investigates the relationship between income of parents and educational attainment of their daughters. Data are from a sample of families with daughters age 18-24. Average parental income is $29,300, average educational attainment of the daughters is 13.1 years of schooling completed, and the correlation is 0.37.
-The regression line for predicting daughter’s education from parental income is reported as: Predicted education = 0.000617*(income) + 8.1
-Is the following statement true or false? "The above line is the regression line to predict education from income."'''
-:''(a)True.''
-:''(b)False.''
-'''3. Heights of Fathers and Sons'''
-'''In the early 1900's when Francis Galton and Karl Pearson measured 1078 pairs of fathers and their grown-up sons, they calculated that the mean height for fathers was about 68 inches with deviation of 3 inches. For their sons, the mean height was 69 inches with deviation of 3 inches. (The actual numbers are slightly smaller, but we will work with these values to keep the calculations simple.) The correlation coefficient was 0.50. Use the information to calculate the slope of the linear model that predicts the height of the son from the height of the father.'''
-'''Choose one answer.'''
-:''(a) 0.50''
-:''(b) The slope cannot be determined without the actual data''
-:''(c) 35.00''
-:''(d) 3/3 = 1.00''
-'''4. The National Highway Safety Administration is interested in the effect of seat belt use on saving lives. One study reported statistics on children under the age of 5 who were involved in motor vehicles accidents in which at least one fatality occurred. 7,060 such accidents between 1985 and 1989 were studied. Of those who survived, 1129 weren't wearing a seat belt, 432 were wearing an adult seat belt and 733 had a children's carseat belt. Of those with fatalities, 509 had no belt, 73 had an adult seat belt, and 139 had a children's carseat belt.'''
-'''Are seat belt status and the outcome of the accidents independent?'''
-'''Choose one answer.'''
-:''(a) Yes''
-:''(b) No''
-:''(c) Can't tell with the information provided''
-'''5. Suppose that wildlife researchers monitor the local alligator population by taking aerial photograhs on a regular schedule. They determine that the best fitting linear model to predict weight in pounds from the length of the gators inches is:'''
-'''Weight = -393 + 5.9*Length with r2 = 0.836.'''
-'''Which of the following statements is true?'''
-'''Choose one answer.'''
-:''(a) A gator that is about 10 inches above average in length is about 59 pounds above the average weight of these gators.''
-:''(b) The correlation between a gator's length and weight is 0.836.''
-:''(c) The correlation between a gator's height and weight cannot be determined without the actual data.''
-:''(d) The correlation between a gator's height and weigth is about -0.914.''
 ===[[EBook_Problems_GLM_Regress | Problems]]===
 ==[[AP_Statistics_Curriculum_2007_GLM_Predict |Variation and Prediction Intervals]]==
 In this section, we discuss point and interval estimates about the slope of linear models.
-'''1. Two researchers are going to take a sample of data from the same population of physics students. Researcher A will select a random sample of students from among all students taking physics. Researcher B's sample will consist only of the students in her class. Both researchers will construct a 95% confidence interval for the mean score on the physics final exam using their own sample data. Which researcher's method has a 95% chance of capturing the true mean of the population of all students taking physics?'''
-'''Choose one answer.'''
-:''(a) Research B''
-:''(b) Researcher A''
-:''(c) Both methods have a 95% chance of capturing the true mean''
-:''(d) Neither''
-'''2. A random sample of 150 UCLA students found that 35% of the respondants wanted a elevator to replace Bruin Walk. A 95% confidence interval for the percentage of all UCLA students who feel this way is approximately:'''
-'''Choose one answer.'''
-:''(a) (24%, 46%)''
-:''(b) (32%, 38%)''
-:''(c) The sample size is too small to compute a confidence interval.''
-:''(d) (27%, 43%)''
-'''3. According to Terry Prachett, the short unit of time in the multiverse is the New York second, defined as the time interval between the light turning green and the cab behind you honking. A magazine took a poll of 100 New Yorkers and found that 90 people agree with that statement wholeheartedly. Which of the following is a 90% confidence interval for the proportion of people who agree with that statement?'''
-'''Choose one answer.'''
-:''(a) 0.9 +\- 0.50''
-:''(b) 0.9 +\- .05''
-:''(c) 0.9 +\- .03''
-:''(d) 0.9 +\- .06''
-'''4. A national poll found that 62% of all Americans agreed that more attention should be paid to mental health of war veterans. If a simple random sample of 326 people was used to make a 95% confidence interval of (0.57,0.67), what is the margin of error?'''
-'''Choose one answer.'''
-:''(a) 0.03''
-:''(b) 0.05''
-:''(c) 0.12''
-:''(d) In order to calculate the margin of error, we need the p-value of the population.''
-'''5. Hermione Granger is on a mission this year to complain about the astronomical cost of wizarding books to the Hogwart board of administrators. Given that the population mean for book cost is 10 and a standard deviation of 2 galleons, If Hermione were to take a simple random sample of 49 students and make a 68% confidence interval, what would be the range of values for the sample mean or Xbar?'''
-'''Choose one answer.'''
-:''(a) 8 and 12 galleons''
-:''(b) 9.4 and 10.6 galleons''
-:''(c) 6 and 14 Galleons''
-:''(d) 9.7 and 10.3 galleons''
-'''6. A 95% confidence interval indicates that:'''
-'''Choose one answer:'''
-:''(a) 95% of the intervals constructed using this process based on samples from this population will include the population mean''
-:''(b) 95% of the time the interval will include the sample mean''
-:''(c) 95% of the possible population means will be included by the interval''
-:''(d) 95% of the possible sample means will be included by the interval''
-'''7. Suppose we want to find out if a coin is not fair. To test this hypothesis we flip the coin 100 times, and in 63 out of 100 flips we get heads. We construct the confidence interval and find it to be (.53,.73). Interpret this confidence interval.'''
-'''Choose one answer.'''
-:''(a) 95 is the Z score that corresponds to our distribution of sample means''
-:''(b) Confidence is something you learn at fraternity parties''
-:''(c) 95% of the time the true proportion of flips that are heads is between .53 and .73''
-:''(d) If we were to repeat this expirement over and over again, 95 times out of 100 our Confidence interval would cover the true proportion of flips that are heads''
-'''8. A 95% confidence interval is calculated for a sample of weights of 100 randomly selected pigs, and is (42 pounds, 48 pounds). Will the sample mean weight fall within the confidence interval?'''
-'''Choose one answer.'''
-:''(a) Yes''
-:''(b) We need more information to determine if this is true.''
-:''(c) No''
-'''9. The average number of fruit candies in a large bag is estimated. The 95% confidence interval is (40, 48). Based on this information, you know that the best estimate of the population mean is:'''
-'''Choose one answer.'''
-:''(a) 43''
-:''(b) 40''
-:''(c) 45''
-:''(d) none of the above.''
-:''(e) 44''
-'''10. Suppose we plan to take a random sample of adults in the U.S. and determine the percent of them who have attended church in the last 30 days. We calculate a 90% confidence interval for the proportion of all adults in the U.S. who attended church in the last 30 days. Which of the following changes in our plans would result in a wider confidence interval? Check all that apply.'''
-'''Choose one answer.'''
-:''(a) Using an 85% confidence level.''
-:''(b) Using a 95% confidence level.''
-:''(c) Using a larger sample.''
-:''(d) Using a smaller sample.''
-'''11. Kevin has always, ever since he was a wee lad, wondered what proportion of the candies in M&M chocolate candies bags are yellow. However, his persistent calls to the M&M headquarter were of no avail. Now that he wields the awesome power of being a TA for Stat 10, he makes each of his 200 students go buy a M&M bag, count the colors, and compute a 99% confidence intervals for the yellow candy proportion. Assume that each M&M bag is a random sample, approximately how many of the 200 confidence intervals will not capture the true population proportion for yellow M&M's?'''
-'''Choose one answer.'''
-:''(a) Not enough information for an answer''
-:''(b) 0 to 4''
-:''(c) 4 to 8''
-:''(d) 12 to 14''
-:''(e) 8 to 12''
-'''12. A 95% confidence interval for the proportion of U.S. adults who favor the death penalty  is given by (0.03, 0.09). Is the following statement true or false?'''
-'''"There is a 95% probability that an adult in the US is in favor of the death penalty."'''
-:''(a) True''
-:''(b) False''
 ===[[EBook_Problems_GLM_Predict | Problems]]===

(default)	Deutsch	Español	Français	Italiano	Português	日本語	България	الامارات العربية المتحدة	Suomi	इस भाषा में	Norge
한국어	中文	繁体中文	Русский	Nederlands	Ελληνικά	Hrvatska	Česká republika	Danmark	Polska	România	Sverige

EBook Problems

From Socr

Current revision as of 06:31, 14 December 2010

Contents

Probability and Statistics EBook Practice Problems

I. Introduction to Statistics

II. Describing, Exploring, and Comparing Data

III. Probability

IV. Probability Distributions

V. Normal Probability Distribution

VI. Relations Between Distributions

VII. Point and Interval Estimates

VIII. Hypothesis Testing

IX. Inferences from Two Samples

X. Correlation and regression

XI. Analysis of Variance (ANOVA)

XII. Non-Parametric Inference

XIII. Multinomial Experiments and Contingency Tables

References

Views

Personal tools

Navigation

Search

Toolbox