# AP Statistics Curriculum 2007 Estim MOM MLE

(Difference between revisions)
 Revision as of 19:24, 8 December 2008 (view source)IvoDinov (Talk | contribs) (→References: expanded the references)← Older edit Current revision as of 16:55, 18 May 2011 (view source)IvoDinov (Talk | contribs) (→MOM Beta Distribution Example) (8 intermediate revisions not shown) Line 8: Line 8: ===Method of Moments (MOM) Estimation=== ===Method of Moments (MOM) Estimation=== - Parameter estimation using the method of moments is both intuitive and easy to calculate. The idea is to use the sample data to calculate some [[AP_Statistics_Curriculum_2007_Distrib_MeanVar#Higher_Moments |sample moments]] and then set these equal to their corresponding population counterparts. Typically the latter involve the parameter(s) that we are interested in estimating and thus we obtain a computationally tractable protocol for their estimation. Summarizing the MOM: + Parameter estimation using the method of moments is both intuitive and easy to calculate. The idea is to use the sample data to calculate some [[AP_Statistics_Curriculum_2007_Distrib_MeanVar#Higher_Moments |sample moments]] and then set these equal to their corresponding population counterparts. Typically the latter involves the parameter(s) that we are interested in estimating and thus we obtain a computationally tractable protocol for their estimation. Summarizing the MOM: * First: Determine the k parameters of interest and the specific (model) distribution for this process; * First: Determine the k parameters of interest and the specific (model) distribution for this process; * Second: Compute the first k (or more) [[AP_Statistics_Curriculum_2007_Distrib_MeanVar#Higher_Moments |sample-moments]]; * Second: Compute the first k (or more) [[AP_Statistics_Curriculum_2007_Distrib_MeanVar#Higher_Moments |sample-moments]]; - * Third: Set the [[AP_Statistics_Curriculum_2007_Distrib_MeanVar#Higher_Moments |sample-moments]] equal to the population moments and solve a (linear or non-linear) system of k equations with k unknowns. + * Third: Set the [[AP_Statistics_Curriculum_2007_Distrib_MeanVar#Higher_Moments |sample-moments]] equal to the population moments and solve for a (linear or non-linear) system of k equations with k unknowns. ====MOM Proportion Example==== ====MOM Proportion Example==== - Let's look at the motivational problem we discussed above. We want to flip a coin 8 times, observe the number of heads (successes) in the outcomes and use that to infer the true (unknown) probability of a Head (''P(H)=?'') for this specific coin. + Let's look at the motivational problem we discussed above. We want to flip a coin 8 times, observe the number of heads (successes) in the outcomes and use that to infer the true (unknown) probability of a Head (''P(H)=?'') of this specific coin. * Hypothetical solution: Suppose we observe the following sequence of outcomes ''{T,H,T,H,H,T,H,H}''. Using the MOM protocol we obtain: * Hypothetical solution: Suppose we observe the following sequence of outcomes ''{T,H,T,H,H,T,H,H}''. Using the MOM protocol we obtain: Line 20: Line 20: ** The first [[AP_Statistics_Curriculum_2007_Distrib_MeanVar#Higher_Moments |sample-moment]] for a [[AP_Statistics_Curriculum_2007_Distrib_Binomial#Bernoulli_process |Bernoulli process]] is p=E(Y). Therefore, if the [[EBook#Random_Variables |random variable]] ''X = {# H’s}'' is [[AP_Statistics_Curriculum_2007_Distrib_Binomial |Binomially distributed]], then the expected value of X is ''E(X)=np=8p''. And the sample value of X is ''Sample#H’s = 5''. Equating the first sample moments yields $8p \approx 5$. Hence, we would estimate the unknown $p=P(H) \approx MOM(p)=\hat{p}={5\over 8}$. ** The first [[AP_Statistics_Curriculum_2007_Distrib_MeanVar#Higher_Moments |sample-moment]] for a [[AP_Statistics_Curriculum_2007_Distrib_Binomial#Bernoulli_process |Bernoulli process]] is p=E(Y). Therefore, if the [[EBook#Random_Variables |random variable]] ''X = {# H’s}'' is [[AP_Statistics_Curriculum_2007_Distrib_Binomial |Binomially distributed]], then the expected value of X is ''E(X)=np=8p''. And the sample value of X is ''Sample#H’s = 5''. Equating the first sample moments yields $8p \approx 5$. Hence, we would estimate the unknown $p=P(H) \approx MOM(p)=\hat{p}={5\over 8}$. - * Experimental Solution: We can also use [http://socr.ucla.edu/htmls/SOCR_Experiments.html SOCR Experiments] to demonstrate the MOM estimation technique. You can refer to the [[SOCR_EduMaterials_Activities_CoinSampleExperiment | SOCR Coin Sample Experiment]] for more information of this SOCR applet. The The figure below illustrates flipping a coin 8 times and observing 5 Heads.  This is a [[AP_Statistics_Curriculum_2007_Distrib_Binomial | Binomial(n=8, p=0.65)]] distribution. However, let's pretend for a minute that we did '''not''' know the actual ''p=P(H)'' value! So we have a good approximation $0.65=p=P(H) \approx MOM(p)=\hat{p}={5\over 8}=0.625$. Of course, if we run this experiment again, our MOM estimate for ''p'' would change! + * Experimental Solution: We can also use [http://socr.ucla.edu/htmls/SOCR_Experiments.html SOCR Experiments] to demonstrate the MOM estimation technique. You can refer to the [[SOCR_EduMaterials_Activities_CoinSampleExperiment | SOCR Coin Sample Experiment]] for more information of this SOCR applet. The figure below illustrates flipping a coin 8 times and observing 5 Heads.  This is a [[AP_Statistics_Curriculum_2007_Distrib_Binomial | Binomial(n=8, p=0.65)]] distribution. However, let's pretend for a minute that we did '''not''' know the actual ''p=P(H)'' value! So we have a good approximation $0.65=p=P(H) \approx MOM(p)=\hat{p}={5\over 8}=0.625$. Of course, if we run this experiment again, our MOM estimate for ''p'' would change!
[[Image:SOCR_EBook_Dinov_Estimates_MOM_MLE_032808_Fig1.jpg|400px]]
[[Image:SOCR_EBook_Dinov_Estimates_MOM_MLE_032808_Fig1.jpg|400px]]
More information about the [http://en.wikipedia.org/wiki/Method_of_moments_%28statistics%29 method of moments for parameter (point) estimation may be found here]. More information about the [http://en.wikipedia.org/wiki/Method_of_moments_%28statistics%29 method of moments for parameter (point) estimation may be found here]. + + ====MOM Beta Distribution Example==== + Suppose we have these 10 observations we suspect came from a Beta distribution. +
+ {| class="wikitable" + |- + | '''Data''' || 0.055||1.005||0.075||0.005||0.075||1.005||0.005||0.025|0.035||0.225 + |}
+ + The [http://en.wikipedia.org/wiki/Beta_distribution Beta Distribution] mean and variance are defined explicitely in terms of the 2 parameters of the distribution (see the SOCR Beta Distribution Calculator Applet]): + : mean: $\mu=\frac{\alpha}{\alpha+\beta}\!$ + : variance: $\sigma^2=\frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}\!$ + + The sample-mean and sample-variance are: + : sample-mean: $\overline{x}=0.251$ + : sample-variance: $s^2=0.16187$ + + Solve the 2 equations (for the first two moments according to the MOM parameter estimation protocol) for the unknown parameters ($\alpha$ and $\beta$): + : $\overline{x}=\mu$ and + : $s^2=\sigma^2$. + + You can use the [http://socr.ucla.edu/htmls/SOCR_Modeler.html SOCR Modeler] to generate other similar examples for many different distributions (see the [[SOCR_EduMaterials_Activities_RNG |SOCR RNG Activity]]). ===Maximum Likelihood Estimation (MLE)=== ===Maximum Likelihood Estimation (MLE)=== Line 41: Line 63: :$\hat{\theta} = \arg \max_{\theta}{L (\theta)}.$ :$\hat{\theta} = \arg \max_{\theta}{L (\theta)}.$ - The outcome of a maximum likelihood analysis is the maximum likelihood estimate $\hat{\theta}$. One typically assumes that the observed data comes are independent and identically distributed (IID) with unknown parameters ($\theta$). This considerably simplifies the problem because the likelihood can then be written as a product of ''n'' univariate probability densities: + The outcome of a maximum likelihood analysis is the maximum likelihood estimate $\hat{\theta}$. One typically assumes that the observed data are independent and identically distributed (IID) with unknown parameters ($\theta$). This considerably simplifies the problem because the likelihood that it can then be written as a product of ''n'' univariate probability densities: :$\mathcal{L}(\theta) = \prod_{i=1}^n f_{\theta}(x_i \mid \theta)$ :$\mathcal{L}(\theta) = \prod_{i=1}^n f_{\theta}(x_i \mid \theta)$ Line 51: Line 73: The maximum of this expression can then be found numerically using various optimization algorithms. The maximum of this expression can then be found numerically using various optimization algorithms. - Note that the maximum likelihood estimator may not be unique, or is guaranteed to exist. + Note that the maximum likelihood estimator may not be unique, or guaranteed to exist. ====MLE Proportion Example==== ====MLE Proportion Example==== - Let's look again at the motivational problem of flipping a coin 8 times, observing the number of heads (successes) in the outcomes and using this inferring (based on MLE) the true (unknown) probability of a Head (''P(H)=?'') for this specific coin. + Let's look again at the motivational problem of flipping a coin 8 times, observing the number of heads (successes) in the outcomes and using this to infer (based on MLE) the true (unknown) probability of a Head (''P(H)=?'') for this specific coin. Suppose again we observe the same sequence of 8 outcomes ''{T,H,T,H,H,T,H,H}''. Using the MLE protocol we obtain: Suppose again we observe the same sequence of 8 outcomes ''{T,H,T,H,H,T,H,H}''. Using the MLE protocol we obtain: Line 67: Line 89: Suppose we have observed IID {$x_1, \cdots, x_n$}={$0.5, 0.3, 0.6, 0.1, 0.2$}, subjects' weights, coming from $N(\mu, \sigma^2=1)$, with marginal density function $f(x|\mu)$. Then the joint density is $f(x_1, \cdots,x_n| \mu) =f(x_1|\mu)\times \cdots \times f(x_n|\mu)$ and the likelihood function $L(\mu) = f(x_1, \cdots,x_n|\mu)$. Suppose we have observed IID {$x_1, \cdots, x_n$}={$0.5, 0.3, 0.6, 0.1, 0.2$}, subjects' weights, coming from $N(\mu, \sigma^2=1)$, with marginal density function $f(x|\mu)$. Then the joint density is $f(x_1, \cdots,x_n| \mu) =f(x_1|\mu)\times \cdots \times f(x_n|\mu)$ and the likelihood function $L(\mu) = f(x_1, \cdots,x_n|\mu)$. - Of course, we are trying to estimate the average weight for a subject form this population -- i.e., we are trying to find the $MLE(\mu)$. + Of course, we are trying to estimate the average weight for a subject from this population -- i.e., we are trying to find the $MLE(\mu)$. : $L(\mu) = e^{-{(0.5-\mu)^2+(0.3-\mu)^2+(0.6-\mu)^2+(0.1-\mu)^2+(0.2-\mu)^2\over 2}}$ : $L(\mu) = e^{-{(0.5-\mu)^2+(0.3-\mu)^2+(0.6-\mu)^2+(0.1-\mu)^2+(0.2-\mu)^2\over 2}}$ Line 73: Line 95: : $0={d\ln{L(\mu)} \over d\mu} = (0.5-\mu)+(0.3-\mu)+(0.6-\mu)+(0.1-\mu)+(0.2-\mu) = -5\mu +1.7$ : $0={d\ln{L(\mu)} \over d\mu} = (0.5-\mu)+(0.3-\mu)+(0.6-\mu)+(0.1-\mu)+(0.2-\mu) = -5\mu +1.7$ : Thus, $\hat{\mu}=0.34$ : Thus, $\hat{\mu}=0.34$ - : Validate the this value indeed maximizes the log-likelihood function, i.e., ${d^2 \ln{L(\mu)} \over d\mu^2}(\mu=\hat{\mu}) <0$. + : Validate that this value indeed maximizes the log-likelihood function, i.e., ${d^2 \ln{L(\mu)} \over d\mu^2}(\mu=\hat{\mu}) <0$. * How does this estimate, $\hat{\mu}=0.34$, compare to the ''sample average'' of the 5 observations? * How does this estimate, $\hat{\mu}=0.34$, compare to the ''sample average'' of the 5 observations? Line 92: Line 114: ===Parameter Estimation Examples=== ===Parameter Estimation Examples=== The [http://socr.ucla.edu/htmls/SOCR_Modeler.html SOCR Modeler] and the corresponding [[SOCR_EduMaterials_ModelerActivities | SOCR Modeler Activities]] provide a number of interesting examples of parameter (point) estimation in terms of fitting ''best'' models to observed data. The [http://socr.ucla.edu/htmls/SOCR_Modeler.html SOCR Modeler] and the corresponding [[SOCR_EduMaterials_ModelerActivities | SOCR Modeler Activities]] provide a number of interesting examples of parameter (point) estimation in terms of fitting ''best'' models to observed data. + + ===[[EBook_Problems_Estim_MOM_MLE|Problems]]===

Line 98: Line 122: * [http://www.stat.ucla.edu/%7Edinov/courses_students.dir/07/Winter/PN284.dir/NITP_PN_M284_Inference2.pdf Lecture notes on Statistical Methods in Neuroimaging] * [http://www.stat.ucla.edu/%7Edinov/courses_students.dir/07/Winter/PN284.dir/NITP_PN_M284_Inference2.pdf Lecture notes on Statistical Methods in Neuroimaging] * [http://repositories.cdlib.org/socr/EM_MM  Notes on parameter estimation, expectation maximization and mixture modeling] and the corresponding [[SOCR_EduMaterials_Activities_2D_PointSegmentation_EM_Mixture | Java applets and activities]]. * [http://repositories.cdlib.org/socr/EM_MM  Notes on parameter estimation, expectation maximization and mixture modeling] and the corresponding [[SOCR_EduMaterials_Activities_2D_PointSegmentation_EM_Mixture | Java applets and activities]]. - * [http://socr.ucla.edu/htmls/SOCR_Modeler.html Web-based executable Applet): http://socr.ucla.edu/htmls/SOCR_Modeler.html + * [http://socr.ucla.edu/htmls/SOCR_Modeler.html Web-based executable Applet]: http://socr.ucla.edu/htmls/SOCR_Modeler.html * [[SOCR_EduMaterials_ModelerActivities | SOCR EM Wiki activities]] * [[SOCR_EduMaterials_ModelerActivities | SOCR EM Wiki activities]] * [http://www.socr.ucla.edu/docs/edu/ucla/stat/SOCR/modeler/RiceFit_Modeler.html HTML JavaDocs] * [http://www.socr.ucla.edu/docs/edu/ucla/stat/SOCR/modeler/RiceFit_Modeler.html HTML JavaDocs] Line 105: Line 129: ** http://code.google.com/p/socr/source/browse/#svn/trunk/SOCR2.0/src/edu/ucla/stat/SOCR ** http://code.google.com/p/socr/source/browse/#svn/trunk/SOCR2.0/src/edu/ucla/stat/SOCR ** http://code.google.com/p/socr/source/browse/trunk/SOCR2.0/src/edu/ucla/stat/SOCR/modeler/RiceFit_Modeler.java ** http://code.google.com/p/socr/source/browse/trunk/SOCR2.0/src/edu/ucla/stat/SOCR/modeler/RiceFit_Modeler.java - * [[AP_Statistics_Curriculum_2007_Estim_MOM_MLE | EBook Section on MOM and MLE estimation]]

## General Advance-Placement (AP) Statistics Curriculum - Method of Moments and Maximum Likelihood Estimation

Suppose we flip a coin 8 times and observe the number of heads (successes) in the outcomes. How would we estimate the true (unknown) probability of a Head (P(H)=?) for this specific coin? There are a number of other similar situations where we need to evaluate, predict or estimate a population (or process) parameter of interest using an observed data sample.

There are many ways to obtain point (value) estimates of various population parameters of interest, using observed data from the specific process we study. The method of moments and the maximum likelihood estimation are among the most popular ones frequently used in practice.

Some practical demonstrations of parameter estimation are shown in the SOCR Modeler Normal and Beta Distribution Model Fitting Activity, which uses the SOCR Modeler applets.

### Method of Moments (MOM) Estimation

Parameter estimation using the method of moments is both intuitive and easy to calculate. The idea is to use the sample data to calculate some sample moments and then set these equal to their corresponding population counterparts. Typically the latter involves the parameter(s) that we are interested in estimating and thus we obtain a computationally tractable protocol for their estimation. Summarizing the MOM:

• First: Determine the k parameters of interest and the specific (model) distribution for this process;
• Second: Compute the first k (or more) sample-moments;
• Third: Set the sample-moments equal to the population moments and solve for a (linear or non-linear) system of k equations with k unknowns.

#### MOM Proportion Example

Let's look at the motivational problem we discussed above. We want to flip a coin 8 times, observe the number of heads (successes) in the outcomes and use that to infer the true (unknown) probability of a Head (P(H)=?) of this specific coin.

• Hypothetical solution: Suppose we observe the following sequence of outcomes {T,H,T,H,H,T,H,H}. Using the MOM protocol we obtain:
• There is one parameter of interest p=P(H) and the process is a Binomial experiment.
• The first sample-moment for a Bernoulli process is p=E(Y). Therefore, if the random variable X = {# H’s} is Binomially distributed, then the expected value of X is E(X)=np=8p. And the sample value of X is Sample#H’s = 5. Equating the first sample moments yields $8p \approx 5$. Hence, we would estimate the unknown $p=P(H) \approx MOM(p)=\hat{p}={5\over 8}$.
• Experimental Solution: We can also use SOCR Experiments to demonstrate the MOM estimation technique. You can refer to the SOCR Coin Sample Experiment for more information of this SOCR applet. The figure below illustrates flipping a coin 8 times and observing 5 Heads. This is a Binomial(n=8, p=0.65) distribution. However, let's pretend for a minute that we did not know the actual p=P(H) value! So we have a good approximation $0.65=p=P(H) \approx MOM(p)=\hat{p}={5\over 8}=0.625$. Of course, if we run this experiment again, our MOM estimate for p would change!

More information about the method of moments for parameter (point) estimation may be found here.

#### MOM Beta Distribution Example

Suppose we have these 10 observations we suspect came from a Beta distribution.

 Data 0.055 1.005 0.075 0.005 0.075 1.005 0.005 0.035 0.225

The Beta Distribution mean and variance are defined explicitely in terms of the 2 parameters of the distribution (see the SOCR Beta Distribution Calculator Applet]):

mean: $\mu=\frac{\alpha}{\alpha+\beta}\!$
variance: $\sigma^2=\frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}\!$

The sample-mean and sample-variance are:

sample-mean: $\overline{x}=0.251$
sample-variance: s2 = 0.16187

Solve the 2 equations (for the first two moments according to the MOM parameter estimation protocol) for the unknown parameters (α and β):

$\overline{x}=\mu$ and
s2 = σ2.

You can use the SOCR Modeler to generate other similar examples for many different distributions (see the SOCR RNG Activity).

### Maximum Likelihood Estimation (MLE)

Maximum likelihood estimation (MLE) is another popular statistical technique for parameter estimation. Modeling distribution parameters using MLE estimation based on observed real world data offers a way of tuning the free parameters of the model to provide an optimum fit. Summarizing the MLE:

Suppose we observe a sample $x_1,x_2,\dots,x_n$ of n values from one distribution with probability density/mass function fθ, and we are trying to estimate the (vector of) parameter(s) θ. We can compute the (multivariate) probability density associated with our observed data, $f_\theta(x_1,\dots,x_n \mid \theta).\,\!$

As a function of θ with x1, ..., xn fixed, the likelihood function is:

$\mathcal{L}(\theta) = f_{\theta}(x_1,\dots,x_n \mid \theta).\,\!$

The method of maximum likelihood estimates θ by finding the value of θ that maximizes $\mathcal{L}(\theta)$. Thus, the maximum likelihood estimator (MLE) of θ is:

$\hat{\theta} = \arg \max_{\theta}{L (\theta)}.$

The outcome of a maximum likelihood analysis is the maximum likelihood estimate $\hat{\theta}$. One typically assumes that the observed data are independent and identically distributed (IID) with unknown parameters (θ). This considerably simplifies the problem because the likelihood that it can then be written as a product of n univariate probability densities:

$\mathcal{L}(\theta) = \prod_{i=1}^n f_{\theta}(x_i \mid \theta)$

and since maxima are unaffected by monotone transformations, one can take the logarithm of this expression to turn it into a sum:

$\mathcal{L}^*(\theta) = \sum_{i=1}^n \log f_{\theta}(x_i \mid \theta).$

The maximum of this expression can then be found numerically using various optimization algorithms.

Note that the maximum likelihood estimator may not be unique, or guaranteed to exist.

#### MLE Proportion Example

Let's look again at the motivational problem of flipping a coin 8 times, observing the number of heads (successes) in the outcomes and using this to infer (based on MLE) the true (unknown) probability of a Head (P(H)=?) for this specific coin.

Suppose again we observe the same sequence of 8 outcomes {T,H,T,H,H,T,H,H}. Using the MLE protocol we obtain:

• Likelihood function: $f(x|\theta=p)={8\choose 5}p^5(1-p)^3$
• Log-likelihood function: $\mathcal{L}^*(\theta) = \ln{{8\choose 5}p^5(1-p)^3}=\ln{8\choose 5} + 5\ln{p} +3\ln{(1-p)}$.
• Maximize the log-likelihood function by setting its first derivative to zero: $0={d (\ln{8\choose 5} + 5\ln{p} +3\ln{(1-p))} \over dp}={5\over p}-{3\over 1-p}$. Thus, 5(1 − p) − 3p = 0, and $\hat{p}={5\over 8}$.

In this case the MOM(p)=MLE(p), however, this is not true in general.

#### Normal Mean MLE Estimation Example

Suppose we have observed IID {$x_1, \cdots, x_n$}={0.5,0.3,0.6,0.1,0.2}, subjects' weights, coming from N(μ,σ2 = 1), with marginal density function f(x | μ). Then the joint density is $f(x_1, \cdots,x_n| \mu) =f(x_1|\mu)\times \cdots \times f(x_n|\mu)$ and the likelihood function $L(\mu) = f(x_1, \cdots,x_n|\mu)$.

Of course, we are trying to estimate the average weight for a subject from this population -- i.e., we are trying to find the MLE(μ).

$L(\mu) = e^{-{(0.5-\mu)^2+(0.3-\mu)^2+(0.6-\mu)^2+(0.1-\mu)^2+(0.2-\mu)^2\over 2}}$
$\ln{L(\mu)} = -{1\over 2}({(0.5-\mu)^2+(0.3-\mu)^2+(0.6-\mu)^2+(0.1-\mu)^2+(0.2-\mu)^2})$
$0={d\ln{L(\mu)} \over d\mu} = (0.5-\mu)+(0.3-\mu)+(0.6-\mu)+(0.1-\mu)+(0.2-\mu) = -5\mu +1.7$
Thus, $\hat{\mu}=0.34$
Validate that this value indeed maximizes the log-likelihood function, i.e., ${d^2 \ln{L(\mu)} \over d\mu^2}(\mu=\hat{\mu}) <0$.
• How does this estimate, $\hat{\mu}=0.34$, compare to the sample average of the 5 observations?

### MOM vs. MLE

• The MOM is inferior to Fisher's MLE method, because maximum likelihood estimators have higher probability of being close to the quantities to be estimated.
• MLE may be intractable in some situations, whereas the MOM estimates can be quickly and easily calculated by hand or using a computer.
• MOM estimates may be used as the first approximations to the solutions of the MLE method, and successive improved approximations may then be found by the Newton-Raphson method. In this respect, the MOM and MLE are symbiotic.
• Sometimes, MOM estimates may be outside of the parameter space; i.e., they are unreliable, which is never a problem with the MLE method.
• MOM estimates are not necessarily sufficient statistics, i.e., they sometimes fail to take into account all relevant information in the sample.
• MOM may be preferred to MLE for estimating some structural parameters (e.g., parameters of a utility function, instead of parameters of a known probability distribution), when appropriate probability distributions are unknown.

### Parameter Estimation Examples

The SOCR Modeler and the corresponding SOCR Modeler Activities provide a number of interesting examples of parameter (point) estimation in terms of fitting best models to observed data.