# SOCR EduMaterials Activities 2D PointSegmentation EM Mixture

(Difference between revisions)
 Revision as of 20:12, 4 April 2007 (view source)IvoDinov (Talk | contribs)← Older edit Current revision as of 20:55, 25 May 2016 (view source)IvoDinov (Talk | contribs) (→Data Import Format) (26 intermediate revisions not shown) Line 1: Line 1: - == [[SOCR_EduMaterials_Activities | SOCR Educational Materials - Activities ]] - SOCR Activity Demonstrating Expectation Maximization and Mixture Modeling == + == [[SOCR_EduMaterials_Activities | SOCR Educational Materials - Activities]] - SOCR Activity Demonstrating Expectation Maximization and Mixture Modeling == == Summary == == Summary == - This is activity demonstrates mixture modeling and expectation maximization (EM) applied to the problem of 2D point cluster segmentation. + This activity demonstrates mixture modeling and expectation maximization (EM) applied to the problem of 2D point cluster segmentation. == Background== == Background== - You may find useful this review the mathematics of [http://www.stat.ucla.edu/%7Edinov/courses_students.dir/04/Spring/Stat233.dir/STAT233_notes.dir/EM_Tutorial.pdf Expectation Maximization and Mixture Modeling]. In this activity we will demonstrate how the EM and mixture modeling may be used to obtain cluster classification of points in 2D. There are a number of problems where such segmentation is very important for solving a practical problem. 1D and 3D applications of the EM mixture modeling are also included at the end. + It may be useful to review the mathematics of [http://repositories.cdlib.org/socr/EM_MM Expectation Maximization and Mixture Modeling]. In this activity, we will demonstrate how the EM and mixture modeling may be used to obtain cluster classification of points in 2D. There are problems where such segmentation is very important for solving a practical problem. 1D and 3D applications of the EM mixture modeling are also included at the end. ==Exercises== ==Exercises== === '''Exercise 1''': SOCR Charts Activity=== === '''Exercise 1''': SOCR Charts Activity=== - * This exercise demonstrates the applications of expectation maximization and mixure modeling for cluster classification of points in 2D. Go to [http://www.socr.ucla.edu/htmls/SOCR_Charts.html SOCR Charts] and select '''Line Charts --> SOCR EM MixtureModelChart'''. The image below demonstrates the overall look-and-feel of the SOCR EM Mixture Model applet. Try adding several clusters of points and select different number of the kernels/mixtures you want to fit to this data. Notice the adaptive behavior of the algorithm. + * This exercise demonstrates the applications of expectation maximization and mixture modeling for cluster classification of points in 2D. Go to [http://www.socr.ucla.edu/htmls/SOCR_Charts.html SOCR Charts] and select '''Line Charts --> SOCR EM MixtureModelChart'''. The image below demonstrates the overall look-and-feel of the SOCR EM Mixture Model applet. Try adding several clusters of points and select different number of the [http://en.wikipedia.org/wiki/Gaussian_kernel kernels]/[http://en.wikipedia.org/wiki/Mixture_model mixtures] you want to fit to this data. Notice the adaptive behavior of this iterative algorithm - the kernels will dynamically adjust to fit the data. The image below shows the main widgets/controls of this applet and the results of fitting a mixture of 3 kernels to the defaults data.
[[Image:SOCR_Activities_EMMixtureModel_Dinov_040407_Fig1.jpg|400px]]
[[Image:SOCR_Activities_EMMixtureModel_Dinov_040407_Fig1.jpg|400px]]
- * ''Data'': You can enter data (paired X,Y observations) either via the Data-Tab, using copy-and-paste, or via the add random points button ('''RandomPts'''). This applet was designed specifically so that the user may enter his/her own paired observations in tabular form. If you are not interested in data analysis, but want to explore more the properties of the EM mixture model you may try the second SOCR applet ([http://www.socr.ucla.edu/htmls/SOCR_Experiments.html SOCR Experiments], select '''Mixture Model EM Experiment''' from the drop down list of [http://www.socr.ucla.edu/htmls/SOCR_Experiments.html SOCR Experiments] on the top-left), which allows you to manually add points by clicking on the graphing canvas + * ''Data'': You can enter data (paired X,Y observations) either via the Data-Tab, using copy-and-paste, or via the add random points button ('''RandomPts'''). This applet was designed specifically so that you may enter your own paired observations in tabular form. If you are not interested in data analysis, but want to explore more the properties of the EM mixture model, you may try the second SOCR applet ([http://www.socr.ucla.edu/htmls/SOCR_Experiments.html SOCR Experiments], select '''Mixture Model EM Experiment''' from the drop down list of [http://www.socr.ucla.edu/htmls/SOCR_Experiments.html SOCR Experiments] on the top-left), which allows you to manually add points by clicking on the graphing canvas. At the end of this document, the [[SOCR_EduMaterials_Activities_2D_PointSegmentation_EM_Mixture#Step-by-Step_Data_Analysis_Protocol | step-by-step protocol]] is included for running the algorithm on any dataset. + ** You can also use the [[SOCR_Data_KneePainData_041409 | 8,000 knee-pain data centroids available here]]. - * ''Applet specifications'': You can fit one of two models to your data (Linear or elliptical Gaussian), see the '''GaussianMix''' button. The '''Normal''' button allows you to choose Slow, Normal and Fast speed for the iterative EM estimation. The '''ClearPts''' will remove all data and reset the applet. The InitKernels button re-initializes the kernels (location and size) and is useful for shaking the EM algorithm to escape local minima. The '''drop-down list''' on the top-right allows you to select the number of kernels you want to fit to this data. There is no exact method to determine this number for all situations (manually or automatically). Typically, one uses the physical or biological properties of the data to determine the appropriate integer value. The '''Step''', '''Run''' and '''Stop''' buttons on the top have the natural functions associated with driving the iterative EM algorithm. The '''Segment''' button allows labeling/classification of the points once the EM algorithm has converged or is stopped. + * ''Applet Specifications'': You can fit one of two models to your data (Linear or elliptical Gaussian), see the '''GaussianMix''' button. Linear model will only show the main direction of variation for a sub-cluster. A Gaussian model will illustrate additional information, including the scope of the kernel coverage. The '''Normal''' button allows you to choose ''Slow'', ''Normal'' or ''Fast'' speed for the iterative EM estimation. The '''ClearPts''' will remove all data and reset the applet. The '''InitKernels''' button re-initializes the kernels (location and size) and is useful for shaking the EM algorithm to escape the local minimum. A kernel is roughly a weighting function used in non-parametric techniques. Kernel locations are generally non trivial to choose - we use random initial kernel locations ('''InitKernels''' button). The '''drop-down list''' on the top-right allows you to select the number of kernels you want to fit to this data. There is no exact method for determining this number in all situations (manually or automatically). Typically, one uses the physical or biological properties of the studied data/process to determine the appropriate integer value. On the top, the '''Step''', '''Run''' and '''Stop''' buttons have the natural functions associated with driving the iterative EM algorithm. The '''Segment''' button allows labeling/classification of the points once the EM algorithm has converged or stopped. - === '''Exercise 2'''=== + * ''Segmentation Results'': Once the EM algorithm converges to a visually satisfactory result you should stop the iterative process ('''Stop''' button) and click on the '''Segment''' button. You will obtain a color classification of all points in 2D based on which of the kernels is most likely to contain the point in its neighborhood. In addition to this visual classification, the Data tab-panel will contain a couple of result columns that contain the complete analytical description of the kernels (as 2D Gaussian), the [http://en.wikipedia.org/wiki/Mixture_model mixture-model] weight coefficients, the [http://en.wikipedia.org/wiki/Log-likelihood log-likelihood function] (quantifying the match between the mixture model and the data), and color-coded membership of all data points to one of the kernels. +
[[Image:SOCR_Activities_EMMixtureModel_Dinov_040407_Fig4.jpg|400px]]
+ + === '''Exercise 2''': 1D and 3D Examples of [[SOCR]] EM Mixture Modeling=== You may also see the action of the same SOCR EM Mixture modeling algorithm for analyzing 1D or 3D data. You may also see the action of the same SOCR EM Mixture modeling algorithm for analyzing 1D or 3D data. - * For 1D data, you can see the EM mixture model fitting used by the [http://www.socr.ucla.edu/htmls/SOCR_Modeler.html SOCR Modeler] to fit a polynomial, spectral or distribution model to (randomly sampled or observed) data. To see this, go to [http://www.socr.ucla.edu/htmls/SOCR_Modeler.html SOCR Modeler] and select '''MixedFit_Modeler''' from the drop-down list of models on the top-left. The figure below shows the result of fitting a 3-kernel Mixture of Normal (Gaussian) distributions to the histogram of a random sample of 100 Cauchy random variables. + * For 1D data, you can see the EM mixture model fitting used by the [http://www.socr.ucla.edu/htmls/SOCR_Modeler.html SOCR Modeler] to fit a polynomial, spectral or distribution model for randomly sampled or observed data. To see this, go to [http://www.socr.ucla.edu/htmls/SOCR_Modeler.html SOCR Modeler] and select '''MixedFit_Modeler''' from the drop-down list of models on the top-left. The figure below shows the result of fitting a 3-kernel Mixture of Normal (Gaussian) distributions to the histogram of a random sample of 100 Cauchy random variables.
[[Image:SOCR_Activities_EMMixtureModel_Dinov_040407_Fig2.jpg|400px]]
[[Image:SOCR_Activities_EMMixtureModel_Dinov_040407_Fig2.jpg|400px]]
- * A demonstration of a 3D data analysis using the SOCR EM Mixture model is included in the [http://www.loni.ucla.edu/download/LOVE/LOVE_User_Guide.pdf LONI Viz Manual]. This shows how 3D brain imaging data may be segmented into three tissue types (White Matter, Gray Matter and Cerebrospinal Fluid). This is achieved by LONI Viz sending the segmentation tasks to SOCR and SOCR returning back the 3D segmented volumes which are superimposed dynamically on top of the initial anatomical brain imaging data in real time. The figure below illustrates this functionality. Other external computational tools could also invoke SOCR statistical computing resources directly by using the [http://www.socr.ucla.edu/htmls/SOCR_Download.html SOCR JAR binaries] and the [http://www.socr.ucla.edu/docs/SOCR_Documentation.html SOCR Documentation]. + * A demonstration of a 3D data analysis using the SOCR EM Mixture model is included in the [http://www.loni.ucla.edu/download/LOVE/LOVE_User_Guide.pdf LONI Viz Manual]. This shows how 3D brain imaging data may be segmented into three tissue types (White Matter, Gray Matter and Cerebra-spinal Fluid). This is achieved by LONI Viz (Dinov et al., 2006) sending the segmentation tasks to SOCR and SOCR returning back the 3D segmented volumes, which are superimposed dynamically on top of the initial anatomical brain imaging data in real time. The figure below illustrates this functionality. Other external computational tools could also invoke SOCR statistical computing resources directly by using the [http://www.socr.ucla.edu/htmls/SOCR_Download.html SOCR JAR binaries] and the [http://www.socr.ucla.edu/docs/SOCR_Documentation.html SOCR Documentation].
[[Image:SOCR_Activities_EMMixtureModel_Dinov_040407_Fig3.jpg|400px]]
[[Image:SOCR_Activities_EMMixtureModel_Dinov_040407_Fig3.jpg|400px]]
- ==Questions== ==Questions== * How stable is the EM algorithm in finding a solution? Does this solution appear to be unique? * How stable is the EM algorithm in finding a solution? Does this solution appear to be unique? * What affects the convergence properties of the EM algorithm (data characteristics, number and properties of the starting initial kernel(s), etc.)? * What affects the convergence properties of the EM algorithm (data characteristics, number and properties of the starting initial kernel(s), etc.)? + + + ==Step-by-Step Data Analysis Protocol== + The following 10 steps show the protocol for entering and analyzing new data using the SOCR EM Mixture model Chart tool. + * Go to: http://www.socr.ucla.edu/htmls/SOCR_Charts.html --> Line Charts --> SOCR EM Mixture Chart. + * '''ClearPnts''' (remove default data). + * '''Data-Tab''': Click the top-left cell & paste in your data (in two column tabular format) using the paste button. + * '''Mapping Tab''': Map the two columns (C1 and C2) accordingly to X & Y axes. + * Click '''UpdateChart''' button. + * Select '''Number of Kernels''' (drop-down list of integers on the top-right). + * InitKernels (choose an appropriate starting configuration – could be arbitrary) + * Use the '''Step''' or '''Run''' buttons to go one iteration or continuously in the EM Mixture algorithm. + * Use '''Stop''' button to stop (for continuous running). + * Click the '''Segment''' button to find point association to kernels. + * Finally, you can right-click on the SOCR Graph to adjust the display or save/print the result. + + Your quantitative results will be stored as new columns in the data table ('''Data Tab'''). + + ===Manual Kernel Initialization=== + Note that you can manually initialize the Gaussian kernels as follows: + To add (up to 9) kernels manually, hold down the '''ALT''' key, click and drag the mouse to select a rectangle for each of the kernels. While holding the '''ALT''' key, you can repeat this to add more manually chosen kernels (up to 9). There will be 2 new buttons that appear in this manual kernel-selection mode. The ''removeLastKernel'' button will remove the last manually added kernel, and the ''autoInitKernel'' button will switch back to the auto kernel-selection mode, where kernels are randomly chosen. The figure below illustrates the manual kernel selection process + +
[[Image:SOCR_Activities_EMMixtureModel_Dinov_102308_Fig3.jpg|400px]]
+ + === Data Import Format=== + Remember that in the [http://www.socr.ucla.edu/htmls/chart/EM_MixtureModeling_Chart.html 2D mixture modeling applet] you only need 2D data (2 columns). If you have more remove the unnecessary columns first (e.g., using Excel), or load all columns but later specify which 2 columns shoud be considered as X and Y. When File-importing data, note that your data must be in CSV format. If you have a header Column, it must start with a hashtag (#), like this: + #X Y + 11 73 + 20 88 + 19 73 + 15 65 + 21 57 + 26 101 + 24 117 + 35 106 + 37 96 + + The easiest way is to copy-paste only the values (no headers) into the applet by clicking on the first cell (1,1) and clicking the paste button (do not paste in data using short keys), and then [[SOCR_EduMaterials_Activities_2D_PointSegmentation_EM_Mixture#Step-by-Step_Data_Analysis_Protocol|follow the step-by-step protocol]] to generate the plot. Remember after you paste in (or File-load in) the data, you need to go to the '''Mapping Tab''' to select your 2 columns (X,Y), click “'''Update Chart'''”, etc., according to the step-by-step protocol. + + Try this protocol with the [[SOCR_Data_KneePainData_041409|SOCR Knee Pain dataset]]. + + ==See also== + * [http://socr.ucla.edu/htmls/dist/Mixture_Distribution.html SOCR Mixture-Distribution applet] + * [[SOCR_EduMaterials_ModelerActivities_MixtureModel_1 | SOCR 1D Mixture Modeling Activity]] + * [[SOCR_EduMaterials_AnalysisActivities_HierarchicalClustering|SOCR Hierarchical Clustering Analysis Activity]] + * [[SOCR_Data_KneePainData_041409| Knee Pain Simulated Data]] + * See also the Bivariate Normal Distribution [[SOCR_BivariateNormal_JS_Activity|activity]] and [http://socr.ucla.edu/htmls/HTML5/BivariateNormal/ webapp].

==References== ==References== - * [http://www.stat.ucla.edu/%7Edinov/courses_students.dir/04/Spring/Stat233.dir/STAT233_notes.dir/EM_Tutorial.pdf  The Mathematics of EM Mixture Modeling] + * Ivo D. Dinov, [http://repositories.cdlib.org/socr/EM_MM Expectation Maximization and Mixture Modeling Tutorial] (December 9, 2008). Statistics Online Computational Resource. Paper EM_MM, http://repositories.cdlib.org/socr/EM_MM. * [http://www.cs.tut.fi/~jupeto/2504tohkaj.pdf Genetic Algorithms and Mixture Modeling] * [http://www.cs.tut.fi/~jupeto/2504tohkaj.pdf Genetic Algorithms and Mixture Modeling] + * [http://www.stat.ucla.edu/%7Edinov/courses_students.dir/07/Winter/PN284.dir/NITP_PN_M284_Inference2.pdf Lecture notes on Statistical Methods in Neuroimaging] + * [http://socr.ucla.edu/htmls/SOCR_Modeler.html Web-based executable Applet]: http://socr.ucla.edu/htmls/SOCR_Modeler.html + * [[SOCR_EduMaterials_ModelerActivities | SOCR EM Wiki activities]] + * [http://www.socr.ucla.edu/docs/edu/ucla/stat/SOCR/modeler/RiceFit_Modeler.html HTML JavaDocs] + * Source code + ** http://socr.googlecode.com/ + ** http://code.google.com/p/socr/source/browse/#svn/trunk/SOCR2.0/src/edu/ucla/stat/SOCR + ** http://code.google.com/p/socr/source/browse/trunk/SOCR2.0/src/edu/ucla/stat/SOCR/modeler/RiceFit_Modeler.java + * [[AP_Statistics_Curriculum_2007_Estim_MOM_MLE | EBook Section on MOM and MLE estimation]]

## Summary

This activity demonstrates mixture modeling and expectation maximization (EM) applied to the problem of 2D point cluster segmentation.

## Background

It may be useful to review the mathematics of Expectation Maximization and Mixture Modeling. In this activity, we will demonstrate how the EM and mixture modeling may be used to obtain cluster classification of points in 2D. There are problems where such segmentation is very important for solving a practical problem. 1D and 3D applications of the EM mixture modeling are also included at the end.

## Exercises

### Exercise 1: SOCR Charts Activity

• This exercise demonstrates the applications of expectation maximization and mixture modeling for cluster classification of points in 2D. Go to SOCR Charts and select Line Charts --> SOCR EM MixtureModelChart. The image below demonstrates the overall look-and-feel of the SOCR EM Mixture Model applet. Try adding several clusters of points and select different number of the kernels/mixtures you want to fit to this data. Notice the adaptive behavior of this iterative algorithm - the kernels will dynamically adjust to fit the data. The image below shows the main widgets/controls of this applet and the results of fitting a mixture of 3 kernels to the defaults data.
• Data: You can enter data (paired X,Y observations) either via the Data-Tab, using copy-and-paste, or via the add random points button (RandomPts). This applet was designed specifically so that you may enter your own paired observations in tabular form. If you are not interested in data analysis, but want to explore more the properties of the EM mixture model, you may try the second SOCR applet (SOCR Experiments, select Mixture Model EM Experiment from the drop down list of SOCR Experiments on the top-left), which allows you to manually add points by clicking on the graphing canvas. At the end of this document, the step-by-step protocol is included for running the algorithm on any dataset.
• Applet Specifications: You can fit one of two models to your data (Linear or elliptical Gaussian), see the GaussianMix button. Linear model will only show the main direction of variation for a sub-cluster. A Gaussian model will illustrate additional information, including the scope of the kernel coverage. The Normal button allows you to choose Slow, Normal or Fast speed for the iterative EM estimation. The ClearPts will remove all data and reset the applet. The InitKernels button re-initializes the kernels (location and size) and is useful for shaking the EM algorithm to escape the local minimum. A kernel is roughly a weighting function used in non-parametric techniques. Kernel locations are generally non trivial to choose - we use random initial kernel locations (InitKernels button). The drop-down list on the top-right allows you to select the number of kernels you want to fit to this data. There is no exact method for determining this number in all situations (manually or automatically). Typically, one uses the physical or biological properties of the studied data/process to determine the appropriate integer value. On the top, the Step, Run and Stop buttons have the natural functions associated with driving the iterative EM algorithm. The Segment button allows labeling/classification of the points once the EM algorithm has converged or stopped.
• Segmentation Results: Once the EM algorithm converges to a visually satisfactory result you should stop the iterative process (Stop button) and click on the Segment button. You will obtain a color classification of all points in 2D based on which of the kernels is most likely to contain the point in its neighborhood. In addition to this visual classification, the Data tab-panel will contain a couple of result columns that contain the complete analytical description of the kernels (as 2D Gaussian), the mixture-model weight coefficients, the log-likelihood function (quantifying the match between the mixture model and the data), and color-coded membership of all data points to one of the kernels.

### Exercise 2: 1D and 3D Examples of SOCR EM Mixture Modeling

You may also see the action of the same SOCR EM Mixture modeling algorithm for analyzing 1D or 3D data.

• For 1D data, you can see the EM mixture model fitting used by the SOCR Modeler to fit a polynomial, spectral or distribution model for randomly sampled or observed data. To see this, go to SOCR Modeler and select MixedFit_Modeler from the drop-down list of models on the top-left. The figure below shows the result of fitting a 3-kernel Mixture of Normal (Gaussian) distributions to the histogram of a random sample of 100 Cauchy random variables.
• A demonstration of a 3D data analysis using the SOCR EM Mixture model is included in the LONI Viz Manual. This shows how 3D brain imaging data may be segmented into three tissue types (White Matter, Gray Matter and Cerebra-spinal Fluid). This is achieved by LONI Viz (Dinov et al., 2006) sending the segmentation tasks to SOCR and SOCR returning back the 3D segmented volumes, which are superimposed dynamically on top of the initial anatomical brain imaging data in real time. The figure below illustrates this functionality. Other external computational tools could also invoke SOCR statistical computing resources directly by using the SOCR JAR binaries and the SOCR Documentation.

## Questions

• How stable is the EM algorithm in finding a solution? Does this solution appear to be unique?
• What affects the convergence properties of the EM algorithm (data characteristics, number and properties of the starting initial kernel(s), etc.)?

## Step-by-Step Data Analysis Protocol

The following 10 steps show the protocol for entering and analyzing new data using the SOCR EM Mixture model Chart tool.

• Go to: http://www.socr.ucla.edu/htmls/SOCR_Charts.html --> Line Charts --> SOCR EM Mixture Chart.
• ClearPnts (remove default data).
• Data-Tab: Click the top-left cell & paste in your data (in two column tabular format) using the paste button.
• Mapping Tab: Map the two columns (C1 and C2) accordingly to X & Y axes.
• Click UpdateChart button.
• Select Number of Kernels (drop-down list of integers on the top-right).
• InitKernels (choose an appropriate starting configuration – could be arbitrary)
• Use the Step or Run buttons to go one iteration or continuously in the EM Mixture algorithm.
• Use Stop button to stop (for continuous running).
• Click the Segment button to find point association to kernels.
• Finally, you can right-click on the SOCR Graph to adjust the display or save/print the result.

Your quantitative results will be stored as new columns in the data table (Data Tab).

### Manual Kernel Initialization

Note that you can manually initialize the Gaussian kernels as follows: To add (up to 9) kernels manually, hold down the ALT key, click and drag the mouse to select a rectangle for each of the kernels. While holding the ALT key, you can repeat this to add more manually chosen kernels (up to 9). There will be 2 new buttons that appear in this manual kernel-selection mode. The removeLastKernel button will remove the last manually added kernel, and the autoInitKernel button will switch back to the auto kernel-selection mode, where kernels are randomly chosen. The figure below illustrates the manual kernel selection process

### Data Import Format

Remember that in the 2D mixture modeling applet you only need 2D data (2 columns). If you have more remove the unnecessary columns first (e.g., using Excel), or load all columns but later specify which 2 columns shoud be considered as X and Y. When File-importing data, note that your data must be in CSV format. If you have a header Column, it must start with a hashtag (#), like this:

#X	Y
11	73
20	88
19	73
15	65
21	57
26	101
24	117
35	106
37	96


The easiest way is to copy-paste only the values (no headers) into the applet by clicking on the first cell (1,1) and clicking the paste button (do not paste in data using short keys), and then follow the step-by-step protocol to generate the plot. Remember after you paste in (or File-load in) the data, you need to go to the Mapping Tab to select your 2 columns (X,Y), click “Update Chart”, etc., according to the step-by-step protocol.

Try this protocol with the SOCR Knee Pain dataset.