RDS Analyst Manual

From HPMRG

Revision as of 03:36, 29 May 2010 by Handcock (Talk | contribs)

(diff) ← Older revision | Current revision (diff) | Newer revision → (diff)

The interface of RDS is similar to SPSS. RDS Analyst is also a free, easy to use, alternative to proprietary data analysis software such as SPSS, STATA, SAS/JMP, and Minitab. It has a menu system to do common data manipulation and analysis tasks, and an excel-like spreadsheet in which to view and edit data.

RDS Analyst is meant for users who want to use state-of-the-art techniques for estimation and quantification of uncertainty from data collected via RDS. It represents advanced, comprehensive and open-source software to visualize, model and conduct sensitivity analyzes for RDS data.

RDS Analyst is an intuitive, cross-platform graphical data analysis system for the analysis of RDS data. It uses menus and dialogs to guide the user efficiently through the data manipulation and analysis process, and has an excel-like spreadsheet for easy data frame visualization and editing. It is also the front-end to the very powerful capabilities accessible via the R command-line interface and also the extensive capabilities of the R statistical language.

Current State

It is an alpha version and is under continuous development and should only be distributed among the group (Lisa, Krista, Whipple, Cori and Mark). This is because this version has advanced code in it and will have bad bugs. We want to support the program and do not want immature versions floating around that give us a bad name and which we will not be able to stamp out.
A release version will be made available sometime before the Workshop.
This version number will be 0.5. Numbers below 0.5 are for internal use only.
The above blurb is draft for what version 0.5 will be and not the current state.

Basic facts

RDS is written for the R statistical environment The current development form is for Windows. Macintosh and LINUX versions will be available in installers when it is released publicly.

The purpose of the initial alpha < 0.5 stage is to:

Find basic installation and running problems.
See if the GUI design will work for your RDS users. How can we improve on it?
See if the current (basic) features work as you expect.
Suggest new features necessary for the workshop
Suggest features that we can add in versions post the workshop

Notes for experienced users:

You should install R with this package (even if you already have R installed separately). This creates a private version of R for RDS to use and ensures RDS has the right version of R available for its use. The two versions will peacefully coexist and you can use the other version of R just as you were originally.

Installation

The installer is at:

http://hpmrg.org/software/RDSASetup.0.03.exe

Download the install and double-click on it to install the software.

This can install all program and utilities needed. If you already have some elements installed you can deselect (or cancel) during the installs. It is recommended that you install this all the first time. This installer is over 90Mb in size and will take time to download.

Subsequently, use this updater to keep your installation to the latest version of the packages:

http://hpmrg.org/software/RDSAUpdater.0.03.exe

This just installs the core packages (that is anything that has changed since the full install was made). It will typically be a few Mb in size. The very first time you install please install both Setup and Updater (in that order). The release will not do this but it just saves time in the downloads.

A reboot is not required. You do not need to uninstall any components to update (This includes R, graphviz, and Java). However the RDS application or the R application must not be running when you update.

There is now an alpha version for the Macintosh. To install:

Download and install R-2.11.0 from here:

http://hpmrg.org/software/R-2.11.0.pkg

Download the RDS Analyst Installer (as a zip file):

http://hpmrg.org/software/RDSAnalystInstaller.mpkg.zip

Double-click on it to uncompress it and then double-click on the installer (i.e., "RDSAnalystInstaller") to install the software.

Here is an installation video (coming).

Starting RDS

To start RDS, select the RDS menu under Programs, and select the RDS program there. Alternatively, if you installed the desktop icon or the ''Quick Launch icon (the default), you can double-click on one of these to start the program. It will start the graphical-user interface. It may take a minute to do this but is fast once loaded.

Once the application starts up, focus on the (top) Data Viewer window. This is where most of the analysis takes place. The other window is Console records a log of all the commands and output. It can be ignored for now.

Quick start demo

Reading the NY Jazz dataset from RDSAT

Select the Open Data menu item from the File menu.
Use the dialog boxes to select C:\Program Files\R\R-2.11.0\library\RDsdevelopment\extdata\nyjazz.rdsat

Looking and editing the data in the spread sheet (the Data menu)

Go to the Data Viewer window and select the nyjazz.data.frame data set from the Data Set menu (center of the window pane)
Click on the Variable View tab. Click the value for Gender(MF) under the Type column and select Factor value. Repeat for Race(WBO), Airplay(yn), and Union(yn). This makes sure that the program recognizes them as categorical variables.

Running an RDS analysis (the RDS menu)

Select Point Estimates from the RDS menu.
Select Gender(MF) as the Outcome Variables using the arrow buttons.
Select network.size as the Network Variable using the arrow buttons.
Click the Run button
Look in the Console window for the results.

Exploratory analysis (the Sample menu)

Select Contingency Tables from the Sample menu.
Select Gender(MF) as the Row and Race(WBO) as the Column
Click the Run button
Look in the Console window for the results.

Saving the results

To save the results in a file, choose Save from the File menu in Console, and make sure Results is selected from the Options:.
To save the commands used to create the output in a file, choose Save from the File menu in Console, and make sure Commands is selected from the Options:.
To save the complete output (results interspersed with the commands that produced them) in a file, choose Save from the File menu in Console, and make sure Complete output is selected from the Options:.

Getting started (seriously)

Once the application starts up, focus on the (top) Data Viewer window. This is where you read in the data. The other window is Console is where most of the analysis takes place. It also records a log of all the commands and output. It can be ignored for now.

Loading RDS data

RDS can read in a wide range of data formats from other packages including SPSS (*.sav), SAS export (*.xpt), and Excel (via Comma separated *.csv). For a general description of this, see Open Data.

It also directly reads in RDSAT files (they need to be renamed with a *.rdsat extension to be recognized).

If it is not an RDSAT file, RDS expects the data to be in a "spread-sheet" format containing the RDS survey data with recruitment information. This should represent valid RDS survey data. The sheet must have one row for each respondent (i.e. case), and columns for each survey response variable. In addition, the recruitment information can be specified in two ways:

Coupon format: Basically like RDSAT but without the two header lines.
Recruiter ID format: Here it expects columns with the following names:
id: A column of integers giving unique ids for each respondent (i.e., row of the spreadsheet).
recruiter.id: A column of ids indicating the recruiter for that respondent (that is id). Recruiters can be identified by elements of id or as 0 for seeds.

e If you read in a RDSAT file the RDS data set is created automatically. If you use a CSV file (e.g., from Excel) you will need to use the Create RDS Data Set menu to set the maximum number of coupons and other data set characteristics before the RDS data set is ready for use. The package creates a data frame with other information in it, like the wave: a column of integers giving the recruitment wave for each row. A value of 0 means the person was a seed.

Saving RDS data

RDS can save the data sets created in a wide range of data formats. We recommend it be saved as in the internal data format for R for easy later reading (*.rda).

For a description of this see Save Data.

The "Data Viewer" window

The RDS Analyst Console window has menus (at the top) for the basic capabilities of the package and a data viewer (below the top) for looking at our data.

The data viewer provides an easy to use, spreadsheet-like environment to view and edit RDS data (or in fact, any spreadsheet data you load). Copy and pasting is supported, and is compatible with Excel 2003/2007, so data can be moved from Excel to R by simply copying it to the data viewer. Contextual menus are used to insert, delete and copy rows and columns.

If there are any data frames loaded in the R session, they can be viewed by selecting them from the Data Set list. Data can be loaded into the R session by clicking Open Data button in the top left hand corner. The Currently viewed data set can be saved using the Save Data button directly to the right of the Open Data button. The Currently viewed data set can be removed from the R session by clicking the button in the upper right.

The data viewer has two modes Data view and Variable view which can be freely switched between them using the tabs. The Variable view enables you to edit the variable types of the data read in. Categorical variables (including binary variables) should be set to type "Factor" by clicking on their entry and selecting it from the menu.

For details see data viewer.

Important: When a menu item is chosen it opens a dialog box where the variables are selected, options are set and the computation is done by clicking the Run button. This creates output in the Console window (you will need to click over to it to see it). These results and output can be saved at any point (typically at the end of the session) by using the Save command (see below).

The menu structure of the Data Viewer window

The top menus:

File:
Edit:
Workspace:
Data:
Sample:
Diagnostics:
Population
Packages & Data:
Window:
Help:

File Menu

Create a new data set, open a data set and save the current data set in a file.
A text editor: View, create and edit a text file. Use Open Document or New Document items to do this.
Run a text file of R commands directly in R using the Source ... item.
Print a file
Quit

Edit Menu

Copy, Cut and Paste text
Preferences: To change font, set the default "working" directory where files are looked for and saved, etc. There is a separate panel for "RDS" specific features. This enables you to change the defaults (like the location of the graphviz binary if it is installed in a non-standard place)

Workspace Menu

Objects that you create during an RDS session are held in computer memory. The collection of objects that you currently have is called the workspace. This workspace is not saved on disk unless you tell RDS to do so. This means that your objects are lost when you close R and not save the objects, or worse when R or your system crashes on you during a session.

When you exit RDS, you will be asked if you want to save your workspace. This will allow you to have the same data sets available the next time you start RDS. This will help to resume work on a project at the same point.

You can open (previously saved) and save workspaces from this menu. So if you have multiple projects you can save the entire workspace for each project in a separate file. Then open them from this menu.

The Clear all item empties the workspace (that is, removes all objects). So you can Clear all items and then open a complete workspace you have saved before.

Note that the Opened workspace is added to the current one. So if you only want the original files you should Clear all first.

Data Menu

This is to recode and modify the data in the data viewer. This means you do not have to go back to SPSS, SAS or Excel to recode, etc.

The term "factor" in R designates a categorical variable. In general factors are nominal but we use factors to represent both ordinal and nominal variables. By labeling a variable as a factor, RDS will treat it appropriately when analyzing it.

Click on the links below to get help on the following capabilities:

Edit Factor: Add or subtract the values of a categorical variable.
Recode Variables: You can recode variables into variables with new names. From the Recode Variables dialogue, select the recode you want to re-target from the Variables to Recode list (e.g. "sex -> sex"). Then click on the Target button on the right. That will let you type in the name of the new variable (e.g., "sexMF"). The recode with show something like "sex -> sexMF". The original "sex" variable will be unchanged and "sexMF" will appear in the Data Viewer window.
Transform the variable. These can be very complex, if needed
Reset Row Names
Sort
Merge Data
Transpose
Subset: Use this to create a version of the data set with the cases of your choice excluded. In the "Subset Expression" box enter an expression for those you want to retain (e.g., HIV < 2) and click "OK". This will create a data set with a name with the suffex ".sub". Now select this in the Data Viewer and run any procedure (e.g. Point Estimates) to see results only for the retained cases.
RDS Data Attributes: Specify here the characteristics of the RDS data like the maximum number of coupons, the network size variable, the missing data symbol, and population size estimates.
Create RDS Data Set: If you read in data from a CSV or other spreadsheet file (i.e., not an RDSAT data file) you will need to specify the maximum number of coupons here (at least). When you do this the the RDS data set is formed from the spread sheet. If the spreadsheet is called foo (say), the RDS data set is called foo.data.frame. You should base all analysis on foo.data.frame.

Sample Menu

This is for exploratory analysis of the RDS data.

It deals with continuous data, categorical data and descriptive data.

Frequencies: Tables of one or more variables, possibly stratified by others (like SPSS)
Descriptives: This uses RDS weights to compute population estimates (rather than samples averages).
Contingency Tables: Cross tabs. They include tests (which are dubious because of the dependence).
One Sample Test:
Two Sample Test:
K-Sample Test:
Correlation:
Linear Model:
Logistic Model:
Generalized Linear Model:

Frequencies, Descriptives, Contingency Tables use RDS weights to compute population estimates (rather than samples averages of the RDS data). The other entries act as if the data are an independent sample and are not (yet) RDS aware.

The basic descriptive, cross-tabs and frequencies use the RDS weights. These are stored in the data.frame and added whenever estimates are computed. The default is the Gile SS weights or VH if the population size is not specified.

For the others, see the Deducer manual.

The Diagnostics menu

Plot Recruitment Tree: Produces a publication quality graphics plot of the recruitment tree.
Bar plot of the number of recruits by wave
Scatter plot of the respondent degree verses wave
Histogram of the number of recruits for each respondent
Bar Chart of the number of recruits from each seed
Make all Diagnostics: Does all of the above (at their current settings and produces a publication quality PDF plot)

The Population menu

This computes estimates of the population characteristics.

It deals with continuous data, categorical data and descriptive data and complements the Sample menu that describes the sample.

Frequencies, Descriptives, Contingency Tables use RDS weights to compute population estimates (rather than samples averages of the RDS data). The other entries act as if the data are an independent sample and are not (yet) RDS aware.

The basic descriptive, cross-tabs and frequencies use the current weights. These are stored in the data.frame and added whenever estimates are computed. The default is the Gile SS weights.

Packages & Data Menu

We do not use this very much.
This enable you to install additional packages for R and look at the packages currently loaded.
This also allows you to edit and view any "objects" in the workspace, such as RDS data sets, spreadsheets (data.frames), functions, etc.

Window Menu

We do not use this very much.
Lists the currently open windows to choose from.
You can go here to choose a window to bring to the front to work on.

Help Menu

To get help, choose R help from the Help menu. The first time it is used it will start a browser for help (which will take a minute to load).

Saving the results, the output and/or the batch commands that produced them!

When a dialog box is Run it creates output in the Console window.

To save the results in a file (typically at the end of a session), choose Save from the File menu in Console, and make sure Results is selected from the Options:.

To save the commands used to create the output in a file (typically at the end of a session), choose Save from the File menu in Console, and make sure Commands is selected from the Options:.

To save the complete output (results interspersed with the commands that produced them) in a file (typically at the end of a session), choose Save from the File menu in Console, and make sure Complete output is selected from the Options:.

Example data

We have three data sets already stored within R and the example data file from RDSAT (nyjazz.rdsat) as examples.

To find out about the faux, fauxmadrona, and fauxsycamore data sets, just use help (see above) and search for them by name. There are manual pages on them :-)

The nyjazz.rdsat file is stored in "C:\Program Files\R\R-2.11.0\library\RDSdevelopment\extdata\nyjazz.rdsat". It can be opened from the "Open Data" dialog box (for example). It is the same as the RDSAT nyjazz.txt with the extension changed to .rdsat so the package will recognize it.

Tips and FAQ

By default, RDS stores all your files in your Desktop. To change this, just choose Set Working Directory from the File menu.

Tutorials

These are pages created by Ian Fellows that illustrate simple exploratory analyzes.

Getting Help from within the package

To get help, choose R help from the Help menu. The first time it is used it will start a browser for help (which will take a minute to load).

Click on an item to get help about R and to get started with R. The "An Introduction to R" is a particularly useful reference for beginners.
Click on the packages tab to get specific help on packages (like RDS).
Click on the RDSgui package to get help with the commands underlying the menus (except the RDS menu).
Click on the RDSdevelopment package to get help with the commands underlying the RDS menu.

   * This gives help on the data sets.
   * Click on e.g., RDS.I.estimates to get help on the RDS-I estimate function.

To get the online manual for the package, choose RDS help from the Help menu in the Data Viewer window.

Credits for RDS

RDS is based on the Deducer software (written by Ian Fellows). We also use the Java based R GUI JGR which is closely integrated with Deducer. These guys deserve most of the credit for what we see.

Common installation problems and bugs

Graphviz does not install or run?

TO DO and discussion

A variance and/or confidence interval menu: We could add Salganik BS (bad as assumes Markov 1st order); V-H delta method is an approximation to it. What about MA? Whipple will add Salganik to RDSdevelopment and Mark will then add it to the GUI to be automatically computed with the point estimates.
Add screen to Open Data dialog that tries to determine if the opened CSV or TXT file is an RDS data set (i.e., has an id and recruiter.id variable). Tries to guess the variable names and has the opportunity to designate a default network size variable, a population size estimate and other meta-information for the data set (e.g., notes). Once processed these are saved in the data.frame. We need a button for "This is not an RDS data set" so they can read in regular data.frames.
Default weights: It would be good to store a primary/default weights vector in the data set and use that for descriptives, etc. What to choose as the weights?
Probably add a Properties tab to the Data Viewer window that has the meta-data in it and some data-set-specific choices. In particular, it will be opened by default at the data load to specify "id" and "recruiter.id" variables. These will include

   - a free-form text window for notes, data source, comments, etc.
   - the names of the "id" and "recruiter.id" variables.
   - a check box for the primary weights (equal, SS, RDS-II, etc)
   - name of the default network size variable
   - a check box for if full weights are to be used in the subsetted data sets
   - missing data codes?
   - ??

Currently the Subset procedure is unsatisfactory as a replacement for the Ratio Composition procedure in RDSAT. The issue is if the weights should be based on the full data or the subset. For the V-H this does not matter much. For us the exclusion will break the RDS structure. So I think it is better to use the full data to compute the weights and then subset the weights in the computations. Thoughts?
Documentation: Word or LaTeX based manual using wiki as a start (Cori?)
Video: I will make a web video of basic use and post it on this page.
Missing data: This is a major issue. See the nyjazz.rdsat example data set, for example. The mi package for "multiple imputation" looks promising. Statistically, multiple imputation is an unnecessary middle-person in a statistical analysis. However, practically and in terms of communicating to practitioners how to think about missing data it is a nice device.\\

The default is to use some form of "complete case" analysis. What this means depends on where the data is missing. If it is an outcome variable then the idea is to compute the weights for all respondents and then compute H-T acting as if they were not sampled (i.e. with weight 0). If it is a network size variable then the cases are just dropped. Etc. It is all ad hoc and unsatisfactory.\\ The advantage of multiple imputation is that it retains the referral structure of RDS. So even if the imputes are off it gets the structure of the dependency right. And, of course, multiple imputation helps add the uncertainty back (compared to single imputation or complete case analysis).\\ If the proportion missing is low I think it will work.

Rather than reconstructing the RDSAT diagnostics, we need a good answer to: How to "Assess Bias in the Sample"? In general, sensitivity analysis that suggests the standard estimates (RDS-I, RDS-II, SS) are poor. One solution, obviously, is to us the MA framework to assess all the other estimators. Thoughts?
Work with X.data.frame with X an "rds.data.frame"
Add Whipple's regression stuff for June.
homophily measure: Use Gile measure based on the full network and work out how to estimate it from the RDS data.

   It is still unclear what the homophily measure is in RDSAT. Ask Cyprian?

Trimming of weights (Not a good idea). A good discussion topic.
If we could ever build a "dot_static.exe" (for windows) we could dispense with the full graphviz install. This is a low priority, but would allow us not to install graphviz separately (most folks would not use it separately) and would also allow RDS to be installed anywhere. So we would not need Administrator provileges to install it. Indeed, it may be able to be installed on a USB drive or a CD and run right off that. The working directory would need to be set to something like the Desktop of the user, but this is something we can consider later if write privileges become an issue.
Add sophisticated help, in the final version I need to install the Microsoft HTML Help Workshop, available from [1]. This is needed to build the .chm help files.
How about using "prettyR" to make the output nicer.
Lisa asked about UCINET output. DL files are hard but netdraw and UCINET (and pajek, for that matter) read in pajek "*.paj" project files. I have written a simple R program to write *.paj files. This is a *.net files plus a *.clu file. The latter has a clustering of the nodes (for us by their seeds). netdraw says it can read in .net and .clu but it may read in .paj which is the concatenated .net and .clu (we will see). The file is write.pajek.r.

Lisa's Issues about current RDSAT

Cannot use and reuse syntax; cannot easily rerun data. Each time you want to run a ‘portion analysis’ you must point and click. Cannot view multiple analyses at one time (have to copy and paste each analysis into a word doc)
Very difficult to format the data for a text file and then to get it into RDSAT. Text file/RDSAT will not take empty spaces, a space in a variable title.
Not able to do exploratory analysis such as cross tabs. We have to do this in SPSS. Also use SPSS to recode data which means if you find a variable you want to recode because of some RDSAT output, you have to open your SPSS file, recode the variable, download into excel, save into text tab delimited, and then download into RDSAT.
Need to be able to look at associations with regression.
Not sure what to do with missing data. If there are a few missing data (say, 5 in a sample of 400) then we just code missing data as such. If there are a lot of missing data (say >10 in a sample of 400. Which often occurs due to skip patterns in questionnaires. However, I am having a lot of countries redesign their questionnaires if doing RDS to avoid this problem), we code missing to a value and run the partition analysis and then conduct a ‘prevalence analysis’ using just the actual data values in the numerator and denominator to get the point estimate and CIs.
Does not do well with continuous data and descriptive data (weighted means, modes, etc.)
Keep in mind that most people in public health use SPSS, SAS or STATA for recodes and cross tabs (exploratory analysis), then dump this into Excel (but need to cut file down to 256 variables for Excel) and then into a text file and THEN into RDSAT.
Most overseas countries do not use MAC.
Over all, RDSAT is very tedious and takes a great deal of time to use.

Workshop Proposal

Here is information on the workshop. The proposal (on the page below) has details of the package. Basically, we should model the package to meet the needs of the workshop and vice versa.

[Workshop on the Analysis of Respondent-driven Sampling Data (UCSF, June 15-16, 2010) Workshop on the Analysis of Respondent-driven Sampling Data (UCSF, June 15-16, 2010)]

References

Development

The svn has:

   * RDSgui: The R package for the GUI
   * org: The JAVA code needed  for the GUI (used by RDSgui)
   * RDSdevelopment: The RDS package (development version)

The Deducer website is http://www.deducer.org/
A Template:Deducer uclaa 011210.pdf presentation on Deducer by Ian Fellows.
What is needed to compile Deducer?

   - Get Rtools for Windows Remember to let the Rtools Installer optionally edit your PATH variable as follows:

PATH=c:\Rtools\bin;c:\Rtools\perl\bin;c:\Rtools\MinGW\bin;c:\R\bin;<others>

   - Edit .cshrc to add the line "setenv NOAWT 1"
   - org and Deducer from svn. Get this from our svn as it has been edited