StatCrunch logo (home)

Report Properties
Thumbnail:

from Flickr
Owner: websterwest
Created: Aug 4, 2012
Share: yes
Views: 7822
Tags:
 
Results in this report
 
Data sets in this report
 
Need help?
To copy selected text, right click to Copy or choose the Copy option under your browser's Edit menu. Text copied in this manner can be pasted directly into most documents with formatting maintained.
To copy selected graphs, right click on the graph to Copy. When pasting into a document, make sure to paste the graph content rather than a link to the graph. For example, to paste in MS Word choose Edit > Paste Special, and select the Device Independent Bitmap option.
You can now also Mail results and reports. The email may contain a simple link to the StatCrunch site or the complete output with data and graphics attached. In addition to being a great way to deliver output to someone else, this is also a great way to save your own hard copy. To try it out, simply click on the Mail link.
Simulation in StatCrunch just got easier
Mail   Print   Twitter   Facebook

StatCrunch has always had very good simulation capabilities.  Most of these capabilities are contained in the Sample columns and Simulate data options under the Data menu.   With these options, one can collect multiple samples from columns in the data table or from specified distributions and store these samples in the data table.   Both of these menu options offer the stacking option so that the samples will appear stacked one on top of another in the data table  with a separate column containing the sample id number.  The beauty of this approach to simulation is that summary statistics can be easily computed for each sample by grouping by the sample id with the Stat > Summary Stats > Columns menu option.  The summary statistics can also be saved to the data table for further analysis.  There are, however, two issues with this approach.  First, the summary statistics procedure is limited to working with numeric columns.  This means that doing simulations with many categorical values requires recoding the categories to have numerical values which can be a big hassle.  Second, this approach requires that all of the sample values be stored in the StatCrunch data table, which for a large number of samples or for a larger sample size can cause StatCrunch to take up extreme amounts of memory.  In some cases, the size of simulation has to be limited so that StatCrunch will not exceed its allocated amount of memory.

Both the Sample columns and Simulate data menu options are now equipped with a new option that is designed to overcome these two drawbacks.   This new option can now be found as the third option for storing simulation results (after the split and stacked options).  When a user chooses this selection, they can then specify a statistic to be computed for each sample and only the resulting values of this statistic will be stored in the data table.  Not only does this option allow one to skip the summary stats step in the simulation process but it also makes using larger sample sizes and a larger number of samples in your simulation study much easier to do with StatCrunch.  The use of this new option for both procedures is shown below in the context of two simple examples. 

Suppose we have a situation where a nursing school has 26 great applicants (10 of which are female and 16 of which are male).   The nursing school decides to choose 6 applicants to admit by randomly selecting them from the group of 26.  After completing this process, 4 of those chosen were found to be female and 2 were male.  The male applicants cry foul because they feel that too many females were accepted, and they think this should not happen if things were really done randomly since women were not in the majority in the applicant pool.   In StatCrunch, we can easily simulate the process of the random selection and see just how often this sort of result happens in this situation.  The first column in the data set shown below lists the 10 female and 16 male applicants.

Data set 1. Nursing School Applicants   [Info]
To analyze this data, please sign in.

We take a sample from the pool of applicants using the Sample columns option under the Data menu as shown below.  Note that the Applicants column has been selected for sampling.  In this case, we are taking 10,000 separate samples each of size 6.  The new storage option for computing a statistic is turned on and the statistic entered is sum("Sample(Applicants)"="Female").  This StatCrunch expression will count the number of females in each sample that is selected.  This sort of calculation involving samples made up of text values is more difficult to work with using other StatCrunch simulation techniques as described above.   Note that the sample is referred to in this expression using "Sample(Applicants)" with the double quotes being required.  While this notation may seem a bit awkward initially, it mimics that of other simulation storage options in StatCrunch and properly describes the values that are being input into the expression.   

Result 1: Snapshot of Sample Columns Dialog   [Info]
Right click to copy

After clicking the Sample Column(s) button, the 10,000 values for the number of females in each random sample of size 6 is stored in the data table.  This column has been renamed (by clicking and editing the column header) to Females.  A simple bar plot shown below can now be constructed using the Graphics > Bar plot > with data menu option to address the question of interest.  In this case, the highlighted bars show the proportion of times  that the number of females is 4 (the observed value) or larger to be 0.123 (12.3%).  While the observed data is not that likely, it does happen more than once out of every ten random samples.  This means that the male applicants do not have a very strong case in this situation.  The exact probability of being 4 or larger can be found using the Hypergeometric calculator under the Stat > Calculators menu option with n=26, m=10 and k=6.

Result 2: Number of females in each random sample of size 6   [Info]
Right click to copy

The central limit theorem is a concept that is often taught using simulation techniques.  For example, consider comparing the distribution of the mean of samples from an exponential distribution across samples of size 5, 10 and 100.  The Data > Simulate Data > Exponential menu option in StatCrunch allows one to easily carry out such a simulation as shown below.  For this procedure, columns of data are thought of as samples and the number of rows in each column is the corresponding sample size.    In this case, we will simulate 10,000 samples each of size 5, so the number of columns is set to 10,000 and the number of rows is set to 5.  The mean of the exponential used below is set to 1.  The new option for computing a statistic for each sample is turned on, and the expression entered is mean(Exponential).   This expression uses the name of the distribution being simulated as a reference to the sample data.  When simulating data from other distributions, this reference would change accordingly. 

Result 3: Snapshot of Exponential Samples Dialog   [Info]
Right click to copy

Clicking the Simulate button will add a new column containing the 10,000 sample means to the data table.  Simulating means based on samples of sizes 10 and 100 can be easily accomplished by choosing the Option > Edit menu option on the notification window and altering the inputs shown above for the other sample sizes.   One can then easily generate a data set similar to the one below where the column names have been modified to indicate the sample sizes used in each case. 

Data set 2. Means of exponential samples with different sample   [Info]
To analyze this data, please sign in.

The Graphics > Histogram menu option can then be used to easily stack histograms of the means for each sample size as shown below.   The histograms show that the distribution of the sample mean becomes more symmetric as the sample size increases and the variability of the sample mean decreases as sample size increases.  One might also choose to overlay the best fitting normal distribution for each histogram to better indicate the convergence to a normal distribution as sample size increases.

Result 4: Histogram of means of exponential samples with different sample sizes   [Info]
Right click to copy

As these two examples above illustrate, the new option allowing for the computation of a statistic for each sample greatly enhances the simulation capabilities of StatCrunch.  Please feel free to express your thoughts or recommendations on this new feature with a comment below.

HTML link:
<A href="http://www.statcrunch.com/5.0/viewreport.php?reportid=20880">Simulation in StatCrunch just got easier</A>

Comments
Want to comment? Subscribe
Already a member? Sign in.
By websterwest
Oct 25, 2011

Yes Ken is correct. I can change the interface so that if there is a space in the distribution name then it displays it in double quotes.
By butler@utsc.utoronto.ca
Oct 25, 2011

In reply to "msullivan01": I guessed, and I was right (sometimes it helps to think like a programmer). Discrete Uniform is two words referring to one thing (the simulated data you never see), so it needs to have quotes around, as in mean("Discrete Uniform").

I think you got done in by the interface, since there was nothing there to say that "Discrete Uniform" needs the quotes. I'm wondering if the name can be collapsed into one word like DUniform, and then you would be able to say mean(DUniform) and quotes wouldn't be necessary to make the meaning clear (from a programmer's point of view).

I got another thought: what if you want to calculate *two* statistics, not just one? I'm thinking of a simulation to show that in a one-sample CI for the mean, using the sample SD and the z-value (1.96, say) gives only 89% or whatever of the intervals covering the population mean, so you have to do something to make the intervals wider (and there's your motivation for the t-distribution). I know I can do it in two stages the "old" way, but wouldn't it be so nice to have a column of sample means and next to it a column of sample SDs (or even lower confidence limit and upper confidence limit)?
By msullivan01
Oct 25, 2011

Thank you for sharing this feature. I'd love to use it with the Central Limit Theorem lab in which students simulate rolling groups of dice, computing the mean of each dice roll, and then creating histograms of the means for different sample sizes. However, I don't seem to be able to create a column of means for Discrete Uniform data as you did for the Exponential data. Here's a screenshot: http://screencast.com/t/z3tv4chxu5I I keep getting the following error: Error computing expression, Invalid statement (DiscreteUniform) Invalid Function call mean

Any ideas?
By butler@utsc.utoronto.ca
Oct 21, 2011

hi Sue,
Actually it is there, but takes a bit more finding. When you select Data, Simulate Data and click on the ?, you get a lot of links to Wikipedia about the distributions, then if you scroll down to Computing a Statistic, there's the same link to Computing an Expression that I described before. Maybe that's new since you last looked.

I'm a "mere user", not a programmer, but I agree that it would be nice (until the construction tool is attached to the Compute Statistic box) if the Compute Expression help were easier to find.

Otherwise, the syntax for what goes into the Compute Expression box is actually easier than for sampling from a population. If you're sampling from a normal distribution, std(Normal) gets the SD for each sample, or if you are sampling from a gamma distribution, std(Gamma) does that, and so on. When typing something into this box, you kind of have to know what the simulated data is going to be called, which is easier for simulating from a distribution than sampling from a population.
By sgrapevine
Oct 21, 2011

Thanks for the help. I did find the help for building expressions by going through the ? button in the Sample Columns dialog box. When I tried to find the same information through the ? button in the Simulate Data dialog box, I could not find it. Since it is now possible to build expressions to compute statistics in the Simulate Data function, it might be nice to be able to get to that help. Just a suggestion....
By butler@utsc.utoronto.ca
Oct 20, 2011

Some more odds and ends:

1. When I was writing my report, I realized that I had to do all my analysis first, save the data set with the results in it, and *then* start on my report. Starting the report sooner would mean it only had access to my results that were saved as of when I started the report. (I suspect that's an inevitability of the programming.)

2. I found the "rows" and "columns" of the Simulate dialog box confusing. I wanted 1000 samples of size 5, but I didn't know which should be rows and which columns. The way I thought about it was "if you were storing the data split across columns, which way would it go?" Then I'd want 1000 columns each of 5 rows, so that each sample is 1 column. Saving the statistic, though, actually produces you 1 column with 1000 *rows*. So I think, instead of "rows" and "columns", "number of samples" and "size of each sample" would be just the thing.

3. The function names (like "mean" and "std" and the others) look like the same names as R. Is that deliberate, or just the way things worked out?
By butler@utsc.utoronto.ca
Oct 20, 2011

Sorry, Webster. I think it was the same gremlin that had me post my message twice. My report is now shared.
By websterwest
Oct 20, 2011

Ken, I don't think you are sharing your report.
By websterwest
Oct 20, 2011

Sue, Ken has given you a great response on this one. Indeed there are a number of function you can use to computer statistics of interest.

By butler@utsc.utoronto.ca
Oct 20, 2011

I have another comment on this, which I *will* do as a report. I think you'll find the report I did here:

http://www.statcrunch.com/5.0/viewreport.php?reportid=22043
By butler@utsc.utoronto.ca
Oct 20, 2011

sgrapevine: you can find the syntax you need by clicking on the "?" help button in the Sample Columns dialog box, then look down at item #7. In there, there is a link to "StatCrunch expression". Click that, and scroll all the way down to Column Functions. The one you want is "std".

For example, suppose you have a column x that contains the population you want to sample from, and you want to look at the sampling distribution of the sample SD for samples of size 5 from that population. Select Data and Sample Columns, select x as the column to sample from, fill in the sampling details, select Compute Statistic. and enter std("Sample(x)") as the statistic to compute. This reads as "the standard deviation of the samples from x". Make sure you get the brackets and quotes in the right place!

When I did this, I got a distribution of sample SDs that was a little skewed to the right. (Of course, it's the sample *variance* that will look more right-skewed because it has a (scaled) chi-squared distribution when your population is normal.)

I definitely should have written this as a report.
By butler@utsc.utoronto.ca
Oct 20, 2011

sgrapevine: you can find the syntax you need by clicking on the "?" help button in the Sample Columns dialog box, then look down at item #7. In there, there is a link to "StatCrunch expression". Click that, and scroll all the way down to Column Functions. The one you want is "std".

For example, suppose you have a column x that contains the population you want to sample from, and you want to look at the sampling distribution of the sample SD for samples of size 5 from that population. Select Data and Sample Columns, select x as the column to sample from, fill in the sampling details, select Compute Statistic. and enter std("Sample(x)") as the statistic to compute. This reads as "the standard deviation of the samples from x". Make sure you get the brackets and quotes in the right place!

When I did this, I got a distribution of sample SDs that was a little skewed to the right. (Of course, it's the sample *variance* that will look more right-skewed because it has a (scaled) chi-squared distribution when your population is normal.)

I definitely should have written this as a report.
By sgrapevine
Oct 19, 2011

And what is the syntax for finding other statistics for the samples? What if I wanted to find the standard deviation for each sample created with a simulation?
By websterwest
Sep 19, 2011

Sure how about posting them to the discussion board on the StatCrunch facebook page?


http://www.facebook.com/pages/StatCrunch/273194530391

By butler@utsc.utoronto.ca
Sep 19, 2011

Thanks for your comments, "websterwest". I can certainly demonstrate the two-step process first, and then do it the quicker way afterwards. I discovered the "stacked histogram" thing myself later, the way you described.

I have an entirely unrelated suggestion about stem and leaf plots. Where would be a good place to direct that?
By websterwest
Sep 19, 2011

Eventually all StatCrunch expression boxes will have a construction tool attached to them. This is in the works.


I also agree about the two stage process potentially being better pedagogically. However, the previous method was too limiting as described above. I think doing an initial small scale simulation study in two steps might allow one to use the new methodology in subsequent studies for faster results.


After selecting your columns for a histogram just click next a couple of times and turn on the Use same X axis option.

By butler@utsc.utoronto.ca
Sep 18, 2011

In the grand scheme of things, I like this, though I have two thoughts:

- the syntax (especially for counting females) is more awkward than I would like to get students using. Is there a way of having a "construction" dialog as for Compute Expression?
- doing a simulation in one stage, while convenient, obscures the process a little: for myself, I like to be clear that I am taking "many" simulated samples of the right size, and then, when I have them, computing some statistic for each sample (as you describe at the top of your report).

I love the "stacked" histograms. This is such a clear illustration of (a) the spread becoming smaller and (b) the shape becoming more normal. How did you get the x-axis scales to stay consistent?

Always Learning