StatCrunch has always had very good simulation capabilities. Most of these capabilities are contained in the Sample columns and Simulate data options under the Data menu. With these options, one can collect multiple samples from columns in the data table or from specified distributions and store these samples in the data table. Both of these menu options offer the stacking option so that the samples will appear stacked one on top of another in the data table with a separate column containing the sample id number. The beauty of this approach to simulation is that summary statistics can be easily computed for each sample by grouping by the sample id with the Stat > Summary Stats > Columns menu option. The summary statistics can also be saved to the data table for further analysis. There are, however, two issues with this approach. First, the summary statistics procedure is limited to working with numeric columns. This means that doing simulations with many categorical values requires recoding the categories to have numerical values which can be a big hassle. Second, this approach requires that all of the sample values be stored in the StatCrunch data table, which for a large number of samples or for a larger sample size can cause StatCrunch to take up extreme amounts of memory. In some cases, the size of simulation has to be limited so that StatCrunch will not exceed its allocated amount of memory.
Both the Sample columns and Simulate data menu options are now equipped with a new option that is designed to overcome these two drawbacks. This new option can now be found as the third option for storing simulation results (after the split and stacked options). When a user chooses this selection, they can then specify a statistic to be computed for each sample and only the resulting values of this statistic will be stored in the data table. Not only does this option allow one to skip the summary stats step in the simulation process but it also makes using larger sample sizes and a larger number of samples in your simulation study much easier to do with StatCrunch. The use of this new option for both procedures is shown below in the context of two simple examples.
Suppose we have a situation where a nursing school has 26 great applicants (10 of which are female and 16 of which are male). The nursing school decides to choose 6 applicants to admit by randomly selecting them from the group of 26. After completing this process, 4 of those chosen were found to be female and 2 were male. The male applicants cry foul because they feel that too many females were accepted, and they think this should not happen if things were really done randomly since women were not in the majority in the applicant pool. In StatCrunch, we can easily simulate the process of the random selection and see just how often this sort of result happens in this situation. The first column in the data set shown below lists the 10 female and 16 male applicants.
We take a sample from the pool of applicants using the Sample columns option under the Data menu as shown below. Note that the Applicants column has been selected for sampling. In this case, we are taking 10,000 separate samples each of size 6. The new storage option for computing a statistic is turned on and the statistic entered is sum("Sample(Applicants)"="Female"). This StatCrunch expression will count the number of females in each sample that is selected. This sort of calculation involving samples made up of text values is more difficult to work with using other StatCrunch simulation techniques as described above. Note that the sample is referred to in this expression using "Sample(Applicants)" with the double quotes being required. While this notation may seem a bit awkward initially, it mimics that of other simulation storage options in StatCrunch and properly describes the values that are being input into the expression.
|
|
After clicking the Sample Column(s) button, the 10,000 values for the number of females in each random sample of size 6 is stored in the data table. This column has been renamed (by clicking and editing the column header) to Females. A simple bar plot shown below can now be constructed using the Graphics > Bar plot > with data menu option to address the question of interest. In this case, the highlighted bars show the proportion of times that the number of females is 4 (the observed value) or larger to be 0.123 (12.3%). While the observed data is not that likely, it does happen more than once out of every ten random samples. This means that the male applicants do not have a very strong case in this situation. The exact probability of being 4 or larger can be found using the Hypergeometric calculator under the Stat > Calculators menu option with n=26, m=10 and k=6.
|
|
The central limit theorem is a concept that is often taught using simulation techniques. For example, consider comparing the distribution of the mean of samples from an exponential distribution across samples of size 5, 10 and 100. The Data > Simulate Data > Exponential menu option in StatCrunch allows one to easily carry out such a simulation as shown below. For this procedure, columns of data are thought of as samples and the number of rows in each column is the corresponding sample size. In this case, we will simulate 10,000 samples each of size 5, so the number of columns is set to 10,000 and the number of rows is set to 5. The mean of the exponential used below is set to 1. The new option for computing a statistic for each sample is turned on, and the expression entered is mean(Exponential). This expression uses the name of the distribution being simulated as a reference to the sample data. When simulating data from other distributions, this reference would change accordingly.
|
|
Clicking the Simulate button will add a new column containing the 10,000 sample means to the data table. Simulating means based on samples of sizes 10 and 100 can be easily accomplished by choosing the Option > Edit menu option on the notification window and altering the inputs shown above for the other sample sizes. One can then easily generate a data set similar to the one below where the column names have been modified to indicate the sample sizes used in each case.
The Graphics > Histogram menu option can then be used to easily stack histograms of the means for each sample size as shown below. The histograms show that the distribution of the sample mean becomes more symmetric as the sample size increases and the variability of the sample mean decreases as sample size increases. One might also choose to overlay the best fitting normal distribution for each histogram to better indicate the convergence to a normal distribution as sample size increases.
|
|
As these two examples above illustrate, the new option allowing for the computation of a statistic for each sample greatly enhances the simulation capabilities of StatCrunch. Please feel free to express your thoughts or recommendations on this new feature with a comment below.
Yes Ken is correct. I can change the interface so that if there is a space in the distribution name then it displays it in double quotes.
In reply to "msullivan01": I guessed, and I was right (sometimes it helps to think like a programmer). Discrete Uniform is two words referring to one thing (the simulated data you never see), so it needs to have quotes around, as in mean("Discrete Uniform").
I think you got done in by the interface, since there was nothing there to say that "Discrete Uniform" needs the quotes. I'm wondering if the name can be collapsed into one word like DUniform, and then you would be able to say mean(DUniform) and quotes wouldn't be necessary to make the meaning clear (from a programmer's point of view).
I got another thought: what if you want to calculate *two* statistics, not just one? I'm thinking of a simulation to show that in a one-sample CI for the mean, using the sample SD and the z-value (1.96, say) gives only 89% or whatever of the intervals covering the population mean, so you have to do something to make the intervals wider (and there's your motivation for the t-distribution). I know I can do it in two stages the "old" way, but wouldn't it be so nice to have a column of sample means and next to it a column of sample SDs (or even lower confidence limit and upper confidence limit)?
Thank you for sharing this feature. I'd love to use it with the Central Limit Theorem lab in which students simulate rolling groups of dice, computing the mean of each dice roll, and then creating histograms of the means for different sample sizes. However, I don't seem to be able to create a column of means for Discrete Uniform data as you did for the Exponential data. Here's a screenshot: http://screencast.com/t/z3tv4chxu5I I keep getting the following error: Error computing expression, Invalid statement (DiscreteUniform) Invalid Function call mean
Any ideas?
hi Sue,
Actually it is there, but takes a bit more finding. When you select Data, Simulate Data and click on the ?, you get a lot of links to Wikipedia about the distributions, then if you scroll down to Computing a Statistic, there's the same link to Computing an Expression that I described before. Maybe that's new since you last looked.
I'm a "mere user", not a programmer, but I agree that it would be nice (until the construction tool is attached to the Compute Statistic box) if the Compute Expression help were easier to find.
Otherwise, the syntax for what goes into the Compute Expression box is actually easier than for sampling from a population. If you're sampling from a normal distribution, std(Normal) gets the SD for each sample, or if you are sampling from a gamma distribution, std(Gamma) does that, and so on. When typing something into this box, you kind of have to know what the simulated data is going to be called, which is easier for simulating from a distribution than sampling from a population.
Thanks for the help. I did find the help for building expressions by going through the ? button in the Sample Columns dialog box. When I tried to find the same information through the ? button in the Simulate Data dialog box, I could not find it. Since it is now possible to build expressions to compute statistics in the Simulate Data function, it might be nice to be able to get to that help. Just a suggestion....
Some more odds and ends:
1. When I was writing my report, I realized that I had to do all my analysis first, save the data set with the results in it, and *then* start on my report. Starting the report sooner would mean it only had access to my results that were saved as of when I started the report. (I suspect that's an inevitability of the programming.)
2. I found the "rows" and "columns" of the Simulate dialog box confusing. I wanted 1000 samples of size 5, but I didn't know which should be rows and which columns. The way I thought about it was "if you were storing the data split across columns, which way would it go?" Then I'd want 1000 columns each of 5 rows, so that each sample is 1 column. Saving the statistic, though, actually produces you 1 column with 1000 *rows*. So I think, instead of "rows" and "columns", "number of samples" and "size of each sample" would be just the thing.
3. The function names (like "mean" and "std" and the others) look like the same names as R. Is that deliberate, or just the way things worked out?
Sorry, Webster. I think it was the same gremlin that had me post my message twice. My report is now shared.
Ken, I don't think you are sharing your report.
Sue, Ken has given you a great response on this one. Indeed there are a number of function you can use to computer statistics of interest.
I have another comment on this, which I *will* do as a report. I think you'll find the report I did here:
http://www.statcrunch.com/5.0/viewreport.php?reportid=22043
sgrapevine: you can find the syntax you need by clicking on the "?" help button in the Sample Columns dialog box, then look down at item #7. In there, there is a link to "StatCrunch expression". Click that, and scroll all the way down to Column Functions. The one you want is "std".
For example, suppose you have a column x that contains the population you want to sample from, and you want to look at the sampling distribution of the sample SD for samples of size 5 from that population. Select Data and Sample Columns, select x as the column to sample from, fill in the sampling details, select Compute Statistic. and enter std("Sample(x)") as the statistic to compute. This reads as "the standard deviation of the samples from x". Make sure you get the brackets and quotes in the right place!
When I did this, I got a distribution of sample SDs that was a little skewed to the right. (Of course, it's the sample *variance* that will look more right-skewed because it has a (scaled) chi-squared distribution when your population is normal.)
I definitely should have written this as a report.
sgrapevine: you can find the syntax you need by clicking on the "?" help button in the Sample Columns dialog box, then look down at item #7. In there, there is a link to "StatCrunch expression". Click that, and scroll all the way down to Column Functions. The one you want is "std".
For example, suppose you have a column x that contains the population you want to sample from, and you want to look at the sampling distribution of the sample SD for samples of size 5 from that population. Select Data and Sample Columns, select x as the column to sample from, fill in the sampling details, select Compute Statistic. and enter std("Sample(x)") as the statistic to compute. This reads as "the standard deviation of the samples from x". Make sure you get the brackets and quotes in the right place!
When I did this, I got a distribution of sample SDs that was a little skewed to the right. (Of course, it's the sample *variance* that will look more right-skewed because it has a (scaled) chi-squared distribution when your population is normal.)
I definitely should have written this as a report.
And what is the syntax for finding other statistics for the samples? What if I wanted to find the standard deviation for each sample created with a simulation?
Sure how about posting them to the discussion board on the StatCrunch facebook page?
http://www.facebook.com/pages/StatCrunch/273194530391
Thanks for your comments, "websterwest". I can certainly demonstrate the two-step process first, and then do it the quicker way afterwards. I discovered the "stacked histogram" thing myself later, the way you described.
I have an entirely unrelated suggestion about stem and leaf plots. Where would be a good place to direct that?
Eventually all StatCrunch expression boxes will have a construction tool attached to them. This is in the works.
I also agree about the two stage process potentially being better pedagogically. However, the previous method was too limiting as described above. I think doing an initial small scale simulation study in two steps might allow one to use the new methodology in subsequent studies for faster results.
After selecting your columns for a histogram just click next a couple of times and turn on the Use same X axis option.
In the grand scheme of things, I like this, though I have two thoughts:
- the syntax (especially for counting females) is more awkward than I would like to get students using. Is there a way of having a "construction" dialog as for Compute Expression?
- doing a simulation in one stage, while convenient, obscures the process a little: for myself, I like to be clear that I am taking "many" simulated samples of the right size, and then, when I have them, computing some statistic for each sample (as you describe at the top of your report).
I love the "stacked" histograms. This is such a clear illustration of (a) the spread becoming smaller and (b) the shape becoming more normal. How did you get the x-axis scales to stay consistent?