Randomization test for two means

This example covers the basics of using a StatCrunch applet to conduct a hypothesis test for comparing two means using randomization techniques. Suppose an instructor thinks yellow is a happy color, and because of this students might perform better on exams printed on yellow paper. She decides to do a simple experiment with an exam in her course to test out her idea. In a class with 20 students, she prints out 10 exams on standard white paper, 10 exams on yellow paper and then randomly assigns the exams to students as they enter the room to take the exam. The resulting scores are contained in the Yellow_White exam data. This data set contains the scores in decreasing order and the corresponding color of the exam for each of the 20 students.

Comparing the two means

To summarize the scores on the yellow and white versions of the exam, choose Stat > Summary Stats > Columns. Select the Score column and then specify Exam as the Group by column. Click Compute! to view the resulting summary statistics for the two sets of exam scores as displayed below. In this case, it makes sense to use the mean as a measure of a typical score for each group. The mean score of those students taking the yellow exam is 77.3, and the mean score of students taking the white exam is 71. The difference between the two mean scores is 6.3. A large difference between the mean of the yellow scores and the mean of the white scores would provide evidence in favor of the instructor's idea. For the observed difference of 6.3 to be statistically significant in support of the instructor's idea, it must be unusually large compared to what one would expect to occur if color really has no impact on the typical score. The goal of the randomization approach is to quantify the likelihood of this large of a difference between the means if the color of the exam really has no impact on the typical score.

Constructing the randomization applet

The scenario where color has no impact on the typical exam score can be easily simulated in StatCrunch using the Applets > Resampling > Randomization test for two means menu option. After selecting this option, specify the two samples that you wish to compare in the resulting dialog window as shown below. For the first sample, select the Score column and enter a Where expression of Exam = Yellow (note the case sensitivity of the column names/values). For the second sample, also select the Score column and enter a Where expression of Exam = White. Click Compute! to construct the applet as shown below.

Understanding the randomization process

With the randomization approach, the goal is to understand the types of differences between means one might see if color has no impact on the typical score. To better understand this basic premise of the randomization approach, begin by clicking the 1 time button in the applet. A new window will then appear that illustrates the shuffling of the color labels in the Exam column as shown below, the idea being that colors can be randomly assigned to scores to simulate the scenario where exam color has no impact on the outcome. The new window also shows the difference between the mean scores for the shuffled data. After the random reassignment is completed, this difference is "dropped" into the graph in the original applet.

Graphing the randomization results

If the randomized difference in means is larger in magnitude (absolute value) than the observed difference, it will displayed in red. Otherwise, it will be displayed in gray. In this case, a difference in means that is 6.3 or larger in magnitude will be colored in red. The difference of 2.5 from the first randomization shown above is not larger in magnitude so it is shown in gray. Clicking the 5 times button in the applet will repeat the process of shuffling and recomputing the difference of means five more times in an animated fashion. The screenshot below shows the results after five additional randomizations have been added to the applet. Three of the six total randomizations were more extreme than the observed value, one difference below -6.3 and two that are above 6.3. The number and proportion of the randomizations falling into each of these regions is also tabled above the graph. The Runs table to the left lists the individual randomizations color-coded in the same fashion. The results of an individual randomization can be inspected by clicking on a number in the Runs table. A bar in the graph may also be clicked to display a listing of all associated randomizations. An individual randomization may also be selected from this listing for inspection.

Ramping up the number of randomizations

After one understands the randomization process, pressing the 1000 times button will repeat the shuffling/recomputing process one thousand times very quickly. This allows one to build a better picture of the distribution of the differences between mean scores if color has no impact. Clicking this button repeatedly allows for a more and more detailed distribution of this difference under the no impact scenario to emerge. The screenshot below shows the distribution after the 1000 times button has been clicked ten consecutive times making the total number of randomizations in the applet 10,006. As one might expect, the randomized differences between the mean of the yellow scores and the mean of the white scores are centered around zero since the randomization approach simulates the scenario where color has no impact on typical score.

Interpreting the randomization results

Using the large number of randomizations shown above, it is possible to consider the instructor's idea that yellow exams will lead to a higher typical score. The randomization approach applied above shows the types of values one would expect to see for the difference between means if color really has no impact on the typical score. The proportion of the randomized mean differences that are at or above 6.3 quantifies how extreme this observed difference is in this context. If this proportion is very small, then the observed mean difference is unusually large if exam color has no impact, which would imply strong evidence that the instructor's idea may actually be true. If on the other hand this proportion is not very small, then the observed mean difference is not unusual if exam color has no impact which would imply there is not a great deal of evidence to support the instructor's claim. The results of the randomizations show that 1,338 of the 10,006 total randomizations were 6.3 or larger. As a proportion, this works out to be 1,338/10,006 = 0.1337 or as a percentage about 13%. This means there is about a 13% chance of a difference of 6.3 or larger if color really has no impact on the typical exam score. As chances go, 13% is relatively large implying that a difference of 6.3 is not that unusual if color has no impact on score. In most cases, the chances of the observed difference or something more extreme would need to be smaller than 5% to be considered unusual. Therefore, the observed data does not provide strong evidence for the instructor's idea that yellow exams will lead to higher scores typically.

More about using the applet

The applet described above can be used to answer a variety of questions when comparing means with data. In the situation described above, the instructor thought yellow exams might increase the typical score so focus was placed on the upper proportion of the randomized mean differences that were at the observed value of 6.3 or even larger. In other situations, it may be appropriate to consider the lower proportion of randomized mean differences at or below the observed value, and in still other cases, one may want to consider both of the lower and upper proportions added together. To accommodate these different scenarios, each of the proportions is tabled in the applet.

Due to the nature of the randomization approach, it is also important to note that users should not expect to get the exact same results as those shown above. The tabled proportions, however, will converge over a large number of randomizations such as the 10,006 shown above. Differences between these results and those of a user conducting a similar number of randomizations should be minimal.

Always Learning
Pearson