Randomization test for correlation

This example covers the basics of using a StatCrunch applet to conduct a hypothesis test for a correlation between two quantitative variables. Suppose a nutritionist is interested in showing there is a significant positive correlation between the fat content and calorie content of chicken sandwiches. She collects the nutritional information of chicken sandwiches for a sample of 7 restaurants. The resulting data are available in the Fat and calorie content for a sample of seven chicken sandwiches data set. Does this data set support her hypothesis of a significant positive correlation between the two variables?

Examining the sample correlation

The correlation coefficient is appropriate when the relationship between two variables is linear. A linear relationship can be verified by examining a scatter plot of the two variables. To construct a scatter plot, choose Graph > Scatter Plot. Select the Fat column for the X variable and Calories for the Y variable. Click Compute to view the resulting scatter plot as shown below. Inspection of the scatter plot reveals a roughly linear relationship.

With linearity verified, it is appropriate to consider the sample correlation between the two variables. To compute the sample correlation, choose Stat > Summary Stats > Correlation. Select the Fat and Calories columns as shown below. Click Compute to view the resulting sample correlation of approximately 0.73. This sample value is compared to the value of 0 which would indicate no correlation between the two variables. For the observed sample correlation of 0.73 to be statistically significant, it must be unusually large compared to what one would expect to occur if there were no correlation between the two variables. The goal of the randomization approach is to quantify the likelihood of this large of a sample correlation if there is really no correlation between the two variables when the entire population of chicken sandwiches is considered.

Constructing the randomization applet

The scenario where there is no correlation between the two variables can be simulated in StatCrunch using the Applets > Resampling > Randomization test for correlation menu option. In the resulting dialog, select the Fat column for the X variable and the Calories column for the Y variable. Click Compute! to construct the applet as shown below.

Understanding the randomization process

With the randomization approach, the goal is to understand the values of sample correlation one might expect to see if there is really no correlation between the two variables. To better understand this, begin by clicking the 1 time button at the top of the applet. A new window will appear with the original Fat column along with a Calories column that is randomly shuffled in an animated fashion. The idea is to randomly reassign each value for Calories to a value for Fat to simulate no correlation between the two variables. The new window also shows the sample correlation for the shuffled data. After the random reassignment is completed, the sample correlation is "dropped" into the graph in the original applet. Note that your results won't match any of the following screen shots, because a different random number seed is used every time.

Graphing the randomization results

If the magnitude (absolute value) of the randomized sample correlation is greater than or equal to the observed value of 0.73, it will be displayed in red. Otherwise, it will be displayed in gray. The correlation of 0.004719 from the first randomization shown above is not larger in magnitude, so it is shown in gray. Clicking the 5 times button in the applet will repeat the process of shuffling and recomputing the correlation five more times in an animated fashion. The screenshot below shows the results after 70 additional randomizations have been added to the applet by clicking the 5 times button 14 times. None of the 71 total randomizations provided correlations that were more extreme in magnitude than the observed correlation. The number of randomized correlations falling into each of the two extreme regions, below -0.73 (extreme negative correlation) and above 0.73 (extreme positive correlation), are also tabled above the graph. The Runs table to the left of the graph lists the individual randomizations color-coded in a similar fashion. The results of an individual randomization can be inspected by clicking on a number in the Runs table. A bar in the graph may also be clicked to display a listing of all associated randomizations. An individual randomization may also be selected from this listing for inspection.

Ramping up the number of randomizations

After one understands the randomization process, pressing the 1000 times button will repeat the shuffling/recomputing process one thousand times very quickly. This allows one to build a better picture of the distribution of the sample correlation if there is really no correlation between the two variables. Clicking this button repeatedly allows for a more and more detailed distribution of this sample correlation under the no true correlation scenario. The screenshot below shows the distribution after the 1000 times button has been clicked ten consecutive times making the total number of randomizations in the applet 10,071. As one might expect, the randomized correlations are centered around zero since the randomization approach simulates the scenario where there is no real correlation between the two variables.

Interpreting the randomization results

The randomization approach applied above shows the types of values one would expect to see for the sample correlation if there is really no correlation between the two variables at the population level. In this case, the nutritionist is trying to show a positive correlation exists, so she is particularly interested in the proportion of the randomized correlations that are 0.73 or larger. This proportion quantifies how extreme this observed correlation is in the proper context of her hypothesis. If this proportion is very small, then the observed correlation is unusually large under the no correlation scenario. This would imply the data provides strong evidence that the variables are positively correlated. If on the other hand this proportion is not very small, then the observed correlation is not that unusual under the no correlation scenario. This would imply there is not a great deal of evidence to support the idea of a positive correlation. The results in this case show that only 64 of the 10,071 total randomizations were greater than or equal to the observed correlation of 0.73. This works out to be a proportion of 64/10,071 = 0.0064 or as a percentage about 0.64%. This means that there is less than a 1% chance of a sample correlation as or more extreme than 0.73 if there is really no correlation between the two variables overall. As chances go, 0.64% is relatively low implying that the correlation between Fat and Calories is significantly greater than 0.

Always Learning
Pearson