Computing correlations between columns

This tutorial covers the steps for computing the correlations between columns in StatCrunch. To begin, load the Home prices in Albuquerque data set, which will be used throughout this tutorial. This data set contains eight columns of data taken from 117 homes sales in Albuquerque, New Mexico in 1993. All of the columns will be used for this tutorial with a focus of their correlation to the sales price of the home. PRICE represents the sales price in hundreds of dollars for each home. The first value of 2,050 for Price then represents a sales price of $205,000. SQFT represents the square footage of the living space in each home and AGE is the age in years for each home. FEATS gives the number of features such as a dishwasher, microwave, or dryer included in the home. NE, CUST, and COR are indicator variables with a value of 1 if the home has the characteristic and a value of 0 if the home does not. NE is for the northeast sector of the city, CUST is for if the home is custom, and COR is for a corner lot. TAX represents the annual property taxes for the house in dollars.

Creating a correlation matrix between variables

To create a correlation matrix between variables in this dataset, choose the Stat > Summary Stats > Correlation menu option. Select all of the columns in the data set under Select column(s) and click Compute! to view the resulting correlation matrix. Each cell contains the sample correlation between two variables. For example, in the cell between PRICE and SQFT, a value of 0.8447951 represents the correlation between the sales prices of the home and its square footage. By default, duplicate correlations and correlations between a variable with itself are left out of the matrix.

Adding P-values to the correlation matrix

Adding P-values to each sample correlation in the matrix can help identify where there is a significant association between variables. In the previous results window, Choose Options > Edit to reopen the dialog window. Under Display, check Two-sided P-value and click Compute!. The resulting correlation matrix now has an additional line per cell that contains the two-sided P-value for the correlation between the corresponding variables.

Limiting the columns displayed in the correlation matrix

A potential home seller might wish to identify the characteristics that have a strong association with sales price as they prepare their house to be put on the market. In the previous results window, choose Options > Edit to reopen the dialog window. Under Display columns, check the Selected option. Under Selected, select the PRICE column to isolate this variable in the correlation matrix. Click Compute to view the altered correlation matrix. The matrix displays only one column showing the correlation of Price with the other characteristics.

Sorting the correlation matrix

Additional options allow for the sorting of the correlation matrix to more easily identify strong associations. To sort the previous results, choose Options > Edit to reopen the dialog window. Consider the options available under the Sort rows by correlation with heading. Under Column, select PRICE to sort the matrix by each variable’s correlation with the price of the home. By default, the matrix will be sorted in ascending order based on the correlation with the PRICE variable. The sorting order can be changed to Descending by choosing this value for the Order option. For this example, leave Order with the default Ascending value and click Compute!. The new correlation matrix has the variables across the rows sorted in ascending order from the lowest correlation with Price (Age) to the highest correlation with Price (Tax).

Always Learning
Pearson