Creating scatter plots

This tutorial covers the steps for creating a scatter plot in StatCrunch. To begin, load the Asking prices for 4-bedroom homes in Bryan-College Station TX data set, which will be used throughout this tutorial. The data set was collected in order to compare four-bedroom homes listed for sale in the two adjoining cities of Bryan, Texas, and College Station, Texas. Using a real estate web site, fifteen homes were randomly selected from four-bedroom homes listed for sale in Bryan, Texas, and fifteen homes were randomly selected from four-bedroom homes listed for sale in College Station, Texas. The Sqft column contains the square footage for each home, and the Location column lists the city where the home is located. The Price column contains the asking price for the home in 1000s of dollars. This first home in the data set has an asking price of $2,400,000 (2.4 million dollars) which is reflected in the data set as 2400. To quickly adjust the asking prices in the data set to dollars, simply add three zeros to the value shown.

Building and interacting with a scatter plot

A scatter plot is the natural tool to examine the relationship between square footage and asking price. To construct this plot, choose the Graph > Scatter Plot menu option. Select the Sqft column for the X column, and the Price column for the Y column. Click Compute! to generate the scatter plot shown below. The clear outlier in the plot can be identified by clicking and dragging the mouse around the point. The row containing the associated value is highlighted in the data table. In this case, the outlier is in the first row of the data set. This highlighting can be cleared using the Clear button in the row selection navigation tool that appears in the lower left hand corner.

Removing outliers

Outliers can have a large impact on graphics as they may greatly expand the range of one or more of the axes obscuring the details in the plot. To remove the outlier in this example, , choose Options > Edit to reopen the dialog window. A Where expression can be used to filter the data values that are included in the plot. In this case, the outlier in question could be excluded from the plot using a number of such expressions. Since the point has been identified as being in the first row, an expression of Row != 1 (meaning where Row is not equal to 1) will eliminate the outlier from the plot. The above scatter plot also reveals that the outlier is the only home in the data set with Price above 2000 ($2,000,000). Therefore, the outlier can be removed by only considering the homes where Price is less than 2000 with the expression, Price < 2000. Enter this expression for the optional Where input, and click Compute! to get the updated scatterplot shown below without the outlier.

Color coding points with a group by column

To compare the nature of the of relationship between square footage and asking price across the two cities, the points in the scatter plot can be color-coded according to the location of each home. To construct such a plot, choose Options > Edit to reopen the dialog window. Under Group by, select the Location column and press Compute!. The resulting scatter plot shown below has the homes in Bryan shown in blue and the homes in College Station shown in red. Note that under Grouping Option, there is also an option to produce a separate plot for each group. In this case, this option would yield separate scatter plots for each city.

Overlaying a line of best fit

In many cases, it is desired to examine a best fitting function that characterizes the relationship between the two variables. StatCrunch allows for overlaying the best fitting polynomial up to the fourth order. A line is defined as a polynomial of order one. To overlay a line of best fit for each location, choose Options > Edit to reopen the dialog window. Under Overlay polynomial order, choose 1 and click Compute!. The resulting scatter plot shown below shows the best fitting line for Bryan in blue and the best fitting line for College Station in red. The equation of each line can be obtained by double clicking on the lines in the plot.

Always Learning
Pearson