The two quantitative variables in this data are the state populations and the number of area codes each state has.
Looking at the two variables, State Population (x) and Number of Area Codes (y), there is a strong positive correlation between the population size and number of area codes. There infact is a linear relationship between the two variables. Since the number of area codes increase with population size, it may appear that California is an outlier since it has 16 area codes, but being that it appears to fall within the general trend, it is not an outlier. What's making California stand out from the rest of the states is it's great population. But similar to the other states, the greater the population, the greater the number of area codes. An appropriate significance level would be .01, taking into account a 1% risk factor since a few states have a greater population but still the same amount of area codes as those with fewer populations.
Simple linear regression results:
Dependent Variable: Number of Area Codes Independent Variable: Population (2000) Number of Area Codes = 0.81333482 + 4.8088711e7 Population (2000) Sample size: 50 R (correlation coefficient) = 0.96382456 Rsq = 0.92895778 Estimate of error standard deviation: 0.83268969 Parameter estimates:
Analysis of variance table for regression model:

The correlation coefficient is .9638 (rounded to four decimal places). Any correlation coefficient that starts with .9 represents a strong positive correlation. These data terms are extremely signifcant at the .01 level being that there is a very strong positive correlation between state population and the number of area codes. The line of best fit, y=mx+b, is y= 4.8089x + 0.8133. This means that y, the number of area codes, is equivalent to every 4.8089 units times x(population), in addition to .8133. Rsq tells how close the data is to the "line of best fit". Being that .929 is still pretty strong, it can be predicted that knowing X helps you predict Y since there is a linear relationship between the two variables.
Analyzing the line of best fit and the scatter plot, this is a good fit for the data. The data is correlated and causes causation because the overall trend shows as the population increases so does the number of area codes.
Looking at the QQ plot of residuals, my expected values follow a normal distribution. You can tell because the data points are following a straight line that is fairly straight amongst the quantiles.
Looking at the graph, the residual plot implies that the linear model is not a good fit. Being that on this graph, the data points are not evenly distributed and has a clear pattern (being clustered around the bottom right), so the residuals are not correlated with the predicted values.
Already a member? Sign in.
Nov 18, 2017
Nice report.