Game of Thrones Part 2
Generated Nov 12, 2017 by amf17d
<data1>
<result1>
Above is the scatterplot for the two quantitative variables User Rating and User Votes. From this scatterplot, we may observe that there is almost a nonlinear relationship between the two variables. With this, it implies that the variables are related, but this relationship results in a scatterplot that does not follow a straightlined pattern. The relationship between these two variables becomes especially prominent as we look at the higher values of ratings and votes. We can see that the highest rated episodes (y axis) have the highest amount of user votes (x axis.) So essentially, the more user votes an episode is, the higher the chance that an episode will have a higher rating. In addition, there are also outliers that affect this relationship.
<result2>
Using the outliers for user votes that were calculated during part 1 of this project, we can remove them to get the above and new scatterplot. This scatterplot deviates from the previous one that included outliers in the sense that it now follows a positive relationship, rather than a nonlinear one. Additionally, although this relationship seems to be positive, it is a weak positive one.
As for the significance level in evaluating my data, it is best to use the significance level of 0.05 due to the level of variability in my data.
The formula for my line of best fit is y=3.27x66490.81.
<result3>
Using the results above, we can first analyze the correlation coefficient, which is 0.55. This correlation coefficient suggests a weak positive correlation between the variables User Rating and User Votes. Further, if we look at the pvalues of the intercept and slope, we find that they are both <0.0001. Using the significance level of 0.05, we can conclude that this is significant as it is less than 0.05.
Looking at the line of best fit, we observe that it denotes a positive correlation. However, because r is weak (at 0.55), this line cannot give a completely accurate depiction of the relationship. A line of best fit cannot give good predictions unless the correlation is strong and there are many data points.
Lastly is the proportion of the scatter in the data accounted for by the bestfit line. Using the above results, it can be seen that is 0.31. This suggests that the data is weakly fitted to the regression line.
<result4>
Using the linear model and the new graph, one might say that these are just an okay fit for the data. Due to the weakness of the correlation, the graph cannot depict with complete accuracy the relationship. However, with the data points we do have, I feel it adequately shows that there is a positive correlation, even if it is a weak one.
My data is correlated at 0.55, which is a weak positive correlation. Further, although the data is weakly positively correlated, correlation does not always cause causation. There may be underlying causes, or perhaps it is a coincidental circumstance. With this data, and the correlation between User Votes and User Ratings, it is likely there is an underlying cause. This underlying cause may be pertaining to the episode themselves, when they aired, what the content of the episode was, and so on. This underlying cause may have resulted in better ratings and more people voting for an episode, as they liked it more for various reasons.
E.C. #1 Create a QQ Plot of your line of best fit
<result5>
My values do not follow a normal distribution.
E.C. #3 Reevaluate your linear model at a different significance level. Do you get different results?
Even if we evaluate the linear model at the 0.01 level, it is still significant because the pvalue for the intercept and slope is <0.0001.
E.C. #4 Create a linear model for a cluster sample. Is the model different than for the whole dataset? Explain
<result6>
As we can see, using a cluster sample method of sample size 30, the model is quite different from the entire data set. Instead of having a weak positive correlation, there is a weak negative correlation between the variables User Votes and User Rating.
Simple linear regression results:
Dependent Variable: UserRating Independent Variable: UserVotes UserRating = 8.3545845 + 0.000032784998 UserVotes Sample size: 61 R (correlation coefficient) = 0.55295314 Rsq = 0.30575718 Estimate of error standard deviation: 0.31852325 Parameter estimates:
Analysis of variance table for regression model:
