StatCrunch logo (home)

Report Properties

from Flickr
Owner: amf17d
Created: Nov 12, 2017
Share: yes
Views: 559
Results in this report
Data sets in this report
Need help?
To copy selected text, right click to Copy or choose the Copy option under your browser's Edit menu. Text copied in this manner can be pasted directly into most documents with formatting maintained.
To copy selected graphs, right click on the graph to Copy. When pasting into a document, make sure to paste the graph content rather than a link to the graph. For example, to paste in MS Word choose Edit > Paste Special, and select the Device Independent Bitmap option.
You can now also Mail results and reports. The email may contain a simple link to the StatCrunch site or the complete output with data and graphics attached. In addition to being a great way to deliver output to someone else, this is also a great way to save your own hard copy. To try it out, simply click on the Mail link.
Game of Thrones Part 2
Mail   Print   Twitter   Facebook

Data set 1. Game of Thrones IMDB Ratings Dataset   [Info]
To analyze this data, please sign in.



Result 1: Scatter Plot   [Info]
Right click to copy

Above is the scatterplot for the two quantitative variables User Rating and User Votes. From this scatterplot, we may observe that there is almost a nonlinear relationship between the two variables. With this, it implies that the variables are related, but this relationship results in a scatterplot that does not follow a straight-lined pattern. The relationship between these two variables becomes especially prominent as we look at the higher values of ratings and votes. We can see that the highest rated episodes (y axis) have the highest amount of user votes (x axis.) So essentially, the more user votes an episode is, the higher the chance that an episode will have a higher rating. In addition, there are also outliers that affect this relationship. 



Result 2: Scatter Plot Without Outliers   [Info]
Right click to copy

Using the outliers for user votes that were calculated during part 1 of this project, we can remove them to get the above and new scatterplot. This scatterplot deviates from the previous one that included outliers in the sense that it now follows a positive relationship, rather than a nonlinear one. Additionally, although this relationship seems to be positive, it is a weak positive one.

As for the significance level in evaluating my data, it is best to use the significance level of 0.05 due to the level of variability in my data.

The formula for my line of best fit is y=3.27x-66490.81.


Result 3: Simple Linear Regression Removed Outliers   [Info]
Simple linear regression results:
Dependent Variable: UserRating
Independent Variable: UserVotes
UserRating = 8.3545845 + 0.000032784998 UserVotes
Sample size: 61
R (correlation coefficient) = 0.55295314
R-sq = 0.30575718
Estimate of error standard deviation: 0.31852325

Parameter estimates:
ParameterEstimateStd. Err.AlternativeDFT-StatP-value
Intercept8.35458450.1370052 ≠ 05960.980053<0.0001
Slope0.0000327849980.0000064315629 ≠ 0595.0975165<0.0001

Analysis of variance table for regression model:

Using the results above, we can first analyze the correlation coefficient, which is 0.55. This correlation coefficient suggests a weak positive correlation between the variables User Rating and User Votes. Further, if we look at the p-values of the intercept and slope, we find that they are both <0.0001. Using the significance level of 0.05, we can conclude that this is significant as it is less than 0.05.

Looking at the line of best fit, we observe that it denotes a positive correlation. However, because r is weak (at 0.55), this line cannot give a completely accurate depiction of the relationship. A line of best fit cannot give good predictions unless the correlation is strong and there are many data points.

Lastly is the proportion of the scatter in the data accounted for by the best-fit line. Using the above results, it can be seen that is 0.31. This suggests that the data is weakly fitted to the regression line.


Result 4: Line of Best Fit Removed Outliers   [Info]
Right click to copy

Using the linear model and the new graph, one might say that these are just an okay fit for the data. Due to the weakness of the correlation, the graph cannot depict with complete accuracy the relationship. However, with the data points we do have, I feel it adequately shows that there is a positive correlation, even if it is a weak one.

My data is correlated at 0.55, which is a weak positive correlation. Further, although the data is weakly positively correlated, correlation does not always cause causation. There may be underlying causes, or perhaps it is a coincidental circumstance. With this data, and the correlation between User Votes and User Ratings, it is likely there is an underlying cause. This underlying cause may be pertaining to the episode themselves, when they aired, what the content of the episode was, and so on. This underlying cause may have resulted in better ratings and more people voting for an episode, as they liked it more for various reasons.

E.C. #1 Create a QQ Plot of your line of best fit

Result 5: QQ Plot Linear Regression   [Info]
Right click to copy

My values do not follow a normal distribution.


E.C. #3 Re-evaluate your linear model at a different significance level. Do you get different results?

Even if we evaluate the linear model at the 0.01 level, it is still significant because the p-value for the intercept and slope is <0.0001.



E.C. #4 Create a linear model for a cluster sample. Is the model different than for the whole dataset? Explain

Result 6: Linear Model Sample   [Info]
Right click to copy

As we can see, using a cluster sample method of sample size 30, the model is quite different from the entire data set. Instead of having a weak positive correlation, there is a weak negative correlation between the variables User Votes and User Rating.


HTML link:
<A href="">Game of Thrones Part 2 </A>

Want to comment? Subscribe
Already a member? Sign in.
By xg15
Nov 15, 2017

For the cluster sample: it should be selected in a cluster way. Like the model for episode 1 or the model for episode 2;
It would be better to explain R^2 is the percentage of variance of response explained by linear model

Always Learning