William Denno
Description of the Data: “2008 Moneymaking movies”
Eighty movies round out the list of a random sample of intheater movies that generated revenue in 2008. The list of movies includes movies released prior to 2008 that had 2008 intheater revenue.
Source of the Data:
The source of the data originally comes from http://www.thenumbers.com/. TheNumbers.com is a website that was officially launched on October 17^{th}, 1997 as a free resource for industry professionals, the investment community, and movie fans to track business information on movies. The site has grown to become the largest freely available database of movie industry information on the web.
Description of the Variables:
Nine variables including four quantitative and five qualitative variables makeup the random sample chosen in this study.
Four quantitative variables include:
 2008 Rank of intheater movies
 2008 Gross Sales
 2008 Number of tickets sold
 2008 Inflationadjusted gross sales.
Five qualitative variables include:
 Movie name
 Genre
 Release date
 MPAA Rating
 Distributor of the movie. The distributor of the movie is an important variable in that investors may use that information to help make a good investment decision.
The objectives of this project include:
 Display/summarize the data to better understand successful intheater movies during 2008.
 Provide a framework for investors to consider when investing in certain movie distributor companies, investing in future releases, or to create a business plan for independent movies.
 Provide information that can be used to identify distribution strategies that are appropriate for a particular movie or type of movie.
Analysis and Charts:
Below you will find several charts and accompanying analysis to help explain specifics about the 2008 Movies studied.
The below pie chart displays the relative frequencies of the movie Genres in this study. Note how Drama and Comedy movies represent over 64% of the movies.

The InTheater Movie Ratings (MPAA) chart below displays the groups of ratings the movies represented in this study. Movies that have an "R" rating or "Not Rated" are amoungst the most common movies in this study.

The below chart shows the breakdown of the Distributors that were tied to the movies in this study. It is important to note that five of the most well known distributors represent the top five on the list, and represent over 91% (or $1.6 billion) of the gross sales during 2008.
Generally speaking, based on the relative frequency seen between gross sales and tickets sold, all movie tickets were sold at an average price of $7.18 for allmovies. Realistically, people do use discount tickets when they go to the movie theater. Either gross sales is a true number or number of tickets is a true number, and the other one was backed into. Additionally, based on the study, the inflationadjusted gross sales mirrors the gross sales, so no inflation was factored in.

The below Histogram dipicts the 2008 Gross sales in this study. The following three histograms are all skewed to the right.

The Ticket Sales Histogram (below) tells virtually the same story as the previous Histogram, both visually and based on the information shared earlier.

The InflationAdjusted Gross Sales Histogram (below) tells virtually the same story as the previous 2 Histograms, both visually and based on the information shared earlier.

The following charts detail the Statisics for Gross Sales, Ticket Sales and InflationAdjusted Gross Sales. It's important to note that based on the results of the histogram and the fact that the results are skewed right, the median is a better measure of central tendancy (vs the mean). As such, the 5number summary is a better tool to use when you want to analyse these distributions. The distribution is not bellshaped, so it is not symmetric. Therefore, the standard deviation is not the best measure to use to describe the distribution. Standard Deviation works well with symmetric,bellshaped curves.
Summary statistics:

Summary statistics:

Summary statistics:

Below you will see a relative frequency table which displays the share each Genre has from the movies in this study. This table supports the first graph in this report.
Frequency table results for Genre:

The frequency table below reflects the (MPAA) Ratings that the movies represent. This table supports the second graph in this report.
Frequency table results for MPAA:

The frequency table below reflects the Distributors associated with the movies in this study. The data supports the third graph in this report. This report would be useful if a group of investors wanted to see how many movies key distributors have financed in 2008. If key distributors invested in several successful movies, they might consider investing in them given the opportunity.
Frequency table results for Distributor:

Simple Linear Regression results:
Generally speaking, the stats below include a dependent variable and an independent variable. The dependent variable can be considered the “outcome” variable, and the independent variable can be considered the “predictor” variable. Sample size for the regression results is 79.
Rank and Gross Sales regression and scatter plot analysis:
The coorelation coefficient is .4676. This indicates a moderately negative relation between the two variables. The rsquared value is .21862683. The rsquared value is a good indicator of strength of the relationships. That rsquared value indicates that if we know the movie rank, we can predict 21.9% of the variants in gross sales.
The Pvalue is a null hypothesis significance test for each coefficient. Since the Pvalue here is less than 05, there is a low probability of getting something like this through random sampling when there is no effect or the coefficient is 0 in the population. The Pvalue here lets us know the negative coeffient is a reliable number.
Note: The model Pvalue and the slope Pvalue is the same because there is only one predictor variable.
In the parameter estimates section, you’ll notice the slope (155,288) and intercept (7.42M).
Simple linear regression results:
Dependent Variable: 2008 Gross Independent Variable: Rank 2008 Gross = 7.4230448E7  155258.17 Rank Sample size: 79 R (correlation coefficient) = 0.4676 Rsq = 0.21862683 Estimate of error standard deviation: 6.2275852E7 Parameter estimates:
Analysis of variance table for regression model:


Gross Sales and Tickets Sold regression and scatter plot analysis:
The con coefficient is 1, which indicates a perfect positive linear relation between the two variables. The rsquared value is also 1 which means with this data, we can predict 100% of the variants in tickets sold.
The Pvalue is the same value as in the Rank and Gross Sales regression analysis. Since the Pvalue here is less than 05, there is a low probability of getting something like this through random sampling when there is no effect or the coefficient is 0 in the population.
In the parameter estimates section, you’ll notice the slope (7.18) and intercept (.44534105).
Simple linear regression results:
Dependent Variable: 2008 Gross Independent Variable: Tickets Sold 2008 Gross = 0.44534105 + 7.18 Tickets Sold Sample size: 79 R (correlation coefficient) = 1 Rsq = 1 Estimate of error standard deviation: 2.883 Parameter estimates:
Analysis of variance table for regression model:


Simple Linear Regression results (after outliers have been removed):
After removing outliers from the data, we show the following changes to the linear regression results. In the Rank and Gross Sales results (below), the r value (coorelation coefficient) changed to a factor of .6751 which indicates a stronger negative linear relation (vs. the relation we have with the outliers included; .4676). By omitting the outliers, the highest ranked movies are not accounted for (included) in the analysis. There is a group of ranked movies we are not accounting for. The highest ranked movies are more of an anomolie are compared to so many "average" ranked movies. The greatest movies (highlyranked movies) usually cost the most to make. There are very few movies that can be made with such a high price tag. When you take those "few"movies out of the mix, you begin to look at the many remaining movies that potentially have similar rankings based on a smaller price tag and success in the theater. The rsquared value also changed to a .4557867 from a .21862683 factor. This means, if we know the movie rank, we can predict 45.6% of the variants in gross sales; a higher percentage than when the data has outliers. The outliers are certainly skewing the results. I expect to see this type of change after the outliers are removed. You'll also notice that the Pvalues have no change.
Simple linear regression results:
Dependent Variable: 2008 Gross Independent Variable: Rank 2008 Gross = 1.0063407E7  19039.344 Rank Sample size: 66 R (correlation coefficient) = 0.6751 Rsq = 0.4557867 Estimate of error standard deviation: 3817854.2 Parameter estimates:
Analysis of variance table for regression model:

In the Ticket Sales and Gross Sales results, there is no change to the coorelation coefficient or rsquared values. I would not expect a change here because, as mentioned earlier in this report, the price of a ticket for every movie in this study is exactly $7.18. Whether the movie did very well or sold little tickets, the relationship between ticket sales and gross sales was the same. Taking out the outliers had no impact on the r and red values. The new rvalue is 1: a perfect positive linear relation. We had this same r value when the outliers were included in the analysis. You'll notice the slope and intercept are exactly the same (when compared to the results that include the outliers).
Simple linear regression results:
Dependent Variable: Tickets Sold Independent Variable: 2008 Gross Tickets Sold = 0.07323955 + 0.13927576 2008 Gross Sample size: 66 R (correlation coefficient) = 1 Rsq = 1 Estimate of error standard deviation: 0.29656646 Parameter estimates:
Analysis of variance table for regression model:

The following two Summary Stats tables reflect the data without the outliers. When you compare them to the original summary reports (which include the outliers), it is clear that the stats have changed. By omitting the outliers, your maximum number changes which will effect other stats of the data (median, Q1, Q3, std deviation). This is true for gross sales and ticket sales.
Summary statistics:

Summary statistics:

Note: A new Dataset (StatProj1) was added to reflect the original data minus the outliers.
Phase 4: Testing a Hypothesis and Confidence Interval
Note: For checking purposes, calculations were done in a TI83 calculator, manually and in Statcrunch.
Testing a Hypothesis
1. Hypothesis:
According to the widelyused movie data website www.thenumbers.com/market, the population (number of tickets sold) mean of all intheater movies during 2008 is 1,876,928.711. I believe a subset of that data (data which my report is based upon: 2008 MoneyMaking Movies) which represents a random sample of the movies across all genres will have a mean that is lower than the total. I believe the mean will be lower because there were a total of 738 movies that had theater ticket sales in 2008. The subset I used for my project represents 79 movies. I believe a 10.7% sample is too small to align to the total mean.
Formally,
Ho: μ = 1.9M tickets
H1: μ < 1.9M tickets
2. Assumptions:
The sample is obtained by using simple random sampling. The sample size >30. The population mean, µ, is the parameter which is being tested. This is a lefttailed test.
3. Test Statistics
xbar = 3,170,930.2
σ = 9,749,089
n = 79
µ_{0} = 1,876,928.711
Calculator: Test, #1: ZTest
Result: z: 1.1797345
To find the test statistic, z_{0,} Use the equation z = (xbar  µ) /(σ /√n).
( 3,170,930.2 – 1,876,928.711 ) / ( 9,749,089 / √79 ) = ( 1294001.489 ) / (1,096,858.208 ) = 1.1797345
Normalcdf( 9999, 1.1797345, 0, 1 )
p = .8809
4. PValue. The probability of obtaining a sample mean of less than 9,749,089 from a population whose mean is 1.9M is .8809. This means that approximately 88 samples out of 100 will give a mean as low or lower than the one obtained if the population mean was 9,749,089.
Statcrunch Calculations:
Hypothesis test results:
μ : mean of Variable H_{0} : μ = 1876928.8 H_{A} : μ < 1876928.8 Std. Dev. = 9749089

5. Conclusion:
P>α, or .8809>.05. I do not reject the Ho (null hypothesis). There is not sufficient evidence that the average intheater ticket sales is less than 9,749,089. The pvalue is significantly higher than α.
Confidence Interval:
1. Level: 95%
Xbar: 3,170,930.2
Std Dev: 9,749,089
n = 79
CLevel: .95
Calculator: Test #7 result: (1021127.56, 5320732.5)
95% confidence interval results:
μ : mean of Variable Std. Dev. = 9749089

2. Explanation of the Confidence Interval:
I am 95% confident the mean ticket sales 3,170,930.2 is between 1,021,127.56 and 5,320,732.5. The population mean does in fact fall within the confidence interval.
3. 95% confidence means:
All sample means lie within 1.96 standard deviations of the population mean. Additionally, 2.5% of the sample means lie in each tail. Additionally, a 95% level of confidence implies that if 100 different confidence intervals are constructed, I would expect 95 of the intervals to include the mean of 3,170,930.2.
4. Confidence in my result:
I am confident in my results. As mentioned, the population mean falls within the confidence interval. Based on all of the tests performed on this set of data, I have no reason not to feel comfortable with the results.
Already a member? Sign in.
Feb 15, 2010
Sorry about that, statcrunch kept giving me an error message!
Feb 15, 2010
Bill, you want to discuss the skewness of the distribution (outliers) and why that might. Is there a particular genre that is more popular than others or production company? Are the results as you expected?
Feb 15, 2010
Bill, you want to discuss the skewness of the distribution (outliers) and why that might. Is there a particular genre that is more popular than others or production company? Are the results as you expected?
Feb 15, 2010
Bill, you want to discuss the skewness of the distribution (outliers) and why that might. Is there a particular genre that is more popular than others or production company? Are the results as you expected?
Jan 18, 2010
Be sure to put the units of measure associated with your quantitative variables (i.e. millions)