StatCrunch logo (home)

Data sets shared by StatCrunch members
Showing 1 to 15 of 204 data sets matching regression
Data Set/Description Owner Last edited Size Views
Movie Budgets and Box Office Earnings (Updated Fall 2016)
This data all comes from the following website the tracks the financial performance of movies:

The “Budget”, “Domestic Gross”, and “Worldwide Gross” columns each are in millions of dollars.

ntorno8Mar 14, 2017266KB2893
Times World University Rankings (2011-2016)
This data comes from the annual Times magazine rankings of universities across the world. The webpage for the Times 2016 rankings is listed above in the source.
The formula for the 2016 rankings is as follows:
30% for Teaching Rating
7.5% for International Outlook Rating
30% for Research Rating
30% for Citations Rating
2.5% for Industry Income Rating.
The “Total Score” from 2016 can be recreated using this formula.

World_RankUniversity rank for a given year
University_NameThe name of the university
CountryLocation of university
Teaching_Rating Rating from a 0-100 scale of the quality of teaching at the university. This rating is based on the institution’s reputation for teaching, it’s student/staff ratio, it’s PhD’s/ undergraduate degrees awarded ratio, and it’s institutional income/ academic staff ratio.
Inter_Outlook_Rating Rating from a 0-100 scale of the international makeup of a university. This rating is based the international student percentage, international staff percentage, and the percentage of research papers from the university that include at least one international author.
Research_Rating Rating from a 0-100 scale of quality of research at the university. This rating is based on the university’s reputation, it’s research income/ academic staff ratio, and it’s production of scholarly papers.
Citations_Rating Rating from a 0-100 scale of based on the normalized average of citations by other papers per paper from the university (how often the research from the university is cited by other papers).
Industry_Income_Rating Rating from a 0-100 scale grading how much companies are willing to invest in the universities research. The rating is calculated based on the research income from businesses per academic staff member.
Total_ScoreThe final score used to determine the university ranking based on Teaching_Rating, International_Outlook_Rating, Research_Rating, Citations_Rating, and Industrial_Income_Rating.
Num_StudentsTotal number of students in a given year
Student/Staff_RatioNumber of students per academic staff member
%_Inter_StudentsPercentage of student body who come from a foreign county
%_Female_Students Percentage of student body that is female.
YearAcademic year that the ranking was released. For example, 2016 denotes the 2015-2016 academic year.
statcrunchhelpApr 5, 2016254KB1921
Report on the Loss of the ‘Titanic’ (S.S.) (1990), British Board of Trade Inquiry Report (reprint), Gloucester, UK: Allan Sutton Publishing. Taken from the Journal on Statistical Education Archive, submitted by Dr. Craig Slinkman has recoded the data as self-explanatory nominal variables. yes craig_slinkman Mar 23, 2010 68KB 5
craig_slinkmanMar 23, 201061KB1601
California Home Prices, 2009
This dataset is a collection of real estate listings from San Luis Obispo county, California, and some locations around it from 2009. The prices are their list price at the creation of this dataset. For more information about this data, go to the website source listed above.
statcrunchhelpMar 11, 201646KB1361
Mother and Daughter Heights.xls
This data set is Galton's Mother and Daughter data set as used in Sanfford Weisberg's Applied Linear Regression, 3rd Edition.
craig_slinkmanApr 10, 201013KB4124
Federal Food Assistance Participation
This primarily comes from the following source: United States Department of Agriculture: Food and Nutrition Service . This dataset also incorporates data from another StatCrunch dataset: US Workforce Participation

YearThe year for each data value
Average Federal Food Assistance Participation in ThousandsNumber of individuals in the US who took part in SNAP (Supplemental Nutrition Assistance Program) during the given year.
% US Population on Federal Food Assitance% of US population that is currently in the SNAP program and is receiving aid with food.
Change of % (US Population on Federal Food Assistance)The change in the percentage of the US population that is receiving food assistance from SNAP.
Presidential ControlPolitical party of president.
Senate ControlPolitical party of the Senate majority
House ControlPolitical party of the House of Representatives majority.
Legislative Branch (House and Senate)Combined control of Senate and House of Representativs
Male Inactivity Rate Aged 25-54Defined as the proportion of the male population aged 25-54 that is not in the labour force. Common reasons for leaving labour force: college, retirement, stay at home, can't find work and no longer try.
Change of Rate (Male Inactivity Rate Aged 25-54)The change in the inactivity rate calculated as the current year minus the previous year.
Female Inactivity Rate Aged 25-54Defined as the proportion of the female population aged 25-54 that is not in the labour force.
Change of Rate (Female Inactivity Rate Aged 25-54)The change in the inactivity rate calculated as the current year minus the previous year.
Annual Average Workforce Participation RateDefined by the Bureau of Labor Statistics as "the percentage of the population [16 years and older] that is either employed or unemployed (that is, either working or actively seeking work). Note that 2015's Annual Average is calculated using the first 11 months."
Change of Rate (Annual Workforce Participation Rate)The change in the workforce participation rate calculated as the current year minus the previous year.
statcrunchhelpJan 8, 201610KB1036
Top 100 Retailers 2015
This dataset comes from the National Retail Federation and tracks the top retail chains in the US for 2015 based on their 2014 sales. The original data can be found at the webpage listed as the source. Note that these retailer include all sorts of avenues including internet sales.
statcrunchhelpMar 14, 20167KB2694
Stats from the major league baseball teams for 2013. The last column I added denotes AL for American League and NL for National League. One could possibly conduct a two-sample means test, for example, to find out whether the average runs for the two leagues are equal. Or there are of course lots of regressions one could run.
eykolo@stat.tamu.eduNov 4, 20133KB1707
MLB Home Attendance vs. Runs Scored 2015
This data comes from the 2015 baseball season and tracks the number of home games, the total attendance at home games, the number of runs scored by that team, the runs scored on that team, the league they play in, and the number of wins the team recorded in the regular season.
frompearsonbooksJun 14, 20161KB1090
Roller Coasters Data
This dataset looks at some of the roller coasters across the US and various other countries.
NameName of roller coaster
ParkAmusement park for roller coaster
CityCity for amusement park
StateState abbreviation
CountryCountry of the roller coaster. US: United States, MX: Mexico, CR: Costa Rica, GT: Guatemala, CO: Columbia, VE: Venezuela, BR: Brazil, AR: Argentina, CL: Chile, EQ: Ecuador, PE: Peru, F: France, D: Germany
TypeS: Steel, W: Wood
ConstructorType of build for the roller coaster
HeightHeight in meters
SpeedSpeed in miles per hour (mph)
LengthLength in meters
InversionsYes if there are inversions, no if not
DurationDuration of ride in seconds
GForceMax g-force
OpenedYear it opened
RegionGeographic region for the roller coaster
ntorno8Sep 15, 201648KB12644
National Longitudinal Youth Survey
The Youth survey consists of a nationally representative sample of youths who were 14 to 20 years old as of December 31, 1999.
This dataset tracks the Age, Height (in inches), Weight (in pounds), Gender, and the self reported "How would you describe your weight?" multiple choice answers for the individuals.
statcrunchhelpMar 8, 2016330KB896
Body Temperature
Data taken from the Journal of Statistics Education online data archive. That archive in turn got the data from an article in the Journal of the American Medical Association. (Mackowiak, et al., "A Critical Appraisal of 98.6 Degrees F …", vol. 268, pp. 1578-80, 1992).
"Body Temp" is measured in degrees fahrenheit
"Heart rate" is the resting beats per minute
statcrunchhelpMar 8, 20162KB2804
Seating Choice versus GPA (For 3 rows, with Text and Indicator Columns)
This dataset contains hypothetical (I believe) data on GPA for students who sit in the front, middle, and back rows of a classroom, as well as a hypothetical gender variable. The data are shown using both text variables (e.g., "front" and "middle") and 0/1 indicator variables for the row and gender variables. This dataset is useful for demonstrating the different ways that StatCrunch can compare means based on two factors: (a) the text factor columns can be used in a two-way ANOVA; and (b) the 0/1 indicator columns can be used in multiple regression. (Because of StatCrunch's current limitation on equal cells, the 0/1 variables only use the first and middle rows.) Both procedures gives the same p-value and same conclusion (as long as the interaction term is centered), thus highlighting the similarity of statistical procedures and StatCrunch's flexibility.
bartonpoulsonApr 7, 20101KB3878
Seating Choice versus GPA (Stacked & Split Columns for Front & Back Rows)
This dataset contains hypothetical (I believe) data on GPA for students who sit in the front and back row of a classroom. The data are shown in several ways: (a) two separate columns (one for the front row GPA and another or the back row GPA); (b) stacked with one column to indicate front or back row and another column with the GPAs; and (c) the row column repeated as a 0/1 indicator variable. This dataset is useful for comparing the different ways that StatCrunch can compare the means of two groups: (a) The two columns of scores (front and back) can be used in the 2-sample t-test or a one-way ANOVA; (b) the stacked text column (front/back) with a separate column for GPA can also be used for one-way ANOVA; and (c) the 0/1 indicator column and stacked GPAs can be used with correlation and regression. Every procedure gives the same p-value and same conclusion, thus highlighting the similarity of statistical procedures and StatCrunch's flexibility.
bartonpoulsonApr 7, 2010465B1833
Home Runs and Strike Outs for 2004 Boston Red Sox by Handedness
These data show home runs and strike outs for the 12 players from the Boston Red Sox who had more than 200 at-bats in the 2004 season (the first year they won the World Series after the 86-year Curse of the Bambino). It also shows whether the players bat left-handed or as switch hitters, both of which are coded as 0/1 (No/Yes, respectively) indicator variables (also known as dummy variables), as well as a text L/R/LR variable. These data were used for a demonstration for bivariate and multiple regression.
bartonpoulsonNov 3, 2009375B1048

1 2 3 4 5 6 7 8 9 10   >

Always Learning