
Data sets shared by StatCrunch members
Showing 1 to 15 of 161 data sets matching regression
Data Set/Description 
Share 
Owner 
Last edited 
Size 
Views 
MLB Home Attendance vs. Runs Scored 2015
This data comes from the 2015 baseball season and tracks the number of home games, the total attendance at home games, the number of runs scored by that team, the runs scored on that team, the league they play in, and the number of wins the team recorded in the regular season.  yes  frompearsonbooks  Jun 14, 2016  1KB  4 
Times World University Rankings (20112016)
This data comes from the annual Times magazine rankings of universities across the world. The webpage for the Times 2016 rankings is listed above in the source.
The formula for the 2016 rankings is as follows: 30% for Teaching Rating 7.5% for International Outlook Rating 30% for Research Rating 30% for Citations Rating 2.5% for Industry Income Rating. The “Total Score” from 2016 can be recreated using this formula.
Column  Description  World_Rank  University rank for a given year 
University_Name  The name of the university 
Country  Location of university 
Teaching_Rating  Rating from a 0100 scale of the quality of teaching at the university. This rating is based on the institution’s reputation for teaching, it’s student/staff ratio, it’s PhD’s/ undergraduate degrees awarded ratio, and it’s institutional income/ academic staff ratio. 
Inter_Outlook_Rating  Rating from a 0100 scale of the international makeup of a university. This rating is based the international student percentage, international staff percentage, and the percentage of research papers from the university that include at least one international author. 
Research_Rating  Rating from a 0100 scale of quality of research at the university. This rating is based on the university’s reputation, it’s research income/ academic staff ratio, and it’s production of scholarly papers. 
Citations_Rating  Rating from a 0100 scale of based on the normalized average of citations by other papers per paper from the university (how often the research from the university is cited by other papers). 
Industry_Income_Rating  Rating from a 0100 scale grading how much companies are willing to invest in the universities research. The rating is calculated based on the research income from businesses per academic staff member. 
Total_Score  The final score used to determine the university ranking based on Teaching_Rating, International_Outlook_Rating, Research_Rating, Citations_Rating, and Industrial_Income_Rating. 
Num_Students  Total number of students in a given year 
Student/Staff_Ratio  Number of students per academic staff member 
%_Inter_Students  Percentage of student body who come from a foreign county 
%_Female_Students  Percentage of student body that is female. 
Year  Academic year that the ranking was released. For example, 2016 denotes the 20152016 academic year. 
 yes  statcrunchhelp  Apr 5, 2016  254KB  663 
Top 100 Retailers 2015
This dataset comes from the National Retail Federation and tracks the top retail chains in the US for 2015 based on their 2014 sales. The original data can be found at the webpage listed as the source. Note that these retailer include all sorts of avenues including internet sales.  yes  statcrunchhelp  Mar 14, 2016  7KB  570 
California Home Prices, 2009
This dataset is a collection of real estate listings from San Luis Obispo county, California, and some locations around it from 2009. The prices are their list price at the creation of this dataset. For more information about this data, go to the website source listed above.  yes  statcrunchhelp  Mar 11, 2016  46KB  764 
National Longitudinal Youth Survey
The Youth survey consists of a nationally representative sample of youths who were 14 to 20 years old as of December 31, 1999.
This dataset tracks the Age, Height (in inches), Weight (in pounds), Gender, and the self reported "How would you describe your weight?" multiple choice answers for the individuals.  yes  statcrunchhelp  Mar 8, 2016  330KB  428 
Body Temperature
Data taken from the Journal of Statistics Education online data archive. That archive in turn got the data from an article in the Journal of the American Medical Association. (Mackowiak, et al., "A Critical Appraisal of 98.6 Degrees F …", vol. 268, pp. 157880, 1992).
"Body Temp" is measured in degrees fahrenheit
"Heart rate" is the resting beats per minute  yes  statcrunchhelp  Mar 8, 2016  2KB  812 
Regression and Correlation worksheet.xlsx
M4_Regression and Correlation  yes  hollypet  Feb 3, 2016  139B  432 
Federal Food Assistance Participation
This primarily comes from the following source: United States Department of Agriculture: Food and Nutrition Service . This dataset also incorporates data from another StatCrunch dataset: US Workforce Participation
Column  Description  Year  The year for each data value  Average Federal Food Assistance Participation in Thousands  Number of individuals in the US who took part in SNAP (Supplemental Nutrition Assistance Program) during the given year.  % US Population on Federal Food Assitance  % of US population that is currently in the SNAP program and is receiving aid with food.  Change of % (US Population on Federal Food Assistance)  The change in the percentage of the US population that is receiving food assistance from SNAP.  Presidential Control  Political party of president.  Senate Control  Political party of the Senate majority  House Control  Political party of the House of Representatives majority.  Legislative Branch (House and Senate)  Combined control of Senate and House of Representativs  Male Inactivity Rate Aged 2554  Defined as the proportion of the male population aged 2554 that is not in the labour force. Common reasons for leaving labour force: college, retirement, stay at home, can't find work and no longer try.  Change of Rate (Male Inactivity Rate Aged 2554)  The change in the inactivity rate calculated as the current year minus the previous year.  Female Inactivity Rate Aged 2554  Defined as the proportion of the female population aged 2554 that is not in the labour force.  Change of Rate (Female Inactivity Rate Aged 2554)  The change in the inactivity rate calculated as the current year minus the previous year.  Annual Average Workforce Participation Rate  Defined by the Bureau of Labor Statistics as "the percentage of the population [16 years and older] that is either employed or unemployed (that is, either working or actively seeking work). Note that 2015's Annual Average is calculated using the first 11 months."  Change of Rate (Annual Workforce Participation Rate)  The change in the workforce participation rate calculated as the current year minus the previous year. 
 yes  statcrunchhelp  Jan 8, 2016  10KB  540 
Text Messaging Activity  yes  12266555_ecollege_kentmlp  Sep 26, 2015  2KB  1334 
Nonlinear_Regression_world_population (1).xlsx  yes  ealgephantom  Dec 4, 2014  313B  542 
Baseball2013.xlsx
Stats from the major league baseball teams for 2013. The last column I added denotes AL for American League and NL for National League. One could possibly conduct a twosample means test, for example, to find out whether the average runs for the two leagues are equal. Or there are of course lots of regressions one could run.  yes  eykolo@stat.tamu.edu  Nov 4, 2013  3KB  1577 
Regression: Cigarettes Lung Kidney Leukemia Bladder
"Cigarette smoking and cancers of the urinary tract: Geographic variation in the United States"
Journal of the National Cancer Institute (vol. 41, no. 5, November, 1968), pp. 12051211; table from pp. 12061207.
Joseph F. Fraumeni, Jr.
Oxford University Press
Units: cigarettes sold per capita, cancer deaths per 100,000  yes  phil_larson  Sep 22, 2013  2KB  2574 
Low Birth Weight Study
SOURCE: Hosmer and Lemeshow (2000) Applied Logistic Regression: Second Edition
Data were collected at Baystate
Medical Center, Springfield, Massachusetts during 1986.
DESCRIPTIVE ABSTRACT:
The goal of this study was to identify risk factors associated with
giving birth to a low birth weight baby (weighing less than 2500 grams).
Data were collected on 189 women, 59 of which had low birth weight babies and 130 of which had normal birth weight babies. Four variables which were thought to be of importance were age, weight of the subject at her last menstrual period, race, and the number of physician visits during the first trimester of pregnancy.
LIST OF VARIABLES:
Columns Variable Abbreviation

24 Identification Code ID
10 Low Birth Weight (0 = Birth Weight >= 2500g, LOW
1 = Birth Weight < 2500g)
1718 Age of the Mother in Years AGE
2325 Weight in Pounds at the Last Menstrual Period LWT
32 Race (1 = White, 2 = Black, 3 = Other) RACE
40 Smoking Status During Pregnancy (1 = Yes, 0 = No) SMOKE
48 History of Premature Labor (0 = None 1 = One, etc.) PTL
55 History of Hypertension (1 = Yes, 0 = No) HT
61 Presence of Uterine Irritability (1 = Yes, 0 = No) UI
67 Number of Physician Visits During the First Trimester FTV
(0 = None, 1 = One, 2 = Two, etc.)
7376 Birth Weight in Grams BWT

PEDAGOGICAL NOTES:
These data have been used as an example of fitting a multiple
logistic regression model.
STORY BEHIND THE DATA:
Low birth weight is an outcome that has been of concern to physicians
for years. This is due to the fact that infant mortality rates and birth
defect rates are very high for low birth weight babies. A woman's behavior
during pregnancy (including diet, smoking habits, and receiving prenatal care)
can greatly alter the chances of carrying the baby to term and, consequently,
of delivering a baby of normal birth weight.
The variables identified in the code sheet given in the table have been
shown to be associated with low birth weight in the obstetrical literature. The
goal of the current study was to ascertain if these variables were important
in the population being served by the medical center where the data were
collected.
References:
1. Hosmer and Lemeshow, Applied Logistic Regression, Wiley, (1989).
 yes  wikipeterson  Jul 23, 2012  6KB  4356 
Baseball data for correlation and regression
This table shows the total number of runs scored, at bats, hits, etc for each of the 30 MLB teams for the 20092011 seasons.
////
Correlations and linear regression models can be calculated between the different numeric variables. A good exercise is to see which variables correlate most strongly with runs_scored.
////
As emphasized in the movie Moneyball, some of the classic metrics such as batting_avg is not as good as the newer metrics like OBP (on base percentage), SLG (slugging percentage), or OPS (on base plus slugging).
////
A guide to a few of the variables that may not be self explanatory.
Runs_Scored: The total of all runs (points) the baseball team scored by the end of the season.
Batting_avg: This is equal to the number of hits divided by at_bats
OBP: On Base Percentage. Similar to batting average, except that it takes into account walks and hitbypitch. Some players who don't have high batting averages, manage to get walked quite frequently.
SLG: Slugging  This weights hits to first base as 1 point, hits to second base as 2 points, third as 3, homeruns as 4, and divides the total by the number of at bats.
OPS  On Base Plus Slugging  this is just OBP added to the SLG numbers.  yes  mileschen  Apr 17, 2012  6KB  1907 
Cigarette Consumption vs CHD Mortality
Now that cigarette smoking has been clearly tied to lung cancer, researchers are focusing on possible links to other diseases. The data below show annual rates of cigarette consumption and deaths from coronary heart disease for several nations. Some public health officials are urging that the US adopt a national goal of cutting cigarette consumption in half over the next decade.
Examine these data and write a report. In your report you should:
1. Include appropriate graphs (e.g. scatterplot, residual plot) and statistics (e.g. mean and SD);
2. Describe the association between cigarette smoking and coronary heart disease;
3. Create a linear model;
4. Evaluate the strength and appropriateness of your model;
5. Interpret the slope and yintercept of the line;
6. Use your model to estimate the potential benefits of reaching the national goal proposed for the US. That is, based on your linear model, if the US were to cut its cigarette consumption in half (from 3900 to 1950), what does the linear model predict would happen to the CHD rate.
7. You should use Statcrunch to generate nice looking graphs and output as needed. Be sure to size them appropriately. No need for a 8x10 scatterplot; Make your graphs about 3x3. You should scale them in Statcrunch first, then copy and paste into Word.
 yes  smcdaniel04  Sep 29, 2011  267B  3550 

