StatCrunch logo (home)

Data sets shared by StatCrunch members
Showing 1 to 15 of 137 data sets matching regression
Data Set/Description Share Owner Last edited Size Views
Regression and Correlation worksheet.xlsx
M4_Regression and Correlation
yeshollypetFeb 3, 2016139B44
Federal Food Assistance Participation
This primarily comes from the following source: United States Department of Agriculture: Food and Nutrition Service . This dataset also incorporates data from another StatCrunch dataset: US Workforce Participation

YearThe year for each data value
Average Federal Food Assistance Participation in ThousandsNumber of individuals in the US who took part in SNAP (Supplemental Nutrition Assistance Program) during the given year.
% US Population on Federal Food Assitance% of US population that is currently in the SNAP program and is receiving aid with food.
Change of % (US Population on Federal Food Assistance)The change in the percentage of the US population that is receiving food assistance from SNAP.
Presidential ControlPolitical party of president.
Senate ControlPolitical party of the Senate majority
House ControlPolitical party of the House of Representatives majority.
Legislative Branch (House and Senate)Combined control of Senate and House of Representativs
Male Inactivity Rate Aged 25-54Defined as the proportion of the male population aged 25-54 that is not in the labour force. Common reasons for leaving labour force: college, retirement, stay at home, can't find work and no longer try.
Change of Rate (Male Inactivity Rate Aged 25-54)The change in the inactivity rate calculated as the current year minus the previous year.
Female Inactivity Rate Aged 25-54Defined as the proportion of the female population aged 25-54 that is not in the labour force.
Change of Rate (Female Inactivity Rate Aged 25-54)The change in the inactivity rate calculated as the current year minus the previous year.
Annual Average Workforce Participation RateDefined by the Bureau of Labor Statistics as "the percentage of the population [16 years and older] that is either employed or unemployed (that is, either working or actively seeking work). Note that 2015's Annual Average is calculated using the first 11 months."
Change of Rate (Annual Workforce Participation Rate)The change in the workforce participation rate calculated as the current year minus the previous year.
yesstatcrunchhelpJan 8, 201610KB96
Text Messaging Activityyes12266555_ecollege_kentmlpSep 26, 20152KB885
Nonlinear_Regression_world_population (1).xlsxyesealgephantomDec 4, 2014313B455
Stats from the major league baseball teams for 2013. The last column I added denotes AL for American League and NL for National League. One could possibly conduct a two-sample means test, for example, to find out whether the average runs for the two leagues are equal. Or there are of course lots of regressions one could run.
yeseykolo@stat.tamu.eduNov 4, 20133KB1358
Regression: Cigarettes Lung Kidney Leukemia Bladder
"Cigarette smoking and cancers of the urinary tract: Geographic variation in the United States" Journal of the National Cancer Institute (vol. 41, no. 5, November, 1968), pp. 1205-1211; table from pp. 1206-1207. Joseph F. Fraumeni, Jr. Oxford University Press Units: cigarettes sold per capita, cancer deaths per 100,000
yesphil_larsonSep 22, 20132KB2309
Low Birth Weight Study
SOURCE: Hosmer and Lemeshow (2000) Applied Logistic Regression: Second Edition Data were collected at Baystate Medical Center, Springfield, Massachusetts during 1986. DESCRIPTIVE ABSTRACT: The goal of this study was to identify risk factors associated with giving birth to a low birth weight baby (weighing less than 2500 grams). Data were collected on 189 women, 59 of which had low birth weight babies and 130 of which had normal birth weight babies. Four variables which were thought to be of importance were age, weight of the subject at her last menstrual period, race, and the number of physician visits during the first trimester of pregnancy. LIST OF VARIABLES: Columns Variable Abbreviation ----------------------------------------------------------------------------- 2-4 Identification Code ID 10 Low Birth Weight (0 = Birth Weight >= 2500g, LOW 1 = Birth Weight < 2500g) 17-18 Age of the Mother in Years AGE 23-25 Weight in Pounds at the Last Menstrual Period LWT 32 Race (1 = White, 2 = Black, 3 = Other) RACE 40 Smoking Status During Pregnancy (1 = Yes, 0 = No) SMOKE 48 History of Premature Labor (0 = None 1 = One, etc.) PTL 55 History of Hypertension (1 = Yes, 0 = No) HT 61 Presence of Uterine Irritability (1 = Yes, 0 = No) UI 67 Number of Physician Visits During the First Trimester FTV (0 = None, 1 = One, 2 = Two, etc.) 73-76 Birth Weight in Grams BWT ----------------------------------------------------------------------------- PEDAGOGICAL NOTES: These data have been used as an example of fitting a multiple logistic regression model. STORY BEHIND THE DATA: Low birth weight is an outcome that has been of concern to physicians for years. This is due to the fact that infant mortality rates and birth defect rates are very high for low birth weight babies. A woman's behavior during pregnancy (including diet, smoking habits, and receiving prenatal care) can greatly alter the chances of carrying the baby to term and, consequently, of delivering a baby of normal birth weight. The variables identified in the code sheet given in the table have been shown to be associated with low birth weight in the obstetrical literature. The goal of the current study was to ascertain if these variables were important in the population being served by the medical center where the data were collected. References: 1. Hosmer and Lemeshow, Applied Logistic Regression, Wiley, (1989).
yeswikipetersonJul 23, 20126KB3664
Baseball data for correlation and regression
This table shows the total number of runs scored, at bats, hits, etc for each of the 30 MLB teams for the 2009-2011 seasons. //// Correlations and linear regression models can be calculated between the different numeric variables. A good exercise is to see which variables correlate most strongly with runs_scored. //// As emphasized in the movie Moneyball, some of the classic metrics such as batting_avg is not as good as the newer metrics like OBP (on base percentage), SLG (slugging percentage), or OPS (on base plus slugging). //// A guide to a few of the variables that may not be self explanatory. Runs_Scored: The total of all runs (points) the baseball team scored by the end of the season. Batting_avg: This is equal to the number of hits divided by at_bats OBP: On Base Percentage. Similar to batting average, except that it takes into account walks and hit-by-pitch. Some players who don't have high batting averages, manage to get walked quite frequently. SLG: Slugging - This weights hits to first base as 1 point, hits to second base as 2 points, third as 3, homeruns as 4, and divides the total by the number of at bats. OPS - On Base Plus Slugging - this is just OBP added to the SLG numbers.
yesmileschenApr 17, 20126KB1349
Cigarette Consumption vs CHD Mortality
Now that cigarette smoking has been clearly tied to lung cancer, researchers are focusing on possible links to other diseases. The data below show annual rates of cigarette consumption and deaths from coronary heart disease for several nations. Some public health officials are urging that the US adopt a national goal of cutting cigarette consumption in half over the next decade. Examine these data and write a report. In your report you should: 1. Include appropriate graphs (e.g. scatterplot, residual plot) and statistics (e.g. mean and SD); 2. Describe the association between cigarette smoking and coronary heart disease; 3. Create a linear model; 4. Evaluate the strength and appropriateness of your model; 5. Interpret the slope and y-intercept of the line; 6. Use your model to estimate the potential benefits of reaching the national goal proposed for the US. That is, based on your linear model, if the US were to cut its cigarette consumption in half (from 3900 to 1950), what does the linear model predict would happen to the CHD rate. 7. You should use Statcrunch to generate nice looking graphs and output as needed. Be sure to size them appropriately. No need for a 8x10 scatterplot; Make your graphs about 3x3. You should scale them in Statcrunch first, then copy and paste into Word.
yessmcdaniel04Sep 29, 2011267B3182
Anscombe's 4 data sets for regression. They are very different, yet have the same correlation and regression coefficients.
yesbutler@utsc.utoronto.caMay 31, 2011360B547
Mother and Daughter Heights.xls
This data set is Galton's Mother and Daughter data set as used in Sanfford Weisberg's Applied Linear Regression, 3rd Edition.
yescraig_slinkmanApr 10, 201013KB3334
Seating Choice versus GPA (For 3 rows, with Text and Indicator Columns)
This dataset contains hypothetical (I believe) data on GPA for students who sit in the front, middle, and back rows of a classroom, as well as a hypothetical gender variable. The data are shown using both text variables (e.g., "front" and "middle") and 0/1 indicator variables for the row and gender variables. This dataset is useful for demonstrating the different ways that StatCrunch can compare means based on two factors: (a) the text factor columns can be used in a two-way ANOVA; and (b) the 0/1 indicator columns can be used in multiple regression. (Because of StatCrunch's current limitation on equal cells, the 0/1 variables only use the first and middle rows.) Both procedures gives the same p-value and same conclusion (as long as the interaction term is centered), thus highlighting the similarity of statistical procedures and StatCrunch's flexibility.
yesbartonpoulsonApr 8, 20101KB2726
Seating Choice versus GPA (Stacked & Split Columns for Front & Back Rows)
This dataset contains hypothetical (I believe) data on GPA for students who sit in the front and back row of a classroom. The data are shown in several ways: (a) two separate columns (one for the front row GPA and another or the back row GPA); (b) stacked with one column to indicate front or back row and another column with the GPAs; and (c) the row column repeated as a 0/1 indicator variable. This dataset is useful for comparing the different ways that StatCrunch can compare the means of two groups: (a) The two columns of scores (front and back) can be used in the 2-sample t-test or a one-way ANOVA; (b) the stacked text column (front/back) with a separate column for GPA can also be used for one-way ANOVA; and (c) the 0/1 indicator column and stacked GPAs can be used with correlation and regression. Every procedure gives the same p-value and same conclusion, thus highlighting the similarity of statistical procedures and StatCrunch's flexibility.
yesbartonpoulsonApr 8, 2010465B1353
Report on the Loss of the ‘Titanic’ (S.S.) (1990), British Board of Trade Inquiry Report (reprint), Gloucester, UK: Allan Sutton Publishing. Taken from the Journal on Statistical Education Archive, submitted by Dr. Craig Slinkman has recoded the data as self-explanatory nominal variables. yes craig_slinkman Mar 23, 2010 68KB 5
yescraig_slinkmanMar 23, 201061KB1295
Home Runs and Strike Outs for 2004 Boston Red Sox by Handedness
These data show home runs and strike outs for the 12 players from the Boston Red Sox who had more than 200 at-bats in the 2004 season (the first year they won the World Series after the 86-year Curse of the Bambino). It also shows whether the players bat left-handed or as switch hitters, both of which are coded as 0/1 (No/Yes, respectively) indicator variables (also known as dummy variables), as well as a text L/R/LR variable. These data were used for a demonstration for bivariate and multiple regression.
yesbartonpoulsonNov 3, 2009375B870

1 2 3 4 5 6 7 8 9 10   >

Always Learning