
Data sets shared by StatCrunch members
Showing 1 to 15 of 74 data sets matching linear
Data Set/Description 
Owner 
Last edited 
Size 
Views 
Mean Weights of Boys Ages 2 to 12
I'm using this for Modeling Linear Associations. It has a decent linear correlation coefficient. A linear regression produces the stats and scatter plot with a polynomial of order one trend line overlay which can be used to illustrate extrapolation/interpolation, error estimates, and model breakdown. For over/underestimates and error, interpolate mean weights for 3 and 5 year olds and compare with observed mean weights of 31.0 pounds and 40.5 pounds, respectively. For model breakdown, adjust the xaxis of the scatter plot to range between 0 and 20, with integer tick marks, and the yaxis to range between 0 and 200, with tick marks 0, 10, 20, ..., 200, and an extrapolation for mean weight at age 20 will suggest a weight somewhere near 135 lbs for a 20 year old male.
 kcramer  Oct 26, 2019  110B  543 
StatCrunch Instruction Sheet Linear Corr and Reg Example  S. Lohse
This data set was included in a text book I was using at the time this example sheet was written.  slohse9395  Oct 7, 2019  179B  426 
Mother and Daughter Heights.xls
This data set is Galton's Mother and Daughter data set as used in Sanfford Weisberg's Applied Linear Regression, 3rd Edition.  craig_slinkman  Apr 10, 2010  13KB  7301 
Cigarette Consumption vs CHD Mortality
Now that cigarette smoking has been clearly tied to lung cancer, researchers are focusing on possible links to other diseases. The data below show annual rates of cigarette consumption and deaths from coronary heart disease for several nations. Some public health officials are urging that the US adopt a national goal of cutting cigarette consumption in half over the next decade.
Examine these data and write a report. In your report you should:
1. Include appropriate graphs (e.g. scatterplot, residual plot) and statistics (e.g. mean and SD);
2. Describe the association between cigarette smoking and coronary heart disease;
3. Create a linear model;
4. Evaluate the strength and appropriateness of your model;
5. Interpret the slope and yintercept of the line;
6. Use your model to estimate the potential benefits of reaching the national goal proposed for the US. That is, based on your linear model, if the US were to cut its cigarette consumption in half (from 3900 to 1950), what does the linear model predict would happen to the CHD rate.
7. You should use Statcrunch to generate nice looking graphs and output as needed. Be sure to size them appropriately. No need for a 8x10 scatterplot; Make your graphs about 3x3. You should scale them in Statcrunch first, then copy and paste into Word.
 smcdaniel04  Sep 29, 2011  267B  5788 
Rebound Regression
Here is a data set that students in one group of our introductory course generated for this activity. Only three drops at each of eleven heights were made here, but this should provide an idea of the type of data that would be collected.  smcdanie%sc  Jul 4, 2008  219B  902 
Baseball data for correlation and regression
This table shows the total number of runs scored, at bats, hits, etc for each of the 30 MLB teams for the 20092011 seasons.
////
Correlations and linear regression models can be calculated between the different numeric variables. A good exercise is to see which variables correlate most strongly with runs_scored.
////
As emphasized in the movie Moneyball, some of the classic metrics such as batting_avg is not as good as the newer metrics like OBP (on base percentage), SLG (slugging percentage), or OPS (on base plus slugging).
////
A guide to a few of the variables that may not be self explanatory.
Runs_Scored: The total of all runs (points) the baseball team scored by the end of the season.
Batting_avg: This is equal to the number of hits divided by at_bats
OBP: On Base Percentage. Similar to batting average, except that it takes into account walks and hitbypitch. Some players who don't have high batting averages, manage to get walked quite frequently.
SLG: Slugging  This weights hits to first base as 1 point, hits to second base as 2 points, third as 3, homeruns as 4, and divides the total by the number of at bats.
OPS  On Base Plus Slugging  this is just OBP added to the SLG numbers.  mileschen  Apr 17, 2012  6KB  4590 
realestate
This example was used during a slide presentation on simple linear regression descriptive statistics in STAT 215 at WVU. This data tables lists the selling price (in $1000), size (in 100ft^2), and condition (from 110) of n=10 homes sold in 1986 in some market.  kjryan  Nov 25, 2017  138B  662 
71 Discussion LinearRegression_SampleTests2
71 Discussion added Max Temperature MAT 240, 17EW1, R1026 course  smii  Oct 19, 2017  803B  361 
chapter 9
This data set is Galton's Mother and Daughter data set as used in Sanfford Weisberg's Applied Linear Regression, 3rd Edition.  katcrowe  Apr 12, 2019  847B  83 
71 Discussion LinearRegression_SampleTests SMM1
71 Discussion Linear Regression, level of humidity between midSeptember and beginning October for years 2016 and 2017 for MAT 270, R1026, 17EW1 course.  smii  Oct 19, 2017  543B  159  Singfat Chu diamond ring data
NAME: Diamond Ring Pricing Using Linear Regression
TYPE: Random sample
SIZE: 48 observations, 2 variables
DESCRIPTIVE ABSTRACT:
This dataset contains the prices of ladies' diamond rings and the carat
size of their diamond stones. The rings are made with gold of 20
carats purity and are each mounted with a single diamond stone.
SOURCE:
The source of the data is a full page advertisement placed in the
_Straits Times_ newspaper issue of February 29, 1992, by a
Singaporebased retailer of diamond jewelry.
VARIABLE DESCRIPTIONS:
Columns
6  8 Size of diamond in carats (1 carat = .2 gram)
16  19 Price of ring in Singapore dollars
Values are aligned and delimited by blanks. There are no missing values.
STORY BEHIND THE DATA:
Data presented in a newspaper advertisement suggest the use of simple
linear regression to relate the prices of diamond rings to the weights
of their diamond stones. The intercept of the resulting regression
line is negative and significantly different from zero. This finding
raises questions about an assumed pricing mechanism and motivates
consideration of remedial actions.
PEDAGOGICAL NOTES:
This dataset can be used to illustrate modelbuilding in linear
regression. A possibly counterintuitive negative intercept may be
avoided by using a multiplicative or exponential regression model.
These regression models are intrinsically linear, and they are
estimated using standard linear regression technology after a suitable
transformation of the data.
Additional information about these data can be found in the "Datasets
and Stories" article "Diamond Ring Pricing Using Linear Regression" in
the _Journal of Statistics Education_ (Chu 1996).
SUBMITTED BY:
Singfat Chu
Department of Decision Sciences
National University of Singapore
10 Kent Ridge Crescent
Singapore 119260
fbachucl@nus.sg  worths1  Oct 29, 2008  1KB  852  US Crime
These data are crimerelated and demographic statistics for 47 US states in 1960. The data were collected from the FBI's Uniform Crime Report and other government agencies to determine how the variable crime rate depends on the other variables measured in the study.
Number of cases: 47 Reference:Vandaele, W. (1978) Participation in illegitimate activities: Erlich revisited. In Deterrence and incapacitation, Blumstein, A., Cohen, J. and Nagin, D., eds., Washington, D.C.: National Academy of Sciences, 270335. Methods: A Primer, New York: Chapman & Hall, 11. Also found in: Hand, D.J., et al. (1994) A Handbook of Small Data Sets, London: Chapman & Hall, 101103.
[Collinearity , Correlation , Causation , Lurking variable , Regression]
Variable  Description 
R  Crime rate # of offenses reported to police per million population 
Age  The number of males of age 1424 per 1000 population 
S  Indicator variable for Southern states (0 = No, 1 = Yes) 
Ed  Mean # of years of schooling x 10 for persons of age 25 or older 
Ex0  1960 per capita expenditure on police by state and local government 
Ex1  1959 per capita expenditure on police by state and local government 
LF  Labor force participation rate per 1000 civilian urban males age 1424 
M  The number of males per 1000 females 
N  State population size in hundred thousands 
NW  The number of nonwhites per 1000 population 
U1  Unemployment rate of urban males per 1000 of age 1424 
U2  Unemployment rate of urban males per 1000 of age 3539 
W  Median value of transferable goods and assets or family income in tens of $ 
X  The number of families per 1000 earning below 1/2 the median income 
 ds231%sc  Aug 11, 2008  2KB  2385  Wages and Hours
The data are from a national sample of 6000 households with a male head earning less than $15,000 annually in 1966. The data were clasified into 39 demographic groups for analysis. The study was undertaken in the context of proposals for a guaranteed annual wage (negative income tax). At issue was the response of labor supply (average hours) to increasing hourly wages. The study was undertaken to estimate this response from available data [ Regression , Outlier , Collinearity , Assumptions, regression]
Variable  Description 
HRS  Average hours worked during the year 
WAGE  Average hourly wage ($) 
ERSP  Average yearly earnings of spouse ($) 
ERNO  Average yearly earnings of other family members ($) 
NEIN  Average yearly nonearned income 
ASSET  Average family asset holdings (Bank account, etc.) ($) 
AGE  Average age of respondent 
DEP  Average number of dependents 
RACE  Percent of white respondents 
SCHOOL  Average highest grade of school completed 
 ds231%sc  Aug 11, 2008  2KB  1660  Smoking and Cancer
The data are per capita numbers of cigarettes smoked (sold) by 43 states and the
District of Columbia in 1960 together with death rates per thouusand population from
various forms of cancer.
Number of cases: 44 Reference: J.F. Fraumeni, "Cigarette Smoking and Cancers of the Urinary Tract: Geographic Variations in the United States," Journal of the National Cancer Institute, 41, 12051211.
[Outlier , Regression , Residuals , Transformation , Nonlinear regression , Dummy variable]
Variable  Description 
CIG  Number of cigarettes smoked (hds per capita) 
BLAD  Deaths per 100K population from bladder cancer 
LUNG  Deathes per 100K population from lung cancer 
KID  Deaths per 100K population from bladder cancer 
LEUK  Deaths per 100 K population from leukemia 
 ds231%sc  Aug 11, 2008  1KB  1688  Responses to Sleep Survey
Topic: Sleeping Habits
Course: STA 220 (statistics)
Semester: Fall 2013
Name: Tiffany Turner
Introduction:
Sleeping habits is a behavioral state that is a natural part of every bodyâ€™s life. Humans spend about 1/3 of their lives asleep. People generally know little about the importance of sleep. Sleep is not just something to fill time when a person is inactive. Sleep is a required activity, not an option. Even though the precise functions of sleep remain a mystery, sleep is important for normal motor and cognitive function. We all recognize and feel the need to sleep. After sleeping, we recognize changes that have occurred, as we feel rested and more alert. Sleep actually appears to be required for survival.
Methodology:
Data was collected through a survey in which individuals were asked about their sleeping habits and what their age is. The survey was given to people in my family and some of the people on statcrunch who participated. I had 14 people to participate in my survey. The data obtained was analyzed by statcrunch data analysis package available at www.statcrunch.com
Analysis and Results:
Both descriptive and inferential analyzing was done at www.statcrunch.com
A. Descriptive Data Analysis: A pie chart was used to describe the sample data since there were all different ages being used.
B. Test analysis (inferential statistics) Regression analysis was done by using statcrunch to identify the existence of a correlation between the participantsâ€™ weekdays and weekends sleeping habits. A similar analysis. A similar analysis was also done for a possible correlation between the age and the hours slept. The result indicates almost zero correlations between individualâ€™s age and sleeping habits.
Conclusion:
The linear regression results obtained contradicted my initial belief that an individualâ€™s age will not increase the hours an individual will sleep. The respondents could have provided inaccurate data since there is no way to verify the information obtained from the survey. Also, the 14 individuals from my family and HCTC who were surveyed may have been dominantly age observers who sleep more during weekends then weekdays. Given the above, it would not be accurate to conclude that there is no correlation between age and how much sleep one gets. Proper methods of data collection such as observation may be appropriate for this type of study.
 tturner0090  Jan 28, 2013  572B  1616 

