StatCrunch logo (home)

Data sets shared by StatCrunch members
Showing 1 to 15 of 355 data sets matching EXAMPLE
Data Set/Description Owner Last edited Size Views
All MLB Salaries (1985-2015)
This data has all MLB player salaries between 1985-2015 including the team played for, the city, and a unique ID for each player. Total this includes 25,575 salaries for 4,963 different baseball players.
The player ID is the first 5 letters from the last name, followed by the first two letters from the first name, followed by a number in case of duplicate names. For example, bondsba01 stands for Barry Bonds with "01" because he's the first with the "bondsba" name ID.
statcrunch_featuredJun 27, 20171MB4380
Median sales price vs Median rent for housing in 50 cities
This data is obtained from Zillow and includes the median sales price and the median price for to rent a home in 50 cities, as of July 2018, taken from https://www.zillow.com/research/local-market-reports/ This will be an excellent data set to use to introduce correlation and regression. Can we predict the median rent in a city based on the median price of homes sold in the city? It is also a good example to discuss the effect of outliers.
rosenthiSep 11, 20181KB496
Times World University Rankings (2011-2016)
This data comes from the annual Times magazine rankings of universities across the world. The webpage for the Times 2016 rankings is listed above in the source.
The formula for the 2016 rankings is as follows:
30% for Teaching Rating
7.5% for International Outlook Rating
30% for Research Rating
30% for Citations Rating
2.5% for Industry Income Rating.
The “Total Score” from 2016 can be recreated using this formula.

ColumnDescription
World_RankUniversity rank for a given year
University_NameThe name of the university
CountryLocation of university
Teaching_Rating Rating from a 0-100 scale of the quality of teaching at the university. This rating is based on the institution’s reputation for teaching, it’s student/staff ratio, it’s PhD’s/ undergraduate degrees awarded ratio, and it’s institutional income/ academic staff ratio.
Inter_Outlook_Rating Rating from a 0-100 scale of the international makeup of a university. This rating is based the international student percentage, international staff percentage, and the percentage of research papers from the university that include at least one international author.
Research_Rating Rating from a 0-100 scale of quality of research at the university. This rating is based on the university’s reputation, it’s research income/ academic staff ratio, and it’s production of scholarly papers.
Citations_Rating Rating from a 0-100 scale of based on the normalized average of citations by other papers per paper from the university (how often the research from the university is cited by other papers).
Industry_Income_Rating Rating from a 0-100 scale grading how much companies are willing to invest in the universities research. The rating is calculated based on the research income from businesses per academic staff member.
Total_ScoreThe final score used to determine the university ranking based on Teaching_Rating, International_Outlook_Rating, Research_Rating, Citations_Rating, and Industrial_Income_Rating.
Num_StudentsTotal number of students in a given year
Student/Staff_RatioNumber of students per academic staff member
%_Inter_StudentsPercentage of student body who come from a foreign county
%_Female_Students Percentage of student body that is female.
YearAcademic year that the ranking was released. For example, 2016 denotes the 2015-2016 academic year.
statcrunchhelpApr 5, 2016254KB3696
Baseball2013.xlsx
Stats from the major league baseball teams for 2013. The last column I added denotes AL for American League and NL for National League. One could possibly conduct a two-sample means test, for example, to find out whether the average runs for the two leagues are equal. Or there are of course lots of regressions one could run.
eykoloNov 4, 20133KB1986
All MLB Salaries (1985-2015)
This data has all MLB player salaries between 1985-2015 including the team played for, the city, and a unique ID for each player. Total this includes 25,575 salaries for 4,963 different baseball players.
The player ID is the first 5 letters from the last name, followed by the first two letters from the first name, followed by a number in case of duplicate names. For example, bondsba01 stands for Barry Bonds with "01" because he's the first with the "bondsba" name ID.
statcrunchhelpMar 15, 20161MB1483
RegisteredNursesSurvey.xlsx
For what survey produced it, see http://www.statcrunch.com/5.0/survey.php?surveyid=8178&code=YINVQ and inputs of all team mates. Towards the end, some validation was done, deleting data where working hours was less than a work day, or outliers to legally admissible work days. Finally arbitarily long chains which were less likely to be encountered in draws of simulated data (M/F, Degrees etc.. were discarded). A total of 12 observations were thus thrown out. All Credit goes to Team 3,the Instructor, our unnamed Friends in the Nursing profession who enthusiastically did a last minute push through over their extended social media groups for data and the respondents who kindly took out time for the survey. Another thought is about the distribution of hours worked. Wven if random, it "should be" "centered on" certain hours a day* number of days, with deviations from centre penalised, while picking a sample.. The observations 38 appear many times for example, however without an explainable reason (we are talking of work-distribution among nursing staff sample) So do "primes" "47, 37, 29" It is not to argue that they "shouldn't occur", but there has to be some reason for their being so significant/vibrant. At this stage we may conclude that most of the respondents may not have been under full-time nursing employments in strict sense of the term. 42, 48,72,60, 50,40 appearing more often would give us less variation but more regularity in the data. Since we haven't tried stratification, we do not know "how often they should occur". We thus do not re-draw observations.
ugoagwuJun 14, 20142KB953
Alcohol data from adults
My group and I design a survey to find out among the adult who drinks , why they drink, their age, education level and how many drink they have per day. The data was gathered individually and put together into statcrunch by one member of the group. This survey shows the number of drinking adults and what motivate them to drink. Our survey question is below. 1. Do you Drink Alcohol? Circle one: Y N 2. What is your age?____years 3. What is your gender? Circle one: Male Female 4. Are you having an increasing number of A. Financial problems B. family problems C. Work problems D. Health problems E. Financial and family problems F financial, health and family problems G. Family and work problems H. Financial, Family, and work problems I. none of the above Circle one. 5. How many drinks do you have a week?_____ drinks 6. Education: What is the highest degree or level of school you have completed? If currently enrolled, mark the previous grade or highest degree received. A. No schooling completed B. Nursery school to 8th grade C. 9th, 10th or 11th grade D. 12th grade, no diploma E. High school graduate - high school diploma or the equivalent (for example: GED) F. Some college credit, but less than 1 year G. 1 or more years of college, no degree H. Associate degree (for example: AA, AS) I. Bachelor's degree (for example: BA, AB, BS) J. Master's degree (for example: MA, MS, MEng, MEd, MSW, MBA) K. Professional degree (for example: MD, DDS, DVM, LLB, JD) Circle one. ----- Original Message ---- Sent on:Tuesday, May 22, 2012 11:46 PM Hi. It looks good. Change: 2. What is your gender? Circle one: Male Female Other to2. What is your gender? Circle one: Male Female Other Since I do not think you will get someone answering as Other. In #3, I forgot another option:3. Are you having an increasing number of A. Financial problems B. family problems C. Work problems D. Financial and family problems E. financial and family problems F. Family and work problems G. Financial, Family, and work problems H. none of the above Circle one.
rosesegeJun 21, 20129KB4631
Housing Price Data
This is an example of the relationship between housing prices with the square footage of the house, the age of the house and if the house has a finished basement.
jpalmateerNov 7, 20133KB1673
Treatment Effects of a Drug on Cognitive Functioning in Children with Mental Retardation and ADHD
Research conducted by: Pearson et al. Case study prepared by: David Lane and Emily Zitek Overview This study investigated the cognitive effects of stimulant medication in children with mental retardation and Attention-Deficit/Hyperactivity Disorder. This case study shows the data for the Delay of Gratification (DOG) task. Children were given various dosages of a drug, methylphenidate (MPH) and then completed this task as part of a larger battery of tests. The order of doses was counterbalanced so that each dose appeared equally often in each position. For example, six children received the lowest dose first, six received it second, etc. The children were on each dose one week before testing. This task, adapted from the preschool delay task of the Gordon Diagnostic System (Gordon, 1983), measures the ability to suppress or delay impulsive behavioral responses. Children were told that a star would appear on the computer screen if they waited long enough to press a response key. If a child responded sooner in less than four seconds after their previous response, they did not earn a star, and the 4-second counter restarted. The DOG differentiates children with and without ADHD of normal intelligence (e.g., Mayes et al., 2001), and is sensitive to MPH treatment in these children (Hall & Kataria, 1992). Questions to Answer Does higher dosage lead to higher cognitive performance (measured by the number of correct responses to the DOG task)? Design Issues This is a repeated-measures design because each participant performed the task after each dosage. Variable Description Placebo: Number of correct responses after taking a placebo d15 Number of correct responses after taking .15 mg/kg of the drug d30 Number of correct responses after taking .30 mg/kg of the drug d60 Number of correct responses after taking .60 mg/kg of the drug
kari.taylorOct 22, 2014434B1387
Low Birth Weight Study
SOURCE: Hosmer and Lemeshow (2000) Applied Logistic Regression: Second Edition Data were collected at Baystate Medical Center, Springfield, Massachusetts during 1986. DESCRIPTIVE ABSTRACT: The goal of this study was to identify risk factors associated with giving birth to a low birth weight baby (weighing less than 2500 grams). Data were collected on 189 women, 59 of which had low birth weight babies and 130 of which had normal birth weight babies. Four variables which were thought to be of importance were age, weight of the subject at her last menstrual period, race, and the number of physician visits during the first trimester of pregnancy. LIST OF VARIABLES: Columns Variable Abbreviation ----------------------------------------------------------------------------- 2-4 Identification Code ID 10 Low Birth Weight (0 = Birth Weight >= 2500g, LOW 1 = Birth Weight < 2500g) 17-18 Age of the Mother in Years AGE 23-25 Weight in Pounds at the Last Menstrual Period LWT 32 Race (1 = White, 2 = Black, 3 = Other) RACE 40 Smoking Status During Pregnancy (1 = Yes, 0 = No) SMOKE 48 History of Premature Labor (0 = None 1 = One, etc.) PTL 55 History of Hypertension (1 = Yes, 0 = No) HT 61 Presence of Uterine Irritability (1 = Yes, 0 = No) UI 67 Number of Physician Visits During the First Trimester FTV (0 = None, 1 = One, 2 = Two, etc.) 73-76 Birth Weight in Grams BWT ----------------------------------------------------------------------------- PEDAGOGICAL NOTES: These data have been used as an example of fitting a multiple logistic regression model. STORY BEHIND THE DATA: Low birth weight is an outcome that has been of concern to physicians for years. This is due to the fact that infant mortality rates and birth defect rates are very high for low birth weight babies. A woman's behavior during pregnancy (including diet, smoking habits, and receiving prenatal care) can greatly alter the chances of carrying the baby to term and, consequently, of delivering a baby of normal birth weight. The variables identified in the code sheet given in the table have been shown to be associated with low birth weight in the obstetrical literature. The goal of the current study was to ascertain if these variables were important in the population being served by the medical center where the data were collected. References: 1. Hosmer and Lemeshow, Applied Logistic Regression, Wiley, (1989).
wikipetersonJul 23, 20126KB7199
Violent Crimes by State
http://www.census.gov/statab/ranks/rank21.html State Rankings -- Statistical Abstract of the United States VIOLENT CRIMES 1 PER 100,000 POPULATION -- 2006 [When states share the same rank, the next lower rank is omitted. Because of rounded data, states may have identical values shown, but different ranks. Cautionary note] Cautionary note about rankings The ranks in some tables are based on estimates derived from a sample(s). Because of sampling and nonsampling errors associated with the estimates, the ranking of the estimates does not necessarily reflect the correct ranking of the unknown true values. Thus, caution should be used when making inferences or statements about the states' true values based on a ranking of the estimates. As an example, the estimated total (average, percent, ratio, etc.) for State A may be larger than the estimates for all other states. This does not necessarily mean that the true total (average, percent, ratio, etc.) for State A is larger than those for all other states. Such an inference typically depends on --among other factors-- the size of the difference(s) between the estimates in question, and the size of their associated standard errors. In other tables, the ranks are based on a complete enumeration of the target population, or on complete administrative reporting from the population. In such cases, sampling is not used, and there is no sampling error component in the estimates. Still, care should still be taken when making inferences or statements based on the rankings. The table values may still exhibit nonsampling error originating from such sources as coverage problems (missing units or duplicates), nonresponse, misreporting, and others. Last Revised: September 27, 2011 at 09:43:17 AM
phil_larsonJan 16, 2013881B3212
Jealousy file.xlsx
Do men and women differ in jealousy about their romantic partners? Research by Buss, Larsen, Westen, and Semmelroth (Exp. 1, 1992) suggested that the answer is yes. In that study, heterosexual men and women in the United States imagined their romantic partners engaged in emotional or sexual affairs with another person, and then indicated which scenario would be more upsetting to them. Men reported being more distressed when imagining their partners involved in sexual infidelity, whereas women were more distressed when they imagined their partners involved in emotional infidelity. Buss et al. concluded that their findings supported their hypotheses, which were derived from evolutionary theory. Subsequent research either supported the Buss et al. (1992) findings or found limitations to their conclusions (Harris, 2003). For example, although Buss et al. used a forced-choice method in their study (e.g., “Which of these two scenarios is more upsetting?”), others have not found such clear sex differences when rating scales are used instead (DeSteno, Bartlett, Braverman, & Salvoes, 2002). In addition, cultural differences have also been found. For example, European and Asian men are more likely to choose emotional infidelity as worse, compared to American men (Harris, 2004). The purpose of this study was to see if (a) we would replicate the original Buss et al. (1992) findings using an Australian sample in 2015, and (b) whether asking participants to rate their feelings would reveal the same sex differences that were reported in the original work. We therefore had separate hypotheses regarding the differences between men and women with respect to emotional infidelity and sexual infidelity.
e.vanmanMay 7, 20177KB980
LAX-JFK_AA & UA flights 6-2012
Data set for population t-Test project BACKGROUND The data contained in the compressed file has been extracted from the On-Time Performance data table of the "On-Time" database from the TranStats data library. The time period is indicated in the name of the compressed file; for example, XXX_XXXXX_2001_1 contains data of the first month of the year 2001. RECORD LAYOUT Below are fields in the order that they appear on the records: Year Year Quarter Quarter (1-4) Month Month DayofMonth Day of Month DayOfWeek Day of Week FlightDate Flight Date (yyyymmdd) UniqueCarrier Unique Carrier Code. When the same code has been used by multiple carriers, a numeric suffix is used for earlier users, for example, PA, PA(1), PA(2). Use this field for analysis across a range of years. AirlineID An identification number assigned by US DOT to identify a unique airline (carrier). A unique airline (carrier) is defined as one holding and reporting under the same DOT certificate regardless of its Code, Name, or holding company/corporation. Carrier Code assigned by IATA and commonly used to identify a carrier. As the same code may have been assigned to different carriers over time, the code is not always unique. For analysis, use the Unique Carrier Code. TailNum Tail Number FlightNum Flight Number OriginCityName Origin Airport, City Name DestCityName Destination Airport, City Name CRSDepTime CRS Departure Time (local time: hhmm) DepTime Actual Departure Time (local time: hhmm) DepDelay Difference in minutes between scheduled and actual departure time. Early departures show negative numbers. DepDelayMinutes Difference in minutes between scheduled and actual departure time. Early departures set to 0. DepDel15 Departure Delay Indicator, 15 Minutes or More (1=Yes) DepartureDelayGroups Departure Delay intervals, every (15 minutes from <-15 to >180) DepTimeBlk CRS Departure Time Block, Hourly Intervals TaxiOut Taxi Out Time, in Minutes WheelsOff Wheels Off Time (local time: hhmm) WheelsOn Wheels On Time (local time: hhmm) TaxiIn Taxi In Time, in Minutes CRSArrTime CRS Arrival Time (local time: hhmm) ArrTime Actual Arrival Time (local time: hhmm) ArrDelay Difference in minutes between scheduled and actual arrival time. Early arrivals show negative numbers. ArrDelayMinutes Difference in minutes between scheduled and actual arrival time. Early arrivals set to 0. ArrDel15 Arrival Delay Indicator, 15 Minutes or More (1=Yes) ArrivalDelayGroups Arrival Delay intervals, every (15-minutes from <-15 to >180) ArrTimeBlk CRS Arrival Time Block, Hourly Intervals CRSElapsedTime CRS Elapsed Time of Flight, in Minutes ActualElapsedTime Elapsed Time of Flight, in Minutes AirTime Flight Time, in Minutes Flights Number of Flights Distance Distance between airports (miles) CarrierDelay Carrier Delay, in Minutes WeatherDelay Weather Delay, in Minutes NASDelay National Air System Delay, in Minutes SecurityDelay Security Delay, in Minutes LateAircraftDelay Late Aircraft Delay, in Minutes
skyviewflierOct 6, 201283KB691
Responses to Survey for Class Examples Spring 2011
This data was collected at the beginning of the spring term for an introductory statistics course at the University of South Carolina.
petkewicMay 17, 201123KB686
Annual Movie Data 2008 Random Sampling.txt
This data is a random sampling of movies that played in theaters in 2008. It includes movies released in previous years that earned money during 2008. For example, a movie released over Thanksgiving in 2007 will most likely earn money in 2007 and 2008. Each box office year ends on the first Sunday of the following year. The next year starts the following day (Monday). For example, the "2004 box office year" ended on Sunday, January 2, 2005. Inflation-adjusted figures are based ticket sale estimates, and may not be precise due to rounding errors.
wikipetersonOct 7, 20098KB472

1 2 3 4 5 6 7 8 9 10   >

Always Learning