Baseball data for correlation and regression
Use first line as column names: yes
Share with everyone: yes
Last edited: Apr 17, 2012
This table shows the total number of runs scored, at bats, hits, etc for each of the 30 MLB teams for the 2009-2011 seasons.
Correlations and linear regression models can be calculated between the different numeric variables. A good exercise is to see which variables correlate most strongly with runs_scored.
As emphasized in the movie Moneyball, some of the classic metrics such as batting_avg is not as good as the newer metrics like OBP (on base percentage), SLG (slugging percentage), or OPS (on base plus slugging).
A guide to a few of the variables that may not be self explanatory.
Runs_Scored: The total of all runs (points) the baseball team scored by the end of the season.
Batting_avg: This is equal to the number of hits divided by at_bats
OBP: On Base Percentage. Similar to batting average, except that it takes into account walks and hit-by-pitch. Some players who don't have high batting averages, manage to get walked quite frequently.
SLG: Slugging - This weights hits to first base as 1 point, hits to second base as 2 points, third as 3, homeruns as 4, and divides the total by the number of at bats.
OPS - On Base Plus Slugging - this is just OBP added to the SLG numbers.
This data set has 7 favorites. Sign in to add it to your favorites!
Adding a data set to your favorites makes it easier to come back to in the future!