Print - Back

Descriptive Statistics EbL by Melissa Smith
Generated May 7, 2019 by m.smith96

I chose to perform my anaysis on workforce fatalities.

I first ran summary statistics and obtained the following results:

### Summary statistics:

ColumnnMeanVarianceStd. dev.Std. err.MedianRangeMinMaxQ1Q3IQRCoef. of var.Mode
fatalities 51 100.84314 9896.3349 99.480324 13.930032 77 526 8 534 35 125 90 98.648581

Multiple modes

Data was obtained from bls.gov and reported on all 50 states, including the District of Columbia, for a total of 51 observations.

Central Tendency

The average or mean number of workplace deaths by state in 2017 was 100.8.

The median or middle number of deaths was 77, much less than the mean which was influenced by (high) extreme values of California 376, Florida 299, New York 313, and Texas 534.  This would suggest a right skewed distribution, as confirmed by the boxplot below.  I wonder what makes these states such dangerous places to work.  Further analysis would be required to help answer that question.

The mode is listed as many in my summary statistics, so I sorted the data to easily identify which values repeat.  I found modes of 20, 32, 35, 72, and 90, each listed twice in the dataset.

Variation

With a standard deviation of about 99.5 deaths, this seems rather large.  Namely on average, each state has a number of deaths 99.5 different from the mean.  I find this value large because it's almost as big as the mean.  The variance of 9896.3 is the standard deviation squared.

The range of the data is 526, meaning there are 526 deaths between the smallest (min) number of deaths per state of 8 and the largest (max) number of deaths per state of 534.  That is quite a significant difference.  I wonder what factors contribute to low and high workplace fatalities.

Position

The five number summary is (8, 35, 77, 125, 526) which would be represented in the boxplot had I not selected to identify outliers with fences.  The interquartile range is 90, stating that the middle 50% of data or deaths per state have a range of 90; starting at 35 and going up to 125 spans 90 deaths.  The remaining 50% of data lies below 35 and or above 125 workplace deaths per state.

The boxplot shows the right skew of the distribution with a longer right whisher.  It also shows the outliers mentioned above.

Since there are outliers present, it is good practice to remove them from the data set and rerun the analysis.

i should have noted which values were removed

### Summary statistics:

ColumnnMeanVarianceStd. dev.Std. err.MedianRangeMinMaxQ1Q3IQRMode
Sort(fatalities) 47 77.042553 2621.3025 51.198657 7.4680917 72 186 8 194 33 108 75 Multiple modes

I can see all of the statistics have shrunk drastically.  The mean dropped to 77.0, the standard deviation was cut nearly in half to 51.2, the range decreased by 340 and the IQR has decreased by 15 to 75.

Looking a the new boxplot, there is still a right skew in the dataset.

All in all, I was interested in this dataset because I lost a relative who died while working.  I also wanted to see where Illinois ranked, which turned out to be 43/51, indicating our state is on the upper end of workplace fatalities, lying between Q3 and max, so in the top 25%.  I already knew that Illinois is a dangerous place to work, but I didn't realize it was so high up there on the list.  I will advise my family of this ranking and ask those who work in dangerous conditions to please excersise extreme caution at all times.

Result 1: 2017 workforce fatalities by state   [Info]

Result 2: 2017 work fatalities wo outliers Boxplot   [Info]