How Much Does A Sunflower Seed Weigh?
Generated May 9, 2014 by 1078261_blackboardht_cwi
Introduction:
Sunflower seeds are a very popular snack among consumers, and there are a large variety of available brands and flavors. Since the sunflower seed population demonstrates such diversity, I thought that it would be interesting to perform an observational study on a sample of seeds. Several attributes were observed for each seed, including weight, length, brand, and flavor. This data was analyzed statistically, and a few major observations will be highlighted within this report. The statistical analysis will focus on the primary variable, "weight with shell." We will explore the distribution of sample seed weights, estimate various population parameters, test a claim about the population's mean seed weight, test for differences among brands, and test for correlation between the primary variable and an auxillary variable.
Sampling Method:
For the purposes of this study, a sample of 50 seeds was obtained through a stratified sampling method; 5 seeds were chosen at random from each of 10 different packages of sunflower seeds. The packages varied in brand and flavor, which contributed to obtaining a sample that was representative of the population. However, for the sake of convenience, not all brands and flavors were included in the study. Thus, the sample cannot truly be considered a simple random sample, which may ultimately affect the reliability of the data.
Variable Classification:
This report analyzes several characteristics of the seeds within the sample. The primary variable that will be analyzed in this study is “weight with shell.” This variable – which will be measured in milligrams – is quantitative, continuous, and at the ratio level of measurement.
The first auxiliary variable is “weight without shell.” This variable – which will be measured in milligrams – is quantitative, continuous, and at the ratio level of measurement.
The second auxiliary variable is “length with shell.” This variable – which will be measured in millimeters – is quantitative, continuous, and at the ratio level of measurement.
The third auxiliary variable is “brand.” This variable is qualitative and at the nominal level of measurement.
The fourth auxiliary variable is “flavor.” This variable is qualitative and at the nominal level of measurement.
Raw Data:
The data obtained from the sample is shown below.
<data1>
Sample Distribution of Primary Variable:
Now that the data has been obtained, it can be thoroughly organized and analyzed. In order to simplify the interpretation process, the data was grouped into classes. This resulted in a distibution with 12 classes, each having a width of 15 mg. This information is summarized in the Grouped Frequency Distribution Table below.
<data2>
The information in the Grouped Frequency Distribution Table is shown graphically in the following Histogram.
<result1>
The center of the distribution lies around 135150 mg; this is the weight range that has the highest frequency. Overall, the seed weights range from 90.9 mg to 257.2 mg. There do not appear to be any obvious outliers, though there are a few seeds that are significantly heavier than the rest.
The Frequency Polygon below shows the overall shape of the distibution.
<result2>
The distribution of the data is somewhat normal. The frequencies start out low, increase to a maximum, and then decrease. However, the distribution is not symmetrical. The right tail of the distribution is longer than the left tail, which means the distribution is skewed to the right. If the distribution was normal, the mean weight would allign with the weight range that has the highest frequency; this would result in a mean weight between 135 mg and 150 mg. However, since the distribution is skewed to the right, the mean weight falls above this range (mean= 152.1 mg).
Looking at the data, it appears that there may be two populations represented within the sample. Many (but not all) of the heavier seeds come from the "David's Jumbo" brand, while the rest come from the other brands. This may be one reason that the distribution is skewed to the right. This speculation will be formally tested later in the report.
The Ogive below shows the cumulative frequency distribution of the data. Within the first few classes, the frequency increases slowly. It begins to increase rapidly near the middle, but then levels off at the last few classes. This pattern is indicative of a normal distribution.
<result3>
The Pie Chart below shows the relative frequency of each class.
<result4>
Measures of Central Tendency for Primary Variable:
The distribution can be more thoroughly analyzed by interpreting its measures of central tendency. Below is a Summary Statistics Report, which includes several measures of center. Though not included in the table, the data set's midrange is 174.1 mg.
<result5>
Along with the summary report above, it can also be useful to explore the 5Number Summary.
<result6>
The following Boxplot is a graphical representation of the 5Number Summary.
<result7>
Another useful summary of the distribution can be found in the following StemandLeaf Plot. It has a similar shape to the histogram shown previously in the report.
<result8>
The best measure of central tendency for the data set is the median (146.7 mg). Even though the distribution is skewed to the right, the median is relatively resistant to the skewed nature of the data. This is especially evident when looking at the histogram, frequency polygon, and stemandleaf plot. The weight range with the highest frequency (Histogram/Frequency Polygon: 135149.9 mg, Stem plot: 140149 mg) contains the median, but it doesn’t contain the other measures of central tendency (Mean: 152.1 mg, Midrange: 174.1 mg). In addition, 70% of the data values are within one standard deviation (35.7 mg) of the median. This shows that the data is closely distributed around the median.
The worst measure of central tendency for the data set is the midrange (174.1 mg). It is significantly affected by the rightskewed nature of the distribution. Thus, it does not fall within the weight range that has the highest frequency of data values (refer to the paragraph above). Additionally, only 56% of the data values are within one standard deviation of the midrange. This shows that the data is not as closely distributed around the midrange as it is around the median. Lastly, the midrange is above Q3 (refer to the 5number summary and boxplot). This means that over 75% of the data values are below the midrange, making it a poor reflection of the data’s center.
Estimation of Population Parameters
After examining the sample distribution for the primary variable, we can use the sample data to estimate various population parameters. We will begin by estimating the population mean seed weight with shell, using a significance level of α = .10. This corresponds to a 90% confidence interval:
143.68 mg < μ < 160.60 mg (E: 8.46 mg)
I am 90% confident that this interval actually contains the true population mean.
Now I will estimate the population standard deviation of seed weights with shell, using a significance level of α = .10. This corresponds to a 90% confidence interval:
30.67 mg < s < 42.88 mg (E left: 5.01 mg, E right: 7.20 mg)
I am 90% confident that this interval actually contains the true population standard deviation.
Finally, I will estimate the proportion of the population that falls within a certain range of weights. For the purposes of this estimation, I will consider seeds that fall within 100.0 to 150.0 mg. For the sample, this proportion is .58. A significance level of α = .10 was used to construct a 90% confidence interval:
.465 < ρ < .695 (E: .115)
I am 90% confident that this interval actually contains the true population proportion.
It is important to note that these estimates may not be as accurate as my level of confidence suggests. The accuracy of a confidence interval depends on the sample used for its construction. As I mentioned earlier, my sample is not truly a simple random sample. Thus, the sample may not be representative of the population as a whole. This possible desprepancy could affect the reliablility of the confidence intervals.
Hypothesis Test for Population Mean Seed Weight With Shell
Now that we've estimated the population mean using a confidence interval, we can formally test a claim about the population mean. A certain online source (Answers.com) claims that the average weight of a sunflower seed is 90.9 mg. This estimate will serve as the hypothesized population mean for the following statistical test; a hypothesis test will be used to test the claim that the average weight of a sunflower seed is 90.9 mg.
H_{o}: μ = 90.9 mg (This is the claim)
H_{1}: μ > 90.9 mg
This onetailed test will use a significance level of α = .05, and it will utilize the student tdistribution. The results are shown below.
<result9>
Since the pvalue (.0001) is less than α (.05), I must reject the null hypothesis. There is sufficient evidence to suggest that the population mean seed weight with shell is higher than 90.9 mg. This conclusion is supported by the previously constructed 90% confidence interval, since it does not contain the claimed value (90.9 mg). I am not sure why there is such a discrepancy between the claim from Answers.com and the results I obtained. Perhaps the claim did not come from a reputable source, making it unreliable. Or perhaps the population has changed significantly since the claim was made. Either way, there is sufficient evidence to refute the claim.
Hypothesis Test for Differences Among Brands
As I mentioned earlier in the report, I suspect that there may be a difference in seed weights among the brands. In order to test this theory, I divided the seeds into two groups based on the brand represented by each seed: David’s Jumbo Brand (15 seeds) and Other Brands (35 seeds). I will use a hypothesis test to test the claim that David’s Jumbo brand seeds have a higher mean weight with shell than the seeds from other brands. For reference, the means and standard deviations for each group are included below.
<result10>
For the purposes of this test, the David's Jumbo Brand group will be considered population 1.
H_{0}: μ_{1}  μ_{2} = 0
H_{1}: μ_{1}  μ_{2} > 1 (This is the claim)
This onetailed test will use a significance level of α = .01, and it will use the student tdistribution. The results are shown below.
<result11>
Since the Pvalue (.0001) is less than α (.01), I must reject the null hypothesis. There is sufficient evidence to support the claim that the population mean weight wth shell of David’s Jumbo brand sunflower seeds is higher than that of the other brands.
The same conlusion can also be drawn using the confidence interval method. But we must be careful to keep things consistent. The significance level (α = .01) corresponds to the area under the tdistribution curve to the right of the critical value. This area represents the probability of getting a tstatistic above the critical value, given that the null hypothesis is true. This same probability must be represented by the area above the upper limit of the confidence interval. Thus, a 98% confidence interval should be used because it has an area of .01 above its upper limit. The interval is shown below.
20.1 mg < μ_{1}  μ_{2} < 72.3 mg
Since the interval does not contain zero, there is sufficient evidence to support the claim that the population mean weight wth shell of David’s Jumbo brand sunflower seeds is higher than that of the other brands. The reason for this difference has yet to be determined. Perhaps the David's Jumbo brand seeds are genetically modified to grow larger than other seeds. Or perhaps they are treated with certain chemicals during the maturation process in order to maximize their growth. Either way, there is a significant difference between the weight of David's Jumbo brand seeds and those of other brands.
Test for Correlation Between Primary and Auxillary Variables
Now that we've thoroughly studied the primary variable, it may be of interest to test for a correlation between the primary variable and one of the auxillary variables. For this test, the independent variable will be "weight with shell" and the dependent variable will be "weight without shell." A Scatter Plot of the paired data is shown below.
<result12>
There appears to be a strong relationship between the variables, but a thorough regression analysis should be performed in order to clearly elucidate the relationship. I will start by testing for a linear relationship. The linear regression line is overlayed on the scatter plot below.
<result13>
The results of the hypothesis test for linear regression are shown below; the linear regression line has an R^{2} value of .773.
<result14>
Now we will test for a quadratic relationship. The quadratic regression line is overlayed on the scatter plot below.
<result15>
The results of the hypothesis test for quadratic regression are shown below; the quadratic regression line has an R^{2} value of .785.
<result16>
We can also test for a cubic relationship. The cubic regression line is overlayed on the scatter plot below.
<result17>
The results of the hypothesis test for cubic regression are shown below; the cubic regression line has an R^{2} value of .790.
<result18>
Finally, I will test for a 4thorder polynomial relationship. The 4thorder polynomial regression line is overlayed on the scatter plot below.
<result19>
The results of the hypothesis test for 4thorder polynomial regression are shown below; the 4thorder ploynomial regression line has an R^{2} value of .803.
<result20>
Since the 4thorder polynomial regression line has the highest R^{2} value (.803) out of those that were tested, it should be chosen as the model of best fit for this relationship. The regression line fits the scatter plot reasonably well, so it is logical to choose this model. There may be other models that fit the relationship better, but for the sake of convenience, we will only consider the 4 regression lines that were tested above.
Now that we have chosen a model of best fit, we can use it to make a prediction. Given that a seed weighs 180.0 mg with its shell, we can predict its weight without the shell. The 95% prediction interval can be found in the last line of the results shown below.
<result21>
The best pointestimate for the weight of the seed without its shell is 81.8 mg. The prediction interval is fairly narrow, suggesting that the regression model is a good fit for the data.
Conclusion
After a thorough anlysis of the data, several important observations have been made. The sample distribution of the primary variable (weight with shell) was found to be approximately normal, with a slight skew to the right. The median was found to be the best measure of central tendency, while the midrange was found to be the worst. The rightskewed nature of the data was initially thought to be caused by a brandrelated difference among the seeds. A formal hypothesis test showed that there is a significant difference between the weights of David's Jumbo brand seeds and those of other brands, supporting the initial speculation. The sample data was also used to estimate population parameters for the primary variable. Additionally, a hypothesis test was used to test the claim that the average weight of a sunflower seed is 90.9 mg. The test showed that there is sufficient evidence to suggest that the population mean seed weight with shell is higher than 90.9 mg, resulting in the rejection of the claim. Finally, a correlation was found between "weight with shell" and "weight without shell"; the model of best fit was found to be a 4th order polynomial. Using this model, it was estimated that a seed weighing 180.0 mg with its shell would weigh 81.8 mg without its shell. Although several significant observations were made within this report, the results are still subject to question. Since the sample is not truly a simple random sample, the results may not be completely reliable.
Summary statistics:

Summary statistics:

Variable: Weight with Shell (mg)
Decimal point is 1 digit(s) to the right of the colon. Leaf unit = 1 9 : 1 10 : 0379 11 : 126 12 : 04479 13 : 455789 14 : 01255689 15 : 000226 16 : 001 17 : 0257 18 : 1789 19 : 0 20 : 21 : 148 22 : 23 : 24 : 3 25 : 7 
Hypothesis test results: μ : Mean of variable H_{0} : μ = 90.9 H_{A} : μ > 90.9

Summary statistics:

Hypothesis test results: μ_{1} : Mean of David's Jumbo μ_{2} : Mean of Other Brands μ_{1}  μ_{2} : Difference between two means H_{0} : μ_{1}  μ_{2} = 0 H_{A} : μ_{1}  μ_{2} > 0 (without pooled variances)

Simple linear regression results:
Dependent Variable: Weight without Shell (mg) Independent Variable: Weight with Shell (mg) Weight without Shell (mg) = 14.66263 + 0.35479131 Weight with Shell (mg) Sample size: 50 R (correlation coefficient) = 0.87925475 Rsq = 0.77308891 Estimate of error standard deviation: 6.9302244 Parameter estimates:
Analysis of variance table for regression model:

Polynomial Regression Results:
Dependent Variable: Weight without Shell (mg) Independent Variable: Weight with Shell (mg)
Analysis of variance table for polynomial regression model:
Summary of fit: Root MSE: 6.8189601 Rsquared: 0.7849 Rsquared (adjusted): 0.7757 
Polynomial Regression Results:
Dependent Variable: Weight without Shell (mg) Independent Variable: Weight with Shell (mg)
Analysis of variance table for polynomial regression model:
Summary of fit: Root MSE: 6.8113495 Rsquared: 0.7899 Rsquared (adjusted): 0.7762 
Polynomial Regression Results:
Dependent Variable: Weight without Shell (mg) Independent Variable: Weight with Shell (mg)
Analysis of variance table for polynomial regression model:
Summary of fit: Root MSE: 6.6658202 Rsquared: 0.8032 Rsquared (adjusted): 0.7857 
Polynomial Regression Results:
Dependent Variable: Weight without Shell (mg) Independent Variable: Weight with Shell (mg)
Analysis of variance table for polynomial regression model:
Summary of fit: Root MSE: 6.6658202 Rsquared: 0.8032 Rsquared (adjusted): 0.7857 Predicted values:
