Print - Back

How Much Does A Sunflower Seed Weigh?
Generated May 9, 2014 by -107826-1_blackboardht_cwi

Introduction:

Sunflower seeds are a very popular snack among consumers, and there are a large variety of available brands and flavors. Since the sunflower seed population demonstrates such diversity, I thought that it would be interesting to perform an observational study on a sample of seeds. Several attributes were observed for each seed, including weight, length, brand, and flavor. This data was analyzed statistically, and a few major observations will be highlighted within this report. The statistical analysis will focus on the primary variable, "weight with shell." We will explore the distribution of sample seed weights, estimate various population parameters, test a claim about the population's mean seed weight, test for differences among brands, and test for correlation between the primary variable and an auxillary variable.

Sampling Method:

For the purposes of this study, a sample of 50 seeds was obtained through a stratified sampling method; 5 seeds were chosen at random from each of 10 different packages of sunflower seeds. The packages varied in brand and flavor, which contributed to obtaining a sample that was representative of the population. However, for the sake of convenience, not all brands and flavors were included in the study. Thus, the sample cannot truly be considered a simple random sample, which may ultimately affect the reliability of the data.

Variable Classification:

This report analyzes several characteristics of the seeds within the sample. The primary variable that will be analyzed in this study is “weight with shell.” This variable – which will be measured in milligrams – is quantitative, continuous, and at the ratio level of measurement.

The first auxiliary variable is “weight without shell.” This variable – which will be measured in milligrams – is quantitative, continuous, and at the ratio level of measurement.

The second auxiliary variable is “length with shell.” This variable – which will be measured in millimeters – is quantitative, continuous, and at the ratio level of measurement.

The third auxiliary variable is “brand.” This variable is qualitative and at the nominal level of measurement.

The fourth auxiliary variable is “flavor.” This variable is qualitative and at the nominal level of measurement.

Raw Data:

The data obtained from the sample is shown below.

<data1>

Sample Distribution of Primary Variable:

Now that the data has been obtained, it can be thoroughly organized and analyzed. In order to simplify the interpretation process, the data was grouped into classes. This resulted in a distibution with 12 classes, each having a width of 15 mg. This information is summarized in the Grouped Frequency Distribution Table below.

<data2>

The information in the Grouped Frequency Distribution Table is shown graphically in the following Histogram.

<result1>

The center of the distribution lies around 135-150 mg; this is the weight range that has the highest frequency. Overall, the seed weights range from 90.9 mg to 257.2 mg. There do not appear to be any obvious outliers, though there are a few seeds that are significantly heavier than the rest.

The Frequency Polygon below shows the overall shape of the distibution.

<result2>

The distribution of the data is somewhat normal. The frequencies start out low, increase to a maximum, and then decrease. However, the distribution is not symmetrical. The right tail of the distribution is longer than the left tail, which means the distribution is skewed to the right. If the distribution was normal, the mean weight would allign with the weight range that has the highest frequency; this would result in a mean weight between 135 mg and 150 mg. However, since the distribution is skewed to the right, the mean weight falls above this range (mean= 152.1 mg).

Looking at the data, it appears that there may be two populations represented within the sample. Many (but not all) of the heavier seeds come from the "David's Jumbo" brand, while the rest come from the other brands. This may be one reason that the distribution is skewed to the right. This speculation will be formally tested later in the report.

The Ogive below shows the cumulative frequency distribution of the data. Within the first few classes, the frequency increases slowly. It begins to increase rapidly near the middle, but then levels off at the last few classes. This pattern is indicative of a normal distribution.

<result3>

The Pie Chart below shows the relative frequency of each class.

<result4>

Measures of Central Tendency for Primary Variable:

The distribution can be more thoroughly analyzed by interpreting its measures of central tendency. Below is a Summary Statistics Report, which includes several measures of center. Though not included in the table, the data set's midrange is 174.1 mg.

<result5>

Along with the summary report above, it can also be useful to explore the 5-Number Summary.

<result6>

The following Boxplot is a graphical representation of the 5-Number Summary.

<result7>

Another useful summary of the distribution can be found in the following Stem-and-Leaf Plot. It has a similar shape to the histogram shown previously in the report.

<result8>

The best measure of central tendency for the data set is the median (146.7 mg). Even though the distribution is skewed to the right, the median is relatively resistant to the skewed nature of the data. This is especially evident when looking at the histogram, frequency polygon, and stem-and-leaf plot. The weight range with the highest frequency (Histogram/Frequency Polygon: 135-149.9 mg, Stem plot: 140-149 mg) contains the median, but it doesn’t contain the other measures of central tendency (Mean: 152.1 mg, Mid-range: 174.1 mg). In addition, 70% of the data values are within one standard deviation (35.7 mg) of the median. This shows that the data is closely distributed around the median.

The worst measure of central tendency for the data set is the mid-range (174.1 mg). It is significantly affected by the right-skewed nature of the distribution. Thus, it does not fall within the weight range that has the highest frequency of data values (refer to the paragraph above). Additionally, only 56% of the data values are within one standard deviation of the mid-range. This shows that the data is not as closely distributed around the mid-range as it is around the median. Lastly, the mid-range is above Q3 (refer to the 5-number summary and boxplot). This means that over 75% of the data values are below the mid-range, making it a poor reflection of the data’s center.

Estimation of Population Parameters

After examining the sample distribution for the primary variable, we can use the sample data to estimate various population parameters. We will begin by estimating the population mean seed weight with shell, using a significance level of α = .10. This corresponds to a 90% confidence interval:

143.68 mg < μ < 160.60 mg (E: 8.46 mg)

I am 90% confident that this interval actually contains the true population mean.

Now I will estimate the population standard deviation of seed weights with shell, using a significance level of α = .10. This corresponds to a 90% confidence interval:

30.67 mg < s < 42.88 mg (E left: 5.01 mg, E right: 7.20 mg)

I am 90% confident that this interval actually contains the true population standard deviation.

Finally, I will estimate the proportion of the population that falls within a certain range of weights. For the purposes of this estimation, I will consider seeds that fall within 100.0 to 150.0 mg. For the sample, this proportion is .58. A significance level of α = .10 was used to construct a 90% confidence interval:

.465 < ρ < .695 (E: .115)

I am 90% confident that this interval actually contains the true population proportion.

It is important to note that these estimates may not be as accurate as my level of confidence suggests. The accuracy of a confidence interval depends on the sample used for its construction. As I mentioned earlier, my sample is not truly a simple random sample. Thus, the sample may not be representative of the population as a whole. This possible desprepancy could affect the reliablility of the confidence intervals.

Hypothesis Test for Population Mean Seed Weight With Shell

Now that we've estimated the population mean using a confidence interval, we can formally test a claim about the population mean. A certain online source (Answers.com) claims that the average weight of a sunflower seed is 90.9 mg. This estimate will serve as the hypothesized population mean for the following statistical test; a hypothesis test will be used to test the claim that the average weight of a sunflower seed is 90.9 mg.

Ho: μ = 90.9 mg (This is the claim)

H1: μ > 90.9 mg

This one-tailed test will use a significance level of α = .05, and it will utilize the student t-distribution. The results are shown below.

<result9>

Since the p-value (.0001) is less than α (.05), I must reject the null hypothesis. There is sufficient evidence to suggest that the population mean seed weight with shell is higher than 90.9 mg. This conclusion is supported by the previously constructed 90% confidence interval, since it does not contain the claimed value (90.9 mg). I am not sure why there is such a discrepancy between the claim from Answers.com and the results I obtained. Perhaps the claim did not come from a reputable source, making it unreliable. Or perhaps the population has changed significantly since the claim was made. Either way, there is sufficient evidence to refute the claim.

Hypothesis Test for Differences Among Brands

As I mentioned earlier in the report, I suspect that there may be a difference in seed weights among the brands. In order to test this theory, I divided the seeds into two groups based on the brand represented by each seed: David’s Jumbo Brand (15 seeds) and Other Brands (35 seeds). I will use a hypothesis test to test the claim that David’s Jumbo brand seeds have a higher mean weight with shell than the seeds from other brands. For reference, the means and standard deviations for each group are included below.

<result10>

For the purposes of this test, the David's Jumbo Brand group will be considered population 1.

H0: μ1 - μ2 = 0

H1: μ1 - μ2 > 1 (This is the claim)

This one-tailed test will use a significance level of α = .01, and it will use the student t-distribution. The results are shown below.

<result11>

Since the P-value (.0001) is less than α (.01), I must reject the null hypothesis. There is sufficient evidence to support the claim that the population mean weight wth shell of David’s Jumbo brand sunflower seeds is higher than that of the other brands.

The same conlusion can also be drawn using the confidence interval method. But we must be careful to keep things consistent. The significance level (α = .01) corresponds to the area under the t-distribution curve to the right of the critical value. This area represents the probability of getting a t-statistic above the critical value, given that the null hypothesis is true. This same probability must be represented by the area above the upper limit of the confidence interval. Thus, a 98% confidence interval should be used because it has an area of .01 above its upper limit. The interval is shown below.

20.1 mg < μ1 - μ2 < 72.3 mg

Since the interval does not contain zero, there is sufficient evidence to support the claim that the population mean weight wth shell of David’s Jumbo brand sunflower seeds is higher than that of the other brands. The reason for this difference has yet to be determined. Perhaps the David's Jumbo brand seeds are genetically modified to grow larger than other seeds. Or perhaps they are treated with certain chemicals during the maturation process in order to maximize their growth. Either way, there is a significant difference between the weight of David's Jumbo brand seeds and those of other brands.

Test for Correlation Between Primary and Auxillary Variables

Now that we've thoroughly studied the primary variable, it may be of interest to test for a correlation between the primary variable and one of the auxillary variables. For this test, the independent variable will be "weight with shell" and the dependent variable will be "weight without shell." A Scatter Plot of the paired data is shown below.

<result12>

There appears to be a strong relationship between the variables, but a thorough regression analysis should be performed in order to clearly elucidate the relationship. I will start by testing for a linear relationship. The linear regression line is overlayed on the scatter plot below.

<result13>

The results of the hypothesis test for linear regression are shown below; the linear regression line has an R2 value of .773.

<result14>

Now we will test for a quadratic relationship. The quadratic regression line is overlayed on the scatter plot below.

<result15>

The results of the hypothesis test for quadratic regression are shown below; the quadratic regression line has an R2 value of .785.

<result16>

We can also test for a cubic relationship. The cubic regression line is overlayed on the scatter plot below.

<result17>

The results of the hypothesis test for cubic regression are shown below; the cubic regression line has an R2 value of .790.

<result18>

Finally, I will test for a 4th-order polynomial relationship. The 4th-order polynomial regression line is overlayed on the scatter plot below.

<result19>

The results of the hypothesis test for 4th-order polynomial regression are shown below; the 4th-order ploynomial regression line has an R2 value of .803.

<result20>

Since the 4th-order polynomial regression line has the highest R2 value (.803) out of those that were tested, it should be chosen as the model of best fit for this relationship. The regression line fits the scatter plot reasonably well, so it is logical to choose this model. There may be other models that fit the relationship better, but for the sake of convenience, we will only consider the 4 regression lines that were tested above.

Now that we have chosen a model of best fit, we can use it to make a prediction. Given that a seed weighs 180.0 mg with its shell, we can predict its weight without the shell. The 95% prediction interval can be found in the last line of the results shown below.

<result21>

The best point-estimate for the weight of the seed without its shell is 81.8 mg. The prediction interval is fairly narrow, suggesting that the regression model is a good fit for the data.

Conclusion

After a thorough anlysis of the data, several important observations have been made. The sample distribution of the primary variable (weight with shell) was found to be approximately normal, with a slight skew to the right. The median was found to be the best measure of central tendency, while the midrange was found to be the worst. The right-skewed nature of the data was initially thought to be caused by a brand-related difference among the seeds. A formal hypothesis test showed that there is a significant difference between the weights of David's Jumbo brand seeds and those of other brands, supporting the initial speculation. The sample data was also used to estimate population parameters for the primary variable. Additionally, a hypothesis test was used to test the claim that the average weight of a sunflower seed is 90.9 mg. The test showed that there is sufficient evidence to suggest that the population mean seed weight with shell is higher than 90.9 mg, resulting in the rejection of the claim. Finally, a correlation was found between "weight with shell" and "weight without shell"; the model of best fit was found to be a 4th order polynomial. Using this model, it was estimated that a seed weighing 180.0 mg with its shell would weigh 81.8 mg without its shell. Although several significant observations were made within this report, the results are still subject to question. Since the sample is not truly a simple random sample, the results may not be completely reliable.

Result 1: Histogram   [Info]

Result 2: Frequency Polygon   [Info]

Result 3: Ogive   [Info]

Result 4: Pie Chart   [Info]

Result 5: Summary Stats   [Info]
Summary statistics:
 Column Mean Median Mode Std. dev. Weight with Shell (mg) 152.144 146.65 No unique 35.68484

Result 6: 5-Number Summary   [Info]
Summary statistics:
 Column Min Q1 Median Q3 Max Weight with Shell (mg) 90.9 129.3 146.65 171.7 257.2

Result 7: Boxplot   [Info]

Result 8: Stem and Leaf Plot   [Info]
 Variable: Weight with Shell (mg) Decimal point is 1 digit(s) to the right of the colon.Leaf unit = 1``` 9 : 1 10 : 0379 11 : 126 12 : 04479 13 : 455789 14 : 01255689 15 : 000226 16 : 001 17 : 0257 18 : 1789 19 : 0 20 : 21 : 148 22 : 23 : 24 : 3 25 : 7 ```

Result 9: Hypothesis Test (Sunflower Seed Weight w/ Shell)   [Info]
Hypothesis test results:
μ : Mean of variable
H0 : μ = 90.9
HA : μ > 90.9
 Variable Sample Mean Std. Err. DF T-Stat P-value Weight with Shell (mg) 152.144 5.0465985 49 12.135699 <0.0001

Result 10: Summary Stats: Brand Subgroups   [Info]
Summary statistics:
 Column Mean Std. dev. David's Jumbo 184.48667 36.376717 Other Brands 138.28286 25.120855

Result 11: Hypothesis Test for Difference Among Brands   [Info]
Hypothesis test results:
μ1 : Mean of David's Jumbo
μ2 : Mean of Other Brands
μ1 - μ2 : Difference between two means
H0 : μ1 - μ2 = 0
HA : μ1 - μ2 > 0
(without pooled variances)
 Difference Sample Diff. Std. Err. DF T-Stat P-value μ1 - μ2 46.20381 10.307663 19.96415 4.482472 0.0001

Result 12: Scatter Plot   [Info]

Result 13: Scatter Plot- Linear   [Info]

Result 14: Simple Linear Regression   [Info]
Simple linear regression results:
Dependent Variable: Weight without Shell (mg)
Independent Variable: Weight with Shell (mg)
Weight without Shell (mg) = 14.66263 + 0.35479131 Weight with Shell (mg)
Sample size: 50
R (correlation coefficient) = 0.87925475
R-sq = 0.77308891
Estimate of error standard deviation: 6.9302244

Parameter estimates:
 Parameter Estimate Std. Err. Alternative DF T-Stat P-Value Intercept 14.66263 4.333337 ≠ 0 48 3.3836811 0.0014 Slope 0.35479131 0.027743772 ≠ 0 48 12.788143 <0.0001

Analysis of variance table for regression model:
 Source DF SS MS F-stat P-value Model 1 7854.3373 7854.3373 163.5366 <0.0001 Error 48 2305.3445 48.02801 Total 49 10159.682

Result 15: Scatter Plot- Quadratic   [Info]

Polynomial Regression Results:
Dependent Variable: Weight without Shell (mg)
Independent Variable: Weight with Shell (mg)
 Parameter Estimate Std. Err. Alternative DF T-Stat P-Value Intercept -8.8205814 15.231232 ≠ 0 47 -0.57911148 0.5653 X 0.65096841 0.18642974 ≠ 0 47 3.4917627 0.0011 X^2 -0.00088451342 0.0005507591 ≠ 0 47 -1.6059897 0.115

Analysis of variance table for polynomial regression model:
 Source DF SS MS F-stat P-value Model 2 7974.2656 3987.1328 85.748081 <0.0001 Error 47 2185.4162 46.498216 Total 49 10159.682

Summary of fit:
Root MSE: 6.8189601
R-squared: 0.7849

Result 17: Scatter Plot- Cubic   [Info]

Result 18: Cubic Regression   [Info]
Polynomial Regression Results:
Dependent Variable: Weight without Shell (mg)
Independent Variable: Weight with Shell (mg)
 Parameter Estimate Std. Err. Alternative DF T-Stat P-Value Intercept 47.107863 55.335408 ≠ 0 46 0.851315 0.399 X -0.42518105 1.0405032 ≠ 0 46 -0.40863023 0.6847 X^2 0.0057130084 0.0063000575 ≠ 0 46 0.90681846 0.3692 X^3 -0.00001287074 0.000012243484 ≠ 0 46 -1.0512319 0.2986

Analysis of variance table for polynomial regression model:
 Source DF SS MS F-stat P-value Model 3 8025.5356 2675.1785 57.661568 <0.0001 Error 46 2134.1462 46.394482 Total 49 10159.682

Summary of fit:
Root MSE: 6.8113495
R-squared: 0.7899

Result 19: Scatter Plot- 4th-order Polynomial   [Info]

Result 20: 4th-order Polynomial Regression   [Info]
Polynomial Regression Results:
Dependent Variable: Weight without Shell (mg)
Independent Variable: Weight with Shell (mg)
 Parameter Estimate Std. Err. Alternative DF T-Stat P-Value Intercept 409.75088 215.24 ≠ 0 45 1.903693 0.0634 X -9.8078995 5.485146 ≠ 0 45 -1.7880836 0.0805 X^2 0.093469799 0.050786563 ≠ 0 45 1.8404435 0.0723 X^3 -0.00036488224 0.00020256386 ≠ 0 45 -1.8013196 0.0784 X^4 5.1160107e-7 2.9388366e-7 ≠ 0 45 1.7408285 0.0885

Analysis of variance table for polynomial regression model:
 Source DF SS MS F-stat P-value Model 4 8160.1896 2040.0474 45.912724 <0.0001 Error 45 1999.4922 44.43316 Total 49 10159.682

Summary of fit:
Root MSE: 6.6658202
R-squared: 0.8032

Result 21: 4-th order Polynomial Regression- P.I.   [Info]
Polynomial Regression Results:
Dependent Variable: Weight without Shell (mg)
Independent Variable: Weight with Shell (mg)
 Parameter Estimate Std. Err. Alternative DF T-Stat P-Value Intercept 409.75088 215.24 ≠ 0 45 1.903693 0.0634 X -9.8078995 5.485146 ≠ 0 45 -1.7880836 0.0805 X^2 0.093469799 0.050786563 ≠ 0 45 1.8404435 0.0723 X^3 -0.00036488224 0.00020256386 ≠ 0 45 -1.8013196 0.0784 X^4 5.1160107e-7 2.9388366e-7 ≠ 0 45 1.7408285 0.0885

Analysis of variance table for polynomial regression model:
 Source DF SS MS F-stat P-value Model 4 8160.1896 2040.0474 45.912724 <0.0001 Error 45 1999.4922 44.43316 Total 49 10159.682

Summary of fit:
Root MSE: 6.6658202
R-squared: 0.8032