BIOL 570 Lab 3: Confidence Intervals and the Sampling Distribution of the Mean
Review summary of the chapter .
From students’ measurements of shells in lab 1, we have compiled 26 samples with n=20. We also have access to measurements from an entire population, so we can assess how accurate our estimates are. The first steps of the lab will produce an interval estimate for a single data set, just as a biologist would do when analyzing his/her data. In the later parts of the lab we will use information from the entire population to demonstrate that the confidence interval is reasonable.
Step 1: Look at the data posted at http://phylo.bio.ku.edu/biostats/Lab3.csv The first column (pop) consists of one length (in cm) for every shell in our population. The next 26 columns (s1, s2, …, s26) are different samples (each with n=20) from this population.
Step 2: Get out your print out of the lab 3 worksheet. Pretend that you only had time to collect twenty measurements and want to infer the population’s mean length. Randomly select one of the twenty-six samples to treat as “your” sample. You can use RStudio (or http://www.r-fiddle.org/#/, if you are not in the lab) to choose a random numbers with:
sample(1:26, 1)
Question #1 What is the number of the sample that you randomly selected?
Step 3: Now you can grab the vector of data for “your” sample using syntax similar to the following (but you should replacing “100” with the random sample number). The second line :
data <- read.csv(url("http://phylo.bio.ku.edu/biostats/Lab3.csv"))
d = data$s100
d = d[!is.na(d)]
Step 4: Recall from previous labs, that we can use
summary(d)
sd(d)
To calculate the summary stats on our vector d (with sd showing you the standard deviation). R also allows other software developers to write packages of code to extend the core language. The package pastecs has a nice summary function called stat.desc. To use it, we have to tell R that we want to load the package:
require(pastecs)
stat.desc(d)
If you see a message which ends in something like “there is no package called ‘pastecs’” then you will need to execute the command
install.packages("pastecs");
And then try the the previous 2 lines again.
Does the standard error of the mean (SE.mean) equal
?
Question #2: Write a sentence that you would put into a scientific publication to describe what you would conclude about the mean length of the population of shells based on your sample of lengths. Remember to include a measure of location, spread, and the sample size.
Step 5: As we discussed in class, the 95% confidence interval for a statistic is a range of plausible values. We say that we are 95% confident that the population parameter value is within our 95% confidence interval. We also noted that when you are estimating the mean of a population from a random sample, then we can use the
rule as the basis for the confidence interval. The term 2SE is referred to as the "margin of error."
Calculate the confidence interval for the mean for your sample using this 2SE rule-of-thumb. The margin of error will not agree exactly with the CI.mean.0.95 that is shown in the R output; we will learn about how R calculates the margin of error later in the course.
Step 6: Go to the board and draw a line above the number line to depict the confidence interval for your sample and add a tick mark to denote the mean.
Step 7: All of the analyses that we have done to this point in this lab are representative of the steps that you could perform on a typical dataset. Now we are going to do some analyses that exploit the fact that in this (very unusual) case we have access to the entire population. We will compare our estimate of the mean to the true value for the population parameter. Use:
true.val = mean(data$pop)
pop.sd = sd(data$pop)
print(paste("true mean = ", true.val, ". sd =", pop.sd))
to examine the parameter values for the entire population (the data in the ‘Length’ column).
Question #3: What are the values of μ (the population mean) and σ (the population standard deviation)? Is μ contained within the confidence interval that you drew on the board?
Step 8: Because we have 26 different samples with n = 20, we can calculate 26 different CIs. First we'll isolate the 26 columns that were random samples, and we'll only keep the first 20 rows of each because there is no data in the later rows. The subset function with a select argument will choose the columns that we want. The logical test “data$s1 > 0” will trim off the rows with no data. We'll store the result in a data frame called "r"
meanv = 0; sdv = 0;
r = subset(data, data$s1>0, select=2:27)
Then to calculate the CI’s, it will be convenient to get a vector of means and standard deviations from each of the 26 random sample. We can do all of that with the following syntax:
for (i in 1:26) {
samp = r[[i]] ;
meanv[i] = mean(samp) ;
sdv[i] = sd(samp) ;
}
Now, given n=20 for each sample, we can calculate a vector of SE for each sample and use the 2 standard errors trick to get a vector of margins of error, and from that the upper and lower values of the CI:
n = 20
se.v = sdv/sqrt(n)
moe.v = 2*se.v
lower.ci.v = meanv - moe.v
upper.ci.v = meanv + moe.v
We can get a vector of T/F values for the cases in which the true value is above the lower limit of the CI and below the upper limit of the CI. sum will count the number of TRUE values, which is the number of CIs that contain the true population mean.
contains.true.val = lower.ci.v < true.val & true.val < upper.ci.v
print(paste(sum(contains.true.val), " of the 26 CIs contained the true mean"))
Question #4: How many of the confidence intervals contain the population mean?
Step 9: As we saw in previous labs, R can randomly sample from a population.
We can put this logic inside a “for” loop to do 200 repetitions of the sampling procedure. Here we’ll generate 200 samples each with n=80. This code may take awhile to execute:
s.contains.true.val = 0
s.mean = 0
s.up = 0
s.lo = 0
s.n = 80
s.se = 0
reps = 1:200
for (i in reps) {
s.sample = sample(data$pop, s.n) ;
ms = mean(s.sample) ;
s.mean[i] = ms ;
s.se[i] = sd(s.sample)/sqrt(s.n) ;
s.up[i] = ms + 2*s.se[i] ;
s.lo[i] = ms - 2*s.se[i] ;
s.contains.true.val[i] = s.lo[i] < true.val & true.val < s.up[i]
}
print(paste(sum(s.contains.true.val), " of the 200 CIs contained the true mean"))
Step 10: How much variability is there in the estimates of the mean? In the code above, we stored the means from each simulation in a vector called s.mean and we stored the upper and lower values of the CIs in s.lo and s.up vectors. We can make a plot with a point for every replicate’s mean and a CI for every replicate that did not overlap with the population mean using the following code:
# plot each mean against its rep number
plot(reps, s.mean, ylim=c(min(s.lo), max(s.up)), ylab="mean", xlab="sim. rep. #")
# we already figured which CIs overlap with the true value, and plot those segments.
excl = !s.contains.true.val
segments(reps[excl], s.lo[excl], reps[excl], s.up[excl])
# we can make a line corresponding to the y-intercept of the true value with slope=0
abline(a=true.val, b=0)
Recall the the sd functions will report the standard deviation for a vector.
Question #5 Based on the 200 sample means we have drawn and stored in s.mean, what is the standard error of the mean for a sample size of 80 shells drawn from our population?
Note that, in question 5 you are not reporting the standard error; you are reporting
an estimate of the standard error.
Step 11: Usually we estimate the standard error of the mean for a single sample, using the formula
. The estimates of the standard errors are stored in the vector s.se. If you print out that vector, are the estimates similar to the value you calculated for question 5?
Step 12: Use the print(s.mean[1]), print(s.lo[1]) and print(s.up[1]) functions to see the limits of the first simulated 95% confidence interval that you create for n=80. Go to the board and draw a line below the number line to depict that confidence interval
interval for this sample; add a tick mark to denote the mean.
Question #6: Look at the confidence intervals drawn on the board. Which sam-
pling scheme (n = 20 or n = 80) results in a tighter confidence interval? Do the results reflect what you expected to see?
Remember: a 95% confidence interval for a larger sample size should encompass a smaller
range (the interval should be tighter) than the 95% confidence interval based on a small
sample size. In the long run, 95% of all 95% confidence intervals should contain the
population mean – whether the sample size is small or large. Better sampling shrinks our
confidence interval, it does not increase the chance that the interval contains the population
parmeter value!