We want to infer information about an entire population just from
knowledge of a relatively small sample. In general, variations on
the Law of Large Numbers can usually tell us that if what we want to
know is something averaged or aggregated, there is a computation you
can do from the sample which gives an estimate which has a very
small chance of being more than a very small distance
away from the thing we are trying to approximate, assuming that
the sample was chosen at random. Chosen at random means that it
was chosen in such a fashion that every individual is equally likely
to be chosen, and certain individuals being in the sample do not
change any other individual's chances of being in the sample. This
is extremely hard to do in practice, and is a major limitation on what
conclusions you can draw. What ``very small chance'' and ``very small
distance'' mean in practice are subtle, and depend on the problem.
All of our conclusions will be probabilistic, that is they will
give probabilities. The thing to remember is this: In all these
cases one imagines that the actual truth about the population is
fixed (though unknown, like what is behind the curtain in the Monty
Hall problem).
A confidence interval is a very general tool. Whenever you want
to measure anything numerical in the real world, there will be
innacuracy. The best you can give is a range for the value, and even
then you cannot say for certain that the true value will fall in that
range. The best you can hope to do is give an exact probability for
your chance of being right when you say it falls within the exact
range of values that you give. That is a confidence interval. For
example, if I say The 99% confidence interval for the median
numer of hours of TV watched per week by American college students is
between 9.7 and 15.5, I am saying that whatever the true value
for that parameter is, I am 99% sure that it is somewhere between 9.7
and 15.4. What I mean by that is that I went through some
procedure to come up with the numbers 9.7 and 15.4, which presumably
involved taking a randon sample of college students, asking them how
many hours of TV they watched, and doing some calculation with those
numbers. If I did this many times, each time my sample would be
different (because it is chosen randomly) and so each time my
interval would be different. When I make the original claim I am
sayiing that if I repeated that procedure thousands of times and came
up with thousands of different intervals, 99% of them would be
correct. It is really a statement about the procedure you
used to compute those numbers, not about the numbers themselves.
When you give a 95% confidence interval, you are saying there is a
5% chance that you will be wrong due to the randomness of the
sample. There is an additional chance that you will be wrong
because you messed up the calculation, or the numbers weren't
recorded correctly, or whatever. In addition, every procedure
requires certain assumptions to be true, such as that the sample is a
Simple Random Sample and that some variable has a normal
distribution, which at best are usually true only approximately. You
have to assess this subjectively. Finally, you need to be clear on
the difference between the actual quantity you raae getting the
confidence interval for and the one you would like it to be for. If
students are consistently underestimating the TV they watch because
of embarassment, the interval I come up with will not be a95%
confidence interval they watch, it will be a confidence interval for
the median number of hours that all college students would
say they watched if asked in the manner the students in your
sample were asked. Of course it is ridiculous to say all these
caveats every time you give a confidence interval, but it is
important to be aware of them and to have a sense of which issues
might be limiting the accuracy of the estimate in your particular
case. No one will warn you of this, and no computer program will
help you assess it, you are on your own.
Confidence intervals are expressed in two different forms.
Sometimes they are given as ``between 9.7 and 15.5,'' or simply
``[9.7, 15.4],'' but sometimes they are given as 12.6 +/- 2.9, the
first number being the center of the interval and the second being
half the width, which we call the margin of error.
We want to consider some claim about a population (in this class,
about a population, but the logic of this works any time you have
probabilistic evidence for a claim), and all we have is information
about a sample. We would like to assess quantitatively the strength
of this evidence, and we would like to have an objective procedure
for deciding if the evidence is good enough to believe it. Two
situations where such an objective procedure is valuable: Scientific
evidence, where objectivity is of paramount importance, and highly
automated decision making processes, such as spam filtering or industrial
processes.
To make it concrete, suppose you are flipping a coin, and trying
to decide if the coin is improperly weighted. If you flip it 10
times and it comes up tails 7 of them, you would probably chalk it
off to chance. Likewise if you flipped it 100 times and it came up
tails 53 times. But if it came up tails 700 out of 1000 times, you
would probably consider this good evidence that the coin is weighted
to favor tails. what is the cutoff? If you think carefully about
how you are deciding, you will probably agree that you are asking
yourself ``What are the chances that a set of flips like this would
happen by chance, i.e. if the coin were properly weighted?'' For
example, seven coins in 10 flips does not seem like an unlikely thing
to occur (assuming the coin is properly weighted), but 700 tails in
1000 flips seems incredibly unlikely, unless the coin is
mis-weighted. This is the central idea of Hypothesis
Testing. To assess the evidence your sample (or experiment
or whatever) offers of a claim, you calculate the probability that
you would see a sample like what you got assuming this
claim was false. This probability is called the p-value, and if it
is very small, it is unlikely your results would have occured by
chance (i.e. if the claim were false), so the fact that it did occur
is evidence for the claim. If it is high, you do not have good
evidence for the claim.
The claim you are assessing the evidence for is called the
Alternate Hypothesis abbreviated H1 or Ha. It's negation, the
thing you assume to compute the probability, is called the Null
Hypothesis. You assume that the Null Hypothesis is true, and
using that you have to compute the probability that you would get
results similar to yours (that is, the percentage of times out of
many random samples you could expect to get results similar to what
your particular sample happens to give. That is the p-value.
In informal hypothesis testing (common in business settings),
this is the end of the story. You report your p-value (Saying ``The
p value is 0.01'' or more precisely ``The probability of getting
results like these if the null hypothesis were true is 1%'') and
make a subjective judgement of how strong the evidence is for the
alternate hypothesis. In our class, just to get into the habit of
doing this, we usually say that a p-value smaller than .5% is very
strong evidence, a p-value from .5% to 10$ is evidence, and a p-value
of more than 10% is very weak evidence for the alternate hypothesis.
But that is very arbitrary.
In formal hypothesis testing you are given a cutoff
called the significance level, usually written as the Greek letter
``alpha,'' In class this will be given to you in the question
(that's how you will know it is supposed to be formal!), but in
practice there are usually standard values in a given field everyone
uses. Common significance levels are 1% and 5%. If your p-value is
less than the significance level you say ``this data is significant
evidence that ...**Alt. Hyp. here***'', otherwise you say ``this data
is not significant
evidence that ...**Alt. Hyp. here***.''
For reasons that we will talk about later, you should never look
at a p-value and pick a significance level that makes it
significant. That is called ``Letting your data tell you what
question to ask.'' An old-fashioned way to express these two
conclusions is ``Reject the null hypothesis'' and ``fail to reject
the null hypothesis.''
There are two ways for your conclusion to fail to reflect reality
in hypothesis testing. If the alternate hypothesis is false and you
find there is significant evidence for the alternate, this is called
an Error of Type I. If the alternate hypothesis is true and you find
the data is not significant, this is an Error of Type II. By
adjusting the significance level, you can balance the trade-off
between the two types of errors. Often the Type I error is the
bigger error (example, convicting the innocent is a type I error,
setting free the guilty is type II).
There is a nice interpretation of the p-value and significance
levels you should know. If you do many significance tests at a
significance level alpha, then on average alpha will be the
percentage out of the times when the null hypothesis is true that you
actually got significant evidence (Type I error). So your
significance level is where you control your chances of making a Type
I error. Of course, the lower your signficance level, the higher
your chances of making a type II error.
People can usually tell what the alternate and null hypothesese
are, but often have trouble telling which is which. Here are three
criteria:
A really simple case is when you want to give a confidence
interval or do a hypothesis test for the population mean of a
numerical variable (let's say
it is normally distributed) and you somehow know its standard
deviation. The procedures in this case are very straightforward, and
you can see the logic more clearly.
Let's say the mean is &mu (we don't
know) and the
standard deviation is &sigma (we know). Thus the distribution of the
original variable is a normal distribution with a mean &mu and a
standar deviation &sigma. Now suppose we take many samples of size n
from the population, and compute the sample mean X-bar of each.
These numbers X-bar will have a distribution (the sample
distribution) which is again normal, still has mean &mu, but now has
standard deviation &sigma/sqrt(n), sigma over the square root of n.
So I don't have to write that a lot I will call it the standard
error. Remember that for a normal distribution with mean &mu and
standard deviation &sigma you can compute any probabilities like
Given some particular number &mu0 called the test
mean, we can test three different alternate hypothesese:
Usually, you do not know the population standard deviation
&sigma, so you cannot use the methods of the previous paragraphs.
If your sample size is large, the sample standard deviation s coming
from your sample
represents a good substitute for the populations standard deviation
&sigma. If your sample size is not large this adds yet another
source of randomness, to your calculation, so you would expect your
confidence intervals to be wider and your p-values to be higher to
account for that. The way to do this is the replace the normal
distribution (this is for example the distribution that gives you
z* in the confidence interval, with another
distribution, the t-distribution. the
t-distribution, like the normal distribution, is a
mathematical formula
that depends on certain parameters, from which you can determine the
probability that a variable following this distribution will fall in
a certain range of values. Instead of depending on the mean and
standard deviation, the t distribution depends on the size of the
sample. For mathematical and historic reasons we do not write the
dependence in terms of the sample size n, but what we call
the degrees of freedom, n-1. Degrees of freedom is a notion
we will see again and again. When the degrees of freedom is large,
the t-distribution is very close to the standard normal
distribution (i.e. the one with mean 0 and s.d. 1), but when the
degrees of freedom are small, it is broader, with fatter tails. The
Excel formulas to calculate the confidence interval and p-value are
almost the same as in the sigma known case.
However, we will use the t
procedure template to do the calculation for us. If you have raw
data, enter it in column A in the "Data" tab of the template. Then
on the "t-test" tab enter your test mean in the text box provided,
enter the significance level below it, and click on the correct form
of the alternate hypothesis or enter the confidence level. The
confidence interval and/or the p-value and conclusion appear to the
right in green. If you only have summary statistics (i.e., X-bar, n)
enter it in the textboxes proveided on the "t-test" tab and click the
"Use summary statistics" box.
Every statistical procedure involves modleing a real situation by
an theoretical mathematical model. This amounts to making
assumptions about the situation that are generally true only
approximately at best. It is crucial that you know what these
assumptions are and how to assess how closely they are met, in order
to have a sense of the reliability of the conclusions. We will
generally choose a rule of thumb for judging when the assumption is
close enough. For the two
version of the one population, one numerical variable procedures
(z-procedure for sigma known, t-procedure for sigma unknown) there
are three assumptions.
Confidence Intervals
Hypothesis Testing
The best way to decide is to say ``If I have good enough evidence to
convince me that ... is true, I will ...'' whichever claim makes
more sense in such a sentence should be your alternate hypothesis.
One Population, One Numerical Variable, Sigma is Known
P(X < a) "=NORMDIST(a, &mu, &sigma, TRUE)"
P(X > a) "=1-NORMDIST(a, &mu, &sigma, TRUE)"
P(a < X < b) "=NORMDIST(a, &mu, &sigma, TRUE)-NORMDIST(b, &mu,
&sigma, TRUE)"
and finally, the probability that X is farther away from the
&mu than a is:
"=2*MIN(NORMDIST(a, &mu, &sigma, TRUE),1-NORMDIST(a, &mu, &sigma,
TRUE))"
On the other hand if you have a probability p,
"=NORMINV(p,&mu,&sigma)"
will give you the value that is at the pth percentile in this
distribution.
"=NORMINV(p,0,1)"
will tell you how many standard deviations you hvae to go up from
the mean to guarantee that p percent of the data will be below you. So
"=NORMINV((1+p)/2,0,1)"
will tell you how many standard deviations you have to go to have
p percent of the data closer to the mean than you are. So if you do
that with p=.95, you will find that 95% of the data is within 1.96
standard deviations of the mean.
With all that in hand, how do you find a 95% confidence interval
for the mean if you have a sample of size n with a sample mean of
X-bar. Well we know that 95% of all samples of size n will be within
1.96 standard errors of the mean, so will differ from the mean by
less than 1.96 &sigma/sqrt(n). So to put it another way we are
95% sure that the mean is within 1.96&sigma/sqrt(n) of our
particular X-bar so the confidence interval is
X-bar +/- 1.96 &sigma/sqrt(n)
for confidence level &alpha it is
X-bar +/- z &sigma/sqrt(n)
where z=NORMINV((1+&alpha)/2,0,1).
The first two are called one-tailed tests, the last two-tailed. You
of course assume the Null Hypothesis
H0: &mu = &mu0
from which you can compute the p-value, the probability that a
random sample would give results at least as much in favor of the
alternate hypothesis as your X-bar. These are respectively
One Sample, One Numerical Variable: The t-procedure
Confidence Interval (confidence level &alpha):
X-bar +/- t &sigma/sqrt(n)
where t=TINV(1-&alpha,n-1).
Hypothesis Testing:
Assumptions For the One Population, One Numerical Variable Tests
Theory | Practice | ||||||
---|---|---|---|---|---|---|---|
The sample is a Simple Random Sample | Those individuals more likely to be chosen in the sample do not appear to differ from those less likely in a way that would affect the measured variable(s) | ||||||
Sampling is with replacement or from an infinite population | The population is at least 20 times the sample size (this is almost always met, and when it isn't there is a correction that some of the templates include). | ||||||
The distribution of X-bar is normal |
|
To help you remember the last assumption, which we will keep seeing
in many tests, I have devised
Steve's Patented Rule of Cool!
What does it take to be cool? It depends on your age.
The t procedure template gives a rough histogram of the sample
data on the "Hist" page, quite sufficient for determining if the
middle criterion is met (not too skew, no extreme outliers), though
it will only work on samples under 3000.
The confidence interval for a single numerical variable is used very
frequently, but hypothesis testing is less frequent: It is not that
often that you want to compare the average value of something to some
fixed thing (the test mean). More often you want to compare two
things. If I am asking whether Treatment A is more effective than
Treatment B, I will want to look at, for example, the average
recovery time for people who receive A and compare it to those who
receive B. If I get statistically significant evidence that the
average recovery time for A is shorter than for B, I conclude that A
is more effective.
Here we think of there being two populations (in this case,
"people treated by Treatment A" and "people treated by Treatment B")
and a random variable X on both. The means of the random variable in
the two populations are &mu1 and &mu2
respectively. Generally our null hypothesis is that
&mu1= &mu2, which we often write as
&mu1- &mu2=0, but every once in a while
we want to have a more general null hypothesis &mu1-
&mu2=d, where d, the test difference, is whatever
number you like. The alternate hypothesis can be
The template (
Two Sample t-test in the Excel Templates page will do the
calculation. If you have the raw data from two samples, just enter
it in two columns in COLUMN A and COLUMN B in the "Data" tab. It is
a very good idea to put labels at the top so you can remember which
is which. The "t-test" tab will then give you a confidence interval
if you enter a confidence level, and will give you a p-value. You
leave the space for "test difference" blank unless you are in the
rare situation where you have a null hypothesis like "the first mean
is exactly 20 more than the last mean"). You can enter significance
level as well if you like. If you only have the summary statistics,
enter these six numbers in the space provided on the t-test page and
click on the "summary statistics" button.
The assumptions of the two sample t procedure are the same as for the
one sample t-procedure, applied to each sample. That is, we assume
in practice that each sample behaves as if it were a random sample,
is from a population of reasonable size, and satisfies the Rule of
Cool. You can check the histograms of your two samples on the
"Hist" page of the template for skewness and outliers. In practice
we generally are more relaxed about the size limitations for two
samples, so that if each are close 40 or are close to 15 and
reasonably symmetric it is considered OK. There is one additional
assumption for the two sample test
Two Populations (two samples), One Numerical Variable: the t-procedure again
We will have a sample from each population, one of size
n1, mean X-bar1, and sample
s.d. s1, the other of size
n1, mean X-bar1, and sample
s.d. s1. It turns out that you can show that if
you took many pairs of samples and computed their X-bar and
s, the differences would follow close to a t-distribution.
From this you can get a p-value, or a confidence interval for what
the difference &mu1-
&mu2 is.
Theory | Practice |
---|---|
Obscure | Each sample should have size at least 5. |
Suppose you have a population, and some proportion p of it has a property, which we will call success. When you choose an individual at random from this population, you will have probability p of getting a success and probability 1-p of getting a failure. If you take a sample of n individuals from a population and count up how many successes you get, that is a binomial experiment, and as you run through many samples you will get a binomial distribution for the number m of successes. m will have a mean of np a standard deviation of the square root of the quantity np(1-p) and will be roughly normal when np and n(1-p) are both at least 5. That means that the variable m/n, which is the sample proportion p-hat, will be normal with a mean of p and a standard deviation of the square root of the quantity p(1-p)/n.
Now turn this around. Suppose you take a sample of size n from a population where the proportion of successes is unknown, and you find the sample proportion is some number p-hat. What does that suggest about the true proportion p? Well presumably the true proportion p is pretty close to p-hat, so the true standard deviation is pretty close to the square root of p-hat(1 - p-hat)/n (in fact the standard deviation changes much more slowly than p). So just as with the confidence interval for a numerical veriable with sigma known, we have
Confidence Interval for a Proportion:
p-hat +/- z sqrt(p-hat(1 - p-hat)/n)
where z is the z-score associated to the confidence level.
The situation is even simpler for hypothesis
testing. Here we will have some test proportion ,
p0, and our null hypothesis will be
p=p0. We will have three possible alternate
hypotheses, namely
Assuming the null hypothesis, that is that the population
proportion is
p0, we know that the sample proportion has a
mean of p0 and a standard deviation of
SQRT( p0(1 - p0)/n), and is normally
distributed. So we get a p value with no further fuss:
ANOVA stands for analysis of variation. Here you have I samples
taken from I different populations (often it is one population under
I different circumstances, such as people be given one of several
different treatments). Generally I is more than 2, because if it 1
or 2 you are better off doing one of the t-procedures. Let's say
the samples are of size n1,
n2, ...nI. On each
population the variable of interest X has some mean
&mui, and some standard deviation &sigma.
We have to assume the standard deviations are all the
same for ANOVA to work! This is too much information to
compute any kind of confidence interval, and in fact there is only
one null and alternate hypothesis we can use, so there is no decision
to make. Our null hypothesis will always be that all these means are
the same, and our alternate hypothesis is that there is some
difference among them. As always we will compute a p-value, which is
the probability you would get samples whose sample means are this far apart
assuming the null hypothesis (that all the population means are
equal).
The way this is computed is actually very instructive. It turns
out that when you can think of all the variation or error in a
problem as coming from two or more distinct sources, often you can
associate a kind of variance to these (it will always be the sum or average of
the squares of the difference between the values and some predicted
value) so that the total variance is a sum of the variance due to
each of these effects. Here, if we assume the null hypothesis, we
can think of all of these samples, since they come from samples of
the same mean &mu and standard deviation &sigma, as one big sample
from one big population. The total variation or SST (Sum Square
error Total) is the sum of the squares of the difference of each
point with the sample mean of the whole sample. This is the sum of
two contributions, the SSG or Sum Square error for groups, which is
the sum of the squares of the differences of each sample mean and the
total sample mean, and the SSE or Sum Square Error within each
sample, which is the sum of the squares of the differences of each
data point minus the mean of its sample. All this is to say
SST = SSG + SSE
where SSG represents how much the sample means vary amongst
themselves and SSE replresents how much the points in each sample
vary amongsth themselves. If they all came from populations with the
same mean, the SSG would just come from natural variation from one
sample to the next, and you would expect it to make up some
predetermined fraction of the total variation (depending on the sizes
of the samples of course). If SSG represents too big a fraction of
the total variation, you start to believe that it is because the
means are actually different. Specifically some math will tell you
that if the null hypothesis is true the quantity
F=SSG/SST
as we run through many different samples will have a
characteristic distribution, called the F distribution. The F
distribution depends on two parameters, the Group Degrees of
Freedom DFG = I-1 and the Error Degrees of Freedom DFE = N-I.
Software or really long tables will tell you, once you know F, DFG
and DFE, what the probability you would get that big an F score or
bigger in a collection of samples (of the correct sizes) from which
all the means are equal. That is your p-value.
All you have to do is enter the data from the spearate samples in
separate columns in the "Data" tab of the ANOVA
template (make sure not to label the columns by numbers, Excel will
think it is data). then read off the p-value. Or, enter the means,
s.d.s and counts in the "Data Summary" tab.
The assumptions for the ANOVA test are a little more
complicated. Each sample must satisfy all the assumptions of the t
procedure. That is, it must be a reasonable approximation of a
simple random sample, it must be sampled from a sufficiently large
population, and it must satisfy the 0:15:40 Rule of Cool (In
practice, we think of the different samples as providing additional
"averaging out," so that we can be more relaxed about this rule.
Three or four sample of size 20 are probably OK short of the most
outrageous outliers or skewness). In addition, the assumed equality
of the standard deviation enforces the following additional rule of
thumbs:
What are the assumptions for the one proportion z-procedure? Depends
slightly on whether you are doing a confidence interval or a test.
Theory Practice
The sample is a Simple Random Sample
Those individuals more likely to be chosen in the sample do
not appear to differ from those less likely in a way that would
affect the measured variable(s)
Sampling is with replacement or from an infinite population
The population is at least 20 times the sample size (this is
almost always met, and when it isn't there is a correction that
some of the templates include).
The distribution of p-hat is normal
Rule of 15
CI Number of
successes and failures is at least 15
HT
Expected number of
successes and failures (n p0 and n
(1 - p0)) is at least 15
Many Populations, One Numerical Variable: ANOVA
sawin
Last modified: Wed Nov 16 21:08:57 EST 2011