We want to infer information about an entire population just from knowledge of a relatively small sample. In general, variations on the Law of Large Numbers can usually tell us that if what we want to know is something averaged or aggregated, there is a computation you can do from the sample which gives an estimate which has a very small chance of being more than a very small distance away from the thing we are trying to approximate, assuming that the sample was chosen at random. Chosen at random means that it was chosen in such a fashion that every individual is equally likely to be chosen, and certain individuals being in the sample do not change any other individual's chances of being in the sample. This is extremely hard to do in practice, and is a major limitation on what conclusions you can draw. What ``very small chance'' and ``very small distance'' mean in practice are subtle, and depend on the problem. All of our conclusions will be probabilistic, that is they will give probabilities. The thing to remember is this: In all these cases one imagines that the actual truth about the population is fixed (though unknown, like what is behind the curtain in the Monty Hall problem).
A confidence interval is a very general tool. Whenever you want to measure anything numerical in the real world, there will be innacuracy. The best you can give is a range for the value, and even then you cannot say for certain that the true value will fall in that range. The best you can hope to do is give an exact probability for your chance of being right when you say it falls within the exact range of values that you give. That is a confidence interval. For example, if I say The 99% confidence interval for the median numer of hours of TV watched per week by American college students is between 9.7 and 15.5, I am saying that whatever the true value for that parameter is, I am 99% sure that it is somewhere between 9.7 and 15.4. What I mean by that is that I went through some procedure to come up with the numbers 9.7 and 15.4, which presumably involved taking a randon sample of college students, asking them how many hours of TV they watched, and doing some calculation with those numbers. If I did this many times, each time my sample would be different (because it is chosen randomly) and so each time my interval would be different. When I make the original claim I am sayiing that if I repeated that procedure thousands of times and came up with thousands of different intervals, 99% of them would be correct. It is really a statement about the procedure you used to compute those numbers, not about the numbers themselves.
When you give a 95% confidence interval, you are saying there is a 5% chance that you will be wrong due to the randomness of the sample. There is an additional chance that you will be wrong because you messed up the calculation, or the numbers weren't recorded correctly, or whatever. In addition, every procedure requires certain assumptions to be true, such as that the sample is a Simple Random Sample and that some variable has a normal distribution, which at best are usually true only approximately. You have to assess this subjectively. Finally, you need to be clear on the difference between the actual quantity you raae getting the confidence interval for and the one you would like it to be for. If students are consistently underestimating the TV they watch because of embarassment, the interval I come up with will not be a95% confidence interval they watch, it will be a confidence interval for the median number of hours that all college students would say they watched if asked in the manner the students in your sample were asked. Of course it is ridiculous to say all these caveats every time you give a confidence interval, but it is important to be aware of them and to have a sense of which issues might be limiting the accuracy of the estimate in your particular case. No one will warn you of this, and no computer program will help you assess it, you are on your own.
Confidence intervals are expressed in two different forms. Sometimes they are given as ``between 9.7 and 15.5,'' or simply ``[9.7, 15.4],'' but sometimes they are given as 12.6 +/ 2.9, the first number being the center of the interval and the second being half the width, which we call the margin of error.
We want to consider some claim about a population (in this class, about a population, but the logic of this works any time you have probabilistic evidence for a claim), and all we have is information about a sample. We would like to assess quantitatively the strength of this evidence, and we would like to have an objective procedure for deciding if the evidence is good enough to believe it. Two situations where such an objective procedure is valuable: Scientific evidence, where objectivity is of paramount importance, and highly automated decision making processes, such as spam filtering or industrial processes.
To make it concrete, suppose you are flipping a coin, and trying to decide if the coin is improperly weighted. If you flip it 10 times and it comes up tails 7 of them, you would probably chalk it off to chance. Likewise if you flipped it 100 times and it came up tails 53 times. But if it came up tails 700 out of 1000 times, you would probably consider this good evidence that the coin is weighted to favor tails. what is the cutoff? If you think carefully about how you are deciding, you will probably agree that you are asking yourself ``What are the chances that a set of flips like this would happen by chance, i.e. if the coin were properly weighted?'' For example, seven coins in 10 flips does not seem like an unlikely thing to occur (assuming the coin is properly weighted), but 700 tails in 1000 flips seems incredibly unlikely, unless the coin is misweighted. This is the central idea of Hypothesis Testing. To assess the evidence your sample (or experiment or whatever) offers of a claim, you calculate the probability that you would see a sample like what you got assuming this claim was false. This probability is called the pvalue, and if it is very small, it is unlikely your results would have occured by chance (i.e. if the claim were false), so the fact that it did occur is evidence for the claim. If it is high, you do not have good evidence for the claim.
The claim you are assessing the evidence for is called the Alternate Hypothesis abbreviated H1 or Ha. It's negation, the thing you assume to compute the probability, is called the Null Hypothesis. You assume that the Null Hypothesis is true, and using that you have to compute the probability that you would get results similar to yours (that is, the percentage of times out of many random samples you could expect to get results similar to what your particular sample happens to give. That is the pvalue.
In informal hypothesis testing (common in business settings), this is the end of the story. You report your pvalue (Saying ``The p value is 0.01'' or more precisely ``The probability of getting results like these if the null hypothesis were true is 1%'') and make a subjective judgement of how strong the evidence is for the alternate hypothesis. In our class, just to get into the habit of doing this, we usually say that a pvalue smaller than .5% is very strong evidence, a pvalue from .5% to 10$ is evidence, and a pvalue of more than 10% is very weak evidence for the alternate hypothesis. But that is very arbitrary.
In formal hypothesis testing you are given a cutoff called the significance level, usually written as the Greek letter ``alpha,'' In class this will be given to you in the question (that's how you will know it is supposed to be formal!), but in practice there are usually standard values in a given field everyone uses. Common significance levels are 1% and 5%. If your pvalue is less than the significance level you say ``this data is significant evidence that ...**Alt. Hyp. here***'', otherwise you say ``this data is not significant evidence that ...**Alt. Hyp. here***.'' For reasons that we will talk about later, you should never look at a pvalue and pick a significance level that makes it significant. That is called ``Letting your data tell you what question to ask.'' An oldfashioned way to express these two conclusions is ``Reject the null hypothesis'' and ``fail to reject the null hypothesis.''
There are two ways for your conclusion to fail to reflect reality in hypothesis testing. If the alternate hypothesis is false and you find there is significant evidence for the alternate, this is called an Error of Type I. If the alternate hypothesis is true and you find the data is not significant, this is an Error of Type II. By adjusting the significance level, you can balance the tradeoff between the two types of errors. Often the Type I error is the bigger error (example, convicting the innocent is a type I error, setting free the guilty is type II).
There is a nice interpretation of the pvalue and significance levels you should know. If you do many significance tests at a significance level alpha, then on average alpha will be the percentage out of the times when the null hypothesis is true that you actually got significant evidence (Type I error). So your significance level is where you control your chances of making a Type I error. Of course, the lower your signficance level, the higher your chances of making a type II error.
People can usually tell what the alternate and null hypothesese are, but often have trouble telling which is which. Here are three criteria:
A really simple case is when you want to give a confidence interval or do a hypothesis test for the population mean of a numerical variable (let's say it is normally distributed) and you somehow know its standard deviation. The procedures in this case are very straightforward, and you can see the logic more clearly.
Let's say the mean is &mu (we don't
know) and the
standard deviation is &sigma (we know). Thus the distribution of the
original variable is a normal distribution with a mean &mu and a
standar deviation &sigma. Now suppose we take many samples of size n
from the population, and compute the sample mean Xbar of each.
These numbers Xbar will have a distribution (the sample
distribution) which is again normal, still has mean &mu, but now has
standard deviation &sigma/sqrt(n), sigma over the square root of n.
So I don't have to write that a lot I will call it the standard
error. Remember that for a normal distribution with mean &mu and
standard deviation &sigma you can compute any probabilities like
P(X < a) "=NORMDIST(a, &mu, &sigma, TRUE)"
P(X > a) "=1NORMDIST(a, &mu, &sigma, TRUE)"
P(a < X < b) "=NORMDIST(a, &mu, &sigma, TRUE)NORMDIST(b, &mu,
&sigma, TRUE)"
and finally, the probability that X is farther away from the
&mu than a is:
"=2*MIN(NORMDIST(a, &mu, &sigma, TRUE),1NORMDIST(a, &mu, &sigma,
TRUE))"
On the other hand if you have a probability p,
"=NORMINV(p,&mu,&sigma)"
will give you the value that is at the pth percentile in this
distribution.
"=NORMINV(p,0,1)"
will tell you how many standard deviations you hvae to go up from
the mean to guarantee that p percent of the data will be below you. So
"=NORMINV((1+p)/2,0,1)"
will tell you how many standard deviations you have to go to have
p percent of the data closer to the mean than you are. So if you do
that with p=.95, you will find that 95% of the data is within 1.96
standard deviations of the mean.
With all that in hand, how do you find a 95% confidence interval
for the mean if you have a sample of size n with a sample mean of
Xbar. Well we know that 95% of all samples of size n will be within
1.96 standard errors of the mean, so will differ from the mean by
less than 1.96 &sigma/sqrt(n). So to put it another way we are
95% sure that the mean is within 1.96&sigma/sqrt(n) of our
particular Xbar so the confidence interval is
Xbar +/ 1.96 &sigma/sqrt(n)
for confidence level &alpha it is
Xbar +/ z &sigma/sqrt(n)
where z=NORMINV((1+&alpha)/2,0,1).
Given some particular number &mu_{0} called the test mean, we can test three different alternate hypothesese:
Usually, you do not know the population standard deviation
&sigma, so you cannot use the methods of the previous paragraphs.
If your sample size is large, the sample standard deviation s coming
from your sample
represents a good substitute for the populations standard deviation
&sigma. If your sample size is not large this adds yet another
source of randomness, to your calculation, so you would expect your
confidence intervals to be wider and your pvalues to be higher to
account for that. The way to do this is the replace the normal
distribution (this is for example the distribution that gives you
z^{*} in the confidence interval, with another
distribution, the tdistribution. the
tdistribution, like the normal distribution, is a
mathematical formula
that depends on certain parameters, from which you can determine the
probability that a variable following this distribution will fall in
a certain range of values. Instead of depending on the mean and
standard deviation, the t distribution depends on the size of the
sample. For mathematical and historic reasons we do not write the
dependence in terms of the sample size n, but what we call
the degrees of freedom, n1. Degrees of freedom is a notion
we will see again and again. When the degrees of freedom is large,
the tdistribution is very close to the standard normal
distribution (i.e. the one with mean 0 and s.d. 1), but when the
degrees of freedom are small, it is broader, with fatter tails. The
Excel formulas to calculate the confidence interval and pvalue are
almost the same as in the sigma known case.
Confidence Interval (confidence level &alpha):
Xbar +/ t &sigma/sqrt(n)
where t=TINV(1&alpha,n1).
Hypothesis Testing:
However, we will use the t procedure template to do the calculation for us. If you have raw data, enter it in column A in the "Data" tab of the template. Then on the "ttest" tab enter your test mean in the text box provided, enter the significance level below it, and click on the correct form of the alternate hypothesis or enter the confidence level. The confidence interval and/or the pvalue and conclusion appear to the right in green. If you only have summary statistics (i.e., Xbar, n) enter it in the textboxes proveided on the "ttest" tab and click the "Use summary statistics" box.
Every statistical procedure involves modleing a real situation by an theoretical mathematical model. This amounts to making assumptions about the situation that are generally true only approximately at best. It is crucial that you know what these assumptions are and how to assess how closely they are met, in order to have a sense of the reliability of the conclusions. We will generally choose a rule of thumb for judging when the assumption is close enough. For the two version of the one population, one numerical variable procedures (zprocedure for sigma known, tprocedure for sigma unknown) there are three assumptions.
Theory  Practice  

The sample is a Simple Random Sample  Those individuals more likely to be chosen in the sample do not appear to differ from those less likely in a way that would affect the measured variable(s)  
Sampling is with replacement or from an infinite population  The population is at least 20 times the sample size (this is almost always met, and when it isn't there is a correction that some of the templates include).  
The distribution of Xbar is normal 

To help you remember the last assumption, which we will keep seeing in many tests, I have devised
The t procedure template gives a rough histogram of the sample data on the "Hist" page, quite sufficient for determining if the middle criterion is met (not too skew, no extreme outliers), though it will only work on samples under 3000.
The confidence interval for a single numerical variable is used very frequently, but hypothesis testing is less frequent: It is not that often that you want to compare the average value of something to some fixed thing (the test mean). More often you want to compare two things. If I am asking whether Treatment A is more effective than Treatment B, I will want to look at, for example, the average recovery time for people who receive A and compare it to those who receive B. If I get statistically significant evidence that the average recovery time for A is shorter than for B, I conclude that A is more effective.
Here we think of there being two populations (in this case, "people treated by Treatment A" and "people treated by Treatment B") and a random variable X on both. The means of the random variable in the two populations are &mu_{1} and &mu_{2} respectively. Generally our null hypothesis is that &mu_{1}= &mu_{2}, which we often write as &mu_{1} &mu_{2}=0, but every once in a while we want to have a more general null hypothesis &mu_{1} &mu_{2}=d, where d, the test difference, is whatever number you like. The alternate hypothesis can be
The template ( Two Sample ttest in the Excel Templates page will do the calculation. If you have the raw data from two samples, just enter it in two columns in COLUMN A and COLUMN B in the "Data" tab. It is a very good idea to put labels at the top so you can remember which is which. The "ttest" tab will then give you a confidence interval if you enter a confidence level, and will give you a pvalue. You leave the space for "test difference" blank unless you are in the rare situation where you have a null hypothesis like "the first mean is exactly 20 more than the last mean"). You can enter significance level as well if you like. If you only have the summary statistics, enter these six numbers in the space provided on the ttest page and click on the "summary statistics" button. The assumptions of the two sample t procedure are the same as for the one sample tprocedure, applied to each sample. That is, we assume in practice that each sample behaves as if it were a random sample, is from a population of reasonable size, and satisfies the Rule of Cool. You can check the histograms of your two samples on the "Hist" page of the template for skewness and outliers. In practice we generally are more relaxed about the size limitations for two samples, so that if each are close 40 or are close to 15 and reasonably symmetric it is considered OK. There is one additional assumption for the two sample test
Theory  Practice 

Obscure  Each sample should have size at least 5. 
Suppose you have a population, and some proportion p of it has a property, which we will call success. When you choose an individual at random from this population, you will have probability p of getting a success and probability 1p of getting a failure. If you take a sample of n individuals from a population and count up how many successes you get, that is a binomial experiment, and as you run through many samples you will get a binomial distribution for the number m of successes. m will have a mean of np a standard deviation of the square root of the quantity np(1p) and will be roughly normal when np and n(1p) are both at least 5. That means that the variable m/n, which is the sample proportion phat, will be normal with a mean of p and a standard deviation of the square root of the quantity p(1p)/n.
Now turn this around. Suppose you take a sample of size n from a population where the proportion of successes is unknown, and you find the sample proportion is some number phat. What does that suggest about the true proportion p? Well presumably the true proportion p is pretty close to phat, so the true standard deviation is pretty close to the square root of phat(1  phat)/n (in fact the standard deviation changes much more slowly than p). So just as with the confidence interval for a numerical veriable with sigma known, we have
Confidence Interval for a Proportion:
phat +/ z sqrt(phat(1  phat)/n)
where z is the zscore associated to the confidence level.
The situation is even simpler for hypothesis testing. Here we will have some test proportion , p_{0}, and our null hypothesis will be p=p_{0}. We will have three possible alternate hypotheses, namely
Assuming the null hypothesis, that is that the population proportion is p_{0}, we know that the sample proportion has a mean of p_{0} and a standard deviation of SQRT( p_{0}(1  p_{0})/n), and is normally distributed. So we get a p value with no further fuss:
Theory  Practice  

The sample is a Simple Random Sample  Those individuals more likely to be chosen in the sample do not appear to differ from those less likely in a way that would affect the measured variable(s)  
Sampling is with replacement or from an infinite population  The population is at least 20 times the sample size (this is almost always met, and when it isn't there is a correction that some of the templates include).  
The distribution of phat is normal  Rule of 15

ANOVA stands for analysis of variation. Here you have I samples taken from I different populations (often it is one population under I different circumstances, such as people be given one of several different treatments). Generally I is more than 2, because if it 1 or 2 you are better off doing one of the tprocedures. Let's say the samples are of size n_{1}, n_{2}, ...n_{I}. On each population the variable of interest X has some mean &mu_{i}, and some standard deviation &sigma. We have to assume the standard deviations are all the same for ANOVA to work! This is too much information to compute any kind of confidence interval, and in fact there is only one null and alternate hypothesis we can use, so there is no decision to make. Our null hypothesis will always be that all these means are the same, and our alternate hypothesis is that there is some difference among them. As always we will compute a pvalue, which is the probability you would get samples whose sample means are this far apart assuming the null hypothesis (that all the population means are equal).
The way this is computed is actually very instructive. It turns out that when you can think of all the variation or error in a problem as coming from two or more distinct sources, often you can associate a kind of variance to these (it will always be the sum or average of the squares of the difference between the values and some predicted value) so that the total variance is a sum of the variance due to each of these effects. Here, if we assume the null hypothesis, we can think of all of these samples, since they come from samples of the same mean &mu and standard deviation &sigma, as one big sample from one big population. The total variation or SST (Sum Square error Total) is the sum of the squares of the difference of each point with the sample mean of the whole sample. This is the sum of two contributions, the SSG or Sum Square error for groups, which is the sum of the squares of the differences of each sample mean and the total sample mean, and the SSE or Sum Square Error within each sample, which is the sum of the squares of the differences of each data point minus the mean of its sample. All this is to say
SST = SSG + SSE
where SSG represents how much the sample means vary amongst themselves and SSE replresents how much the points in each sample vary amongsth themselves. If they all came from populations with the same mean, the SSG would just come from natural variation from one sample to the next, and you would expect it to make up some predetermined fraction of the total variation (depending on the sizes of the samples of course). If SSG represents too big a fraction of the total variation, you start to believe that it is because the means are actually different. Specifically some math will tell you that if the null hypothesis is true the quantity
F=SSG/SST
as we run through many different samples will have a characteristic distribution, called the F distribution. The F distribution depends on two parameters, the Group Degrees of Freedom DFG = I1 and the Error Degrees of Freedom DFE = NI. Software or really long tables will tell you, once you know F, DFG and DFE, what the probability you would get that big an F score or bigger in a collection of samples (of the correct sizes) from which all the means are equal. That is your pvalue.
All you have to do is enter the data from the spearate samples in separate columns in the "Data" tab of the ANOVA template (make sure not to label the columns by numbers, Excel will think it is data). then read off the pvalue. Or, enter the means, s.d.s and counts in the "Data Summary" tab.
The assumptions for the ANOVA test are a little more complicated. Each sample must satisfy all the assumptions of the t procedure. That is, it must be a reasonable approximation of a simple random sample, it must be sampled from a sufficiently large population, and it must satisfy the 0:15:40 Rule of Cool (In practice, we think of the different samples as providing additional "averaging out," so that we can be more relaxed about this rule. Three or four sample of size 20 are probably OK short of the most outrageous outliers or skewness). In addition, the assumed equality of the standard deviation enforces the following additional rule of thumbs: