Inferential Statistics Notes


Inference

We want to infer information about an entire population just from knowledge of a relatively small sample. In general, variations on the Law of Large Numbers can usually tell us that if what we want to know is something averaged or aggregated, there is a computation you can do from the sample which gives an estimate which has a very small chance of being more than a very small distance away from the thing we are trying to approximate, assuming that the sample was chosen at random. Chosen at random means that it was chosen in such a fashion that every individual is equally likely to be chosen, and certain individuals being in the sample do not change any other individual's chances of being in the sample. This is extremely hard to do in practice, and is a major limitation on what conclusions you can draw. What ``very small chance'' and ``very small distance'' mean in practice are subtle, and depend on the problem. All of our conclusions will be probabilistic, that is they will give probabilities. The thing to remember is this: In all these cases one imagines that the actual truth about the population is fixed (though unknown, like what is behind the curtain in the Monty Hall problem).

Confidence Intervals

A confidence interval is a very general tool. Whenever you want to measure anything numerical in the real world, there will be innacuracy. The best you can give is a range for the value, and even then you cannot say for certain that the true value will fall in that range. The best you can hope to do is give an exact probability for your chance of being right when you say it falls within the exact range of values that you give. That is a confidence interval. For example, if I say The 99% confidence interval for the median numer of hours of TV watched per week by American college students is between 9.7 and 15.5, I am saying that whatever the true value for that parameter is, I am 99% sure that it is somewhere between 9.7 and 15.4. What I mean by that is that I went through some procedure to come up with the numbers 9.7 and 15.4, which presumably involved taking a randon sample of college students, asking them how many hours of TV they watched, and doing some calculation with those numbers. If I did this many times, each time my sample would be different (because it is chosen randomly) and so each time my interval would be different. When I make the original claim I am sayiing that if I repeated that procedure thousands of times and came up with thousands of different intervals, 99% of them would be correct. It is really a statement about the procedure you used to compute those numbers, not about the numbers themselves.

When you give a 95% confidence interval, you are saying there is a 5% chance that you will be wrong due to the randomness of the sample. There is an additional chance that you will be wrong because you messed up the calculation, or the numbers weren't recorded correctly, or whatever. In addition, every procedure requires certain assumptions to be true, such as that the sample is a Simple Random Sample and that some variable has a normal distribution, which at best are usually true only approximately. You have to assess this subjectively. Finally, you need to be clear on the difference between the actual quantity you raae getting the confidence interval for and the one you would like it to be for. If students are consistently underestimating the TV they watch because of embarassment, the interval I come up with will not be a95% confidence interval they watch, it will be a confidence interval for the median number of hours that all college students would say they watched if asked in the manner the students in your sample were asked. Of course it is ridiculous to say all these caveats every time you give a confidence interval, but it is important to be aware of them and to have a sense of which issues might be limiting the accuracy of the estimate in your particular case. No one will warn you of this, and no computer program will help you assess it, you are on your own.

Confidence intervals are expressed in two different forms. Sometimes they are given as ``between 9.7 and 15.5,'' or simply ``[9.7, 15.4],'' but sometimes they are given as 12.6 +/- 2.9, the first number being the center of the interval and the second being half the width, which we call the margin of error.

If we were not provided with the knack of being wrong, we could never get anything useful done.  We think our way along by choosing between right and wrong alternatives, and the wrong choices have to be made as frequently as the right. ones

Hypothesis Testing

We want to consider some claim about a population (in this class, about a population, but the logic of this works any time you have probabilistic evidence for a claim), and all we have is information about a sample. We would like to assess quantitatively the strength of this evidence, and we would like to have an objective procedure for deciding if the evidence is good enough to believe it. Two situations where such an objective procedure is valuable: Scientific evidence, where objectivity is of paramount importance, and highly automated decision making processes, such as spam filtering or industrial processes.

To make it concrete, suppose you are flipping a coin, and trying to decide if the coin is improperly weighted. If you flip it 10 times and it comes up tails 7 of them, you would probably chalk it off to chance. Likewise if you flipped it 100 times and it came up tails 53 times. But if it came up tails 700 out of 1000 times, you would probably consider this good evidence that the coin is weighted to favor tails. what is the cutoff? If you think carefully about how you are deciding, you will probably agree that you are asking yourself ``What are the chances that a set of flips like this would happen by chance, i.e. if the coin were properly weighted?'' For example, seven coins in 10 flips does not seem like an unlikely thing to occur (assuming the coin is properly weighted), but 700 tails in 1000 flips seems incredibly unlikely, unless the coin is mis-weighted. This is the central idea of Hypothesis Testing. To assess the evidence your sample (or experiment or whatever) offers of a claim, you calculate the probability that you would see a sample like what you got assuming this claim was false. This probability is called the p-value, and if it is very small, it is unlikely your results would have occured by chance (i.e. if the claim were false), so the fact that it did occur is evidence for the claim. If it is high, you do not have good evidence for the claim.

The claim you are assessing the evidence for is called the Alternate Hypothesis abbreviated H1 or Ha. It's negation, the thing you assume to compute the probability, is called the Null Hypothesis. You assume that the Null Hypothesis is true, and using that you have to compute the probability that you would get results similar to yours (that is, the percentage of times out of many random samples you could expect to get results similar to what your particular sample happens to give. That is the p-value.

In informal hypothesis testing (common in business settings), this is the end of the story. You report your p-value (Saying ``The p value is 0.01'' or more precisely ``The probability of getting results like these if the null hypothesis were true is 1%'') and make a subjective judgement of how strong the evidence is for the alternate hypothesis. In our class, just to get into the habit of doing this, we usually say that a p-value smaller than .5% is very strong evidence, a p-value from .5% to 10$ is evidence, and a p-value of more than 10% is very weak evidence for the alternate hypothesis. But that is very arbitrary.

In formal hypothesis testing you are given a cutoff called the significance level, usually written as the Greek letter ``alpha,'' In class this will be given to you in the question (that's how you will know it is supposed to be formal!), but in practice there are usually standard values in a given field everyone uses. Common significance levels are 1% and 5%. If your p-value is less than the significance level you say ``this data is significant evidence that ...**Alt. Hyp. here***'', otherwise you say ``this data is not significant evidence that ...**Alt. Hyp. here***.'' For reasons that we will talk about later, you should never look at a p-value and pick a significance level that makes it significant. That is called ``Letting your data tell you what question to ask.'' An old-fashioned way to express these two conclusions is ``Reject the null hypothesis'' and ``fail to reject the null hypothesis.''

There are two ways for your conclusion to fail to reflect reality in hypothesis testing. If the alternate hypothesis is false and you find there is significant evidence for the alternate, this is called an Error of Type I. If the alternate hypothesis is true and you find the data is not significant, this is an Error of Type II. By adjusting the significance level, you can balance the trade-off between the two types of errors. Often the Type I error is the bigger error (example, convicting the innocent is a type I error, setting free the guilty is type II).

There is a nice interpretation of the p-value and significance levels you should know. If you do many significance tests at a significance level alpha, then on average alpha will be the percentage out of the times when the null hypothesis is true that you actually got significant evidence (Type I error). So your significance level is where you control your chances of making a Type I error. Of course, the lower your signficance level, the higher your chances of making a type II error.

People can usually tell what the alternate and null hypothesese are, but often have trouble telling which is which. Here are three criteria:

The best way to decide is to say ``If I have good enough evidence to convince me that ... is true, I will ...'' whichever claim makes more sense in such a sentence should be your alternate hypothesis.

One Population, One Numerical Variable, Sigma is Known

A really simple case is when you want to give a confidence interval or do a hypothesis test for the population mean of a numerical variable (let's say it is normally distributed) and you somehow know its standard deviation. The procedures in this case are very straightforward, and you can see the logic more clearly.

Let's say the mean is &mu (we don't know) and the standard deviation is &sigma (we know). Thus the distribution of the original variable is a normal distribution with a mean &mu and a standar deviation &sigma. Now suppose we take many samples of size n from the population, and compute the sample mean X-bar of each. These numbers X-bar will have a distribution (the sample distribution) which is again normal, still has mean &mu, but now has standard deviation &sigma/sqrt(n), sigma over the square root of n. So I don't have to write that a lot I will call it the standard error. Remember that for a normal distribution with mean &mu and standard deviation &sigma you can compute any probabilities like
P(X < a) "=NORMDIST(a, &mu, &sigma, TRUE)"
P(X > a) "=1-NORMDIST(a, &mu, &sigma, TRUE)"
P(a < X < b) "=NORMDIST(a, &mu, &sigma, TRUE)-NORMDIST(b, &mu, &sigma, TRUE)"
and finally, the probability that X is farther away from the &mu than a is:
"=2*MIN(NORMDIST(a, &mu, &sigma, TRUE),1-NORMDIST(a, &mu, &sigma, TRUE))"
On the other hand if you have a probability p,
"=NORMINV(p,&mu,&sigma)"
will give you the value that is at the pth percentile in this distribution.
"=NORMINV(p,0,1)"
will tell you how many standard deviations you hvae to go up from the mean to guarantee that p percent of the data will be below you. So
"=NORMINV((1+p)/2,0,1)"
will tell you how many standard deviations you have to go to have p percent of the data closer to the mean than you are. So if you do that with p=.95, you will find that 95% of the data is within 1.96 standard deviations of the mean.
With all that in hand, how do you find a 95% confidence interval for the mean if you have a sample of size n with a sample mean of X-bar. Well we know that 95% of all samples of size n will be within 1.96 standard errors of the mean, so will differ from the mean by less than 1.96 &sigma/sqrt(n). So to put it another way we are 95% sure that the mean is within 1.96&sigma/sqrt(n) of our particular X-bar so the confidence interval is
X-bar +/- 1.96 &sigma/sqrt(n)
for confidence level &alpha it is
X-bar +/- z &sigma/sqrt(n)
where z=NORMINV((1+&alpha)/2,0,1).

Given some particular number &mu0 called the test mean, we can test three different alternate hypothesese:

The first two are called one-tailed tests, the last two-tailed. You of course assume the Null Hypothesis
H0: &mu = &mu0
from which you can compute the p-value, the probability that a random sample would give results at least as much in favor of the alternate hypothesis as your X-bar. These are respectively

One Sample, One Numerical Variable: The t-procedure

Usually, you do not know the population standard deviation &sigma, so you cannot use the methods of the previous paragraphs. If your sample size is large, the sample standard deviation s coming from your sample represents a good substitute for the populations standard deviation &sigma. If your sample size is not large this adds yet another source of randomness, to your calculation, so you would expect your confidence intervals to be wider and your p-values to be higher to account for that. The way to do this is the replace the normal distribution (this is for example the distribution that gives you z* in the confidence interval, with another distribution, the t-distribution. the t-distribution, like the normal distribution, is a mathematical formula that depends on certain parameters, from which you can determine the probability that a variable following this distribution will fall in a certain range of values. Instead of depending on the mean and standard deviation, the t distribution depends on the size of the sample. For mathematical and historic reasons we do not write the dependence in terms of the sample size n, but what we call the degrees of freedom, n-1. Degrees of freedom is a notion we will see again and again. When the degrees of freedom is large, the t-distribution is very close to the standard normal distribution (i.e. the one with mean 0 and s.d. 1), but when the degrees of freedom are small, it is broader, with fatter tails. The Excel formulas to calculate the confidence interval and p-value are almost the same as in the sigma known case.
Confidence Interval (confidence level &alpha):
X-bar +/- t &sigma/sqrt(n)
where t=TINV(1-&alpha,n-1).
Hypothesis Testing:

However, we will use the t procedure template to do the calculation for us. If you have raw data, enter it in column A in the "Data" tab of the template. Then on the "t-test" tab enter your test mean in the text box provided, enter the significance level below it, and click on the correct form of the alternate hypothesis or enter the confidence level. The confidence interval and/or the p-value and conclusion appear to the right in green. If you only have summary statistics (i.e., X-bar, n) enter it in the textboxes proveided on the "t-test" tab and click the "Use summary statistics" box.

Assumptions For the One Population, One Numerical Variable Tests

Every statistical procedure involves modleing a real situation by an theoretical mathematical model. This amounts to making assumptions about the situation that are generally true only approximately at best. It is crucial that you know what these assumptions are and how to assess how closely they are met, in order to have a sense of the reliability of the conclusions. We will generally choose a rule of thumb for judging when the assumption is close enough. For the two version of the one population, one numerical variable procedures (z-procedure for sigma known, t-procedure for sigma unknown) there are three assumptions.
Theory Practice
The sample is a Simple Random Sample Those individuals more likely to be chosen in the sample do not appear to differ from those less likely in a way that would affect the measured variable(s)
Sampling is with replacement or from an infinite population The population is at least 20 times the sample size (this is almost always met, and when it isn't there is a correction that some of the templates include).
The distribution of X-bar is normal
Either
The sample size is at least 40
or
The sample size is at least 15 and the histogram of the sample shows no extreme skewness or outliers
or
The original variable X is known to be normal

To help you remember the last assumption, which we will keep seeing in many tests, I have devised

Steve's Patented Rule of Cool!

What does it take to be cool? It depends on your age.

The t procedure template gives a rough histogram of the sample data on the "Hist" page, quite sufficient for determining if the middle criterion is met (not too skew, no extreme outliers), though it will only work on samples under 3000.

Two Populations (two samples), One Numerical Variable: the t-procedure again

The confidence interval for a single numerical variable is used very frequently, but hypothesis testing is less frequent: It is not that often that you want to compare the average value of something to some fixed thing (the test mean). More often you want to compare two things. If I am asking whether Treatment A is more effective than Treatment B, I will want to look at, for example, the average recovery time for people who receive A and compare it to those who receive B. If I get statistically significant evidence that the average recovery time for A is shorter than for B, I conclude that A is more effective.

Here we think of there being two populations (in this case, "people treated by Treatment A" and "people treated by Treatment B") and a random variable X on both. The means of the random variable in the two populations are &mu1 and &mu2 respectively. Generally our null hypothesis is that &mu1= &mu2, which we often write as &mu1- &mu2=0, but every once in a while we want to have a more general null hypothesis &mu1- &mu2=d, where d, the test difference, is whatever number you like. The alternate hypothesis can be

We will have a sample from each population, one of size n1, mean X-bar1, and sample s.d. s1, the other of size n1, mean X-bar1, and sample s.d. s1. It turns out that you can show that if you took many pairs of samples and computed their X-bar and s, the differences would follow close to a t-distribution. From this you can get a p-value, or a confidence interval for what the difference &mu1- &mu2 is.

The template ( Two Sample t-test in the Excel Templates page will do the calculation. If you have the raw data from two samples, just enter it in two columns in COLUMN A and COLUMN B in the "Data" tab. It is a very good idea to put labels at the top so you can remember which is which. The "t-test" tab will then give you a confidence interval if you enter a confidence level, and will give you a p-value. You leave the space for "test difference" blank unless you are in the rare situation where you have a null hypothesis like "the first mean is exactly 20 more than the last mean"). You can enter significance level as well if you like. If you only have the summary statistics, enter these six numbers in the space provided on the t-test page and click on the "summary statistics" button. The assumptions of the two sample t procedure are the same as for the one sample t-procedure, applied to each sample. That is, we assume in practice that each sample behaves as if it were a random sample, is from a population of reasonable size, and satisfies the Rule of Cool. You can check the histograms of your two samples on the "Hist" page of the template for skewness and outliers. In practice we generally are more relaxed about the size limitations for two samples, so that if each are close 40 or are close to 15 and reasonably symmetric it is considered OK. There is one additional assumption for the two sample test
Theory Practice
Obscure Each sample should have size at least 5.

One Population, One Categorical (Yes/No) Variable: The z-procedure

Suppose you have a population, and some proportion p of it has a property, which we will call success. When you choose an individual at random from this population, you will have probability p of getting a success and probability 1-p of getting a failure. If you take a sample of n individuals from a population and count up how many successes you get, that is a binomial experiment, and as you run through many samples you will get a binomial distribution for the number m of successes. m will have a mean of np a standard deviation of the square root of the quantity np(1-p) and will be roughly normal when np and n(1-p) are both at least 5. That means that the variable m/n, which is the sample proportion p-hat, will be normal with a mean of p and a standard deviation of the square root of the quantity p(1-p)/n.

Now turn this around. Suppose you take a sample of size n from a population where the proportion of successes is unknown, and you find the sample proportion is some number p-hat. What does that suggest about the true proportion p? Well presumably the true proportion p is pretty close to p-hat, so the true standard deviation is pretty close to the square root of p-hat(1 - p-hat)/n (in fact the standard deviation changes much more slowly than p). So just as with the confidence interval for a numerical veriable with sigma known, we have

Confidence Interval for a Proportion:

p-hat +/- z sqrt(p-hat(1 - p-hat)/n)

where z is the z-score associated to the confidence level.

The situation is even simpler for hypothesis testing. Here we will have some test proportion , p0, and our null hypothesis will be p=p0. We will have three possible alternate hypotheses, namely