Inferential Statistics Notes

Inference

We want to infer information about an entire population just from knowledge of a relatively small sample. In general, variations on the Law of Large Numbers can usually tell us that if what we want to know is something averaged or aggregated, there is a computation you can do from the sample which gives an estimate which has a very small chance of being more than a very small distance away from the thing we are trying to approximate, assuming that the sample was chosen at random. Chosen at random means that it was chosen in such a fashion that every individual is equally likely to be chosen, and certain individuals being in the sample do not change any other individual's chances of being in the sample. This is extremely hard to do in practice, and is a major limitation on what conclusions you can draw. What ``very small chance'' and ``very small distance'' mean in practice are subtle, and depend on the problem. All of our conclusions will be probabilistic, that is they will give probabilities. The thing to remember is this: In all these cases one imagines that the actual truth about the population is fixed (though unknown, like what is behind the curtain in the Monty Hall problem).

Confidence Intervals

A confidence interval is a very general tool. Whenever you want to measure anything numerical in the real world, there will be innacuracy. The best you can give is a range for the value, and even then you cannot say for certain that the true value will fall in that range. The best you can hope to do is give an exact probability for your chance of being right when you say it falls within the exact range of values that you give. That is a confidence interval. For example, if I say The 99% confidence interval for the median numer of hours of TV watched per week by American college students is between 9.7 and 15.5, I am saying that whatever the true value for that parameter is, I am 99% sure that it is somewhere between 9.7 and 15.4. What I mean by that is that I went through some procedure to come up with the numbers 9.7 and 15.4, which presumably involved taking a randon sample of college students, asking them how many hours of TV they watched, and doing some calculation with those numbers. If I did this many times, each time my sample would be different (because it is chosen randomly) and so each time my interval would be different. When I make the original claim I am sayiing that if I repeated that procedure thousands of times and came up with thousands of different intervals, 99% of them would be correct. It is really a statement about the procedure you used to compute those numbers, not about the numbers themselves.

When you give a 95% confidence interval, you are saying there is a 5% chance that you will be wrong due to the randomness of the sample. There is an additional chance that you will be wrong because you messed up the calculation, or the numbers weren't recorded correctly, or whatever. In addition, every procedure requires certain assumptions to be true, such as that the sample is a Simple Random Sample and that some variable has a normal distribution, which at best are usually true only approximately. You have to assess this subjectively. Finally, you need to be clear on the difference between the actual quantity you raae getting the confidence interval for and the one you would like it to be for. If students are consistently underestimating the TV they watch because of embarassment, the interval I come up with will not be a95% confidence interval they watch, it will be a confidence interval for the median number of hours that all college students would say they watched if asked in the manner the students in your sample were asked. Of course it is ridiculous to say all these caveats every time you give a confidence interval, but it is important to be aware of them and to have a sense of which issues might be limiting the accuracy of the estimate in your particular case. No one will warn you of this, and no computer program will help you assess it, you are on your own.

Confidence intervals are expressed in two different forms. Sometimes they are given as ``between 9.7 and 15.5,'' or simply ``[9.7, 15.4],'' but sometimes they are given as 12.6 +/- 2.9, the first number being the center of the interval and the second being half the width, which we call the margin of error.

If we were not provided with the knack of being wrong, we could never get anything useful done. We think our way along by choosing between right and wrong alternatives, and the wrong choices have to be made as frequently as the right. ones

Hypothesis Testing

We want to consider some claim about a population (in this class, about a population, but the logic of this works any time you have probabilistic evidence for a claim), and all we have is information about a sample. We would like to assess quantitatively the strength of this evidence, and we would like to have an objective procedure for deciding if the evidence is good enough to believe it. Two situations where such an objective procedure is valuable: Scientific evidence, where objectivity is of paramount importance, and highly automated decision making processes, such as spam filtering or industrial processes.

To make it concrete, suppose you are flipping a coin, and trying to decide if the coin is improperly weighted. If you flip it 10 times and it comes up tails 7 of them, you would probably chalk it off to chance. Likewise if you flipped it 100 times and it came up tails 53 times. But if it came up tails 700 out of 1000 times, you would probably consider this good evidence that the coin is weighted to favor tails. what is the cutoff? If you think carefully about how you are deciding, you will probably agree that you are asking yourself ``What are the chances that a set of flips like this would happen by chance, i.e. if the coin were properly weighted?'' For example, seven coins in 10 flips does not seem like an unlikely thing to occur (assuming the coin is properly weighted), but 700 tails in 1000 flips seems incredibly unlikely, unless the coin is mis-weighted. This is the central idea of Hypothesis Testing. To assess the evidence your sample (or experiment or whatever) offers of a claim, you calculate the probability that you would see a sample like what you got assuming this claim was false. This probability is called the p-value, and if it is very small, it is unlikely your results would have occured by chance (i.e. if the claim were false), so the fact that it did occur is evidence for the claim. If it is high, you do not have good evidence for the claim.

The claim you are assessing the evidence for is called the Alternate Hypothesis abbreviated H1 or Ha. It's negation, the thing you assume to compute the probability, is called the Null Hypothesis. You assume that the Null Hypothesis is true, and using that you have to compute the probability that you would get results similar to yours (that is, the percentage of times out of many random samples you could expect to get results similar to what your particular sample happens to give. That is the p-value.

In informal hypothesis testing (common in business settings), this is the end of the story. You report your p-value (Saying ``The p value is 0.01'' or more precisely ``The probability of getting results like these if the null hypothesis were true is 1%'') and make a subjective judgement of how strong the evidence is for the alternate hypothesis. In our class, just to get into the habit of doing this, we usually say that a p-value smaller than .5% is very strong evidence, a p-value from .5% to 10$ is evidence, and a p-value of more than 10% is very weak evidence for the alternate hypothesis. But that is very arbitrary.

In formal hypothesis testing you are given a cutoff called the significance level, usually written as the Greek letter ``alpha,'' In class this will be given to you in the question (that's how you will know it is supposed to be formal!), but in practice there are usually standard values in a given field everyone uses. Common significance levels are 1% and 5%. If your p-value is less than the significance level you say ``this data is significant evidence that ...**Alt. Hyp. here***'', otherwise you say ``this data is not significant evidence that ...**Alt. Hyp. here***.'' For reasons that we will talk about later, you should never look at a p-value and pick a significance level that makes it significant. That is called ``Letting your data tell you what question to ask.'' An old-fashioned way to express these two conclusions is ``Reject the null hypothesis'' and ``fail to reject the null hypothesis.''

There are two ways for your conclusion to fail to reflect reality in hypothesis testing. If the alternate hypothesis is false and you find there is significant evidence for the alternate, this is called an Error of Type I. If the alternate hypothesis is true and you find the data is not significant, this is an Error of Type II. By adjusting the significance level, you can balance the trade-off between the two types of errors. Often the Type I error is the bigger error (example, convicting the innocent is a type I error, setting free the guilty is type II).

There is a nice interpretation of the p-value and significance levels you should know. If you do many significance tests at a significance level alpha, then on average alpha will be the percentage out of the times when the null hypothesis is true that you actually got significant evidence (Type I error). So your significance level is where you control your chances of making a Type I error. Of course, the lower your signficance level, the higher your chances of making a type II error.

People can usually tell what the alternate and null hypothesese are, but often have trouble telling which is which. Here are three criteria:

The null hypothesis must be specific, because you have to be able to use it to compute a probability. So if one of the two statements involves an equals sign (specific) and the other not equals or less than or greater than (not specific) the equals sign is the null hypothesis
If it is clear that one of the statements is the one you are assessing the evidence in favor of, that is the alternate hypothesis. That is the basic meaning of the alternate hypothesis.
If one of the statements is the thing you would continue to assume until you see evidence otherwise, or is the ``default assumption,'' or suggests maintaining the status quo rather than actively responding, or involves less terrible consequences if you believe it mistakenly, that is the null hypothesis.

The best way to decide is to say ``If I have good enough evidence to convince me that ... is true, I will ...'' whichever claim makes more sense in such a sentence should be your alternate hypothesis.

One Population, One Numerical Variable, Sigma is Known

A really simple case is when you want to give a confidence interval or do a hypothesis test for the population mean of a numerical variable (let's say it is normally distributed) and you somehow know its standard deviation. The procedures in this case are very straightforward, and you can see the logic more clearly.

Let's say the mean is &mu (we don't know) and the standard deviation is &sigma (we know). Thus the distribution of the original variable is a normal distribution with a mean &mu and a standar deviation &sigma. Now suppose we take many samples of size n from the population, and compute the sample mean X-bar of each. These numbers X-bar will have a distribution (the sample distribution) which is again normal, still has mean &mu, but now has standard deviation &sigma/sqrt(n), sigma over the square root of n. So I don't have to write that a lot I will call it the standard error. Remember that for a normal distribution with mean &mu and standard deviation &sigma you can compute any probabilities like
P(X < a) "=NORMDIST(a, &mu, &sigma, TRUE)"
P(X > a) "=1-NORMDIST(a, &mu, &sigma, TRUE)"
P(a < X < b) "=NORMDIST(a, &mu, &sigma, TRUE)-NORMDIST(b, &mu, &sigma, TRUE)"
and finally, the probability that X is farther away from the &mu than a is:
"=2*MIN(NORMDIST(a, &mu, &sigma, TRUE),1-NORMDIST(a, &mu, &sigma, TRUE))"
On the other hand if you have a probability p,
"=NORMINV(p,&mu,&sigma)"
will give you the value that is at the pth percentile in this distribution.
"=NORMINV(p,0,1)"
will tell you how many standard deviations you hvae to go up from the mean to guarantee that p percent of the data will be below you. So
"=NORMINV((1+p)/2,0,1)"
will tell you how many standard deviations you have to go to have p percent of the data closer to the mean than you are. So if you do that with p=.95, you will find that 95% of the data is within 1.96 standard deviations of the mean.
With all that in hand, how do you find a 95% confidence interval for the mean if you have a sample of size n with a sample mean of X-bar. Well we know that 95% of all samples of size n will be within 1.96 standard errors of the mean, so will differ from the mean by less than 1.96 &sigma/sqrt(n). So to put it another way we are 95% sure that the mean is within 1.96&sigma/sqrt(n) of our particular X-bar so the confidence interval is
X-bar +/- 1.96 &sigma/sqrt(n)
for confidence level &alpha it is
X-bar +/- z &sigma/sqrt(n)
where z=NORMINV((1+&alpha)/2,0,1).

Given some particular number &mu₀ called the test mean, we can test three different alternate hypothesese:

&mu <&mu₀
&mu > &mu₀
&mu <> &mu₀

The first two are called one-tailed tests, the last two-tailed. You of course assume the Null Hypothesis
H0: &mu = &mu₀
from which you can compute the p-value, the probability that a random sample would give results at least as much in favor of the alternate hypothesis as your X-bar. These are respectively

If Ha:&mu <&mu₀, p=NORMDIST(X-bar,&mu₀,&sigma/sqrt(n))
If Ha:&mu > &mu₀, p= 1- NORMDIST(X-bar,&mu₀,&sigma/sqrt(n))
If Ha:&mu <> &mu₀,p=MIN(NORMDIST(X-bar,&mu₀,&sigma/sqrt(n)), 1-NORMDIST(X-bar,&mu₀, &sigma/sqrt(n))).

One Sample, One Numerical Variable: The t-procedure

Usually, you do not know the population standard deviation &sigma, so you cannot use the methods of the previous paragraphs. If your sample size is large, the sample standard deviation s coming from your sample represents a good substitute for the populations standard deviation &sigma. If your sample size is not large this adds yet another source of randomness, to your calculation, so you would expect your confidence intervals to be wider and your p-values to be higher to account for that. The way to do this is the replace the normal distribution (this is for example the distribution that gives you z^* in the confidence interval, with another distribution, the t-distribution. the t-distribution, like the normal distribution, is a mathematical formula that depends on certain parameters, from which you can determine the probability that a variable following this distribution will fall in a certain range of values. Instead of depending on the mean and standard deviation, the t distribution depends on the size of the sample. For mathematical and historic reasons we do not write the dependence in terms of the sample size n, but what we call the degrees of freedom, n-1. Degrees of freedom is a notion we will see again and again. When the degrees of freedom is large, the t-distribution is very close to the standard normal distribution (i.e. the one with mean 0 and s.d. 1), but when the degrees of freedom are small, it is broader, with fatter tails. The Excel formulas to calculate the confidence interval and p-value are almost the same as in the sigma known case.
Confidence Interval (confidence level &alpha):
X-bar +/- t &sigma/sqrt(n)
where t=TINV(1-&alpha,n-1).
Hypothesis Testing:

If Ha:&mu <&mu₀, p=1-TDIST((X-bar-&mu₀)sqrt(n)/&sigma,n-1,1)
If Ha:&mu > &mu₀, p=TDIST((X-bar-&mu₀)sqrt(n)/&sigma,n-1,1)
If Ha:&mu <> &mu₀,p=MIN(TDIST((X-bar-&mu₀)sqrt(n)/&sigma,n-1,1), 1-TDIST((X-bar-&mu₀)sqrt(n)/&sigma,n-1,1)).

However, we will use the t procedure template to do the calculation for us. If you have raw data, enter it in column A in the "Data" tab of the template. Then on the "t-test" tab enter your test mean in the text box provided, enter the significance level below it, and click on the correct form of the alternate hypothesis or enter the confidence level. The confidence interval and/or the p-value and conclusion appear to the right in green. If you only have summary statistics (i.e., X-bar, n) enter it in the textboxes proveided on the "t-test" tab and click the "Use summary statistics" box.

Assumptions For the One Population, One Numerical Variable Tests

Every statistical procedure involves modleing a real situation by an theoretical mathematical model. This amounts to making assumptions about the situation that are generally true only approximately at best. It is crucial that you know what these assumptions are and how to assess how closely they are met, in order to have a sense of the reliability of the conclusions. We will generally choose a rule of thumb for judging when the assumption is close enough. For the two version of the one population, one numerical variable procedures (z-procedure for sigma known, t-procedure for sigma unknown) there are three assumptions.

Theory Practice

The sample is a Simple Random Sample Those individuals more likely to be chosen in the sample do not appear to differ from those less likely in a way that would affect the measured variable(s)

Sampling is with replacement or from an infinite population The population is at least 20 times the sample size (this is almost always met, and when it isn't there is a correction that some of the templates include).

The distribution of X-bar is normal

Either

The sample size is at least 40

or

The sample size is at least 15 and the histogram of the sample shows no extreme skewness or outliers

or

The original variable X is known to be normal

To help you remember the last assumption, which we will keep seeing in many tests, I have devised

Steve's Patented Rule of Cool!

What does it take to be cool? It depends on your age.

If you are under 15, you have to be known to be normal to be considered cool
If you are between 15 and 40, you have to look good to be considered cool
If you are over 40, you don't need to do anything, you just are cool!

The t procedure template gives a rough histogram of the sample data on the "Hist" page, quite sufficient for determining if the middle criterion is met (not too skew, no extreme outliers), though it will only work on samples under 3000.

Two Populations (two samples), One Numerical Variable: the t-procedure again

The confidence interval for a single numerical variable is used very frequently, but hypothesis testing is less frequent: It is not that often that you want to compare the average value of something to some fixed thing (the test mean). More often you want to compare two things. If I am asking whether Treatment A is more effective than Treatment B, I will want to look at, for example, the average recovery time for people who receive A and compare it to those who receive B. If I get statistically significant evidence that the average recovery time for A is shorter than for B, I conclude that A is more effective.

Here we think of there being two populations (in this case, "people treated by Treatment A" and "people treated by Treatment B") and a random variable X on both. The means of the random variable in the two populations are &mu₁ and &mu₂ respectively. Generally our null hypothesis is that &mu₁= &mu₂, which we often write as &mu₁- &mu₂=0, but every once in a while we want to have a more general null hypothesis &mu₁- &mu₂=d, where d, the test difference, is whatever number you like. The alternate hypothesis can be

&mu₁> &mu₂,
&mu₁< &mu₂,
&mu₁< > &mu₂.

We will have a sample from each population, one of size n₁, mean X-bar₁, and sample s.d. s₁, the other of size n₁, mean X-bar₁, and sample s.d. s₁. It turns out that you can show that if you took many pairs of samples and computed their X-bar and s, the differences would follow close to a t-distribution. From this you can get a p-value, or a confidence interval for what the difference &mu₁- &mu₂ is.

The template ( Two Sample t-test in the Excel Templates page will do the calculation. If you have the raw data from two samples, just enter it in two columns in COLUMN A and COLUMN B in the "Data" tab. It is a very good idea to put labels at the top so you can remember which is which. The "t-test" tab will then give you a confidence interval if you enter a confidence level, and will give you a p-value. You leave the space for "test difference" blank unless you are in the rare situation where you have a null hypothesis like "the first mean is exactly 20 more than the last mean"). You can enter significance level as well if you like. If you only have the summary statistics, enter these six numbers in the space provided on the t-test page and click on the "summary statistics" button. The assumptions of the two sample t procedure are the same as for the one sample t-procedure, applied to each sample. That is, we assume in practice that each sample behaves as if it were a random sample, is from a population of reasonable size, and satisfies the Rule of Cool. You can check the histograms of your two samples on the "Hist" page of the template for skewness and outliers. In practice we generally are more relaxed about the size limitations for two samples, so that if each are close 40 or are close to 15 and reasonably symmetric it is considered OK. There is one additional assumption for the two sample test

Theory Practice

Obscure Each sample should have size at least 5.

Theory	Practice
Obscure	Each sample should have size at least 5.

One Population, One Categorical (Yes/No) Variable: The z-procedure

Suppose you have a population, and some proportion p of it has a property, which we will call success. When you choose an individual at random from this population, you will have probability p of getting a success and probability 1-p of getting a failure. If you take a sample of n individuals from a population and count up how many successes you get, that is a binomial experiment, and as you run through many samples you will get a binomial distribution for the number m of successes. m will have a mean of np a standard deviation of the square root of the quantity np(1-p) and will be roughly normal when np and n(1-p) are both at least 5. That means that the variable m/n, which is the sample proportion p-hat, will be normal with a mean of p and a standard deviation of the square root of the quantity p(1-p)/n.

Now turn this around. Suppose you take a sample of size n from a population where the proportion of successes is unknown, and you find the sample proportion is some number p-hat. What does that suggest about the true proportion p? Well presumably the true proportion p is pretty close to p-hat, so the true standard deviation is pretty close to the square root of p-hat(1 - p-hat)/n (in fact the standard deviation changes much more slowly than p). So just as with the confidence interval for a numerical veriable with sigma known, we have

Confidence Interval for a Proportion:

p-hat +/- z sqrt(p-hat(1 - p-hat)/n)

where z is the z-score associated to the confidence level.

The situation is even simpler for hypothesis testing. Here we will have some test proportion , p₀, and our null hypothesis will be p=p₀. We will have three possible alternate hypotheses, namely

p < p₀
p > p₀

p <> p₀.

Assuming the null hypothesis, that is that the population proportion is p₀, we know that the sample proportion has a mean of p₀ and a standard deviation of SQRT( p₀(1 - p₀)/n), and is normally distributed. So we get a p value with no further fuss:

If Ha:p < p₀, p_<=NORMDIST(p-hat,p₀,SQRT( p₀(1 - p₀)/n))
If Ha:p > p₀, p_>= 1- p_<
If Ha:p<> p₀,p_<>=2*MIN(p_<,p_>)
What are the assumptions for the one proportion z-procedure? Depends slightly on whether you are doing a confidence interval or a test.

Theory Practice

The sample is a Simple Random Sample Those individuals more likely to be chosen in the sample do not appear to differ from those less likely in a way that would affect the measured variable(s)

Sampling is with replacement or from an infinite population The population is at least 20 times the sample size (this is almost always met, and when it isn't there is a correction that some of the templates include).

The distribution of p-hat is normal Rule of 15

CI Number of successes and failures is at least 15

HT Expected number of successes and failures (n p₀ and n (1 - p₀)) is at least 15

Many Populations, One Numerical Variable: ANOVA

ANOVA stands for analysis of variation. Here you have I samples taken from I different populations (often it is one population under I different circumstances, such as people be given one of several different treatments). Generally I is more than 2, because if it 1 or 2 you are better off doing one of the t-procedures. Let's say the samples are of size n₁, n₂, ...n_I. On each population the variable of interest X has some mean &mu_i, and some standard deviation &sigma. We have to assume the standard deviations are all the same for ANOVA to work! This is too much information to compute any kind of confidence interval, and in fact there is only one null and alternate hypothesis we can use, so there is no decision to make. Our null hypothesis will always be that all these means are the same, and our alternate hypothesis is that there is some difference among them. As always we will compute a p-value, which is the probability you would get samples whose sample means are this far apart assuming the null hypothesis (that all the population means are equal).
The way this is computed is actually very instructive. It turns out that when you can think of all the variation or error in a problem as coming from two or more distinct sources, often you can associate a kind of variance to these (it will always be the sum or average of the squares of the difference between the values and some predicted value) so that the total variance is a sum of the variance due to each of these effects. Here, if we assume the null hypothesis, we can think of all of these samples, since they come from samples of the same mean &mu and standard deviation &sigma, as one big sample from one big population. The total variation or SST (Sum Square error Total) is the sum of the squares of the difference of each point with the sample mean of the whole sample. This is the sum of two contributions, the SSG or Sum Square error for groups, which is the sum of the squares of the differences of each sample mean and the total sample mean, and the SSE or Sum Square Error within each sample, which is the sum of the squares of the differences of each data point minus the mean of its sample. All this is to say
SST = SSG + SSE
where SSG represents how much the sample means vary amongst themselves and SSE replresents how much the points in each sample vary amongsth themselves. If they all came from populations with the same mean, the SSG would just come from natural variation from one sample to the next, and you would expect it to make up some predetermined fraction of the total variation (depending on the sizes of the samples of course). If SSG represents too big a fraction of the total variation, you start to believe that it is because the means are actually different. Specifically some math will tell you that if the null hypothesis is true the quantity
F=SSG/SST
as we run through many different samples will have a characteristic distribution, called the F distribution. The F distribution depends on two parameters, the Group Degrees of Freedom DFG = I-1 and the Error Degrees of Freedom DFE = N-I. Software or really long tables will tell you, once you know F, DFG and DFE, what the probability you would get that big an F score or bigger in a collection of samples (of the correct sizes) from which all the means are equal. That is your p-value.
All you have to do is enter the data from the spearate samples in separate columns in the "Data" tab of the ANOVA template (make sure not to label the columns by numbers, Excel will think it is data). then read off the p-value. Or, enter the means, s.d.s and counts in the "Data Summary" tab.
The assumptions for the ANOVA test are a little more complicated. Each sample must satisfy all the assumptions of the t procedure. That is, it must be a reasonable approximation of a simple random sample, it must be sampled from a sufficiently large population, and it must satisfy the 0:15:40 Rule of Cool (In practice, we think of the different samples as providing additional "averaging out," so that we can be more relaxed about this rule. Three or four sample of size 20 are probably OK short of the most outrageous outliers or skewness). In addition, the assumed equality of the standard deviation enforces the following additional rule of thumbs:

The sample standard deviations of the individual samples should not be "too far apart." Specifically, the ratio of the largest to the smallest should be at most 2. the "Data Summary" page of the template checks this for you.
each sample should in any case be at least 5.

sawin
Last modified: Wed Nov 16 21:08:57 EST 2011