INTRO
TO HYPOTHESIS TESTING and STATISTICAL THEORY
Hypothesis = a specific statement that can be empirically tested.
Hypotheses flow from the research question guiding the research study.
There are two hypotheses in statistical analysis: the null (H0) and the research hypothesis (H1).
Null hypothesis= Conceptually, this hypothesis argues that there is no relationship between two or more variables. X doesn't influence Y.
Research hypothesis = Conceptually, this hypothesis argues that there is a relationship between two or more variables. X does influence Y. In general, you need to argue one of three different statements:
On Y, group 1 is larger than group 2
On Y, group 1 is smaller than group 2
On Y, group 1 is different than group 2
Need to state both null and research hypothesis in statistical terms too.
Example:
Research Question = does gender influence alcohol consumption?
Null Hypothesis
Conceptually: Gender (x) does not influence alcohol
consumption (y).
Statistically: The average alcohol consumption among men is not different
than the average alcohol consumption among women. mean1 = mean2
Research Hypothesis
Conceptually: Gender (x) does influence alcohol consumption (y).
Statistically: Need to argue one of the following:
The average alcohol consumption among men is different than
the average alcohol consumption among women. mean1 ¹
The average alcohol consumption among men is greater than
the average alcohol consumption among women. mean1 >
The average alcohol consumption among men is less than the
average alcohol consumption among women. mean1 <
In hypothesis
testing we either accept or reject the null. We never prove the
null or the research hypothesis. We
build all of our statistical analyses around the research hypothesis and make it
as difficult as possible to reject the null. Thereby, minimizing error,
specifically the chance that we say there is a relationship between two
variables when there is not (we minimize Type 1 error or
Alpha error).
Conceptually, in
general, to test the research hypothesis we:
1. Calculate a statistic on a sample such as a
mean, a %, mean, or correlation.
2. Compare the sample statistic to a comparable statistic calculated under the assumption that there is no relationship between the two variables. For example, looking at the influence of gender on alcohol consumption:
If there was no difference between men and women's alcohol consumption:
mean1-mean2 = 0 (or close to it).
If there was a difference between men and women's alcohol consumption:
mean1-mean2 ¹
3. From this comparison, we determine if there is a relationship between two variables. This comparison (and the theory behind it) assumes the sample represents the population accurately.
Seems simple. The tricky part is determining how different the two statistics (in this case means) have to be before we will conclude that they are not equal. How much different than 0 does mean1-mean2 have to be before we say there is a relationship? Due to random sampling variability and measurement error, we know that even if there is no relationship between gender and alcohol consumption the two means may not be exactly the same.
For example, you collect a random sample of 100 UNCW students and ask them their sex and number of alcoholic beverages consumed last week. From this data you calculate the average weekly consumption of men and women.
Men = 8
Women = 5
How different do these two statistics have to be before you will feel confident in concluding that men tend to consume more alcohol than women? Is a difference of 3 drinks big enough to conclude this? What if the difference was 10? In general, the bigger the difference, the more confident we are in concluding "there is a relationship." But keep in mind, that by chance, your sample could have included one man who drank 20 drinks per week which would inflate the mean for men. Hence, even if you found a difference of 10 drinks you still could be inaccurately describing the pattern in the population of all UNCW students (to which you want to generalize from your sample) by concluding that there is a difference between men and women student's alcohol consumption. It is this type of error that we try to minimize in statistical analysis -- the error of concluding there is a relationship when there is not.
Theoretically, one
way to gain more confidence in your conclusions would be to take repeated
samples from the population and compare the statistics calculated on each (in
this case average weekly alcohol consumption). If they were all similar
(you wouldn't expect them all to be exactly the same due to sampling
variability), then you might be more confident that the value you got in your
sample truly reflects the population.
Let's say, using the same sampling technique, you take nine more samples of UNCW students and calculate the average weekly alcohol consumption of men and women. We will look at the men's data first.
Your sample (sample
1) means = men 8, women 5
Sample # | Men's Average Weekly Alcohol Consumption | Women's Average Weekly Alcohol Consumption |
Sample 1 | 8 | 5 |
Sample 2 | 7 | 5 |
Sample 3 | 7.5 | 5.5 |
Sample 4 | 8 | 4 |
Sample 5 | 10 | 5 |
Sample 6 | 8.5 | 7 |
Sample 7 | 6.5 | 2 |
Sample 8 | 8 | 5 |
Sample 9 | 8 | 4 |
Sample 10 | 8 | 5 |
Including your original sample, you compose the following frequency table from your 10 sample means.
men | x | f |
6.5 | 1 | |
7 | 1 | |
7.5 | 1 | |
8 | 5 | |
8.5 | 1 | |
10 | 1 | |
Average = 7.95 | n=10 |
women |
x | f |
2 | 1 | |
4 | 2 | |
5 | 5 | |
5.5 | 1 | |
7 | 1 | |
average = 4.75 | n=10 |
From the above frequency tables we can create a sampling distribution.
A sampling
distribution is a probability
distribution of the statistic. In this case it is for a mean, but it could be of
any statistic -- a median, a percentage, etc..).
It is a distribution of the probabilities for each value of the statistic
from the repeated samples.
So all we need to
do is calculate the probabilities for each of the values in the above frequency
tables. To calculate the probabilities, we use the frequency of each value
divided by the total number of samples. Let's add this probability column
to our tables.
men | x | f | p |
6.5 | 1 | .1 (=1/10) | |
7 | 1 | .1 | |
7.5 | 1 | .1 | |
8 | 5 | .5 | |
8.5 | 1 | .1 | |
10 | 1 | .1 | |
Average = 7.95 | n=10 |
women |
x | f | p |
2 | 1 | .1 | |
4 | 2 | .2 | |
5 | 5 | .5 | |
5.5 | 1 | .1 | |
7 | 1 | .1 | |
average = 4.75 | n=10 |
We could also graph the sampling distributions, like we did with variable distributions:
Diagrams:
(see board)
Application of the Sampling Distribution
We compare our sample statistic to the sampling distribution.
See diagrams on board.
Ho: The average alcohol consumption among men is not different than the average alcohol consumption among women. mean1 = mean2
H1: The average alcohol consumption among men is greater than the average alcohol consumption among women. mean1 > mean2
In our comparison, we ask ourselves: Are the two distributions (in this case the sample mean and the sampling distribution mean) different enough to say that the difference is not due to random sampling variation in our sample?
If our sample statistic varies dramatically from the sampling distribution we conclude either:*
1. The independent variable does influence the dependent variable. "There is a difference." We reject the null hypothesis. The difference between the sample statistic and the sampling distribution is so big that it must reflect a real difference between men and women's alcohol consumption. In general, the more different the sample statistic is from the sampling distribution (which serves as a proxy for the population value) the lower the probability associated with our statistic. This low probability means there is low error in concluding "X influences Y."
2. Our sample is not representative of the population. We accept the null hypothesis. In this case, although the statistics is much different than the sampling distribution, there is a high probability associated with the statistic. This high probability means there is a great chance of error in concluding that "X influences Y." This high probability usually occurs because 1) it is a small sample, and/or 2) the standard deviation on the dependent variable is large. Combined, the sample size and the standard deviation for the variable of interest indicate the "precision of the statistic" or standard error. The standard error is a measure of the sampling variability due to data collection methods.
*Whether we conclude with answer 1 or answer 2 depends on how much error we are willing to take in concluding there is a relationship between X and Y.
A related question to whether the sample
statistic is different enough from the sampling distribution is:
Is the difference big enough to matter? By
extension, is the difference between the two groups (from which the statistics
were calculated) big enough to make a difference in our lives? Is there a
substantive difference? Here's
an example.
Say that we collect data on
boys and girls math
test scores in the USA. The scores show that boys do better in math.
Boys
mean math test score = 82
Girls mean math test score = 79
So there is a difference of 3 points out of 100 points. Is this difference
big enough to make a difference in the collective lives of boys and girls? Does
a difference of 3 points out of 100 points lead to boys doing better in school,
in general, than girls? Doing better in college? In their
careers? In their ability to do their taxes? Etc..
Why is the
sampling distribution important: We use the sampling distribution of a
statistic to determine the probability that the value of the statistic is like
other possible sample values. It helps us determine the likelihood of error in concluding there is a relationship when there is not, or in
concluding that two statistics are different.
The sampling
distribution is derived assuming the null hypothesis is correct. The
sampling distribution
says, if there is no
relationship between x and y, these are the statistics we would expect and their
associated probabilities.
In reality, we don't have time or money to draw repeated samples of a population to calculate sampling distributions of statistics. Fortunately, theoretical sampling distributions have already been calculated for nearly all test statistics (you will learn of these shortly). These distributions assume the null hypothesis. They can be found in the back of any statistical textbook, on-line (http://www.statsoftinc.com/textbook/sttable.html), and statistical software packages (such as SPSS) calculate them.
The Influence of Data Collection Methods on Hypothesis Testing: Sampling Design and Measurement
Sampling distributions are theoretical. That means all the probabilities we obtain from probability tables or the computer are theoretical. The theory behind these probabilities doesn’t hold if the sample doesn't represent the population. In general, the best way to accurately represent a population with a sample is by using random probability sampling methods, and often we must stratify on important characteristics such as race, gender and age.
For
example: You want to say study men and
women’s alcohol consumption. You draw a simple random sample using a
telephone book.
Who
are you going to get using this sampling design?
1. People who have phones. This is probably not a big problem as the
percent of people without phones is low in most areas throughout the USA.
2.
People who answer phone calls and are willing to cooperate with a survey.
Who are they? They are disproportionately young, elderly, and women, and
more likely to either work at home or be unemployed (especially if you call
primarily during day time hours). So your sample may have
disproportionately higher numbers of young, elderly, women, unemployed, small
business owners, telecommuters, people in occupations that permit working at
home, domestic engineers, and people with either no kids or high numbers of
kids. All of these characteristics influence
alcohol consumption, so your sample may be biased and may inaccurately depict
alcohol consumption among men and women.
How to avoid this problem? Stratify random sample on gender and age.
How to determine if your sample is biased?
Always calculate descriptive
statistics on key demographic variables in your sample before doing any
inferential analysis so that you can determine if the sample represents the
population. Even
if you don’t have exact data on your population for the research question you
are studying you can usually obtain demographic data from the census, county
data, previous research on the population on a different question, etc..
What Influences the Ability to Reject the Null (i.e., to find statistically significant results)
1. The standard deviation: This takes into account the range and the
variation of the variable.
High
variation, harder to find significance (harder to reject null).
No
variation, can’t do test.
2. Sample Size: the smaller the n, the harder to find significant relationships/differences
In general, larger samples
have more power. Power =
the ability to find significant relationships.
With small samples you have to have a huge difference before you will
be able to find it statistically significant or before you can say either
“this difference holds in the population” or “this difference is
significant, or meaningful”. So
you will only be able to look for large effects.
Like a difference of 20 drinks consumed between men and women.
Often you need to be able to find smaller differences.
Caution: with really big
sample sizes you can find very, very small differences significant. At
some point, these differences become meaningless.
For example, if you had a sample of 1000 men and women, you might have
the statistical power to find differences of .10 (1/10 of a drink)
statistically different. This little bit
more of a drink probably has no effect on your health or behavior. So with very large
samples, researchers often randomly split the sample into a smaller sub-sample
so that the sample size won’t inflate the statistical significance of the
tests.
Combined, sample size and
the standard deviation tell you about the precision of the statistic.
3. Alpha level: higher the alpha, the easier
to reject (more error willing to accept)
4. 1 or 2 tailed test: 1 tailed test easier to reject.
n=100,
alpha=.05, 2 tailed: t critical = 1.96
n=100, alpha=.05, 1 tailed:
t critical =1.66
See diagram on board