INTRO TO HYPOTHESIS TESTING and STATISTICAL THEORY

Hypothesis = a specific statement that can be empirically tested.

Hypotheses flow from the research question guiding the research study.

There are two hypotheses in statistical analysis: the null (H0) and the research hypothesis (H1).

Null hypothesis= Conceptually, this hypothesis argues that there is no relationship between two or more variables.  X doesn't influence Y.

Research hypothesis = Conceptually, this hypothesis argues that there is a relationship between two or more variables.  X does influence Y.  In general, you need to argue one of three different statements:

On Y, group 1 is larger than group 2
On Y, group 1 is smaller than group 2
On Y, group 1 is different than group 2

Need to state both null and research hypothesis in statistical terms too.

Example:

Research Question = does gender influence alcohol consumption?

Null Hypothesis

Conceptually: Gender (x) does not influence alcohol consumption (y).

Statistically:  The average alcohol consumption among men is not different than the average alcohol consumption among women.  mean1 = mean2

Research Hypothesis

Conceptually: Gender (x) does influence alcohol consumption (y).

Statistically:  Need to argue one of the following:

The average alcohol consumption among men is different than the average alcohol consumption among women.  mean1 ¹ mean2 (Men = group 1,  Women = group 2)

The average alcohol consumption among men is greater than the average alcohol consumption among women.  mean1 > mean2 (Men = group 1,  Women = group 2)

The average alcohol consumption among men is less than the average alcohol consumption among women.  mean1 < mean2 (Men = group 1,  Women = group 2)

In hypothesis testing we either accept or reject the null.  We never prove the null or the research hypothesis. We build all of our statistical analyses around the research hypothesis and make it as difficult as possible to reject the null.  Thereby, minimizing error, specifically the chance that we say there is a relationship between two variables when there is not (we minimize Type 1 error or Alpha error).

 


Conceptually, in general, to test the research hypothesis we:

1. Calculate a statistic on a sample such as a mean, a %, mean, or correlation.

2. Compare the sample statistic to a comparable statistic calculated under the assumption that there is no relationship between the two variables.  For example, looking at the influence of gender on alcohol consumption:

If there was no difference between men and women's alcohol consumption:

mean1-mean2 = 0 (or close to it).

If there was a difference between men and women's alcohol consumption:  

mean1-mean2 ¹ 0

3. From this comparison, we determine if there is a relationship between two variables. This comparison (and the theory behind it) assumes the sample represents the population accurately.

 

Seems simple.  The tricky part is determining how different the two statistics (in this case means) have to be before we will conclude that they are not equal.  How much different than 0 does mean1-mean2 have to be before we say there is a relationship?  Due to random sampling variability and measurement error, we know that even if there is no relationship between gender and alcohol consumption the two means may not be exactly the same.

For example, you collect a random sample of 100 UNCW students and ask them their sex and number of alcoholic beverages consumed last week.  From this data you calculate the  average weekly consumption of men and women.

Men = 8
Women = 5

How different do these two statistics have to be before you will feel confident in concluding that men tend to consume more alcohol than women?  Is a difference of 3 drinks big enough to conclude this?  What if the difference was 10?  In general, the bigger the difference, the more confident we are in concluding "there is a relationship."  But keep in mind, that by chance, your sample could have included one man who drank 20 drinks per week which would inflate the mean for men.  Hence, even if you found a difference of 10 drinks you still could be inaccurately describing the pattern in the population of all UNCW students (to which you want to generalize from your sample) by concluding that there is a difference between men and women student's alcohol consumption.  It is this type of error that we try to minimize in statistical analysis -- the error of concluding there is a relationship when there is not. 


Theoretically, one way to gain more confidence in your conclusions would be to take repeated samples from the population and compare the statistics calculated on each (in this case average weekly alcohol consumption).  If they were all similar (you wouldn't expect them all to be exactly the same due to sampling variability), then you might be more confident that the value you got in your sample truly reflects the population.

Let's say, using the same sampling technique, you take nine more samples of UNCW students and calculate the average weekly alcohol consumption of men and women.  We will look at the men's data first. 

Your sample (sample 1) means = men 8, women 5 

Sample # Men's Average Weekly Alcohol Consumption Women's Average Weekly Alcohol Consumption
Sample 1 8 5
Sample 2  7 5
Sample 3 7.5  5.5 
Sample 4 8 4
Sample 5 10  5
Sample 6  8.5 7
Sample 7 6.5  2
Sample 8 8 5
Sample 9 8  4
Sample 10 8  5

 

Including your original sample, you compose the following frequency table from your 10 sample means. 

men x f
  6.5 1
  7
  7.5 1
  8 5
  8.5 1
  10 1
Average  = 7.95   n=10

 

women

x f
  2 1
  4 2
  5 5
  5.5 1
  7 1
average = 4.75   n=10

 

From the above frequency tables we can create a sampling distribution.

A sampling distribution is a probability distribution of the statistic. In this case it is for a mean, but it could be of any statistic -- a median, a percentage, etc..).  It is a distribution of the probabilities for each value of the statistic from the repeated samples.

So all we need to do is calculate the probabilities for each of the values in the above frequency tables.  To calculate the probabilities, we use the frequency of each value divided by the total number of samples.  Let's add this probability column to our tables. 

 

men x f p
  6.5 1 .1 (=1/10)
  7 .1
  7.5 1 .1
  8 5 .5
  8.5 1 .1
  10 1 .1
Average  = 7.95   n=10  

   

women

x f p
  2 1 .1
  4 2 .2
  5 5 .5
  5.5 1 .1
  7 1 .1
average = 4.75   n=10  

We could also graph the sampling distributions, like we did with variable distributions:

Diagrams: (see board)  

 

Application of the Sampling Distribution

We compare our sample statistic to the sampling distribution.

See diagrams on board.

 

Ho: The average alcohol consumption among men is not different than the average alcohol consumption among women.  mean1 = mean2

H1: The average alcohol consumption among men is greater than the average alcohol consumption among women.  mean1 >  mean2

In our comparison, we ask ourselves:  Are the two distributions (in this case the sample mean and the sampling distribution mean) different enough to say that the difference is not due to random sampling variation in our sample?

If our sample statistic varies dramatically from the sampling distribution we conclude either:*

1.  The independent variable does influence the dependent variable.  "There is a difference."  We reject the null hypothesis.  The difference between the sample statistic and the sampling distribution is so big that it must reflect a real difference between men and women's alcohol consumption.  In general, the more different the sample statistic is from the sampling distribution (which serves as a proxy for the population value) the lower the probability associated with our statistic.  This low probability means there is low error in concluding "X influences Y." 

2.  Our sample is not representative of the population. We accept the null hypothesis.  In this case, although the statistics is much different than the sampling distribution, there is a high probability associated with the statistic.  This high probability means there is a great chance of error in concluding that "X influences Y."  This high probability usually occurs because 1) it is a small sample, and/or 2) the standard deviation on the dependent variable is large. Combined, the sample size and the standard deviation for the variable of interest indicate the "precision of the statistic" or standard error.  The standard error is a measure of the sampling variability due to data collection methods. 

*Whether we conclude with answer 1 or answer 2 depends on how much error we are willing to take in concluding there is a relationship between X and Y.


A related question to
whether the sample statistic is different enough from the sampling distribution is:  Is the difference big enough to matter?  By extension, is the difference between the two groups (from which the statistics were calculated) big enough to make a difference in our lives?  Is there a substantive difference? Here's an example.  

Say that we collect data on boys and girls math test scores in the USA.  The scores show that boys do better in math.

Boys mean math test score = 82   
Girls mean math test score =  79 

So there is a difference of 3 points out of 100 points.  Is this difference big enough to make a difference in the collective lives of boys and girls? Does a difference of 3 points out of 100 points lead to boys doing better in school, in general, than girls?  Doing better in college?  In their careers?  In their ability to do their taxes?  Etc..  

We may find, if we have a big enough sample, that a difference of 3 points is statistically significant.  But boys doing 3 points on a scale of 0-100 probably doesn’t amount to much in real life.  It is not going to make boys do better in life.  It is not a meaningful difference.  So, we always have to assess whether our statistical analyses are substantively significant, as well as statistically significant. 

Why is the sampling distribution important:  We use the sampling distribution of a statistic to determine the probability that the value of the statistic is like other possible sample values.  It helps us determine the likelihood of error in concluding there is a relationship when there is not, or in concluding that two statistics are different.   

The sampling distribution is derived assuming the null hypothesis is correct. 
The sampling distribution says, if there is no relationship between x and y, these are the statistics we would expect and their associated probabilities.

In reality, we don't have time or money to draw repeated samples of a population to calculate sampling distributions of statistics.  Fortunately, theoretical sampling distributions have already been calculated for nearly all test statistics (you will learn of these shortly). These distributions assume the null hypothesis.  They can be found in the back of any statistical textbook, on-line (http://www.statsoftinc.com/textbook/sttable.html), and statistical software packages (such as SPSS) calculate them. 


The Influence of Data Collection Methods on Hypothesis Testing: Sampling Design and Measurement 

Sampling distributions are theoretical.  That means all the probabilities we obtain from probability tables or the computer are theoretical.  The theory behind these probabilities doesn’t hold if the sample doesn't represent the population. In general, the best way to accurately represent a population with a sample is by using random probability sampling methods, and often we must stratify on important characteristics such as race, gender and age. 

For example: You want to say study men and women’s alcohol consumption.  You draw a simple random sample using a telephone book. 

Who are you going to get using this sampling design?

1. People who have phones.  This is probably not a big problem as the percent of people without phones is low in most areas throughout the USA.

2. People who answer phone calls and are willing to cooperate with a survey.  Who are they?  They are disproportionately young, elderly, and women, and more likely to either work at home or be unemployed (especially if you call primarily during day time hours).  So your sample may have disproportionately higher numbers of young, elderly, women, unemployed, small business owners, telecommuters, people in occupations that permit working at home, domestic engineers, and people with either no kids or high numbers of kids.  All of these characteristics influence alcohol consumption, so your sample may be biased and may inaccurately depict alcohol consumption among men and women.

How to avoid this problem? Stratify random sample on gender and age.

How to determine if your sample is biased? 

Always calculate descriptive statistics on key demographic variables in your sample before doing any inferential analysis so that you can determine if the sample represents the population. Even if you don’t have exact data on your population for the research question you are studying you can usually obtain demographic data from the census, county data, previous research on the population on a different question, etc..


What Influences the Ability to Reject the Null (i.e., to find statistically significant results)

1. The standard deviation: This takes into account the range and the variation of the variable.

High variation, harder to find significance (harder to reject null).

No variation, can’t do test.

2. Sample Size: the smaller the n, the harder to find significant relationships/differences 

In general, larger samples have more power.  Power = the ability to find significant relationships.  With small samples you have to have a huge difference before you will be able to find it statistically significant or before you can say either “this difference holds in the population” or “this difference is significant, or meaningful”.  So you will only be able to look for large effects.  Like a difference of 20 drinks consumed between men and women.   Often you need to be able to find smaller differences.   

Caution: with really big sample sizes you can find very, very small differences significant.  At some point, these differences become meaningless.  For example, if you had a sample of 1000 men and women, you might have the statistical power to find differences of .10 (1/10 of a drink) statistically different.  This little bit more of a drink probably has no effect on your health or behavior. So with very large samples, researchers often randomly split the sample into a smaller sub-sample so that the sample size won’t inflate the statistical significance of the tests.

Combined, sample size and the standard deviation tell you about the precision of the statistic. 


3. Alpha level:  higher the alpha, the easier to reject (more error willing to accept)

4.  1 or 2 tailed test: 1 tailed test easier to reject.

n=100, alpha=.05, 2 tailed: t critical = 1.96             

n=100, alpha=.05, 1 tailed: t critical =1.66  

See diagram on board