Bivariate Correlations (Pearson's r)

Used when both the DV and the IV are continuous.   (Robust to minor violations in distributional assumptions.)

A correlation indicates what the linear relationship is between two variables.  It indicates how the two variables covary.

A positive correlation means that as one variable goes up in value, the other variable goes up too.  Or as one variable goes down in value, the other variable goes down too.  A positive correlation means the two variables vary in the same direction (either they both go up when one changes, or they both go down when one changes).

See scatterplot on board (x= education, y = income)

A negative correlation means that as one variable goes up in value, the other variable goes down.  Or as one variable goes down in value, the other variable goes up.  A negative correlation means the two variables vary in opposite directions.

See scatterplot on board.  (x= age, y = crime)

Correlations (denoted with the symbol "r") range from -1 to +1.  A -1 means there is a strong negative linear relationship between the two variables.  A +1 means there is a strong positive linear relationship between the two variables. A 0 correlation means that there is no linear relationship between the two variables.

See scatterplot on board; x= age, y = shoe size)

A correlation of +1 or -1 in the social sciences is rare. Usually it only occurs if you are using an IV and a DV that are essentially the same thing.  For example, IV= age, DV= cohort.

Very strong correlations are rare in the social sciences (social behavior is complicated).

The size of the correlation depends on:

1. Sample size: Bigger n's lead to higher r's.

2. Distribution of your variables:  Pearson's r may not work well if your variables are not normally distributed.

3. Unit of analysis:  Large unit of analysis (ex. countries, businesses, even households) lead to higher r's, because there is usually less variation in any x and y among large units of analysis.

Example:  IV = health care expenditures (\$'s),  DV = health status (0-10)

Unit of analysis = people; r = .20

Unit of analysis = countries; r = .75

With small units of analysis (such as the GSS), we usually do not find correlations higher than .3 or so.

Explained Variation (r2)

You can also calculate the amount of variation that the IV explains of the DV by squaring the correlation.

r2 =  r *  r

r2 * 100 = The percent of variation in the DV that the IV explains.

Example: IV = health care expenditures (\$'s),  DV = health status (0-10)

Unit of analysis = people; r = .20, r2 = .04.  Convert .04 to percentage (.04 * 100; or move decimal two places to the right).  Interpretation:   Health care expenditures explain 4% of the variation in people's  health status.

Unit of analysis = countries; r = .75, r2 = .56.  Convert .56 to percentage (.56 * 100; or move decimal two places to the right).  Interpretation:   Health care expenditures explain 56% of the variation in a country's health status.

See board for Venn diagram.

Limitations of Correlations

1. Correlation does not prove causation.  In the social sciences, many independent variables could also be dependent variables. For example, what influences what -- does education determine income, or does income determine education, or both?  If you specify that x = education and y = income and find a significant positive correlation, you should not say that education causes income to increase (unless you are able to establish time ordering, which you usually can't with survey data).  The correlation does not indicate causation, only covariance.  The correlation between x = income and y = education would be the same as that between x = education and y = income.

2. Correlations only show whether a linear relationship occurs between two variables.  Many relationships in the social sciences are not linear. For example, look at the possible relationships between age and the consumption of pornography.  See board for scatterplot.

Calculating Pearson's Correlation by Hand

See board for formula.

Example:  n =10, x = number of absences, y = final grade in SOC 301 course

 absences grade 1 92 4 93 2 90 10 45 15 36 6 79 0 92 1 93 2 95 9 44

Calculate: Use a table to compute math.

 x y xy x2 y2 1 92 92 1 8464 4 93 372 16 8649 2 90 180 4 8100 10 45 450 100 2025 15 36 540 225 1296 6 79 474 36 6241 0 92 0 0 8464 1 93 93 1 8649 2 95 190 4 9025 9 44 396 81 1936 sum 50 759 2787 468 62849

See board for formula calculations.

r = -.94

*People rarely compute correlations by hand. Nor do they calculate the t-test on a correlation by hand. Thus, we will do all correlations on the computer.

Significance Tests

We use a t test to determine whether a correlation is different from 0. There are 3 research hypotheses we can test. You must choose one:

• There is a relationship between the IV and the DV. r  0.  (two tailed test)
• There is a positive relationship between the IV and the DV.  As the IV goes up, the DV goes up too (or as the IV goes down, the DV goes down too).  r > 0. (one tailed test, right hand side).
• There is a negative relationship between the IV and the DV.  As the IV goes up, the DV goes down (or as the IV goes down, the DV goes up).  r < 0. (one tailed test, left hand side.)

Then draw your diagram using the alpha that you set ahead of time.  Be sure to draw the number of tails that correspond to your research hypothesis, and to split alpha in half for a two tailed test.

If p is lower than alpha, reject.  If p is higher than alpha, accept.   Then give your interpretation.  If you accept the null, you say that there is no correlation between the IV and the DV.   If you reject the null, you say that there is a correlation between the IV and the DV, and then tell us what that relationship is.  Is it a positive or negative correlation?  As the IV increases, what happens in the DV?  How much variation in the DV does the IV explain?

If the r is significant, then the r2 is too, in a bivariate analysis.

Example 1.

I think that education influences the number of children that people have.

IV = education level (0 to 20)

DV = number of children (0 to 8+)

Null Hypothesis: There is no linear relationship between education and the number of children that people have.  r  = 0 .

Research Hypothesis:  There is a linear relationship between education and the number of children that people have.  r 0.

Alpha = .05.  Two tailed test.  Draw diagram.

From SPSS (using the GSS), we learn that  r = -.21, p = .000

r2 = .0441

Reject the null.   There is a weak negative relationship between education and the number of children that people have.  As education increases, the number of children that people have tends to decrease slightly. Education explains 4.41% of the variation in the number of children.

Example 2.

I think that age influences how many siblings that people have. Specifically, I think that older people tend to have more siblings than younger people.

IV = age (18 to 89)

DV = number of siblings (0 to 24)

Null Hypothesis: There is no linear relationship, or there is a negative relationship, between age and the number of siblings that people have.  r  = 0  or r < 0.

Research Hypothesis:  There is a linear positive relationship between age and the number of siblings that people have.  r > 0.

Alpha = .05.  One tailed test.  Draw diagram.

From SPSS (using the GSS), we learn that  r = .14, p = .000

r2 = .0196

Reject the null.   There is a weak positive relationship between age and the number of siblings that people have.  As age increases, the number of siblings that people have tends to increase a little. Age explains 1.96% of the variation in the number of siblings.

What if alpha was .01?  Accept null.  There is no linear relationship between age and the number of siblings that people have.

Example 3.

I think that the number of hours that people work per week influences how many times they have sex.

IV = hours worked (3 to 89)

DV = sex frequency (0 to 6)

Null Hypothesis: There is no linear relationship between the number of hours that people work per week and the number of times they have sex.  r  = 0

Research Hypothesis:  There is a linear relationship between the number of hours that people work per week and the number of times they have sex.  r 0.

Alpha = .05.  Two tailed test.  Draw diagram.

From SPSS (using the GSS), we learn that  r = .06, p = .027

Accept null.  There is no linear relationship between the number of hours worked last week and the number of times that people have sex.

Take Home Example

I think people with higher income (measured in dollars) watch less television (measured in hours) than people with lower incomes.

Alpha = .05.

r = -.19, p = .000 (two tailed)