Descriptive Statistics

STATISTICAL ANALYSIS OF ECOLOGICAL DATA

I. Objectives:

Discuss why scientists employ statistics to understand ecological problems.
Calculate the following descriptive statistics: mean, variance, standard deviation.
Describe the concept of normal distribution.
Understand and apply the t-test to differences in populations.
Understand and apply the chi-square test to frequency or count data.

II. Introduction:

Ecologists are often concerned with numbers of organisms (density) and their patterns of distribution in nature. This makes ecology a quantitative science. However, ecologists cannot count and determine the location of every organism in a given area. Rather, ecologists must collect and analyze data from samples taken within the population. The quantitative data collected by ecologists can be (actually, must be) analyzed using statistics.

A. Terminology:
Before we begin our work, some definitions are needed:
    E = summation
    x = single observation or data point
    n = sample size (number of data points)
    s² = variance
    s = standard deviation
    SS = sum of squares
    df = degrees of freedom, often = n-1

B. Descriptive Statistics:
Descriptive statistics summarize some aspect of the population. The most commonly used are the mean, median, mode, variance and standard deviation. For ecological studies, the mean, variance and standard deviation are most often used.

B1. Mean:

The mean is a measure of the central tendency (average) for a population.
    Mean = Ex/n
Example 1: In population #1, the following numbers of trees are counted in 5 quadrats: 1, 6, 11, 16, 21
    Mean = 55/5 = 11
Example 2: In population #2, the following numbers of trees are counted in 5 quadrats: 10, 11, 11, 11, 12
    Mean = 55/5 = 11

B2. Variance:
As shown above, two populations with the same mean may have quite different variation in numbers. In the example given above, population #2 has a narrow range of abundances per quadrat (10-12) while population #1 has a relatively wide range of numbers per quadrat (1-21). The variance is a measure of this range of possible results.
    S² = SS/df
SS can be calculated as E(x-mean)², but this is a cumbersome equation to use when there are large numbers of data. A simpler way of calculating SS on most calculators is:
    SS = Ex² – [(Ex)²/n]
Example 1: For population #1 described earlier:
    SS=855-[3025/5] = 855-605 = 250
    S²=250/(5-1)=62.5
Example 2: For population #2 described earlier:
    SS= 607-605=2
    S²=2/(5-1)=0.5

B3. Standard Deviation:
The standard deviation is determined as the square root of the variance. For population #1, the standard deviation would be 7.9. For population #2, the standard deviation would be 0.7. The standard deviation is important because it provides an easily visualized measure of the variation from the mean for normally distributed data.

What is normally distributed? A normal distribution is a typical bell curve, with the peak of the curve corresponding to the mean. However, a bell curve can be narrow and tall or broad and short, depending on whether the data has a low or high variance. The standard deviation provides an easily understood estimate of this variability. For normally distributed data, 95% of all possible observations (such as counts in quadrats) will lie within 2 standard deviations of the mean. This is often known as the 95% confidence limits. For example, population #2 has a standard deviation of 0.7. Two standard deviations would thus be 1.4. Therefore, the 95% confidence limits for this population are 11 (the mean) + 1.4 (or 9.6-12.4). This means that if we took further quadrat samples from this population, on average 95% of these additional quadrats would have densities between 9.6-12.4 trees per quadrat.

C. Comparative Statistics

C.1 What are comparative statistics?
Statistics can also be used to determine whether populations (or measurements of population characteristics) are similar or different. For example:
Is the density of pine trees in two areas similar or different?
Is the number of crabs in the Cape Fear estuary more now than a decade ago?

This use of statistics is called significance testing. Using the scientific method, even using statistics, the scientist cannot prove anything. Statistics can only demonstrate that an event is very unlikely, but nothing is ever proved in the process. Typically, the investigator establishes a hypothesis and then tries to determine if that hypothesis is likely by showing the alternatives (called null hypotheses) are not likely. For example, one may have an hypothesis that densities of pine trees are different between 2 forests. However, because statistics do not prove differences, the investigator actually seeks to prove that the null hypothesis of no difference between the forests is unlikely to be true (confusing isn’t it!).

C2. Example using the t-test.

Let’s run through an example:

Step 1: A researcher develops the following hypothesis:
H_a = There is a difference in the density of pine trees between a recently burned forest and a forest that has not been burned for 25 years.

Step 2: A null hypothesis is formed:
H_o (the null hypothesis) = There is no difference in densities between the two forests.

Step 3: Data collection:
Next, the researcher collects data. In this case, quadrat sampling is appropriate and the following counts of pine trees per 100 m² quadrats are recorded:
Unburned forest (4 quadrats): 5, 2, 3, 8 Burned forest (5 quadrats): 15, 25, 20, 11, 15

Step 4: Analysis of data:
Now the densities (no. per quadrat) of pines can be compared statistically to determine if there is a difference. Since this data represents replicate measures from 2 groups, an appropriate test for comparing the groups is the t-test.

Calculation by Hand:
The t-test has the following formula: t = |mean₁ – mean₂|
S_x1-x2

Where S_x1-x2= sqrt[(s_p²/n₁) + (s_p²/n₂)] and s_p² = (SS₁ + SS₂)/(df₁ + df₂)

The following numbers are calculated to determine the t-statistic for the two populations (b=burned, u=unburned)

mean_u = 18/4 = 4.5 mean_b = 86/5 = 17.2

n_u = 4 n_b = 5

SS_u = 102-(324/4) = 21 SS_b = 1596-(7396/5) = 116.8

S_p² = (21+116.8)/(3+4) = 19.7

S_x1-x2= sqrt[(19.7/4) + (19.7/5)] = sqrt(8.97) = 2.99

t = |4.5 – 17.2| = 4.26

2.99

The degrees of freedom (df) for this test are (n_u-1) + (n_b-1) = (4-1) + (5-1) = 7

This t-value can be looked up in a t-table. If the calculated values is greater than the value under the df. row for 0.05 probability level, then you reject the null hypothesis and conclude there is a significant difference between the burned and unburned forests. In this case, the table value for 7 df and 0.05 significance level is 1.895. Therefore, we can conclude that pine tree density is greater in the burned forest.

Calculation using JMP IN:

In this course, we can also calculate a t-test using a standard, commercial statistical package, JMP IN. To do this, do the following steps:

Double click on the JMP icon
A table will appear. If there is only 1 column (labeled column 1), you will need to create a second by double clicking in the right side space (to the right of "column 1").
Click on the column 1 square and then click on the name. Type in "forest type". Do the same for column 2, typing in "density". For forest type, change the default data type (the letter in the small box above the name) from "c" to "n". Leave density as "c".
Enter the data in the following format (u=unburned forest, b=burned forest):

	Forest type	Density
1	U	5
2	U	2
3	U	3
4	U	8
5	B	15
6	B	25
7	B	20
8	B	11
9	B	15

From the menu bar, choose Analyze – Fit Y by X
Choose forest type as X and density as Y
Do the group means/one-way ANOVA comparison (which will be the default comparison for you data).
You will get a graph of the data. Choose means, ANOVA/t-test.
The results of the t-test will be displayed along with the results of several other tests. Please note that the calculated t-value is the same as we calculated by hand, and a p-value of less than 0.05 is shown, indicating a significant difference.

C4. Chi Square (X²) test

Another test that is useful for comparing totals, counts or frequencies is the X² test. Using the X² test, scientists can determine if observed values are the same as values expected for a given situation. For example, you survey the number of crabs under "large" rocks and "small" rocks in a swift current to determine if there is a difference in the number of crabs under each rock type.

The total number of crabs under 20 rocks was:

                    Large rocks        Small rocks
Observed         200                      10
Expected         105                    105

The expected number is established by determining the number of crabs expected if the null hypothesis were true. In this case there is a hypothesis (H_a) of a difference between the rocks and a null hypothesis (H_o) of no difference. So, if there are a total of 210 crabs collected, with no difference in the number found under each rock type, there must be an expected number of 105 for both large and small rocks (105+105 = 210).

The X² statistic is then calculated by:

X² = E [(observed – expected)²/expected]

In this example, X² = (200-105)²/105 + (10-105)2/105 = 171.9

For this case, the degrees of freedom (df) for the test is determined by the number of groups minus 1 (2-1=1). For 1 degree of freedom at a 0.05 significance level, the critical table value is 3.84. Since your calculated value is greater that the table value, the null hypothesis is rejected and you conclude there is a difference in the number of crabs under large rocks versus small rocks.

Connect to a sample t-test

Return to Ecology Lab Syllabus