Go to:
CoHort Software |
CoStat |
CoStat Statistics
Analysis of Frequency Data in CoStat
(Including
Cross Tabulation, Calculating Expected Values,
Goodness-Of-Fit Tests, Tests of Independence,
Chi-Square,
Likelihood Ratio, Log-Linear, and
Fisher's Exact Tests)
Analysis of Frequency Data deals with data
that has been tabulated; that is, the number of sampled items that fall
into different categories. The categories can be based on 1 criteria
("1 way", for example, sex), 2 criteria ("2 way", for example, sex and race),
or 3
criteria ("3 way", for example, sex, race, and religion). For 2 way and 3
way tabulations, the process is often called cross tabulation. The
process of tabulation is also called binning, since it analogous to
sorting or categorizing items and putting them into bins.
This type of frequency analysis is quite different from an FFT
which finds the component frequencies (as in Cycles Per Second)
in a time series.
There are several procedures in CoStat related
to frequency data:
- Cross Tabulation -
Tabulate the data if not tabulated already.
- 1 way, Calculate Expected Values
- For 1 way tabulations, calculate the expected values based
on the normal, binomial, or Poisson distributions.
- Print a table of observed and expected frequencies and
descriptive statistics.
- Analysis -
- 1 Way Tests
- Calculate descriptive statistics:
the mean, standard deviation, skewness, and kurtosis of the data.
- Perform Kolmogorov-Smirnov test of goodness-of-fit (before
pooling).
- Test goodness-of-fit (how closely the expected values
match the observed values) with the Chi-Square
and Likelihood Ratio tests.
- 2 Way Tests
- Print margin totals.
- Test the independence (the lack of interaction)
of the two factors with the Chi-Square test,
the Likelihood Ratio test, and Fisher's Exact Test.
- 3 Way Tests
- Print margin totals.
- Test the independence (the lack of interaction) of the
three factors with Log-Linear models.
Analysis of Frequency Data in the CoStat Manual
CoStat's manual has:
- An introduction to analysis of frequency data.
- A description of the calculation methods that are used by the program.
- 7 complete sample runs.
The sample runs show how to do 7 different types of analysis
of frequency data. Here is sample run #1:
Sample Run 1 - 1 Way, Not-Yet-Tabulated Data, Normal Distribution
In this example, the raw, untabulated data is from the
wheat experiment.
In the wheat experiment (page 233),
three varieties of wheat were grown at four locations.
At each of the locations, there were four blocks, within
each of which were small plots for each of the varieties.
The Height and Yield of each plot were measured.
The goal is to
visualize the distribution of plant heights and compare this
distribution to a normal distribution. The analysis will
indicate if the distribution of heights is significantly different
from the normal distribution.
For this sample run, the values of one column, Height,
need to be tabulated. Open the wheat.dt data file and specify:
- From the menu bar, choose: Statistics : Frequency Analysis : Cross Tabulation
- Keep If:
- Column 1: 4)Height (This automatically sets:
Numeric (checked), Lower Limit=60, Class width=10,
New Name=Height Classes).
- Insert Results At: (the end)
- Frequency Name: Observed
- Print Frequencies: (not checked) (they'll be printed later)
- OK
The printed results are:
CROSS TABULATION
2000-08-03 12:19:18
Using: c:\cohort6\wheat.dt
n Way: 1
Keep If:
n Data Points = 48
Column Numeric Lower Limit Class Width New Name n Classes
------------- --------- ------------- ------------- ------------- ---------
4) Height true 60 10 Height Classe 10
The procedure
then
calculates descriptive statistics for the population and asks you
which distribution to use when calculating expected frequencies:
normal, poisson, or binomial distributions. (The poisson and
binomial distributions are only options when the class width is 1 and
the lowest limit is -0.5.)
Most data has an expected normal distribution. The significance
tests for many statistics (for example, product moment correlation
coefficient) assume that the population is normally distributed. In
this example, we will test the fidelity of the height distribution to
normality by looking at the skewness and kurtosis of the distribution.
The theoretical normal distribution
(based on the mean and standard deviation) appears as a straight line
on this graph. The Poisson and binomial distribution are discussed in
the next 2 sample runs.
The
procedure
can use the
observed descriptive statistics to calculate the expected values (an
intrinsic hypothesis) or you can enter other values to be used when
calculating the expected values (an extrinsic hypothesis). The
distinction between testing an intrinsic or extrinsic hypothesis is
important because they are tested with slightly different goodness of
fit tests (see Sokal and Rohlf,
1981 or 1995, for more information).
The normal distribution uses estimates of 2 parameters from the
population (the mean and the standard deviation) when calculating the
expected frequencies.
Differences
from
Descriptive statistics - If you start an analysis
with Statistics : Frequency Analysis : 1 Way, Calculate Expected
with already tabulated data
(and not with raw data and
Statistics : Frequency Analysis : Cross Tabulation)
the mean and standard deviation
calculated here will be based on the tabulated data and will differ
somewhat from the mean and standard deviation as calculated in
Statistics : Descriptive.
The statistics calculated on tabulated data
assume that all items in a given bin have
a value equal to the bin's lower limit plus 1/2 the class width.
So if you have the raw data and want to know the mean and
standard deviation, use the statistics calculated in
Statistics : Descriptive, since they are more accurate.
Continuing with the sample run, we will choose to
calculate expected values based on the
normal distribution, using the mean and standard deviation calculated from the
data. On the Frequency 1 Expected dialog:
- Lower Limit: 6) Height Classes
- Observed: 7) Observed
- Distribution: Normal
- Mean: (use default)
- Standard Deviation: (use default)
- Save Expected: (checked)
- OK
The results are:
1 WAY FREQUENCY ANALYSIS - Calculate Expected Values
2000-08-03 12:21:33
Using: c:\cohort6\wheat.dt
Lower Limit Column: 6) Height Classes
Observed Column: 7) Observed
Distribution: Normal
Mean: 99.5833333333
Standard Deviation: 24.92186371
n Data Points = 48
n Classes = 10
Descriptive Statistics (for the tabulated data)
Testing skewness=0 and kurtosis=0 tests if the numbers have a
normal distribution.
(Poisson distributed data should have significant positive skewness.)
(Binomially distributed data may or may not have significant skewness.)
If the probability that skewness equals 0 ('P(g1=0)') is <=0.05,
the distribution is probably not normally distributed.
If the probability that kurtosis equals 0 ('P(g2=0)') is <=0.05,
the distribution is probably not normally distributed.
Descriptive Statistics fit a normal distribution to the data:
Mean is the arithmetic mean (or 'average') of the values.
Standard Deviation is a measure of the dispersion of the distribution.
Variance is the square of the standard deviation.
Skewness is a measure of the symmetry of the distribution.
Kurtosis is a measure of the peakedness of the distribution.
If skewness or kurtosis is significantly greater or less than 0 (P<=0.05),
it indicates that the population is probably not normally distributed.
n data points = 48
Min = 65.0
Max = 155.0
Mean = 99.5833333333
Standard deviation = 24.92186371
Variance = 621.09929078
Skewness = 0.62821922472 Standard Error = 0.3431493092
Two-tailed test of hypothesis that skewness = 0 (df = infinity) :
P = .0672 ns
Kurtosis = -0.1752294896 Standard Error = 0.67439742269
Two-tailed test of hypothesis that kurtosis = 0 (df = infinity) :
P = .7950 ns
Height Cl Observed Percent Expected Deviation
--------- --------- --------- --------- -------------
60 6 12.500 5.6450522 0.35494776137
70 5 10.417 4.7227305 0.27726948384
80 5 10.417 6.4461811 -1.4461810931
90 15 31.250 7.5061757 7.49382431075
100 2 4.167 7.4566562 -5.45665624
110 5 10.417 6.31944 -1.319439972
120 4 8.333 4.5689842 -0.5689841573
130 2 4.167 2.8181393 -0.8181393174
140 1 2.083 1.4828591 -0.4828590617
150 3 6.250 1.0337817 1.96621828556
Pooling - When expected
frequencies for the normal and binomial distributions are calculated,
the integrand of the left and right tails are added to the expected
frequencies of the lowest and highest classes, respectively.
The methods for calculating the expected frequencies can be found in
Sokal and Rohlf (1981 or 1995).
The final stage of the sample run sets up the goodness of fit tests.
On the Statistics : Frequency Analysis : 1 Way Tests dialog, choose:
- Observed: 7) Observed
- Expected: 8) Expected
- n Intrinsic: 2 (In this case, two parameters
which were calculated from the data, mean
and standard deviation, were used to compute the expected values.)
- OK
The results are:
1 WAY FREQUENCY ANALYSIS - Goodness-Of-Fit Tests
2000-08-03 12:23:34
Using: c:\cohort6\wheat.dt
Observed Column: 7) Observed
Expected Column: 8) Expected
n Intrinsic (parameters estimated from the data): 2
n Observed = 48
n Expected = 48
n Classes Before Pooling = 10
n Classes After Pooling = 6
These tests test the goodness-of-fit of the observed and expected values.
If P<=0.05, the expected distribution is probably not a good fit of the
data.
Kolmogorov-Smirnov Test
(not recommended for discrete data; recommended for continuous data)
D obs = 0.13916375964
n = 48
Since n<=100, see Table Y in Rohlf & Sokal (1995) for critical
values for an intrinsic hypothesis.
Likelihood Ratio Test
(ok for discrete data; ok for continuous data)
G = 12.0082419926
df (nClasses-nIntrinsic-1) = 3
P = .0074 **
Likelihood Ratio Test with Williams' Correction
(recommended for discrete data; ok for continuous data)
G (corrected) = 11.5407353521
df (nClasses-nIntrinsic-1) = 3
P = .0091 **
Chi-Square Test
(ok for discrete data; ok for continuous data)
X2 = 12.0297034449
df (nClasses-nIntrinsic-1) = 3
P = .0073 **
All of these tests confirm that this is not a normally
distributed population, which is not surprising since it has a very
heterogeneous source.
The test statistics are calculated as follows (from
Sokal and Rohlf, 1981 or 1995):
For the
Kolmogorov-Smirnov test:
D = dmax/n
where:
- d = the difference between expected and observed cumulative
frequencies
- dmax = the maximum of the differences
- n = then number of classes
If the number of rows of data is less than 100, critical values of
D can be found for extrinsic hypotheses in Table 32 (
Rohlf and Sokal,
1981) (but not Table X in Rohlf and Sokal, 1995, which is a slightly
different table). For intrinsic hypotheses, see Table 33 (Rohlf and
Sokal, 1981) (but not Table Y in Rohlf and Sokal, 1995, which is a
slightly different table). Or, see other books of statistical
tables. If the total number of tabulated data points is greater than
99, the critical values of D are calculated by the procedure from the
following equation:
Dalpha= sqrt(-ln(alpha/2)/(2n))
- where alpha= the significance level
For the likelihood ratio test:
G = 2SUMfiln(fi/fhati)
For the Chi-square test:
X2 = SUM(fi2/fhati) - n
- where f is the observed frequency and fhat is the expected
frequency.
The test statistics G and X2 can be compared with
tabulated values of the Chi-square distribution. The degrees of
freedom equals the number of classes (after pooling) minus the number
of parameters estimated from the population to calculate the expected
frequencies (in this case 2, that is, the mean and the standard
deviation) minus 1. In this sample run, df = 6-2-1 = 3.
Williams' Correction for the Likelihood Ratio test
(for intrinsic and extrinsic hypotheses) is used because it
leads to a closer approximation of a chi-square
distribution. See
Sokal and Rohlf, Section 17.2.
Yates' Correction for Continuity -
Unlike earlier versions of CoStat, the new CoStat does
not do Yates' Correction for Continuity. It is now
thought to result in excessively conservative tests and
is not recommended. (See
Sokal and Rohlf, 1995, pg. 703.)
If there are no expected values, the goodness of fit tests will be
skipped.
Go to:
CoHort Software |
CoStat |
CoStat Statistics |
Top
|