Go to: CoHort Software | CoStat | CoStat Statistics

Polynomial Regression in CoStat

Polynomial equations have the general form:

y = b₀ + b₁x¹ + b₂x² + b₃x³ + b₄x⁴ + b₅x⁵ + ... b_nxⁿ

where b₀ is an optional constant term and b₁ through b_n are coefficients of increasing powers of x. You must specify the order of the polynomial to which you wish to fit your data.

A linear equation (y = b₀ + b₁x) is called a first order polynomial.
A quadratic polynomial equation (y = b₀ + b₁x + b₂x²) is called a second order polynomial.
A cubic polynomial equation (y = b₀ + b₁x + b₂x² + b₃x³) is called a third order polynomial.
Higher (4th or 5th) order polynomials are useful for attempts to describe data points as fully as possible, but the terms generally cannot be meaningfully interpreted in any biological or physical sense. Higher order terms can lead to odd and unreasonable results, especially beyond the range of the x values.

If your goal is to describe a smooth curve through a large number of data points, consider splines (see Graph : Dataset : Representations in CoPlot) or other methods (for example, "Transformations : Smooth"), too.

Data Format

There must be at least two numeric columns of data; you can designate any column as the x column and any column as the y column. Rows of data with missing values in the x or y column are rejected.

Options on the Statistics : Regression : Polynomial dialog box:

X Column:: Choose the x column from a list of the columns.
Y Column:: Choose the y column (the dependent variable) from a list of the columns.
Degree:: Specify the polynomial order. For example, Degree=2 will generate a quadratic equation (for example, y = 0.32 + 0.15*x + 0.02*x^2).
Keep If:: lets you enter a boolean expression (for example, (col(1)>50) and (col(2)<col(3))). Each row of the data file is tested. If the equation evaluates to true, that row of data will be used in the calculations. If false, that row of data will be ignored. See "Using Equations", "the A button", and "the f() button".
Calculate Constant:: In most cases checked is appropriate. Not checked will produce a curve passing through the origin (x=0, y=0).
Print Residuals:: prints the X values, Y observed, Y expected, and Residual (Y observed - Y expected). These are commonly printed so you can see if the residuals appear to be random (that's good) or if there is some trend (that's bad; maybe some other type of equation is more suitable).
Save Residuals:: This lets you optionally insert two new columns in the data file with the expected Y's and the residuals. You can then use CoPlot to plot X vs. Y Observed and Y Expected, or plot X vs. the residuals.
OK: Press this to run the procedure when all of the settings above are correct.
Close: Close the dialog box.

The Sample Run

The data for the sample run is a made-up set of x and y data points:

PRINT DATA
2000-08-04 16:17:44
Using: c:\cohort6\expdata.dt
  First Column: 1) X
  Last Column:  2) Y
  First Row:    1
  Last Row:     8

    X         Y     
--------- --------- 
        1         2 
        2       3.5 
        3         8 
        4        17 
        5        28 
        6        39 
        7        54 
        8        70

For the sample run, use File : Open to open the file called expdata.dt in the cohort directory. Then:

From the menu bar, choose: Statistics : Regression : Polynomial regression
X Column: 1) X
Y Column: 2) Y
Degree: 2
Keep If:
Calculate constant: (checked)
Print Residuals: (checked)
Save Residuals: (don't)
OK


REGRESSION: POLYNOMIAL
2002-09-26 16:11:26
Using: C:\cohort6\expdata.dt
X Column: 1) X
Y Column: 2) Y
Degree: 2
Keep If: 
Calculate Constant: true

Total number of data points = 8
Number of data points used = 8
Regression equation: 
y = 0.54464285714
  -0.5625*x^1
  +1.16369047619*x^2
 
R^2 is the coefficient of multiple determination.  It is the fraction
of total variation of Y which is explained by the regression:
R^2=SSregression/SStotal.  It ranges from 0 (no explanation of the
variation) to 1 (a perfect explanation).

R^2 = 0.99893689645

For each term in the ANOVA table below, if P<=0.05, that term was a
significant source of Y's variation.

Source                              SS       df        MS         F     P
------------------------ ------------- -------- --------- --------- ---------
Regression               4352.83630952        2 2176.4182 2349.1054 .0000 ***
x^1                      4125.33482143        1 4125.3348 4452.6582 .0000 ***
x^2                      227.501488095        1 227.50149 245.55252 .0000 ***
Error                    4.63244047619        5 0.9264881
------------------------ ------------- -------- --------- --------- ---------
Total                       4357.46875        7

Table of Statistics for the Regression Coefficients:

Column                       Coef.  Std Error  t(Coef=0)      P      +/-95% CL
------------------------ ---------  ---------  ---------  ---------  ---------
Intercept                0.5446429   1.342886  0.4055764  .7018 ns   3.4519984
x^1                        -0.5625  0.6846597  -0.821576  .4487 ns   1.7599737
x^2                      1.1636905  0.0742618  15.670116  .0000 ***   0.190896

Degrees of freedom for two-tailed t tests = 5
If P<=0.05, the coefficient is significantly different from 0.

Residuals:

      Row              X     Y observed     Y expected       Residual
---------  -------------  -------------  -------------  -------------
        1              1              2  1.14583333333  0.85416666667
        2              2            3.5   4.0744047619  -0.5744047619
        3              3              8  9.33035714286  -1.3303571429
        4              4             17  16.9136904762  0.08630952381
        5              5             28  26.8244047619   1.1755952381
        6              6             39        39.0625        -0.0625
        7              7             54  53.6279761905  0.37202380952
        8              8             70  70.5208333333  -0.5208333333

If the constant term is not calculated (uncheck that checkbox), the curve will be forced through the origin. The results are then:

REGRESSION: POLYNOMIAL
2002-09-26 16:14:38
Using: C:\cohort6\expdata.dt
X Column: 1) X
Y Column: 2) Y
Degree: 2
Keep If: 
Calculate Constant: false

Total number of data points = 8
Number of data points used = 8
Regression equation: 
y = 
 -0.3076671035*x^1
  +1.13870685889*x^2
 
R^2 is the coefficient of multiple determination.  It is the fraction
of total variation of Y which is explained by the regression:
R^2=SSregression/SStotal.  It ranges from 0 (no explanation of the
variation) to 1 (a perfect explanation).

R^2 = 0.99954387736

For each term in the ANOVA table below, if P<=0.05, that term was a
significant source of Y's variation.

Source                              SS       df        MS         F     P
------------------------ ------------- -------- --------- --------- ---------
Regression               10485.4651595        2 5242.7326 6574.1784 .0000 ***
x^1                      9787.10294118        1 9787.1029 12272.638 .0000 ***
x^2                      698.362218282        1 698.36222 875.71848 .0000 ***
Error                    4.78484054172        6 0.7974734
------------------------ ------------- -------- --------- --------- ---------
Total                         10490.25        8

Table of Statistics for the Regression Coefficients:

Column                       Coef.  Std Error  t(Coef=0)      P      +/-95% CL
------------------------ ---------  ---------  ---------  ---------  ---------
x^1                      -0.307667  0.2523271  -1.219318  .2685 ns   0.6174222
x^2                      1.1387069  0.0384795  29.592541  .0000 ***   0.094156

Degrees of freedom for two-tailed t tests = 6
If P<=0.05, the coefficient is significantly different from 0.

Residuals:

      Row              X     Y observed     Y expected       Residual
---------  -------------  -------------  -------------  -------------
        1              1              2  0.83103975535  1.16896024465
        2              2            3.5  3.93949322848  -0.4394932285
        3              3              8   9.3253604194  -1.3253604194
        4              4             17  16.9886413281  0.01135867191
        5              5             28  26.9293359546  1.07066404543
        6              6             39  39.1474442988  -0.1474442988
        7              7             54  53.6429663609  0.35703363914
        8              8             70  70.4159021407  -0.4159021407

Note that the Total degrees of freedom equals the number of data points (1 greater than before), since the estimated mean was not used in the regression. The R^2 value is higher than the R^2 value for the model with a constant term(!). Remember that the R^2 value is calculated a different way when there is no constant term (see "Regression - Details - R^2" and "Regression - Constant term").

Go to: CoHort Software | CoStat | CoStat Statistics | Top