Go to: CoHort Software | CoStat | CoStat Statistics

# Multiple Regression in CoStat

Multiple regression is the simultaneous linear regression of several x columns of data (independent variables) on one y column of data (the dependent variable). The general form of the resulting equation is:

y = b0 + b1x1 + b2x2 + b3x3 ... bnxn where the b values are the coefficients that the regression finds optimal (least squares) values for.

CoStat can do a multiple regression of a full model (where all of the x columns are in the model) or one subset (where you specify a subset of the x columns).

Note that some of the x columns may have been created from other x columns with CoStat's 'Transformations' procedure. For example, you could make a column with x12, or a column with x1*x2. In this way, you can make model of a "response surface" or other more complex models.

Often, an experimenter has a large number of x columns and wishes to know if there is a smaller, simpler model with a subset of these x columns which adequately explains the dependent variable. For this type of multiple regression problem, see Subset Selection in Multiple Regression.

Sample Run

In the sample run, we will estimate the relationship of the employment level with several economic variables (unemployment rate, GNP, etc.). The data is from an article testing computational accuracy (Longley, 1967).

```PRINT DATA
2000-08-05 11:06:07
Using: c:\cohort6\longley.dt
First Column: 1) GNP def
Last Column:  7) Employment
First Row:    1
Last Row:     16

GNP def     GNP    Unemployment Armed Forces  14 yrs     Time    Employment
--------- --------- ------------ ------------ --------- --------- ----------
83    234289         2356         1590    107608      1947      60323
88.5    259426         2325         1456    108632      1948      61122
88.2    258054         3682         1616    109773      1949      60171
89.5    284599         3351         1650    110929      1950      61187
96.2    328975         2099         3099    112075      1951      63221
98.1    346999         1932         3594    113270      1952      63639
99    365385         1870         3547    115094      1953      64989
100    363112         3578         3350    116219      1954      63761
101.2    397469         2904         3048    117388      1955      66019
104.6    419180         2822         2857    118734      1956      67857
108.4    442769         2936         2798    120445      1957      68169
110.8    444546         4681         2637    121950      1958      66513
112.6    482704         3813         2552    123366      1959      68655
114.2    502601         3931         2514    125368      1960      69564
115.7    518173         4806         2572    127852      1961      69331
116.9    554894         4007         2827    130081      1962      70551
```

Longley ran this seemingly routine regression on several mainframe computers and found incredibly varied answers, largely because the x values are large relative to their standard error and because of mild collinearity among the x values. CoHort's Regression compares quite well - the estimated coefficients are accurate to 10 significant figures.

There is a fascinating follow-up article by Beaton, et al. (1976), which points out that a greater source of inaccuracy may be the data itself. Slight variations in the original data cause large variations in the results. This is an important consideration and further investigation of the matter is encouraged before accepting the results of any regression.

For the sample run, use File : Open to open the file called longley.dt in the cohort directory. Then:

1. From the menu bar, choose: Statistics : Regression : Multiple (Full Model)
2. Keep If:
3. Calculate constant: (checked)
4. Print Residuals: (checked)
5. Save residuals: (don't)
6. Validation Method: Bootstrap
7. Validate N Times: 100
8. OK
```REGRESSION: MULTIPLE (FULL MODEL)
2002-09-26 16:06:26
Using: C:\cohort6\LONGLEY.DT
X Columns:
1) GNP def         3) Unemployment    5) 14 yrs
2) GNP             4) Armed Forces    6) Time
Y Column: 7) Employment
Keep If:
Calculate Constant: true

Total number of data points = 16
Number of data points used = 16
Regression equation:
col(7)[Employment] = -3482258.6346
+15.0618722714*col(1)[GNP def]
-0.0358191793*col(2)[GNP]
-2.0202298038*col(3)[Unemployment]
-1.0332268672*col(4)[Armed Forces]
-0.0511041057*col(5)[14 yrs]
+1829.15146461*col(6)[Time]

R^2     = 0.99547900458    AIC = 187.828836554    MSEP    = 172802.886467
adj R^2 = 0.99246500763    BIC =   199.5078489    PRESS   = 2886892.54145
PRE R^2 = 0.98482749195    MAE = 179.371521174    LOO MAE = 333.425946693

For each term in the ANOVA table below, if P<=0.05, that term was a
significant source of Y's variation.

Source                              SS       df        MS         F     P
------------------------ ------------- -------- --------- --------- ---------
Regression               184172401.944        6  30695400 330.28534 .0000 ***
col(1)[GNP def]          174397449.779        1 1.74397e8 1876.5326 .0000 ***
col(2)[GNP]              4787181.04445        1   4787181  51.51051 .0001 ***
col(3)[Unemployment]     2263971.10982        1 2263971.1 24.360538 .0008 ***
col(4)[Armed Forces]     876397.161861        1 876397.16 9.4301143 .0133 *
col(5)[14 yrs]            348589.39965        1  348589.4 3.7508541 .0848 ns
col(6)[Time]             1498813.44959        1 1498813.4 16.127371 .0030 **
Error                    836424.055506        9 92936.006
------------------------ ------------- -------- --------- --------- ---------
Total                        185008826       15

Table of Statistics for the Regression Coefficients:

Column                       Coef.  Std Error  t(Coef=0)      P      +/-95% CL
------------------------ ---------  ---------  ---------  ---------  ---------
Intercept                 -3482259  890420.38  -3.910803  .0036 **   2014270.8
col(1)[GNP def]          15.061872  84.914926   0.177376  .8631 ns   192.09091
col(2)[GNP]              -0.035819   0.033491  -1.069516  .3127 ns   0.0757619
col(3)[Unemployment]      -2.02023  0.4883997  -4.136427  .0025 **   1.1048368
col(4)[Armed Forces]     -1.033227  0.2142742  -4.821985  .0009 ***  0.4847218
col(5)[14 yrs]           -0.051104  0.2260732  -0.226051  .8262 ns   0.5114131
col(6)[Time]             1829.1515   455.4785  4.0158898  .0030 **   1030.3639

Degrees of freedom for two-tailed t tests = 9
If P<=0.05, the coefficient is significantly different from 0.

Residuals:

Row     Y observed     Y expected       Residual
---------  -------------  -------------  -------------
1          60323  60055.6599702  267.340029759
2          61122  61216.0139424  -94.013942399
3          60171  60124.7128322  46.2871677573
4          61187  61597.1146219  -410.11462193
5          63221  62911.2854092   309.71459076
6          63639  63888.3112153  -249.31121533
7          64989  65153.0489564   -164.0489564
8          63761  63774.1803569  -13.180356867
9          66019  66004.6952274  14.3047726001
10          67857  67401.6059054  455.394094552
11          68169  68186.2689271  -17.268927115
12          66513  66552.0550425  -39.055042523
13          68655  68810.5499736  -155.54997359
14          69564   69649.671308  -85.671308042
15          69331   68989.068486   341.93151396
16          70551  70757.7578252  -206.75782519

Validation Method: Bootstrap
Validate N Times:  100
Leave-Group-Out PRESS   = 7966592.18225
Leave-Group-Out PRE R^2 = 0.9801802
Leave-Group-Out MAE     = 564.585568568

(The validation method randomly assigns rows of data to validation groups,
so the Leave-Group-Out statistics printed above will vary.
You can reduce the variability by increasing 'Validate N Times'.)

Group  Leave-Group-Out Validation Equations
-----  ----------------------------------------------------------------------
1  10958996.4318 +246.007245432*col(1)[GNP def] +0.1405383348*col(2)[GNP] +0.8217102887*col(3)[Unemployment] -1.2388561368*col(4)[Armed Forces] +2.63894001237*col(5)[14 yrs] -5770.9069564*col(6)[Time]
2  -3913158.8845 +138.451642639*col(1)[GNP def] -0.0747379129*col(2)[GNP] -2.7571974387*col(3)[Unemployment] -1.060357447*col(4)[Armed Forces] +0.21777855892*col(5)[14 yrs] +2036.04532194*col(6)[Time]
3  -2393179.0155 +15.0310564322*col(1)[GNP def] -0.0261600929*col(2)[GNP] -1.898797727*col(3)[Unemployment] -0.800027798*col(4)[Armed Forces] +0.2010804385*col(5)[14 yrs] +1254.37302405*col(6)[Time]
...
100  -4161802.887 +4.28677342928*col(1)[GNP def] -0.0522384443*col(2)[GNP] -2.6041771478*col(3)[Unemployment] -0.9534097692*col(4)[Armed Forces] -0.0074760369*col(5)[14 yrs] +2178.89210068*col(6)[Time]

(The validation method randomly assigns rows of data to validation groups,
so the LGO Validation Equations will vary. Since these equations are
generated during the individual validation runs, increasing
'Validate N Times' will not decrease the variability of the coefficients.)
```

Go to: CoHort Software | CoStat | CoStat Statistics | Top