Go to:
CoHort Software |
CoStat |
CoStat Statistics
Multiple Regression in CoStatMultiple regression is the simultaneous linear regression of several x columns of data (independent variables) on one y column of data (the dependent variable). The general form of the resulting equation is: y = b0 + b1x1 + b2x2 + b3x3 ... bnxn where the b values are the coefficients that the regression finds optimal (least squares) values for. CoStat can do a multiple regression of a full model (where all of the x columns are in the model) or one subset (where you specify a subset of the x columns). Note that some of the x columns may have been created from other x columns with CoStat's 'Transformations' procedure. For example, you could make a column with x12, or a column with x1*x2. In this way, you can make model of a "response surface" or other more complex models. Often, an experimenter has a large number of x columns and wishes to know if there is a smaller, simpler model with a subset of these x columns which adequately explains the dependent variable. For this type of multiple regression problem, see Subset Selection in Multiple Regression. Sample Run In the sample run, we will estimate the relationship of the employment level with several economic variables (unemployment rate, GNP, etc.). The data is from an article testing computational accuracy (Longley, 1967).
PRINT DATA
2000-08-05 11:06:07
Using: c:\cohort6\longley.dt
First Column: 1) GNP def
Last Column: 7) Employment
First Row: 1
Last Row: 16
GNP def GNP Unemployment Armed Forces 14 yrs Time Employment
--------- --------- ------------ ------------ --------- --------- ----------
83 234289 2356 1590 107608 1947 60323
88.5 259426 2325 1456 108632 1948 61122
88.2 258054 3682 1616 109773 1949 60171
89.5 284599 3351 1650 110929 1950 61187
96.2 328975 2099 3099 112075 1951 63221
98.1 346999 1932 3594 113270 1952 63639
99 365385 1870 3547 115094 1953 64989
100 363112 3578 3350 116219 1954 63761
101.2 397469 2904 3048 117388 1955 66019
104.6 419180 2822 2857 118734 1956 67857
108.4 442769 2936 2798 120445 1957 68169
110.8 444546 4681 2637 121950 1958 66513
112.6 482704 3813 2552 123366 1959 68655
114.2 502601 3931 2514 125368 1960 69564
115.7 518173 4806 2572 127852 1961 69331
116.9 554894 4007 2827 130081 1962 70551
Longley ran this seemingly routine regression on several mainframe computers and found incredibly varied answers, largely because the x values are large relative to their standard error and because of mild collinearity among the x values. CoHort's Regression compares quite well - the estimated coefficients are accurate to 10 significant figures. There is a fascinating follow-up article by Beaton, et al. (1976), which points out that a greater source of inaccuracy may be the data itself. Slight variations in the original data cause large variations in the results. This is an important consideration and further investigation of the matter is encouraged before accepting the results of any regression. For the sample run, use File : Open to open the file called longley.dt in the cohort directory. Then:
REGRESSION: MULTIPLE (FULL MODEL)
2002-09-26 16:06:26
Using: C:\cohort6\LONGLEY.DT
X Columns:
1) GNP def 3) Unemployment 5) 14 yrs
2) GNP 4) Armed Forces 6) Time
Y Column: 7) Employment
Keep If:
Calculate Constant: true
Total number of data points = 16
Number of data points used = 16
Regression equation:
col(7)[Employment] = -3482258.6346
+15.0618722714*col(1)[GNP def]
-0.0358191793*col(2)[GNP]
-2.0202298038*col(3)[Unemployment]
-1.0332268672*col(4)[Armed Forces]
-0.0511041057*col(5)[14 yrs]
+1829.15146461*col(6)[Time]
R^2 = 0.99547900458 AIC = 187.828836554 MSEP = 172802.886467
adj R^2 = 0.99246500763 BIC = 199.5078489 PRESS = 2886892.54145
PRE R^2 = 0.98482749195 MAE = 179.371521174 LOO MAE = 333.425946693
For each term in the ANOVA table below, if P<=0.05, that term was a
significant source of Y's variation.
Source SS df MS F P
------------------------ ------------- -------- --------- --------- ---------
Regression 184172401.944 6 30695400 330.28534 .0000 ***
col(1)[GNP def] 174397449.779 1 1.74397e8 1876.5326 .0000 ***
col(2)[GNP] 4787181.04445 1 4787181 51.51051 .0001 ***
col(3)[Unemployment] 2263971.10982 1 2263971.1 24.360538 .0008 ***
col(4)[Armed Forces] 876397.161861 1 876397.16 9.4301143 .0133 *
col(5)[14 yrs] 348589.39965 1 348589.4 3.7508541 .0848 ns
col(6)[Time] 1498813.44959 1 1498813.4 16.127371 .0030 **
Error 836424.055506 9 92936.006
------------------------ ------------- -------- --------- --------- ---------
Total 185008826 15
Table of Statistics for the Regression Coefficients:
Column Coef. Std Error t(Coef=0) P +/-95% CL
------------------------ --------- --------- --------- --------- ---------
Intercept -3482259 890420.38 -3.910803 .0036 ** 2014270.8
col(1)[GNP def] 15.061872 84.914926 0.177376 .8631 ns 192.09091
col(2)[GNP] -0.035819 0.033491 -1.069516 .3127 ns 0.0757619
col(3)[Unemployment] -2.02023 0.4883997 -4.136427 .0025 ** 1.1048368
col(4)[Armed Forces] -1.033227 0.2142742 -4.821985 .0009 *** 0.4847218
col(5)[14 yrs] -0.051104 0.2260732 -0.226051 .8262 ns 0.5114131
col(6)[Time] 1829.1515 455.4785 4.0158898 .0030 ** 1030.3639
Degrees of freedom for two-tailed t tests = 9
If P<=0.05, the coefficient is significantly different from 0.
Residuals:
Row Y observed Y expected Residual
--------- ------------- ------------- -------------
1 60323 60055.6599702 267.340029759
2 61122 61216.0139424 -94.013942399
3 60171 60124.7128322 46.2871677573
4 61187 61597.1146219 -410.11462193
5 63221 62911.2854092 309.71459076
6 63639 63888.3112153 -249.31121533
7 64989 65153.0489564 -164.0489564
8 63761 63774.1803569 -13.180356867
9 66019 66004.6952274 14.3047726001
10 67857 67401.6059054 455.394094552
11 68169 68186.2689271 -17.268927115
12 66513 66552.0550425 -39.055042523
13 68655 68810.5499736 -155.54997359
14 69564 69649.671308 -85.671308042
15 69331 68989.068486 341.93151396
16 70551 70757.7578252 -206.75782519
Validation Method: Bootstrap
Validate N Times: 100
Leave-Group-Out PRESS = 7966592.18225
Leave-Group-Out PRE R^2 = 0.9801802
Leave-Group-Out MAE = 564.585568568
(The validation method randomly assigns rows of data to validation groups,
so the Leave-Group-Out statistics printed above will vary.
You can reduce the variability by increasing 'Validate N Times'.)
Group Leave-Group-Out Validation Equations
----- ----------------------------------------------------------------------
1 10958996.4318 +246.007245432*col(1)[GNP def] +0.1405383348*col(2)[GNP] +0.8217102887*col(3)[Unemployment] -1.2388561368*col(4)[Armed Forces] +2.63894001237*col(5)[14 yrs] -5770.9069564*col(6)[Time]
2 -3913158.8845 +138.451642639*col(1)[GNP def] -0.0747379129*col(2)[GNP] -2.7571974387*col(3)[Unemployment] -1.060357447*col(4)[Armed Forces] +0.21777855892*col(5)[14 yrs] +2036.04532194*col(6)[Time]
3 -2393179.0155 +15.0310564322*col(1)[GNP def] -0.0261600929*col(2)[GNP] -1.898797727*col(3)[Unemployment] -0.800027798*col(4)[Armed Forces] +0.2010804385*col(5)[14 yrs] +1254.37302405*col(6)[Time]
...
100 -4161802.887 +4.28677342928*col(1)[GNP def] -0.0522384443*col(2)[GNP] -2.6041771478*col(3)[Unemployment] -0.9534097692*col(4)[Armed Forces] -0.0074760369*col(5)[14 yrs] +2178.89210068*col(6)[Time]
(The validation method randomly assigns rows of data to validation groups,
so the LGO Validation Equations will vary. Since these equations are
generated during the individual validation runs, increasing
'Validate N Times' will not decrease the variability of the coefficients.)
Go to: CoHort Software | CoStat | CoStat Statistics | Top |