Go to:
CoHort Software |
CoStat |
CoStat Statistics
Multiple Regression in CoStatMultiple regression is the simultaneous linear regression of several x columns of data (independent variables) on one y column of data (the dependent variable). The general form of the resulting equation is: y = b0 + b1x1 + b2x2 + b3x3 ... bnxn where the b values are the coefficients that the regression finds optimal (least squares) values for. CoStat can do a multiple regression of a full model (where all of the x columns are in the model) or one subset (where you specify a subset of the x columns). Note that some of the x columns may have been created from other x columns with CoStat's 'Transformations' procedure. For example, you could make a column with x12, or a column with x1*x2. In this way, you can make model of a "response surface" or other more complex models. Often, an experimenter has a large number of x columns and wishes to know if there is a smaller, simpler model with a subset of these x columns which adequately explains the dependent variable. For this type of multiple regression problem, see Subset Selection in Multiple Regression. Sample Run In the sample run, we will estimate the relationship of the employment level with several economic variables (unemployment rate, GNP, etc.). The data is from an article testing computational accuracy (Longley, 1967). PRINT DATA 2000-08-05 11:06:07 Using: c:\cohort6\longley.dt First Column: 1) GNP def Last Column: 7) Employment First Row: 1 Last Row: 16 GNP def GNP Unemployment Armed Forces 14 yrs Time Employment --------- --------- ------------ ------------ --------- --------- ---------- 83 234289 2356 1590 107608 1947 60323 88.5 259426 2325 1456 108632 1948 61122 88.2 258054 3682 1616 109773 1949 60171 89.5 284599 3351 1650 110929 1950 61187 96.2 328975 2099 3099 112075 1951 63221 98.1 346999 1932 3594 113270 1952 63639 99 365385 1870 3547 115094 1953 64989 100 363112 3578 3350 116219 1954 63761 101.2 397469 2904 3048 117388 1955 66019 104.6 419180 2822 2857 118734 1956 67857 108.4 442769 2936 2798 120445 1957 68169 110.8 444546 4681 2637 121950 1958 66513 112.6 482704 3813 2552 123366 1959 68655 114.2 502601 3931 2514 125368 1960 69564 115.7 518173 4806 2572 127852 1961 69331 116.9 554894 4007 2827 130081 1962 70551 Longley ran this seemingly routine regression on several mainframe computers and found incredibly varied answers, largely because the x values are large relative to their standard error and because of mild collinearity among the x values. CoHort's Regression compares quite well - the estimated coefficients are accurate to 10 significant figures. There is a fascinating follow-up article by Beaton, et al. (1976), which points out that a greater source of inaccuracy may be the data itself. Slight variations in the original data cause large variations in the results. This is an important consideration and further investigation of the matter is encouraged before accepting the results of any regression. For the sample run, use File : Open to open the file called longley.dt in the cohort directory. Then:
REGRESSION: MULTIPLE (FULL MODEL) 2002-09-26 16:06:26 Using: C:\cohort6\LONGLEY.DT X Columns: 1) GNP def 3) Unemployment 5) 14 yrs 2) GNP 4) Armed Forces 6) Time Y Column: 7) Employment Keep If: Calculate Constant: true Total number of data points = 16 Number of data points used = 16 Regression equation: col(7)[Employment] = -3482258.6346 +15.0618722714*col(1)[GNP def] -0.0358191793*col(2)[GNP] -2.0202298038*col(3)[Unemployment] -1.0332268672*col(4)[Armed Forces] -0.0511041057*col(5)[14 yrs] +1829.15146461*col(6)[Time] R^2 = 0.99547900458 AIC = 187.828836554 MSEP = 172802.886467 adj R^2 = 0.99246500763 BIC = 199.5078489 PRESS = 2886892.54145 PRE R^2 = 0.98482749195 MAE = 179.371521174 LOO MAE = 333.425946693 For each term in the ANOVA table below, if P<=0.05, that term was a significant source of Y's variation. Source SS df MS F P ------------------------ ------------- -------- --------- --------- --------- Regression 184172401.944 6 30695400 330.28534 .0000 *** col(1)[GNP def] 174397449.779 1 1.74397e8 1876.5326 .0000 *** col(2)[GNP] 4787181.04445 1 4787181 51.51051 .0001 *** col(3)[Unemployment] 2263971.10982 1 2263971.1 24.360538 .0008 *** col(4)[Armed Forces] 876397.161861 1 876397.16 9.4301143 .0133 * col(5)[14 yrs] 348589.39965 1 348589.4 3.7508541 .0848 ns col(6)[Time] 1498813.44959 1 1498813.4 16.127371 .0030 ** Error 836424.055506 9 92936.006 ------------------------ ------------- -------- --------- --------- --------- Total 185008826 15 Table of Statistics for the Regression Coefficients: Column Coef. Std Error t(Coef=0) P +/-95% CL ------------------------ --------- --------- --------- --------- --------- Intercept -3482259 890420.38 -3.910803 .0036 ** 2014270.8 col(1)[GNP def] 15.061872 84.914926 0.177376 .8631 ns 192.09091 col(2)[GNP] -0.035819 0.033491 -1.069516 .3127 ns 0.0757619 col(3)[Unemployment] -2.02023 0.4883997 -4.136427 .0025 ** 1.1048368 col(4)[Armed Forces] -1.033227 0.2142742 -4.821985 .0009 *** 0.4847218 col(5)[14 yrs] -0.051104 0.2260732 -0.226051 .8262 ns 0.5114131 col(6)[Time] 1829.1515 455.4785 4.0158898 .0030 ** 1030.3639 Degrees of freedom for two-tailed t tests = 9 If P<=0.05, the coefficient is significantly different from 0. Residuals: Row Y observed Y expected Residual --------- ------------- ------------- ------------- 1 60323 60055.6599702 267.340029759 2 61122 61216.0139424 -94.013942399 3 60171 60124.7128322 46.2871677573 4 61187 61597.1146219 -410.11462193 5 63221 62911.2854092 309.71459076 6 63639 63888.3112153 -249.31121533 7 64989 65153.0489564 -164.0489564 8 63761 63774.1803569 -13.180356867 9 66019 66004.6952274 14.3047726001 10 67857 67401.6059054 455.394094552 11 68169 68186.2689271 -17.268927115 12 66513 66552.0550425 -39.055042523 13 68655 68810.5499736 -155.54997359 14 69564 69649.671308 -85.671308042 15 69331 68989.068486 341.93151396 16 70551 70757.7578252 -206.75782519 Validation Method: Bootstrap Validate N Times: 100 Leave-Group-Out PRESS = 7966592.18225 Leave-Group-Out PRE R^2 = 0.9801802 Leave-Group-Out MAE = 564.585568568 (The validation method randomly assigns rows of data to validation groups, so the Leave-Group-Out statistics printed above will vary. You can reduce the variability by increasing 'Validate N Times'.) Group Leave-Group-Out Validation Equations ----- ---------------------------------------------------------------------- 1 10958996.4318 +246.007245432*col(1)[GNP def] +0.1405383348*col(2)[GNP] +0.8217102887*col(3)[Unemployment] -1.2388561368*col(4)[Armed Forces] +2.63894001237*col(5)[14 yrs] -5770.9069564*col(6)[Time] 2 -3913158.8845 +138.451642639*col(1)[GNP def] -0.0747379129*col(2)[GNP] -2.7571974387*col(3)[Unemployment] -1.060357447*col(4)[Armed Forces] +0.21777855892*col(5)[14 yrs] +2036.04532194*col(6)[Time] 3 -2393179.0155 +15.0310564322*col(1)[GNP def] -0.0261600929*col(2)[GNP] -1.898797727*col(3)[Unemployment] -0.800027798*col(4)[Armed Forces] +0.2010804385*col(5)[14 yrs] +1254.37302405*col(6)[Time] ... 100 -4161802.887 +4.28677342928*col(1)[GNP def] -0.0522384443*col(2)[GNP] -2.6041771478*col(3)[Unemployment] -0.9534097692*col(4)[Armed Forces] -0.0074760369*col(5)[14 yrs] +2178.89210068*col(6)[Time] (The validation method randomly assigns rows of data to validation groups, so the LGO Validation Equations will vary. Since these equations are generated during the individual validation runs, increasing 'Validate N Times' will not decrease the variability of the coefficients.)
Go to: CoHort Software | CoStat | CoStat Statistics | Top |