Go to: CoHort Software | CoPlot | CoStat | CoStat Statistics  

Subset Selection in Multiple Regression
in CoStat

CoStat includes several methods for finding the best models which are subsets of a full multiple regression model. (This is also known as "attribute selection", "feature selection", and "variable selection".)

Summary: For subset selection in multiple regression with more than 40 X variables (when All Subsets starts to become too slow), the Simons 2 procedure does a dramatically better job of finding the best subset models than any other approximate subset selection procedure available anywhere. This procedure is only available in the copy of CoStat that comes with CoPlot.

Details

Problems Associated With Selecting Subsets:

Problem #1 - The number of possible subsets can be huge.
The number of possible subsets grows very quickly with the number of X columns. If there are k X columns, there will be 2^k -1 possible models. Usually, your goal is to find a model with a small subset of X's which provide a good fit of the data. By limiting the maximum number of X's in the models (Max N X's In Model, you can limit the search to fewer possible models, but the number can still be huge (it approaches k^Max N X's In Model). Ideally, you would use the All Subsets method to check all of the possible subsets with 1 - Max N X's In Model. But, when you ask for models with, for example, 9 or more X's selected from a file with 40 or more X's, the computer time for checking all subsets (that is about 2x10^14 models) becomes prohibitive (days, weeks, years, 100's of years, ...). Fortunately, there are alternatives to the All Subsets method: various approximate methods (Simons 2, Simons 1, Replace 1, Forward, and other methods which are not in CoStat) for efficiently finding good models. All of the approximate methods are faster than All Subsets, but none is guaranteed to find the best models.

Two of the approximate methods, Simons 2 and Simons 1 (which use algorithms developed by Robert Simons of CoHort Software and are only available in CoHort's software products), are dramatically better than Replace 1 (which is essentially the same as SAS's best method, MaxR), Forward Selection (which is also in SAS), and other commonly available (Backward Elimination and Stepwise) and not commonly available (Garrote, Ridge, Forward Stagewise, LARS, Lasso, and even Miller's Replace 2) approximate methods. Simons 2 and Simons 1 are much better at finding the best models and they do so in a reasonable amount of time. We did a Monte Carlo test of the subset selection methods in CoStat. In the test, 1000 random data files were generated. Each had 100 rows, 39 X columns, and 1 Y column. Each of the selection methods was asked to find the best subsets with 1, 2, 3, 4, 5, 6, and 7 X's. The results from each method were compared to the results from All Subsets (which by definition finds the best available subsets). Below is a tally of how many times each method failed to find all 7 of the subsets ranked #1, and how often each failed to find all 140 of the subsets ranked #1 through #20. (The times are on a 200 MHz AMD K6 computer.)

 Method                     Time (sec)  N Failures Rank#1  N Failures in Top 20
 -------------------------  ----------  -----------------  --------------------
 Forward                            78  663                 1000
 Replace 1 (like SAS MaxR)         134  322                 1000 
 Replace 2                         292   81                  651
 Simons 1                          578    9                  333
 Simons 2                         1923    1                   23
 All Subsets                     43451    0 by definition      0 by definition

Results: For the test of finding the best models (Rank#1), the Replace1 procedure (which is the equivalent of MaxR, the best approximate procedure available in SAS), failed with 322 out of the 1,000 data files. In contrast, the new Simons2 procedure, the best approximate procedure available in CoStat, failed with only 1 out of the 1,000 data files.

Recommendations:

  • Use Method: All Subsets whenever possible.
  • Use Method: Simons 2 whenever All Subsets is too slow.
  • Run the procedure overnight, if necessary.
The extra computer time needed for the better selection methods will often yield better results.

Availability:

  • The stand-alone version of CoStat includes the All Subsets, Forward and Replace 1 procedures.
  • CoPlot includes everything in CoStat and also the Replace 2, Simons 1 and Simons 2 procedures.

Problem #2 - Often, there are several good models.
Because there are so many possible subsets when the number of X columns is large, there are often several almost equally good models to choose from. Sometimes, if you were to remove a few data points, a different model would be ranked #1.

Recommendations:

  • Print out nBest=20 (or more) models for each subset size.
    You need to choose from several good models. Don't blindly choose the model ranked #1.
  • Think carefully about which model is the "best" model.
    When picking the "best" model (the one you will continue to work with), different statisticians and researchers have different external criteria (for example, you might need the smallest model with R^2 > 0.9) and different favorite statistics (Cp? MSEP? PreSS? LGO Pre R^2?). If you have favorite statistics, use them. We recommend looking for the model with the lowest Cp (because it reflects all of the data), LOO PreSS (because it reflects Leave One Out validation), and LGO PreSS (because it reflects Leave Group Out validation). But you should read about the different statistics, consult a statistician, and choose the statistics that are right for your situation.
  • Use Validation Method: Bootstrap (or Leave 20% Out)
    By testing with different subsets of the data, validation gives you a measure of the robustness of the models. The validation methods aren't perfect but at least they give you a rough idea of how robust the models are.
  • Use Validate N Times: 100
    Validation gives you a measure of robustness of the different models with different subsets of the data. The validation process randomly pick which rows will be used, so using larger values of Validate N Times helps stabilize the validation statistics.

Problem #3 - The best X columns may be no better than random values.
A common situation is for the data file to have many X columns (40, 100, 400, 1000, ...) and relatively few rows of data (roughly, less than twice as many rows as there are columns). In this situation, the probablility that a good subset of X's will exist (by chance) is very high. In fact, if you add some X columns with random numbers to a data file with relatively few rows, you will see that they are sometimes included in the "best" models.

Clearly, the more rows of data you have the better. But in any case, you need to do some sort of validation tests to ensure that the "best" models are unlikely to have occured by chance and that they remain the best models with other samples of data.

A commonly proposed solution to these problems is to divide the data into 2 parts: (for example, use 1/2 of the rows of data for subset selection and 1/2 for cross validation and for estimation of the regression coefficients). But unless you have lots of rows of data, you are then ignoring valuable information in each part of the process and result is less than optimal. A simulation study by Roecker (1991) showed that it is better to use all of the data for both parts. In the end, as Miller (2002, page 191) says, "the bias is present in the complete data set; cross-validation cannot remove it."

CoStat has options to do validation tests on the N Best subset models that it finds.

Recommendations:

  • Use More Data.
    The more rows of data you have, the more reliable the results. Collecting data is often expensive, but more data does lead to more reliable conclusions. More data doesn't significantly increase the run time. If you do have relatively few rows of data, be vary wary of the results.
  • See the recommendations for Problem #2 above.

Problem #4 - The regression statistics and regression coefficients are biased.
All of the statistics generated by this procedure (R^2, etc.) and the regression coefficients in the models are biased. The true R^2 values (and the other statistics) probably aren't as high as they appear here. Similarly, the absolute values of the true regression coefficients are probably somewhat smaller. This occurs because the statistical tests were designed for use on one or a few pre-specified models, but this procedure uses them to compare millions of models. Also, we are using the same data values for subset selection and for generating the statistics and the coefficients. This is actually one of the causes of Problems #2 and #3 (above). Techniques to calculate unbiased statistics and regression coefficients have not yet been developed.

Recommendations:

  • Just be aware of the problem.
  • Using the Bootstrap validation method or Leave X% Out (where X>=20%) and the Print: Coefficients Table can give you a rough estimate of the variability of the coefficients.

Validation
CoStat takes a slightly non-standard approach to validation. We recommend using the best selection method possible (All Subsets, Simons 2, or Simons 1) to identify the nBest models. The validation procedure then repeatedly chooses a subset of the data and generates the validation statistics (LGO PreSS, LGO Pre R^2, and/or LGO MAE) for each of the nBest models (which have already been selected).

The standard validation procedure recommended by many authors and other statistics programs uses a poor selection method (such as Forward, Replace 1, or Stepwise) and then repeats the subset selection procedure for each validation replication. But these poor selection methods may never find the best models! We argue that it is better to use the best selection method possible to find the best models and then just use validation to compare the models which have already been identified. (Miller, 2002, uses the standard validation approach throughout his book, but he mentions the alternative that CoStat uses on page 189).

References - Alan J. Miller's Subset Selection in Regression (Second Edition) (Chapman & Hall/CRC, 2002) is an excellent book which covers all aspects of subset selection.

Data Format - As with Regression : Multiple (Full Model), there must be three or more columns of data in the data file. The initial columns must be the X columns. The final column must be the Y column.

Options in the Statistics : Regression : Multiple (Subset Selection) dialog box

Keep If:
lets you enter a boolean expression (for example, (col(1)>50) and (col(2)<col(3))). Each row of the data file is tested. If the equation evaluates to true, that row of data will be used in the calculations. If false, that row of data will be ignored. See "Using Equations", "the A button", and "the f() button".
Method:
Usually, you should use the "All Subsets" method, because it is the only method guaranteed to find the best possible models. It actually tests all of the possible subset models. For large problems, this procedure may take a long time and you should consider letting a problem run overnight.

If All Subsets is too slow (for example, some large problems might takes days, weeks, or years), use:

Simons 2, Simons 1, and Replace 2 are only available in the version of CoStat that comes with CoPlot. Simons 2 takes longer to run than Simons 1, which takes longer than Replace 1, but the Simons methods are much more likely to find the best models. (See Problem #1 above.) (Forward is sometimes useful when you want to do a very quick pilot test before running All Subsets or one of the Simons methods.
Max N X's In Model:
is the maximum number of X's in the models. For example, your data file might have 20 X columns, but you might want to restrict the search to subsets which have a maximum of 7 X's. For all of the Methods (and especially All Subsets), the procedure takes much longer with larger values of Max N X's In Model.
N Best:
is the number of top models of each subset size which will be saved and printed. Increases in N Best make the procedure only a little bit slower.

If you use one of the validation methods and N Best is less than 20, CoStat will automatically increase N Best to 20. Validation only makes sense if it compares several models.

Print X's As:
The X's in the subset models can be printed as Names, Numbers, or Number:[Names].
Print:
You can choose which information you want to have printed for each model. For all of the formulas:
  • Yobs = an observed Y value.
  • Yexp = an expected Y value (from one of the regression models).
  • variance = the variance, estimated from the Error Mean Square from fitting the whole model. (If there are fewer rows than columns in the datafile, this is not available.)
  • RSS = the Residual Sum of Squares = the Error Sum of Squares from one of the models.
  • i = 1 (representing the constant in that is always in these models).
  • n = the number of rows of data used in the regression.
  • p = nXsInModel+i
  • Leave One Out predictedY = a Y value predicted (as if) from a regression done without that row of data. In other words:
    1. Do a regression without one row of data;
    2. Calculate the predicted Y for the row left out;
    3. Then repeat steps 1 and 2 for every other row of data.
    In CoStat the calculation is made a different way, but the numeric results are the same.

Here is a description of each of the statistics. (For statistics for which there are variations in how they are calculated, we have chosen the variation supported by SAS.)

  • R^2 = the Coefficient of Multiple Determination.
    R^2 = SSModel/SSTotal = (SSTotal-RSS)/SSTotal
    R^2 ranges from 0 (the regression accounts for none of the variation in Y values) to 1 (the regression accounts for all of the variation).
  • Adj R^2 = the Adjusted R^2 (a variation of R^2 which adjusts for the number of Xs in the model so that Adj R^2 for models of different sizes can be directly compared).
    Adj R^2 = 1 - (n-i)*(1-R^2)/(n-p)
  • Cp = Mallow's Cp statistic.
    Cp = RSS/variance - (n-2p)
    If there are fewer rows than columns in the datafile, the variance estimate is not available, so this statistic is not available.
  • MSEP = Mean Squared Error of Prediction (for a random model).
    MSEP = (RSS*(n+1)*(n-2)) / ((n-p)*n*(n-p-1))
  • AIC = Akaike Information Criteria.
    AIC = n*ln(RSS/n)+2p
  • BIC = Bayesian Information Criteria.
    BIC = n*ln(RSS/n)+2*(p+2)*q-2*q^2
    where q = variance / (RSS/n)
    If there are fewer rows than columns in the datafile, the variance estimate is not available, so this statistic is not available.
  • Mean Absolute Error
    MAE = average(abs(Yobs-Yexp))
  • Leave-One-Out Prediction SS = Prediction Sum of Squares = PRESS.
    LOO PreSS = sum((Yobs-predictedY)^2)
    This is the standard PRESS statistic.
  • Leave-One-Out Prediction R^2 =
    LOO PreR^2 is the square of Pearson's Product Moment Correlation Coefficient (r) calculated from all of the LOO predicted Y's and the corresponding observed Y's.
    This statistic was suggested by "Hall (2002)".
  • Leave-One-Out Mean Absolute Error =
    LOO MAE = average(abs(Yobs-predictedY))
  • Equations With Column Names prints the equations for the selected models with their column names.
  • Equations With Column Numbers prints the equations for the selected models with their column numbers. These equations can be used with "Transformations : Transform (Numeric)". to calculate expected Y values.
Validation Method:
Validation gives you a measure of robustness of the different models with different subsets of the data. The validation process retests the selected models with randomly picked subsets of the data. (See the discussion of validation above, and Problems #2 and #3 above.)

Bootstrap - For each validation replication, the Bootstrap method randomly picks rows from the original data file, with replacement. For each validation replication, it picks the same same number of rows as were in the original data file.

Leave X% Out - There are several Leave X% Out validation methods. Each repeatedly reruns the regression with X% of the rows of data removed. Brieman and Spector (1992) did a simulation study which found Leave 20% Out to be a good choice.

The Leave X% Out validation replicates are done in balanced groups of replicates (for example, Leave 20% Out works with groups of 5 replicates), so that each row of data is left out of exactly 1 of the replicates in each group of replicates. The use of balanced groups (as opposed repeatedly randomly choosing which X% to omit) leads to more stable results with fewer validation replications.

We strongly recommend using a validation method. We recommend the Bootstrap method or Leave 20% Out.

Validate N Times:
For Bootstrap validation method, this is the number of times the procedure resamples the data. For the Leave X% Out validation methods, this is the number of times that the results will be validated by selecting and testing a balanced group of subsets of the data.

The validation process randomly pick which rows will be used, so using larger values of Validate N Times (we recommend 100) helps stabilize the validation statistics.

Print:
You can choose which validation information you want to have printed:
  • N Times Ranked #1 = a count of the number of validation replications where each model was ranked #1 (by comparing R^2 values with other models of the same size). The tally is printed in the Rank#1 column of the results. With Bootstrap or Leave 20% Out (or more), very robust models will always be ranked #1. Moderately robust models will be ranked #1 more than twice as often as the next best model. Not-robust models will often capture fewer #1 rankings than other models. With the Leave 10% Out (or less) methods, validation is more likely to continually pick the same best model as was found by the subset selection procedure, making this statistic less useful.
  • Leave-Group-Out Prediction SS
    LGO PreSS = sum((Yobs-predictedY)^2)
    where the predictedY is estimated from a regression done without that row of data (1 group at a time) during validation.
    The value printed for this statistic is the average value for all the validation replications.
    This is not the standard PRESS statistic (see Leave-One-Out Prediction SS).
  • Leave-Group-Out Prediction R^2
    LGO PreR^2 = the square of Pearson's Product Moment Correlation Coefficient (r) calculated from all of the predicted Y's (calculated during LeaveGroupOut validation) and the corresponding observed Y's.
    The value printed for this statistic is the average value for all the validation replications.
    This statistic was suggested by "Hall (2002)", who calls this Q^2.
  • Leave-Group-Out Mean Absolute Error
    LGO MAE = average(abs(Yobs-predictedY))
    where the predictedY is estimated from a regression done without each row of data (1 group at a time, during validation).
    The value printed for this statistic is the average value for all the validation replications.
  • Coefficients Table prints a table which indicates the variability of the regression coefficients in the original best model of each size. The table displays the mean and standard deviation of each coefficient for validation replicates where that model is ranked #1, and the mean and standard deviation for each coefficient for all validation replicates. The difference between the means (Mean #1 - Mean All) (when considered with their standard deviations) is a measure of the bias of the coefficients. The difference is larger when the Leave X% Out percentage is higher. The unique statistical properties of the Bootstrap method make its results the best measure of the bias. See Problem #4 above.
  • Validation Equations - For Leave-X%-Out validation methods, the equations generated are from the first balanced validation group. For bootstrap validation, the equations generated are from each of the validation runs (up to Validation N Times = 100).

The Sample Run

The data for the sample run is the Longley Data, which has 6 X columns and 1 Y column. With so few X columns, even Method: All Subsets runs very quickly.

For the sample run, use File : Open to open the file called longley.dt in the cohort directory. Then:

  1. From the menu bar, choose: Statistics : Regression : Multiple (Subset Selection).
  2. Keep If:
  3. Method: All Subsets
  4. Max N X's In Model: 6
  5. N Best: 20
  6. Print X's As: Numbers
  7. Print:
    • R^2
    • Cp
    • Leave-One-Out Prediction SS
    • Equations With Column Numbers
  8. Validation Method: Bootstrap
  9. Validate N Times: 100
  10. Print:
    • N Times Ranked #1
    • Leave-Group-Out Prediction SS
    • Coefficients Table
  11. OK
REGRESSION : MULTIPLE : SUBSET SELECTION
2002-10-14 18:42:31
Using: C:\cohort6\LONGLEY.DT
  X Columns:
       1) GNP def         3) Unemployment    5) 14 yrs      
       2) GNP             4) Armed Forces    6) Time        
  Y Column: 7) Employment

Method: All Subsets
Max N X's In Model: 6
Keep If: 
Validation Method: Bootstrap
Validate N Times: 100

Total number of data rows = 16
Number of data rows used = 16

The best models:
                                                                           
n Xs Rank   X Columns                                                      
---- ----   ---------------------------------------------------------------
   1    1   2
   1    2   6
   1    3   1
   1    4   5
   1    5   3
   1    6   4
                                                                           
n Xs Rank   X Columns                                                      
---- ----   ---------------------------------------------------------------
   2    1   3, 6
   2    2   2, 3
   2    3   2, 5
   2    4   2, 6
   2    5   3, 5
...
                                                                           
n Xs Rank   X Columns                                                      
---- ----   ---------------------------------------------------------------
   3    1   3, 4, 6
   3    2   2, 3, 4
   3    3   2, 4, 5
   3    4   1, 3, 6
   3    5   2, 3, 6
...
                                                                           
n Xs Rank   X Columns                                                      
---- ----   ---------------------------------------------------------------
   4    1   2, 3, 4, 6
   4    2   3, 4, 5, 6
   4    3   1, 3, 4, 6
   4    4   2, 3, 4, 5
   4    5   1, 2, 4, 5
...
                                                                           
n Xs Rank   X Columns                                                      
---- ----   ---------------------------------------------------------------
   5    1   2, 3, 4, 5, 6
   5    2   1, 2, 3, 4, 6
   5    3   1, 3, 4, 5, 6
   5    4   1, 2, 3, 4, 5
   5    5   1, 2, 4, 5, 6
   5    6   1, 2, 3, 5, 6
                                                                           
n Xs Rank   X Columns                                                      
---- ----   ---------------------------------------------------------------
   6    1   1, 2, 3, 4, 5, 6
                                                                           
The statistics for the best models:
                                                              
n_Xs Rank         R^2         Cp  LOO_Press  Rank#1  LGO_Press
---- ----   ---------  ---------  ---------  ------  ---------
   1    1   0.9673738  52.949425  7589201.1      97  3096552.4
   1    2   0.9434809  100.51322   13016160       3  5214933.2
   1    3   0.9426439  102.17939   13500813       0  5173452.8
   1    4   0.9223501  142.57869   18421169       0  7827711.3
   1    5   0.2525043  1476.0486  1.69617e8       0   69522036
   1    6   0.2091301  1562.3943  1.88314e8       0   84857519
                                                              
n_Xs Rank         R^2         Cp  LOO_Press  Rank#1  LGO_Press
---- ----   ---------  ---------  ---------  ------  ---------
   2    1   0.9823137  25.208364    4679317      58  2120576.2
   2    2   0.9806546  28.511069  5076800.9      15    2239282
   2    3   0.9790585  31.688487  5674302.5       8  2452798.7
   2    4   0.9734556  42.842209  6981785.1       3    3053396
   2    5   0.9688932  51.924638  8876891.5       9  4088699.4
...
                                                              
n_Xs Rank         R^2         Cp  LOO_Press  Rank#1  LGO_Press
---- ----   ---------  ---------  ---------  ------  ---------
   3    1   0.992847   6.2394837  2132127.4      87  1072037.2
   3    2   0.9850996  21.662472  4386976.5       2  2151847.5
   3    3   0.9835103  24.826233  4941030.8       4  2274526.7
   3    4   0.9828873  26.066385  5149873.3       1  2629974.1
   3    5   0.9824913  26.854818  5741368.7       3  2962273.2
...
                                                              
n_Xs Rank         R^2         Cp  LOO_Press  Rank#1  LGO_Press
---- ----   ---------  ---------  ---------  ------  ---------
   4    1   0.9953587  3.2394804  1998041.1      60  1113358.9
   4    2   0.994672   4.6064343  2216518.9      27  1748728.6
   4    3   0.992854   8.2256744  2540766.1      10  1427904.9
   4    4   0.9872082  19.464804  5184195.3       0  3717538.5
   4    5   0.9863071  21.258566  4951193.5       0  2793420.3
...
                                                              
n_Xs Rank         R^2         Cp  LOO_Press  Rank#1  LGO_Press
---- ----   ---------  ---------  ---------  ------  ---------
   5    1   0.9954632  5.0314623  2561221.4      42  2573200.5
   5    2   0.9954533  5.0510991  2344384.2      34  1363439.1
   5    3   0.9949044  6.1438652  2581127.2      22  3728562.8
   5    4   0.9873777  21.127371  5852906.9       1  5618362.2
   5    5   0.9868841  22.110031  5832091.7       1  8692038.2
   5    6   0.983799   28.251542  7676622.8       0  5942862.2
                                                              
n_Xs Rank         R^2         Cp  LOO_Press  Rank#1  LGO_Press
---- ----   ---------  ---------  ---------  ------  ---------
   6    1   0.995479           7  2886892.5     100  4877001.5
                                                              
(The validation method randomly assigns rows of data to validation groups,
so the Rank#1 and LGO statistics printed above will vary.
You can reduce the variability by increasing 'Validate N Times'.)
                                                              
                                                                           
n Xs Rank   Equations With Column Numbers                                    
---- ----   -----------------------------------------------------------------
   1    1   col(7) = 51843.5897819 +0.03475229435*col(2)
   1    2   col(7) = -1335105.2441 +716.511764706*col(6)
   1    3   col(7) = 33189.1733796 +315.966086377*col(1)
   1    4   col(7) = 8380.67418338 +0.48487809832*col(5)
   1    5   col(7) = 59286.3553982 +1.88852315637*col(3)
   1    6   col(7) = 59301.2646582 +2.30780841269*col(4)
                                                                           
n Xs Rank   Equations With Column Numbers                                    
---- ----   -----------------------------------------------------------------
   2    1   col(7) = -1587138.9078 -0.9955303213*col(3) +847.088742485*col(6)
   2    2   col(7) = 52382.1670501 +0.03784032702*col(2) -0.5435743321*col(3)
   2    3   col(7) = 88938.7983051 +0.06317243566*col(2) -0.4097429223*col(5)
   2    4   col(7) = 1198708.11085 +0.06299295723*col(2) -592.38341363*col(6)
   2    5   col(7) = -135.32549508 -1.1151492576*col(3) +0.5877277691*col(5)
...
                                                                           
n Xs Rank   Equations With Column Numbers                                    
---- ----   -----------------------------------------------------------------
   3    1   col(7) = -1797221.1122 -1.4696711189*col(3) -0.7722814913*col(
   3    2   col(7) = 53306.4611883 +0.04078799732*col(2) -0.7968165793*col(
   3    3   col(7) = 109470.955483 +0.07992714636*col(2) -0.4978793189*col(
   3    4   col(7) = -1879252.7716 -64.775010369*col(1) -1.0519056696*col(
   3    5   col(7) = -1198891.4309 +0.00899056576*col(2) -0.8904086536*col(
...

n Xs Rank   Equations With Column Numbers                                    
---- ----   -----------------------------------------------------------------
   4    1   col(7) = -3598729.3743 -0.0401904697*col(2) -2.0883907318*col(
   4    2   col(7) = -2446174.695 -1.5004764434*col(3) -0.9343638696*col(
   4    3   col(7) = -1828915.7377 -7.2827116254*col(1) -1.4734185066*col(
   4    4   col(7) = 82613.0992041 +0.06210170815*col(2) -0.5198036017*col(
   4    5   col(7) = 120323.684241 -136.32592586*col(1) +0.09659841249*col(
...
                                                                           
n Xs Rank   Equations With Column Numbers                                    
---- ----   -----------------------------------------------------------------
   5    1   col(7) = -3449891.5997 -0.0319613069*col(2) -1.9721499421*col(
   5    2   col(7) = -3564921.8744 +27.7148784578*col(1) -0.042127114*col(
   5    3   col(7) = -2705054.5008 -43.916959962*col(1) -1.5262904441*col(
   5    4   col(7) = 92461.3078244 -48.462828184*col(1) +0.07200384932*col(
   5    5   col(7) = -403186.16429 -179.87874985*col(1) +0.09517876035*col(
   5    6   col(7) = -1121975.8255 -127.76330578*col(1) +0.03985731002*col(
                                                                           
n Xs Rank   Equations With Column Numbers                                    
---- ----   -----------------------------------------------------------------
   6    1   col(7) = -3482258.6346 +15.0618722714*col(1) -0.0358191793*col(
                                                                           
Coefficients Table                                                      
                                                                           
Variability of Biased Regression Coefficients:
When the same data is used to search for the best subset models and to
estimate the regression coefficients, the estimated regression coefficients
are biased. A measure of the bias is Mean #1 - Mean All: the mean of
the values of a coefficient when it is in validation replicates where
that model is ranked #1 and the mean of the values of that coefficient for
all validation replicates. Usually, the absolute values of the biased 
coefficients are too large. The difference is larger when the Leave X% Out
percentage is higher. The unique statistical properties of the Bootstrap
method make its results the best measure of the bias.
                                                                           
(The validation method randomly assigns rows of data to validation groups,
so the following results will vary.
You can reduce the variability by increasing 'Validate N Times'.)
                                                                           
         Biased       Mean   Std.Dev.   Mean for   Std.Dev.  Mean #1 -
   X      Coef.    When #1    When #1   All Reps   All Reps   Mean All
----  ---------  ---------  ---------  ---------  ---------  ---------
 Int   51843.59                                                      
   2  0.0347523  0.0352033   0.001667  0.0351513  0.0016724  5.2062e-5
                                                                           
         Biased       Mean   Std.Dev.   Mean for   Std.Dev.  Mean #1 -
   X      Coef.    When #1    When #1   All Reps   All Reps   Mean All
----  ---------  ---------  ---------  ---------  ---------  ---------
 Int   -1587139                                                      
   3   -0.99553  -1.019034  0.2133845  -1.024537     0.2057  0.0055035
   6  847.08874  852.93932  37.350316   854.7123  36.720343   -1.77298
                                                                           
         Biased       Mean   Std.Dev.   Mean for   Std.Dev.  Mean #1 -
   X      Coef.    When #1    When #1   All Reps   All Reps   Mean All
----  ---------  ---------  ---------  ---------  ---------  ---------
 Int   -1797221                                                      
   3  -1.469671  -1.496287  0.1371075  -1.516463  0.1779194   0.020176
   4  -0.772281  -0.802561  0.1441071  -0.798892  0.1911141  -0.003669
   6   956.3798  965.36394  38.794978  966.61598  41.199368  -1.252044
                                                                           
         Biased       Mean   Std.Dev.   Mean for   Std.Dev.  Mean #1 -
   X      Coef.    When #1    When #1   All Reps   All Reps   Mean All
----  ---------  ---------  ---------  ---------  ---------  ---------
 Int   -3598729                                                      
   2   -0.04019  -0.053328  0.0250326  -0.042098  0.0268491   -0.01123
   3  -2.088391  -2.288518  0.3259705  -2.113154  0.4043208  -0.175364
   4  -1.014639   -1.08571  0.1863755  -1.016285  0.2048972  -0.069425
   6  1887.4095  2197.3399  568.26124   1933.823  616.52441  263.51689
                                                                           
         Biased       Mean   Std.Dev.   Mean for   Std.Dev.  Mean #1 -
   X      Coef.    When #1    When #1   All Reps   All Reps   Mean All
----  ---------  ---------  ---------  ---------  ---------  ---------
 Int   -3449892                                                      
   2  -0.031961  -0.056069   0.056141  -0.043624  0.0454442  -0.012445
   3   -1.97215  -2.319053  0.6524732  -2.135074  0.6091017  -0.183979
   4  -1.019969  -1.037111  0.2179187  -1.046027  0.2589043  0.0089161
   5  -0.077537  0.0731863  0.2800716  -0.028957  0.3865007  0.1021435
   6  1814.1014  2154.0816  1009.3157  2013.4852  861.44673  140.59641
                                                                           
         Biased       Mean   Std.Dev.   Mean for   Std.Dev.  Mean #1 -
   X      Coef.    When #1    When #1   All Reps   All Reps   Mean All
----  ---------  ---------  ---------  ---------  ---------  ---------
 Int   -3482259                                                      
   1  15.061872  25.654696  109.00544  25.654696  109.00544          0
   2  -0.035819  -0.049325  0.0580191  -0.049325  0.0580191          0
   3   -2.02023  -2.229122  0.7783844  -2.229122  0.7783844          0
   4  -1.033227  -1.111287  0.3492111  -1.111287  0.3492111          0
   5  -0.051104   -0.02682   0.569697   -0.02682   0.569697          0
   6  1829.1515  2082.7935  994.94113  2082.7935  994.94113          0

The model with n Xs=4 and Rank=1 (the X columns are 2, 3, 4, 6), looks to be a good model. It has the lowest Cp, LOO_PreSS, and LGO_PreSS values of any of the models. And it was Ranked #1 among models with 4 X's more than twice as often as the next best model with 4 X's.

 


For more information, go to: CoHort Software | CoPlot | CoStat | CoStat Statistics | Top