Go to:
CoHort Software |
CoPlot |
CoStat |
CoStat Statistics
Subset Selection in Multiple Regression in CoStat
CoStat includes several
methods for finding the best models which are subsets
of a full multiple regression model.
(This is also known as "attribute selection", "feature selection",
and "variable selection".)
Summary: For subset selection in multiple regression
with more than 40 X variables (when All Subsets starts to
become too slow),
the Simons 2 procedure does a
dramatically better
job of finding the
best subset models than any other approximate subset selection procedure
available anywhere. This procedure is only available in
the copy of CoStat that comes with CoPlot.
Details
Problems Associated With Selecting Subsets:
Problem #1 -
The number of possible subsets can be huge.
The number of possible subsets
grows very quickly with the number of X columns.
If there are k X columns, there will be 2^k -1 possible models. Usually,
your goal is to find a model with a small subset of X's
which provide a good fit of the data.
By limiting the maximum number of X's in the models (Max N X's In Model,
you can limit the search to fewer possible models, but the number
can still be huge (it approaches k^Max N X's In Model).
Ideally, you would use the All Subsets method
to check all of the possible subsets with 1 - Max N X's In Model.
But, when you ask for models with, for example, 9 or more X's selected
from a file with 40 or more X's,
the computer time for checking all subsets
(that is about 2x10^14 models) becomes prohibitive
(days, weeks, years, 100's of years, ...).
Fortunately, there are alternatives to the All Subsets
method: various approximate methods (Simons 2, Simons 1,
Replace 1, Forward, and other methods which are not in CoStat) for
efficiently finding good models.
All of the approximate methods are faster than All Subsets,
but none is guaranteed to find the best models.
Two of the approximate methods, Simons 2 and Simons 1
(which use algorithms
developed by Robert Simons of CoHort Software
and are only available in CoHort's software products),
are dramatically better than
Replace 1 (which is essentially the same as SAS's best method, MaxR),
Forward Selection (which is also in SAS), and other commonly
available (Backward Elimination and Stepwise)
and not commonly available
(Garrote, Ridge, Forward Stagewise, LARS, Lasso,
and even Miller's Replace 2) approximate methods.
Simons 2 and Simons 1 are
much better at finding the best models
and they do so in a reasonable amount of time.
We did a Monte Carlo test of the subset selection methods in CoStat.
In the test, 1000 random data files were generated. Each had 100 rows,
39 X columns, and 1 Y column. Each of the selection methods was
asked to find the best subsets with 1, 2, 3, 4, 5, 6, and 7 X's.
The results from each method were compared to the results from
All Subsets
(which by definition finds the best available subsets).
Below is a tally of how many times each method failed to find
all 7 of the subsets ranked #1, and how often each failed to
find all 140 of the subsets ranked #1 through #20.
(The times are on a 200 MHz AMD K6 computer.)
Method Time (sec) N Failures Rank#1 N Failures in Top 20
------------------------- ---------- ----------------- --------------------
Forward 78 663 1000
Replace 1 (like SAS MaxR) 134 322 1000
Replace 2 292 81 651
Simons 1 578 9 333
Simons 2 1923 1 23
All Subsets 43451 0 by definition 0 by definition
Results: For the test of finding the best models (Rank#1),
the Replace1 procedure (which is the equivalent of MaxR, the best approximate
procedure available in SAS), failed with 322 out of the 1,000 data files. In
contrast, the new Simons2 procedure, the best approximate procedure available
in CoStat, failed with only 1 out of the 1,000 data files.
Recommendations:
- Use Method: All Subsets whenever possible.
- Use Method: Simons 2 whenever All Subsets is too slow.
- Run the procedure overnight, if necessary.
The extra computer time needed for the better selection methods
will often yield better results.
Availability:
- The stand-alone version of CoStat
includes the All Subsets, Forward and Replace 1 procedures.
- CoPlot includes everything in CoStat
and also the Replace 2, Simons 1 and Simons 2 procedures.
Problem #2
- Often, there are several good models.
Because there are so many possible subsets when the number
of X columns is large, there are often several almost equally good
models to choose from. Sometimes, if you were to remove
a few data points, a different model would be ranked #1.
Recommendations:
- Print out nBest=20 (or more) models for each subset size.
You need to choose from several good models. Don't blindly
choose the model ranked #1.
- Think carefully about which model is the "best" model.
When picking the "best" model (the one you will continue to work with),
different statisticians and researchers have different external
criteria (for example, you might need the smallest model with R^2 > 0.9)
and different favorite statistics (Cp? MSEP? PreSS? LGO Pre R^2?).
If you have favorite statistics, use them.
We recommend looking for the model with the
lowest Cp (because it reflects all of the data),
LOO PreSS (because it reflects Leave One Out validation),
and LGO PreSS (because it reflects Leave Group Out validation).
But you should read about
the different statistics, consult a statistician, and choose the
statistics that are right for your situation.
- Use Validation Method: Bootstrap (or Leave 20% Out)
By testing with different subsets of the data,
validation gives you a measure of the robustness of the models.
The validation methods aren't perfect but at least
they give you a rough idea of how robust the models are.
- Use Validate N Times: 100
Validation gives you a measure of robustness of the different
models with different subsets of the data.
The validation process randomly pick which rows will be used,
so using larger values of
Validate N Times helps stabilize the validation statistics.
Problem #3 - The best X columns may be no better than random values.
A common situation is for the data file to have many X columns
(40, 100, 400, 1000, ...) and relatively few rows of data
(roughly, less than twice as many rows as there are columns).
In this situation, the probablility that a good subset of X's will exist
(by chance) is very high. In fact, if you add some X columns with
random numbers to a data file with relatively few rows,
you will see that they are sometimes included in the "best" models.
Clearly, the more rows of data you have the better. But in any case,
you need to do some sort of validation tests to ensure
that the "best" models are unlikely to have occured by chance
and that they remain the best models with other samples of data.
A commonly proposed solution to these problems is to divide the data into 2 parts:
(for example, use 1/2 of the rows of data for subset
selection and 1/2 for cross validation and for estimation of the regression
coefficients). But unless you have lots of rows of data, you are then
ignoring valuable information in each part of the process and
result is less than optimal.
A simulation study by Roecker (1991)
showed that it is better to use all of the data for both parts.
In the end, as Miller (2002, page 191) says, "the bias is present in the
complete data set; cross-validation cannot remove it."
CoStat has options to do validation tests on the N Best
subset models that it finds.
Recommendations:
- Use More Data.
The more rows of data you have, the more reliable the results.
Collecting data is often
expensive, but more data does lead to more reliable conclusions.
More data doesn't significantly increase the run time.
If you do have relatively few rows of data, be vary wary of the results.
- See the recommendations for Problem #2 above.
Problem #4
- The regression statistics and regression coefficients are biased.
All of the statistics generated by this procedure (R^2, etc.) and
the regression coefficients in the models are biased.
The true R^2 values (and the other statistics) probably aren't
as high as they appear here.
Similarly, the absolute values of the true regression
coefficients are probably somewhat smaller.
This occurs because the statistical tests were designed for use on one or a few
pre-specified models,
but this procedure uses them to compare millions of models.
Also, we are using the same data values for subset selection
and for generating the statistics and the coefficients.
This is actually one of the causes of Problems #2 and #3 (above).
Techniques to calculate unbiased statistics and regression coefficients
have not yet been developed.
Recommendations:
- Just be aware of the problem.
- Using the Bootstrap
validation method or Leave X% Out (where X>=20%) and the
Print: Coefficients Table
can give you a rough estimate of the variability of the coefficients.
Validation
CoStat takes a slightly non-standard approach to validation.
We recommend using the best selection method possible
(All Subsets, Simons 2, or Simons 1)
to identify the nBest models. The validation procedure then repeatedly
chooses a subset of the
data and generates the validation statistics (LGO PreSS, LGO Pre R^2,
and/or LGO MAE) for each of the nBest models
(which have already been selected).
The standard validation procedure recommended by many authors
and other statistics programs uses a poor selection method
(such as Forward, Replace 1, or Stepwise)
and then repeats the subset
selection procedure for each validation replication.
But these poor selection methods may never find the best models!
We argue that it is better to use the best selection method possible
to find the best models and then just use validation to compare
the models which have already been identified.
(Miller, 2002, uses the standard validation approach
throughout his book,
but he mentions the alternative that CoStat uses on page 189).
References -
Alan J. Miller's Subset Selection in Regression (Second Edition)
(Chapman & Hall/CRC, 2002) is an excellent book which covers all
aspects of subset selection.
Data Format - As with Regression : Multiple (Full Model),
there must be three or more columns of data in the data file.
The initial columns must be the X columns.
The final column must be the Y column.
Options in the
Statistics : Regression : Multiple (Subset Selection)
dialog box
- Keep If:
- lets you enter a boolean expression (for example,
(col(1)>50) and (col(2)<col(3))).
Each row of the data file is tested. If the equation evaluates to
true, that row of data will be used in the calculations.
If false, that row of data will be ignored.
See
"Using Equations",
"the A button", and
"the f() button".
- Method:
- Usually, you should use the
"All Subsets"
method, because
it is the only method guaranteed to find the best possible models.
It actually tests all of the possible subset models.
For large problems, this procedure may take a long time and you should consider
letting a problem run overnight.
If All Subsets is too slow (for example, some large problems
might takes days, weeks, or years), use:
Simons 2, Simons 1, and Replace 2 are only available in the version of CoStat that
comes with CoPlot.
Simons 2 takes longer to run than Simons 1, which
takes longer than Replace 1, but the Simons
methods are much more likely to find the best models.
(See Problem #1 above.)
(Forward
is sometimes useful when you want to do a very quick
pilot test before running All Subsets or
one of the Simons methods.
- Max N X's In Model:
- is the maximum number of X's in the models.
For example, your data file might have 20 X columns, but you might want
to restrict the search to subsets which have a maximum of 7 X's.
For all of the Methods (and especially All Subsets),
the procedure takes much longer with larger values of Max N X's In Model.
- N Best:
- is the number of top models of each subset size
which will be saved and printed.
Increases in N Best make the procedure only a little bit slower.
If you use one of the validation methods and N Best is less than 20,
CoStat will automatically increase N Best to 20.
Validation only makes sense if it compares several models.
- Print X's As:
- The X's in the subset models can be printed as
Names, Numbers, or Number:[Names].
- Print:
- You can choose which information you want to have printed for each model.
For all of the formulas:
- Yobs = an observed Y value.
- Yexp = an expected Y value (from one of the regression models).
- variance = the variance, estimated from the Error Mean Square
from fitting the whole model.
(If there are fewer rows than columns in the datafile, this is not available.)
- RSS = the Residual Sum of Squares = the Error Sum of Squares
from one of the models.
- i = 1 (representing the constant in that is always in these models).
- n = the number of rows of data used in the regression.
- p = nXsInModel+i
- Leave One Out predictedY = a Y value predicted
(as if) from a regression done without that row of data.
In other words:
- Do a regression without one row of data;
- Calculate the predicted Y for the row left out;
- Then repeat steps 1 and 2 for every other row of data.
In CoStat the calculation is made a different way, but the
numeric results are the same.
Here is a description of each of the statistics. (For statistics for which
there are variations in how they are calculated,
we have chosen the variation supported by SAS.)
- R^2 = the Coefficient of Multiple Determination.
R^2 = SSModel/SSTotal = (SSTotal-RSS)/SSTotal
R^2 ranges from 0 (the regression accounts for none of the
variation in Y values) to 1 (the regression accounts for
all of the variation).
- Adj R^2
= the Adjusted R^2 (a variation of R^2 which
adjusts for the number of Xs in the model so that Adj R^2 for models
of different sizes can be directly compared).
Adj R^2 = 1 - (n-i)*(1-R^2)/(n-p)
- Cp =
Mallow's Cp statistic.
Cp = RSS/variance - (n-2p)
If there are fewer rows than columns in the datafile,
the variance estimate is not available, so this statistic is not available.
- MSEP
= Mean Squared Error of Prediction
(for a random model).
MSEP = (RSS*(n+1)*(n-2)) / ((n-p)*n*(n-p-1))
- AIC =
Akaike Information Criteria.
AIC = n*ln(RSS/n)+2p
- BIC =
Bayesian Information Criteria.
BIC = n*ln(RSS/n)+2*(p+2)*q-2*q^2
where q = variance / (RSS/n)
If there are fewer rows than columns in the datafile,
the variance estimate is not available, so this statistic is not available.
- Mean Absolute Error
MAE
= average(abs(Yobs-Yexp))
- Leave-One-Out Prediction SS =
Prediction Sum of Squares =
PRESS.
LOO PreSS
= sum((Yobs-predictedY)^2)
This is the standard PRESS statistic.
- Leave-One-Out Prediction R^2 =
LOO PreR^2 is the square of Pearson's
Product Moment Correlation Coefficient
(r) calculated from all of the LOO predicted Y's
and the corresponding observed Y's.
This statistic was suggested by
"Hall (2002)".
- Leave-One-Out Mean Absolute Error =
LOO MAE = average(abs(Yobs-predictedY))
- Equations With Column Names prints the equations
for the selected models with their column names.
- Equations With Column Numbers prints the equations
for the selected models with their column numbers.
These equations can be used with
"Transformations : Transform (Numeric)".
to calculate expected Y values.
- Validation Method:
- Validation gives you a measure of robustness of the different
models with different subsets of the data.
The validation process retests the selected models with
randomly picked subsets of the data.
(See the discussion of
validation above, and
Problems #2 and #3
above.)
Bootstrap -
For each validation replication, the Bootstrap method randomly
picks rows from the original data file, with replacement.
For each validation replication, it picks the same same number
of rows as were in the original
data file.
Leave X% Out -
There are several Leave X% Out validation methods. Each
repeatedly reruns the regression with X% of the rows of data removed.
Brieman and Spector (1992) did a simulation study
which found Leave 20% Out to be a good choice.
The Leave X% Out
validation replicates are done in balanced groups of replicates
(for example, Leave 20% Out works with groups of 5 replicates),
so that each row of data is left out of exactly 1 of the replicates
in each group of replicates.
The use of balanced groups (as opposed repeatedly randomly choosing
which X% to omit) leads to more stable results with fewer validation replications.
We strongly recommend using a validation method.
We recommend the Bootstrap method or
Leave 20% Out.
- Validate N Times:
- For Bootstrap validation method, this is the number of times
the procedure resamples the data. For the Leave X% Out validation
methods, this is the number of times that the results will be validated
by selecting and testing a balanced group of subsets of the data.
The validation process randomly pick which rows will be used,
so using larger values of
Validate N Times (we recommend 100) helps stabilize
the validation statistics.
- Print:
- You can choose which validation information you want to have printed:
- N Times Ranked #1
= a count of the number of validation replications where each model
was ranked #1 (by comparing R^2 values with other models of the
same size).
The tally is printed in the
Rank#1
column of the results. With Bootstrap or
Leave 20% Out (or more),
very robust models will always be ranked #1.
Moderately robust models will be ranked #1 more than twice as often
as the next best model. Not-robust models will often capture fewer
#1 rankings than other models.
With the Leave 10% Out (or less) methods,
validation is more likely to continually pick the same best
model as was found by the subset selection procedure,
making this statistic less useful.
- Leave-Group-Out Prediction SS
LGO PreSS
= sum((Yobs-predictedY)^2)
where the predictedY is estimated from a regression done
without that row of data (1 group at a time) during validation.
The value printed for this statistic is the average value
for all the validation replications.
This is not the standard PRESS statistic
(see Leave-One-Out Prediction SS).
- Leave-Group-Out Prediction R^2
LGO PreR^2
= the square of Pearson's Product Moment Correlation Coefficient
(r) calculated from all of the predicted Y's (calculated during
LeaveGroupOut validation) and the corresponding observed Y's.
The value printed for this statistic is the average value
for all the validation replications.
This statistic was suggested by
"Hall (2002)", who
calls this Q^2.
- Leave-Group-Out Mean Absolute Error
LGO MAE = average(abs(Yobs-predictedY))
where the predictedY is estimated from a regression done
without each row of data (1 group at a time, during validation).
The value printed for this statistic is the average value
for all the validation replications.
- Coefficients Table
prints a table which
indicates the variability of the regression coefficients
in the original best model of each size. The table displays
the mean and standard deviation of each coefficient
for validation replicates where that model is ranked #1,
and the mean and standard deviation for each coefficient
for all validation replicates. The difference between the
means (Mean #1 - Mean All)
(when considered with their standard deviations)
is a measure of the bias of the coefficients.
The difference is larger when the Leave X% Out percentage
is higher. The unique statistical properties of the Bootstrap
method make its results the best measure of the bias.
See Problem #4
above.
- Validation Equations
- For Leave-X%-Out validation methods, the equations generated
are from the first balanced validation group.
For bootstrap validation, the equations generated are
from each of the validation runs
(up to Validation N Times = 100).
The Sample Run
The data for the sample run is the
Longley Data,
which has 6 X columns and 1 Y column.
With so few X columns, even Method: All Subsets runs
very quickly.
For the sample run, use File : Open to open the file called
longley.dt in the cohort directory.
Then:
- From the menu bar, choose:
Statistics : Regression : Multiple (Subset Selection).
- Keep If:
- Method: All Subsets
- Max N X's In Model: 6
- N Best: 20
- Print X's As: Numbers
- Print:
- R^2
- Cp
- Leave-One-Out Prediction SS
- Equations With Column Numbers
- Validation Method: Bootstrap
- Validate N Times: 100
- Print:
- N Times Ranked #1
- Leave-Group-Out Prediction SS
- Coefficients Table
- OK
REGRESSION : MULTIPLE : SUBSET SELECTION
2002-10-14 18:42:31
Using: C:\cohort6\LONGLEY.DT
X Columns:
1) GNP def 3) Unemployment 5) 14 yrs
2) GNP 4) Armed Forces 6) Time
Y Column: 7) Employment
Method: All Subsets
Max N X's In Model: 6
Keep If:
Validation Method: Bootstrap
Validate N Times: 100
Total number of data rows = 16
Number of data rows used = 16
The best models:
n Xs Rank X Columns
---- ---- ---------------------------------------------------------------
1 1 2
1 2 6
1 3 1
1 4 5
1 5 3
1 6 4
n Xs Rank X Columns
---- ---- ---------------------------------------------------------------
2 1 3, 6
2 2 2, 3
2 3 2, 5
2 4 2, 6
2 5 3, 5
...
n Xs Rank X Columns
---- ---- ---------------------------------------------------------------
3 1 3, 4, 6
3 2 2, 3, 4
3 3 2, 4, 5
3 4 1, 3, 6
3 5 2, 3, 6
...
n Xs Rank X Columns
---- ---- ---------------------------------------------------------------
4 1 2, 3, 4, 6
4 2 3, 4, 5, 6
4 3 1, 3, 4, 6
4 4 2, 3, 4, 5
4 5 1, 2, 4, 5
...
n Xs Rank X Columns
---- ---- ---------------------------------------------------------------
5 1 2, 3, 4, 5, 6
5 2 1, 2, 3, 4, 6
5 3 1, 3, 4, 5, 6
5 4 1, 2, 3, 4, 5
5 5 1, 2, 4, 5, 6
5 6 1, 2, 3, 5, 6
n Xs Rank X Columns
---- ---- ---------------------------------------------------------------
6 1 1, 2, 3, 4, 5, 6
The statistics for the best models:
n_Xs Rank R^2 Cp LOO_Press Rank#1 LGO_Press
---- ---- --------- --------- --------- ------ ---------
1 1 0.9673738 52.949425 7589201.1 97 3096552.4
1 2 0.9434809 100.51322 13016160 3 5214933.2
1 3 0.9426439 102.17939 13500813 0 5173452.8
1 4 0.9223501 142.57869 18421169 0 7827711.3
1 5 0.2525043 1476.0486 1.69617e8 0 69522036
1 6 0.2091301 1562.3943 1.88314e8 0 84857519
n_Xs Rank R^2 Cp LOO_Press Rank#1 LGO_Press
---- ---- --------- --------- --------- ------ ---------
2 1 0.9823137 25.208364 4679317 58 2120576.2
2 2 0.9806546 28.511069 5076800.9 15 2239282
2 3 0.9790585 31.688487 5674302.5 8 2452798.7
2 4 0.9734556 42.842209 6981785.1 3 3053396
2 5 0.9688932 51.924638 8876891.5 9 4088699.4
...
n_Xs Rank R^2 Cp LOO_Press Rank#1 LGO_Press
---- ---- --------- --------- --------- ------ ---------
3 1 0.992847 6.2394837 2132127.4 87 1072037.2
3 2 0.9850996 21.662472 4386976.5 2 2151847.5
3 3 0.9835103 24.826233 4941030.8 4 2274526.7
3 4 0.9828873 26.066385 5149873.3 1 2629974.1
3 5 0.9824913 26.854818 5741368.7 3 2962273.2
...
n_Xs Rank R^2 Cp LOO_Press Rank#1 LGO_Press
---- ---- --------- --------- --------- ------ ---------
4 1 0.9953587 3.2394804 1998041.1 60 1113358.9
4 2 0.994672 4.6064343 2216518.9 27 1748728.6
4 3 0.992854 8.2256744 2540766.1 10 1427904.9
4 4 0.9872082 19.464804 5184195.3 0 3717538.5
4 5 0.9863071 21.258566 4951193.5 0 2793420.3
...
n_Xs Rank R^2 Cp LOO_Press Rank#1 LGO_Press
---- ---- --------- --------- --------- ------ ---------
5 1 0.9954632 5.0314623 2561221.4 42 2573200.5
5 2 0.9954533 5.0510991 2344384.2 34 1363439.1
5 3 0.9949044 6.1438652 2581127.2 22 3728562.8
5 4 0.9873777 21.127371 5852906.9 1 5618362.2
5 5 0.9868841 22.110031 5832091.7 1 8692038.2
5 6 0.983799 28.251542 7676622.8 0 5942862.2
n_Xs Rank R^2 Cp LOO_Press Rank#1 LGO_Press
---- ---- --------- --------- --------- ------ ---------
6 1 0.995479 7 2886892.5 100 4877001.5
(The validation method randomly assigns rows of data to validation groups,
so the Rank#1 and LGO statistics printed above will vary.
You can reduce the variability by increasing 'Validate N Times'.)
n Xs Rank Equations With Column Numbers
---- ---- -----------------------------------------------------------------
1 1 col(7) = 51843.5897819 +0.03475229435*col(2)
1 2 col(7) = -1335105.2441 +716.511764706*col(6)
1 3 col(7) = 33189.1733796 +315.966086377*col(1)
1 4 col(7) = 8380.67418338 +0.48487809832*col(5)
1 5 col(7) = 59286.3553982 +1.88852315637*col(3)
1 6 col(7) = 59301.2646582 +2.30780841269*col(4)
n Xs Rank Equations With Column Numbers
---- ---- -----------------------------------------------------------------
2 1 col(7) = -1587138.9078 -0.9955303213*col(3) +847.088742485*col(6)
2 2 col(7) = 52382.1670501 +0.03784032702*col(2) -0.5435743321*col(3)
2 3 col(7) = 88938.7983051 +0.06317243566*col(2) -0.4097429223*col(5)
2 4 col(7) = 1198708.11085 +0.06299295723*col(2) -592.38341363*col(6)
2 5 col(7) = -135.32549508 -1.1151492576*col(3) +0.5877277691*col(5)
...
n Xs Rank Equations With Column Numbers
---- ---- -----------------------------------------------------------------
3 1 col(7) = -1797221.1122 -1.4696711189*col(3) -0.7722814913*col(
3 2 col(7) = 53306.4611883 +0.04078799732*col(2) -0.7968165793*col(
3 3 col(7) = 109470.955483 +0.07992714636*col(2) -0.4978793189*col(
3 4 col(7) = -1879252.7716 -64.775010369*col(1) -1.0519056696*col(
3 5 col(7) = -1198891.4309 +0.00899056576*col(2) -0.8904086536*col(
...
n Xs Rank Equations With Column Numbers
---- ---- -----------------------------------------------------------------
4 1 col(7) = -3598729.3743 -0.0401904697*col(2) -2.0883907318*col(
4 2 col(7) = -2446174.695 -1.5004764434*col(3) -0.9343638696*col(
4 3 col(7) = -1828915.7377 -7.2827116254*col(1) -1.4734185066*col(
4 4 col(7) = 82613.0992041 +0.06210170815*col(2) -0.5198036017*col(
4 5 col(7) = 120323.684241 -136.32592586*col(1) +0.09659841249*col(
...
n Xs Rank Equations With Column Numbers
---- ---- -----------------------------------------------------------------
5 1 col(7) = -3449891.5997 -0.0319613069*col(2) -1.9721499421*col(
5 2 col(7) = -3564921.8744 +27.7148784578*col(1) -0.042127114*col(
5 3 col(7) = -2705054.5008 -43.916959962*col(1) -1.5262904441*col(
5 4 col(7) = 92461.3078244 -48.462828184*col(1) +0.07200384932*col(
5 5 col(7) = -403186.16429 -179.87874985*col(1) +0.09517876035*col(
5 6 col(7) = -1121975.8255 -127.76330578*col(1) +0.03985731002*col(
n Xs Rank Equations With Column Numbers
---- ---- -----------------------------------------------------------------
6 1 col(7) = -3482258.6346 +15.0618722714*col(1) -0.0358191793*col(
Coefficients Table
Variability of Biased Regression Coefficients:
When the same data is used to search for the best subset models and to
estimate the regression coefficients, the estimated regression coefficients
are biased. A measure of the bias is Mean #1 - Mean All: the mean of
the values of a coefficient when it is in validation replicates where
that model is ranked #1 and the mean of the values of that coefficient for
all validation replicates. Usually, the absolute values of the biased
coefficients are too large. The difference is larger when the Leave X% Out
percentage is higher. The unique statistical properties of the Bootstrap
method make its results the best measure of the bias.
(The validation method randomly assigns rows of data to validation groups,
so the following results will vary.
You can reduce the variability by increasing 'Validate N Times'.)
Biased Mean Std.Dev. Mean for Std.Dev. Mean #1 -
X Coef. When #1 When #1 All Reps All Reps Mean All
---- --------- --------- --------- --------- --------- ---------
Int 51843.59
2 0.0347523 0.0352033 0.001667 0.0351513 0.0016724 5.2062e-5
Biased Mean Std.Dev. Mean for Std.Dev. Mean #1 -
X Coef. When #1 When #1 All Reps All Reps Mean All
---- --------- --------- --------- --------- --------- ---------
Int -1587139
3 -0.99553 -1.019034 0.2133845 -1.024537 0.2057 0.0055035
6 847.08874 852.93932 37.350316 854.7123 36.720343 -1.77298
Biased Mean Std.Dev. Mean for Std.Dev. Mean #1 -
X Coef. When #1 When #1 All Reps All Reps Mean All
---- --------- --------- --------- --------- --------- ---------
Int -1797221
3 -1.469671 -1.496287 0.1371075 -1.516463 0.1779194 0.020176
4 -0.772281 -0.802561 0.1441071 -0.798892 0.1911141 -0.003669
6 956.3798 965.36394 38.794978 966.61598 41.199368 -1.252044
Biased Mean Std.Dev. Mean for Std.Dev. Mean #1 -
X Coef. When #1 When #1 All Reps All Reps Mean All
---- --------- --------- --------- --------- --------- ---------
Int -3598729
2 -0.04019 -0.053328 0.0250326 -0.042098 0.0268491 -0.01123
3 -2.088391 -2.288518 0.3259705 -2.113154 0.4043208 -0.175364
4 -1.014639 -1.08571 0.1863755 -1.016285 0.2048972 -0.069425
6 1887.4095 2197.3399 568.26124 1933.823 616.52441 263.51689
Biased Mean Std.Dev. Mean for Std.Dev. Mean #1 -
X Coef. When #1 When #1 All Reps All Reps Mean All
---- --------- --------- --------- --------- --------- ---------
Int -3449892
2 -0.031961 -0.056069 0.056141 -0.043624 0.0454442 -0.012445
3 -1.97215 -2.319053 0.6524732 -2.135074 0.6091017 -0.183979
4 -1.019969 -1.037111 0.2179187 -1.046027 0.2589043 0.0089161
5 -0.077537 0.0731863 0.2800716 -0.028957 0.3865007 0.1021435
6 1814.1014 2154.0816 1009.3157 2013.4852 861.44673 140.59641
Biased Mean Std.Dev. Mean for Std.Dev. Mean #1 -
X Coef. When #1 When #1 All Reps All Reps Mean All
---- --------- --------- --------- --------- --------- ---------
Int -3482259
1 15.061872 25.654696 109.00544 25.654696 109.00544 0
2 -0.035819 -0.049325 0.0580191 -0.049325 0.0580191 0
3 -2.02023 -2.229122 0.7783844 -2.229122 0.7783844 0
4 -1.033227 -1.111287 0.3492111 -1.111287 0.3492111 0
5 -0.051104 -0.02682 0.569697 -0.02682 0.569697 0
6 1829.1515 2082.7935 994.94113 2082.7935 994.94113 0
The model with n Xs=4 and Rank=1
(the X columns are 2, 3, 4, 6), looks to be a good model.
It has the lowest Cp, LOO_PreSS, and LGO_PreSS
values of any of the models.
And it was Ranked #1 among models with 4 X's
more than twice as often as the next best model with 4 X's.
For more information, go to:
CoHort Software |
CoPlot |
CoStat |
CoStat Statistics |
Top
|