According to the advice, I tried rms package.
Just to make sure, I have data of 104 patients (x6.df), which consists of 5 explanatory variables and one binary outcome (poor/good) (previous model 2 strategy). The outcome consists of 25 poor results and 79 good results. Therefore, My events per variable (EPV) is only 5 (much less than the rule of thumb of 10).

My questions are about validate and pentrace in rms package.
I present some codes and results.
I appreciate anybody's help in advance.

> x6.lrm <- lrm(outcome ~ stenosis+x1+x2+procedure+ClinicalScore, data=x6.df, x=T, y=T)

> x6.lrm
...
Obs  104    LR chi2      29.24    R2       0.367    C       0.816
 negative 79    d.f.         5    g        1.633    Dxy     0.632
 positive 25    Pr(> chi2) <0.0001   gr    5.118    gamma   0.632
max |deriv| 1e-08                    gp    0.237    tau-a   0.233
                                     Brier   0.127

               Coef    S.E.   Wald Z Pr(>|Z|)
Intercept      -5.5328 2.6287 -2.10  0.0353
stenosis       -0.0150 0.0284 -0.53  0.5979
x1              3.0425 0.9100  3.34  0.0008
x2             -0.7534 0.4519 -1.67  0.0955
procedure       1.2085 0.5717  2.11  0.0345
ClinicalScore   0.3762 0.2287  1.65  0.0999

It seems not too bad. Next, validation by bootstrap ...

> validate(x6.lrm, B=200, bw=F)
          index.orig training    test optimism index.corrected   n
Dxy           0.6324   0.6960  0.5870   0.1091          0.5233 200
R2            0.3668   0.4370  0.3154   0.1216          0.2453 200
Intercept     0.0000   0.0000 -0.2007   0.2007         -0.2007 200
Slope         1.0000   1.0000  0.7565   0.2435          0.7565 200
Emax          0.0000   0.0000  0.0999   0.0999          0.0999 200
D             0.2716   0.3368  0.2275   0.1093          0.1623 200
U            -0.0192  -0.0192  0.0369  -0.0561          0.0369 200
Q             0.2908   0.3560  0.1906   0.1654          0.1254 200
B             0.1272   0.1155  0.1384  -0.0229          0.1501 200
g             1.6328   2.0740  1.4647   0.6093          1.0235 200
gp            0.2367   0.2529  0.2189   0.0341          0.2026 200

The apparent Dxy is 0.63, and bias-corrected Dxy is 0.52. The maximum absolute error is estimated to be 0.099. The changes in slope and intercept are also more substantial. In all, there is evidence that I am somewhat overfitting the data, right?.

Furthermore, using step-down variable selection ...

> validate(x6.lrm, B=200, bw=T)

                Backwards Step-down - Original Model

 Deleted        Chi-Sq d.f. P      Residual d.f. P      AIC
 stenosis       0.28   1    0.5979 0.28     1    0.5979 -1.72
 ClinicalScore  2.60   1    0.1068 2.88     2    0.2370 -1.12
 x2             2.86   1    0.0910 5.74     3    0.1252 -0.26

Approximate Estimates after Deleting Factors

             Coef   S.E. Wald Z         P
Intercept  -5.865 1.4136 -4.149 3.336e-05
x1          2.915 0.8685  3.357 7.889e-04
procedure   1.072 0.5590  1.918 5.508e-02

Factors in Final Model

[1] x1         procedure
          index.orig training    test optimism index.corrected   n
Dxy           0.5661   0.6755  0.5559   0.1196          0.4464 200
R2            0.2876   0.4085  0.2784   0.1301          0.1575 200
Intercept     0.0000   0.0000 -0.2459   0.2459         -0.2459 200
Slope         1.0000   1.0000  0.7300   0.2700          0.7300 200
Emax          0.0000   0.0000  0.1173   0.1173          0.1173 200
D             0.2038   0.3130  0.1970   0.1160          0.0877 200
U            -0.0192  -0.0192  0.0382  -0.0574          0.0382 200
Q             0.2230   0.3323  0.1589   0.1734          0.0496 200
B             0.1441   0.1192  0.1452  -0.0261          0.1702 200
g             1.2628   1.9524  1.3222   0.6302          0.6326 199
gp            0.2041   0.2430  0.2043   0.0387          0.1654 199

If I select only two variables (x1 and procedure), bias-corrected Dxy goes down to 0.45.

[Question 1]
I have EPV problem. Even so, should I keep the full model (5-variable model)? or can I use the 2-variable (x1 and procedure) model which the validate() with step-down provides?

[Question 2]
If I use 2-variable model, should I do
x2.lrm <- lrm(postopDWI_HI ~ T1+procedure2, data=x6.df, x=T, y=T)?
or keep the value showed above by validate function?

Next, shrinkage ...

> pentrace(x6.lrm, seq(0, 5.0, by=0.05))
Best penalty:
penalty         df
   3.05   4.015378

The best penalty is 3.05. So, I update it with this penalty to obtain the corresponding penalized model:

> x6.lrm.pen <- update(x6.lrm, penalty=3.05, x=T, y=T)
> x6.lrm.pen
.....
Penalty factors

 simple nonlinear interaction nonlinear.interaction
   3.05      3.05        3.05                  3.05
Final penalty on -2 log L
     [,1]
[1,]  3.8

Obs     104    LR chi2      28.18    R2       0.313    C       0.818
 negative    79    d.f.     4.015    g        1.264    Dxy     0.635
 positive    25   Pr(> chi2) <0.0001 gr       3.538    gamma   0.637
max |deriv| 3e-05                    gp       0.201    tau-a   0.234
                                     Brier    0.129

               Coef    S.E.   Wald Z Pr(>|Z|) Penalty Scale
Intercept      -4.7246 2.2429 -2.11  0.0352    0.0000
stenosis       -0.0105 0.0240 -0.44  0.6621   17.8021
x1              2.3605 0.7254  3.25  0.0011    0.6054
x2             -0.5385 0.3653 -1.47  0.1404    1.2851
procedure       0.9247 0.4844  1.91  0.0563    0.8576
ClinicalScore   0.3046 0.1874  1.63  0.1041    2.4779

Arrange the coefficients of the two models side by side, and also list the difference between the two:

> cbind(coef(x6.lrm), coef(x6.lrm.pen), abs(coef(x6.lrm)-coef(x6.lrm.pen)))
                      [,1]        [,2]        [,3]
Intercept      -5.53281808 -4.72464766 0.808170417
stenosis       -0.01496757 -0.01050797 0.004459599
x1              3.04248257  2.36051833 0.681964238
x2             -0.75335619 -0.53854750 0.214808685
procedure       1.20847252  0.92474708 0.283725441
ClinicalScore   0.37623189  0.30457557 0.071656322

[Question 3]
Is this penalized model the one I should present for my colleagues?
I still have EPV problem. Or is EPV problem O.K. if I use penalization?

I am still wondering about what I can do to avoid EPV problem. Collecting new data would be a long-time and huge work...


(11/04/22 1:46), khos...@med.kobe-u.ac.jp wrote:
Thank you for your comment.
I forgot to mention that varclus and pvclust showed similar results for
my data.

BTW, I did not realize rms is a replacement for the Design package.
I appreciate your suggestion.
--
KH

(11/04/21 8:00), Frank Harrell wrote:
I think it's OK. You can also use the Hmisc package's varclus function.
Frank


細田弘吉 wrote:

Dear Prof. Harrel,

Thank you very much for your quick advice.
I will try rms package.

Regarding model reduction, is my model 2 method (clustering and recoding
that are blinded to the outcome) permissible?

Sincerely,

--
KH

(11/04/20 22:01), Frank Harrell wrote:
Deleting variables is a bad idea unless you make that a formal part of
the
BMA so that the attempt to delete variables is penalized for.
Instead of
BMA I recommend simple penalized maximum likelihood estimation (see the
lrm
function in the rms package) or pre-modeling data reduction that is
blinded
to the outcome variable.
Frank


細田弘吉 wrote:

Hi everybody,
I apologize for long mail in advance.

I have data of 104 patients, which consists of 15 explanatory
variables
and one binary outcome (poor/good). The outcome consists of 25 poor
results and 79 good results. I tried to analyze the data with logistic
regression. However, the 15 variables and 25 events means events per
variable (EPV) is much less than 10 (rule of thumb). Therefore, I
used R
package, "BMA" to perform logistic regression with BMA to avoid this
problem.

model 1 (full model):
x1, x2, x3, x4 are continuous variables and others are binary data.

x16.bic.glm<- bic.glm(outcome ~ ., data=x16.df,
glm.family="binomial", OR20, strict=FALSE)
summary(x16.bic.glm)
(The output below has been cut off at the right edge to save space)

62 models were selected
Best 5 models (cumulative posterior probability = 0.3606 ):

p!=0 EV SD model 1 model2
Intercept 100 -5.1348545 1.652424 -4.4688 -5.15
-5.1536
age 3.3 0.0001634 0.007258 .
sex 4.0
.M -0.0243145 0.220314 .
side 10.8
.R 0.0811227 0.301233 .
procedure 46.9 -0.5356894 0.685148 . -1.163
symptom 3.8 -0.0099438 0.129690 . .
stenosis 3.4 -0.0003343 0.005254 .
x1 3.7 -0.0061451 0.144084 .
x2 100.0 3.1707661 0.892034 3.2221 3.11
x3 51.3 -0.4577885 0.551466 -0.9154 .
HT 4.6
.positive 0.0199299 0.161769 . .
DM 3.3
.positive -0.0019986 0.105910 . .
IHD 3.5
.positive 0.0077626 0.122593 . .
smoking 9.1
.positive 0.0611779 0.258402 . .
hyperlipidemia 16.0
.positive 0.1784293 0.512058 . .
x4 8.2 0.0607398 0.267501 . .


nVar 2 2
1 3 3
BIC -376.9082
-376.5588 -376.3094 -375.8468 -374.5582
post prob 0.104
0.087 0.077 0.061 0.032

[Question 1]
Is it O.K to calculate odds ratio and its 95% confidence interval from
"EV" (posterior distribution mean) and“SD”(posterior distribution
standard deviation)?
For example, 95%CI of EV of x2 can be calculated as;
exp(3.1707661)
[1] 23.82573 -----> odds ratio
exp(3.1707661+1.96*0.892034)
[1] 136.8866
exp(3.1707661-1.96*0.892034)
[1] 4.146976
------------------> 95%CI (4.1 to 136.9)
Is this O.K.?

[Question 2]
Is it permissible to delete variables with small value of "p!=0" and
"EV", such as age (3.3% and 0.0001634) to reduce the number of
explanatory variables and reconstruct new model without those
variables
for new session of BMA?

model 2 (reduced model):
I used R package, "pvclust", to reduce the model. The result suggested
x1, x2 and x4 belonged to the same cluster, so I picked up only x2.
Based on the subject knowledge, I made a simple unweighted sum, by
counting the number of clinical features. For 9 features (sex, side,
HT2, hyperlipidemia, DM, IHD, smoking, symptom, age), the sum ranges
from 0 to 9. This score was defined as ClinicalScore. Consequently, I
made up new data set (x6.df), which consists of 5 variables (stenosis,
x2, x3, procedure, and ClinicalScore) and one binary outcome
(poor/good). Then, for alternative BMA session...

BMAx6.glm<- bic.glm(postopDWI_HI ~ ., data=x6.df,
glm.family="binomial", OR=20, strict=FALSE)
summary(BMAx6.glm)
(The output below has been cut off at the right edge to save space)
Call:
bic.glm.formula(f = postopDWI_HI ~ ., data = x6.df, glm.family =
"binomial", strict = FALSE, OR = 20)


13 models were selected
Best 5 models (cumulative posterior probability = 0.7626 ):

p!=0 EV SD model 1 model 2
Intercept 100 -5.6918362 1.81220 -4.4688 -6.3166
stenosis 8.1 -0.0008417 0.00815 . .
x2 100.0 3.0606165 0.87765 3.2221 3.1154
x3 46.5 -0.3998864 0.52688 -0.9154 .
procedure 49.3 0.5747013 0.70164 . 1.1631
ClinicalScore 27.1 0.0966633 0.19645 . .


nVar 2 2 1
3 3
BIC -376.9082 -376.5588
-376.3094 -375.8468 -375.5025
post prob 0.208 0.175
0.154 0.122 0.103

[Question 3]
Am I doing it correctly or not?
I mean this kind of model reduction is permissible for BMA?

[Question 4]
I still have 5 variables, which violates the rule of thumb, "EPV> 10".
Is it permissible to delete "stenosis" variable because of small value
of "EV"? Or is it O.K. because this is BMA?

Sorry for long post.

I appreciate your help very much in advance.

--
KH

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



-----
Frank Harrell
Department of Biostatistics, Vanderbilt University
--
View this message in context:
http://r.789695.n4.nabble.com/BMA-logistic-regression-odds-ratio-model-reduction-etc-tp3462416p3462919.html

Sent from the R help mailing list archive at Nabble.com.


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to