The survival package uses [.terms a fair bit, in particular it makes use of the
index
returned in the 'specials' attribute, but the base R code has at least two
problems.
First, it does not account for offset terms, if present, and also fails for a
formula such
as y ~ age + (sex=='male'). Yes, the user should have used I(sex=='male'),
but they
very often don't and the formula works fine without it, but not [.terms.
Users of
coxph regularly remind me of these flaws :-)
I first reported this in a bugzilla in 2016. I think that I (finally) have a
working fix
for all the issues. This is found below in an Sweave document (which is short
enough
that I've pasted it directly in) along with some further discussion. Three
particular
questions are 1: my solution for (sex=="male") is a bit of a hack, but it works
and I
don't see anything better that is simple, and 2: we should then add
documentation that
both [.terms and drop.terms also remove any offset() terms. (This is currently
true
anyway.) and 3. Whether this will break other code the currently works around
the flaws.
If this looks good I could try to create a formal patch, though that is not
something I
have done before.
Terry T.
--
\documentclass [11pt]{article}
\newcommand{\code}[1]{\texttt{#1}}
\title{Subscripting a terms object}
\author{Terry Therneau}
\begin{document}
\maketitle
<>=
options(continue=' ')
@
\section{Introduction}
The survival package makes use of the the subscripting method for terms.
Consider the following code:
<>=
library(survival)
fit1 <- coxph(Surv(time, status) ~ age + strata(inst), data=lung)
Terms <- terms(fit1)
@
In a Cox model the strata term acts as a per-institution intercept, but it
is not a part of the $X$ matrix.
There are two strategies:
\begin{itemize}
\item If there are strata by covariate interactions, the strata columns
are removed from the $X$ matrix after the call to model.matrix.
They (and the intercept) are necessary to correctly form interactions.
\item If there are no strata by covariate interactions, the strata term is
removed from the terms object before calling model.matrix.
Some models may have thousands of strata, a nested case-control design for
instance, so this is important for efficiency.
\end{itemize}
The approach is to add \code{specials='strata'} to our call to \code{terms},
which causes any strata call to be marked: the \code{specials} attribute
of the terms object will then contain 3 for the example above (the specials
index counts the response),
and \code{Terms[-2]} will create a version without the strata term.
\section{Errors in [.terms}
Our problem is that the subscripting operator for a terms object fails in 2
cases.
Consider the following increasingly complex formulas:
<>=
library(survival)
f1 <- model.frame(terms(Surv(time, status) ~ ph.ecog + strata(inst) + age,
specials="strata"), data=lung)
f2 <- model.frame(terms(Surv(time, status) ~ offset(sex) + age +
strata(ph.ecog), specials="strata"), data= lung)
f3 <- model.frame(terms(~ pspline(wt.loss) + (ph.ecog==1) + strata(inst) +
(wt.loss + meal.cal)*sex, specials="strata"), lung)
test1 <- terms(f1)
test2 <- terms(f2)
test3 <- terms(f3)
@
The root problem is the need for multiple subscripts. Consider \code{test1}.
\begin{itemize}
\item The object itself is a formula of length 5, with `~' as the first
element.
\item Attributes are
\begin{itemize}
\item The variables and predvars attributes are call objects, each a
list()
with 4 elments: the response and all 3 predictors.
\item The factors attribute has 4 rows and 2 columns, with row labels
corresponding to the \code{variables} list.
\item The specials, response, and offset (if present) attributes give
the index of those terms in the variables attribute.
\item The term.labels and order attributes omit the resonse and the
offset,
so have length 2.
\item The dataClasses attribute is a character vector of length 4.
\end{itemize}
\end{itemize}
So the ideal result of term1[remove the specials] would use subscript of
\begin{itemize}
\item \code{[-5]} on the formula itself, variables and predvars attributes
\item \code{[-2]} for term.labels
\item \code{[-4 , -2, drop=FALSE]} for the factor attribute
\item \code{[-2]} for order attribute
\item \code{[-4]} for the dataClasses attribute
\end{itemize}
That will recreate the formula that ``would have been'' had there been no
strata term. Now look at the first portion of the code in models.R
<<>>=
`[.terms` <- function (termobj, i)
{
resp <- if (attr(termobj, "response")) termobj[[2L]]
newformula <- attr(termobj, "term.labels")[i]
if (length(newformula) == 0L) newformula <- "1"
newformula <- reformulate(newformula, resp, attr(termobj, "intercept"),