The survival package uses [.terms a fair bit, in particular it makes use of the index returned in the 'specials' attribute, but the base R code has at least two problems. First, it does not account for offset terms, if present, and also fails for a formula such as y ~ age + (sex=='male'). Yes, the user should have used I(sex=='male'), but they very often don't and the formula works fine without it, but not [.terms. Users of coxph regularly remind me of these flaws :-)
I first reported this in a bugzilla in 2016. I think that I (finally) have a working fix for all the issues. This is found below in an Sweave document (which is short enough that I've pasted it directly in) along with some further discussion. Three particular questions are 1: my solution for (sex=="male") is a bit of a hack, but it works and I don't see anything better that is simple, and 2: we should then add documentation that both [.terms and drop.terms also remove any offset() terms. (This is currently true anyway.) and 3. Whether this will break other code the currently works around the flaws. If this looks good I could try to create a formal patch, though that is not something I have done before. Terry T. ------------------ \documentclass [11pt]{article} \newcommand{\code}[1]{\texttt{#1}} \title{Subscripting a terms object} \author{Terry Therneau} \begin{document} \maketitle <<echo=FALSE>>= options(continue=' ') @ \section{Introduction} The survival package makes use of the the subscripting method for terms. Consider the following code: <<test1>>= library(survival) fit1 <- coxph(Surv(time, status) ~ age + strata(inst), data=lung) Terms <- terms(fit1) @ In a Cox model the strata term acts as a per-institution intercept, but it is not a part of the $X$ matrix. There are two strategies: \begin{itemize} \item If there are strata by covariate interactions, the strata columns are removed from the $X$ matrix after the call to model.matrix. They (and the intercept) are necessary to correctly form interactions. \item If there are no strata by covariate interactions, the strata term is removed from the terms object before calling model.matrix. Some models may have thousands of strata, a nested case-control design for instance, so this is important for efficiency. \end{itemize} The approach is to add \code{specials='strata'} to our call to \code{terms}, which causes any strata call to be marked: the \code{specials} attribute of the terms object will then contain 3 for the example above (the specials index counts the response), and \code{Terms[-2]} will create a version without the strata term. \section{Errors in [.terms} Our problem is that the subscripting operator for a terms object fails in 2 cases. Consider the following increasingly complex formulas: <<case1>>= library(survival) f1 <- model.frame(terms(Surv(time, status) ~ ph.ecog + strata(inst) + age, specials="strata"), data=lung) f2 <- model.frame(terms(Surv(time, status) ~ offset(sex) + age + strata(ph.ecog), specials="strata"), data= lung) f3 <- model.frame(terms(~ pspline(wt.loss) + (ph.ecog==1) + strata(inst) + (wt.loss + meal.cal)*sex, specials="strata"), lung) test1 <- terms(f1) test2 <- terms(f2) test3 <- terms(f3) @ The root problem is the need for multiple subscripts. Consider \code{test1}. \begin{itemize} \item The object itself is a formula of length 5, with `~' as the first element. \item Attributes are \begin{itemize} \item The variables and predvars attributes are call objects, each a list() with 4 elments: the response and all 3 predictors. \item The factors attribute has 4 rows and 2 columns, with row labels corresponding to the \code{variables} list. \item The specials, response, and offset (if present) attributes give the index of those terms in the variables attribute. \item The term.labels and order attributes omit the resonse and the offset, so have length 2. \item The dataClasses attribute is a character vector of length 4. \end{itemize} \end{itemize} So the ideal result of term1[remove the specials] would use subscript of \begin{itemize} \item \code{[-5]} on the formula itself, variables and predvars attributes \item \code{[-2]} for term.labels \item \code{[-4 , -2, drop=FALSE]} for the factor attribute \item \code{[-2]} for order attribute \item \code{[-4]} for the dataClasses attribute \end{itemize} That will recreate the formula that ``would have been'' had there been no strata term. Now look at the first portion of the code in models.R <<>>= `[.terms` <- function (termobj, i) { resp <- if (attr(termobj, "response")) termobj[[2L]] newformula <- attr(termobj, "term.labels")[i] if (length(newformula) == 0L) newformula <- "1" newformula <- reformulate(newformula, resp, attr(termobj, "intercept"), environment(termobj)) result <- terms(newformula, specials = names(attr(termobj, "specials"))) # Edit the optional attributes } @ The use of reformulate() is a nice trick, and correctly creates all the attributes except predvars and dataClasses. However, the index reported in the specials attribute is generated with reference to the variables attribute, or equivalently the row names of factors, not with respect to the term.labels attribute; the latter lacks the response and any offset terms. Thus our code works for test1 but fails for test2: specials points to the wrong variable in term.labels. The reformulate trick breaks in another way in test3 due to the \code{(ph.ecog==1)} term. In the term.labels attribute the parentheses disappear, and the result of the reformulate call is not a proper formula. The + binds tighter than == leading to an error message that will confuse most users. We can argue, and I probably would, that the user should have used \code{I(ph.ecog==1)}. But they didn't, and without the I() it is a legal formula, or at least one that currently works. Fixing this issue is a harder, since only the underlying formula retains the necessary syntax. Short of deparsing the formula, a hack that appears to work is to surround every term with a set of parenthesis. @ The impact of an offset term was overlooked in second portion of the subscript routine as well, i.e., the ``edit optional attribute'' section which attempts to amend the predvars and dataClasses attributes. Here is an updated subscript routine which works for the examples above. <<newsub>>= mytermsub <- function (termobj, i) { # In the terms object for # model.frame(mpg ~ cyl + offset(-.04*disp) + wt*factor(carb), mtcars) # The subscript 'i' in the call counts variables on the right hand side, # to drop "wt" use a subscript of -3. # Using reformulate() with term.labels is the primary strategy, since the # latter includes all the interactions. But the offset is missing from # term.labels, so we have to be more clever with indexing. rvar <- if (attr(termobj, "response") ==1) termobj[[2L]] j <- seq.int(length(attr(termobj, "term.labels")) + length(attr(termobj, "offset"))) if (!is.null(attr(termobj, "offset"))) { # k = where offset() would have appeared in term.labels, before removals k <- attr(termobj, "response") - attr(termobj, "offset") index1 <- match(j, j[k], nomatch=0)[i] } else index1 <- j[i] newformula <- attr(termobj, "term.labels")[index1] # Adding () around each term is for a formula with + (sex=='male') in the # formula. newformula <- reformulate(paste0("(", newformula, ")"), response= rvar, intercept = attr(termobj, "intercept"), env = environment(termobj)) if (length(newformula) == 0L) newformula <- "1" # addition of an extra specials label causes no harm result <- terms(newformula, specials = names(attr(termobj, "specials"))) # now add back the predvars and dataClasses attributes; which do contain # the response and offset. index2 <- j[i] if (attr(termobj, "response")==1) index2 <- c(1, index2 +1) # if 'i' drops an interaction it won't change predvars or dataClasses index2 <- index2[index2 <= length(rownames(attr(termobj, "factors")))] if (!is.null(attr(termobj, "offset"))) index2 <- index2[-attr(termobj, "offset")] if (!is.null(attr(termobj, "predvars"))) attr(result, "predvars") <- attr(termobj, "predvars")[c(1, index2 +1)] if (!is.null(attr(termobj, "dataClasses"))) attr(result, "dataClasses") <- attr(termobj, "dataClasses")[index2] result } @ Now test this out by dropping the strata and offset terms. <<testit>>= f1b <- model.frame(terms(Surv(time, status) ~ ph.ecog + age, specials="strata"), data=lung) f2b <- model.frame(terms(Surv(time, status) ~ age, specials="strata"), data= lung) f3b <- model.frame(terms(~ pspline(wt.loss) + (ph.ecog==1) + (wt.loss + meal.cal)*sex, specials="strata"), lung) all.equal(attributes(terms(f1b)), attributes(mytermsub(test1, -2))) all.equal(attributes(terms(f2b)), attributes(mytermsub(test2, -3))) all.equal(attributes(terms(f3b)), attributes(mytermsub(test3, -3))) @ The formula itself changes due to the extra parentheses, hence the check of only the attributes. The drop.terms function shares much of the same code. <<drop>>= drop.terms <- function(termobj, dropx = NULL, keep.response = FALSE) { if (is.null(dropx)) return(termobj) if(!inherits(termobj, "terms")) stop(gettextf("'termobj' must be a object of class %s", dQuote("terms")), domain = NA) rvar <- if (keep.response & attr(termobj, "response") ==1) termobj[[2L]] j <- seq.int(length(attr(termobj, "term.labels")) + length(attr(termobj, "offset"))) if (!is.null(attr(termobj, "offset"))) { # k = where offset() would have appeared in term.labels k <- attr(termobj, "response") - attr(termobj, "offset") index1 <- match(j, j[k], nomatch=0)[-dropx] } else index1 <- j[-dropx] newformula <- attr(termobj, "term.labels")[index1] # Adding () around each term is for a formula with + (sex=='male') newformula <- reformulate(paste0("(", newformula, ")"), response = rvar, intercept = attr(termobj, "intercept"), env = environment(termobj)) if (length(newformula) == 0L) newformula <- "1" # addition of an extra specials label causes no harm result <- terms(newformula, specials = names(attr(termobj, "specials"))) # now add back the predvars and dataClasses attributes; which do contain # the response and offset. index2 <- j[-dropx] if (attr(termobj, "response")==1) index2 <- c(1, index2 +1) # if 'i' drops an interaction it won't change predvars or dataClasses index2 <- index2[index2 <= length(rownames(attr(termobj, "factors")))] if (!is.null(attr(termobj, "offset"))) index2 <- index2[-attr(termobj, "offset")] if (!is.null(attr(termobj, "predvars"))) attr(result, "predvars") <-attr(termobj, "predvars")[c(1, index2 +1)] if (!is.null(attr(termobj, "dataClasses"))) attr(result, "dataClasses") <- attr(termobj, "dataClasses")[index2] result } @ \end{document} [[alternative HTML version deleted]] ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel