[Rd] problems with [.terms

2021-08-21 Thread Therneau, Terry M., Ph.D. via R-devel
The survival package uses [.terms a fair bit, in particular it makes use of the 
index 
returned in the 'specials' attribute, but the base R code has at least two 
problems.
First, it does not account for offset terms, if present, and also fails for a 
formula such 
as  y ~ age + (sex=='male').   Yes, the user should have used I(sex=='male'), 
but they 
very often don't and the formula works fine without it, but not [.terms. 
Users of 
coxph regularly remind me of these flaws :-)

I first reported this in a bugzilla in 2016.   I think that I (finally) have a 
working fix 
for all the issues.   This is found below in an Sweave document (which is short 
enough 
that I've pasted it directly in) along with some further discussion.   Three 
particular 
questions are 1: my solution for (sex=="male") is a bit of a hack, but it works 
and I 
don't see anything better that is simple, and 2: we should then add 
documentation that 
both [.terms and drop.terms also remove any offset() terms.  (This is currently 
true 
anyway.) and 3. Whether this will break other code the currently works around 
the flaws.

If this looks good I could try to create a formal patch, though that is not 
something I 
have done before.

Terry T.

--
\documentclass [11pt]{article}
\newcommand{\code}[1]{\texttt{#1}}

\title{Subscripting a terms object}
\author{Terry Therneau}
\begin{document}
\maketitle
<>=
options(continue=' ')
@

\section{Introduction}
The survival package makes use of the the subscripting method for terms.
Consider the following code:
<>=
library(survival)
fit1 <- coxph(Surv(time, status) ~ age + strata(inst), data=lung)
Terms <- terms(fit1)
@

In a Cox model the strata term acts as a per-institution intercept, but it
is not a part of the $X$ matrix.
There are two strategies:
\begin{itemize}
   \item If there are strata by covariate interactions, the strata columns
 are removed from the $X$ matrix after the call to model.matrix.
 They (and the intercept) are necessary to correctly form interactions.
   \item If there are no strata by covariate interactions, the strata term is
 removed from the terms object before calling model.matrix.
 Some models may have thousands of strata, a nested case-control design for
 instance, so this is important for efficiency.
\end{itemize}

The approach is to add \code{specials='strata'} to our call to \code{terms},
which causes any strata call to be marked: the \code{specials} attribute
of the terms object will then contain 3 for the example above (the specials
index counts the response),
and \code{Terms[-2]} will create a version without the strata term.

\section{Errors in [.terms}
Our problem is that the subscripting operator for a terms object fails in 2
cases.
Consider the following increasingly complex formulas:

<>=
library(survival)
f1 <- model.frame(terms(Surv(time, status) ~ ph.ecog + strata(inst) + age,
specials="strata"), data=lung)
f2 <- model.frame(terms(Surv(time, status) ~ offset(sex) + age +
strata(ph.ecog), specials="strata"), data= lung)
f3 <- model.frame(terms(~ pspline(wt.loss) + (ph.ecog==1) + strata(inst) +
 (wt.loss + meal.cal)*sex, specials="strata"), lung)
test1 <- terms(f1)
test2 <- terms(f2)
test3 <- terms(f3)
@

The root problem is the need for multiple subscripts.  Consider \code{test1}.
\begin{itemize}
   \item The object itself is a formula of length 5, with `~' as the first
 element.
   \item Attributes are
 \begin{itemize}
   \item The variables and predvars attributes are call objects, each a 
list()
 with 4 elments: the response and all 3 predictors.
   \item The factors attribute has 4 rows and 2 columns, with row labels
 corresponding to the \code{variables} list.
   \item The specials, response, and offset (if present) attributes give
 the index of those terms in the variables attribute.
   \item The term.labels and order attributes omit the resonse and the 
offset,
 so have length 2.
   \item The dataClasses attribute is a character vector of length 4.
 \end{itemize}
\end{itemize}
So the ideal result of  term1[remove the specials] would use subscript of
\begin{itemize}
   \item \code{[-5]} on the formula itself, variables and predvars attributes
   \item \code{[-2]} for term.labels
   \item \code{[-4 , -2, drop=FALSE]} for the factor attribute
   \item \code{[-2]} for order attribute
   \item \code{[-4]} for the dataClasses attribute
\end{itemize}

That will recreate the formula that ``would have been'' had there been no
strata term.  Now look at the first portion of the code in models.R

<<>>=
`[.terms` <- function (termobj, i)
{
 resp <- if (attr(termobj, "response")) termobj[[2L]]
 newformula <- attr(termobj, "term.labels")[i]
 if (length(newformula) == 0L) newformula <- "1"
 newformula <- reformulate(newformula, resp, attr(termobj, "intercept"),
  

[Rd] Add keywords to descriptions of R packages

2021-08-21 Thread beliavsky--- via R-devel
It's great that there are almost 20,000 R packages and that experts have 
curated the R packages in task views, but the volume of packages is such that 
task view maintainers may limit additions because of time constraints or 
because they don't want the task views to become unwieldy. I suggest that 
keywords, chosen by package authors from an approved list, be allowed in 
package descriptions. The keywords could be the same as task view names (for 
example "timeseries") or they could be more specialized (for example "garch"). 
Users would be able to search packages by keyword. Ideally, all R packages 
would be indexed and discoverable this way.
I maintain a list of time series packages that were not in the task view at the 
time I added them: 
https://github.com/Beliavsky/R-Time-Series-Task-View-Supplement/blob/main/README.md.
 There are almost 130 entries. I also have one for finance: 
https://github.com/Beliavsky/R-Finance-Task-View-Supplement .
[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Add keywords to descriptions of R packages

2021-08-21 Thread Duncan Murdoch

On 21/08/2021 3:11 p.m., beliavsky--- via R-devel wrote:

It's great that there are almost 20,000 R packages and that experts have curated the R packages in 
task views, but the volume of packages is such that task view maintainers may limit additions 
because of time constraints or because they don't want the task views to become unwieldy. I suggest 
that keywords, chosen by package authors from an approved list, be allowed in package descriptions. 
The keywords could be the same as task view names (for example "timeseries") or they 
could be more specialized (for example "garch"). Users would be able to search packages 
by keyword. Ideally, all R packages would be indexed and discoverable this way.
I maintain a list of time series packages that were not in the task view at the time I added them: https://github.com/Beliavsky/R-Time-Series-Task-View-Supplement/blob/main/README.md. There are almost 130 entries. I also have one for finance: https://github.com/Beliavsky/R-Finance-Task-View-Supplement 


There are already keywords, and (in my opinion) they are not very 
useful. See Writing R Extensions, section 2.1.1.  Even Roxygen2 supports 
them, using the @keywords token.


Duncan Murdoch

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Add keywords to descriptions of R packages

2021-08-21 Thread Gábor Csárdi
FWIW R-hub indexes package metadata, and you can search it online:
https://r-pkg.org/search.html?q=time+series
or from within R:
https://r-hub.github.io/pkgsearch/

Gabor

On Sun, Aug 22, 2021 at 12:13 AM beliavsky--- via R-devel
 wrote:
>
> It's great that there are almost 20,000 R packages and that experts have 
> curated the R packages in task views, but the volume of packages is such that 
> task view maintainers may limit additions because of time constraints or 
> because they don't want the task views to become unwieldy. I suggest that 
> keywords, chosen by package authors from an approved list, be allowed in 
> package descriptions. The keywords could be the same as task view names (for 
> example "timeseries") or they could be more specialized (for example 
> "garch"). Users would be able to search packages by keyword. Ideally, all R 
> packages would be indexed and discoverable this way.
> I maintain a list of time series packages that were not in the task view at 
> the time I added them: 
> https://github.com/Beliavsky/R-Time-Series-Task-View-Supplement/blob/main/README.md.
>  There are almost 130 entries. I also have one for finance: 
> https://github.com/Beliavsky/R-Finance-Task-View-Supplement .
> [[alternative HTML version deleted]]
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel