[Rd] issue with data()

2021-02-16 Thread Therneau, Terry M., Ph.D. via R-devel
I am testing out the next release of survival, which involves running R CMD 
check on 868 
CRAN packages that import, depend or suggest it.

The survival package has a lot of data sets, most of which are non-trivial real 
examples 
(something I'm proud of).  To save space I've bundled many of them, .e.g., 
data/cancer.rda 
has 19 different dataframes.

This caused failures in 4 packages, each because they have a line such as 
"data(lung)"  or 
data(breast, package= "survival"); and the data() command looks for a file name.

This is a question about which option is considered the best (perhaps more of a 
poll), 
between two choices

1. unbundle them again  (it does save 1/3 of the space, and I do get complaints 
from R CMD 
build about size)
2. send notes to the 4 maintainers.  The help files for the data sets have the 
usage 
documented as  "lung" or "breast", and not data(lung), so I am technically 
legal to claim 
they have a mistake.

A third option to make the data sets a separate package is not on the table.  I 
use them 
heavily in my help files and test suite, and since survival is a recommended 
package I 
can't add library(x) statements for  !(x %in% recommended).   I am guessing 
that this 
would also break many dependent packages.

Terry T.

-- 
Terry M Therneau, PhD
Department of Health Science Research
Mayo Clinic
thern...@mayo.edu

"TERR-ree THUR-noh"


[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] issue with data()

2021-02-16 Thread Michael Dewey

Dear Terry

Option 2 looks the best to me. They have a relatively simple change to 
make and there are only four of them.


Michael

On 16/02/2021 14:39, Therneau, Terry M., Ph.D. via R-devel wrote:

I am testing out the next release of survival, which involves running R CMD 
check on 868
CRAN packages that import, depend or suggest it.

The survival package has a lot of data sets, most of which are non-trivial real 
examples
(something I'm proud of).  To save space I've bundled many of them, .e.g., 
data/cancer.rda
has 19 different dataframes.

This caused failures in 4 packages, each because they have a line such as 
"data(lung)"  or
data(breast, package= "survival"); and the data() command looks for a file name.

This is a question about which option is considered the best (perhaps more of a 
poll),
between two choices

1. unbundle them again  (it does save 1/3 of the space, and I do get complaints 
from R CMD
build about size)
2. send notes to the 4 maintainers.  The help files for the data sets have the 
usage
documented as  "lung" or "breast", and not data(lung), so I am technically 
legal to claim
they have a mistake.

A third option to make the data sets a separate package is not on the table.  I 
use them
heavily in my help files and test suite, and since survival is a recommended 
package I
can't add library(x) statements for  !(x %in% recommended).   I am guessing 
that this
would also break many dependent packages.

Terry T.



--
Michael
http://www.dewey.myzen.co.uk/home.html

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Corrupt internal row names when creating a data.frame with `attributes<-`

2021-02-16 Thread Kevin Ushey
Strictly speaking, I don't think this is a "corrupt" representation,
given that any APIs used to access that internal representation will
call abs() on the row count encoded within. At least, as far as I can
tell, there aren't any adverse downstream effects from having the row
names attribute encoded with this particular internal representation.

On the other hand, the documentation in ?.row_names_info states, for
the 'type' argument:

integer. Currently type = 0 returns the internal "row.names" attribute
(possibly NULL), type = 2 the number of rows implied by the attribute,
and type = 1 the latter with a negative sign for ‘automatic’ row
names.

so one could argue that it's incorrect in light of that documentation
(the row names are "automatic", but the row count is not marked with a
negative sign). Or perhaps this is a different "type" of internal
automatic row name, since it was generated from an already-existing
integer sequence rather than "automatically" in a call to
data.frame().

Kevin

On Sun, Feb 14, 2021 at 6:51 AM Davis Vaughan  wrote:
>
> Hi all,
>
> I believe that the internal row names object created at this line in
> `row_names_gets()` should be using `-n`, not `n`.
> https://github.com/wch/r-source/blob/b30641d3f58703bbeafee101f983b6b263b7f27d/src/main/attrib.c#L71
>
> This can currently generate corrupt internal row names when using
> `attributes<-` or `structure()`, which calls `attributes<-`.
>
> # internal row names are typically `c(NA, -n)`
> df <- data.frame(x = 1:3)
> .row_names_info(df, type = 0L)
> #> [1] NA -3
>
> # using `attributes()` materializes their non-internal form
> attrs <- attributes(df)
> attrs
> #> $names
> #> [1] "x"
> #>
> #> $class
> #> [1] "data.frame"
> #>
> #> $row.names
> #> [1] 1 2 3
>
> # let's make a data frame from scratch with `attributes<-`
> data <- list(x = 1:3)
> attributes(data) <- attrs
>
> # oh no!
> .row_names_info(data, type = 0L)
> #> [1] NA  3
>
> # Note: Must have `nrow(df) > 2` to demonstrate this bug, as otherwise
> # internal row names are not attempted to be created in the C level
> # `row_names_gets()`
>
> Thanks,
> Davis
>
> [[alternative HTML version deleted]]
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Corrupt internal row names when creating a data.frame with `attributes<-`

2021-02-16 Thread Bill Dunlap
as.matrix.data.frame does not take the absolute value of that number:
  > dPos <- 
structure(list(X=101:103,201:203),class="data.frame",row.names=c(NA_integer_,+3L))
  > dNeg <- 
structure(list(X=101:103,201:203),class="data.frame",row.names=c(NA_integer_,-3L))
  > rownames(as.matrix(dPos))
  [1] "1" "2" "3"
  > rownames(as.matrix(dNeg))
  NULL

-Bill

On Tue, Feb 16, 2021 at 11:06 AM Kevin Ushey  wrote:
>
> Strictly speaking, I don't think this is a "corrupt" representation,
> given that any APIs used to access that internal representation will
> call abs() on the row count encoded within. At least, as far as I can
> tell, there aren't any adverse downstream effects from having the row
> names attribute encoded with this particular internal representation.
>
> On the other hand, the documentation in ?.row_names_info states, for
> the 'type' argument:
>
> integer. Currently type = 0 returns the internal "row.names" attribute
> (possibly NULL), type = 2 the number of rows implied by the attribute,
> and type = 1 the latter with a negative sign for ‘automatic’ row
> names.
>
> so one could argue that it's incorrect in light of that documentation
> (the row names are "automatic", but the row count is not marked with a
> negative sign). Or perhaps this is a different "type" of internal
> automatic row name, since it was generated from an already-existing
> integer sequence rather than "automatically" in a call to
> data.frame().
>
> Kevin
>
> On Sun, Feb 14, 2021 at 6:51 AM Davis Vaughan  wrote:
> >
> > Hi all,
> >
> > I believe that the internal row names object created at this line in
> > `row_names_gets()` should be using `-n`, not `n`.
> > https://github.com/wch/r-source/blob/b30641d3f58703bbeafee101f983b6b263b7f27d/src/main/attrib.c#L71
> >
> > This can currently generate corrupt internal row names when using
> > `attributes<-` or `structure()`, which calls `attributes<-`.
> >
> > # internal row names are typically `c(NA, -n)`
> > df <- data.frame(x = 1:3)
> > .row_names_info(df, type = 0L)
> > #> [1] NA -3
> >
> > # using `attributes()` materializes their non-internal form
> > attrs <- attributes(df)
> > attrs
> > #> $names
> > #> [1] "x"
> > #>
> > #> $class
> > #> [1] "data.frame"
> > #>
> > #> $row.names
> > #> [1] 1 2 3
> >
> > # let's make a data frame from scratch with `attributes<-`
> > data <- list(x = 1:3)
> > attributes(data) <- attrs
> >
> > # oh no!
> > .row_names_info(data, type = 0L)
> > #> [1] NA  3
> >
> > # Note: Must have `nrow(df) > 2` to demonstrate this bug, as otherwise
> > # internal row names are not attempted to be created in the C level
> > # `row_names_gets()`
> >
> > Thanks,
> > Davis
> >
> > [[alternative HTML version deleted]]
> >
> > __
> > R-devel@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Corrupt internal row names when creating a data.frame with `attributes<-`

2021-02-16 Thread Davis Vaughan
This originally came up in this dplyr issue:
https://github.com/tidyverse/dplyr/issues/5745

Where `tibble::column_to_rownames()` failed because it eventually checks
`.row_names_info(.data) > 0L` to see if there are automatic row names,
which is in line with the documentation that Kevin pointed out: "type = 1
the latter with a negative sign for ‘automatic’ row names."

Davis

On Tue, Feb 16, 2021 at 2:29 PM Bill Dunlap 
wrote:

> as.matrix.data.frame does not take the absolute value of that number:
>   > dPos <-
> structure(list(X=101:103,201:203),class="data.frame",row.names=c(NA_integer_,+3L))
>   > dNeg <-
> structure(list(X=101:103,201:203),class="data.frame",row.names=c(NA_integer_,-3L))
>   > rownames(as.matrix(dPos))
>   [1] "1" "2" "3"
>   > rownames(as.matrix(dNeg))
>   NULL
>
> -Bill
>
> On Tue, Feb 16, 2021 at 11:06 AM Kevin Ushey  wrote:
> >
> > Strictly speaking, I don't think this is a "corrupt" representation,
> > given that any APIs used to access that internal representation will
> > call abs() on the row count encoded within. At least, as far as I can
> > tell, there aren't any adverse downstream effects from having the row
> > names attribute encoded with this particular internal representation.
> >
> > On the other hand, the documentation in ?.row_names_info states, for
> > the 'type' argument:
> >
> > integer. Currently type = 0 returns the internal "row.names" attribute
> > (possibly NULL), type = 2 the number of rows implied by the attribute,
> > and type = 1 the latter with a negative sign for ‘automatic’ row
> > names.
> >
> > so one could argue that it's incorrect in light of that documentation
> > (the row names are "automatic", but the row count is not marked with a
> > negative sign). Or perhaps this is a different "type" of internal
> > automatic row name, since it was generated from an already-existing
> > integer sequence rather than "automatically" in a call to
> > data.frame().
> >
> > Kevin
> >
> > On Sun, Feb 14, 2021 at 6:51 AM Davis Vaughan  wrote:
> > >
> > > Hi all,
> > >
> > > I believe that the internal row names object created at this line in
> > > `row_names_gets()` should be using `-n`, not `n`.
> > >
> https://github.com/wch/r-source/blob/b30641d3f58703bbeafee101f983b6b263b7f27d/src/main/attrib.c#L71
> > >
> > > This can currently generate corrupt internal row names when using
> > > `attributes<-` or `structure()`, which calls `attributes<-`.
> > >
> > > # internal row names are typically `c(NA, -n)`
> > > df <- data.frame(x = 1:3)
> > > .row_names_info(df, type = 0L)
> > > #> [1] NA -3
> > >
> > > # using `attributes()` materializes their non-internal form
> > > attrs <- attributes(df)
> > > attrs
> > > #> $names
> > > #> [1] "x"
> > > #>
> > > #> $class
> > > #> [1] "data.frame"
> > > #>
> > > #> $row.names
> > > #> [1] 1 2 3
> > >
> > > # let's make a data frame from scratch with `attributes<-`
> > > data <- list(x = 1:3)
> > > attributes(data) <- attrs
> > >
> > > # oh no!
> > > .row_names_info(data, type = 0L)
> > > #> [1] NA  3
> > >
> > > # Note: Must have `nrow(df) > 2` to demonstrate this bug, as otherwise
> > > # internal row names are not attempted to be created in the C level
> > > # `row_names_gets()`
> > >
> > > Thanks,
> > > Davis
> > >
> > > [[alternative HTML version deleted]]
> > >
> > > __
> > > R-devel@r-project.org mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
> > __
> > R-devel@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] NCOL, as.matrix, cbind and NULL

2021-02-16 Thread Gabriel Becker
Hi all,

so I've known for a while that NROW(NULL) gives 0, where nrow(NULL) gives
an error, so I naively expected NCOL to do the same.

Of course, it does not, and is documented* (more on this in a bit) as not
doing so. For those reading without the documentation open, it gives 1.

The relevant doc states:

‘nrow’ and ‘ncol’ return the number of rows or columns present in ‘x’.
 ‘NCOL’ and ‘NROW’ do the same treating a vector as 1-column matrix, even a
0-length vector, compatibly with ‘as.matrix()’ or ‘cbind()’, see the
example.

But there are a couple of fiddly bits here. First is that it says "even a
0-length *vector*" (emphasis mine), but we have

> is.vector(NULL)
[1] FALSE

As opposed, of course, to, e.g., numeric(0).

Next is the claim of compatibility with as.matrix and cbind, but in both my
released version of R (4.0.2) and devel that I just built from trunk, we
have

> NCOL(NULL)

[1] 1

> cbind(NULL)

NULL

> as.matrix(NULL)

*Error in array(x, c(length(x), 1L), if (!is.null(names(x)))
list(names(x),  : *

*  'data' must be of a vector type, was 'NULL'*


So in fact each function is treating NULL completely differently.


The fix (to change behavior or to add a mention in the documentation that
NULL is treated as a 0-length vector) would be easy to do, should I file a
bug with a patch for this?


Best,

~G

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] issue with print()ing multibyte characters on R 4.0.4

2021-02-16 Thread Hiroaki Yutani
Hi all,

I saw several people on Japanese locale claim that, on R 4.0.4,
print() doesn't display
Japanese characters correctly. This seems to happen only on Windows
and on macOS (I
usually use Linux and I don't see this problem).

For example, in the result below, "鬼" and "外" are displayed in
"\u" format. What's
curious here is that "は" is displayed as it is, by the way.

> "鬼は外"
[1] "\u9b3cは\u5916"

But, if I use such functions as message() or cat(), the string is
displayed as it is.

> message("鬼は外")
鬼は外

Considering the fact that it seems only Windows and macOS are
affected, I suspect this
is somehow related to this change described in the release note,
(though I have no idea
what change this is):

The internal table for iswprint (used on Windows, macOS and AIX) has been
updated to include many recent Unicode characters.
(https://cran.r-project.org/doc/manuals/r-release/NEWS.html)

Before I'm going to file this issue on Bugzilla, I'd like to confirm
if this is not the intended
change, and, if this is actually intended, I want to discuss how to
improve this behaviour.

Best,
Hiroaki Yutani

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel