Re: [Rd] request for discussion on lonely doc patch suggestion

2025-03-24 Thread Duncan Murdoch
I sent some comments directly to Ben.  I just want to reply publicly to 
this part:


On 2025-03-24 1:18 p.m., Ben Bolker wrote:


The patch file is attached (also available at bugzilla, if it doesn't
get through to the list). I find the patch format a little hard to read,
so I'm reproducing just the *new* text below.


I agree absolutely about the lack of readability of patch files.  A side 
by side display is much nicer.  If anyone out there isn't using one, you 
should.


I really like the one I use ("Beyond Compare"), but it's not open 
source.  I've been using it for a very long time (20 years or more, I 
think), and I suspect there are very good open source competitors out 
there now (and may have been for all the time I've been using BC). 
Suggestions?


Duncan Murdoch

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Why does NextMethod() pick up duplicate arguments in '...' if given positionally at top level?

2025-03-24 Thread Michael Chirico
Consider:

foo <- function(x, y, ...) {
UseMethod("foo")
}

foo.default <- function(x, y = 0, ...) {
cat(sprintf("%s: x=%s, y=%s\n", as.character(match.call()[[1L]]), x, y))
if (...length()) str(list(...))
}

foo.C <- function(x, y = 3, ...) {
cat(sprintf("%s: x=%s, y=%s\n", as.character(match.call()[[1L]]), x, y))
if (...length()) str(list(...))
NextMethod("foo", x = x, y = y)
}

c <- structure(class = "C", 1)

# 'x' winds up in ..1
foo(c)
# foo.C: x=1, y=3
# foo.default: x=1, y=3
# List of 1
#  $ : 'C' num 1

# empty ...!
foo(x=c)
# foo.C: x=1, y=3
# foo.default: x=1, y=3

# now both x is ..1, y is ..2
foo(c, 4)
# foo.C: x=1, y=4
# foo.default: x=1, y=4
# List of 2
#  $ : 'C' num 1
#  $ : num 4

# perhaps predictably, ...length()==0
foo(x=c, y=4)
# foo.C: x=1, y=4
# foo.default: x=1, y=4

I've tried re-reading ?NextMethod a few times as well as R-lang [1] &
can't make heads or tails of this. I've also come across related 2012
(!) thread [2] and tangentially-related bug [3].

Is this intended behavior? If so, might I reiterate Henrik's long-ago
request for better documentation of how to work around this?

For some added context, where I actually encountered this, my S3
method is mainly written to overwrite the defaults of a parent class's
method.

Mike C

[1] https://cran.r-project.org/doc/manuals/r-devel/R-lang.html#NextMethod
[2] https://stat.ethz.ch/pipermail/r-devel/2012-October/065016.html
[3] https://bugs.r-project.org/show_bug.cgi?id=15654

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] request for discussion on lonely doc patch suggestion

2025-03-24 Thread Tim Taylor
FWIW, on the command line I’m a happy 'delta' user for a quick side by side 
comparison (https://github.com/dandavison/delta)

>  On 24 Mar 2025, at 19:32, J C Nash  wrote:
> 
> For Linux users, meld is quite nice for side by side editing, though I've 
> never tried using it for
> display. Just checking now suggests it isn't obvious how to "print" side by 
> side display.
> 
> I've made meld easier for my own use by creating an icon in Double Commander 
> (DC allows
> the user to create iconized links to scripts and programs). There are two 
> panes in the DC
> file manager. I highlight one file in each then click. This saves typing two 
> full paths in
> a command
> 
>   meld  path/to/file1 path/to/file2
> 
> I suspect the highlight and click makes my use of meld reasonably attractive. 
> I'm not sure
> I'd use it in the raw command line mode.
> 
> Like Duncan, I welcome suggestions for similar tools, especially if there's a 
> display option.
> 
> John Nash
> 
>> On 2025-03-24 15:21, Duncan Murdoch wrote:
>> I sent some comments directly to Ben.  I just want to reply publicly to this 
>> part:
>>> On 2025-03-24 1:18 p.m., Ben Bolker wrote:
>>> The patch file is attached (also available at bugzilla, if it doesn't
>>> get through to the list). I find the patch format a little hard to read,
>>> so I'm reproducing just the *new* text below.
>> I agree absolutely about the lack of readability of patch files.  A side by 
>> side display is much nicer.  If anyone out there isn't using one, you should.
>> I really like the one I use ("Beyond Compare"), but it's not open source.  
>> I've been using it for a very long time (20 years or more, I think), and I 
>> suspect there are very good open source competitors out there now (and may 
>> have been for all the time I've been using BC). Suggestions?
>> Duncan Murdoch
>> __
>> R-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] table() and as.character() performance for logical values

2025-03-24 Thread Sebastian Meyer

Am 21.03.25 um 15:42 schrieb Aidan Lakshman via R-devel:

After investigating the source of table, I ended up on the reason being 
“as.character()”:


This is specifically happening within the conversion of the input to type 
factor, which is where the as.character conversion happens.


Yes, I also think 'factor' could do a bit better for unclassed integers 
(such as when called from 'cut') as well as for logical input (such as 
from 'summary' -> 'table').


Note that 'as.factor' already has a "fast track" for plain integers 
(originally for 'split.default' from 'tapply'), so can be used instead 
of 'factor' when there is no need for custom 'levels', 'labels', or 
'exclude'. (Thanks for already mentioning 'tabulate'.)


A 'factor' patch would apply more broadly, e.g.:

===
--- src/library/base/R/factor.R (Revision 88042)
+++ src/library/base/R/factor.R (Arbeitskopie)
@@ -20,14 +20,18 @@
exclude = NA, ordered = is.ordered(x), nmax = NA)
 {
 if(is.null(x)) x <- character()
+directmatch <- !is.object(x) &&
+(is.character(x) || is.integer(x) || is.logical(x))
 nx <- names(x)
 if (missing(levels)) {
y <- unique(x, nmax = nmax)
ind <- order(y)
-   levels <- unique(as.character(y)[ind])
+if (!directmatch)
+y <- as.character(y)
+   levels <- unique(y[ind])
 }
 force(ordered) # check if original x is an ordered factor
-if(!is.character(x))
+if(!directmatch)
x <- as.character(x)
 ## levels could be a long vector, but match will not handle that.
 levels <- levels[is.na(match(levels, exclude))]
 f <- match(x, levels)
===

This skips as.character() also for integer/logical 'x' and would indeed 
bring table() runtimes "in order":


set.seed(1)
C <- sample(c("no", "yes"), 10^7, replace = TRUE)
F <- as.factor(C)
L <- F == "yes"
I <- as.integer(L)
N <- as.numeric(I)

## Median system.time(table(.)) in ms:
## table(F)   256
## table(I)   384   # not  696
## table(L)   409   # not 1159
## table(C)   591
## table(N)  3324

The (seemingly) small patch passes check-all, but maybe it overlooks 
some edge cases. I'd test it on a subset of CRAN/BIOC packages.


Best,

Sebastian Meyer



   # Timing is all on my local machine (OSX)
   N_v <- sample(c(1,0), 10^7, replace = TRUE)
   L_v <- sample(c(TRUE, FALSE), 10^7, replace = TRUE)
  #  user  system elapsed
   system.time(table(N_v))# 2.155   0.039   2.192
   system.time(table(L_v))# 0.806   0.030   0.838

   system.time(N_fv <- as.factor(N_v))# 2.026   0.024   2.050
   system.time(L_fv <- as.factor(L_v))# 0.668   0.015   0.683

   system.time(table(N_fv))   # 0.133   0.022   0.156
   system.time(table(L_fv))   # 0.134   0.018   0.151


The performance for Integers and specially booleans is quite surprising.


Of note is that the performance is significantly better if using `tabulate`, 
since this doesn't involve a conversion to factor (though input must be 
numeric/factor, results aren't named, and it has worse handling of NA values). 
If you have performance critical calls like this you could consider using 
`tabulate` instead.

   system.time(tabulate(N_v)) # 0.054   0.002   0.056
   system.time(tabulate(as.integer(L_v))) # 0.052   0.002   0.055


I don't know if this is a known issue or not; most of my colleagues are aware 
of the slow-down and use `tabulate` when performance is required. My 
understanding was that the slower performance is a trade-off for more 
consistent performance (better output, better handling of ambiguities/NA, 
etc.), and that speed isn't the highest priority with `table`. Maybe someone 
else has a better understanding of the history of the function.

As for improving the speed, it would basically come down to refactoring `table` 
to not use a `factor` conversion. I'd be concerned about introducing a lot of 
edge cases with that, but it's theoretically possible. Based on 30 seconds of 
thinking, it may be possible to do something like:

## just a sketch of a barebones non-factor implementation
   test_tab <- function(x){
 lookup <- unique(x)
 counts <- tabulate(match(x, lookup))
 names(counts) <- as.character(lookup)
 counts
   }

   system.time(test_tab(L_v))  # 0.101   0.006   0.107
   system.time(test_tab(N_v))  # 0.129   0.015   0.144

This is also faster in the case where there are lots of categories with few 
entries per category:

   N_v2 <- 1:1e7
   system.time(test_tab(N_v2)) # 0.383   0.024   0.411
   system.time(table(N_v2))# 6.122   0.228   6.398

Obviously there are some big shortcomings:
- it's missing a lot of error checking etc. that the standard `table` has
- it only works with 1D vectors
- NA handli