date:20230901

[Rd] Updated `dendrapply`

2023-09-01 Thread Lakshman, Aidan H

Hi everyone,
Ivan and I had a few discussions several months ago regarding `dendrapply`, but 
now that I've had the chance to work on it more specifically and discuss it 
Martin Maechler at the R Project Sprint, I figured it would be a good idea to 
start a new email chain.
I've refactored `dendrapply`, and the current implementation is available in my 
bugzilla report (https://bugs.r-project.org/show_bug.cgi?id=18480). This 
project started due to the help page for `dendrapply`, which specifically 
requested users to contribute improvements to the function. I'm including a lot 
of writeup here because I'm very aware that this is a relatively large 
contribution for someone that doesn't have a history of contributing a lot of 
code to base, and I'd like to justify the inclusion.
Feel free to skip everything I've written and instead use the following links. 
A thorough discussion follows.
- Bugzilla with patch: https://bugs.r-project.org/show_bug.cgi?id=18480
- R Checks: https://github.com/r-devel/r-svn/pull/111
- Discussion at R Project Sprint: 
https://github.com/r-devel/r-project-sprint-2023/discussions/6
- Original blog post about this (long, code out of date, but has a simpler 
explanation of the implementation): 
https://www.ahl27.com/posts/2023/02/dendrapply/

Responses to common questions:
- Why does this project need to be done?
The current implementation in `stats::dendrapply` is recursive, and thus has 
issues with deeply nested dendrogram objects. As of 2015, users experienced 
issues with recursive operations on dendrograms causing stack overflow issues 
(see https://bugs.r-project.org/show_bug.cgi?id=15215). This has been 
alleviated by better computers and short-term workarounds, but many users have 
limited resources and/or need for large trees. Even with sufficient memory, a 
recursive implementation incurs a nontrivial amount of unneccessary 
computational overhead. I'll also add that this is a feature that was requested 
in R itself (see Note section in `?dendrapply`), and Martin Maechler has been 
supportive of the work thus far on it.
- What does this implementation do?
There are a few improvements in this implementation. The first is that function 
application to dendrogram objects is no longer recursive. This implementation 
is also based in C, providing a performance boost of at least 4x (see later 
question for details). Finally, iterative application of functions in C allows 
for flexibility in *how* the dendrogram is traversed, which gives end-users a 
significant amount of power to work with tree-like structures in an intuitive 
way. An easy example is subsetting based on leaf values--if a user wanted to 
subset a dendrogram to only certain leaves, there isn't a good way to do this 
in base R (even with dendrapply). `how='post.order'` fixes this problem. I'm 
happy to provide additional examples if needed.
- Why C? This is harder to maintain than R.
This is a valid point. I did my best to include as much documentation as 
possible, and I'm also volunteering myself to help maintain this function. C 
provides a lot of power in working with dendrograms, since R's toolkit for 
tree-like structures is relatively lacking. This refactor is theoretically 
doable in R, but the implementation would involve an immense amount of memory 
overhead to ensure we can preserve the node states as we traverse the tree. 
There is precedence for a C implementation of dendrapply (see other `*apply` 
functions). Additionally, this decreases function application overhead, and 
allows future extensions to be written purely in R in a much simpler way. I 
think this tradeoff is worth it, but I am happy to discuss implementation 
specifics with anyone that is interested.
- Ivan previously mentioned issues with user specific `[[.dendrogram` 
implementations, and it doesn't seem that you've fixed that.
This is correct. I discovered during the R project sprint that 
`stats::dendrapply` does not respect user-specific implementations of 
`[[.dendrogram`. stats::`[[.dendrogram` has its own issues; if the user defines 
multiple classes for a dendrogram object, double bracket subsetting will remove 
the class (a bug report will be filed for this shortly). My implementation 
exactly replicates the performance of stats::`[[.dendrogram`, and if users are 
in need of a function that can respect custom subset overloading, I can address 
those feature requests if/when they are submitted.
- Backwards compatibility?
>From current testing, this is a drop-in replacement for `stats::dendrapply`. R 
>builds successfully, and all >400 tests in the CRAN package that uses 
>`dendrapply` the most (dendextend) pass with no changes from the original. The 
>additional argument `how=c('pre.order', 'post.order')` is the same syntax as 
>`rapply`, and additional documentation has been added to the `dendrapply.Rd` 
>to reflect this change. This is still an unfinished TODO; the internal R 
>testing for `dendrapply` is very sparse. I haven'

Re: [Rd] Updated `dendrapply`

2023-09-01 Thread Ivan Krylov

В Fri, 1 Sep 2023 09:17:57 +
"Lakshman, Aidan H"  пишет:

> - Ivan previously mentioned issues with user specific `[[.dendrogram`
> implementations, and it doesn't seem that you've fixed that.

> This is correct. I discovered during the R project sprint that
> `stats::dendrapply` does not respect user-specific implementations of
> `[[.dendrogram`. stats::`[[.dendrogram` has its own issues; if the
> user defines multiple classes for a dendrogram object, double bracket
> subsetting will remove the class (a bug report will be filed for this
> shortly).

True, my warning about not handing potential subclasses of dendrogram
was purely theoretical.

(A hypothetical subclass of dendrogram could work with the current
[[.dendrogram if it ensured that its own class name always precedes
'dendrogram' in the class vector, thus never being downstream from
stats:::`[[.dendrogram` in a chain of NextMethod() calls. But that's
still hypothetical.)

I see that your current implementation very nicely bounds the PROTECT()
stack usage and avoids the need to deallocate arbitrary SEXPs, which is
awkward to do with R's garbage collector API. Congratulations!

-- 
Best regards,
Ivan

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] Updated `dendrapply`

Re: [Rd] Updated `dendrapply`

2 matches

Site Navigation

Mail list logo

Footer information