[Rd] Updated `dendrapply`

Lakshman, Aidan H Fri, 01 Sep 2023 02:18:46 -0700

Hi everyone,
Ivan and I had a few discussions several months ago regarding `dendrapply`, but 
now that I've had the chance to work on it more specifically and discuss it 
Martin Maechler at the R Project Sprint, I figured it would be a good idea to 
start a new email chain.
I've refactored `dendrapply`, and the current implementation is available in my 
bugzilla report (https://bugs.r-project.org/show_bug.cgi?id=18480). This 
project started due to the help page for `dendrapply`, which specifically 
requested users to contribute improvements to the function. I'm including a lot 
of writeup here because I'm very aware that this is a relatively large 
contribution for someone that doesn't have a history of contributing a lot of 
code to base, and I'd like to justify the inclusion.
Feel free to skip everything I've written and instead use the following links. 
A thorough discussion follows.
- Bugzilla with patch: https://bugs.r-project.org/show_bug.cgi?id=18480
- R Checks: https://github.com/r-devel/r-svn/pull/111
- Discussion at R Project Sprint: 
https://github.com/r-devel/r-project-sprint-2023/discussions/6
- Original blog post about this (long, code out of date, but has a simpler 
explanation of the implementation): 
https://www.ahl27.com/posts/2023/02/dendrapply/


Responses to common questions:
- Why does this project need to be done?
The current implementation in `stats::dendrapply` is recursive, and thus has 
issues with deeply nested dendrogram objects. As of 2015, users experienced 
issues with recursive operations on dendrograms causing stack overflow issues 
(see https://bugs.r-project.org/show_bug.cgi?id=15215). This has been 
alleviated by better computers and short-term workarounds, but many users have 
limited resources and/or need for large trees. Even with sufficient memory, a 
recursive implementation incurs a nontrivial amount of unneccessary 
computational overhead. I'll also add that this is a feature that was requested 
in R itself (see Note section in `?dendrapply`), and Martin Maechler has been 
supportive of the work thus far on it.
- What does this implementation do?
There are a few improvements in this implementation. The first is that function 
application to dendrogram objects is no longer recursive. This implementation 
is also based in C, providing a performance boost of at least 4x (see later 
question for details). Finally, iterative application of functions in C allows 
for flexibility in *how* the dendrogram is traversed, which gives end-users a 
significant amount of power to work with tree-like structures in an intuitive 
way. An easy example is subsetting based on leaf values--if a user wanted to 
subset a dendrogram to only certain leaves, there isn't a good way to do this 
in base R (even with dendrapply). `how='post.order'` fixes this problem. I'm 
happy to provide additional examples if needed.
- Why C? This is harder to maintain than R.
This is a valid point. I did my best to include as much documentation as 
possible, and I'm also volunteering myself to help maintain this function. C 
provides a lot of power in working with dendrograms, since R's toolkit for 
tree-like structures is relatively lacking. This refactor is theoretically 
doable in R, but the implementation would involve an immense amount of memory 
overhead to ensure we can preserve the node states as we traverse the tree. 
There is precedence for a C implementation of dendrapply (see other `*apply` 
functions). Additionally, this decreases function application overhead, and 
allows future extensions to be written purely in R in a much simpler way. I 
think this tradeoff is worth it, but I am happy to discuss implementation 
specifics with anyone that is interested.
- Ivan previously mentioned issues with user specific `[[.dendrogram` 
implementations, and it doesn't seem that you've fixed that.
This is correct. I discovered during the R project sprint that 
`stats::dendrapply` does not respect user-specific implementations of 
`[[.dendrogram`. stats::`[[.dendrogram` has its own issues; if the user defines 
multiple classes for a dendrogram object, double bracket subsetting will remove 
the class (a bug report will be filed for this shortly). My implementation 
exactly replicates the performance of stats::`[[.dendrogram`, and if users are 
in need of a function that can respect custom subset overloading, I can address 
those feature requests if/when they are submitted.
- Backwards compatibility?
>From current testing, this is a drop-in replacement for `stats::dendrapply`. R 
>builds successfully, and all >400 tests in the CRAN package that uses 
>`dendrapply` the most (dendextend) pass with no changes from the original. The 
>additional argument `how=c('pre.order', 'post.order')` is the same syntax as 
>`rapply`, and additional documentation has been added to the `dendrapply.Rd` 
>to reflect this change. This is still an unfinished TODO; the internal R 
>testing for `dendrapply` is very sparse. I haven't been able to find any 
>differences between stats::dendrapply and this implementation, but I am 
>planning to run a full check against all CRAN packages that use `dendrapply`. 
>I'm also planning to add additional regression testing to R either as part of 
>this patch in a separate patch.
- You mentioned there was more to the listed '4x improvement'
Yes. I haven't yet put together a comprehensive benchmark for highly unbalanced 
trees, and in truth there are so many possible tree structures that it would be 
challenging to test them all. However, on trees with 5 leaves the performance 
is roughly identical to that of `stats`, and benchmarking with `microbenchmark` 
demonstrates performance gains of roughly 5x on fully balanced trees with 
10-5000 leaves. This should be a lower bound for performance, since fully 
balanced trees minimize internal nodes and thus have less recursion...so on 
reasonably sized trees of arbitrary structure we should have at least around a 
4x improvement. I'll also stress that the focus of this patch is not a runtime 
improvement--it's nice that we get a speedup, but the added value here is the 
removal of recursive calls.
- Why not just put this in a package?
I think there's value in fleshing out the structures included in base R. 
Dendrograms are a general tree structure, and few programming languages provide 
support for these out of the box. Dendrogram objects are currently rarely used, 
but with a little bit of additional functionality, they could be a very 
powerful tool for users. These have applications in a variety of fields that 
are not just phylogenetics; implementations of domain-specific tools (e.g. 
`ape`, `DECIPHER`) are better suited to 3rd party packages. However, 
`dendrograms` already exist in base and have poor support, which is even 
admitted in their help files. `dendrapply`. While it's currently limited to 
acting on dendrogram objects, a solid implementation would open the door to 
generalizing dendrapply to work on any nested list. This is my personal 
opinion, and there is certainly an argument to be made that `dendrapply` (and 
even `dendrogram` as a whole) could live outside of base.
- Why/how were the included traversal strategies chosen?
The default, pre.order, was chosen because it replicates existing 
functionality. post.order was included because it lends itself well to a lot of 
applications. Between these two methods, we have a way to apply a function to 
trees ensuring that parents are evaluated before children, and ensuring 
children are evaluated before parents. Future extensions to support an in.order 
or BFS/level.order traversal is definitely an option, but I don't think the 
added implementation effort and complexity adds a lot of functionality over the 
two that have been included.
There's also been previous comments regarding the structure of dendrograms. 
While hclust will return a bifurcating tree, dendrogram objects (and 
dendrapply) support arbitrary multifurcations and edge weights.
Apologies for the lengthy email. If you've read even half, thanks for your 
time. There�s likely some optimizations that could be made in the C code 
dealing with R structures; I�m still learning the intricacies of some of the 
more specific points of this. One that comes to mind is converting the repeated 
`lang2` calls to instead initialize a `lang2` call and then change the symbol 
in the call, as is done in `lapply` (and was mentioned by Ivan in my last 
submission). I�ll test that as well.
Further feedback is welcome and much appreciated.
-Aidan

-----------------------
Aidan Lakshman (he/him)<https://www.ahl27.com/>
Doctoral Fellow, Wright Lab<https://www.wrightlabscience.com/>
University of Pittsburgh School of Medicine
Department of Biomedical Informatics
ah...@pitt.edu
(724) 612-9940


        [[alternative HTML version deleted]]

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] Updated `dendrapply`

Reply via email to