Hi everyone,
Ivan and I had a few discussions several months ago regarding `dendrapply`, but
now that I've had the chance to work on it more specifically and discuss it
Martin Maechler at the R Project Sprint, I figured it would be a good idea to
start a new email chain.
I've refactored `dendrapply`, and the current implementation is available in my
bugzilla report (https://bugs.r-project.org/show_bug.cgi?id=18480). This
project started due to the help page for `dendrapply`, which specifically
requested users to contribute improvements to the function. I'm including a lot
of writeup here because I'm very aware that this is a relatively large
contribution for someone that doesn't have a history of contributing a lot of
code to base, and I'd like to justify the inclusion.
Feel free to skip everything I've written and instead use the following links.
A thorough discussion follows.
- Bugzilla with patch: https://bugs.r-project.org/show_bug.cgi?id=18480
- R Checks: https://github.com/r-devel/r-svn/pull/111
- Discussion at R Project Sprint:
https://github.com/r-devel/r-project-sprint-2023/discussions/6
- Original blog post about this (long, code out of date, but has a simpler
explanation of the implementation):
https://www.ahl27.com/posts/2023/02/dendrapply/
Responses to common questions:
- Why does this project need to be done?
The current implementation in `stats::dendrapply` is recursive, and thus has
issues with deeply nested dendrogram objects. As of 2015, users experienced
issues with recursive operations on dendrograms causing stack overflow issues
(see https://bugs.r-project.org/show_bug.cgi?id=15215). This has been
alleviated by better computers and short-term workarounds, but many users have
limited resources and/or need for large trees. Even with sufficient memory, a
recursive implementation incurs a nontrivial amount of unneccessary
computational overhead. I'll also add that this is a feature that was requested
in R itself (see Note section in `?dendrapply`), and Martin Maechler has been
supportive of the work thus far on it.
- What does this implementation do?
There are a few improvements in this implementation. The first is that function
application to dendrogram objects is no longer recursive. This implementation
is also based in C, providing a performance boost of at least 4x (see later
question for details). Finally, iterative application of functions in C allows
for flexibility in *how* the dendrogram is traversed, which gives end-users a
significant amount of power to work with tree-like structures in an intuitive
way. An easy example is subsetting based on leaf values--if a user wanted to
subset a dendrogram to only certain leaves, there isn't a good way to do this
in base R (even with dendrapply). `how='post.order'` fixes this problem. I'm
happy to provide additional examples if needed.
- Why C? This is harder to maintain than R.
This is a valid point. I did my best to include as much documentation as
possible, and I'm also volunteering myself to help maintain this function. C
provides a lot of power in working with dendrograms, since R's toolkit for
tree-like structures is relatively lacking. This refactor is theoretically
doable in R, but the implementation would involve an immense amount of memory
overhead to ensure we can preserve the node states as we traverse the tree.
There is precedence for a C implementation of dendrapply (see other `*apply`
functions). Additionally, this decreases function application overhead, and
allows future extensions to be written purely in R in a much simpler way. I
think this tradeoff is worth it, but I am happy to discuss implementation
specifics with anyone that is interested.
- Ivan previously mentioned issues with user specific `[[.dendrogram`
implementations, and it doesn't seem that you've fixed that.
This is correct. I discovered during the R project sprint that
`stats::dendrapply` does not respect user-specific implementations of
`[[.dendrogram`. stats::`[[.dendrogram` has its own issues; if the user defines
multiple classes for a dendrogram object, double bracket subsetting will remove
the class (a bug report will be filed for this shortly). My implementation
exactly replicates the performance of stats::`[[.dendrogram`, and if users are
in need of a function that can respect custom subset overloading, I can address
those feature requests if/when they are submitted.
- Backwards compatibility?
>From current testing, this is a drop-in replacement for `stats::dendrapply`. R
>builds successfully, and all >400 tests in the CRAN package that uses
>`dendrapply` the most (dendextend) pass with no changes from the original. The
>additional argument `how=c('pre.order', 'post.order')` is the same syntax as
>`rapply`, and additional documentation has been added to the `dendrapply.Rd`
>to reflect this change. This is still an unfinished TODO; the internal R
>testing for `dendrapply` is very sparse. I haven'