[Rd] `dendrapply` Enhancements

2023-02-24 Thread Lakshman, Aidan H
Hi everyone,

My apologies if this isn�t the right place to submit this�I�m new to the 
R-devel community and still figuring out what is where.

If people want to skip my writeup and just look at the code, I�ve made a 
repository for it here: https://github.com/ahl27/new_dendrapply/tree/master. 
I�m not quite sure how to integrate it into a fork of R-devel; the package 
structure is different from what I�m used to.

I had written a slightly improved version of dendrapply for one of my research 
projects, and my advisor encouraged me to submit it to the R project. It took 
me longer than I expected, but I�ve finally gotten my implementation to be a 
drop-in replacement for `stats::dendrapply`. The man page for 
`stats::dendrapply` says �The implementation is somewhat experimental and 
suggestions for enhancements (or nice examples of usage) are very welcome,� so 
I figured this had the potential to be a worthwhile contribution. I wanted to 
send it out to R-devel to see if this was something worth pursuing as an 
enhancement to R.

The implementation I have is based in C, which I understand implies an 
increased burden of maintenance over pure R code. However, it does come with 
the following benefits:

- Completely eliminates recursion, so no memory overhead from function calls or 
possibility of stack overflows (this was a major issue reported on some of the 
functions in one of our Bioconductor packages that previously used 
`dendrapply`).
- Modest runtime improvement, around 2x on my computer (2021 MBP, 32GB RAM). 
I�m relatively confident this could be optimized more.
- Seemingly significant reduction in memory reduction, still working on a 
robust benchmark. Suggestions for the best way to do that are welcome.
- Support for applying functions with an inorder traversal (as in 
`stats::dendrapply`) as well as using a postorder traversal.

This implementation was tested manually as well as running all the unit tests 
in `dendextend`, which comprises a lot of applications of `dendrapply`.

The postorder traversal would be a significant new functionality to dendrapply, 
as it would allow for functions that use the child nodes to correctly execute. 
A toy example of this is something like:
```
exFunc <- function(x){
  attr(x, 'newA') <- 'a'
  if(is.null(attr(x, 'leaf'))){
cat(attr(x[[1]], 'newA'), attr(x[[2]], 'newA'))
cat('\n')
  }
  x
})

dendrapply(dend, exFunc)
```

With the current version of dendrapply, this prints nothing, but the postorder 
traversal version will print �a� twice for each internal branch. If this would 
be a worthwhile addition, I can refactor the code for brevity and add a 
`how=c("in.order", "post.order")`, with the default value �in.order� to 
maintain backwards compatibility. A preorder traversal version should also be 
possible, I just haven�t gotten to it yet.

I think the runtime could be optimized more as well.

Thank you in advance for looking at my code and offering feedback; I�m excited 
at the possibility of helping contribute to the R project! I�m happy to discuss 
more either here, on GitHub, or on the R Contributors Slack.

Sincerely,
Aidan Lakshman

---
Aidan Lakshman (he/him)
Doctoral Candidate, Wright Lab
University of Pittsburgh School of Medicine
Department of Biomedical Informatics
ah...@pitt.edu
(724) 612-9940


[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] `dendrapply` Enhancements

2023-02-24 Thread Toby Hocking
Hi Aidan, I think you are on the right email list.
I'm not R-core, but this looks like an interesting/meaningful/significant
contribution to base R.
I'm not sure what the original dendrapply looks like in terms of code style
(variable names/white space formatting/etc) but in my experience it is
important that your code contribution makes minimal changes in that area.
Did you hear about the R project sprint 2023?
https://contributor.r-project.org/r-project-sprint-2023/ Your work falls
into the "new developments" category so I think you could apply for that
funding to participate.
Toby

On Fri, Feb 24, 2023 at 3:47 AM Lakshman, Aidan H  wrote:

> Hi everyone,
>
> My apologies if this isn’t the right place to submit this—I’m new to the
> R-devel community and still figuring out what is where.
>
> If people want to skip my writeup and just look at the code, I’ve made a
> repository for it here:
> https://github.com/ahl27/new_dendrapply/tree/master. I’m not quite sure
> how to integrate it into a fork of R-devel; the package structure is
> different from what I’m used to.
>
> I had written a slightly improved version of dendrapply for one of my
> research projects, and my advisor encouraged me to submit it to the R
> project. It took me longer than I expected, but I’ve finally gotten my
> implementation to be a drop-in replacement for `stats::dendrapply`. The man
> page for `stats::dendrapply` says “The implementation is somewhat
> experimental and suggestions for enhancements (or nice examples of usage)
> are very welcome,” so I figured this had the potential to be a worthwhile
> contribution. I wanted to send it out to R-devel to see if this was
> something worth pursuing as an enhancement to R.
>
> The implementation I have is based in C, which I understand implies an
> increased burden of maintenance over pure R code. However, it does come
> with the following benefits:
>
> - Completely eliminates recursion, so no memory overhead from function
> calls or possibility of stack overflows (this was a major issue reported on
> some of the functions in one of our Bioconductor packages that previously
> used `dendrapply`).
> - Modest runtime improvement, around 2x on my computer (2021 MBP, 32GB
> RAM). I’m relatively confident this could be optimized more.
> - Seemingly significant reduction in memory reduction, still working on a
> robust benchmark. Suggestions for the best way to do that are welcome.
> - Support for applying functions with an inorder traversal (as in
> `stats::dendrapply`) as well as using a postorder traversal.
>
> This implementation was tested manually as well as running all the unit
> tests in `dendextend`, which comprises a lot of applications of
> `dendrapply`.
>
> The postorder traversal would be a significant new functionality to
> dendrapply, as it would allow for functions that use the child nodes to
> correctly execute. A toy example of this is something like:
> ```
> exFunc <- function(x){
>   attr(x, 'newA') <- 'a'
>   if(is.null(attr(x, 'leaf'))){
> cat(attr(x[[1]], 'newA'), attr(x[[2]], 'newA'))
> cat('\n')
>   }
>   x
> })
>
> dendrapply(dend, exFunc)
> ```
>
> With the current version of dendrapply, this prints nothing, but the
> postorder traversal version will print ‘a’ twice for each internal branch.
> If this would be a worthwhile addition, I can refactor the code for brevity
> and add a `how=c("in.order", "post.order")`, with the default value
> “in.order” to maintain backwards compatibility. A preorder traversal
> version should also be possible, I just haven’t gotten to it yet.
>
> I think the runtime could be optimized more as well.
>
> Thank you in advance for looking at my code and offering feedback; I’m
> excited at the possibility of helping contribute to the R project! I’m
> happy to discuss more either here, on GitHub, or on the R Contributors
> Slack.
>
> Sincerely,
> Aidan Lakshman
>
> ---
> Aidan Lakshman (he/him)
> Doctoral Candidate, Wright Lab
> University of Pittsburgh School of Medicine
> Department of Biomedical Informatics
> ah...@pitt.edu
> (724) 612-9940
>
>
> [[alternative HTML version deleted]]
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Compilation Error when DEBUG_approx Toggled on in RISC-V

2023-02-24 Thread Jane He
Hi all,

While compiling R to RISC-V64 architecture and debugging in R's C source
codes, I think I have found a small bug. Can anyone please verify whether
it is a real bug?

The possible bug lies in the file `R-4.2.2/src/library/stats/src/approx.c`
in function `R_approxfun` around line 148:

#ifdef DEBUG_approx
REprintf("R_approxfun(x,y, nxy = %.0f, .., nout = %.0f, method = %d,
...)",
(double)nxy, (double)nout, Meth->kind);
#endif

However, there is no `Meth` defined in this function, causing a compilation
error when `DEBUG_approx` is toggled on. The real `Meth` is actually
defined in the `approx1` function, and the Meth seen here is probably a
historical artifact that was never erased. To fix the error, I suggest
changing `Meth->kind` to `method` in the `R_approxfun` function as follows:

#ifdef DEBUG_approx
REprintf("R_approxfun(x,y, nxy = %.0f, .., nout = %.0f, method = %d,
...)",
(double)nxy, (double)nout, method);
#endif

If this is really an error, I hope this small fix will be helpful to the R
community.

Thank you for your time and consideration.

Best,
Jane He

_
University of California, Irvine
Student of Master of Software Engineering
2022-2023 cohort

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Possible NA Propagation Failure in RISC-V64 CPU?

2023-02-24 Thread Jane He
Hi all,

I am currently compiling R to RISC-V64 CPU and I think I have discovered a
NA propagation failure.

How R implements NA (not available) and NaN (not-a-number) is explained in
detail here:
https://stat.ethz.ch/pipermail/r-devel/2014-February/068380.html.

In short, according to my understanding of R's convention, any calculation
involving NA but no NaN should result in NA (called NA propagation), and
any calculation involving NaN but no NA should result in NaN. Calculations
involving both NA and NaN can result in either value.

While many R functions handle this logic in their source codes, basic
arithmetic operations such as +-/* throw it to the hardware to handle.
However, the RISC-V64 CPU does not behave as expected, at least the CPU I
am using (Starfive JH7100-7110).

Here are the relevant bit patterns. From my understanding, as IEEE only
regulates the bit patterns of NaN, R picks one of the bit patterns (ending
with 07a2) and declares it as NA.

# print_hex is a function to print the bit pattern in hex
> print_hex(0.1)
3fba # same for RISC-V
> print_hex(NaN)
7ff8 # same for RISC-V
> print_hex(NA)
7ff007a2 # same for RISC-V
> print_hex(NA+1)
7ff807a2 # 7ff8 in RISC-V
> print_hex(NA*1)
7ff807a2 # 7ff8 in RISC-V
> print_hex(NaN*1)
7ff8 # same for RISC-V


Therefore, in RISC-V64, all basic arithmetic operations involving NA give
NaN.

> NA+1
[1] NaN

This failure in NA propagation may cause many R packages like mice to not
work properly, and results in the `make check` test in the `stats` package
to fail. Example from the make check test:

xn <- 1:4
yn <- c(1,NA,3:4)
xout <- (1:9)/2
data.frame(approx(xn,yn, xout, na.rm=FALSE, rule = 2)) # failure, some
values should be NA but it turns out NaN

I am reaching out to the R community looking for help in solving this
problem. Does anyone around here have any hints or ideas on how to solve
this issue?

Currently, my hacky implementation is to stop the NA operand before it goes
to the hardware and directly return NA as the output. However, this
solution may penalize performance significantly, so I am looking for any
alternative idea.

Thank you for your time and consideration!

Best regards,
Jane He

_
University of California, Irvine
Student of Master of Software Engineering
2022-2023 cohort

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] `dendrapply` Enhancements

2023-02-24 Thread Lakshman, Aidan H
Hi Toby,

Thanks for your reply! I haven�t heard about the R project sprint, but I�ll 
definitely check it out. UK is going to be a little hard for me to get to 
funding-wise, but I�ll try to apply for funding.

I appreciate your other comments. As far as coding style, I did do everything I 
could think of to make sure it�s a drop-in replacement for the current version 
with the default settings, so all the user-exposed arguments/variables should 
be identical. I used the conventions in 
https://github.com/wch/r-source/wiki/Contributing for commenting and 
whitespace, so hopefully that all looks okay. I�m realizing there may be some 
differences in tab widths, but I can fix that later today.

-Aidan

---
Aidan Lakshman (he/him)
Doctoral Candidate, Wright Lab
University of Pittsburgh School of Medicine
Department of Biomedical Informatics
ah...@pitt.edu
(724) 612-9940


From: Toby Hocking 
Date: Friday, February 24, 2023 at 06:57
To: Lakshman, Aidan H 
Cc: R-devel@r-project.org 
Subject: Re: [Rd] `dendrapply` Enhancements
Hi Aidan, I think you are on the right email list.
I'm not R-core, but this looks like an interesting/meaningful/significant 
contribution to base R.
I'm not sure what the original dendrapply looks like in terms of code style 
(variable names/white space formatting/etc) but in my experience it is 
important that your code contribution makes minimal changes in that area.
Did you hear about the R project sprint 2023? 
https://contributor.r-project.org/r-project-sprint-2023/
 Your work falls into the "new developments" category so I think you could 
apply for that funding to participate.
Toby

On Fri, Feb 24, 2023 at 3:47 AM Lakshman, Aidan H 
mailto:ah...@pitt.edu>> wrote:
Hi everyone,

My apologies if this isn�t the right place to submit this�I�m new to the 
R-devel community and still figuring out what is where.

If people want to skip my writeup and just look at the code, I�ve made a 
repository for it here: 
https://github.com/ahl27/new_dendrapply/tree/master.
 I�m not quite sure how to integrate it into a fork of R-devel; the package 
structure is different from what I�m used to.

I had written a slightly improved version of dendrapply for one of my research 
projects, and my advisor encouraged me to submit it to the R project. It took 
me longer than I expected, but I�ve finally gotten my implementation to be a 
drop-in replacement for `stats::dendrapply`. The man page for 
`stats::dendrapply` says �The implementation is somewhat experimental and 
suggestions for enhancements (or nice examples of usage) are very welcome,� so 
I figured this had the potential to be a worthwhile contribution. I wanted to 
send it out to R-devel to see if this was something worth pursuing as an 
enhancement to R.

The implementation I have is based in C, which I understand implies an 
increased burden of maintenance over pure R code. However, it does come with 
the following benefits:

- Completely eliminates recursion, so no memory overhead from function calls or 
possibility of stack overflows (this was a major issue reported on some of the 
functions in one of our Bioconductor packages that previously used 
`dendrapply`).
- Modest runtime improvement, around 2x on my computer (2021 MBP, 32GB RAM). 
I�m relatively confident this could be optimized more.
- Seemingly significant reduction in memory reduction, still working on a 
robust benchmark. Suggestions for the best way to do that are welcome.
- Support for applying functions with an inorder traversal (as in 
`stats::dendrapply`) as well as using a postorder traversal.

This implementation was tested manually as well as running all the unit tests 
in `dendextend`, which comprises a lot of applications of `dendrapply`.

The postorder traversal would be a significant new functionality to dendrapply, 
as it would allow for functions that use the child nodes to correctly execute. 
A toy example of this is something like:
```
exFunc <- function(x){
  attr(x, 'newA') <- 'a'
  if(is.null(attr(x, 'leaf'))){
cat(attr(x[[1]], 'newA'), attr(x[

Re: [Rd] Possible NA Propagation Failure in RISC-V64 CPU?

2023-02-24 Thread Ivan Krylov
On Thu, 23 Feb 2023 15:39:08 -0800
Jane He  wrote:

> In short, according to my understanding of R's convention, any
> calculation involving NA but no NaN should result in NA (called NA
> propagation), and any calculation involving NaN but no NA should
> result in NaN. Calculations involving both NA and NaN can result in
> either value.

The ?NA help page already contains a warning that "future CPUs and/or
compilers" may prevent NA from resulting in computations with NA. A
similar problem has been encountered on Apple processors, but a
workaround was found there:
https://blog.r-project.org/2020/11/02/will-r-work-on-apple-silicon/#nanan-payload-propagation

> This failure in NA propagation may cause many R packages like mice to
> not work properly, and results in the `make check` test in the
> `stats` package to fail.

Perhaps the way forward is to update the tests.

Regarding R packages, since is.na() is already documented to return
TRUE for all NaNs, not only the NA value, it should be possible to make
them work even if they currently fail.

-- 
Best regards,
Ivan

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Possible NA Propagation Failure in RISC-V64 CPU?

2023-02-24 Thread Tomas Kalibera



On 2/24/23 00:39, Jane He wrote:

Hi all,

I am currently compiling R to RISC-V64 CPU and I think I have discovered a
NA propagation failure.

How R implements NA (not available) and NaN (not-a-number) is explained in
detail here:
https://stat.ethz.ch/pipermail/r-devel/2014-February/068380.html.

In short, according to my understanding of R's convention, any calculation
involving NA but no NaN should result in NA (called NA propagation), and
any calculation involving NaN but no NA should result in NaN. Calculations
involving both NA and NaN can result in either value.

While many R functions handle this logic in their source codes, basic
arithmetic operations such as +-/* throw it to the hardware to handle.
However, the RISC-V64 CPU does not behave as expected, at least the CPU I
am using (Starfive JH7100-7110).

Here are the relevant bit patterns. From my understanding, as IEEE only
regulates the bit patterns of NaN, R picks one of the bit patterns (ending
with 07a2) and declares it as NA.

# print_hex is a function to print the bit pattern in hex

print_hex(0.1)

3fba # same for RISC-V

print_hex(NaN)

7ff8 # same for RISC-V

print_hex(NA)

7ff007a2 # same for RISC-V

print_hex(NA+1)

7ff807a2 # 7ff8 in RISC-V

print_hex(NA*1)

7ff807a2 # 7ff8 in RISC-V

print_hex(NaN*1)

7ff8 # same for RISC-V


Therefore, in RISC-V64, all basic arithmetic operations involving NA give
NaN.


NA+1

[1] NaN

This failure in NA propagation may cause many R packages like mice to not
work properly, and results in the `make check` test in the `stats` package
to fail. Example from the make check test:

xn <- 1:4
yn <- c(1,NA,3:4)
xout <- (1:9)/2
data.frame(approx(xn,yn, xout, na.rm=FALSE, rule = 2)) # failure, some
values should be NA but it turns out NaN

I am reaching out to the R community looking for help in solving this
problem. Does anyone around here have any hints or ideas on how to solve
this issue?

Currently, my hacky implementation is to stop the NA operand before it goes
to the hardware and directly return NA as the output. However, this
solution may penalize performance significantly, so I am looking for any
alternative idea.

Thank you for your time and consideration!


Please see this bug report for a related discussion: 
https://bugs.r-project.org/show_bug.cgi?id=18416


This is a known problem and has been discussed a number of times before 
RISC-V. Still, RISC-V is a bit unique architecture in that it 
deliberately does not propagate NaN payloads at all and it cannot be 
enabled. R's distinction of NA vs NaN however depends on that propagation.


The propagation of NAs through computations in R is not reliable even on 
other current platforms (due to what the hardware does, but also what 
the compilers and libraries do, see e.g. ?NaN) and technically it is not 
guaranteed and R packages are advised not to depend on it, but various 
package checks and existing software adjusted to how the propagation 
happens to work on widely used platforms, now probably mostly x86_64. 
64-bit ARM is even a bit more friendly to R than that (when the CPU is 
switched to such mode, which R does). If RISC-V supported such mode, R 
would enable it as well, but it deliberately does not support it.


Good defensive portable code would not depend on the propagation to 
happen in hardware/compilers but add checks in software, if propagation 
was needed. I doubt that the performance overhead would be too high if 
done well, particularly for vectorized computations, but it will be 
noticeable and in some cases may be high. This would have to include 
special handling of data including NAs when they are passed to external 
numerical libraries, which would be a lot of work to implement and 
maintain, it seems unrealistic. At the same time there is no strong 
incentive to do so: people who want the propagation seem to be 
reasonably happy with how it happens to work on major platforms.


For the current R to work well on RISC-V, really, the platform would 
have to support the propagation, at least optionally, perhaps like 
64-bit ARM. Unless major platforms in the future stop supporting the 
propagation, I am skeptical anything major could change on the R side.


Best
Tomas




Best regards,
Jane He

_
University of California, Irvine
Student of Master of Software Engineering
2022-2023 cohort

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Compilation Error when DEBUG_approx Toggled on in RISC-V

2023-02-24 Thread Martin Maechler
> Jane He 
> on Thu, 23 Feb 2023 15:33:23 -0800 writes:

> Hi all,

> While compiling R to RISC-V64 architecture and debugging in R's C source
> codes, I think I have found a small bug. Can anyone please verify whether
> it is a real bug?

> The possible bug lies in the file `R-4.2.2/src/library/stats/src/approx.c`
> in function `R_approxfun` around line 148:

> #ifdef DEBUG_approx
>REprintf("R_approxfun(x,y, nxy = %.0f, .., nout = %.0f, method = %d, 
...)",
> (double)nxy, (double)nout, Meth->kind);
> #endif

> However, there is no `Meth` defined in this function, causing a 
compilation
> error when `DEBUG_approx` is toggled on. The real `Meth` is actually
> defined in the `approx1` function, and the Meth seen here is probably a
> historical artifact that was never erased. To fix the error, I suggest
> changing `Meth->kind` to `method` in the `R_approxfun` function as 
follows:

> #ifdef DEBUG_approx
> REprintf("R_approxfun(x,y, nxy = %.0f, .., nout = %.0f, method = %d,
> ...)",
> (double)nxy, (double)nout, method);
> #endif

> If this is really an error, I hope this small fix will be helpful to the R
> community.

well,  you are right this has been a cut'n'paste thinko (almost
surely by me) that has probably never shown but to the 0, 1, or
2 people in the world who may have manually compiled *and*
defined DEBUG_approx ..

> Thank you for your time and consideration.

and to you, too!
The fixed is part of the source now.


Best,
Martin

> Best,
> Jane He

> _
> University of California, Irvine
> Student of Master of Software Engineering
> 2022-2023 cohort

> [[alternative HTML version deleted]]

> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel