Re: [Rd] as.matrix.dist patch (performance)

2024-01-16 Thread Ivan Krylov
Dear Tim,

В Thu, 10 Aug 2023 22:38:44 +0100
Tim Taylor  пишет:

> Submitting here in the first instance but happy to move to Bugzilla
> if more appropriate.

It's a fine patch. The 1.7 times speed up from not transposing the
return value shouldn't be sneezed at. I think it's time to move it to
Bugzilla so that it won't be completely forgotten.

-- 
Best regards,
Ivan

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] cwilcox - new version

2024-01-16 Thread Aidan Lakshman
New email client, hoping that Outlook won’t continue to HTMLify my emails and 
mess up my attachments without my consent :/

I’m re-attaching the patch file here, compressed as a .gz in case that was the 
issue with the previous attachment. Thanks to Ivan for pointing out that my 
attachment didn’t go through correctly on the first try.

I’ll be working on seeing if I can optimize this to actually improve 
performance during the day today.

-Aidan

On 15 Jan 2024, at 5:51, Andreas Löffler wrote:

> I am a newbie to C (although I did some programming in Perl and Pascal) so
> forgive me for the language that follows. I am writing because I have a
> question concerning the *implementation of *(the internal function)*
> cwilcox* in
> https://svn.r-project.org/R/!svn/bc/81274/branches/R-DL/src/nmath/wilcox.c.
> This function is used to determine the test statistics of the
> Wilcoxon-Mann-Whitney U-test.
>
> ___
> The function `cwilcox` has three inputs: k, m and n. To establish the test
> statistics `cwilcox` determines the number of possible sums of natural
> numbers with the following restrictions:
> - the sum of the elements must be k,
> - the elements (summands) must be in a decreasing (or increasing) order,
> - every element must be less then m and
> - there must be at most n summands.
>
> The problem with the evaluation of this function `cwilcox(k,m,n)` is that
> it requires a recursion where in fact a three-dimensional array has to be
> evaluated (see the code around line 157). This requires a lot of memory and
> takes time and seems still an issue even with newer machines, see the
> warning in the documentation
> https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/wilcox.test
> .
>
> In an old paper I have written a formula where the evaluation can be done
> in a one-dimensional array that uses way less memory and is much faster.
> The paper itself is in German (you find a copy here:
> https://upload.wikimedia.org/wikipedia/commons/f/f5/LoefflerWilcoxonMannWhitneyTest.pdf),
> so I uploaded a translation into English (to be found in
> https://upload.wikimedia.org/wikipedia/de/1/19/MannWhitney_151102.pdf).
>
> I also looked at the code in R and wrote a new version of the code that
> uses my formulae so that a faster implementation is possible (code with
> comments is below). I have several questions regarding the code:
>
>1. A lot of commands in the original R code are used to handle the
>memory. I do not know how to do this so I skipped memory handling
>completely and simply allocated space to an array (see below int
>cwilcox_distr[(m*n+1)/2];). Since my array is 1-dimensional instead of
>3-dimensional I think as a first guess that will be ok.
>2. I read the documentation when and why code in R should be changed. I
>am not familiar enough with R to understand how this applies here. My code
>uses less memory - is that enough? Or should another function be defined
>that uses my code? Or shall it be disregarded?
>3. I was not able to file my ideas in bugzilla.r-project or the like and
>I hope that this mailing list is a good address for my question.
>
> I also have written an example of a main function where the original code
> from R is compared to my code. I do not attach this example because this
> email is already very long but I can provide that example if needed.
>
> Maybe we can discuss this further. Best wishes, Andreas Löffler.
>
>
> ```
> #include 
> #include 
>
> /* The traditional approch to determine the Mann-Whitney statistics uses a
> recursion formular for partitions of natural numbers, namely in the line
> w[i][j][k] = cwilcox(k - j, i - 1, j) + cwilcox(k, i, j - 1);
> (taken from
> https://svn.r-project.org/R/!svn/bc/81274/branches/R-DL/src/nmath/wilcox.c).
> This approach requires a huge number of partitions to be evaluated because
> the second variable (j in the left term) and the third variable (k in the
> left term) in this recursion are not constant but change as well. Hence, a
> three dimensional array is evaluated.
>
> In an old paper a recursion equations was shown that avoids this
> disadvantage. The recursion equation of that paper uses only an array where
> the second as well as the third variable remain constant. This implies
> faster evaluation and less memory used. The original paper is in German and
> can be found in
> https://upload.wikimedia.org/wikipedia/commons/f/f5/LoefflerWilcoxonMannWhitneyTest.pdf
> and the author has uploaded a translation into English in
> https://upload.wikimedia.org/wikipedia/de/1/19/MannWhitney_151102.pdf. This
> function uses this approach. */
>
> static int
> cwilcox_sigma(int k, int m, int n) { /* this relates to the sigma function
> below */
>   int s, d;
>
>   s=0;
>   for (d = 1; d <= m; d++) {
>   if (k%d == 0) {
>   s=s+d;
>   }
>   }
>   for (d = n

Re: [Rd] as.matrix.dist patch (performance)

2024-01-16 Thread Tim Taylor
Cheers Ivan

Heather Turner has improved in it further in the R contributors slack. I’ll 
take another look and ensure something is added to Bugzilla, with attribution, 
in the next few days.

Tim

> On 16 Jan 2024, at 14:36, Ivan Krylov  wrote:
> 
> Dear Tim,
> 
> В Thu, 10 Aug 2023 22:38:44 +0100
> Tim Taylor  пишет:
> 
>> Submitting here in the first instance but happy to move to Bugzilla
>> if more appropriate.
> 
> It's a fine patch. The 1.7 times speed up from not transposing the
> return value shouldn't be sneezed at. I think it's time to move it to
> Bugzilla so that it won't be completely forgotten.
> 
> --
> Best regards,
> Ivan

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Choices to remove `srcref` (and its buddies) when serializing objects

2024-01-16 Thread Dipterix Wang
Could you recommend any packages/functions that compute hash such that the 
source references and sexpinfo_struct are ignored? Basically a version of 
`serialize` that convert R objects to raw without storing the ancillary source 
reference and sexpinfo.

I think most people would think of `digest` but that package uses `serialize` 
(see discussion 
https://github.com/eddelbuettel/digest/issues/200#issuecomment-1894289875)

> On Jan 12, 2024, at 11:33 AM, Tomas Kalibera  wrote:
> 
> 
> On 1/12/24 06:11, Dipterix Wang wrote:
>> Dear R devs,
>> 
>> I was digging into a package issue today when I realized R serialize 
>> function not always generate the same results on equivalent objects when 
>> users choose to run differently. For example, the following code
>> 
>> serialize(with(new.env(), { function(){} }), NULL, TRUE)
>> 
>> generates different results when I copy-paste into console vs when I use 
>> ctrl+shift+enter to source the file in RStudio.
>> 
>> With a deeper inspect into the cause, I found that function and language get 
>> source reference when getOption("keep.source") is TRUE. This means the 
>> source reference will make the functions different while in most cases, 
>> whether keeping function source might not impact how a function behaves.
>> 
>> While it's OK that function serialize generates different results, functions 
>> such as `rlang::hash` and `digest::digest`, which depend on `serialize` 
>> might eventually deliver false positives on same inputs. I've checked source 
>> code in digest package hoping to get around this issue (for example 
>> serialize(..., refhook = ...)). However, my workaround did not work. It 
>> seems that the markers to the objects are different even if I used `refhook` 
>> to force srcref to be the same. I also tried `removeSource` and 
>> `rlang::zap_srcref`. None of them works directly on nested environments with 
>> multiple functions.
>> 
>> I wonder how hard it would be to have options to discard source when 
>> serializing R objects?
>> 
>> Currently my analyses heavily depend on digest function to generate file 
>> caches and automatically schedule pipelines (to update cache) when changes 
>> are detected. The pipelines save the hashes of source code, inputs, and 
>> outputs together so other people can easily verify the calculation without 
>> accessing the original data (which could be sensitive), or running hour-long 
>> analyses, or having to buy servers. All of these require `serialize` to 
>> produce the same results regardless of how users choose to run the code.
>> 
>> It would be great if this feature could be in the future R. Other pipeline 
>> packages such as `targets` and `drake` can also benefit from it.
> 
> I don't think such functionality would belong to serialize(). This function 
> is not meant to produce stable results based on the input, the serialized 
> representation may even differ based on properties not seen by users.
> 
> I think an option to ignore source code would belong to a function that 
> computes the hash, as other options of identical().
> 
> Tomas
> 
> 
>> Thanks,
>> 
>> - Dipterix
>>  [[alternative HTML version deleted]]
>> 
>> __
>> R-devel@r-project.org  mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel


[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] cwilcox - new version

2024-01-16 Thread Aidan Lakshman
I’ve been looking at this for a couple hours--it ended up being trickier than I 
expected to implement well.

I’ve attached a new patch here. This version scales significantly better than 
the existing method for both `pwilcox` and `dwilcox`. Examples are included 
below.

I can’t think of any other ways to improve runtime at this point. That’s not to 
say there aren’t any, though—I’m hopeful inspection by someone else could 
identify more ways to save some time.

This version uses Andreas’s algorithm for computing the value of `cwilcox`, but 
evaluates it lazily and caches results. Memory usage of this should be orders 
of magnitude less than the current version, and recursion is eliminated. As far 
as I can tell, the numerical accuracy of the results is unchanged.

Performance statistics are interesting. If we assume the two populations have a 
total of `m` members, then this implementation runs slightly slower for m < 20, 
and much slower for 50 < m < 100. However, this implementation works 
significantly *faster* for m > 200. The breakpoint is precisely when each 
population has a size of 50; `qwilcox(0.5,50,50)` runs in 8 microseconds in the 
current version, but `qwilcox(0.5, 50, 51)` runs in 5 milliseconds. The new 
version runs in roughly 1 millisecond for both. This is probably because of 
internal logic that requires many more `free/calloc` calls if either population 
is larger than `WILCOX_MAX`, which is set to 50.

I’m hopeful that this can be optimized to be suitable for inclusion in R. Lower 
performance for population sizes below 50 is not ideal, since `wilcox.test` 
switches to non-exact testing for population sizes above 50.

-Aidan

Benchmarking results on my machine using `microbenchmark`:
(median runtimes over 100 iterations, us=microseconds, ms=milliseconds, 
s=seconds)

`qwilcox(0.5, n, n)`

  n: 10  25 50 100 200
old version:  1.2us   2.9us  9.0us  87.4ms2.3s
Andreas version:  2.7us  68.6us  1.1ms  16.9ms 264.3ms

`dwilcox((n*n)%/%2, n, n)`
  n: 10  25 50 100 200
old version:  1.4us   0.9us  0.9us  43.2ms 851.6ms
Andreas version:  2.3us  53.9us  1.0ms  16.4ms 261.7ms


`pwilcox(1:100, 10, 10)`:
old version:  62.9us
Andreas version:  22.9us


---
Aidan Lakshman (he/him)
PhD Candidate, Wright Lab
University of Pittsburgh School of Medicine
Department of Biomedical Informatics
http://www.ahl27.com/
ah...@pitt.edu | (724) 612-9940

On 16 Jan 2024, at 9:47, Aidan Lakshman wrote:

> New email client, hoping that Outlook won’t continue to HTMLify my emails and 
> mess up my attachments without my consent :/
>
> I’m re-attaching the patch file here, compressed as a .gz in case that was 
> the issue with the previous attachment. Thanks to Ivan for pointing out that 
> my attachment didn’t go through correctly on the first try.
>
> I’ll be working on seeing if I can optimize this to actually improve 
> performance during the day today.
>
> -Aidan
>
> On 15 Jan 2024, at 5:51, Andreas Löffler wrote:
>
>> I am a newbie to C (although I did some programming in Perl and Pascal) so
>> forgive me for the language that follows. I am writing because I have a
>> question concerning the *implementation of *(the internal function)*
>> cwilcox* in
>> https://svn.r-project.org/R/!svn/bc/81274/branches/R-DL/src/nmath/wilcox.c.
>> This function is used to determine the test statistics of the
>> Wilcoxon-Mann-Whitney U-test.
>>
>> ___
>> The function `cwilcox` has three inputs: k, m and n. To establish the test
>> statistics `cwilcox` determines the number of possible sums of natural
>> numbers with the following restrictions:
>> - the sum of the elements must be k,
>> - the elements (summands) must be in a decreasing (or increasing) order,
>> - every element must be less then m and
>> - there must be at most n summands.
>>
>> The problem with the evaluation of this function `cwilcox(k,m,n)` is that
>> it requires a recursion where in fact a three-dimensional array has to be
>> evaluated (see the code around line 157). This requires a lot of memory and
>> takes time and seems still an issue even with newer machines, see the
>> warning in the documentation
>> https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/wilcox.test
>> .
>>
>> In an old paper I have written a formula where the evaluation can be done
>> in a one-dimensional array that uses way less memory and is much faster.
>> The paper itself is in German (you find a copy here:
>> https://upload.wikimedia.org/wikipedia/commons/f/f5/LoefflerWilcoxonMannWhitneyTest.pdf),
>> so I uploaded a translation into English (to be found in
>> https://upload.wikimedia.org/wikipedia/de/1/19/MannWhitney_151102.pdf).
>>
>> I also looked at the code in R and wrote a new version of the code that
>> uses my formulae so that a fas

Re: [Rd] ADA Compliance

2024-01-16 Thread Robert Baer

On 1/12/2024 1:50 PM, Hunter, Zayne via R-devel wrote:

Hello,


I am working with Ball State University to obtain a license of R.


R is open source so there really should be no need to "obtain a license 
for R".  This can hopefully reduce the hoops you must jump through.   
Mostly, I believe you can think of the R license as GPL2/GPL3, but you 
can read more about other R and R package licenses here: 
https://www.r-project.org/Licenses/#:~:text=R%20as%20a%20package%20is,%2C%20which%20includes%20GPL%2D3%20.




As part of our requirements for obtaining new software, we must review the VPAT 
for ADA compliance. Can you provide this information for me?

Thanks,


Zayne Hunter
Technology Advisor & Vendor Relations Manager
Ball State University
zayne.hun...@bsu.edu
(765)285-7853






[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel