Re: [Rd] Fast Kendall's Tau

2012-06-28 Thread David Simcha
Sorry for the late reply.  I've been extremely busy, out of town, etc.
lately.  I wrote the core algorithm for this function back in 2010, but did
not make any attempt to integrate it into R-base, for several reasons:

1.  I was having difficulty figuring out what all the missing value options
(which I never use) are supposed to do and how to efficiently adapt my code
to them.

2.  The cor() function (IIRC; this is from memory from over a year ago)
implements Pearson, Spearman and Kendall very messily all in one function.
I felt that this could use some refactoring to make the integration of a
new Kendall algorithm sane, but didn't know the codebase well enough to do
it myself w/o a significant learning curve.

3.  Testing the integration would require setting up a build environment
for R.

Basically, I wanted to contribute this algorithm but didn't want to go
through the necessary learning curve to become a regular R-base
contributor.  I was hoping the integration work would be trivial to someone
who contributes to this codebase regularly.

On Tue, Jun 26, 2012 at 5:44 PM, Duncan Murdoch wrote:

> On 12-06-25 2:48 PM, Adler, Avraham wrote:
>
>> Hello.
>>
>> Has any further action been taken regarding implementing David Simcha's
>> fast Kendall tau code (now found in the package pcaPP as cor.fk) into
>> R-base? It is literally hundreds of times faster, although I am uncertain
>> as to whether he wrote code for testing the significance of the parameter.
>> The last mention I have seen of this was in 2010> pipermail/r-devel/2010-**February/056745.html
>> >.
>>
>
> You could check the NEWS file, but I don't remember anything being done
> along these lines.  If the code is in a CRAN package, there doesn't seem to
> be any need to move it to base R.
>
> Duncan Murdoch
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] O(N log N) Kendall Tau

2009-12-13 Thread David Simcha
I've noticed that the implementation of Kendall's Tau in R is O(N^2).  
The following reference describes how it can be done in O(N log N):


A Computer Method for Calculating Kendall's Tau with Ungrouped Data
William R. Knight
Journal of the American Statistical Association, Vol. 61, No. 314, Part 
1 (Jun., 1966), pp. 436-439

http://www.jstor.org/pss/2282833

I'm interested in contributing an implementation of this algorithm to 
R.  However, I have a few questions:


1.  Would it likely be accepted if well-implemented?
2.  cov.c in the main/ directory of the R source tree seems to contain 
at least 4 different but nearly identical implementations of Kendall's 
Tau:  cov_complete1, cov_complete2, cov_na1, and cov_na2.  I can't tell 
why.  The file is very difficult to read because of its lack of 
comments, (ab)use of macros and improper indentation.  As I will 
probably need to modify this file, can some seasoned veteran please 
explain what the heck is going on in this file?


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] O(N log N) impl of Kendall's Tau

2010-02-24 Thread David Simcha
I've attached an O(N log N) implementation of Kendall's Tau, which I 
hope will eventually replace the O(N^2) version currently implemented in 
R.  For N = 30,000 it's several hundred times faster than the old 
version.  The core algorithm comes with a lot of tests, which are 
included in the kendall.c file.  However, I've not yet integrated this 
code into the rest of R because, honestly, the code in cor.c is 
inscrutable and intermingles computing Kendall's Tau with dealing with 
missing values and computing other kinds of correlation.  I'd like one 
of the core devs' help with the integration.  The details of the 
algorithm and how to use it are explained in the comments inside kendall.c.


Please let me know what else I can do to help get this improvement 
folded into R.


--David Simcha
/* This file contains code to calculate Kendall's Tau in O(N log N) time in
 * a manner similar to the following reference:
 *
 * A Computer Method for Calculating Kendall's Tau with Ungrouped Data
 * William R. Knight Journal of the American Statistical Association, Vol. 61,
 * No. 314, Part 1 (Jun., 1966), pp. 436-439
 *
 * Copyright 2010 David Simcha
 *
 * License:
 * Boost Software License - Version 1.0 - August 17th, 2003
 *
 * Permission is hereby granted, free of charge, to any person or organization
 * obtaining a copy of the software and accompanying documentation covered by
 * this license (the "Software") to use, reproduce, display, distribute,
 * execute, and transmit the Software, and to prepare derivative works of the
 * Software, and to permit third-parties to whom the Software is furnished to
 * do so, all subject to the following:
 *
 * The copyright notices in the Software and this entire statement, including
 * the above license grant, this restriction and the following disclaimer,
 * must be included in all copies of the Software, in whole or in part, and
 * all derivative works of the Software, unless such copies or derivative
 * works are solely in the form of machine-executable object code generated by
 * a source language processor.
 *
 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 * FITNESS FOR A PARTICULAR PURPOSE, TITLE AND NON-INFRINGEMENT. IN NO EVENT
 * SHALL THE COPYRIGHT HOLDERS OR ANYONE DISTRIBUTING THE SOFTWARE BE LIABLE
 * FOR ANY DAMAGES OR OTHER LIABILITY, WHETHER IN CONTRACT, TORT OR OTHERWISE,
 * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
 * DEALINGS IN THE SOFTWARE.
 *
 */

#include 
#include 
#include 
#include 


uint64_t insertionSort(double*, size_t);

#define kendallTest
#ifdef kendallTest

#include 
#include 
#include 



/* Kludge:  In testing mode, just forward R_rsort to insertionSort to make this
 * module testable without having to include (and compile) a bunch of other
 * stuff.
 */
void R_rsort(double* arr, int len) {
insertionSort(arr, len);
}
#else

#include   /* For R_rsort. */

#endif

/* Sorts in place, returns the bubble sort distance between the input array
 * and the sorted array.
 */
uint64_t insertionSort(double* arr, size_t len) {
size_t maxJ, i;
uint64_t swapCount = 0;

if(len < 2) {
return 0;
}

maxJ = len - 1;
for(i = len - 2; i < len; --i) {
size_t j = i;
double val = arr[i];

for(; j < maxJ && arr[j + 1] < val; ++j) {
arr[j] = arr[j + 1];
}

arr[j] = val;
swapCount += (j - i);
}

return swapCount;
}

static uint64_t merge(double* from, double* to, size_t middle, size_t len) {
size_t bufIndex, leftLen, rightLen;
uint64_t swaps;
double* left;
double* right;

bufIndex = 0;
swaps = 0;

left = from;
right = from + middle;
rightLen = len - middle;
leftLen = middle;

while(leftLen && rightLen) {
if(right[0] < left[0]) {
to[bufIndex] = right[0];
swaps += leftLen;
rightLen--;
right++;
} else {
to[bufIndex] = left[0];
leftLen--;
left++;
}
bufIndex++;
}

if(leftLen) {
memcpy(to + bufIndex, left, leftLen * sizeof(double));
} else if(rightLen) {
memcpy(to + bufIndex, right, rightLen * sizeof(double));
}

return swaps;
}

/* Sorts in place, returns the bubble sort distance between the input array
 * and the sorted array.
 */
uint64_t mergeSort(double* x, double* buf, size_t len) {
uint64_t swaps;
size_t half;

if(len < 10) {
return insertionSort(x, len);
}

swaps = 0;

if(len < 2) {
return 0;
}

half = len / 2;
swaps += mergeSort(x, buf, half);
swaps += mergeSort(x + half, buf + half, len - half);
swaps += merge(x, buf, half, len);

memcpy(x, buf, len * sizeof(doubl