[Rd] Possible bug in Spearman correlation with use="pairwise.complete.obs"

2011-01-22 Thread Simon Anders

Hi,

I have just encountered a strange behaviour from 'cor' with regards to 
the treatment of NAs when calculating Spearman correlations. I guess it 
is a subtle bug.


If I understand the help page correctly, the two modes 'complete.obs' 
and 'pairwise.complete.obs' specify how to deal with correlation 
coefficients when calculating a correlation _matrix_. When calculating a 
single (scalar) correlation coefficient for two data vectors x and y, 
both should give the same result.


For Pearson correlation, this is in fact the case:


x <- runif( 10 )
y <- runif( 10 )
y[5] <- NA



cor( x, y, use="complete.obs" )

[1] 0.407858

cor( x, y, use="pairwise.complete.obs" )

[1] 0.407858

For Spearman correlation, we do NOT get the same results


cor( x, y, method="spearman", use="complete.obs" )

[1] 0.3416009

cor( x, y, method="spearman", use="pairwise.complete.obs" )

[1] 0.333

To see the likely reason for this possible bug, observe:


goodobs <- !is.na(x) & !is.na(y)



cor( rank(x)[goodobs], rank(y)[goodobs] )

[1] 0.3416009

cor( rank(x[goodobs]), rank(y[goodobs]) )

[1] 0.333

I would claim that only the calculation resulting in 0. is a proper 
Spearman correlation, while the line resulting in 0.3416 is not. After 
all, the following is not a complete set of ranks because there are 9 
observations, numbered from 1 to 10, skipping the 3:



rank(x)[goodobs]

[1] 10  6  8  7  4  5  1  9  2

Would you hence agree that 'method="spearman"' with 
'use="pairwise.complete.obs"' is incorrect?


Cheers
  Simon



sessionInfo()

R version 2.12.0 (2010-10-15)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.utf8   LC_NUMERIC=C
 [3] LC_TIME=en_US.utf8LC_COLLATE=en_US.utf8
 [5] LC_MONETARY=C LC_MESSAGES=en_US.utf8
 [7] LC_PAPER=en_US.utf8   LC_NAME=C
 [9] LC_ADDRESS=C  LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base

other attached packages:
[1] pspearman_0.2-5 SuppDists_1.1-8

loaded via a namespace (and not attached):
[1] tools_2.12.0




+---
| Dr. Simon Anders, Dipl.-Phys.
| European Molecular Biology Laboratory (EMBL), Heidelberg
| office phone +49-6221-387-8632
| preferred (permanent) e-mail: sand...@fs.tum.de

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] R_ext manual: type cast in C function registra tion

2010-01-07 Thread Simon Anders
Hi,

the "Writing R extensions" manual currently advises to register a .Call
function as follows:

   R_CallMethodDef callMethods[]  = {
   {"myCall", &myCall, 3},
   {NULL, NULL, 0}
   };

This produces a compiler warning, at least on my GCC, because the second
slot in the R_CallMethodDef is declared as DL_FUNC (which is declared as
'typedef void * (*DL_FUNC)();').

I'd suggest to change the example code in the manual to include an
explicit cast, i.e., 


   R_CallMethodDef callMethods[]  = {
   {"myCall", (DL_FUNC) &myCall, 3},
   {NULL, NULL, 0}
   };

in order to get rid of an unnecessary compiler warning.

Cheers
  Simon


+---
| Dr. Simon Anders, Dipl.-Phys.
| European Molecular Biology Laboratory (EMBL), Heidelberg
| office phone +49-6221-387-8632
| preferred (permanent) e-mail: sand...@fs.tum.de

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] 'merge' function: behavior w.r.t. NAs in the key column

2008-03-14 Thread Simon Anders
Hi,

I recently ran into a problem with 'merge' that stems from the way how 
missing values in the key column (i.e., the column specified
in the "by" argument) are handled. I wonder whether the current behavior 
is fully consistent.

Please have a look at this example:

> x <- data.frame( key = c(1:3,3,NA,NA), val = 10+1:6 )
> y <- data.frame( key = c(NA,2:5,3,NA), val = 20+1:7 )

> x
   key val
1   1  11
2   2  12
3   3  13
4   3  14
5  NA  15
6  NA  16

> y
   key val
1  NA  21
2   2  22
3   3  23
4   4  24
5   5  25
6   3  26
7  NA  27

> merge( x, y, by="key" )
   key val.x val.y
1   21222
2   31323
3   31326
4   31423
5   31426
6  NA1521
7  NA1527
8  NA1621
9  NA1627

As one should expect, there are now four lines with key value '3',
because the key '3' appears twice both in x and in y. According to the
logic of merge, a row should be produced in the output for each pairing
of a row from x and a row from y where the values of 'key' are equal.

However, the 'NA' values are treated exactly the same way. It seems that 
'merge' considers the pairing of lines with 'NA' in both 'key' columns 
an allowed match. IMHO, this runs against the convention that two NAs 
are not considered equal. ('NA==NA' does not evaluate to 'TRUE'.)

Is might be more consistent if merge did not include any rows into the 
output with an "NA" in the key column.

Maybe, one could add a flag argument to 'merge' to switch between this 
behaviour and the current one? A note in the help page might be nice, too.

Best regards
   Simon



+---
| Dr. Simon Anders, Dipl. Phys.
| European Bioinformatics Institute, Hinxton, Cambridgeshire, UK
| office phone +44-1223-494478, mobile phone +44-7505-841692
| preferred (permanent) e-mail: [EMAIL PROTECTED]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] 'merge' function: behavior w.r.t. NAs in the key column

2008-03-19 Thread Simon Anders
Hi Bill,

Bill Dunlap wrote:
> Splus (versions 8.0, 7.0, and 6.2) gives:
>> merge( x, y, by="key" )
>  key val.x val.y
>1   21222
>2   31323
>3   31423
>4   31326
>5   31426
> Is that what you expect?  There is no argument
> to Splus's merge to make it include the NA's
> in the way R's merge does.  Should there be such
> an argument?

Yes, this is what I would expect.

Would it be reasonable to consider Splus's behavior as correct and R's 
behavior as inconsistent, and hence ask R's 'merge' function to be fixed?

Cheers
   Simon

> On Fri, 14 Mar 2008, Simon Anders wrote:
>> I recently ran into a problem with 'merge' that stems from the way how
>> missing values in the key column (i.e., the column specified
>> in the "by" argument) are handled. I wonder whether the current behavior
>> is fully consistent.
>> ...
>>> x <- data.frame( key = c(1:3,3,NA,NA), val = 10+1:6 )
>>> y <- data.frame( key = c(NA,2:5,3,NA), val = 20+1:7 )
>> ...
>>> merge( x, y, by="key" )
>>key val.x val.y
>> 1   21222
>> 2   31323
>> 3   31326
>> 4   31423
>> 5   31426
>> 6  NA1521
>> 7  NA1527
>> 8  NA1621
>> 9  NA1627
>>
>> As one should expect, there are now four lines with key value '3',
>> because the key '3' appears twice both in x and in y. According to the
>> logic of merge, a row should be produced in the output for each pairing
>> of a row from x and a row from y where the values of 'key' are equal.
>>
>> However, the 'NA' values are treated exactly the same way. It seems that
>> 'merge' considers the pairing of lines with 'NA' in both 'key' columns
>> an allowed match. IMHO, this runs against the convention that two NAs
>> are not considered equal. ('NA==NA' does not evaluate to 'TRUE'.)
>>
>> Is might be more consistent if merge did not include any rows into the
>> output with an "NA" in the key column.
>>
>> Maybe, one could add a flag argument to 'merge' to switch between this
>> behaviour and the current one? A note in the help page might be nice, too.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] manual 'Writing R Extensions': bug in example

2008-04-06 Thread Simon Anders
Hi,

I would like to report a small bug in one of the code examples given in 
the 'Writing R extensions' manual. Section 5.9.2 ("Calling .External") 
defines the function "showArgs" in order to demonstrate how to access in 
C the pair-listed arguments passed by .External. The function aims at 
printing echoing all passed arguments to the screen but only prints 
every second argument.

The loop to go through the pair list starts as follows

   args = CDR(args); /* skip 'name' */
   for(i = 0; args != R_NilValue; i++, args = CDR(args)) {
  args = CDR(args);/* <--- This line is wrong. */

As you can see, "args = CDR(args)" is called twice, resulting in every 
other argument skipped. The marked line should be deleted.

Furthermore, the function requires all arguments to be named, as it 
gives an error if TAG returns NIL in the line

name = CHAR(PRINTNAME(TAG(args)));

It might improve the example to introduce here a test "if( TAG(args) != 
R_NilValue )".

May I suggest to the maintainer of the manual to change the example to 
the following code? (Changes to original: lines 1, 2, 15, 17, 18 added; 
one line deleted after line 13.)

  1  #include 
  2  #include 
  3  #include 
  4
  5  SEXP showArgs(SEXP args)
  6  {
  7int i, nargs;
  8Rcomplex cpl;
  9const char *name;
10SEXP el;
11
12args = CDR(args); /* skip 'name' */
13for(i = 0; args != R_NilValue; i++, args = CDR(args)) {
14
15  if( TAG(args) != R_NilValue )
16 name = CHAR(PRINTNAME(TAG(args)));
17  else
18 name = "";
19  switch(TYPEOF(CAR(args))) {
20  case REALSXP:
21Rprintf("[%d] '%s' %f\n", i+1, name, REAL(CAR(args))[0]);
22break;
23  case LGLSXP:
24  case INTSXP:
25Rprintf("[%d] '%s' %d\n", i+1, name, INTEGER(CAR(args))[0]);
26break;
27  case CPLXSXP:
28cpl = COMPLEX(CAR(args))[0];
29Rprintf("[%d] '%s' %f + %fi\n", i+1, name, cpl.r, cpl.i);
30break;
31  case STRSXP:
32Rprintf("[%d] '%s' %s\n", i+1, name,
33   CHAR(STRING_ELT(CAR(args), 0)));
34    break;
35  default:
36Rprintf("[%d] '%s' R type\n", i+1, name);
37  }
38}
39return(R_NilValue);
40  }
41

Best regards
   Simon Anders


+---
| Dr. Simon Anders, Dipl. Phys.
| European Bioinformatics Institute, Hinxton, Cambridgeshire, UK
| office phone +44-1223-494478, mobile phone +44-7505-841692
| preferred (permanent) e-mail: [EMAIL PROTECTED]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Rterm.exe, windows event loop running in multiple threads?

2008-09-18 Thread Simon Anders
Hi Michael,

Michael Lawrence wrote:
>>> For processing events, RGtk2 has moved away from using the old tcl_do
hack
>>> and now synchronizes on the Windows event loop. In Rgui.exe this works
>>> fine
>>> (well mostly), but in Rterm.exe, users report the following warning from
>>> GLib: "main loop already active in another thread". The most obvious way
>>> this could occur is if the Windows event loop were iterating in multiple
>>> threads. This does not seem to be the case from my casual inspection of
>>> the
>>> R source.

Following your advice, I have changed my HilbertCurveDisplay package from
the tcl_do hack to your threading technique.

IIRC, you start this second thread whose event handler gets called
periodically by a timer. It checks whether there are any pending Gtk
events, and if so, sends a message to the R main thread in order to cause
one GTK event loop iteration.

I have writen my gtkmm code the same way and got the same error message,
repeated many times. However, after I removed the check for pending events
from the handler of the timer event, the error message vanishes. It seems
that GTK does not like it if gtk_events_pending() is called from two
threads even if both threads refer to the same main loop. (Actually, it
does not mind, as the error is just a warning, and fucntionality ois not
disrupted. I guess that Rgui.exe gets the same warning but fails to
display it.)

Could this be the problem for you as well?

[Thinking a bout this again: Why, actually, do we need a second thread.
Could we not set up a Windows timer that dispatches a message to the main
thread's event handler peridoically to trigger a polling of the Gtk event
loop?]

  Simon



+---
| Dr. Simon Anders, Dipl. Phys.
| European Bioinformatics Institute, Hinxton, Cambridgeshire, UK
| office phone +44-1223-494478, mobile phone +44-7505-841692
| preferred (permanent) e-mail: [EMAIL PROTECTED]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Possible bug: How does libR.so find RHOME?

2009-04-06 Thread Simon Anders
Hi

While installing RPy2, I had curious problems which, if I traced them
back correctly, have their root in the way that libR.so find the R
installation directory. I wonder if this might be a subtle bug that only
causes problems when one has several versions of R on one system but
then can be very annoying and cause very strange behaviour.

It seems to me that libR.so asks the Unix shell to execute "R RHOME" or
something similar, and this causes problems, as detailed below.

We have several R installations on our server (a CentOS Linux, version
4.6, x86_64 machine), and in my shell environment, the symlink "R" that
is found in the system search path points to a static build of R-2.8.1,
while the symlink "R-2.9" points to today's R-2.9.beta, built with
'--enable-shlib'. (What follows is true, however, for R 2.8.1, built as
a library, as well.)

The setup.py script of RPy2 calls "R RHOME" to find R and then compiles
its C code to a shared object that is linked to libR.so. It uses the
'-R' linker option to store the full path to libR.so.

As "R RHOME" would have returned the path to my static R-2.8
installation, I specified the RHOME path to my R-2.9 library build
explicitly to RPy2's setup.py by means of an environment variable that
is checked by the script. This resulted in an RPy2 installation properly
linked to my R-2.9.

However, on starting RPy2 in Python, I got this strange error:

  >>> from rpy2 import robjects
  Error in grep("(_US|_CA)", lcpaper) :
8 arguments passed to .Internal(grep) which requires 9
  [...]

RPy2 started up nevertheless, but failed to load the base packages, so
that all standard functions were missing.

Changing the "R" symlink in my search path to point to the R-2.9 version
against which RPy2 is linked solved the problem. It seems that, before,
the R-2.9 libR.so tried to load the base environment of R-2.8, because
R-2.8 was in the search path.

As a further check, I put an empty script called "R" that does nothing
into the search path. Then I get this:

   >>> from rpy2 import robjects
   cannot find system Renviron
   Fatal error: unable to open the base package

My guess is that it is libR.so (and not RPy2) that locates the R
executable in the path. If this is so, this seems to be a bug to me: An
application (let alone a shared library) should not rely on the names of
symlinks in the user's search path.

Best regards
  Simon


+---
| Dr. Simon Anders, Dipl. Phys.
| European Bioinformatics Institute, Hinxton, Cambridgeshire, UK
| office phone +44-1223-492680, mobile phone +44-7505-841692
| preferred (permanent) e-mail: sand...@fs.tum.de

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Possible bug: How does libR.so find RHOME?

2009-04-06 Thread Simon Anders
Hi

please disregard the previous mail; I realized that my conclusion that
the bug is in R and not in Rpy2, was quite wrong, as I just noticed
after abit more thinking. The problem is at a pretty obvious place in
RPy2's initialization routine. I'll ask the RPy2 mailing list for help.

  Simon

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel