[Rd] Possible bug in Spearman correlation with use="pairwise.complete.obs"
Hi, I have just encountered a strange behaviour from 'cor' with regards to the treatment of NAs when calculating Spearman correlations. I guess it is a subtle bug. If I understand the help page correctly, the two modes 'complete.obs' and 'pairwise.complete.obs' specify how to deal with correlation coefficients when calculating a correlation _matrix_. When calculating a single (scalar) correlation coefficient for two data vectors x and y, both should give the same result. For Pearson correlation, this is in fact the case: x <- runif( 10 ) y <- runif( 10 ) y[5] <- NA cor( x, y, use="complete.obs" ) [1] 0.407858 cor( x, y, use="pairwise.complete.obs" ) [1] 0.407858 For Spearman correlation, we do NOT get the same results cor( x, y, method="spearman", use="complete.obs" ) [1] 0.3416009 cor( x, y, method="spearman", use="pairwise.complete.obs" ) [1] 0.333 To see the likely reason for this possible bug, observe: goodobs <- !is.na(x) & !is.na(y) cor( rank(x)[goodobs], rank(y)[goodobs] ) [1] 0.3416009 cor( rank(x[goodobs]), rank(y[goodobs]) ) [1] 0.333 I would claim that only the calculation resulting in 0. is a proper Spearman correlation, while the line resulting in 0.3416 is not. After all, the following is not a complete set of ranks because there are 9 observations, numbered from 1 to 10, skipping the 3: rank(x)[goodobs] [1] 10 6 8 7 4 5 1 9 2 Would you hence agree that 'method="spearman"' with 'use="pairwise.complete.obs"' is incorrect? Cheers Simon sessionInfo() R version 2.12.0 (2010-10-15) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.utf8 LC_NUMERIC=C [3] LC_TIME=en_US.utf8LC_COLLATE=en_US.utf8 [5] LC_MONETARY=C LC_MESSAGES=en_US.utf8 [7] LC_PAPER=en_US.utf8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] pspearman_0.2-5 SuppDists_1.1-8 loaded via a namespace (and not attached): [1] tools_2.12.0 +--- | Dr. Simon Anders, Dipl.-Phys. | European Molecular Biology Laboratory (EMBL), Heidelberg | office phone +49-6221-387-8632 | preferred (permanent) e-mail: sand...@fs.tum.de __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] R_ext manual: type cast in C function registra tion
Hi, the "Writing R extensions" manual currently advises to register a .Call function as follows: R_CallMethodDef callMethods[] = { {"myCall", &myCall, 3}, {NULL, NULL, 0} }; This produces a compiler warning, at least on my GCC, because the second slot in the R_CallMethodDef is declared as DL_FUNC (which is declared as 'typedef void * (*DL_FUNC)();'). I'd suggest to change the example code in the manual to include an explicit cast, i.e., R_CallMethodDef callMethods[] = { {"myCall", (DL_FUNC) &myCall, 3}, {NULL, NULL, 0} }; in order to get rid of an unnecessary compiler warning. Cheers Simon +--- | Dr. Simon Anders, Dipl.-Phys. | European Molecular Biology Laboratory (EMBL), Heidelberg | office phone +49-6221-387-8632 | preferred (permanent) e-mail: sand...@fs.tum.de __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] 'merge' function: behavior w.r.t. NAs in the key column
Hi, I recently ran into a problem with 'merge' that stems from the way how missing values in the key column (i.e., the column specified in the "by" argument) are handled. I wonder whether the current behavior is fully consistent. Please have a look at this example: > x <- data.frame( key = c(1:3,3,NA,NA), val = 10+1:6 ) > y <- data.frame( key = c(NA,2:5,3,NA), val = 20+1:7 ) > x key val 1 1 11 2 2 12 3 3 13 4 3 14 5 NA 15 6 NA 16 > y key val 1 NA 21 2 2 22 3 3 23 4 4 24 5 5 25 6 3 26 7 NA 27 > merge( x, y, by="key" ) key val.x val.y 1 21222 2 31323 3 31326 4 31423 5 31426 6 NA1521 7 NA1527 8 NA1621 9 NA1627 As one should expect, there are now four lines with key value '3', because the key '3' appears twice both in x and in y. According to the logic of merge, a row should be produced in the output for each pairing of a row from x and a row from y where the values of 'key' are equal. However, the 'NA' values are treated exactly the same way. It seems that 'merge' considers the pairing of lines with 'NA' in both 'key' columns an allowed match. IMHO, this runs against the convention that two NAs are not considered equal. ('NA==NA' does not evaluate to 'TRUE'.) Is might be more consistent if merge did not include any rows into the output with an "NA" in the key column. Maybe, one could add a flag argument to 'merge' to switch between this behaviour and the current one? A note in the help page might be nice, too. Best regards Simon +--- | Dr. Simon Anders, Dipl. Phys. | European Bioinformatics Institute, Hinxton, Cambridgeshire, UK | office phone +44-1223-494478, mobile phone +44-7505-841692 | preferred (permanent) e-mail: [EMAIL PROTECTED] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] 'merge' function: behavior w.r.t. NAs in the key column
Hi Bill, Bill Dunlap wrote: > Splus (versions 8.0, 7.0, and 6.2) gives: >> merge( x, y, by="key" ) > key val.x val.y >1 21222 >2 31323 >3 31423 >4 31326 >5 31426 > Is that what you expect? There is no argument > to Splus's merge to make it include the NA's > in the way R's merge does. Should there be such > an argument? Yes, this is what I would expect. Would it be reasonable to consider Splus's behavior as correct and R's behavior as inconsistent, and hence ask R's 'merge' function to be fixed? Cheers Simon > On Fri, 14 Mar 2008, Simon Anders wrote: >> I recently ran into a problem with 'merge' that stems from the way how >> missing values in the key column (i.e., the column specified >> in the "by" argument) are handled. I wonder whether the current behavior >> is fully consistent. >> ... >>> x <- data.frame( key = c(1:3,3,NA,NA), val = 10+1:6 ) >>> y <- data.frame( key = c(NA,2:5,3,NA), val = 20+1:7 ) >> ... >>> merge( x, y, by="key" ) >>key val.x val.y >> 1 21222 >> 2 31323 >> 3 31326 >> 4 31423 >> 5 31426 >> 6 NA1521 >> 7 NA1527 >> 8 NA1621 >> 9 NA1627 >> >> As one should expect, there are now four lines with key value '3', >> because the key '3' appears twice both in x and in y. According to the >> logic of merge, a row should be produced in the output for each pairing >> of a row from x and a row from y where the values of 'key' are equal. >> >> However, the 'NA' values are treated exactly the same way. It seems that >> 'merge' considers the pairing of lines with 'NA' in both 'key' columns >> an allowed match. IMHO, this runs against the convention that two NAs >> are not considered equal. ('NA==NA' does not evaluate to 'TRUE'.) >> >> Is might be more consistent if merge did not include any rows into the >> output with an "NA" in the key column. >> >> Maybe, one could add a flag argument to 'merge' to switch between this >> behaviour and the current one? A note in the help page might be nice, too. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] manual 'Writing R Extensions': bug in example
Hi, I would like to report a small bug in one of the code examples given in the 'Writing R extensions' manual. Section 5.9.2 ("Calling .External") defines the function "showArgs" in order to demonstrate how to access in C the pair-listed arguments passed by .External. The function aims at printing echoing all passed arguments to the screen but only prints every second argument. The loop to go through the pair list starts as follows args = CDR(args); /* skip 'name' */ for(i = 0; args != R_NilValue; i++, args = CDR(args)) { args = CDR(args);/* <--- This line is wrong. */ As you can see, "args = CDR(args)" is called twice, resulting in every other argument skipped. The marked line should be deleted. Furthermore, the function requires all arguments to be named, as it gives an error if TAG returns NIL in the line name = CHAR(PRINTNAME(TAG(args))); It might improve the example to introduce here a test "if( TAG(args) != R_NilValue )". May I suggest to the maintainer of the manual to change the example to the following code? (Changes to original: lines 1, 2, 15, 17, 18 added; one line deleted after line 13.) 1 #include 2 #include 3 #include 4 5 SEXP showArgs(SEXP args) 6 { 7int i, nargs; 8Rcomplex cpl; 9const char *name; 10SEXP el; 11 12args = CDR(args); /* skip 'name' */ 13for(i = 0; args != R_NilValue; i++, args = CDR(args)) { 14 15 if( TAG(args) != R_NilValue ) 16 name = CHAR(PRINTNAME(TAG(args))); 17 else 18 name = ""; 19 switch(TYPEOF(CAR(args))) { 20 case REALSXP: 21Rprintf("[%d] '%s' %f\n", i+1, name, REAL(CAR(args))[0]); 22break; 23 case LGLSXP: 24 case INTSXP: 25Rprintf("[%d] '%s' %d\n", i+1, name, INTEGER(CAR(args))[0]); 26break; 27 case CPLXSXP: 28cpl = COMPLEX(CAR(args))[0]; 29Rprintf("[%d] '%s' %f + %fi\n", i+1, name, cpl.r, cpl.i); 30break; 31 case STRSXP: 32Rprintf("[%d] '%s' %s\n", i+1, name, 33 CHAR(STRING_ELT(CAR(args), 0))); 34 break; 35 default: 36Rprintf("[%d] '%s' R type\n", i+1, name); 37 } 38} 39return(R_NilValue); 40 } 41 Best regards Simon Anders +--- | Dr. Simon Anders, Dipl. Phys. | European Bioinformatics Institute, Hinxton, Cambridgeshire, UK | office phone +44-1223-494478, mobile phone +44-7505-841692 | preferred (permanent) e-mail: [EMAIL PROTECTED] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Rterm.exe, windows event loop running in multiple threads?
Hi Michael, Michael Lawrence wrote: >>> For processing events, RGtk2 has moved away from using the old tcl_do hack >>> and now synchronizes on the Windows event loop. In Rgui.exe this works >>> fine >>> (well mostly), but in Rterm.exe, users report the following warning from >>> GLib: "main loop already active in another thread". The most obvious way >>> this could occur is if the Windows event loop were iterating in multiple >>> threads. This does not seem to be the case from my casual inspection of >>> the >>> R source. Following your advice, I have changed my HilbertCurveDisplay package from the tcl_do hack to your threading technique. IIRC, you start this second thread whose event handler gets called periodically by a timer. It checks whether there are any pending Gtk events, and if so, sends a message to the R main thread in order to cause one GTK event loop iteration. I have writen my gtkmm code the same way and got the same error message, repeated many times. However, after I removed the check for pending events from the handler of the timer event, the error message vanishes. It seems that GTK does not like it if gtk_events_pending() is called from two threads even if both threads refer to the same main loop. (Actually, it does not mind, as the error is just a warning, and fucntionality ois not disrupted. I guess that Rgui.exe gets the same warning but fails to display it.) Could this be the problem for you as well? [Thinking a bout this again: Why, actually, do we need a second thread. Could we not set up a Windows timer that dispatches a message to the main thread's event handler peridoically to trigger a polling of the Gtk event loop?] Simon +--- | Dr. Simon Anders, Dipl. Phys. | European Bioinformatics Institute, Hinxton, Cambridgeshire, UK | office phone +44-1223-494478, mobile phone +44-7505-841692 | preferred (permanent) e-mail: [EMAIL PROTECTED] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Possible bug: How does libR.so find RHOME?
Hi While installing RPy2, I had curious problems which, if I traced them back correctly, have their root in the way that libR.so find the R installation directory. I wonder if this might be a subtle bug that only causes problems when one has several versions of R on one system but then can be very annoying and cause very strange behaviour. It seems to me that libR.so asks the Unix shell to execute "R RHOME" or something similar, and this causes problems, as detailed below. We have several R installations on our server (a CentOS Linux, version 4.6, x86_64 machine), and in my shell environment, the symlink "R" that is found in the system search path points to a static build of R-2.8.1, while the symlink "R-2.9" points to today's R-2.9.beta, built with '--enable-shlib'. (What follows is true, however, for R 2.8.1, built as a library, as well.) The setup.py script of RPy2 calls "R RHOME" to find R and then compiles its C code to a shared object that is linked to libR.so. It uses the '-R' linker option to store the full path to libR.so. As "R RHOME" would have returned the path to my static R-2.8 installation, I specified the RHOME path to my R-2.9 library build explicitly to RPy2's setup.py by means of an environment variable that is checked by the script. This resulted in an RPy2 installation properly linked to my R-2.9. However, on starting RPy2 in Python, I got this strange error: >>> from rpy2 import robjects Error in grep("(_US|_CA)", lcpaper) : 8 arguments passed to .Internal(grep) which requires 9 [...] RPy2 started up nevertheless, but failed to load the base packages, so that all standard functions were missing. Changing the "R" symlink in my search path to point to the R-2.9 version against which RPy2 is linked solved the problem. It seems that, before, the R-2.9 libR.so tried to load the base environment of R-2.8, because R-2.8 was in the search path. As a further check, I put an empty script called "R" that does nothing into the search path. Then I get this: >>> from rpy2 import robjects cannot find system Renviron Fatal error: unable to open the base package My guess is that it is libR.so (and not RPy2) that locates the R executable in the path. If this is so, this seems to be a bug to me: An application (let alone a shared library) should not rely on the names of symlinks in the user's search path. Best regards Simon +--- | Dr. Simon Anders, Dipl. Phys. | European Bioinformatics Institute, Hinxton, Cambridgeshire, UK | office phone +44-1223-492680, mobile phone +44-7505-841692 | preferred (permanent) e-mail: sand...@fs.tum.de __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Possible bug: How does libR.so find RHOME?
Hi please disregard the previous mail; I realized that my conclusion that the bug is in R and not in Rpy2, was quite wrong, as I just noticed after abit more thinking. The problem is at a pretty obvious place in RPy2's initialization routine. I'll ask the RPy2 mailing list for help. Simon __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel