[Rd] Bug in URLencode and patch

2015-01-11 Thread Thomas J. Leeper
I believe the implementation of utils::URLencode is non-compliant with
RFC 3986, which it claims to implement
(http://tools.ietf.org/html/rfc3986). Specifically, its percent
encoding uses lowercase letters a-f, which it should use uppercase
letters A-F.

Here's what URLencode currently produces:

library("utils")
URLencode("*+,;=:/?", reserved = TRUE)
# "%2a%2b%2c%3b%3d%3a%2f%3f"

According to RFC 3986 (references below), these should be uppercase:

toupper(URLencode("*+,;=:/?", reserved = TRUE))
# "%2A%2B%2C%3B%3D%3A%2F%3F"


This is a problem for me because I'm working with a web API that
authenticates using, in part, a hashed version of the URL-escaped
query arguments and this bug yields different hashes even though the
URLs are substantively the same. Here's a trivial example using just a
colon:

library("digest")
URLencode(":", reserved = TRUE)
# [1] "%3a"
digest("%3a")
# [1] "77fff19a933ae715d006469545892caf"
digest("%3A")
# [1] "8f270f6ac6fe3260f52293ea1d911093"

As an aside, I know that RCurl::curlEscape implements this correctly,
but I don't see any reason why URLencode shouldn't comply with RFC
3986.


The fix should be relatively simple. Here's updated code for URLencode
that simply adds a call to `toupper`:

function (URL, reserved = FALSE)
{
OK <- paste0("[^", if (!reserved)
"][!$&'()*+,;=:/?@#", "ABCDEFGHIJKLMNOPQRSTUVWXYZ",
"abcdefghijklmnopqrstuvwxyz0123456789._~-",
"]")
x <- strsplit(URL, "")[[1L]]
z <- grep(OK, x)
if (length(z)) {
y <- sapply(x[z], function(x) paste0("%",
toupper(as.character(charToRaw(x))),
collapse = ""))
x[z] <- y
}
paste(x, collapse = "")
}


The relevant parts of RFC 3986 are (emphasis added):
2.1: "The uppercase hexadecimal digits 'A' through 'F' are equivalent
to the lowercase digits 'a' through 'f', respectively.  If two URIs
differ only in the case of hexadecimal digits used in percent-encoded
octets, they are equivalent.  For consistency, URI producers and
normalizers should use **uppercase** hexadecimal digits for all
percent-encodings."

6.2.2.1: "For all URIs, the hexadecimal digits within a
percent-encoding triplet (e.g., "%3a" versus "%3A") are
case-insensitive and therefore should be normalized to use
**uppercase** letters for the digits A-F."


Best,
-Thomas


Thomas J. Leeper
http://www.thomasleeper.com

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Cost of garbage collection seems excessive

2015-01-11 Thread luke-tierney

This is a known issue that is being looked into. The primary culprit
seems to be the case labels that are created and need to be scanned by
the GC.

Best,

luke

On Fri, 9 Jan 2015, Nathan Kurz wrote:


When doing repeated regressions on large data sets, I'm finding that
the time spent on garbage collection often exceeds the time spent on
the regression itself.   Consider this test program which I'm running
on  an Intel Haswell i7-4470 processor under Linux 3.13 using R 3.1.2
compiled with ICPC 14.1:

nate@haswell:~$ cat > gc.R
 library(speedglm)
 createData <- function(n) {
 int <- -5
 x <- rnorm(n, 50, 7)
 e <- rnorm(n, 0, 1)
 y <- int + (1.2 * x) + e
 return(data.frame(y, x))
 }
gc.time()
data <- createData(50)
data.y <- as.matrix(data[1])
data.x <- model.matrix(y ~ ., data)
for (i in 1:100) speedglm.wfit(X=data.x, y=data.y, family=gaussian())
gc.time()

nate@haswell:~$ time Rscript gc.R
Loading required package: Matrix
Loading required package: methods
[1] 0 0 0 0 0
[1] 10.410  0.024 10.441  0.000  0.000
real 0m17.167s
user 0m16.996s
sys 0m0.176s

The total execution time is 17 seconds, and the time spent on garbage
collection is almost 2/3 of that.  My actual use case is a package
that creates an ensemble from a variety of cross-validated
regressions, and exhibits the same poor performance. Is this expected
behavior?

I've found that I can reduce the garbage collection time to a
tolerable level by setting the R_VSIZE environment value to a large
enough value:

nate@haswell:~$ time R_VSIZE=1GB Rscript gc.R
Loading required package: Matrix
Loading required package: methods
[1] 0 0 0 0 0
[1] 0.716 0.025 0.739 0.000 0.000
real 0m7.694s
user 0m7.388s
sys 0m0.309s

I can do slightly better with even higher values, and by using
R_GC_MEM_GROW=3.  But while using the environment variables solves the
issue for me, I fear that the end users of my package won't be able to
set them.   Is there a way that I can achieve the higher performance
from within R rather than from the command line?

Thanks!

--nate

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel



--
Luke Tierney
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa  Phone: 319-335-3386
Department of Statistics andFax:   319-335-3017
   Actuarial Science
241 Schaeffer Hall  email:   luke-tier...@uiowa.edu
Iowa City, IA 52242 WWW:  http://www.stat.uiowa.edu

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel