[Rd] Generate reproducible output independently of the build path

2017-04-25 Thread Ximin Luo
(please keep me CCd, I am not subscribed)

Dear R developers,

At the Reproducible Builds project we've been trying to get build tools and 
packages generate bit-for-bit identical output, even under different build 
paths. This is beneficial for users because they can more easily compare their 
builds with others, as well as other reasons.

At the moment about 400 out of 26000 Debian packages are unreproducible due to 
how R writes paths.rds files as well as RDB files. An example diff is here:

https://tests.reproducible-builds.org/debian/rb-pkg/unstable/amd64/diffoscope-results/r-cran-tensor.html

I've attached a patch (applies to both 3.3.3 and 3.4) that fixes this issue; 
however I know it's not perfect and would welcome feedback on how to make it 
acceptable to the R project.

For example, I've tried to limit the effects of the patch only to the RDB 
loading/saving code, but I'm not familiar with the codebase so it would be good 
if someone could verify that I did this correctly. Then, ideally we would also 
add some tests to ensure that unreproduciblity does not crop back in "by 
accident". R code heavily relies on absolute paths, and I went down several 
dead-ends chasing and editing variables containing absolute paths, before I 
finally managed to get this working patch, so I suspect that without specific 
reproducibility tests, this issue might recur in the future.

I've checked that the existing tests still pass, with this patch applied to the 
Debian package. I have some errors like:

-  Warning message:
-  In utils::packageDescription(basename(dir), dirname(dir)) :
-no package 'cluster' was found
-* checking R files for non-ASCII characters ... OK
-* checking R files for syntax errors ... OK
:* checking whether the package can be loaded ... ERROR

but I also get the same errors when I build the unpatched Debian package. And 
if I run e.g. `Rscript -e 'library(cluster)'` with a patched Rscript, there is 
no error and the exit code is 0.

Ximin

-- 
GPG: ed25519/56034877E1F87C35
GPG: rsa4096/1318EFAC5FBBDBCE
https://github.com/infinity0/pubkeys.git
diff -u r-base-3.3.3/debian/changelog r-base-3.3.3/debian/changelog
--- r-base-3.3.3.orig/src/library/base/R/namespace.R
+++ r-base-3.3.3/src/library/base/R/namespace.R
@@ -190,7 +190,8 @@
 
 loadNamespace <- function (package, lib.loc = NULL,
keep.source = getOption("keep.source.pkgs"),
-   partial = FALSE, versionCheck = NULL)
+   partial = FALSE, versionCheck = NULL,
+   relpath = FALSE)
 {
 libpath <- attr(package, "LibPath")
 package <- as.character(package)[[1L]]
@@ -246,9 +247,9 @@
 attr(dimpenv, "name") <- paste("lazydata", name, sep = ":")
 setNamespaceInfo(env, "lazydata", dimpenv)
 setNamespaceInfo(env, "imports", list("base" = TRUE))
-## this should be an absolute path
-setNamespaceInfo(env, "path",
- normalizePath(file.path(lib, name), "/", TRUE))
+path <- if (relpath) file.path(".", name)
+else normalizePath(file.path(lib, name), "/", TRUE)
+setNamespaceInfo(env, "path", path)
 setNamespaceInfo(env, "dynlibs", NULL)
 setNamespaceInfo(env, "S3methods", matrix(NA_character_, 0L, 3L))
 env$.__S3MethodsTable__. <-
--- r-base-3.3.3.orig/src/library/tools/R/admin.R
+++ r-base-3.3.3/src/library/tools/R/admin.R
@@ -785,7 +785,6 @@
 .install_package_Rd_objects <-
 function(dir, outDir, encoding = "unknown")
 {
-dir <- file_path_as_absolute(dir)
 mandir <- file.path(dir, "man")
 manfiles <- if(!dir.exists(mandir)) character()
 else list_files_with_type(mandir, "docs")
--- r-base-3.3.3.orig/src/library/tools/R/makeLazyLoad.R
+++ r-base-3.3.3/src/library/tools/R/makeLazyLoad.R
@@ -29,7 +29,7 @@
 if (packageHasNamespace(package, dirname(pkgpath))) {
 if (! is.null(.getNamespace(as.name(package
 stop("namespace must not be already loaded")
-ns <- suppressPackageStartupMessages(loadNamespace(package, lib.loc, keep.source, partial = TRUE))
+ns <- suppressPackageStartupMessages(loadNamespace(package, lib.loc, keep.source, partial = TRUE, relpath = TRUE))
 makeLazyLoadDB(ns, dbbase, compress = compress)
 }
 else
--- r-base-3.3.3.orig/src/library/tools/R/parseRd.R
+++ r-base-3.3.3/src/library/tools/R/parseRd.R
@@ -62,6 +62,7 @@
 basename <- basename(srcfile$filename)
 srcfile$encoding <- encoding
 srcfile$Enc <- "UTF-8"
+srcfile$wd <- "."
 
 if (encoding == "ASCII") {
 if (any(is.na(iconv(lines, "", "ASCII"
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Generate reproducible output independently of the build path

2017-05-03 Thread Ximin Luo
Ximin Luo:
> [..]
> 
> I've attached a patch (applies to both 3.3.3 and 3.4) that fixes this issue; 
> however I know it's not perfect and would welcome feedback on how to make it 
> acceptable to the R project.
> 

Hi all, attached is an updated version of the patch.

We've tested this on our jenkins infrastructure and it makes 463/478 Debian R 
packages reproducible:

https://tests.reproducible-builds.org/debian/issues/unstable/randomness_in_r_rdb_rds_databases_issue.html

The previous version of the patch was slightly flawed, it made 2 of these 
packages fail-to-build-from-source (r-bioc-biobase, r-cran-shinybs). This is 
fixed in the current patch attached, and these packages reproduce with it.

The remaining FTBFS (r-cran-randomfields) are due to incompatibilities between 
r-base 3.3.3 and 3.4.0, being discussed in Debian bug 861333, and are not 
caused by this patch.

The remaining 14 unreproducible packages are likely unreproducible due to 
issues specifically in those packages. For example 
r-cran-runit-0.4.31/man/checkFuncs.Rd contains an explicit absolute path, and 
making this relative fixes the unreproducibility. I have not yet checked the 
other packages.

> For example, I've tried to limit the effects of the patch only to the RDB 
> loading/saving code, but I'm not familiar with the codebase so it would be 
> good if someone could verify that I did this correctly. Then, ideally we 
> would also add some tests to ensure that unreproduciblity does not crop back 
> in "by accident". R code heavily relies on absolute paths, and I went down 
> several dead-ends chasing and editing variables containing absolute paths, 
> before I finally managed to get this working patch, so I suspect that without 
> specific reproducibility tests, this issue might recur in the future.
> 

I've been talking with Dirk Eddelbuettel off-thread and he suggested that the 
rest of the patch could also be guarded by something like 
getOption("useRelativePath", bool).

It would be good if other members of R Core could comment and give me some more 
guidance along these lines. :)

> I've checked that the existing tests still pass, with this patch applied to 
> the Debian package. I have some errors like:
> [..]
> :* checking whether the package can be loaded ... ERROR
> [..]

We also figured out that this was a previous issue with Debian R 3.3.3, the 
error goes away with 3.4.0 either patched or unpatched.

X

-- 
GPG: ed25519/56034877E1F87C35
GPG: rsa4096/1318EFAC5FBBDBCE
https://github.com/infinity0/pubkeys.git
--- r-base-3.4.0.orig/src/library/base/R/namespace.R
+++ r-base-3.4.0/src/library/base/R/namespace.R
@@ -192,7 +192,8 @@
 
 loadNamespace <- function (package, lib.loc = NULL,
keep.source = getOption("keep.source.pkgs"),
-   partial = FALSE, versionCheck = NULL)
+   partial = FALSE, versionCheck = NULL,
+   relpath = FALSE)
 {
 libpath <- attr(package, "LibPath")
 package <- as.character(package)[[1L]]
@@ -506,6 +505,12 @@
 # else warning(gettextf("package %s contains no R code",
 #sQuote(package)), call. = FALSE, domain = NA)
 
+# if relpath is asked for, set this here. we do this *after* the
+# sys.source() step just above, because some packages' top-level code
+# like to call things like packageDescription() which requires an
+# actually-existing path info for them to work.
+if (relpath) setNamespaceInfo(ns, "path", file.path(".", package))
+
 ## partial loading stops at this point
 ## -- used in preparing for lazy-loading
 if (partial) return(ns)
--- r-base-3.4.0.orig/src/library/tools/R/admin.R
+++ r-base-3.4.0/src/library/tools/R/admin.R
@@ -788,7 +788,6 @@
 .install_package_Rd_objects <-
 function(dir, outDir, encoding = "unknown")
 {
-dir <- file_path_as_absolute(dir)
 mandir <- file.path(dir, "man")
 manfiles <- if(!dir.exists(mandir)) character()
 else list_files_with_type(mandir, "docs")
--- r-base-3.4.0.orig/src/library/tools/R/makeLazyLoad.R
+++ r-base-3.4.0/src/library/tools/R/makeLazyLoad.R
@@ -28,7 +28,7 @@
 if (packageHasNamespace(package, dirname(pkgpath))) {
 if (! is.null(.getNamespace(as.name(package
 stop("namespace must not be already loaded")
-ns <- suppressPackageStartupMessages(loadNamespace(package, lib.loc, keep.source, partial = TRUE))
+ns <- suppressPackageStartupMessages(loadNamespace(package, lib.loc, keep.source, partial = TRUE, relpath = TRUE))
 makeLazyLoadDB(ns, dbbase, compress = compress)
 }
 else
--- r-base-3.4.0.orig/src/library/tools/R/parseRd.R
+++ r-base-3.