Hi Uwe,

On 11-12-07 12:34 AM, Uwe Ligges wrote:


On 06.12.2011 23:28, Hervé Pagès wrote:
Hi,

Recently added to doc/NEWS.Rd:

'R CMD check' now gives a warning rather than a note if it finds
inefficiently compressed datasets. With 'bzip2' and 'xz' compression
having been available since R 2.10.0, there is no excuse for not
using them.

Why isn't a note enough for this?

Generally speaking, warnings are for things that are dangerous,
or unsafe, or unportable, or for anything that could potentially
cause trouble. I don't see how using gzip instead of bzip2 or xz
could fall into that category (and BTW gzip is the default for
save() and for 'R CMD build' resave-data feature).

The problem is that bzip2 and xz compressions are slower and also
require more memory than gzip. Bioconductor has big data packages
and sometimes it makes sense to use gzip and not bzip2 or xz. For
example, when loading Human chromosome 1 from disk, bzip2 and xz
are 7 and 3.4 times slower than gzip, respectively:

> system.time(load("chr1-gzip.rda"))
user system elapsed
1.210 0.180 1.384

> system.time(load("chr1-bzip2.rda"))
user system elapsed
9.500 0.160 9.674

> system.time(load("chr1-xz.rda"))
user system elapsed
4.46 0.20 4.69

hpages@latitude:~/testing$ ls -lhtr chr1-*.rda
-rw-r--r-- 1 hpages hpages 61M 2011-12-06 12:13 chr1-gzip.rda
-rw-r--r-- 1 hpages hpages 55M 2011-12-06 12:15 chr1-bzip2.rda
-rw-r--r-- 1 hpages hpages 49M 2011-12-06 12:25 chr1-xz.rda

This is with R-2.14.0 on a 64-bit Ubuntu laptop with 8GB of RAM.

The size on disk doesn't really matter and it doesn't matter either
that the source tarball for the full Human genome ends up being 20%
bigger when using gzip instead of xz: the 20% extra time it takes to
download it (which needs to be done only once) will largely be
compensated by the fact that most analyses will run faster e.g. in
40-45 sec. instead of more than 2 minutes (for many short analyses,
loading the chromosomes into memory is the bottleneck).


Oh, from a European side this 20% extra time may be an hour when
downloading from the BioC master rather than a mirror.

I guess that's why we have mirrors.

And space and traffic is an issue for CRAN.



Is there a way to turn this warning off? If not, could an option be
added to 'R CMD check' to turn this warning off? Something along the
lines of the --no-resave-data option for 'R CMD build'.


The manual tells us:

"The following environment variables can be used to customize the
operation of check: a convenient place to set these is the file
‘~/.R/check.Renviron’.

Ah I see, this is in the "R Internals" manual. Good to know.


[...]

_R_CHECK_COMPACT_DATA2_

If true, check data for ascii and uncompressed saves, and also check if
using bzip2 or xz compression would be significantly better. Implies
_R_CHECK_COMPACT_DATA_ is true. Default: true."

Not with current R-devel: _R_CHECK_COMPACT_DATA2_ is gone (has been merged with _R_CHECK_COMPACT_DATA_).
I guess we could always use _R_CHECK_COMPACT_DATA_ to turn this off
but that would mean we also turn off checking data for ascii and
uncompressed saves...

Thanks,
H.



Uwe




Thanks,
H.



--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to