In addition to the major points that others made: if you care about speed, don't use compression. With today's fast disks it's an order of magnitude slower to use compression:
> d=lapply(1:10, function(x) as.integer(rnorm(1e7))) > system.time(saveRDS(d, file="test.rds.gz")) user system elapsed 17.210 0.148 17.397 > system.time(saveRDS(d, file="test.rds", compress=F)) user system elapsed 0.482 0.355 0.929 The above example is intentionally well compressible, in real life the differences are actually even bigger. As people that deal with big data know well, disks are no longer the bottleneck - it's the CPU now. Cheers, Simon BTW: why in the world would you use ascii=TRUE? It's pretty much the slowest possible serialization you can use - it will even overshadow compression: > system.time(saveRDS(d, file="test.rds", compress=F)) user system elapsed 0.459 0.383 0.940 > system.time(saveRDS(d, file="test-a.rds", compress=F, ascii=T)) user system elapsed 36.713 0.140 36.929 and the same goes for reading: > system.time(readRDS("test-a.rds")) user system elapsed 27.616 0.275 27.948 > system.time(readRDS("test.rds")) user system elapsed 0.609 0.184 0.795 > On Jan 15, 2015, at 7:45 AM, Stewart Morris <stewart.mor...@igmm.ed.ac.uk> > wrote: > > Hi, > > I am dealing with very large datasets and it takes a long time to save a > workspace image. > > The options to save compressed data are: "gzip", "bzip2" or "xz", the default > being gzip. I wonder if it's possible to include the pbzip2 > (http://compression.ca/pbzip2/) algorithm as an option when saving. > > "PBZIP2 is a parallel implementation of the bzip2 block-sorting file > compressor that uses pthreads and achieves near-linear speedup on SMP > machines. The output of this version is fully compatible with bzip2 v1.0.2 or > newer" > > I tested this as follows with one of my smaller datasets, having only read in > the raw data: > > ============ > # Dumped an ascii image > save.image(file='test', ascii=TRUE) > > # At the shell prompt: > ls -l test > -rw-rw-r--. 1 swmorris swmorris 1794473126 Jan 14 17:33 test > > time bzip2 -9 test > 364.702u 3.148s 6:14.01 98.3% 0+0k 48+1273976io 1pf+0w > > time pbzip2 -9 test > 422.080u 18.708s 0:11.49 3836.2% 0+0k 0+1274176io 0pf+0w > ============ > > As you can see, bzip2 on its own took over 6 minutes whereas pbzip2 took 11 > seconds, admittedly on a 64 core machine (running at 50% load). Most modern > machines are multicore so everyone would get some speedup. > > Is this feasible/practical? I am not a developer so I'm afraid this would be > down to someone else... > > Thoughts? > > Cheers, > > Stewart > > -- > Stewart W. Morris > Centre for Genomic and Experimental Medicine > The University of Edinburgh > United Kingdom > > The University of Edinburgh is a charitable body, registered in > Scotland, with registration number SC005336. > > ______________________________________________ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel