Why not write the RDS file more atomically - write it to a temporary file and rename that file to its final name when it is completely written? E.g.,
saveRDS.atomically function (object, file, ...) { tfile <- tempfile(basename(file), dirname(file)) on.exit(if (file.exists(tfile)) unlink(tfile)) retval <- saveRDS(object, tfile, ...) if (!file.rename(tfile, file)) { # perhaps want an if(file.exists(file))unlink(file) first stop("Cannot rename temporary file ", tfile, " to ", file) } invisible(retval) } (The file.rename may be tripped up by an overeager virus checker looking at the newly created tfile. I don't know the best way to deal with that.) Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com > -----Original Message----- > From: r-devel-boun...@r-project.org [mailto:r-devel-boun...@r-project.org] On > Behalf > Of Henrik Bengtsson > Sent: Saturday, September 15, 2012 10:22 AM > To: R-devel > Subject: [Rd] Risk of readRDS() not detecting race conditions with parallel > saveRDS()? > > I hardly know anything about the format used in (non-compressed) > serialization/RDS, but hoping someone with more knowledge could give > me some feedback; > > Consider two R processes running in parallel on the same unknown file > system. Both of them write and read to the same RDS file foo.rds > (without compression) at random times using saveRDS(object, > file="foo.rds", compress=FALSE) and object2 <- > readRDS(file="foo.rds"). This happens frequently enough such that > there is a risk for the two processes to write to the same "foo.rds" > file at the same time (here one needs to acknowledge that file updates > are not atomic nor instant). > > To simulate the event that two processes writes to the same file at > the same time (and non-atomically) results in a interweaved/appended > "foo.rds" file, I manually corrupted "foo.rds" by > inserting/dropping/replacing a single random byte. It appears that > readRDS() will detect this simple event, by throwing an error on > "unknown input format", which is what I want. My question is now, is > it reasonable to assume that if two or more processes happen to write > to the same RDS file at the same time, it is extremely unlikely (*) > that they would generate a file that would pass as valid by readRDS()? > (*) extremely unlikely = if all of us would run this toy example we > would not end up with a non-detect but still corrupt "foo.rds" file > in, say, 10000 years. > > Background: The R.cache package allows memoization (caching of > results) to file such that the cache is persistent across R sessions. > The persistent part is achieved by writing cache files to the same > file directory. This is safe when you run a single process, and even > if readRDS() would fail to read a cache file it is no big deal; the > memoization will just fail and the results will be recalculated and be > resaved. The questions is what happens if you run this in parallel > and push it to the extreme; is there a risk that the memoization will > properly return but with invalid results. I prefer not having to > synchronize this with a mutex/semaphore/common server, but instead > rely on this try-an-see approach (cf. the Ethernet protocol on shared > medium). My guess (and hope) is that the risk is extremely unlikely > (*), but I'd like to hear if someone else thinks otherwise. > > Thanks, > > Henrik > > ______________________________________________ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel