Hi, I'm running into more issues when reading data from a gzfile connection. If I read the data sequentially with successive calls to readBin(), the data I get looks ok. But if I call seek() between the successive calls to readBin(), I get corrupted data.
Here is a (hopefully) reproducible example. See my sessionInfo() at the end (I'm not on Windows, where, according to the man page, seek() is broken). ## Generate data with a repeated easy-to-recognize byte pattern ## of length 26: mydata <- rep(charToRaw(paste(letters, collapse="")), 400) ## Write the data to test.gz file: con <- gzfile("test.gz", open="wb") writeBin(mydata, con) close(con) ## Read the data from test.gz file. We'll read blocks of 26 bytes ## located at various offsets that are multiple of 26, so we expect ## to see our original pattern ("abc...xyz"). con <- gzfile("test.gz", open="rb") ## Offset 0: ok > rawToChar(readBin(con, "raw", n=26)) [1] "abcdefghijklmnopqrstuvwxyz" ## Offset 78: still ok > seek(con, where=78) [1] 26 > seek(con) [1] 78 > rawToChar(readBin(con, "raw", n=26)) [1] "abcdefghijklmnopqrstuvwxyz" ## Offset 520: data is messed up > seek(con, where=520) [1] 104 > seek(con) [1] 520 > rawToChar(readBin(con, "raw", n=26)) [1] "zabcdefghijklmnopqrstuvvuv" ## Offset 2600: very messed up > seek(con, where=2600) [1] 546 > seek(con) [1] 2600 > rawToChar(readBin(con, "raw", n=26)) [1] "xxxxxmpxxxxxxesxxxxxxxxxxp" ## Offset 10400: see previous email (subject: "error when calling ## seek() twice on a gzfile connection") > seek(con, where=10400) [1] 2626 Warning message: In seek.connection(con, where = 10400) : seek on a gzfile connection returned an internal error close(con) Reading the data sequentially with no calls to seek() returns the expected pattern 400 times: con <- gzfile("test.gz", open="rb") blocks <- sapply(1:400, function(i) rawToChar(readBin(con, "raw", n=26))) ## Check the result: > readBin(con, "raw", n=26) # no more data raw(0) > seek(con) [1] 10400 > table(blocks) blocks abcdefghijklmnopqrstuvwxyz 400 Thanks, H. > sessionInfo() R version 3.0.0 (2013-04-03) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319 ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel