Hi,

I'm running into more issues when reading data from a gzfile connection.
If I read the data sequentially with successive calls to readBin(), the
data I get looks ok. But if I call seek() between the successive calls
to readBin(), I get corrupted data.

Here is a (hopefully) reproducible example. See my sessionInfo() at the
end (I'm not on Windows, where, according to the man page, seek() is
broken).

  ## Generate data with a repeated easy-to-recognize byte pattern
  ## of length 26:
  mydata <- rep(charToRaw(paste(letters, collapse="")), 400)

  ## Write the data to test.gz file:
  con <- gzfile("test.gz", open="wb")
  writeBin(mydata, con)
  close(con)

  ## Read the data from test.gz file. We'll read blocks of 26 bytes
  ## located at various offsets that are multiple of 26, so we expect
  ## to see our original pattern ("abc...xyz").
  con <- gzfile("test.gz", open="rb")

  ## Offset 0: ok
  > rawToChar(readBin(con, "raw", n=26))
  [1] "abcdefghijklmnopqrstuvwxyz"

  ## Offset 78: still ok
  > seek(con, where=78)
  [1] 26
  > seek(con)
  [1] 78
  > rawToChar(readBin(con, "raw", n=26))
  [1] "abcdefghijklmnopqrstuvwxyz"

  ## Offset 520: data is messed up
  > seek(con, where=520)
  [1] 104
  > seek(con)
  [1] 520
  > rawToChar(readBin(con, "raw", n=26))
  [1] "zabcdefghijklmnopqrstuvvuv"


  ## Offset 2600: very messed up
  > seek(con, where=2600)
  [1] 546
  > seek(con)
  [1] 2600
  > rawToChar(readBin(con, "raw", n=26))
  [1] "xxxxxmpxxxxxxesxxxxxxxxxxp"

  ## Offset 10400: see previous email (subject: "error when calling
  ## seek() twice on a gzfile connection")
  > seek(con, where=10400)
  [1] 2626
  Warning message:
  In seek.connection(con, where = 10400) :
    seek on a gzfile connection returned an internal error

  close(con)

Reading the data sequentially with no calls to seek() returns the
expected pattern 400 times:

  con <- gzfile("test.gz", open="rb")
  blocks <- sapply(1:400, function(i) rawToChar(readBin(con, "raw", n=26)))

  ## Check the result:

  > readBin(con, "raw", n=26)  # no more data
  raw(0)

  > seek(con)
  [1] 10400

  > table(blocks)
  blocks
  abcdefghijklmnopqrstuvwxyz
                         400

Thanks,
H.

> sessionInfo()
R version 3.0.0 (2013-04-03)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=C                 LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to