Re: [Rd] Small inconsistency in serialize() between R versions and implications on digest()
On Wed, 7 Mar 2007, Henrik Bengtsson wrote: > To follow up, I went ahead and generated "random" object to scan for a > common header for a given R version, and it seems to be that at most > the first 18 bytes are non-data specific, which could be the length of > the serialization header. > > Here is my code for this: > > scanSerialize <- function(object, hdr=NULL, ...) { > # Serialize object > raw <- serialize(object, connection=NULL, ascii=TRUE); > > # First run? > if (is.null(hdr)) >return(raw); > > # Find differences between current longest header and new raw vector > n <- length(hdr); > diffs <- (as.integer(hdr) != as.integer(raw[1:n])); > > # No differences? > if (!any(diffs)) >return(hdr); > > # Position of first difference > idx <- which(diffs)[1]; > > # Keep common header > hdr <- hdr[seq_len(idx-1)]; > > hdr; > }; > > # Serialize a first "random" object > hdr <- scanSerialize(NA); > for (kk in 1:100) > hdr <- scanSerialize(kk, hdr=hdr); > for (kk in 1:100) { > x <- sample(letters, size=sample(100), replace=TRUE); > hdr <- scanSerialize(x, hdr=hdr); > } > for (kk in 1:100) { > hdr <- scanSerialize(kk, hdr=hdr); > hdr <- scanSerialize(hdr, hdr=hdr); > } > > cat("Length:", length(hdr), "\n"); > print(hdr); > print(rawToChar(hdr)); > > On R v2.5.0 devel, this gives: > Length: 18 > [1] 41 0a 32 0a 31 33 32 33 35 32 0a 31 33 31 38 34 30 0a > [1] "A\n2\n132352\n131840\n" > > However, it would still be good to get an "official" statement from > one in the R-code team about the serialization header and where the > data section start. Again, I want to cut out as much as possible for > consistency between R version without loosing data dependent bytes. An official, and definitive, statement from the _R-core_ team has been available to you all along at https://svn.r-project.org/R/trunk/src/main/serialize.c My unofficial and non-definitive interpretation of that statement is that there is a header of four items, A format code 'A' or 'X' ('B' also possible in older formats) version number of the format Packed integer containint the R version that did the serializing Packed integer containing the oldest R version that can read the format You can see this if you look at the ascii version as text: > serialize(1, stdout(), ascii=TRUE) A 2 132097 131840 14 1 1 NULL > serialize(as.integer(1), stdout(), ascii=TRUE) A 2 132097 131840 13 1 1 NULL In the non-ascii 'X' (as in xdr) format this will constitute 13 bytes. In ascii format I believe it is currently 18 bytes but this could change with the version number of R -- I'd have to read the official and definitive statement to see how the integer packing is done and work out whether that could change the number of bytes. The number of bytes would also change if we reached format version 10, but something about the format would also change of course. A safer way to look at the header in the ascii version is as the first four lines. Best, luke > > Thanks > > /Henrik > > On 3/7/07, Henrik Bengtsson <[EMAIL PROTECTED]> wrote: >> Hi, >> >> I noticed that serialize() gives different results depending on R >> version, which has implications to the digest() function in the digest >> package. Note, it does give the same output across platforms. I know >> that serialize() is under development, but is this expected, e.g. is >> there some kind of header in the result that specifies "who" generated >> the stream, and if so, exactly what bytes are they? >> >> SETUP: >> >> R versions: >> A) R v2.4.0 (2006-10-03) >> B) R v2.4.1pat (2007-01-13 r40470) >> C) R v2.5.0dev (2006-12-12 r40167) >> >> This is on WinXP and I start R with Rterm --vanilla. >> >> Example: Identical serialize() calls using the different R versions. >> >>> raw <- serialize(1, connection=NULL, ascii=TRUE) >>> print(raw) >> >> gives: >> >> (A): [1] 41 0a 32 0a 31 33 32 30 39 36 0a 31 33 31 38 34 30 0a 31 34 >> 0a 31 0a 31 0a >> (B): [1] 41 0a 32 0a 31 33 32 30 39 37 0a 31 33 31 38 34 30 0a 31 34 >> 0a 31 0a 31 0a >> (C): [1] 41 0a 32 0a 31 33 32 33 35 32 0a 31 33 31 38 34 30 0a 31 34 >> 0a 31 0a 31 0a >> >> Note the difference in raw bytes 8 to 10, i.e. >> >>> raw[7:11] >> (A): [1] 32 30 39 36 0a >> (B): [1] 32 30 39 37 0a >> (C): [1] 32 33 35 32 0a >> >> Does bytes 8, 9 and 10 in the raw vector somehow contain information >> about the R version or similar? The following poor mans test says >> that is the only difference: >> >> On all R versions, the following gives identical results: >> >>> raw <- serialize(1:1e4, connection=NULL, ascii=TRUE) >>> raw <- as.integer(raw[-c(8:10)]) >>> sum(raw) >> [1] 2147884 >>> sum(log(raw)) >> [1] 177201.2 >> >> If it is true that there is a R version specific header in serialized >> objects, then the digest() function should exclude such header in >> order to produce consistent results across R versions, because now >> digest
Re: [Rd] Small inconsistency in serialize() between R versions and implications on digest()
On Fri, 9 Mar 2007, Paul Murrell wrote: > Hi > > > Luke Tierney wrote: >> On Wed, 7 Mar 2007, Henrik Bengtsson wrote: >> >>> To follow up, I went ahead and generated "random" object to scan for a >>> common header for a given R version, and it seems to be that at most >>> the first 18 bytes are non-data specific, which could be the length of >>> the serialization header. >>> >>> Here is my code for this: >>> >>> scanSerialize <- function(object, hdr=NULL, ...) { >>> # Serialize object >>> raw <- serialize(object, connection=NULL, ascii=TRUE); >>> >>> # First run? >>> if (is.null(hdr)) >>>return(raw); >>> >>> # Find differences between current longest header and new raw vector >>> n <- length(hdr); >>> diffs <- (as.integer(hdr) != as.integer(raw[1:n])); >>> >>> # No differences? >>> if (!any(diffs)) >>>return(hdr); >>> >>> # Position of first difference >>> idx <- which(diffs)[1]; >>> >>> # Keep common header >>> hdr <- hdr[seq_len(idx-1)]; >>> >>> hdr; >>> }; >>> >>> # Serialize a first "random" object >>> hdr <- scanSerialize(NA); >>> for (kk in 1:100) >>> hdr <- scanSerialize(kk, hdr=hdr); >>> for (kk in 1:100) { >>> x <- sample(letters, size=sample(100), replace=TRUE); >>> hdr <- scanSerialize(x, hdr=hdr); >>> } >>> for (kk in 1:100) { >>> hdr <- scanSerialize(kk, hdr=hdr); >>> hdr <- scanSerialize(hdr, hdr=hdr); >>> } >>> >>> cat("Length:", length(hdr), "\n"); >>> print(hdr); >>> print(rawToChar(hdr)); >>> >>> On R v2.5.0 devel, this gives: >>> Length: 18 >>> [1] 41 0a 32 0a 31 33 32 33 35 32 0a 31 33 31 38 34 30 0a >>> [1] "A\n2\n132352\n131840\n" >>> >>> However, it would still be good to get an "official" statement from >>> one in the R-code team about the serialization header and where the >>> data section start. Again, I want to cut out as much as possible for >>> consistency between R version without loosing data dependent bytes. >> >> An official, and definitive, statement from the _R-core_ team has been >> available to you all along at >> >> https://svn.r-project.org/R/trunk/src/main/serialize.c > > > There's also a bit of info on this in Section 1.7 of the "R Internals" > Manual. > > Paul Thanks -- I'd forgotten about that. Looking at that shows that my unofficial and non-definitive interpretation was not quite right for the binary case -- the header there is 14 bytes (I forgot that there is a \n after the X even in the binary case). Best, luke > > >> My unofficial and non-definitive interpretation of that statement is >> that there is a header of four items, >> >> A format code 'A' or 'X' ('B' also possible in older formats) >> version number of the format >> Packed integer containint the R version that did the serializing >> Packed integer containing the oldest R version that can read the format >> >> You can see this if you look at the ascii version as text: >> >> > serialize(1, stdout(), ascii=TRUE) >> A >> 2 >> 132097 >> 131840 >> 14 >> 1 >> 1 >> NULL >> > serialize(as.integer(1), stdout(), ascii=TRUE) >> A >> 2 >> 132097 >> 131840 >> 13 >> 1 >> 1 >> NULL >> >> In the non-ascii 'X' (as in xdr) format this will constitute 13 bytes. >> In ascii format I believe it is currently 18 bytes but this could >> change with the version number of R -- I'd have to read the official >> and definitive statement to see how the integer packing is done and >> work out whether that could change the number of bytes. The number of >> bytes would also change if we reached format version 10, but something >> about the format would also change of course. A safer way to look at >> the header in the ascii version is as the first four lines. >> >> Best, >> >> luke >> >>> Thanks >>> >>> /Henrik >>> >>> On 3/7/07, Henrik Bengtsson <[EMAIL PROTECTED]> wrote: Hi, I noticed that serialize() gives different results depending on R version, which has implications to the digest() function in the digest package. Note, it does give the same output across platforms. I know that serialize() is under development, but is this expected, e.g. is there some kind of header in the result that specifies "who" generated the stream, and if so, exactly what bytes are they? SETUP: R versions: A) R v2.4.0 (2006-10-03) B) R v2.4.1pat (2007-01-13 r40470) C) R v2.5.0dev (2006-12-12 r40167) This is on WinXP and I start R with Rterm --vanilla. Example: Identical serialize() calls using the different R versions. > raw <- serialize(1, connection=NULL, ascii=TRUE) > print(raw) gives: (A): [1] 41 0a 32 0a 31 33 32 30 39 36 0a 31 33 31 38 34 30 0a 31 34 0a 31 0a 31 0a (B): [1] 41 0a 32 0a 31 33 32 30 39 37 0a 31 33 31 38 34 30 0a 31 34 0a 31 0a 31 0a (C): [1] 41 0a 32 0a 31 33 32 33 35 32 0a 31 33 31 38 34 30 0a 31 34 0a 31 0a 31 0a Note the differ
Re: [Rd] Small inconsistency in serialize() between R versions and implications on digest()
Hi Luke Tierney wrote: > On Wed, 7 Mar 2007, Henrik Bengtsson wrote: > >> To follow up, I went ahead and generated "random" object to scan for a >> common header for a given R version, and it seems to be that at most >> the first 18 bytes are non-data specific, which could be the length of >> the serialization header. >> >> Here is my code for this: >> >> scanSerialize <- function(object, hdr=NULL, ...) { >> # Serialize object >> raw <- serialize(object, connection=NULL, ascii=TRUE); >> >> # First run? >> if (is.null(hdr)) >>return(raw); >> >> # Find differences between current longest header and new raw vector >> n <- length(hdr); >> diffs <- (as.integer(hdr) != as.integer(raw[1:n])); >> >> # No differences? >> if (!any(diffs)) >>return(hdr); >> >> # Position of first difference >> idx <- which(diffs)[1]; >> >> # Keep common header >> hdr <- hdr[seq_len(idx-1)]; >> >> hdr; >> }; >> >> # Serialize a first "random" object >> hdr <- scanSerialize(NA); >> for (kk in 1:100) >> hdr <- scanSerialize(kk, hdr=hdr); >> for (kk in 1:100) { >> x <- sample(letters, size=sample(100), replace=TRUE); >> hdr <- scanSerialize(x, hdr=hdr); >> } >> for (kk in 1:100) { >> hdr <- scanSerialize(kk, hdr=hdr); >> hdr <- scanSerialize(hdr, hdr=hdr); >> } >> >> cat("Length:", length(hdr), "\n"); >> print(hdr); >> print(rawToChar(hdr)); >> >> On R v2.5.0 devel, this gives: >> Length: 18 >> [1] 41 0a 32 0a 31 33 32 33 35 32 0a 31 33 31 38 34 30 0a >> [1] "A\n2\n132352\n131840\n" >> >> However, it would still be good to get an "official" statement from >> one in the R-code team about the serialization header and where the >> data section start. Again, I want to cut out as much as possible for >> consistency between R version without loosing data dependent bytes. > > An official, and definitive, statement from the _R-core_ team has been > available to you all along at > > https://svn.r-project.org/R/trunk/src/main/serialize.c There's also a bit of info on this in Section 1.7 of the "R Internals" Manual. Paul > My unofficial and non-definitive interpretation of that statement is > that there is a header of four items, > > A format code 'A' or 'X' ('B' also possible in older formats) > version number of the format > Packed integer containint the R version that did the serializing > Packed integer containing the oldest R version that can read the format > > You can see this if you look at the ascii version as text: > > > serialize(1, stdout(), ascii=TRUE) > A > 2 > 132097 > 131840 > 14 > 1 > 1 > NULL > > serialize(as.integer(1), stdout(), ascii=TRUE) > A > 2 > 132097 > 131840 > 13 > 1 > 1 > NULL > > In the non-ascii 'X' (as in xdr) format this will constitute 13 bytes. > In ascii format I believe it is currently 18 bytes but this could > change with the version number of R -- I'd have to read the official > and definitive statement to see how the integer packing is done and > work out whether that could change the number of bytes. The number of > bytes would also change if we reached format version 10, but something > about the format would also change of course. A safer way to look at > the header in the ascii version is as the first four lines. > > Best, > > luke > >> Thanks >> >> /Henrik >> >> On 3/7/07, Henrik Bengtsson <[EMAIL PROTECTED]> wrote: >>> Hi, >>> >>> I noticed that serialize() gives different results depending on R >>> version, which has implications to the digest() function in the digest >>> package. Note, it does give the same output across platforms. I know >>> that serialize() is under development, but is this expected, e.g. is >>> there some kind of header in the result that specifies "who" generated >>> the stream, and if so, exactly what bytes are they? >>> >>> SETUP: >>> >>> R versions: >>> A) R v2.4.0 (2006-10-03) >>> B) R v2.4.1pat (2007-01-13 r40470) >>> C) R v2.5.0dev (2006-12-12 r40167) >>> >>> This is on WinXP and I start R with Rterm --vanilla. >>> >>> Example: Identical serialize() calls using the different R versions. >>> raw <- serialize(1, connection=NULL, ascii=TRUE) print(raw) >>> gives: >>> >>> (A): [1] 41 0a 32 0a 31 33 32 30 39 36 0a 31 33 31 38 34 30 0a 31 34 >>> 0a 31 0a 31 0a >>> (B): [1] 41 0a 32 0a 31 33 32 30 39 37 0a 31 33 31 38 34 30 0a 31 34 >>> 0a 31 0a 31 0a >>> (C): [1] 41 0a 32 0a 31 33 32 33 35 32 0a 31 33 31 38 34 30 0a 31 34 >>> 0a 31 0a 31 0a >>> >>> Note the difference in raw bytes 8 to 10, i.e. >>> raw[7:11] >>> (A): [1] 32 30 39 36 0a >>> (B): [1] 32 30 39 37 0a >>> (C): [1] 32 33 35 32 0a >>> >>> Does bytes 8, 9 and 10 in the raw vector somehow contain information >>> about the R version or similar? The following poor mans test says >>> that is the only difference: >>> >>> On all R versions, the following gives identical results: >>> raw <- serialize(1:1e4, connection=NULL, ascii=TRUE) raw <-
Re: [Rd] Small inconsistency in serialize() between R versions and implications on digest()
On 3/8/07, Luke Tierney <[EMAIL PROTECTED]> wrote: > On Fri, 9 Mar 2007, Paul Murrell wrote: > > > Hi > > > > > > Luke Tierney wrote: > >> On Wed, 7 Mar 2007, Henrik Bengtsson wrote: > >> > >>> To follow up, I went ahead and generated "random" object to scan for a > >>> common header for a given R version, and it seems to be that at most > >>> the first 18 bytes are non-data specific, which could be the length of > >>> the serialization header. > >>> > >>> Here is my code for this: > >>> > >>> scanSerialize <- function(object, hdr=NULL, ...) { > >>> # Serialize object > >>> raw <- serialize(object, connection=NULL, ascii=TRUE); > >>> > >>> # First run? > >>> if (is.null(hdr)) > >>>return(raw); > >>> > >>> # Find differences between current longest header and new raw vector > >>> n <- length(hdr); > >>> diffs <- (as.integer(hdr) != as.integer(raw[1:n])); > >>> > >>> # No differences? > >>> if (!any(diffs)) > >>>return(hdr); > >>> > >>> # Position of first difference > >>> idx <- which(diffs)[1]; > >>> > >>> # Keep common header > >>> hdr <- hdr[seq_len(idx-1)]; > >>> > >>> hdr; > >>> }; > >>> > >>> # Serialize a first "random" object > >>> hdr <- scanSerialize(NA); > >>> for (kk in 1:100) > >>> hdr <- scanSerialize(kk, hdr=hdr); > >>> for (kk in 1:100) { > >>> x <- sample(letters, size=sample(100), replace=TRUE); > >>> hdr <- scanSerialize(x, hdr=hdr); > >>> } > >>> for (kk in 1:100) { > >>> hdr <- scanSerialize(kk, hdr=hdr); > >>> hdr <- scanSerialize(hdr, hdr=hdr); > >>> } > >>> > >>> cat("Length:", length(hdr), "\n"); > >>> print(hdr); > >>> print(rawToChar(hdr)); > >>> > >>> On R v2.5.0 devel, this gives: > >>> Length: 18 > >>> [1] 41 0a 32 0a 31 33 32 33 35 32 0a 31 33 31 38 34 30 0a > >>> [1] "A\n2\n132352\n131840\n" > >>> > >>> However, it would still be good to get an "official" statement from > >>> one in the R-code team about the serialization header and where the > >>> data section start. Again, I want to cut out as much as possible for > >>> consistency between R version without loosing data dependent bytes. > >> > >> An official, and definitive, statement from the _R-core_ team has been > >> available to you all along at > >> > >> https://svn.r-project.org/R/trunk/src/main/serialize.c > > > > > > There's also a bit of info on this in Section 1.7 of the "R Internals" > > Manual. > > > > Paul > > Thanks -- I'd forgotten about that. Looking at that shows that my > unofficial and non-definitive interpretation was not quite right for > the binary case -- the header there is 14 bytes (I forgot that there > is a \n after the X even in the binary case). Luke and Paul, thank you for this. Searching for the 4th newline seems to be the most robust thing to do in the ASCII case. /Henrik > > Best, > > luke > > > > > > >> My unofficial and non-definitive interpretation of that statement is > >> that there is a header of four items, > >> > >> A format code 'A' or 'X' ('B' also possible in older formats) > >> version number of the format > >> Packed integer containint the R version that did the serializing > >> Packed integer containing the oldest R version that can read the > >> format > >> > >> You can see this if you look at the ascii version as text: > >> > >> > serialize(1, stdout(), ascii=TRUE) > >> A > >> 2 > >> 132097 > >> 131840 > >> 14 > >> 1 > >> 1 > >> NULL > >> > serialize(as.integer(1), stdout(), ascii=TRUE) > >> A > >> 2 > >> 132097 > >> 131840 > >> 13 > >> 1 > >> 1 > >> NULL > >> > >> In the non-ascii 'X' (as in xdr) format this will constitute 13 bytes. > >> In ascii format I believe it is currently 18 bytes but this could > >> change with the version number of R -- I'd have to read the official > >> and definitive statement to see how the integer packing is done and > >> work out whether that could change the number of bytes. The number of > >> bytes would also change if we reached format version 10, but something > >> about the format would also change of course. A safer way to look at > >> the header in the ascii version is as the first four lines. > >> > >> Best, > >> > >> luke > >> > >>> Thanks > >>> > >>> /Henrik > >>> > >>> On 3/7/07, Henrik Bengtsson <[EMAIL PROTECTED]> wrote: > Hi, > > I noticed that serialize() gives different results depending on R > version, which has implications to the digest() function in the digest > package. Note, it does give the same output across platforms. I know > that serialize() is under development, but is this expected, e.g. is > there some kind of header in the result that specifies "who" generated > the stream, and if so, exactly what bytes are they? > > SETUP: > > R versions: > A) R v2.4.0 (2006-10-03) > B) R v2.4.1pat (2007-01-13 r40470) > C) R v2.5.0dev (2006-12-12 r40167) > > This is on WinXP and I start R with Rte