date:20070308

Re: [Rd] Small inconsistency in serialize() between R versions and implications on digest()

2007-03-08 Thread Luke Tierney

On Wed, 7 Mar 2007, Henrik Bengtsson wrote:

> To follow up, I went ahead and generated "random" object to scan for a
> common header for a given R version, and it seems to be that at most
> the first 18 bytes are non-data specific, which could be the length of
> the serialization header.
>
> Here is my code for this:
>
> scanSerialize <- function(object, hdr=NULL, ...) {
>  # Serialize object
>  raw <- serialize(object, connection=NULL, ascii=TRUE);
>
>  # First run?
>  if (is.null(hdr))
>return(raw);
>
>  # Find differences between current longest header and new raw vector
>  n <- length(hdr);
>  diffs <- (as.integer(hdr) != as.integer(raw[1:n]));
>
>  # No differences?
>  if (!any(diffs))
>return(hdr);
>
>  # Position of first difference
>  idx <- which(diffs)[1];
>
>  # Keep common header
>  hdr <- hdr[seq_len(idx-1)];
>
>  hdr;
> };
>
> # Serialize a first "random" object
> hdr <- scanSerialize(NA);
> for (kk in 1:100)
>  hdr <- scanSerialize(kk, hdr=hdr);
> for (kk in 1:100) {
>  x <- sample(letters, size=sample(100), replace=TRUE);
>  hdr <- scanSerialize(x, hdr=hdr);
> }
> for (kk in 1:100) {
>  hdr <- scanSerialize(kk, hdr=hdr);
>  hdr <- scanSerialize(hdr, hdr=hdr);
> }
>
> cat("Length:", length(hdr), "\n");
> print(hdr);
> print(rawToChar(hdr));
>
> On R v2.5.0 devel, this gives:
> Length: 18
> [1] 41 0a 32 0a 31 33 32 33 35 32 0a 31 33 31 38 34 30 0a
> [1] "A\n2\n132352\n131840\n"
>
> However, it would still be good to get an "official" statement from
> one in the R-code team about the serialization header and where the
> data section start.  Again, I want to cut out as much as possible for
> consistency between R version without loosing data dependent bytes.

An official, and definitive, statement from the _R-core_ team has been
available to you all along at

https://svn.r-project.org/R/trunk/src/main/serialize.c

My unofficial and non-definitive interpretation of that statement is
that there is a header of four items,

 A format code 'A' or 'X' ('B' also possible in older formats)
 version number of the format
 Packed integer containint the R version that did the serializing
 Packed integer containing the oldest R version that can read the format

You can see this if you look at the ascii version as text:

 > serialize(1, stdout(), ascii=TRUE)
 A
 2
 132097
 131840
 14
 1
 1
 NULL
 > serialize(as.integer(1), stdout(), ascii=TRUE)
 A
 2
 132097
 131840
 13
 1
 1
 NULL

In the non-ascii 'X' (as in xdr) format this will constitute 13 bytes.
In ascii format I believe it is currently 18 bytes but this could
change with the version number of R -- I'd have to read the official
and definitive statement to see how the integer packing is done and
work out whether that could change the number of bytes. The number of
bytes would also change if we reached format version 10, but something
about the format would also change of course.  A safer way to look at
the header in the ascii version is as the first four lines.

Best,

luke

>
> Thanks
>
> /Henrik
>
> On 3/7/07, Henrik Bengtsson <[EMAIL PROTECTED]> wrote:
>> Hi,
>>
>> I noticed that serialize() gives different results depending on R
>> version, which has implications to the digest() function in the digest
>> package.  Note, it does give the same output across platforms.  I know
>> that serialize() is under development, but is this expected, e.g. is
>> there some kind of header in the result that specifies "who" generated
>> the stream, and if so, exactly what bytes are they?
>>
>> SETUP:
>>
>> R versions:
>> A) R v2.4.0 (2006-10-03)
>> B) R v2.4.1pat (2007-01-13 r40470)
>> C) R v2.5.0dev (2006-12-12 r40167)
>>
>> This is on WinXP and I start R with Rterm --vanilla.
>>
>> Example: Identical serialize() calls using the different R versions.
>>
>>> raw <- serialize(1, connection=NULL, ascii=TRUE)
>>> print(raw)
>>
>> gives:
>>
>> (A): [1] 41 0a 32 0a 31 33 32 30 39 36 0a 31 33 31 38 34 30 0a 31 34
>> 0a 31 0a 31 0a
>> (B): [1] 41 0a 32 0a 31 33 32 30 39 37 0a 31 33 31 38 34 30 0a 31 34
>> 0a 31 0a 31 0a
>> (C): [1] 41 0a 32 0a 31 33 32 33 35 32 0a 31 33 31 38 34 30 0a 31 34
>> 0a 31 0a 31 0a
>>
>> Note the difference in raw bytes 8 to 10, i.e.
>>
>>> raw[7:11]
>> (A): [1] 32 30 39 36 0a
>> (B): [1] 32 30 39 37 0a
>> (C): [1] 32 33 35 32 0a
>>
>> Does bytes 8, 9 and 10 in the raw vector somehow contain information
>> about the R version or similar?  The following poor mans test says
>> that is the only difference:
>>
>> On all R versions, the following gives identical results:
>>
>>> raw <- serialize(1:1e4, connection=NULL, ascii=TRUE)
>>> raw <- as.integer(raw[-c(8:10)])
>>> sum(raw)
>> [1] 2147884
>>> sum(log(raw))
>> [1] 177201.2
>>
>> If it is true that there is a R version specific header in serialized
>> objects, then the digest() function should exclude such header in
>> order to produce consistent results across R versions, because now
>> digest

Re: [Rd] Small inconsistency in serialize() between R versions and implications on digest()

2007-03-08 Thread Luke Tierney

On Fri, 9 Mar 2007, Paul Murrell wrote:

> Hi
>
>
> Luke Tierney wrote:
>> On Wed, 7 Mar 2007, Henrik Bengtsson wrote:
>>
>>> To follow up, I went ahead and generated "random" object to scan for a
>>> common header for a given R version, and it seems to be that at most
>>> the first 18 bytes are non-data specific, which could be the length of
>>> the serialization header.
>>>
>>> Here is my code for this:
>>>
>>> scanSerialize <- function(object, hdr=NULL, ...) {
>>>  # Serialize object
>>>  raw <- serialize(object, connection=NULL, ascii=TRUE);
>>>
>>>  # First run?
>>>  if (is.null(hdr))
>>>return(raw);
>>>
>>>  # Find differences between current longest header and new raw vector
>>>  n <- length(hdr);
>>>  diffs <- (as.integer(hdr) != as.integer(raw[1:n]));
>>>
>>>  # No differences?
>>>  if (!any(diffs))
>>>return(hdr);
>>>
>>>  # Position of first difference
>>>  idx <- which(diffs)[1];
>>>
>>>  # Keep common header
>>>  hdr <- hdr[seq_len(idx-1)];
>>>
>>>  hdr;
>>> };
>>>
>>> # Serialize a first "random" object
>>> hdr <- scanSerialize(NA);
>>> for (kk in 1:100)
>>>  hdr <- scanSerialize(kk, hdr=hdr);
>>> for (kk in 1:100) {
>>>  x <- sample(letters, size=sample(100), replace=TRUE);
>>>  hdr <- scanSerialize(x, hdr=hdr);
>>> }
>>> for (kk in 1:100) {
>>>  hdr <- scanSerialize(kk, hdr=hdr);
>>>  hdr <- scanSerialize(hdr, hdr=hdr);
>>> }
>>>
>>> cat("Length:", length(hdr), "\n");
>>> print(hdr);
>>> print(rawToChar(hdr));
>>>
>>> On R v2.5.0 devel, this gives:
>>> Length: 18
>>> [1] 41 0a 32 0a 31 33 32 33 35 32 0a 31 33 31 38 34 30 0a
>>> [1] "A\n2\n132352\n131840\n"
>>>
>>> However, it would still be good to get an "official" statement from
>>> one in the R-code team about the serialization header and where the
>>> data section start.  Again, I want to cut out as much as possible for
>>> consistency between R version without loosing data dependent bytes.
>>
>> An official, and definitive, statement from the _R-core_ team has been
>> available to you all along at
>>
>>  https://svn.r-project.org/R/trunk/src/main/serialize.c
>
>
> There's also a bit of info on this in Section 1.7 of the "R Internals"
> Manual.
>
> Paul

Thanks -- I'd forgotten about that.  Looking at that shows that my
unofficial and non-definitive interpretation was not quite right for
the binary case -- the header there is 14 bytes (I forgot that there
is a \n after the X even in the binary case).

Best,

luke

>
>
>> My unofficial and non-definitive interpretation of that statement is
>> that there is a header of four items,
>>
>>  A format code 'A' or 'X' ('B' also possible in older formats)
>>  version number of the format
>>  Packed integer containint the R version that did the serializing
>>  Packed integer containing the oldest R version that can read the format
>>
>> You can see this if you look at the ascii version as text:
>>
>> > serialize(1, stdout(), ascii=TRUE)
>>  A
>>  2
>>  132097
>>  131840
>>  14
>>  1
>>  1
>>  NULL
>> > serialize(as.integer(1), stdout(), ascii=TRUE)
>>  A
>>  2
>>  132097
>>  131840
>>  13
>>  1
>>  1
>>  NULL
>>
>> In the non-ascii 'X' (as in xdr) format this will constitute 13 bytes.
>> In ascii format I believe it is currently 18 bytes but this could
>> change with the version number of R -- I'd have to read the official
>> and definitive statement to see how the integer packing is done and
>> work out whether that could change the number of bytes. The number of
>> bytes would also change if we reached format version 10, but something
>> about the format would also change of course.  A safer way to look at
>> the header in the ascii version is as the first four lines.
>>
>> Best,
>>
>> luke
>>
>>> Thanks
>>>
>>> /Henrik
>>>
>>> On 3/7/07, Henrik Bengtsson <[EMAIL PROTECTED]> wrote:
 Hi,

 I noticed that serialize() gives different results depending on R
 version, which has implications to the digest() function in the digest
 package.  Note, it does give the same output across platforms.  I know
 that serialize() is under development, but is this expected, e.g. is
 there some kind of header in the result that specifies "who" generated
 the stream, and if so, exactly what bytes are they?

 SETUP:

 R versions:
 A) R v2.4.0 (2006-10-03)
 B) R v2.4.1pat (2007-01-13 r40470)
 C) R v2.5.0dev (2006-12-12 r40167)

 This is on WinXP and I start R with Rterm --vanilla.

 Example: Identical serialize() calls using the different R versions.

> raw <- serialize(1, connection=NULL, ascii=TRUE)
> print(raw)
 gives:

 (A): [1] 41 0a 32 0a 31 33 32 30 39 36 0a 31 33 31 38 34 30 0a 31 34
 0a 31 0a 31 0a
 (B): [1] 41 0a 32 0a 31 33 32 30 39 37 0a 31 33 31 38 34 30 0a 31 34
 0a 31 0a 31 0a
 (C): [1] 41 0a 32 0a 31 33 32 33 35 32 0a 31 33 31 38 34 30 0a 31 34
 0a 31 0a 31 0a

 Note the differ

Re: [Rd] Small inconsistency in serialize() between R versions and implications on digest()

2007-03-08 Thread Paul Murrell

Hi


Luke Tierney wrote:
> On Wed, 7 Mar 2007, Henrik Bengtsson wrote:
> 
>> To follow up, I went ahead and generated "random" object to scan for a
>> common header for a given R version, and it seems to be that at most
>> the first 18 bytes are non-data specific, which could be the length of
>> the serialization header.
>>
>> Here is my code for this:
>>
>> scanSerialize <- function(object, hdr=NULL, ...) {
>>  # Serialize object
>>  raw <- serialize(object, connection=NULL, ascii=TRUE);
>>
>>  # First run?
>>  if (is.null(hdr))
>>return(raw);
>>
>>  # Find differences between current longest header and new raw vector
>>  n <- length(hdr);
>>  diffs <- (as.integer(hdr) != as.integer(raw[1:n]));
>>
>>  # No differences?
>>  if (!any(diffs))
>>return(hdr);
>>
>>  # Position of first difference
>>  idx <- which(diffs)[1];
>>
>>  # Keep common header
>>  hdr <- hdr[seq_len(idx-1)];
>>
>>  hdr;
>> };
>>
>> # Serialize a first "random" object
>> hdr <- scanSerialize(NA);
>> for (kk in 1:100)
>>  hdr <- scanSerialize(kk, hdr=hdr);
>> for (kk in 1:100) {
>>  x <- sample(letters, size=sample(100), replace=TRUE);
>>  hdr <- scanSerialize(x, hdr=hdr);
>> }
>> for (kk in 1:100) {
>>  hdr <- scanSerialize(kk, hdr=hdr);
>>  hdr <- scanSerialize(hdr, hdr=hdr);
>> }
>>
>> cat("Length:", length(hdr), "\n");
>> print(hdr);
>> print(rawToChar(hdr));
>>
>> On R v2.5.0 devel, this gives:
>> Length: 18
>> [1] 41 0a 32 0a 31 33 32 33 35 32 0a 31 33 31 38 34 30 0a
>> [1] "A\n2\n132352\n131840\n"
>>
>> However, it would still be good to get an "official" statement from
>> one in the R-code team about the serialization header and where the
>> data section start.  Again, I want to cut out as much as possible for
>> consistency between R version without loosing data dependent bytes.
> 
> An official, and definitive, statement from the _R-core_ team has been
> available to you all along at
> 
>   https://svn.r-project.org/R/trunk/src/main/serialize.c


There's also a bit of info on this in Section 1.7 of the "R Internals"
Manual.

Paul


> My unofficial and non-definitive interpretation of that statement is
> that there is a header of four items,
> 
>  A format code 'A' or 'X' ('B' also possible in older formats)
>  version number of the format
>  Packed integer containint the R version that did the serializing
>  Packed integer containing the oldest R version that can read the format
> 
> You can see this if you look at the ascii version as text:
> 
>  > serialize(1, stdout(), ascii=TRUE)
>  A
>  2
>  132097
>  131840
>  14
>  1
>  1
>  NULL
>  > serialize(as.integer(1), stdout(), ascii=TRUE)
>  A
>  2
>  132097
>  131840
>  13
>  1
>  1
>  NULL
> 
> In the non-ascii 'X' (as in xdr) format this will constitute 13 bytes.
> In ascii format I believe it is currently 18 bytes but this could
> change with the version number of R -- I'd have to read the official
> and definitive statement to see how the integer packing is done and
> work out whether that could change the number of bytes. The number of
> bytes would also change if we reached format version 10, but something
> about the format would also change of course.  A safer way to look at
> the header in the ascii version is as the first four lines.
> 
> Best,
> 
> luke
> 
>> Thanks
>>
>> /Henrik
>>
>> On 3/7/07, Henrik Bengtsson <[EMAIL PROTECTED]> wrote:
>>> Hi,
>>>
>>> I noticed that serialize() gives different results depending on R
>>> version, which has implications to the digest() function in the digest
>>> package.  Note, it does give the same output across platforms.  I know
>>> that serialize() is under development, but is this expected, e.g. is
>>> there some kind of header in the result that specifies "who" generated
>>> the stream, and if so, exactly what bytes are they?
>>>
>>> SETUP:
>>>
>>> R versions:
>>> A) R v2.4.0 (2006-10-03)
>>> B) R v2.4.1pat (2007-01-13 r40470)
>>> C) R v2.5.0dev (2006-12-12 r40167)
>>>
>>> This is on WinXP and I start R with Rterm --vanilla.
>>>
>>> Example: Identical serialize() calls using the different R versions.
>>>
 raw <- serialize(1, connection=NULL, ascii=TRUE)
 print(raw)
>>> gives:
>>>
>>> (A): [1] 41 0a 32 0a 31 33 32 30 39 36 0a 31 33 31 38 34 30 0a 31 34
>>> 0a 31 0a 31 0a
>>> (B): [1] 41 0a 32 0a 31 33 32 30 39 37 0a 31 33 31 38 34 30 0a 31 34
>>> 0a 31 0a 31 0a
>>> (C): [1] 41 0a 32 0a 31 33 32 33 35 32 0a 31 33 31 38 34 30 0a 31 34
>>> 0a 31 0a 31 0a
>>>
>>> Note the difference in raw bytes 8 to 10, i.e.
>>>
 raw[7:11]
>>> (A): [1] 32 30 39 36 0a
>>> (B): [1] 32 30 39 37 0a
>>> (C): [1] 32 33 35 32 0a
>>>
>>> Does bytes 8, 9 and 10 in the raw vector somehow contain information
>>> about the R version or similar?  The following poor mans test says
>>> that is the only difference:
>>>
>>> On all R versions, the following gives identical results:
>>>
 raw <- serialize(1:1e4, connection=NULL, ascii=TRUE)
 raw <-

Re: [Rd] Small inconsistency in serialize() between R versions and implications on digest()

2007-03-08 Thread Henrik Bengtsson

On 3/8/07, Luke Tierney <[EMAIL PROTECTED]> wrote:
> On Fri, 9 Mar 2007, Paul Murrell wrote:
>
> > Hi
> >
> >
> > Luke Tierney wrote:
> >> On Wed, 7 Mar 2007, Henrik Bengtsson wrote:
> >>
> >>> To follow up, I went ahead and generated "random" object to scan for a
> >>> common header for a given R version, and it seems to be that at most
> >>> the first 18 bytes are non-data specific, which could be the length of
> >>> the serialization header.
> >>>
> >>> Here is my code for this:
> >>>
> >>> scanSerialize <- function(object, hdr=NULL, ...) {
> >>>  # Serialize object
> >>>  raw <- serialize(object, connection=NULL, ascii=TRUE);
> >>>
> >>>  # First run?
> >>>  if (is.null(hdr))
> >>>return(raw);
> >>>
> >>>  # Find differences between current longest header and new raw vector
> >>>  n <- length(hdr);
> >>>  diffs <- (as.integer(hdr) != as.integer(raw[1:n]));
> >>>
> >>>  # No differences?
> >>>  if (!any(diffs))
> >>>return(hdr);
> >>>
> >>>  # Position of first difference
> >>>  idx <- which(diffs)[1];
> >>>
> >>>  # Keep common header
> >>>  hdr <- hdr[seq_len(idx-1)];
> >>>
> >>>  hdr;
> >>> };
> >>>
> >>> # Serialize a first "random" object
> >>> hdr <- scanSerialize(NA);
> >>> for (kk in 1:100)
> >>>  hdr <- scanSerialize(kk, hdr=hdr);
> >>> for (kk in 1:100) {
> >>>  x <- sample(letters, size=sample(100), replace=TRUE);
> >>>  hdr <- scanSerialize(x, hdr=hdr);
> >>> }
> >>> for (kk in 1:100) {
> >>>  hdr <- scanSerialize(kk, hdr=hdr);
> >>>  hdr <- scanSerialize(hdr, hdr=hdr);
> >>> }
> >>>
> >>> cat("Length:", length(hdr), "\n");
> >>> print(hdr);
> >>> print(rawToChar(hdr));
> >>>
> >>> On R v2.5.0 devel, this gives:
> >>> Length: 18
> >>> [1] 41 0a 32 0a 31 33 32 33 35 32 0a 31 33 31 38 34 30 0a
> >>> [1] "A\n2\n132352\n131840\n"
> >>>
> >>> However, it would still be good to get an "official" statement from
> >>> one in the R-code team about the serialization header and where the
> >>> data section start.  Again, I want to cut out as much as possible for
> >>> consistency between R version without loosing data dependent bytes.
> >>
> >> An official, and definitive, statement from the _R-core_ team has been
> >> available to you all along at
> >>
> >>  https://svn.r-project.org/R/trunk/src/main/serialize.c
> >
> >
> > There's also a bit of info on this in Section 1.7 of the "R Internals"
> > Manual.
> >
> > Paul
>
> Thanks -- I'd forgotten about that.  Looking at that shows that my
> unofficial and non-definitive interpretation was not quite right for
> the binary case -- the header there is 14 bytes (I forgot that there
> is a \n after the X even in the binary case).

Luke and Paul, thank you for this.  Searching for the 4th newline
seems to be the most robust thing to do in the ASCII case.

/Henrik

>
> Best,
>
> luke
>
> >
> >
> >> My unofficial and non-definitive interpretation of that statement is
> >> that there is a header of four items,
> >>
> >>  A format code 'A' or 'X' ('B' also possible in older formats)
> >>  version number of the format
> >>  Packed integer containint the R version that did the serializing
> >>  Packed integer containing the oldest R version that can read the 
> >> format
> >>
> >> You can see this if you look at the ascii version as text:
> >>
> >> > serialize(1, stdout(), ascii=TRUE)
> >>  A
> >>  2
> >>  132097
> >>  131840
> >>  14
> >>  1
> >>  1
> >>  NULL
> >> > serialize(as.integer(1), stdout(), ascii=TRUE)
> >>  A
> >>  2
> >>  132097
> >>  131840
> >>  13
> >>  1
> >>  1
> >>  NULL
> >>
> >> In the non-ascii 'X' (as in xdr) format this will constitute 13 bytes.
> >> In ascii format I believe it is currently 18 bytes but this could
> >> change with the version number of R -- I'd have to read the official
> >> and definitive statement to see how the integer packing is done and
> >> work out whether that could change the number of bytes. The number of
> >> bytes would also change if we reached format version 10, but something
> >> about the format would also change of course.  A safer way to look at
> >> the header in the ascii version is as the first four lines.
> >>
> >> Best,
> >>
> >> luke
> >>
> >>> Thanks
> >>>
> >>> /Henrik
> >>>
> >>> On 3/7/07, Henrik Bengtsson <[EMAIL PROTECTED]> wrote:
>  Hi,
> 
>  I noticed that serialize() gives different results depending on R
>  version, which has implications to the digest() function in the digest
>  package.  Note, it does give the same output across platforms.  I know
>  that serialize() is under development, but is this expected, e.g. is
>  there some kind of header in the result that specifies "who" generated
>  the stream, and if so, exactly what bytes are they?
> 
>  SETUP:
> 
>  R versions:
>  A) R v2.4.0 (2006-10-03)
>  B) R v2.4.1pat (2007-01-13 r40470)
>  C) R v2.5.0dev (2006-12-12 r40167)
> 
>  This is on WinXP and I start R with Rte

Re: [Rd] Small inconsistency in serialize() between R versions and implications on digest()

Re: [Rd] Small inconsistency in serialize() between R versions and implications on digest()

Re: [Rd] Small inconsistency in serialize() between R versions and implications on digest()

Re: [Rd] Small inconsistency in serialize() between R versions and implications on digest()

4 matches

Site Navigation

Mail list logo

Footer information