[Rd] Request: tools::md5sum should accept connections and finally in-memory objects

2020-05-01 Thread Dénes Tóth



AFAIK there is no hashing utility in base R which can create hash 
digests of arbitrary R objects. However, as also described by Henrik 
Bengtsson in [1], we have tools::md5sum() which calculates MD5 hashes of 
files. Calculating hashes of in-memory objects is a very common task in 
several areas, as demonstrated by the popularity of the 'digest' package 
(~850.000 downloads/month).


Upon the inspection of the relevant files in the R-source (e.g., [2] and 
[3]), it seems all building blocks have already been implemented so that 
hashing should not be restricted to files. I would like to ask:


1) Why is md5_buffer unused?:
In src/library/tools/src/md5.c [see 2], md5_buffer is implemented which 
seems to be the counterpart of md5_stream for non-file inputs:


---
#ifdef UNUSED
/* Compute MD5 message digest for LEN bytes beginning at BUFFER.  The
   result is always in little endian byte order, so that a byte-wise
   output yields to the wanted ASCII representation of the message
   digest.  */
static void *
md5_buffer (const char *buffer, size_t len, void *resblock)
{
  struct md5_ctx ctx;

  /* Initialize the computation context.  */
  md5_init_ctx (&ctx);

  /* Process whole buffer but last len % 64 bytes.  */
  md5_process_bytes (buffer, len, &ctx);

  /* Put result in desired memory area.  */
  return md5_finish_ctx (&ctx, resblock);
}
#endif
---

2) How can the R-community help so that this feature becomes available 
in package 'tools'?


Suggestions:
As a first step, it would be great if tools::md5sum would support 
connections (credit goes to Henrik for the idea). E.g., instead of the 
signature tools::md5sum(files), we could have tools::md5sum(files, conn 
= NULL), which would allow:


x <- runif(10)
tools::md5sum(conn = rawConnection(serialize(x, NULL)))

To avoid the inconsistency between 'files' (which computes the hash 
digests in a vectorized manner, that is, one for each file) and 'conn' 
(which expects a single connection), and to make it easier to extend the 
hashing for other algorithms without changing the main R interface, a 
more involved solution would be to introduce tools::hash and 
tools::hashes, in a similar vein to digest::digest and digest::getVDigest.


Regards,
Denes


[1]: https://github.com/HenrikBengtsson/Wishlist-for-R/issues/21
[2]: 
https://github.com/wch/r-source/blob/5a156a0865362bb8381dcd69ac335f5174a4f60c/src/library/tools/src/md5.c#L172
[3]: 
https://github.com/wch/r-source/blob/5a156a0865362bb8381dcd69ac335f5174a4f60c/src/library/tools/src/Rmd5.c#L27


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Request: tools::md5sum should accept connections and finally in-memory objects

2020-05-01 Thread John Mount
Perhaps use the digest package? Isn't "R the R packages?"

> On May 1, 2020, at 2:00 PM, Dénes Tóth  wrote:
> 
> 
> AFAIK there is no hashing utility in base R which can create hash digests of 
> arbitrary R objects. However, as also described by Henrik Bengtsson in [1], 
> we have tools::md5sum() which calculates MD5 hashes of files. Calculating 
> hashes of in-memory objects is a very common task in several areas, as 
> demonstrated by the popularity of the 'digest' package (~850.000 
> downloads/month).
> 
> Upon the inspection of the relevant files in the R-source (e.g., [2] and 
> [3]), it seems all building blocks have already been implemented so that 
> hashing should not be restricted to files. I would like to ask:
> 
> 1) Why is md5_buffer unused?:
> In src/library/tools/src/md5.c [see 2], md5_buffer is implemented which seems 
> to be the counterpart of md5_stream for non-file inputs:
> 
> ---
> #ifdef UNUSED
> /* Compute MD5 message digest for LEN bytes beginning at BUFFER.  The
>   result is always in little endian byte order, so that a byte-wise
>   output yields to the wanted ASCII representation of the message
>   digest.  */
> static void *
> md5_buffer (const char *buffer, size_t len, void *resblock)
> {
>  struct md5_ctx ctx;
> 
>  /* Initialize the computation context.  */
>  md5_init_ctx (&ctx);
> 
>  /* Process whole buffer but last len % 64 bytes.  */
>  md5_process_bytes (buffer, len, &ctx);
> 
>  /* Put result in desired memory area.  */
>  return md5_finish_ctx (&ctx, resblock);
> }
> #endif
> ---
> 
> 2) How can the R-community help so that this feature becomes available in 
> package 'tools'?
> 
> Suggestions:
> As a first step, it would be great if tools::md5sum would support connections 
> (credit goes to Henrik for the idea). E.g., instead of the signature 
> tools::md5sum(files), we could have tools::md5sum(files, conn = NULL), which 
> would allow:
> 
> x <- runif(10)
> tools::md5sum(conn = rawConnection(serialize(x, NULL)))
> 
> To avoid the inconsistency between 'files' (which computes the hash digests 
> in a vectorized manner, that is, one for each file) and 'conn' (which expects 
> a single connection), and to make it easier to extend the hashing for other 
> algorithms without changing the main R interface, a more involved solution 
> would be to introduce tools::hash and tools::hashes, in a similar vein to 
> digest::digest and digest::getVDigest.
> 
> Regards,
> Denes
> 
> 
> [1]: https://github.com/HenrikBengtsson/Wishlist-for-R/issues/21
> [2]: 
> https://github.com/wch/r-source/blob/5a156a0865362bb8381dcd69ac335f5174a4f60c/src/library/tools/src/md5.c#L172
> [3]: 
> https://github.com/wch/r-source/blob/5a156a0865362bb8381dcd69ac335f5174a4f60c/src/library/tools/src/Rmd5.c#L27
> 
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

---
John Mount
http://www.win-vector.com/  
Our book: Practical Data Science with R
http://practicaldatascience.com  






[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Request: tools::md5sum should accept connections and finally in-memory objects

2020-05-01 Thread Dénes Tóth




On 5/1/20 11:09 PM, John Mount wrote:

Perhaps use the digest package? Isn't "R the R packages?"


I think it is clear that I am aware of the existence of the digest 
package and also of other packages with similar functionality, e.g. the 
fastdigest package. (And I actually do use digest as I guess 99% percent 
of the R developers do at least as an indirect dependency.) The point is 
that
a) digest is a wonderful and very stable package, but still, it is a 
user-contributed package, whereas
b) 'tools' is a base package which is included by default in all R 
installations, and
c) tools::md5sum already exists, with almost all building blocks to 
allow its extension to calculate MD5 hashes of R objects, and
d) there is high demand in the R community for being able to calculate 
hashes.


So yes, if one wants to use all the utilities or the various algos that 
the digest package provides, one should install and load it. But if one 
can live with MD5 hashes, why not use the built-in R function? (Well, 
without serializing an object to a file, calling tools::md5sum, and then 
cleaning up the file.)




On May 1, 2020, at 2:00 PM, Dénes Tóth > wrote:



AFAIK there is no hashing utility in base R which can create hash 
digests of arbitrary R objects. However, as also described by Henrik 
Bengtsson in [1], we have tools::md5sum() which calculates MD5 hashes 
of files. Calculating hashes of in-memory objects is a very common 
task in several areas, as demonstrated by the popularity of the 
'digest' package (~850.000 downloads/month).


Upon the inspection of the relevant files in the R-source (e.g., [2] 
and [3]), it seems all building blocks have already been implemented 
so that hashing should not be restricted to files. I would like to ask:


1) Why is md5_buffer unused?:
In src/library/tools/src/md5.c [see 2], md5_buffer is implemented 
which seems to be the counterpart of md5_stream for non-file inputs:


---
#ifdef UNUSED
/* Compute MD5 message digest for LEN bytes beginning at BUFFER.  The
  result is always in little endian byte order, so that a byte-wise
  output yields to the wanted ASCII representation of the message
  digest.  */
static void *
md5_buffer (const char *buffer, size_t len, void *resblock)
{
 struct md5_ctx ctx;

 /* Initialize the computation context.  */
 md5_init_ctx (&ctx);

 /* Process whole buffer but last len % 64 bytes.  */
 md5_process_bytes (buffer, len, &ctx);

 /* Put result in desired memory area.  */
 return md5_finish_ctx (&ctx, resblock);
}
#endif
---

2) How can the R-community help so that this feature becomes available 
in package 'tools'?


Suggestions:
As a first step, it would be great if tools::md5sum would support 
connections (credit goes to Henrik for the idea). E.g., instead of the 
signature tools::md5sum(files), we could have tools::md5sum(files, 
conn = NULL), which would allow:


x <- runif(10)
tools::md5sum(conn = rawConnection(serialize(x, NULL)))

To avoid the inconsistency between 'files' (which computes the hash 
digests in a vectorized manner, that is, one for each file) and 'conn' 
(which expects a single connection), and to make it easier to extend 
the hashing for other algorithms without changing the main R 
interface, a more involved solution would be to introduce tools::hash 
and tools::hashes, in a similar vein to digest::digest and 
digest::getVDigest.


Regards,
Denes


[1]: https://github.com/HenrikBengtsson/Wishlist-for-R/issues/21
[2]: 
https://github.com/wch/r-source/blob/5a156a0865362bb8381dcd69ac335f5174a4f60c/src/library/tools/src/md5.c#L172
[3]: 
https://github.com/wch/r-source/blob/5a156a0865362bb8381dcd69ac335f5174a4f60c/src/library/tools/src/Rmd5.c#L27


__
R-devel@r-project.org  mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


---
John Mount
http://www.win-vector.com/
Our book: Practical Data Science with R
http://practicaldatascience.com







__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Request: tools::md5sum should accept connections and finally in-memory objects

2020-05-01 Thread Duncan Murdoch
The tools package is not for users, it's for functions that R uses in 
installing packages, checking them, etc.  If you want a function for 
users, it would belong in utils.  But what's wrong with the digest 
package?  What's the argument that R Core should take this on?


Duncan Murdoch

On 01/05/2020 5:00 p.m., Dénes Tóth wrote:


AFAIK there is no hashing utility in base R which can create hash
digests of arbitrary R objects. However, as also described by Henrik
Bengtsson in [1], we have tools::md5sum() which calculates MD5 hashes of
files. Calculating hashes of in-memory objects is a very common task in
several areas, as demonstrated by the popularity of the 'digest' package
(~850.000 downloads/month).

Upon the inspection of the relevant files in the R-source (e.g., [2] and
[3]), it seems all building blocks have already been implemented so that
hashing should not be restricted to files. I would like to ask:

1) Why is md5_buffer unused?:
In src/library/tools/src/md5.c [see 2], md5_buffer is implemented which
seems to be the counterpart of md5_stream for non-file inputs:

---
#ifdef UNUSED
/* Compute MD5 message digest for LEN bytes beginning at BUFFER.  The
 result is always in little endian byte order, so that a byte-wise
 output yields to the wanted ASCII representation of the message
 digest.  */
static void *
md5_buffer (const char *buffer, size_t len, void *resblock)
{
struct md5_ctx ctx;

/* Initialize the computation context.  */
md5_init_ctx (&ctx);

/* Process whole buffer but last len % 64 bytes.  */
md5_process_bytes (buffer, len, &ctx);

/* Put result in desired memory area.  */
return md5_finish_ctx (&ctx, resblock);
}
#endif
---

2) How can the R-community help so that this feature becomes available
in package 'tools'?

Suggestions:
As a first step, it would be great if tools::md5sum would support
connections (credit goes to Henrik for the idea). E.g., instead of the
signature tools::md5sum(files), we could have tools::md5sum(files, conn
= NULL), which would allow:

x <- runif(10)
tools::md5sum(conn = rawConnection(serialize(x, NULL)))

To avoid the inconsistency between 'files' (which computes the hash
digests in a vectorized manner, that is, one for each file) and 'conn'
(which expects a single connection), and to make it easier to extend the
hashing for other algorithms without changing the main R interface, a
more involved solution would be to introduce tools::hash and
tools::hashes, in a similar vein to digest::digest and digest::getVDigest.

Regards,
Denes


[1]: https://github.com/HenrikBengtsson/Wishlist-for-R/issues/21
[2]:
https://github.com/wch/r-source/blob/5a156a0865362bb8381dcd69ac335f5174a4f60c/src/library/tools/src/md5.c#L172
[3]:
https://github.com/wch/r-source/blob/5a156a0865362bb8381dcd69ac335f5174a4f60c/src/library/tools/src/Rmd5.c#L27

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel



__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Request: tools::md5sum should accept connections and finally in-memory objects

2020-05-01 Thread John Mount


> So yes, if one wants to use all the utilities or the various algos that the 
> digest package provides, one should install and load it. But if one can live 
> with MD5 hashes, why not use the built-in R function? (Well, without 
> serializing an object to a file, calling tools::md5sum, and then cleaning up 
> the file.)

Doesn't that assume that the serialization method is deterministic? Is that a 
documented property of the serialization tools?

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Request: tools::md5sum should accept connections and finally in-memory objects

2020-05-01 Thread Dénes Tóth




On 5/1/20 11:35 PM, Duncan Murdoch wrote:
The tools package is not for users, it's for functions that R uses in 
installing packages, checking them, etc. 


I think the target group for this functionality is the group of R 
developers, not regular R users.


If you want a function for 
users, it would belong in utils.  But what's wrong with the digest 
package?  What's the argument that R Core should take this on?


There is nothing wrong with the digest package except for being an extra 
dependency which could be avoided if an already implemented C function 
were available at the R level.


I do understand that given the load on R Core, they do include new 
features and the related burden of maintenance only if it is absolutely 
necessary. This is why I asked first whether there is a particular 
reason not to expose an already existing (base-R) implementation. I 
think it is reasonable to assume that 'md5_buffer' exists for a reason - 
but probably there is a reason why it never became part of any exported 
function. Now I checked the history of the md5.c file; it was last 
edited 8 years ago. Somewhat surprisingly, md5_buffer was already 
included in the original file (created 17 years ago), but marked as 
UNUSED 12 years ago.


Just to clarify: I do not want suggest that R Core team should take over 
all functionalities of the digest package. I do really focus on 
computing MD5 digests, which is already possible for files. My 
suggestion for a more general function was meant for keeping potential 
further enhancements in mind.





Duncan Murdoch

On 01/05/2020 5:00 p.m., Dénes Tóth wrote:


AFAIK there is no hashing utility in base R which can create hash
digests of arbitrary R objects. However, as also described by Henrik
Bengtsson in [1], we have tools::md5sum() which calculates MD5 hashes of
files. Calculating hashes of in-memory objects is a very common task in
several areas, as demonstrated by the popularity of the 'digest' package
(~850.000 downloads/month).

Upon the inspection of the relevant files in the R-source (e.g., [2] and
[3]), it seems all building blocks have already been implemented so that
hashing should not be restricted to files. I would like to ask:

1) Why is md5_buffer unused?:
In src/library/tools/src/md5.c [see 2], md5_buffer is implemented which
seems to be the counterpart of md5_stream for non-file inputs:

---
#ifdef UNUSED
/* Compute MD5 message digest for LEN bytes beginning at BUFFER.  The
 result is always in little endian byte order, so that a byte-wise
 output yields to the wanted ASCII representation of the message
 digest.  */
static void *
md5_buffer (const char *buffer, size_t len, void *resblock)
{
    struct md5_ctx ctx;

    /* Initialize the computation context.  */
    md5_init_ctx (&ctx);

    /* Process whole buffer but last len % 64 bytes.  */
    md5_process_bytes (buffer, len, &ctx);

    /* Put result in desired memory area.  */
    return md5_finish_ctx (&ctx, resblock);
}
#endif
---

2) How can the R-community help so that this feature becomes available
in package 'tools'?

Suggestions:
As a first step, it would be great if tools::md5sum would support
connections (credit goes to Henrik for the idea). E.g., instead of the
signature tools::md5sum(files), we could have tools::md5sum(files, conn
= NULL), which would allow:

x <- runif(10)
tools::md5sum(conn = rawConnection(serialize(x, NULL)))

To avoid the inconsistency between 'files' (which computes the hash
digests in a vectorized manner, that is, one for each file) and 'conn'
(which expects a single connection), and to make it easier to extend the
hashing for other algorithms without changing the main R interface, a
more involved solution would be to introduce tools::hash and
tools::hashes, in a similar vein to digest::digest and 
digest::getVDigest.


Regards,
Denes


[1]: https://github.com/HenrikBengtsson/Wishlist-for-R/issues/21
[2]:
https://github.com/wch/r-source/blob/5a156a0865362bb8381dcd69ac335f5174a4f60c/src/library/tools/src/md5.c#L172 


[3]:
https://github.com/wch/r-source/blob/5a156a0865362bb8381dcd69ac335f5174a4f60c/src/library/tools/src/Rmd5.c#L27 



__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel






__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel