Re: [Rd] Choices to remove `srcref` (and its buddies) when serializing objects

Tomas Kalibera Wed, 17 Jan 2024 01:32:18 -0800

On 1/16/24 20:16, Dipterix Wang wrote:

Could you recommend any packages/functions that compute hash such thatthe source references and sexpinfo_struct are ignored? Basically aversion of `serialize` that convert R objects to raw without storingthe ancillary source reference and sexpinfo.I think most people would think of `digest` but that package uses`serialize` (see discussionhttps://github.com/eddelbuettel/digest/issues/200#issuecomment-1894289875)

I think one could implement hashing on the fly without anyserialization, similarly to how identical works, but I am not aware ofany existing implementation. Again, if that wasn't clear: I don't thinktrying to compute a hash of an object from its serialized representationis a good idea - it is of course convenient, but has problems like theone you have ran into.

In some applications it may still be good enough: if by various tweaks,such as ensuring source references are off in your case, you achieve astate when false alarms are rare (identical objects have differenthashes), and hence say unnecessary re-computation is rare, maybe it isgood enough.


Tomas

On Jan 12, 2024, at 11:33 AM, Tomas Kalibera<[email protected]> wrote:
On 1/12/24 06:11, Dipterix Wang wrote:
Dear R devs,
I was digging into a package issue today when I realized R serializefunction not always generate the same results on equivalent objectswhen users choose to run differently. For example, the following code
serialize(with(new.env(), { function(){} }), NULL, TRUE)
generates different results when I copy-paste into console vs when Iuse ctrl+shift+enter to source the file in RStudio.
With a deeper inspect into the cause, I found that function andlanguage get source reference when getOption("keep.source") is TRUE.This means the source reference will make the functions differentwhile in most cases, whether keeping function source might notimpact how a function behaves.
While it's OK that function serialize generates different results,functions such as `rlang::hash` and `digest::digest`, which dependon `serialize` might eventually deliver false positives on sameinputs. I've checked source code in digest package hoping to getaround this issue (for example serialize(..., refhook = ...)).However, my workaround did not work. It seems that the markers tothe objects are different even if I used `refhook` to force srcrefto be the same. I also tried `removeSource` and `rlang::zap_srcref`.None of them works directly on nested environments with multiplefunctions.
I wonder how hard it would be to have options to discard source whenserializing R objects?
Currently my analyses heavily depend on digest function to generatefile caches and automatically schedule pipelines (to update cache)when changes are detected. The pipelines save the hashes of sourcecode, inputs, and outputs together so other people can easily verifythe calculation without accessing the original data (which could besensitive), or running hour-long analyses, or having to buy servers.All of these require `serialize` to produce the same resultsregardless of how users choose to run the code.
It would be great if this feature could be in the future R. Otherpipeline packages such as `targets` and `drake` can also benefitfrom it.
I don't think such functionality would belong to serialize(). Thisfunction is not meant to produce stable results based on the input,the serialized representation may even differ based on properties notseen by users.
I think an option to ignore source code would belong to a functionthat computes the hash, as other options of identical().
Tomas
Thanks,

- Dipterix
[[alternative HTML version deleted]]

______________________________________________
[email protected] list
https://stat.ethz.ch/mailman/listinfo/r-devel


______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Choices to remove `srcref` (and its buddies) when serializing objects

Reply via email to