On 1/16/24 20:16, Dipterix Wang wrote:
Could you recommend any packages/functions that compute hash such that
the source references and sexpinfo_struct are ignored? Basically a
version of `serialize` that convert R objects to raw without storing
the ancillary source reference and sexpinfo.
I think most people would think of `digest` but that package uses
`serialize` (see discussion
https://github.com/eddelbuettel/digest/issues/200#issuecomment-1894289875)
I think one could implement hashing on the fly without any
serialization, similarly to how identical works, but I am not aware of
any existing implementation. Again, if that wasn't clear: I don't think
trying to compute a hash of an object from its serialized representation
is a good idea - it is of course convenient, but has problems like the
one you have ran into.
In some applications it may still be good enough: if by various tweaks,
such as ensuring source references are off in your case, you achieve a
state when false alarms are rare (identical objects have different
hashes), and hence say unnecessary re-computation is rare, maybe it is
good enough.
Tomas
On Jan 12, 2024, at 11:33 AM, Tomas Kalibera
<tomas.kalib...@gmail.com> wrote:
On 1/12/24 06:11, Dipterix Wang wrote:
Dear R devs,
I was digging into a package issue today when I realized R serialize
function not always generate the same results on equivalent objects
when users choose to run differently. For example, the following code
serialize(with(new.env(), { function(){} }), NULL, TRUE)
generates different results when I copy-paste into console vs when I
use ctrl+shift+enter to source the file in RStudio.
With a deeper inspect into the cause, I found that function and
language get source reference when getOption("keep.source") is TRUE.
This means the source reference will make the functions different
while in most cases, whether keeping function source might not
impact how a function behaves.
While it's OK that function serialize generates different results,
functions such as `rlang::hash` and `digest::digest`, which depend
on `serialize` might eventually deliver false positives on same
inputs. I've checked source code in digest package hoping to get
around this issue (for example serialize(..., refhook = ...)).
However, my workaround did not work. It seems that the markers to
the objects are different even if I used `refhook` to force srcref
to be the same. I also tried `removeSource` and `rlang::zap_srcref`.
None of them works directly on nested environments with multiple
functions.
I wonder how hard it would be to have options to discard source when
serializing R objects?
Currently my analyses heavily depend on digest function to generate
file caches and automatically schedule pipelines (to update cache)
when changes are detected. The pipelines save the hashes of source
code, inputs, and outputs together so other people can easily verify
the calculation without accessing the original data (which could be
sensitive), or running hour-long analyses, or having to buy servers.
All of these require `serialize` to produce the same results
regardless of how users choose to run the code.
It would be great if this feature could be in the future R. Other
pipeline packages such as `targets` and `drake` can also benefit
from it.
I don't think such functionality would belong to serialize(). This
function is not meant to produce stable results based on the input,
the serialized representation may even differ based on properties not
seen by users.
I think an option to ignore source code would belong to a function
that computes the hash, as other options of identical().
Tomas
Thanks,
- Dipterix
[[alternative HTML version deleted]]
______________________________________________
R-devel@r-project.orgmailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel