On 1/12/24 06:11, Dipterix Wang wrote:
Dear R devs,
I was digging into a package issue today when I realized R serialize function
not always generate the same results on equivalent objects when users choose to
run differently. For example, the following code
serialize(with(new.env(), { function(){} }), NULL, TRUE)
generates different results when I copy-paste into console vs when I use
ctrl+shift+enter to source the file in RStudio.
With a deeper inspect into the cause, I found that function and language get source
reference when getOption("keep.source") is TRUE. This means the source
reference will make the functions different while in most cases, whether keeping function
source might not impact how a function behaves.
While it's OK that function serialize generates different results, functions
such as `rlang::hash` and `digest::digest`, which depend on `serialize` might
eventually deliver false positives on same inputs. I've checked source code in
digest package hoping to get around this issue (for example serialize(...,
refhook = ...)). However, my workaround did not work. It seems that the markers
to the objects are different even if I used `refhook` to force srcref to be the
same. I also tried `removeSource` and `rlang::zap_srcref`. None of them works
directly on nested environments with multiple functions.
I wonder how hard it would be to have options to discard source when
serializing R objects?
Currently my analyses heavily depend on digest function to generate file caches
and automatically schedule pipelines (to update cache) when changes are
detected. The pipelines save the hashes of source code, inputs, and outputs
together so other people can easily verify the calculation without accessing
the original data (which could be sensitive), or running hour-long analyses, or
having to buy servers. All of these require `serialize` to produce the same
results regardless of how users choose to run the code.
It would be great if this feature could be in the future R. Other pipeline
packages such as `targets` and `drake` can also benefit from it.
I don't think such functionality would belong to serialize(). This
function is not meant to produce stable results based on the input, the
serialized representation may even differ based on properties not seen
by users.
I think an option to ignore source code would belong to a function that
computes the hash, as other options of identical().
Tomas
Thanks,
- Dipterix
[[alternative HTML version deleted]]
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel