Could you recommend any packages/functions that compute hash such that the 
source references and sexpinfo_struct are ignored? Basically a version of 
`serialize` that convert R objects to raw without storing the ancillary source 
reference and sexpinfo.

I think most people would think of `digest` but that package uses `serialize` 
(see discussion 
https://github.com/eddelbuettel/digest/issues/200#issuecomment-1894289875)

> On Jan 12, 2024, at 11:33 AM, Tomas Kalibera <tomas.kalib...@gmail.com> wrote:
> 
> 
> On 1/12/24 06:11, Dipterix Wang wrote:
>> Dear R devs,
>> 
>> I was digging into a package issue today when I realized R serialize 
>> function not always generate the same results on equivalent objects when 
>> users choose to run differently. For example, the following code
>> 
>> serialize(with(new.env(), { function(){} }), NULL, TRUE)
>> 
>> generates different results when I copy-paste into console vs when I use 
>> ctrl+shift+enter to source the file in RStudio.
>> 
>> With a deeper inspect into the cause, I found that function and language get 
>> source reference when getOption("keep.source") is TRUE. This means the 
>> source reference will make the functions different while in most cases, 
>> whether keeping function source might not impact how a function behaves.
>> 
>> While it's OK that function serialize generates different results, functions 
>> such as `rlang::hash` and `digest::digest`, which depend on `serialize` 
>> might eventually deliver false positives on same inputs. I've checked source 
>> code in digest package hoping to get around this issue (for example 
>> serialize(..., refhook = ...)). However, my workaround did not work. It 
>> seems that the markers to the objects are different even if I used `refhook` 
>> to force srcref to be the same. I also tried `removeSource` and 
>> `rlang::zap_srcref`. None of them works directly on nested environments with 
>> multiple functions.
>> 
>> I wonder how hard it would be to have options to discard source when 
>> serializing R objects?
>> 
>> Currently my analyses heavily depend on digest function to generate file 
>> caches and automatically schedule pipelines (to update cache) when changes 
>> are detected. The pipelines save the hashes of source code, inputs, and 
>> outputs together so other people can easily verify the calculation without 
>> accessing the original data (which could be sensitive), or running hour-long 
>> analyses, or having to buy servers. All of these require `serialize` to 
>> produce the same results regardless of how users choose to run the code.
>> 
>> It would be great if this feature could be in the future R. Other pipeline 
>> packages such as `targets` and `drake` can also benefit from it.
> 
> I don't think such functionality would belong to serialize(). This function 
> is not meant to produce stable results based on the input, the serialized 
> representation may even differ based on properties not seen by users.
> 
> I think an option to ignore source code would belong to a function that 
> computes the hash, as other options of identical().
> 
> Tomas
> 
> 
>> Thanks,
>> 
>> - Dipterix
>>      [[alternative HTML version deleted]]
>> 
>> ______________________________________________
>> R-devel@r-project.org <mailto:R-devel@r-project.org> mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel


        [[alternative HTML version deleted]]

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to