Could you recommend any packages/functions that compute hash such that the source references and sexpinfo_struct are ignored? Basically a version of `serialize` that convert R objects to raw without storing the ancillary source reference and sexpinfo.
I think most people would think of `digest` but that package uses `serialize` (see discussion https://github.com/eddelbuettel/digest/issues/200#issuecomment-1894289875) > On Jan 12, 2024, at 11:33 AM, Tomas Kalibera <tomas.kalib...@gmail.com> wrote: > > > On 1/12/24 06:11, Dipterix Wang wrote: >> Dear R devs, >> >> I was digging into a package issue today when I realized R serialize >> function not always generate the same results on equivalent objects when >> users choose to run differently. For example, the following code >> >> serialize(with(new.env(), { function(){} }), NULL, TRUE) >> >> generates different results when I copy-paste into console vs when I use >> ctrl+shift+enter to source the file in RStudio. >> >> With a deeper inspect into the cause, I found that function and language get >> source reference when getOption("keep.source") is TRUE. This means the >> source reference will make the functions different while in most cases, >> whether keeping function source might not impact how a function behaves. >> >> While it's OK that function serialize generates different results, functions >> such as `rlang::hash` and `digest::digest`, which depend on `serialize` >> might eventually deliver false positives on same inputs. I've checked source >> code in digest package hoping to get around this issue (for example >> serialize(..., refhook = ...)). However, my workaround did not work. It >> seems that the markers to the objects are different even if I used `refhook` >> to force srcref to be the same. I also tried `removeSource` and >> `rlang::zap_srcref`. None of them works directly on nested environments with >> multiple functions. >> >> I wonder how hard it would be to have options to discard source when >> serializing R objects? >> >> Currently my analyses heavily depend on digest function to generate file >> caches and automatically schedule pipelines (to update cache) when changes >> are detected. The pipelines save the hashes of source code, inputs, and >> outputs together so other people can easily verify the calculation without >> accessing the original data (which could be sensitive), or running hour-long >> analyses, or having to buy servers. All of these require `serialize` to >> produce the same results regardless of how users choose to run the code. >> >> It would be great if this feature could be in the future R. Other pipeline >> packages such as `targets` and `drake` can also benefit from it. > > I don't think such functionality would belong to serialize(). This function > is not meant to produce stable results based on the input, the serialized > representation may even differ based on properties not seen by users. > > I think an option to ignore source code would belong to a function that > computes the hash, as other options of identical(). > > Tomas > > >> Thanks, >> >> - Dipterix >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> R-devel@r-project.org <mailto:R-devel@r-project.org> mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel [[alternative HTML version deleted]] ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel