> I think one could implement hashing on the fly without any > serialization, similarly to how identical works, but I am not aware of > any existing implementation
We have one in vctrs but it's not exported: https://github.com/r-lib/vctrs/blob/main/src/hash.c The main use is vectorised hashing: ``` # Non-vectorised vctrs:::obj_hash(1:10) #> [1] 1e 77 ce 48 # Vectorised vctrs:::vec_hash(1L) #> [1] 70 a2 85 ef vctrs:::vec_hash(1:2) #> [1] 70 a2 85 ef bf 3c 2c cf # vctrs semantics so dfs are vectors of rows length(vctrs:::vec_hash(mtcars)) / 4 #> [1] 32 nrow(mtcars) #> [1] 32 ``` Best, Lionel On Wed, Jan 17, 2024 at 10:32 AM Tomas Kalibera <tomas.kalib...@gmail.com> wrote: > > On 1/16/24 20:16, Dipterix Wang wrote: > > Could you recommend any packages/functions that compute hash such that > > the source references and sexpinfo_struct are ignored? Basically a > > version of `serialize` that convert R objects to raw without storing > > the ancillary source reference and sexpinfo. > > I think most people would think of `digest` but that package uses > > `serialize` (see discussion > > https://github.com/eddelbuettel/digest/issues/200#issuecomment-1894289875) > > I think one could implement hashing on the fly without any > serialization, similarly to how identical works, but I am not aware of > any existing implementation. Again, if that wasn't clear: I don't think > trying to compute a hash of an object from its serialized representation > is a good idea - it is of course convenient, but has problems like the > one you have ran into. > > In some applications it may still be good enough: if by various tweaks, > such as ensuring source references are off in your case, you achieve a > state when false alarms are rare (identical objects have different > hashes), and hence say unnecessary re-computation is rare, maybe it is > good enough. > > Tomas > > > > >> On Jan 12, 2024, at 11:33 AM, Tomas Kalibera > >> <tomas.kalib...@gmail.com> wrote: > >> > >> > >> On 1/12/24 06:11, Dipterix Wang wrote: > >>> Dear R devs, > >>> > >>> I was digging into a package issue today when I realized R serialize > >>> function not always generate the same results on equivalent objects > >>> when users choose to run differently. For example, the following code > >>> > >>> serialize(with(new.env(), { function(){} }), NULL, TRUE) > >>> > >>> generates different results when I copy-paste into console vs when I > >>> use ctrl+shift+enter to source the file in RStudio. > >>> > >>> With a deeper inspect into the cause, I found that function and > >>> language get source reference when getOption("keep.source") is TRUE. > >>> This means the source reference will make the functions different > >>> while in most cases, whether keeping function source might not > >>> impact how a function behaves. > >>> > >>> While it's OK that function serialize generates different results, > >>> functions such as `rlang::hash` and `digest::digest`, which depend > >>> on `serialize` might eventually deliver false positives on same > >>> inputs. I've checked source code in digest package hoping to get > >>> around this issue (for example serialize(..., refhook = ...)). > >>> However, my workaround did not work. It seems that the markers to > >>> the objects are different even if I used `refhook` to force srcref > >>> to be the same. I also tried `removeSource` and `rlang::zap_srcref`. > >>> None of them works directly on nested environments with multiple > >>> functions. > >>> > >>> I wonder how hard it would be to have options to discard source when > >>> serializing R objects? > >>> > >>> Currently my analyses heavily depend on digest function to generate > >>> file caches and automatically schedule pipelines (to update cache) > >>> when changes are detected. The pipelines save the hashes of source > >>> code, inputs, and outputs together so other people can easily verify > >>> the calculation without accessing the original data (which could be > >>> sensitive), or running hour-long analyses, or having to buy servers. > >>> All of these require `serialize` to produce the same results > >>> regardless of how users choose to run the code. > >>> > >>> It would be great if this feature could be in the future R. Other > >>> pipeline packages such as `targets` and `drake` can also benefit > >>> from it. > >> > >> I don't think such functionality would belong to serialize(). This > >> function is not meant to produce stable results based on the input, > >> the serialized representation may even differ based on properties not > >> seen by users. > >> > >> I think an option to ignore source code would belong to a function > >> that computes the hash, as other options of identical(). > >> > >> Tomas > >> > >> > >>> Thanks, > >>> > >>> - Dipterix > >>> [[alternative HTML version deleted]] > >>> > >>> ______________________________________________ > >>> R-devel@r-project.orgmailing list > >>> https://stat.ethz.ch/mailman/listinfo/r-devel > > > > ______________________________________________ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel