Re: [Rd] Choices to remove `srcref` (and its buddies) when serializing objects
В Tue, 16 Jan 2024 14:16:19 -0500 Dipterix Wang пишет: > Could you recommend any packages/functions that compute hash such > that the source references and sexpinfo_struct are ignored? Basically > a version of `serialize` that convert R objects to raw without > storing the ancillary source reference and sexpinfo. I can show how this can be done, but it's not currently on CRAN or even a well-defined package API. I have adapted a copy of R's serialize() [*] with the following changes: * Function bytecode and flags are ignored: f <- function() invisible() depcache:::hash(f, 2) # This is plain FNV1a-64 of serialize() output # [1] "9b7a1af5468deba4" .Call(depcache:::C_hash2, f) # This is the new hash [1] 91 5f b8 a1 b0 6b cb 40 f() # called once: function gets the MAYBEJIT_MASK flag depcache:::hash(f, 2) # [1] "7d30e05546e7a230" .Call(depcache:::C_hash2, f) # [1] 91 5f b8 a1 b0 6b cb 40 f() # called twice: function now has bytecode depcache:::hash(f, 2) # [1] "2a2cba4150e722b8" .Call(depcache:::C_hash2, f) # [1] 91 5f b8 a1 b0 6b cb 40 # new hash stays the same * Source references are ignored: .Call(depcache:::C_hash2, \( ) invisible( )) # [1] 91 5f b8 a1 b0 6b cb 40 # compare vs. above # For quoted function definitions, source references have to be handled # differently .Call(depcache:::C_hash2, quote(function(){})) [1] 58 0d 44 8e d4 fd 37 6f .Call(depcache:::C_hash2, quote(\( ){ })) [1] 58 0d 44 8e d4 fd 37 6f * ALTREP is ignored: identical(1:10, 1:10+0L) # [1] TRUE identical(serialize(1:10, NULL), serialize(1:10+0L, NULL)) # [1] FALSE identical( .Call(depcache:::C_hash2, 1:10), .Call(depcache:::C_hash2, 1:10+0L) ) # [1] TRUE * Strings not marked as bytes are encoded into UTF-8: identical('\uff', iconv('\uff', 'UTF-8', 'latin1')) # [1] TRUE identical( serialize('\uff', NULL), serialize(iconv('\uff', 'UTF-8', 'latin1'), NULL) ) # [1] FALSE identical( .Call(depcache:::C_hash2, '\uff'), .Call(depcache:::C_hash2, iconv('\uff', 'UTF-8', 'latin1')) ) # [1] TRUE * NaNs with different payloads (except NA_numeric_) are replaced by R_NaN. One of the many downsides to the current approach is that we rely on the non-API entry point getPRIMNAME() in order to hash builtins. Looking at the source code for identical() is no help here, because it uses the private PRIMOFFSET macro. The bitstream being hashed is also, unfortunately, not exactly compatible with R serialization format version 2: I had to ignore the LEVELS of the language objects being hashed both because identical() seems to ignore those and because I was missing multiple private definitions (e.g. the MAYBEJIT flag) to handle them properly. Then there's also the problem of immediate bindings [**]: I've seen bits of vctrs, rstudio, rlang blow up when calling CAR() on SEXP objects that are not safe to handle this way, but R_expand_binding_value() (used by serialize()) is again a private function that is not accessible from packages. identical() won't help here, because it compares reference objects (which may or may not contain such immediate bindings) by their pointer values instead of digging down into them. Dropping the (already violated) requirement to be compatible with R serialization bitstream will make it possible to simplify the code further. Finally: a <- new.env() b <- new.env() a$x <- b$x <- 42 identical(a, b) # [1] FALSE .Call(depcache:::C_hash2, a) # [1] 44 21 f1 36 5d 92 03 1b .Call(depcache:::C_hash2, b) # [1] 44 21 f1 36 5d 92 03 1b ...but that's unavoidable when looking at frozen object contents instead of their live memory layout. If you're interested, here's the development version of the package: install.packages('depcache',contriburl='https://aitap.github.io/Rpackages') -- Best regards, Ivan [*] https://github.com/aitap/depcache/blob/serialize_canonical/src/serialize.c [**] https://svn.r-project.org/R/trunk/doc/notes/immbnd.md __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Choices to remove `srcref` (and its buddies) when serializing objects
> -- > > Date: Wed, 17 Jan 2024 11:35:02 -0500 > > From: Dipterix Wang > > To: Lionel Henry , Tomas Kalibera > > > > Cc: r-devel@r-project.org > > Subject: Re: [Rd] Choices to remove `srcref` (and its buddies) when > > serializing objects > > Message-ID: <3cf4ca2d-9f72-4c7b-90aa-4d2e9f745...@gmail.com> > > Content-Type: text/plain; charset="utf-8" > > > > > > > > > > > On Wed, Jan 17, 2024 at 10:32 AM Tomas Kalibera > > > > wrote: > > > > > > > > I think one could implement hashing on the fly without any > > > > > > serialization, similarly to how identical works, but I am not aware of > > > > > > any existing implementation. Again, if that wasn't clear: I don't think > > > > > > trying to compute a hash of an object from its serialized representation > > > > > > is a good idea - it is of course convenient, but has problems like the > > > > > > one you have ran into. > > > > > > > > > > > > In some applications it may still be good enough: if by various tweaks, > > > > > > such as ensuring source references are off in your case, you achieve a > > > > > > state when false alarms are rare (identical objects have different > > > > > > hashes), and hence say unnecessary re-computation is rare, maybe it is > > > > > > good enough. > > > > > > > I really appreciate you answer my questions and solve my puzzles. I went back > and read the R internal code for `serialize` and totally agree on this, that > serialization is not a good idea for digesting R objects, especially on > environments, expressions, and functions. > > What I want is a function that can produce the same and stable hash for > identical objects. However, there is no function (given our best knowledge) > on the market that can do this. `digest::digest` and `rlang::hash` are the > first functions that come into my mind. Both are widely used, but they use > serialize. The author of `digest` said: > > > "As you know, digest takes and (ahem) "digests" what serialize gives it, > so you would have to look into what serialize lets you do." > > vctrs:::obj_hash is probably the closest to the implementation of > `identical`, but the above examples give different results for identical > objects. > > The existence of digest:: digest and rlang::hash shows that there is a huge > demand for this "ideal" hash function. However, I bet most people are using > digest/hash "incorrectly". Please read the full discussion to this old bug report: https://bugs.r-project.org/show_bug.cgi?id=18178 Quoting briefly: Serialization is not intended to be used this way. What serialization tries to provide is that x and unserialize(serialize(x, NULL)) will be identical() while preserving internal representation where possible. Two objects that are considered identical() can have very different internal representations, and their serializations will reflect this. You will see that it is not as simple as just removing the srcref or the bytecode to functions. The issue with the `identical()` function in that context was eventually patched, but the comment by R-Core that serialization is not intended to be used to produce a reliable hash stands. Use of `identical()` or `serialize()` is simply not designed to ensure the same hashable object (in terms of bytes). This is echoed by Tomas' comment above. But we note that it is 'good enough' in most cases. Fwiw `nanonext::sha256()` and family directly hashes character strings and raw objects, but uses the same approach as `digest::digest()` elsewhere. So if someone comes up with a canonical binary representation of R objects, it will be able to hash it reliably. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] [External] Re: Choices to remove `srcref` (and its buddies) when serializing objects
On Thu, 18 Jan 2024, Ivan Krylov via R-devel wrote: В Tue, 16 Jan 2024 14:16:19 -0500 Dipterix Wang пишет: Could you recommend any packages/functions that compute hash such that the source references and sexpinfo_struct are ignored? Basically a version of `serialize` that convert R objects to raw without storing the ancillary source reference and sexpinfo. I can show how this can be done, but it's not currently on CRAN or even a well-defined package API. I have adapted a copy of R's serialize() [*] with the following changes: * Function bytecode and flags are ignored: f <- function() invisible() depcache:::hash(f, 2) # This is plain FNV1a-64 of serialize() output # [1] "9b7a1af5468deba4" .Call(depcache:::C_hash2, f) # This is the new hash [1] 91 5f b8 a1 b0 6b cb 40 f() # called once: function gets the MAYBEJIT_MASK flag depcache:::hash(f, 2) # [1] "7d30e05546e7a230" .Call(depcache:::C_hash2, f) # [1] 91 5f b8 a1 b0 6b cb 40 f() # called twice: function now has bytecode depcache:::hash(f, 2) # [1] "2a2cba4150e722b8" .Call(depcache:::C_hash2, f) # [1] 91 5f b8 a1 b0 6b cb 40 # new hash stays the same * Source references are ignored: .Call(depcache:::C_hash2, \( ) invisible( )) # [1] 91 5f b8 a1 b0 6b cb 40 # compare vs. above # For quoted function definitions, source references have to be handled # differently .Call(depcache:::C_hash2, quote(function(){})) [1] 58 0d 44 8e d4 fd 37 6f .Call(depcache:::C_hash2, quote(\( ){ })) [1] 58 0d 44 8e d4 fd 37 6f * ALTREP is ignored: identical(1:10, 1:10+0L) # [1] TRUE identical(serialize(1:10, NULL), serialize(1:10+0L, NULL)) # [1] FALSE identical( .Call(depcache:::C_hash2, 1:10), .Call(depcache:::C_hash2, 1:10+0L) ) # [1] TRUE * Strings not marked as bytes are encoded into UTF-8: identical('\uff', iconv('\uff', 'UTF-8', 'latin1')) # [1] TRUE identical( serialize('\uff', NULL), serialize(iconv('\uff', 'UTF-8', 'latin1'), NULL) ) # [1] FALSE identical( .Call(depcache:::C_hash2, '\uff'), .Call(depcache:::C_hash2, iconv('\uff', 'UTF-8', 'latin1')) ) # [1] TRUE * NaNs with different payloads (except NA_numeric_) are replaced by R_NaN. One of the many downsides to the current approach is that we rely on the non-API entry point getPRIMNAME() in order to hash builtins. Looking at the source code for identical() is no help here, because it uses the private PRIMOFFSET macro. The bitstream being hashed is also, unfortunately, not exactly compatible with R serialization format version 2: I had to ignore the LEVELS of the language objects being hashed both because identical() seems to ignore those and because I was missing multiple private definitions (e.g. the MAYBEJIT flag) to handle them properly. Then there's also the problem of immediate bindings [**]: I've seen bits of vctrs, rstudio, rlang blow up when calling CAR() on SEXP objects that are not safe to handle this way, but R_expand_binding_value() (used by serialize()) is again a private function that is not accessible from packages. identical() won't help here, because it compares reference objects (which may or may not contain such immediate bindings) by their pointer values instead of digging down into them. What does 'blow up' mean? If it is anything other than signal a "bad binding access" error then it would be good to have more details. Best, luke Dropping the (already violated) requirement to be compatible with R serialization bitstream will make it possible to simplify the code further. Finally: a <- new.env() b <- new.env() a$x <- b$x <- 42 identical(a, b) # [1] FALSE .Call(depcache:::C_hash2, a) # [1] 44 21 f1 36 5d 92 03 1b .Call(depcache:::C_hash2, b) # [1] 44 21 f1 36 5d 92 03 1b ...but that's unavoidable when looking at frozen object contents instead of their live memory layout. If you're interested, here's the development version of the package: install.packages('depcache',contriburl='https://aitap.github.io/Rpackages') -- Luke Tierney Ralph E. Wareham Professor of Mathematical Sciences University of Iowa Phone: 319-335-3386 Department of Statistics andFax: 319-335-3017 Actuarial Science 241 Schaeffer Hall email: luke-tier...@uiowa.edu Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] [External] Re: Choices to remove `srcref` (and its buddies) when serializing objects
On Thu, 18 Jan 2024 09:59:31 -0600 (CST) luke-tier...@uiowa.edu wrote: > What does 'blow up' mean? If it is anything other than signal a "bad > binding access" error then it would be good to have more details. My apologies for not being precise enough. I meant the "bad binding access" error in all such cases. -- Best regards, Ivan __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Should subsetting named vector return named vector including named unmatched elements?
Subsetting vector (including lists) returns the same number of elements as the subsetting vector, including unmatched elements which are reported as `NA` or `NULL` (in case of lists). Consider: ``` menu = list( "bacon" = "foo", "eggs" = "bar", "beans" = "baz" ) select = c("bacon", "eggs", "spam") menu[select] # $bacon # [1] "foo" # # $eggs # [1] "bar" # # $ # NULL ``` Wouldn't it be more logical to return named vector/list including names of unmatched elements when subsetting using names? After all, the unmatched elements are already returned. I.e., the output would look like this: ``` menu[select] # $bacon # [1] "foo" # # $eggs # [1] "bar" # # $spam # NULL ``` The simple fix `menu[select] |> setNames(select)` solves, but it feels to me like something that could be a default behaviour. On slightly unrelated note, when I was asking if there is a better solution, the `menu[select]` seems to allocate more memory than `menu_env = list2env(menu); mget(select, envir = menu, ifnotfound = list(NULL)`. Or the sapply solution. Is this a benchmarking artifact? https://stackoverflow.com/q/77828678/4868692 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Should subsetting named vector return named vector including named unmatched elements?
Jiří, For your first question, the NA names make sense if you think of indexing with a character vector as the same as menu[match(select, names(menu))]. You're not indexing with "beans"; rather, "beans" becomes NA because it's not in the names of menu. (This is how it's documented in ?`[`: "Character vectors will be matched to the names of the object...") Steve On Thursday, January 18th, 2024 at 2:51 PM, Jiří Moravec wrote: > Subsetting vector (including lists) returns the same number of elements > as the subsetting vector, including unmatched elements which are > reported as `NA` or `NULL` (in case of lists). > > Consider: > > ``` > menu = list( > "bacon" = "foo", > "eggs" = "bar", > "beans" = "baz" > ) > > select = c("bacon", "eggs", "spam") > > menu[select] > # $bacon > # [1] "foo" > # > # $eggs > # [1] "bar" > # > # $ > > # NULL > > `Wouldn't it be more logical to return named vector/list including names of > unmatched elements when subsetting using names? After all, the unmatched > elements are already returned. I.e., the output would look like this:` > > menu[select] > # $bacon > # [1] "foo" > # > # $eggs > # [1] "bar" > # > # $spam > # NULL > > ``` > > The simple fix `menu[select] |> setNames(select)` solves, but it feels > > to me like something that could be a default behaviour. > > On slightly unrelated note, when I was asking if there is a better > solution, the `menu[select]` seems to allocate more memory than > `menu_env = list2env(menu); mget(select, envir = menu, ifnotfound = > list(NULL)`. Or the sapply solution. Is this a benchmarking artifact? > > https://stackoverflow.com/q/77828678/4868692 > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Should subsetting named vector return named vector including named unmatched elements?
Never been a big fan of this behavior either but maybe the intention was to make it easier to distinguish between 2 types of NAs in the result: those that were present in the original vector vs those that are introduced by an unmatched subscript. Like in this example: x <- setNames(c(101:108, NA), letters[1:9]) x # a b c d e f g h i # 101 102 103 104 105 106 107 108 NA x[c("g", "k", "a", "i")] # g a i # 107 NA 101 NA The first NA is the result of an unmatched subscript, while the second one comes from 'x'. This is of limited interest though. In most real world applications I've worked on, we actually need to "fix" the names of the result. Best, H. On 1/18/24 11:51, Jiří Moravec wrote: > Subsetting vector (including lists) returns the same number of > elements as the subsetting vector, including unmatched elements which > are reported as `NA` or `NULL` (in case of lists). > > Consider: > > ``` > menu = list( > "bacon" = "foo", > "eggs" = "bar", > "beans" = "baz" > ) > > select = c("bacon", "eggs", "spam") > > menu[select] > # $bacon > # [1] "foo" > # > # $eggs > # [1] "bar" > # > # $ > # NULL > > ``` > > Wouldn't it be more logical to return named vector/list including > names of unmatched elements when subsetting using names? After all, > the unmatched elements are already returned. I.e., the output would > look like this: > > ``` > > menu[select] > # $bacon > # [1] "foo" > # > # $eggs > # [1] "bar" > # > # $spam > # NULL > > ``` > > The simple fix `menu[select] |> setNames(select)` solves, but it feels > to me like something that could be a default behaviour. > > On slightly unrelated note, when I was asking if there is a better > solution, the `menu[select]` seems to allocate more memory than > `menu_env = list2env(menu); mget(select, envir = menu, ifnotfound = > list(NULL)`. Or the sapply solution. Is this a benchmarking artifact? > > https://stackoverflow.com/q/77828678/4868692 > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel -- Hervé Pagès Bioconductor Core Team hpages.on.git...@gmail.com [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel