Re: [Rd] Possible `substr` bug in UTF-8 Corner Case
Thanks, fixed in R-devel (by checking validity of UTF-8 strings for substr/substring). Tomas On 03/29/2018 03:53 AM, brodie gaslam via R-devel wrote: I think there is a memory bug in `substr` that is triggered by a UTF-8 corner case: an incomplete UTF-8 byte sequence at the end of a string. With a valgrind level 2 instrumented build of R-devel I get: string <- "abc\xEE" # \xEE indicates the start of a 3 byte UTF-8 sequence Encoding(string) <- "UTF-8" substr(string, 1, 10) ==15375== Invalid read of size 1 ==15375== at 0x45B3F0: substr (character.c:286) ==15375== by 0x45B3F0: do_substr (character.c:342) ==15375== by 0x4CFCB9: bcEval (eval.c:6775) ==15375== by 0x4D95AF: Rf_eval (eval.c:624) ==15375== by 0x4DAD12: R_execClosure (eval.c:1764) ==15375== by 0x4D9561: Rf_eval (eval.c:747) ==15375== by 0x507008: Rf_ReplIteration (main.c:258) ==15375== by 0x5073E7: R_ReplConsole (main.c:308) ==15375== by 0x507494: run_Rmainloop (main.c:1082) ==15375== by 0x41A8E6: main (Rmain.c:29) ==15375== Address 0xb9e518d is 3,869 bytes inside a block of size 7,960 alloc'd ==15375== at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==15375== by 0x51033E: GetNewPage (memory.c:888) ==15375== by 0x511FC0: Rf_allocVector3 (memory.c:2691) ==15375== by 0x4657AC: Rf_allocVector (Rinlinedfuns.h:577) ==15375== by 0x4657AC: Rf_ScalarString (Rinlinedfuns.h:1007) ==15375== by 0x4657AC: coerceToVectorList (coerce.c:892) ==15375== by 0x4657AC: Rf_coerceVector (coerce.c:1293) ==15375== by 0x4660EB: ascommon (coerce.c:1369) ==15375== by 0x4667C0: do_asvector (coerce.c:1544) ==15375== by 0x4CFCB9: bcEval (eval.c:6775) ==15375== by 0x4D95AF: Rf_eval (eval.c:624) ==15375== by 0x4DAD12: R_execClosure (eval.c:1764) ==15375== by 0x515EF7: dispatchMethod (objects.c:408) ==15375== by 0x516379: Rf_usemethod (objects.c:458) ==15375== by 0x516694: do_usemethod (objects.c:543) ==15375== [1] "abc" Here is a patch for the native version of `substr` that highlights the problem and a possible fix. Basically `substr` computes the byte width of a UTF-8 character based on the leading byte ("\xEE" here, which implies 3 bytes) and reads/writes that entire byte width irrespective of whether the string actually ends before the theoretical end of the UTF-8 "character". Index: src/main/character.c === --- src/main/character.c(revision 74482) +++ src/main/character.c(working copy) @@ -283,7 +283,7 @@ for (i = 0; i < so && str < end; i++) { int used = utf8clen(*str); if (i < sa - 1) { str += used; continue; } -for (j = 0; j < used; j++) *buf++ = *str++; +for (j = 0; j < used && str < end; j++) *buf++ = *str++; } } else if (ienc == CE_LATIN1 || ienc == CE_BYTES) { for (str += (sa - 1), i = sa; i <= so; i++) *buf++ = *str++; The change above removed the valgrind error for me. I re-built R with the change and ran "make check" which seemed to work fine. I also ran some simple checks on UTF-8 strings and things seem to work okay. I have very limited experience making changes to R (this is my first attempt at a patch) so please take all of the above with extreme skepticism. Apologies in advance if this turns out to be a false alarm caused by an error on my part. Best, Brodie. PS: apologies also if the formatting of this e-mail is bad. I have not figured out how to get plaintext working properly with yahoo. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] as.pairlist does not convert call objects
On 03/28/2018 08:31 AM, Jialin Ma wrote: Dear all, It seems that as.pairlist does not convert call objects, producing results like the following: is.pairlist(as.pairlist(quote(x + y))) [1] FALSE Should this behavior be expected? The documentation says that the behavior of as.pairlist is undocumented in this case: " ‘as.pairlist’ is implemented as ‘as.vector(x, "pairlist")’, and hence will dispatch methods for the generic function ‘as.vector’. Lists are copied element-by-element into a pairlist and the names of the list used as tags for the pairlist: the return value for other types of argument is undocumented. " as.pairlist implementation does currently nothing for a language object (because it is internally represented using a linked list). is.pairlist implementation is checking whether it's argument is a user-level pairlist, which language object is not, so it returns FALSE in the example. These functions are rather low-level and should not be needed in user programs. Certainly programs should not depend on undocumented behavior. Tomas Thanks, Jialin sessionInfo() R version 3.4.1 (2017-06-30) Platform: x86_64-suse-linux-gnu (64-bit) Running under: openSUSE Tumbleweed Matrix products: default BLAS: /usr/lib64/R/lib/libRblas.so LAPACK: /usr/lib64/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods [7] base other attached packages: [1] magrittr_1.5 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Possible `substr` bug in UTF-8 Corner Case
Thank you for the quick response and for the quick fix (and for the rchk vagrant image I used to build and test the below!). One thing I'll note about the fix is that it may start breaking things that used to "work". I think it is fair to say that character count is not well defined with illegal UTF-8 sequences (and noteworthy that `nchar` does actually stop when it encounters them), but there may be a bit of code out there that relied on being able to successfully complete (albeit while potentially corrupting memory) that will now produce errors. It may be worth highlighting this in the release notes. Best, Brodie. On Thursday, March 29, 2018, 9:11:15 AM EDT, Tomas Kalibera wrote: Thanks, fixed in R-devel (by checking validity of UTF-8 strings for substr/substring). Tomas On 03/29/2018 03:53 AM, brodie gaslam via R-devel wrote: > I think there is a memory bug in `substr` that is triggered by a UTF-8 corner > case: an incomplete UTF-8 byte sequence at the end of a string. With a > valgrind level 2 instrumented build of R-devel I get: > >> string <- "abc\xEE" # \xEE indicates the start of a 3 byte UTF-8 sequence >> Encoding(string) <- "UTF-8" >> substr(string, 1, 10) > ==15375== Invalid read of size 1 > ==15375== at 0x45B3F0: substr (character.c:286) > ==15375== by 0x45B3F0: do_substr (character.c:342) > ==15375== by 0x4CFCB9: bcEval (eval.c:6775) > ==15375== by 0x4D95AF: Rf_eval (eval.c:624) > ==15375== by 0x4DAD12: R_execClosure (eval.c:1764) > ==15375== by 0x4D9561: Rf_eval (eval.c:747) > ==15375== by 0x507008: Rf_ReplIteration (main.c:258) > ==15375== by 0x5073E7: R_ReplConsole (main.c:308) > ==15375== by 0x507494: run_Rmainloop (main.c:1082) > ==15375== by 0x41A8E6: main (Rmain.c:29) > ==15375== Address 0xb9e518d is 3,869 bytes inside a block of size 7,960 > alloc'd > ==15375== at 0x4C2DB8F: malloc (in > /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) > ==15375== by 0x51033E: GetNewPage (memory.c:888) > ==15375== by 0x511FC0: Rf_allocVector3 (memory.c:2691) > ==15375== by 0x4657AC: Rf_allocVector (Rinlinedfuns.h:577) > ==15375== by 0x4657AC: Rf_ScalarString (Rinlinedfuns.h:1007) > ==15375== by 0x4657AC: coerceToVectorList (coerce.c:892) > ==15375== by 0x4657AC: Rf_coerceVector (coerce.c:1293) > ==15375== by 0x4660EB: ascommon (coerce.c:1369) > ==15375== by 0x4667C0: do_asvector (coerce.c:1544) > ==15375== by 0x4CFCB9: bcEval (eval.c:6775) > ==15375== by 0x4D95AF: Rf_eval (eval.c:624) > ==15375== by 0x4DAD12: R_execClosure (eval.c:1764) > ==15375== by 0x515EF7: dispatchMethod (objects.c:408) > ==15375== by 0x516379: Rf_usemethod (objects.c:458) > ==15375== by 0x516694: do_usemethod (objects.c:543) > ==15375== > [1] "abc" > > Here is a patch for the native version of `substr` that highlights the > problem and a possible fix. Basically `substr` computes the byte width of a > UTF-8 character based on the leading byte ("\xEE" here, which implies 3 > bytes) and reads/writes that entire byte width irrespective of whether the > string actually ends before the theoretical end of the UTF-8 "character". > > Index: src/main/character.c > === > --- src/main/character.c(revision 74482) > +++ src/main/character.c(working copy) > @@ -283,7 +283,7 @@ > for (i = 0; i < so && str < end; i++) { > int used = utf8clen(*str); > if (i < sa - 1) { str += used; continue; } > -for (j = 0; j < used; j++) *buf++ = *str++; > +for (j = 0; j < used && str < end; j++) *buf++ = *str++; > } > } else if (ienc == CE_LATIN1 || ienc == CE_BYTES) { > for (str += (sa - 1), i = sa; i <= so; i++) *buf++ = *str++; > > The change above removed the valgrind error for me. I re-built R with the > change and ran "make check" which seemed to work fine. I also ran some simple > checks on UTF-8 strings and things seem to work okay. > > I have very limited experience making changes to R (this is my first attempt > at a patch) so please take all of the above with extreme skepticism. > > Apologies in advance if this turns out to be a false alarm caused by an error > on my part. > > Best, > > Brodie. > > PS: apologies also if the formatting of this e-mail is bad. I have not > figured out how to get plaintext working properly with yahoo. > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Objects not gc'ed due to caching (?) in R's S3 dispatch mechanism
Now also committed to the release branch. Best, luke On Tue, 27 Mar 2018, luke-tier...@uiowa.edu wrote: I have committed a change to R-devel that addresses this. To be on the safe side I need to run some more extensive tests before deciding if this can be ported to the release branch for R 3.5.0. Should know in a day or two. Best, luke On Tue, 27 Mar 2018, luke-tier...@uiowa.edu wrote: This has nothing to do with printing or dispatch per se. It is the result of an internal register (R_ReturnedValue) being protected. It gets rewritten whenever there is a jump, e.g. by an explicit return call. So a simplified example is new_foo <- function() { e <- new.env() reg.finalizer(e, function(e) message("Finalizer called")) e } bar <- function(x) return(x) bar(new_foo()) gc() # still in .Last.value gc() # nothing UseMethod essentially does a return call so you see the effect there. The R_ReturnedValue register could probably be safely cleared in more places but it isn't clear exactly where. As things stand it will be cleared on the next use of a non-local transfer of control, and those happen frequently enough that I'm not convinced this is worth addressing, at least not at this point in the release cycle. Best, luke On Mon, 26 Mar 2018, Iñaki Úcar wrote: Hi, I initially opened an issue in the R6 repo because my issue was with an R6 object. But Winston (thanks!) further simplified my example, and it turns out that the issue (whether a feature or a bug is yet to be seen) had to do with S3 dispatching. The following example, by Winston, depicts the issue: print.foo <- function(x, ...) { cat("print.foo called\n") invisible(x) } new_foo <- function() { e <- new.env() reg.finalizer(e, function(e) message("Finalizer called")) class(e) <- "foo" e } new_foo() gc() # still in .Last.value gc() # nothing I would expect that the second call to gc() should free 'e', but it's not. However, if we call now *any* S3 method, then the object can be finally gc'ed: print(1) gc() # Finalizer called So the hypothesis is that there is some kind of caching (?) mechanism going on. Intended behaviour or not, this is something that was introduced between R 3.2.3 and 3.3.2 (the first succeeds; from the second on, the example fails as described above). Regards, Iñaki PS: Further discussion and examples in https://github.com/r-lib/R6/issues/140 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel -- Luke Tierney Ralph E. Wareham Professor of Mathematical Sciences University of Iowa Phone: 319-335-3386 Department of Statistics andFax: 319-335-3017 Actuarial Science 241 Schaeffer Hall email: luke-tier...@uiowa.edu Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Objects not gc'ed due to caching (?) in R's S3 dispatch mechanism
Now also committed to the release branch. Best, luke On Tue, 27 Mar 2018, luke-tier...@uiowa.edu wrote: I have committed a change to R-devel that addresses this. To be on the safe side I need to run some more extensive tests before deciding if this can be ported to the release branch for R 3.5.0. Should know in a day or two. Best, luke On Tue, 27 Mar 2018, luke-tier...@uiowa.edu wrote: This has nothing to do with printing or dispatch per se. It is the result of an internal register (R_ReturnedValue) being protected. It gets rewritten whenever there is a jump, e.g. by an explicit return call. So a simplified example is new_foo <- function() { e <- new.env() reg.finalizer(e, function(e) message("Finalizer called")) e } bar <- function(x) return(x) bar(new_foo()) gc() # still in .Last.value gc() # nothing UseMethod essentially does a return call so you see the effect there. The R_ReturnedValue register could probably be safely cleared in more places but it isn't clear exactly where. As things stand it will be cleared on the next use of a non-local transfer of control, and those happen frequently enough that I'm not convinced this is worth addressing, at least not at this point in the release cycle. Best, luke On Mon, 26 Mar 2018, Iñaki Úcar wrote: Hi, I initially opened an issue in the R6 repo because my issue was with an R6 object. But Winston (thanks!) further simplified my example, and it turns out that the issue (whether a feature or a bug is yet to be seen) had to do with S3 dispatching. The following example, by Winston, depicts the issue: print.foo <- function(x, ...) { cat("print.foo called\n") invisible(x) } new_foo <- function() { e <- new.env() reg.finalizer(e, function(e) message("Finalizer called")) class(e) <- "foo" e } new_foo() gc() # still in .Last.value gc() # nothing I would expect that the second call to gc() should free 'e', but it's not. However, if we call now *any* S3 method, then the object can be finally gc'ed: print(1) gc() # Finalizer called So the hypothesis is that there is some kind of caching (?) mechanism going on. Intended behaviour or not, this is something that was introduced between R 3.2.3 and 3.3.2 (the first succeeds; from the second on, the example fails as described above). Regards, Iñaki PS: Further discussion and examples in https://github.com/r-lib/R6/issues/140 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel -- Luke Tierney Ralph E. Wareham Professor of Mathematical Sciences University of Iowa Phone: 319-335-3386 Department of Statistics andFax: 319-335-3017 Actuarial Science 241 Schaeffer Hall email: luke-tier...@uiowa.edu Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Base R examples that write to current working directory
Hi all, Given the recent CRAN push to prevent examples writing to the working directory, is there any interest in fixing base R examples that write to the working directory? A few candidates are the graphics devices, file.create(), writeBin(), writeChar(), write(), and saveRDS(). I'm sure there are many more. One way to catch these naughty examples would be to search for unlink() in examples: e.g., https://github.com/wch/r-source/search?utf8=✓&q=unlink+extension%3ARd&type=. Of course, simply cleaning up after yourself is not sufficient because if those files existed before the examples were run, the examples will destroy them. Hadley -- http://hadley.nz __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Base R examples that write to current working directory
On 29/03/2018 5:23 PM, Hadley Wickham wrote: Hi all, Given the recent CRAN push to prevent examples writing to the working directory, is there any interest in fixing base R examples that write to the working directory? A few candidates are the graphics devices, file.create(), writeBin(), writeChar(), write(), and saveRDS(). I'm sure there are many more. One way to catch these naughty examples would be to search for unlink() in examples: e.g., https://github.com/wch/r-source/search?utf8=✓&q=unlink+extension%3ARd&type=. Of course, simply cleaning up after yourself is not sufficient because if those files existed before the examples were run, the examples will destroy them. Why not put together a patch that fixes these? This doesn't seem to be something that needs discussion, fixing the bad examples would be a good idea. Duncan Murdoch __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Nice names in deparse
I am raising this again. As as.character(list(c(one = "1"))) is still "1" in R 3.5.0 alpha, could as.character(list(c(one = 1))) be "1" , too, as before? The case here is the list component is atomic with length 1. On Sat, 16/12/17, Suharto Anggono Suharto Anggono wrote: Subject: Nice names in deparse To: r-devel@r-project.org Date: Saturday, 16 December, 2017, 11:09 PM Tags (argument names) in call to 'list' becomes names of the result. It is not necessarily so with call to 'c'. The default method of 'c' has 'recursive' and 'use.names' arguments. In R devel r73778, with x <- 0; names(x) <- "recursive" , dput(x) or even dput(x, control = "all") gives c(recursive = 0) However, actual result of c(recursive = 0) is NULL. Also with x <- 0; names(x) <- "recursive" , dput(x, control = c("keepNA", "keepInteger", "showAttributes")) in R devel r73778 gives structure(c(0), .Names = "recursive") The 'control' is suggested by an example for output as in R < 3.5.0. However, the output is slightly different from dput(x) in R 3.3.2: structure(0, .Names = "recursive") Part of NEWS item related with "niceNames" control option: as.character(list( c (one = 1))) now includes the name, as as.character(list(list(one = 1))) has always done. Please reconsider. As as.numeric(list(c(one = 1))) gives 1 , I expect that as.character(list(c(one = "1"))) gives "1" . It does in R devel r73778. Why does as.character(list(c(one = 1))) give "c(one = 1)" ? as.numeric(list(c(one = "1"))) gives 1 . list(list(one = 1)) is not quite the same. as.numeric(list(list(one = 1))) gives NA . __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel