Re: [Rd] Possible `substr` bug in UTF-8 Corner Case

2018-03-29 Thread Tomas Kalibera
Thanks, fixed in R-devel (by checking validity of UTF-8 strings for 
substr/substring).

Tomas

On 03/29/2018 03:53 AM, brodie gaslam via R-devel wrote:

I think there is a memory bug in `substr` that is triggered by a UTF-8 corner 
case: an incomplete UTF-8 byte sequence at the end of a string.  With a 
valgrind level 2 instrumented build of R-devel I get:


string <- "abc\xEE"    # \xEE indicates the start of a 3 byte UTF-8 sequence
Encoding(string) <- "UTF-8"
substr(string, 1, 10)

==15375== Invalid read of size 1
==15375==    at 0x45B3F0: substr (character.c:286)
==15375==    by 0x45B3F0: do_substr (character.c:342)
==15375==    by 0x4CFCB9: bcEval (eval.c:6775)
==15375==    by 0x4D95AF: Rf_eval (eval.c:624)
==15375==    by 0x4DAD12: R_execClosure (eval.c:1764)
==15375==    by 0x4D9561: Rf_eval (eval.c:747)
==15375==    by 0x507008: Rf_ReplIteration (main.c:258)
==15375==    by 0x5073E7: R_ReplConsole (main.c:308)
==15375==    by 0x507494: run_Rmainloop (main.c:1082)
==15375==    by 0x41A8E6: main (Rmain.c:29)
==15375==  Address 0xb9e518d is 3,869 bytes inside a block of size 7,960 alloc'd
==15375==    at 0x4C2DB8F: malloc (in 
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==15375==    by 0x51033E: GetNewPage (memory.c:888)
==15375==    by 0x511FC0: Rf_allocVector3 (memory.c:2691)
==15375==    by 0x4657AC: Rf_allocVector (Rinlinedfuns.h:577)
==15375==    by 0x4657AC: Rf_ScalarString (Rinlinedfuns.h:1007)
==15375==    by 0x4657AC: coerceToVectorList (coerce.c:892)
==15375==    by 0x4657AC: Rf_coerceVector (coerce.c:1293)
==15375==    by 0x4660EB: ascommon (coerce.c:1369)
==15375==    by 0x4667C0: do_asvector (coerce.c:1544)
==15375==    by 0x4CFCB9: bcEval (eval.c:6775)
==15375==    by 0x4D95AF: Rf_eval (eval.c:624)
==15375==    by 0x4DAD12: R_execClosure (eval.c:1764)
==15375==    by 0x515EF7: dispatchMethod (objects.c:408)
==15375==    by 0x516379: Rf_usemethod (objects.c:458)
==15375==    by 0x516694: do_usemethod (objects.c:543)
==15375==
[1] "abc"

Here is a patch for the native version of `substr` that highlights the problem and a possible fix.  
Basically `substr` computes the byte width of a UTF-8 character based on the leading byte 
("\xEE" here, which implies 3 bytes) and reads/writes that entire byte width irrespective 
of whether the string actually ends before the theoretical end of the UTF-8 "character".

Index: src/main/character.c
===
--- src/main/character.c(revision 74482)
+++ src/main/character.c(working copy)
@@ -283,7 +283,7 @@
 for (i = 0; i < so && str < end; i++) {
 int used = utf8clen(*str);
 if (i < sa - 1) { str += used; continue; }
-for (j = 0; j < used; j++) *buf++ = *str++;
+for (j = 0; j < used && str < end; j++) *buf++ = *str++;
 }
  } else if (ienc == CE_LATIN1 || ienc == CE_BYTES) {
 for (str += (sa - 1), i = sa; i <= so; i++) *buf++ = *str++;

The change above removed the valgrind error for me.  I re-built R with the change and ran 
"make check" which seemed to work fine. I also ran some simple checks on UTF-8 
strings and things seem to work okay.

I have very limited experience making changes to R (this is my first attempt at 
a patch) so please take all of the above with extreme skepticism.

Apologies in advance if this turns out to be a false alarm caused by an error 
on my part.

Best,

Brodie.

PS: apologies also if the formatting of this e-mail is bad.  I have not figured 
out how to get plaintext working properly with yahoo.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] as.pairlist does not convert call objects

2018-03-29 Thread Tomas Kalibera


On 03/28/2018 08:31 AM, Jialin Ma wrote:

Dear all,

It seems that as.pairlist does not convert call objects, producing
results like the following:


is.pairlist(as.pairlist(quote(x + y)))

[1] FALSE

Should this behavior be expected?
The documentation says that the behavior of as.pairlist is undocumented 
in this case:


"
 ‘as.pairlist’ is implemented as ‘as.vector(x, "pairlist")’, and
 hence will dispatch methods for the generic function ‘as.vector’.
 Lists are copied element-by-element into a pairlist and the names
 of the list used as tags for the pairlist: the return value for
 other types of argument is undocumented.
"

as.pairlist implementation does currently nothing for a language object 
(because it is internally represented using a linked list). is.pairlist 
implementation is checking whether it's argument is a user-level 
pairlist, which language object is not, so it returns FALSE in the example.


These functions are rather low-level and should not be needed in user 
programs. Certainly programs should not depend on undocumented behavior.


Tomas

Thanks,
Jialin



sessionInfo()

R version 3.4.1 (2017-06-30)
Platform: x86_64-suse-linux-gnu (64-bit)
Running under: openSUSE Tumbleweed

Matrix products: default
BLAS: /usr/lib64/R/lib/libRblas.so
LAPACK: /usr/lib64/R/lib/libRlapack.so

locale:
  [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
  [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
  [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8
  [7] LC_PAPER=en_US.UTF-8   LC_NAME=C
  [9] LC_ADDRESS=C   LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics  grDevices utils datasets  methods
[7] base

other attached packages:
[1] magrittr_1.5

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Possible `substr` bug in UTF-8 Corner Case

2018-03-29 Thread brodie gaslam via R-devel
Thank you for the quick response and for the quick fix (and for the rchk 
vagrant image I used to build and test the below!).
One thing I'll note about the fix is that it may start breaking things that 
used to "work".  I think it is fair to say that character count is not well 
defined with illegal UTF-8 sequences (and noteworthy that `nchar` does actually 
stop when it encounters them), but there may be a bit of code out there that 
relied on being able to successfully complete (albeit while potentially 
corrupting memory) that will now produce errors.  It may be worth highlighting 
this in the release notes.

Best,
Brodie.


On Thursday, March 29, 2018, 9:11:15 AM EDT, Tomas Kalibera 
 wrote:  
 
 Thanks, fixed in R-devel (by checking validity of UTF-8 strings for 
substr/substring).
Tomas

On 03/29/2018 03:53 AM, brodie gaslam via R-devel wrote:
> I think there is a memory bug in `substr` that is triggered by a UTF-8 corner 
> case: an incomplete UTF-8 byte sequence at the end of a string.  With a 
> valgrind level 2 instrumented build of R-devel I get:
>
>> string <- "abc\xEE"    # \xEE indicates the start of a 3 byte UTF-8 sequence
>> Encoding(string) <- "UTF-8"
>> substr(string, 1, 10)
> ==15375== Invalid read of size 1
> ==15375==    at 0x45B3F0: substr (character.c:286)
> ==15375==    by 0x45B3F0: do_substr (character.c:342)
> ==15375==    by 0x4CFCB9: bcEval (eval.c:6775)
> ==15375==    by 0x4D95AF: Rf_eval (eval.c:624)
> ==15375==    by 0x4DAD12: R_execClosure (eval.c:1764)
> ==15375==    by 0x4D9561: Rf_eval (eval.c:747)
> ==15375==    by 0x507008: Rf_ReplIteration (main.c:258)
> ==15375==    by 0x5073E7: R_ReplConsole (main.c:308)
> ==15375==    by 0x507494: run_Rmainloop (main.c:1082)
> ==15375==    by 0x41A8E6: main (Rmain.c:29)
> ==15375==  Address 0xb9e518d is 3,869 bytes inside a block of size 7,960 
> alloc'd
> ==15375==    at 0x4C2DB8F: malloc (in 
> /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
> ==15375==    by 0x51033E: GetNewPage (memory.c:888)
> ==15375==    by 0x511FC0: Rf_allocVector3 (memory.c:2691)
> ==15375==    by 0x4657AC: Rf_allocVector (Rinlinedfuns.h:577)
> ==15375==    by 0x4657AC: Rf_ScalarString (Rinlinedfuns.h:1007)
> ==15375==    by 0x4657AC: coerceToVectorList (coerce.c:892)
> ==15375==    by 0x4657AC: Rf_coerceVector (coerce.c:1293)
> ==15375==    by 0x4660EB: ascommon (coerce.c:1369)
> ==15375==    by 0x4667C0: do_asvector (coerce.c:1544)
> ==15375==    by 0x4CFCB9: bcEval (eval.c:6775)
> ==15375==    by 0x4D95AF: Rf_eval (eval.c:624)
> ==15375==    by 0x4DAD12: R_execClosure (eval.c:1764)
> ==15375==    by 0x515EF7: dispatchMethod (objects.c:408)
> ==15375==    by 0x516379: Rf_usemethod (objects.c:458)
> ==15375==    by 0x516694: do_usemethod (objects.c:543)
> ==15375==
> [1] "abc"
>
> Here is a patch for the native version of `substr` that highlights the 
> problem and a possible fix.  Basically `substr` computes the byte width of a 
> UTF-8 character based on the leading byte ("\xEE" here, which implies 3 
> bytes) and reads/writes that entire byte width irrespective of whether the 
> string actually ends before the theoretical end of the UTF-8 "character".
>
> Index: src/main/character.c
> ===
> --- src/main/character.c(revision 74482)
> +++ src/main/character.c(working copy)
> @@ -283,7 +283,7 @@
>  for (i = 0; i < so && str < end; i++) {
>  int used = utf8clen(*str);
>  if (i < sa - 1) { str += used; continue; }
> -for (j = 0; j < used; j++) *buf++ = *str++;
> +for (j = 0; j < used && str < end; j++) *buf++ = *str++;
>  }
>   } else if (ienc == CE_LATIN1 || ienc == CE_BYTES) {
>  for (str += (sa - 1), i = sa; i <= so; i++) *buf++ = *str++;
>
> The change above removed the valgrind error for me.  I re-built R with the 
> change and ran "make check" which seemed to work fine. I also ran some simple 
> checks on UTF-8 strings and things seem to work okay.
>
> I have very limited experience making changes to R (this is my first attempt 
> at a patch) so please take all of the above with extreme skepticism.
>
> Apologies in advance if this turns out to be a false alarm caused by an error 
> on my part.
>
> Best,
>
> Brodie.
>
> PS: apologies also if the formatting of this e-mail is bad.  I have not 
> figured out how to get plaintext working properly with yahoo.
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel


  
[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Objects not gc'ed due to caching (?) in R's S3 dispatch mechanism

2018-03-29 Thread luke-tierney

Now also committed to the release branch.

Best,

luke

On Tue, 27 Mar 2018, luke-tier...@uiowa.edu wrote:


I have committed a change to R-devel that addresses this. To be on the
safe side I need to run some more extensive tests before deciding if
this can be ported to the release branch for R 3.5.0. Should know in a
day or two.

Best,

luke

On Tue, 27 Mar 2018, luke-tier...@uiowa.edu wrote:


This has nothing to do with printing or dispatch per se. It is the
result of an internal register (R_ReturnedValue) being protected. It
gets rewritten whenever there is a jump, e.g. by an explicit return
call. So a simplified example is

new_foo <- function() {
 e <- new.env()
   reg.finalizer(e, function(e) message("Finalizer called"))
 e
 }

bar <- function(x) return(x)

bar(new_foo())
gc() # still in .Last.value
gc() # nothing

UseMethod essentially does a return call so you see the effect there.

The R_ReturnedValue register could probably be safely cleared in more
places but it isn't clear exactly where. As things stand it will be
cleared on the next use of a non-local transfer of control, and those
happen frequently enough that I'm not convinced this is worth
addressing, at least not at this point in the release cycle.

Best,

luke

On Mon, 26 Mar 2018, Iñaki Úcar wrote:


Hi,

I initially opened an issue in the R6 repo because my issue was with
an R6 object. But Winston (thanks!) further simplified my example, and
it turns out that the issue (whether a feature or a bug is yet to be
seen) had to do with S3 dispatching.

The following example, by Winston, depicts the issue:

print.foo <- function(x, ...) {
 cat("print.foo called\n")
 invisible(x)
}

new_foo <- function() {
 e <- new.env()
 reg.finalizer(e, function(e) message("Finalizer called"))
 class(e) <- "foo"
 e
}

new_foo()
gc() # still in .Last.value
gc() # nothing

I would expect that the second call to gc() should free 'e', but it's
not. However, if we call now *any* S3 method, then the object can be
finally gc'ed:

print(1)
gc() # Finalizer called

So the hypothesis is that there is some kind of caching (?) mechanism
going on. Intended behaviour or not, this is something that was
introduced between R 3.2.3 and 3.3.2 (the first succeeds; from the
second on, the example fails as described above).

Regards,
Iñaki

PS: Further discussion and examples in 
https://github.com/r-lib/R6/issues/140


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel









--
Luke Tierney
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa  Phone: 319-335-3386
Department of Statistics andFax:   319-335-3017
   Actuarial Science
241 Schaeffer Hall  email:   luke-tier...@uiowa.edu
Iowa City, IA 52242 WWW:  http://www.stat.uiowa.edu
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Objects not gc'ed due to caching (?) in R's S3 dispatch mechanism

2018-03-29 Thread luke-tierney

Now also committed to the release branch.

Best,

luke

On Tue, 27 Mar 2018, luke-tier...@uiowa.edu wrote:


I have committed a change to R-devel that addresses this. To be on the
safe side I need to run some more extensive tests before deciding if
this can be ported to the release branch for R 3.5.0. Should know in a
day or two.

Best,

luke

On Tue, 27 Mar 2018, luke-tier...@uiowa.edu wrote:


This has nothing to do with printing or dispatch per se. It is the
result of an internal register (R_ReturnedValue) being protected. It
gets rewritten whenever there is a jump, e.g. by an explicit return
call. So a simplified example is

new_foo <- function() {
 e <- new.env()
   reg.finalizer(e, function(e) message("Finalizer called"))
 e
 }

bar <- function(x) return(x)

bar(new_foo())
gc() # still in .Last.value
gc() # nothing

UseMethod essentially does a return call so you see the effect there.

The R_ReturnedValue register could probably be safely cleared in more
places but it isn't clear exactly where. As things stand it will be
cleared on the next use of a non-local transfer of control, and those
happen frequently enough that I'm not convinced this is worth
addressing, at least not at this point in the release cycle.

Best,

luke

On Mon, 26 Mar 2018, Iñaki Úcar wrote:


Hi,

I initially opened an issue in the R6 repo because my issue was with
an R6 object. But Winston (thanks!) further simplified my example, and
it turns out that the issue (whether a feature or a bug is yet to be
seen) had to do with S3 dispatching.

The following example, by Winston, depicts the issue:

print.foo <- function(x, ...) {
 cat("print.foo called\n")
 invisible(x)
}

new_foo <- function() {
 e <- new.env()
 reg.finalizer(e, function(e) message("Finalizer called"))
 class(e) <- "foo"
 e
}

new_foo()
gc() # still in .Last.value
gc() # nothing

I would expect that the second call to gc() should free 'e', but it's
not. However, if we call now *any* S3 method, then the object can be
finally gc'ed:

print(1)
gc() # Finalizer called

So the hypothesis is that there is some kind of caching (?) mechanism
going on. Intended behaviour or not, this is something that was
introduced between R 3.2.3 and 3.3.2 (the first succeeds; from the
second on, the example fails as described above).

Regards,
Iñaki

PS: Further discussion and examples in 
https://github.com/r-lib/R6/issues/140


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel









--
Luke Tierney
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa  Phone: 319-335-3386
Department of Statistics andFax:   319-335-3017
   Actuarial Science
241 Schaeffer Hall  email:   luke-tier...@uiowa.edu
Iowa City, IA 52242 WWW:  http://www.stat.uiowa.edu
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Base R examples that write to current working directory

2018-03-29 Thread Hadley Wickham
Hi all,

Given the recent CRAN push to prevent examples writing to the working
directory, is there any interest in fixing base R examples that write
to the working directory? A few candidates are the graphics devices,
file.create(), writeBin(), writeChar(), write(), and saveRDS(). I'm
sure there are many more.

One way to catch these naughty examples would be to search for
unlink() in examples: e.g.,
https://github.com/wch/r-source/search?utf8=✓&q=unlink+extension%3ARd&type=.
Of course, simply cleaning up after yourself is not sufficient because
if those files existed before the examples were run, the examples will
destroy them.

Hadley

-- 
http://hadley.nz

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Base R examples that write to current working directory

2018-03-29 Thread Duncan Murdoch

On 29/03/2018 5:23 PM, Hadley Wickham wrote:

Hi all,

Given the recent CRAN push to prevent examples writing to the working
directory, is there any interest in fixing base R examples that write
to the working directory? A few candidates are the graphics devices,
file.create(), writeBin(), writeChar(), write(), and saveRDS(). I'm
sure there are many more.

One way to catch these naughty examples would be to search for
unlink() in examples: e.g.,
https://github.com/wch/r-source/search?utf8=✓&q=unlink+extension%3ARd&type=.
Of course, simply cleaning up after yourself is not sufficient because
if those files existed before the examples were run, the examples will
destroy them.



Why not put together a patch that fixes these?  This doesn't seem to be 
something that needs discussion, fixing the bad examples would be a good 
idea.


Duncan Murdoch

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Nice names in deparse

2018-03-29 Thread Suharto Anggono Suharto Anggono via R-devel
I am raising this again.

As
as.character(list(c(one = "1")))
is still
"1"
in R 3.5.0 alpha, could
as.character(list(c(one = 1)))
be
"1"
, too, as before?

The case here is the list component is atomic with length 1.


On Sat, 16/12/17, Suharto Anggono Suharto Anggono  
wrote:

 Subject: Nice names in deparse
 To: r-devel@r-project.org
 Date: Saturday, 16 December, 2017, 11:09 PM

Tags (argument names) in call to 'list' becomes names of the result. It is not 
necessarily so with call to 'c'. The default method of 'c' has 'recursive' and 
'use.names' arguments.

In R devel r73778, with
x <- 0; names(x) <- "recursive"  ,
dput(x)
or even
dput(x, control = "all")
gives
c(recursive = 0)
However, actual result of c(recursive = 0) is NULL.

Also with
x <- 0; names(x) <- "recursive"  ,
dput(x, control = c("keepNA", "keepInteger", "showAttributes"))
in R devel r73778
gives
structure(c(0), .Names = "recursive")
The 'control' is suggested by an example for output as in R < 3.5.0. However, 
the output is slightly different from
dput(x)
in R 3.3.2:
structure(0, .Names = "recursive")


Part of NEWS item related with "niceNames" control option:
as.character(list( c (one = 1))) now includes the name, as 
as.character(list(list(one = 1))) has always done.

Please reconsider.
As
as.numeric(list(c(one = 1)))
gives
1 ,
I expect that
as.character(list(c(one = "1")))
gives
"1" .
It does in R devel r73778.
Why does
as.character(list(c(one = 1)))
give
"c(one = 1)" ?

as.numeric(list(c(one = "1")))
gives
1 .

list(list(one = 1))
is not quite the same.
as.numeric(list(list(one = 1)))
gives
NA .

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel