date:20150720

Re: [Rd] Errors on Windows with grep(fixed=TRUE) on UTF-8 strings

2015-07-20 Thread suimong

Thank you Winston for the solution! The only workaround I come up with is to
set options(encoding = "UTF-8"), which is generally undesirable.

I'm wondering is there any chance this patch will be included in future R
version? I have been running into this problem from time to time and the
latest R 3.2.1 still hasn't handled this issue properly.


Winston Chang wrote
> After a bit more investigation, I think I've found the cause of the bug,
> and I have a patch.
> 
> This bug happens with grep(), when:
> * Running on Windows.
> * The search uses fixed=TRUE.
> * The search pattern is a single byte.
> * The current locale has a multibyte encoding.
> 
> ===
> Here's an example that demonstrates the bug:
> 
> # First, create a 3-byte UTF-8 character
> y <- rawToChar(as.raw(c(0xe6, 0xb8, 0x97)))
> Encoding(y) <- "UTF-8"
> y
> # [1] "渗"
> 
> # In my default locale, grep with a single-char pattern and fixed=TRUE
> # returns integer(0), as expected.
> Sys.getlocale("LC_CTYPE")
> # [1] "English_United States.1252"
> grep("a", y, fixed = TRUE)
> # integer(0)
> 
> # When the using a multibyte locale, grep with a single-char
> # pattern and fixed=TRUE results in an error.
> Sys.setlocale("LC_CTYPE", "chinese")
> grep("a", y, fixed = TRUE)
> # Error in grep("a", y, fixed = TRUE) : invalid multibyte string at '<97>'
> 
> 
> ===
> 
> I believe the problem is in the main/grep.c file, in the fgrep_one
> function. It tests for a multi-byte character string locale
> `mbcslocale`, and then for the `use_UTF8`, like so:
> 
> if (!useBytes && mbcslocale) {
> ...
> } else if (!useBytes && use_UTF8) {
> ...
> } else ...
> 
> This can be seen at
> https://github.com/wch/r-source/blob/e92b4c1cba05762480cd3898335144e5dd111cb7/src/main/grep.c#L668-L692
> 
> A similar pattern occurs in the fgrep_one_bytes function, at
> https://github.com/wch/r-source/blob/e92b4c1cba05762480cd3898335144e5dd111cb7/src/main/grep.c#L718-L736
> 
> 
> I believe that the test order should be reversed; it should test first
> for `use_UTF8`, and then for `mbcslocale`. This pattern occurs in a few
> places in grep.c. It looks like this:
> 
> if (!useBytes && use_UTF8) {
> ...
> } else if (!useBytes && mbcslocale) {
> ...
> } else ...
> 
> 
> ===
> This patch does what I described; it simply tests for `use_UTF8` first,
> and then `mbcslocale`, in both fgrep_one and fgrep_one_bytes. I made
> this patch against the 3.1.2 sources, and tested the example code above.
> In both cases, grep() returned integer(0), as expected.
> 
> (The reason I made this change against 3.1.2 is because I had problems
> getting the current trunk to compile on both Linux or Windows.)
> 
> 
> diff --git src/main/grep.c src/main/grep.c
> index 6e6ec3e..348c63d 100644
> --- src/main/grep.c
> +++ src/main/grep.c
> @@ -664,27 +664,27 @@ static int fgrep_one(const char *pat, const char
> *target,
>   }
>   return -1;
>  }
> -if (!useBytes && mbcslocale) { /* skip along by chars */
> - mbstate_t mb_st;
> +if (!useBytes && use_UTF8) {
>   int ib, used;
> - mbs_init(&mb_st);
>   for (ib = 0, i = 0; ib <= len-plen; i++) {
>   if (strncmp(pat, target+ib, plen) == 0) {
>   if (next != NULL) *next = ib + plen;
>   return i;
>   }
> - used = (int) Mbrtowc(NULL,  target+ib, MB_CUR_MAX, &mb_st);
> + used = utf8clen(target[ib]);
>   if (used <= 0) break;
>   ib += used;
>   }
> -} else if (!useBytes && use_UTF8) {
> +} else if (!useBytes && mbcslocale) { /* skip along by chars */
> + mbstate_t mb_st;
>   int ib, used;
> + mbs_init(&mb_st);
>   for (ib = 0, i = 0; ib <= len-plen; i++) {
>   if (strncmp(pat, target+ib, plen) == 0) {
>   if (next != NULL) *next = ib + plen;
>   return i;
>   }
> - used = utf8clen(target[ib]);
> + used = (int) Mbrtowc(NULL,  target+ib, MB_CUR_MAX, &mb_st);
>   if (used <= 0) break;
>   ib += used;
>   }
> @@ -714,21 +714,21 @@ static int fgrep_one_bytes(const char *pat, const
> char *target, int len,
>   if (*p == pat[0]) return i;
>   return -1;
>  }
> -if (!useBytes && mbcslocale) { /* skip along by chars */
> - mbstate_t mb_st;
> +if (!useBytes && use_UTF8) { /* not really needed */
>   int ib, used;
> - mbs_init(&mb_st);
>   for (ib = 0, i = 0; ib <= len-plen; i++) {
>   if (strncmp(pat, target+ib, plen) == 0) return ib;
> - used = (int) Mbrtowc(NULL, target+ib, MB_CUR_MAX, &mb_st);
> + used = utf8clen(target[ib]);
>   if (used <= 0) break;
>   ib += used;
>   }
> -} else if (!useBytes && use_UTF8) { /* not really needed */
> +} else if (!useBytes && mbcslocale) { /* skip along by chars */
> + mbstate_t mb_st;
>   int ib, used;
> + mbs_init(&mb

Re: [Rd] Defining a `show` function breaks the print-ing of S4 object -- bug or expected?

2015-07-20 Thread luke-tierney


This fixed in R-devel in r68702; in R-3-2-branch in r68705.

Best,

luke

On Tue, 30 Jun 2015, Duncan Murdoch wrote:


On 30/06/2015 7:04 PM, Paul Gilbert wrote:



On 06/30/2015 11:33 AM, Duncan Murdoch wrote:

On 30/06/2015 5:27 PM, Lorenz, David wrote:

There is something I'm really missing here. The function show is a
standardGeneric function, so the correct way to write it as method like
this:


That describes methods::show.  The problem is that the default print
mechanism isn't calling methods::show() (or base::print() as Luke says),
it's calling show() or print() in the global environment, so the user's
function overrides the generic, and you get the error.


These are two different problems aren't they? I can see that you might
want to ensure that base::print() calls methods::show(), but forcing the
default print to go to base::print(), rather than whatever print() is
first on the search path, would seem like a real change of philosophy.
What about all the other base functions that can be overridden by
something in the global environment?


I'd guess it's a minority of R users who know that print() or show() is
being called when you just evaluate an expression.  Most would think R
just shows you the value of the expression.  That's why they'd be
surprised when their local function suddenly stops the display of
variables from working.

On the other hand, if someone defined a print or show *method* in the
global environment, I think that one should override one defined in a
package namespace.  It does now, and I wouldn't change that.  The
difference is that I'd expect someone defining a method to know what
they're doing, but just defining a function doesn't imply that.

Duncan Murdoch



Paul


Luke, are you going to look at this, or should I?

Duncan Murdoch



setMethod("show",  "Person", function(object) {

for an object of class "Person" for example.




Dave

On Tue, Jun 30, 2015 at 10:11 AM,  wrote:


Same thing happens with S3 if you redefine print(). I thought that
code was actually calculating the function to call rather than the
symbol to use, but apparently not. Shouldn't be too hard to fix.

luke

On Tue, 30 Jun 2015, Hadley Wickham wrote:

  On Tue, Jun 30, 2015 at 2:20 PM, Duncan Murdoch

 wrote:


On 30/06/2015 1:57 PM, Hadley Wickham wrote:


A slightly simpler formulation of the problem is:

show <- function(...) stop("My show!")
methods::setClass("Person", slots = list(name = "character"))
methods::new("Person", name = "Tom")
#> Error in (function (...)  : My show!



Just to be clear:  the complaint is that the auto-called show() is not
methods::show?  I.e. after

x <- methods::new("Person", name = "Tom")

you would expect

show(x)

to give the error, but not

x

??



Correct - I'd expect print() to always call methods::show(), not
whatever show() is first on the search path.

Hadley




--
Luke Tierney
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa  Phone: 319-335-3386
Department of Statistics andFax:   319-335-3017
Actuarial Science
241 Schaeffer Hall  email:   luke-tier...@uiowa.edu
Iowa City, IA 52242 WWW:  http://www.stat.uiowa.edu


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel



[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel



__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel



__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel



--
Luke Tierney
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa  Phone: 319-335-3386
Department of Statistics andFax:   319-335-3017
   Actuarial Science
241 Schaeffer Hall  email:   luke-tier...@uiowa.edu
Iowa City, IA 52242 WWW:  http://www.stat.uiowa.edu

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Errors on Windows with grep(fixed=TRUE) on UTF-8 strings

Re: [Rd] Defining a `show` function breaks the print-ing of S4 object -- bug or expected?

2 matches

Site Navigation

Mail list logo

Footer information