[Rd] [PATCH] Improve utf8clen and remove utf8_table4
Given a char `c' which should be the start byte of a utf8 character, the utf8clen function returns the byte length of the utf8 character. Before this patch, the utf8clen function would return either: * 1 if `c' was an ascii character or a utf8 continuation byte * An int in the range [2, 6] indicating the byte length of the utf8 character With this patch, the utf8clen function will now return either: * -1 if `c' is not a valid utf8 start byte * The byte length of the utf8 character (the number of leading 1's, really) I believe returning -1 for continuation bytes makes utf8clen less error prone. The utf8_table4 array is no longer needed and has been removed. Sahil Index: src/main/util.c === --- src/main/util.c (revision 72365) +++ src/main/util.c (working copy) @@ -1183,18 +1183,23 @@ return TRUE; } -/* Number of additional bytes */ -static const unsigned char utf8_table4[] = { - 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, - 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, - 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, - 3,3,3,3,3,3,3,3,4,4,4,4,5,5,5,5 }; - +/* + * If `c' is not a valid utf8 start byte, return -1. + * Otherwise, return the number of bytes in the utf8 string with start byte `c' + */ int attribute_hidden utf8clen(char c) { -/* This allows through 8-bit chars 10xx, which are invalid */ -if ((c & 0xc0) != 0xc0) return 1; -return 1 + utf8_table4[c & 0x3f]; +int n = 0; /* number of leading 1's */ +int m = 0x80; /* byte mask */ + +while (c & m) { +++n; +m >>= 1; +} + +if (n == 0) return 1; /* an ascii char of the form 0xxx */ +else if (n == 1) return -1; /* invalid start byte of the form 10xx */ +else return n; } /* These return the result in wchar_t, but does assume Index: src/main/valid_utf8.h === --- src/main/valid_utf8.h (revision 72365) +++ src/main/valid_utf8.h (working copy) @@ -75,7 +75,7 @@ if (c < 0xc0) return 1; /* Isolated 10xx byte */ if (c >= 0xfe) return 1; /* Invalid 0xfe or 0xff bytes */ - ab = utf8_table4[c & 0x3f]; /* Number of additional bytes */ + ab = utf8clen(c) - 1; /* Number of additional bytes */ if (length < ab) return 1; length -= ab; /* Length remaining */ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] [PATCH] Improve utf8clen and remove utf8_table4
Some of the code that uses utf8clen checks the validity of the utf8 string before making the call. However, there were some hairy areas where I felt that the new semantics may cause issues (if not now, then in future changes). I've attached two patches: * new_semantics.diff keeps the new semantics and updates those hairy areas above. * old_semantics.diff maintains the old semantics (return 1 even for continuation bytes). I don't think the new semantics will cause issues, especially with the updates, but we can err on the side of caution and keep the old semantics. I feel that the new semantics provide a clearer interface though (the function expects a start byte and should return an error if a start byte is not supplied). In either case, the utf8_table4 array has been removed. Sahil On 03/19/2017 05:38 AM, Duncan Murdoch wrote: On 19/03/2017 2:31 AM, Sahil Kang wrote: Given a char `c' which should be the start byte of a utf8 character, the utf8clen function returns the byte length of the utf8 character. Before this patch, the utf8clen function would return either: * 1 if `c' was an ascii character or a utf8 continuation byte * An int in the range [2, 6] indicating the byte length of the utf8 character With this patch, the utf8clen function will now return either: * -1 if `c' is not a valid utf8 start byte * The byte length of the utf8 character (the number of leading 1's, really) I believe returning -1 for continuation bytes makes utf8clen less error prone. The utf8_table4 array is no longer needed and has been removed. utf8clen is used internally by R in more than a dozen places, and is likely used in packages as well. Have you checked that this change in semantics won't break any of those uses? Duncan Murdoch Index: src/main/util.c === --- src/main/util.c (revision 72365) +++ src/main/util.c (working copy) @@ -1183,18 +1183,23 @@ return TRUE; } -/* Number of additional bytes */ -static const unsigned char utf8_table4[] = { - 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, - 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, - 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, - 3,3,3,3,3,3,3,3,4,4,4,4,5,5,5,5 }; - +/* + * If `c' is not a valid utf8 start byte, return 1. + * Otherwise, return the number of bytes in the utf8 string with start byte `c' + */ int attribute_hidden utf8clen(char c) { -/* This allows through 8-bit chars 10xx, which are invalid */ -if ((c & 0xc0) != 0xc0) return 1; -return 1 + utf8_table4[c & 0x3f]; +int n = 0; /* number of leading 1's */ +int m = 0x80; /* byte mask */ + +while (c & m) { +++n; +m >>= 1; +} + +if (n == 0) return 1; /* an ascii char of the form 0xxx */ +else if (n == 1) return 1; /* invalid start byte of the form 10xx */ +else return n; } /* These return the result in wchar_t, but does assume Index: src/main/valid_utf8.h === --- src/main/valid_utf8.h (revision 72365) +++ src/main/valid_utf8.h (working copy) @@ -75,7 +75,7 @@ if (c < 0xc0) return 1; /* Isolated 10xx byte */ if (c >= 0xfe) return 1; /* Invalid 0xfe or 0xff bytes */ - ab = utf8_table4[c & 0x3f]; /* Number of additional bytes */ + ab = utf8clen(c) - 1; /* Number of additional bytes */ if (length < ab) return 1; length -= ab; /* Length remaining */ Index: src/main/character.c === --- src/main/character.c (revision 72365) +++ src/main/character.c (working copy) @@ -276,7 +276,9 @@ if (ienc == CE_UTF8) { const char *end = str + strlen(str); for (i = 0; i < so && str < end; i++) { - int used = utf8clen(*str); + used = utf8clen(*str); + if (used < 0) used = 1; /* gobble up invalid utf8 start byte */ + if (i < sa - 1) { str += used; continue; } for (j = 0; j < used; j++) *buf++ = *str++; } @@ -459,10 +461,18 @@ int i, in = 0, out = 0; if (ienc == CE_UTF8) { - for (i = 1; i < sa; i++) buf += utf8clen(*buf); + int len; + for (i = 1; i < sa; i++) { + len = utf8clen(*buf); + buf += len < 0 ? 1 : len; + } for (i = sa; i <= so && in < strlen(str); i++) { - in += utf8clen(str[in]); - out += utf8clen(buf[out]); + len = utf8clen(str[in]); + in += len < 0 ? 1 : len; + + len = utf8clen(buf[out]); + out += len < 0 ? 1 : len; + if (!str[in]) break; } if (in != out) memmove(buf+in, buf+out, strlen(buf+out)+1); Index: src/main/connections.c === --- src/main/connections.c (revision 72365) +++ src/main/connections.c (working copy) @@ -4408,6 +4408,7 @@ if (iread >= nbytes) b
[Rd] [PATCH] Ensure correct order of evaluation in macro
Hello, I'd like to contribute this small patch (attached) that I think will help prevent some future bugs from occurring in paste.c. By wrapping the macro's arguments in parentheses, we can ensure that the correct order of evaluation will take place during preprocessing. To illustrate, we can use the == operator which has lower evaluation precedence than the < operator: * With the current macro, imax2(3==4, 1) expands to 0. * After applying this patch, imax2(3==4, 1) expands to 1 as expected. Since I'm still relatively new to the mailing list, I've kept this patch small. I did notice other macros that have this same issue, so I can start sending additional patches if this seems okay. Thanks, Sahil Index: src/main/paste.c === --- src/main/paste.c (revision 72713) +++ src/main/paste.c (working copy) @@ -31,7 +31,7 @@ #include "Defn.h" #include -#define imax2(x, y) ((x < y) ? y : x) +#define imax2(x, y) (((x) < (y)) ? (y) : (x)) #include "Print.h" #include "RBufferUtils.h" __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] [PATCH] Ensure correct order of evaluation in macro
Hi Duncan, Would you merge this patch? I'm planning on sending larger patches in the next few days that fix other macros I've seen, but I figured it'd be best to start with a smaller patch. Thanks, Sahil __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] [PATCH] Ensure correct order of evaluation in macro
Hi Martin, Would you feel comfortable merging my small patch in? I'd like to send additional patches that will improve these types of function macros so I'm hoping to start with this small change. Thanks, Sahil __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] SUGGESTION: Use JIRA for Bug Reporting, Package Development and Project Management
I enjoy the simplicity of a mailing list; I'd prefer not to create a new user account to participate in a software project. Plenty of other software communities utilize mailing lists, and I think you'll also begin to prefer email as you get more involved with software projects. R is also a GNU package so I don't think it's a good idea to use a closed source product like JIRA. I also think it's easier for the core devs to manage a mailing list instead of a JIRA instance. I'm not sure what you mean about privacy concerns for package maintainers; can you elaborate? Sahil __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] [Bug Fix] Default values not applied to ... arguments
Hi Duncan, Martin Here's a small patch that fixes bug 15199 reported at: https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=15199 I was able to reproduce the bug as Duncan had outlined just fine, but I did notice that when we debug(f), the problem went away. I later realized that f(1,,3) behaved correctly the first time it was executed, but misbehaved as documented on subsequent calls. This narrowed the problem down to the byte-compilation that occurs on subsequent function calls. This patch prevents byte-compilation on closure objects. Although it's a less than ideal solution, this patch fixes the bug at least until the underlying byte-compilation issue can be found (I'm currently scrutinizing the promiseArgs function at eval.c:2771). Thanks, Sahil Index: src/main/eval.c === --- src/main/eval.c (revision 72894) +++ src/main/eval.c (working copy) @@ -1059,8 +1059,8 @@ SEXP body = BODY(fun); -if (R_jit_enabled > 0 && TYPEOF(body) != BCODESXP && - ! R_disable_bytecode && ! NOJIT(fun)) { +if (R_jit_enabled > 0 && TYPEOF(fun) != CLOSXP && TYPEOF(body) != BCODESXP + && !R_disable_bytecode && ! NOJIT(fun)) { if (MAYBEJIT(fun)) { /* function marked as MAYBEJIT the first time now seen __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] [Bug Fix] Default values not applied to ... arguments
Yes, I see what you mean. My patch only disables JIT compilation for closures. If a user manually compiles a closure, however, the bug pops up again. I think the bug may either lie in how the byte-compiler compiles closures, or how closures with compiled body's are executed by the AST interpreter (without my patch applied): compiler::enableJIT(0) # turn off JIT compilation f <- function(...) { g(...) } g <- function(a=4, b=5, c=6) { print(c(missing(a), missing(b), missing(c))) b } f(1,,3) # works fine # [1] FALSE TRUE FALSE # [1] 5 f(1,,3) # works fine # [1] FALSE TRUE FALSE # [1] 5 ff_1 <- compiler::compile(f) # compile f ff_2 <- compiler::cmpfun(f) # compile body of closure eval(ff_1)(1,,3) # works fine # [1] FALSE TRUE FALSE # [1] 5 eval(ff_2)(1,,3) # bug shows up again # [1] FALSE TRUE FALSE # Error in g(...) : argument is missing, with no default Thanks, Sahil On 07/06/2017 09:29 AM, Tomas Kalibera wrote: > > Thanks for the report, I've fixed 15199 in the AST interpreter in > 72664, I will fix it in the byte-code interpreter as well. > > If you ever needed to disable the JIT, there is API for that, see > ?compiler. Note though that by disabling the JIT you won't disable the > byte-code interpreter, because code also gets compiled when packages > are installed or when the compiler is invoked explicitly. > > Best, > Tomas > > On 07/06/2017 04:40 PM, Sahil Kang wrote: >> Hi Duncan, Martin >> >> Here's a small patch that fixes bug 15199 reported at: >> https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=15199 >> >> I was able to reproduce the bug as Duncan had outlined just fine, but >> I did notice that when we debug(f), the problem went away. >> I later realized that f(1,,3) behaved correctly the first time it was >> executed, but misbehaved as documented on subsequent calls. >> This narrowed the problem down to the byte-compilation that occurs on >> subsequent function calls. >> >> This patch prevents byte-compilation on closure objects. >> Although it's a less than ideal solution, this patch fixes the bug at >> least until the underlying byte-compilation issue can be found (I'm >> currently scrutinizing the promiseArgs function at eval.c:2771). >> >> Thanks, >> Sahil >> >> >> __ >> R-devel@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel > > [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] [Bug Fix] Default values not applied to ... arguments
I'm glad I could help, and I look forward to reading your patch so that I can learn more about the R internals. It'd be nice to close this 4 year old bug. Sahil Original Message From: Tomas Kalibera Sent: July 7, 2017 11:10:05 AM PDT To: Sahil Kang Subject: Re: [Rd] [Bug Fix] Default values not applied to ... arguments As I said the old behavior had been for several years both in AST and BC interpreter. I fixed it in AST and I will fix it in BC. I know exactly where it is and how to fix it, there is no more help I could use on this issue and I don't think there is any need to discuss this further, and particularly so on R-devel. Please do not send speculative emails to R-devel, in the interest of saving time of people following the list. Thanks Tomas On 07/07/2017 11:42 AM, Sahil Kang wrote: > Yes, I see what you mean. My patch only disables JIT compilation for > closures. If a user manually compiles a closure, however, the bug pops > up again. > > I think the bug may either lie in how the byte-compiler compiles > closures, or how closures with compiled body's are executed by the AST > interpreter (without my patch applied): > > compiler::enableJIT(0) # turn off JIT compilation > f <- function(...) { g(...) } > g <- function(a=4, b=5, c=6) { > print(c(missing(a), missing(b), missing(c))) > b > } > > f(1,,3) # works fine > # [1] FALSE TRUE FALSE > # [1] 5 > > f(1,,3) # works fine > # [1] FALSE TRUE FALSE > # [1] 5 > > ff_1 <- compiler::compile(f) # compile f > ff_2 <- compiler::cmpfun(f) # compile body of closure > > eval(ff_1)(1,,3) # works fine > # [1] FALSE TRUE FALSE > # [1] 5 > > eval(ff_2)(1,,3) # bug shows up again > # [1] FALSE TRUE FALSE > # Error in g(...) : argument is missing, with no default > > Thanks, > Sahil > > On 07/06/2017 09:29 AM, Tomas Kalibera wrote: >> >> Thanks for the report, I've fixed 15199 in the AST interpreter in >> 72664, I will fix it in the byte-code interpreter as well. >> >> If you ever needed to disable the JIT, there is API for that, see >> ?compiler. Note though that by disabling the JIT you won't disable >> the byte-code interpreter, because code also gets compiled when >> packages are installed or when the compiler is invoked explicitly. >> >> Best, >> Tomas >> >> On 07/06/2017 04:40 PM, Sahil Kang wrote: >>> Hi Duncan, Martin >>> >>> Here's a small patch that fixes bug 15199 reported at: >>> https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=15199 >>> >>> I was able to reproduce the bug as Duncan had outlined just fine, >>> but I did notice that when we debug(f), the problem went away. >>> I later realized that f(1,,3) behaved correctly the first time it >>> was executed, but misbehaved as documented on subsequent calls. >>> This narrowed the problem down to the byte-compilation that occurs >>> on subsequent function calls. >>> >>> This patch prevents byte-compilation on closure objects. >>> Although it's a less than ideal solution, this patch fixes the bug >>> at least until the underlying byte-compilation issue can be found >>> (I'm currently scrutinizing the promiseArgs function at eval.c:2771). >>> >>> Thanks, >>> Sahil >>> >>> >>> __ >>> R-devel@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-devel >> >> > __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel