I think there is a memory bug in `substr` that is triggered by a UTF-8 corner case: an incomplete UTF-8 byte sequence at the end of a string. With a valgrind level 2 instrumented build of R-devel I get:
> string <- "abc\xEE" # \xEE indicates the start of a 3 byte UTF-8 sequence > Encoding(string) <- "UTF-8" > substr(string, 1, 10) ==15375== Invalid read of size 1 ==15375== at 0x45B3F0: substr (character.c:286) ==15375== by 0x45B3F0: do_substr (character.c:342) ==15375== by 0x4CFCB9: bcEval (eval.c:6775) ==15375== by 0x4D95AF: Rf_eval (eval.c:624) ==15375== by 0x4DAD12: R_execClosure (eval.c:1764) ==15375== by 0x4D9561: Rf_eval (eval.c:747) ==15375== by 0x507008: Rf_ReplIteration (main.c:258) ==15375== by 0x5073E7: R_ReplConsole (main.c:308) ==15375== by 0x507494: run_Rmainloop (main.c:1082) ==15375== by 0x41A8E6: main (Rmain.c:29) ==15375== Address 0xb9e518d is 3,869 bytes inside a block of size 7,960 alloc'd ==15375== at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==15375== by 0x51033E: GetNewPage (memory.c:888) ==15375== by 0x511FC0: Rf_allocVector3 (memory.c:2691) ==15375== by 0x4657AC: Rf_allocVector (Rinlinedfuns.h:577) ==15375== by 0x4657AC: Rf_ScalarString (Rinlinedfuns.h:1007) ==15375== by 0x4657AC: coerceToVectorList (coerce.c:892) ==15375== by 0x4657AC: Rf_coerceVector (coerce.c:1293) ==15375== by 0x4660EB: ascommon (coerce.c:1369) ==15375== by 0x4667C0: do_asvector (coerce.c:1544) ==15375== by 0x4CFCB9: bcEval (eval.c:6775) ==15375== by 0x4D95AF: Rf_eval (eval.c:624) ==15375== by 0x4DAD12: R_execClosure (eval.c:1764) ==15375== by 0x515EF7: dispatchMethod (objects.c:408) ==15375== by 0x516379: Rf_usemethod (objects.c:458) ==15375== by 0x516694: do_usemethod (objects.c:543) ==15375== [1] "abc<ee>" Here is a patch for the native version of `substr` that highlights the problem and a possible fix. Basically `substr` computes the byte width of a UTF-8 character based on the leading byte ("\xEE" here, which implies 3 bytes) and reads/writes that entire byte width irrespective of whether the string actually ends before the theoretical end of the UTF-8 "character". Index: src/main/character.c =================================================================== --- src/main/character.c (revision 74482) +++ src/main/character.c (working copy) @@ -283,7 +283,7 @@ for (i = 0; i < so && str < end; i++) { int used = utf8clen(*str); if (i < sa - 1) { str += used; continue; } - for (j = 0; j < used; j++) *buf++ = *str++; + for (j = 0; j < used && str < end; j++) *buf++ = *str++; } } else if (ienc == CE_LATIN1 || ienc == CE_BYTES) { for (str += (sa - 1), i = sa; i <= so; i++) *buf++ = *str++; The change above removed the valgrind error for me. I re-built R with the change and ran "make check" which seemed to work fine. I also ran some simple checks on UTF-8 strings and things seem to work okay. I have very limited experience making changes to R (this is my first attempt at a patch) so please take all of the above with extreme skepticism. Apologies in advance if this turns out to be a false alarm caused by an error on my part. Best, Brodie. PS: apologies also if the formatting of this e-mail is bad. I have not figured out how to get plaintext working properly with yahoo. ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel