Re: [Rd] R 2.9.2 crashes when sorting latin1-encoded strings

Prof Brian Ripley Mon, 05 Oct 2009 07:16:49 -0700

This was a missing PROTECT() in do_order.

But I'll echo what Simon Urbanek said: don't do that but rather usethe documented ways to re-encode the file as you read it. (Latin-1used to be needed for collation on Mac OS X as C-level collation inUTF-8 was completely broken -- but we have worked around that.)

We provided fileEncoding= in read.table for those who failed to RTFMand thought encoding= was to set the file encoding, but it seems thatencodings are simply too hard a concept for some R users.


On Wed, 30 Sep 2009, Stefan Evert wrote:

Hi everyone!

I think I stumbled over a bug in the latest R 2.9.2 patched for OS X:
R version 2.9.2 Patched (2009-09-24 r49861)
i386-apple-darwin9.8.0
When I try to sort latin1-encoded character vectors, R sometimes crashes witha segmentation fault. I'm running OS X 10.5.8 and have observed thisbehaviour both with the i386 and x86_64 builds, in the R.app GUI as well ason the command line.
Here's a minimal example that reliably triggers the crash on my machine:

=====
print(sessionInfo())

words <- c("aa", "ab", "a\xfc", "a\xe4", "b\xe4", "b\xfc", "\xe4\xfc")
str(words)

print(table(Encoding(words)))
Encoding(words) <- "latin1"  # this is the correct encoding!
print(table(Encoding(words)))

N <- 1000
words <- rep(words, length.out=N)

print(N)
for (i in 1:N) {
x <- words[1:i]
# the following line will crash for some i, depending on the particular
# strings in <words> and the subset selected for <x> above
order(x)
}
=====
The output I get from this code is appended at the end of the mail. Note thatR incorrectly declares the latin1 strings in <word> to have UTF-8 encoding(this seems wrong to me because the \x escapes insert raw bytes into thestring). The crash only occurs if the correct "latin1" encoding (or"unknown") is explicitly specified. Otherwise the string handling codeappears to ignore everything after the first invalid multibyte character.
I haven't been able to trigger the bug without some kind of loop. The crashalways occurs at the same iteration, but this changes depending on thecontents of <words> and the specific subset selected in each loop iteration.Also note that the 64-bit version of R gives a different error message. If Iomit the unrelated statement "print(N)", the 64-bit version segfaults and the32-bit version just hangs with high CPU load. All this suggests to me thatthere must be some insidious memory corruption or stack/range overflow in theinternal ordering code.
Can other people reproduce this problem on different platforms and possiblywith different versions of R?
BTW, I ran into the crash when trying to read.delim() a file in latin1encoding, using either encoding="latin1" or fileEncoding="latin1", and thenconverting it back and forth between a character vector and a factor. Istill don't understand what's going on there. The behaviour of read.delim()seems to depend very much on my locale settings when running R, which israther unpleasant. Is there a way to find out how strings are storedinternally (i.e. getting the exact byte representation) and whether Rbelieves them to be in UTF-8 or latin1 encoding?
Best regards,
Stefan Evert

[ [email protected] | http://purl.org/stefan.evert ]





Output of sample code on my machine:
print(sessionInfo())
R version 2.9.2 Patched (2009-09-24 r49861)
i386-apple-darwin9.8.0

locale:
en_GB/en_GB/C/C/en_GB/en_GB

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base
words <- c("aa", "ab", "a\xfc", "a\xe4", "b\xe4", "b\xfc", "\xe4\xfc")
str(words)
chr [1:7] "aa" "ab" "a\xfc" "a\xe4" "b\xe4" "b\xfc" ...
print(table(Encoding(words)))
unknown   UTF-8
   2       5
Encoding(words) <- "latin1"  # this is the correct encoding!
print(table(Encoding(words)))
latin1 unknown
   5       2
N <- 1000
words <- rep(words, length.out=N)

print(N)
[1] 1000
for (i in 1:N) {
+   x <- words[1:i]
+   # the following line will crash for some i, depending on the particular
+   # strings in <words> and the subset selected for <x> above
+   order(x)
+ }

*** caught bus error ***
address 0x86, cause 'non-existent physical address'

Traceback:
1: order(x)
aborting ...
Bus error
64-bit version:
print(sessionInfo())
R version 2.9.2 Patched (2009-09-24 r49861)
x86_64-apple-darwin9.8.0

locale:
en_GB/en_GB/C/C/en_GB/en_GB

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base
words <- c("aa", "ab", "a\xfc", "a\xe4", "b\xe4", "b\xfc", "\xe4\xfc")
str(words)
chr [1:7] "aa" "ab" "a\xfc" "a\xe4" "b\xe4" "b\xfc" ...
print(table(Encoding(words)))
unknown   UTF-8
   2       5
Encoding(words) <- "latin1"  # this is the correct encoding!
print(table(Encoding(words)))
latin1 unknown
   5       2
N <- 1000
words <- rep(words, length.out=N)

print(N)
[1] 1000
for (i in 1:N) {
+   x <- words[1:i]
+   # the following line will crash for some i, depending on the particular
+   # strings in <words> and the subset selected for <x> above
+   order(x)
+ }
Error in order(x) : 'translateCharUTF8' must be called on a CHARSXP
Execution halted
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


--
Brian D. Ripley,                  [email protected]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] R 2.9.2 crashes when sorting latin1-encoded strings

Reply via email to