Re: [R] Non-ACSII characters in R on Windows

Milan Bouchet-Valat Mon, 16 Sep 2013 01:42:09 -0700

Le vendredi 13 septembre 2013 à 23:38 +0400, Maxim Linchits a écrit :
> This is a condensed version of the same question on stackexchange here:
> http://stackoverflow.com/questions/18789330/r-on-windows-character-encoding-hell
> If you've already stumbled upon it feel free to ignore.
> 
> My problem is that R on US Windows does not read *any* text file that
> contains *any* foreign characters. It simply reads the first consecutive n
> ASCII characters and then throws a warning once it reached a foreign
> character:
> 
> > test <- read.table("test.txt", sep=";", dec=",", quote="",
> fileEncoding="UTF-8")
> Warning messages:
> 1: In read.table("test.txt", sep = ";", dec = ",", quote = "", fileEncoding
> = "UTF-8") :
>   invalid input found on input connection 'test.txt'
> 2: In read.table("test.txt", sep = ";", dec = ",", quote = "", fileEncoding
> = "UTF-8") :
>   incomplete final line found by readTableHeader on 'test.txt'
> > print(test)
>        V1
> 1 english
> 
> > Sys.getlocale()
>    [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
> States.1252;
>      LC_MONETARY=English_United
> States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
> 
> 
> It is important to note that that R on linux will read UTF-8 as well as
> exotic character sets without a problem. I've tried it with the exact same
> files (one was UTF-8 and another was OEM866 Cyrillic).
> 
> If I do not include the fileEncoding parameter, read.table will read the
> whole CSV file. But naturally it will read it wrong because it does not
> know the encoding. So whenever I try to specify the fileEncoding, R will
> throw the warnings and stop once it reaches a foreign character. It's the
> same story with all international character encodings.
> Other users on stackexchange have reported exactly the same issue.
> 
> 
> Is anyone here who is on a US version of Windows able to import files with
> foreign characters? Please let me know.
A reproducible example would have helped, as requested by the posting
guide.


Though I am also experiencing the same problem after saving the data
below to a CSV file encoded in UTF-8 (you can do this using even the
Notepad):
"Ա","Բ"
1,10
2,20

This is on a Windows 7 box using French locale, but same codepage 1252
as yours. What is interesting is that reading the file using
readLines(file("myFile.csv", encoding="UTF-8"))
gives no invalid characters. So there must be a bug in read.table().


But I must note I do not experience issues with French accentuated
characters like "é" ("\Ue9"). On the contrary, reading Armenian
characters like "Ա" ("\U531") gives weird results: the character appears
as <U+0531> instead of Ա.

Self-contained example, writing the file and reading it back from R:
tmpfile <- tempfile()
writeLines("\U531", file(tmpfile, "w", encoding="UTF-8"))
readLines(file(tmpfile, encoding="UTF-8"))
# "<U+0531>"

The same phenomenon happens when creating a data frame from this
character (as noted on StackExchange):
data.frame("\U531")

So my conclusion is that maybe Windows does not really support Unicode
characters that are not "relevant" for your current locale. And that may
have created bugs in the way R handles them in read.table(). R
developers can probably tell us more about it.


Regards

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Non-ACSII characters in R on Windows

Reply via email to