Le vendredi 13 septembre 2013 à 23:38 +0400, Maxim Linchits a écrit : > This is a condensed version of the same question on stackexchange here: > http://stackoverflow.com/questions/18789330/r-on-windows-character-encoding-hell > If you've already stumbled upon it feel free to ignore. > > My problem is that R on US Windows does not read *any* text file that > contains *any* foreign characters. It simply reads the first consecutive n > ASCII characters and then throws a warning once it reached a foreign > character: > > > test <- read.table("test.txt", sep=";", dec=",", quote="", > fileEncoding="UTF-8") > Warning messages: > 1: In read.table("test.txt", sep = ";", dec = ",", quote = "", fileEncoding > = "UTF-8") : > invalid input found on input connection 'test.txt' > 2: In read.table("test.txt", sep = ";", dec = ",", quote = "", fileEncoding > = "UTF-8") : > incomplete final line found by readTableHeader on 'test.txt' > > print(test) > V1 > 1 english > > > Sys.getlocale() > [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United > States.1252; > LC_MONETARY=English_United > States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252" > > > It is important to note that that R on linux will read UTF-8 as well as > exotic character sets without a problem. I've tried it with the exact same > files (one was UTF-8 and another was OEM866 Cyrillic). > > If I do not include the fileEncoding parameter, read.table will read the > whole CSV file. But naturally it will read it wrong because it does not > know the encoding. So whenever I try to specify the fileEncoding, R will > throw the warnings and stop once it reaches a foreign character. It's the > same story with all international character encodings. > Other users on stackexchange have reported exactly the same issue. > > > Is anyone here who is on a US version of Windows able to import files with > foreign characters? Please let me know. A reproducible example would have helped, as requested by the posting guide.
Though I am also experiencing the same problem after saving the data below to a CSV file encoded in UTF-8 (you can do this using even the Notepad): "Ա","Բ" 1,10 2,20 This is on a Windows 7 box using French locale, but same codepage 1252 as yours. What is interesting is that reading the file using readLines(file("myFile.csv", encoding="UTF-8")) gives no invalid characters. So there must be a bug in read.table(). But I must note I do not experience issues with French accentuated characters like "é" ("\Ue9"). On the contrary, reading Armenian characters like "Ա" ("\U531") gives weird results: the character appears as <U+0531> instead of Ա. Self-contained example, writing the file and reading it back from R: tmpfile <- tempfile() writeLines("\U531", file(tmpfile, "w", encoding="UTF-8")) readLines(file(tmpfile, encoding="UTF-8")) # "<U+0531>" The same phenomenon happens when creating a data frame from this character (as noted on StackExchange): data.frame("\U531") So my conclusion is that maybe Windows does not really support Unicode characters that are not "relevant" for your current locale. And that may have created bugs in the way R handles them in read.table(). R developers can probably tell us more about it. Regards ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.