Le lundi 16 septembre 2013 à 10:40 +0200, Milan Bouchet-Valat a écrit : > Le vendredi 13 septembre 2013 à 23:38 +0400, Maxim Linchits a écrit : > > This is a condensed version of the same question on stackexchange here: > > http://stackoverflow.com/questions/18789330/r-on-windows-character-encoding-hell > > If you've already stumbled upon it feel free to ignore. > > > > My problem is that R on US Windows does not read *any* text file that > > contains *any* foreign characters. It simply reads the first consecutive n > > ASCII characters and then throws a warning once it reached a foreign > > character: > > > > > test <- read.table("test.txt", sep=";", dec=",", quote="", > > fileEncoding="UTF-8") > > Warning messages: > > 1: In read.table("test.txt", sep = ";", dec = ",", quote = "", fileEncoding > > = "UTF-8") : > > invalid input found on input connection 'test.txt' > > 2: In read.table("test.txt", sep = ";", dec = ",", quote = "", fileEncoding > > = "UTF-8") : > > incomplete final line found by readTableHeader on 'test.txt' > > > print(test) > > V1 > > 1 english > > > > > Sys.getlocale() > > [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United > > States.1252; > > LC_MONETARY=English_United > > States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252" > > > > > > It is important to note that that R on linux will read UTF-8 as well as > > exotic character sets without a problem. I've tried it with the exact same > > files (one was UTF-8 and another was OEM866 Cyrillic). > > > > If I do not include the fileEncoding parameter, read.table will read the > > whole CSV file. But naturally it will read it wrong because it does not > > know the encoding. So whenever I try to specify the fileEncoding, R will > > throw the warnings and stop once it reaches a foreign character. It's the > > same story with all international character encodings. > > Other users on stackexchange have reported exactly the same issue. > > > > > > Is anyone here who is on a US version of Windows able to import files with > > foreign characters? Please let me know. > A reproducible example would have helped, as requested by the posting > guide. > > Though I am also experiencing the same problem after saving the data > below to a CSV file encoded in UTF-8 (you can do this using even the > Notepad): > "Ա","Բ" > 1,10 > 2,20 > > This is on a Windows 7 box using French locale, but same codepage 1252 > as yours. What is interesting is that reading the file using > readLines(file("myFile.csv", encoding="UTF-8")) > gives no invalid characters. So there must be a bug in read.table(). > > > But I must note I do not experience issues with French accentuated > characters like "é" ("\Ue9"). On the contrary, reading Armenian > characters like "Ա" ("\U531") gives weird results: the character appears > as <U+0531> instead of Ա. > > Self-contained example, writing the file and reading it back from R: > tmpfile <- tempfile() > writeLines("\U531", file(tmpfile, "w", encoding="UTF-8")) > readLines(file(tmpfile, encoding="UTF-8")) > # "<U+0531>" > > The same phenomenon happens when creating a data frame from this > character (as noted on StackExchange): > data.frame("\U531") > > So my conclusion is that maybe Windows does not really support Unicode > characters that are not "relevant" for your current locale. And that may > have created bugs in the way R handles them in read.table(). R > developers can probably tell us more about it. After some more investigation, one part of the problem can be traced back to scan() (with myFile.csv filled as described above): scan("myFile.csv", encoding="UTF-8", sep=",", nlines=1) # Read 2 items # [1] "Ա" "Բ"
Equivalent, but nonsensical to me: scan("myFile.csv", fileEncoding="CP1252", encoding="UTF-8", sep=",", nlines=1) # Read 2 items # [1] "Ա" "Բ" scan("myFile.csv", fileEncoding="UTF-8", sep=",", nlines=1) # Read 0 items # character(0) # Warning message: # In scan(file, what, nmax, sep, dex, quote, skip, nlines, na.strings, : # invalid input found on input connection 'myFile.csv' So there seem to be one part of the issue in scan(), which for some reason does not work when passed fileEncoding="UTF-8"; and another part in read.table(), which transforms "Ա" ("\U531") into "X.U.0531.", probably via make.names(), since: make.names("\U531") # "X.U.0531." Does this make sense to R-core members? Regards ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.