I can reproduce with read.table(encoding="UTF-8") in RGui on Windows 10, reading a file containing the two UTF-8 characters. The table is read correctly into R as documented (both characters are represented in UTF-8 and marked as such), but, the conversion of Infinity to 8 and of Zhe to <U+0436> happens later during printing using print.data.frame(). For instance, it currently does not happen during print(as.matrix()). As I wrote in more detail in another email in this thread, R sometimes needs to convert strings to the current native encoding, Windows converts Infinity to 8 by default as "best fit", but fails to convert Zhe, so R displays the <U+436>.
It is easiest to only use input files in current native encoding, so one could convert before passing them to R and make sure the conversion does not have similar problems... or use R on a non-Windows platform. Relying on which R functions/packages can work with non-native encodings may be brittle, but of course any R function that documents to work with non-native encodings (like read.table(encoding=)) should do so. If not, it will be fixed following a bug report. I am not sure if that is what you had in mind, but conversion of character (string) to double is a different matter. as.double() now as documented in ?as.double returns NA for "∞" (on Linux). Best Tomas On 2/7/19 11:17 AM, David Byrne wrote: > Bug > Using read.table(file, encoding="UTF-8") to import a UTF-8 encoded > file containing the infinity symbol (' ∞ ') results in the infinity > symbol imported as the number 8. Other Unicode characters seem > unaffected, example, Zhe: ж > > Expected Behavior: > The imported data.frame should represent the infinity symbol as the > expected 'Inf' so that normal mathematical operations can be processed > > Stack Overflow Post: > I created a question on Stack Overflow where one other member was able > to reproduce the same issues I was having. This question can be found > at: > https://stackoverflow.com/questions/54522196/r-read-table-with-utf-8-encoded-file-reads-infinity-symbol-as-8-int > > Method to Reproduce - 1: > A simple method to reproduce this issues is to use R-Studio: In the > console, type the following: >> read.table(text=" ∞", encoding="UTF-8") > The result should be a data.frame with a single value of '8' > > Repeating the same with ж Results in correct expected behavior > > Method to Reproduce - 2: > Create a .csv file containing the infinity and Zhe characters (I have > attached the file for convenience, hopefully it is no rejected by your > email service). Launch an interactive session using > >> r --vanilla > Enter the following statement taking care to replace the > <path-to-file> with the appropriate one: > >> read.table("<path-to-file>/unicode_chars.csv", sep=",", encoding="UTF-8") > > This should result in a two element data.frame; the first being the > incorrect value of 8 with an additional <U+FEFF> and the second the > correct value of Zhe. > > Note the additional <U+FEFF> prefixed to the front of the '8'. This > appears to be a hidden character for the purposes of letting editors > know the encoding. The following link has some explanation however, it > states this is caused by excel. The file I created was done so using > notepad and not Excel. > > https://medium.freecodecamp.org/a-quick-tale-about-feff-the-invisible-character-cd25cd4630e7 > > System Details: > OS: >> Windows 10.0.17134 Build 17134 > > R Version: >> platform x86_64-w64-mingw32 >> arch x86_64 >> os mingw32 >> system x86_64, mingw32 >> status >> major 3 >> minor 4.1 >> year 2017 >> month 06 >> day 30 >> svn rev 72865 >> language R >> version.string R version 3.4.1 (2017-06-30) >> nickname Single Candle > ______________________________________________ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel [[alternative HTML version deleted]] ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel