Windows Notepad prefixes UTF-8 files with a Byte Order Mark (\UFEFF). Per https://en.wikipedia.org/wiki/Byte_order_mark, this is permitted in UTF-8, but not required. I suppose that there are other Windows programs which do likewise (in addition to Excel and Notepad).
"The Unicode Standard permits the BOM in UTF-8 <https://en.wikipedia.org/wiki/UTF-8>,[3] <https://en.wikipedia.org/wiki/Byte_order_mark#cite_note-3> but does not require or recommend its use.[4] <https://en.wikipedia.org/wiki/Byte_order_mark#cite_note-4> Byte order has no meaning in UTF-8,[5] <https://en.wikipedia.org/wiki/Byte_order_mark#cite_note-utf-8-bom-5> so its only use in UTF-8 is to signal at the start that the text stream is encoded in UTF-8, or that it was converted to UTF-8 from a stream that contained an optional BOM. The standard also does not recommend removing a BOM when it is there, so that round-tripping between encodings does not lose information, and so that code that relies on it continues to work.[6] <https://en.wikipedia.org/wiki/Byte_order_mark#cite_note-6>[7] <https://en.wikipedia.org/wiki/Byte_order_mark#cite_note-7> The IETF recommends that if a protocol either (a) always uses UTF-8, or (b) has some other way to indicate what encoding is being used, then it "SHOULD forbid use of U+FEFF as a signature."[8] <https://en.wikipedia.org/wiki/Byte_order_mark#cite_note-rfc3629-8>" On Thu, Feb 7, 2019 at 8:10 AM Daniel Possenriede <possenri...@gmail.com> wrote: > There seems to be something odd with "∞" on Windows (and not only with > read.table) > In native encoding (cp-1252 in my case), "∞" gets converted to "8" > > x <- "∞" > Encoding(x) > #> [1] "unknown" > print(x) > #> [1] "8" > charToRaw(x) > #> [1] 38 > > "∞" is indeed "8" > > identical(x, "8") > #> [1] TRUE > > Everything seems fine if "∞" is UTF-8 encoded. > > y <- "\u221E" > Encoding(y) > #> [1] "UTF-8" > print(y) > #> [1] "∞" > charToRaw(y) > #> [1] e2 88 9e > > Unless the string is converted back to native encoding. > > format(y) > #> [1] "8" > > This ought to be "<U+221E>", equivalently to > > format("∝") > #> [1] "<U+221D>" > > Session Info: > > si <- sessionInfo() > si$running > #> [1] "Windows 10 x64 (build 17134)" > si$R.version$version.string > #> [1] "R version 3.5.2 (2018-12-20)" > si$locale > #> [1] > > "LC_COLLATE=German_Germany.1252;LC_CTYPE=German_Germany.1252;LC_MONETARY=German_Germany.1252;LC_NUMERIC=C;LC_TIME=German_Germany.1252" > > > > Am Do., 7. Feb. 2019 um 14:33 Uhr schrieb David Byrne < > david.byrne...@gmail.com>: > > > I can confirm that it doesn't happen on Ubuntu 18.04.1 so Peter is > > most likely correct; it looks like its Windows specific. > > > > On Thu, 7 Feb 2019 at 12:55, peter dalgaard <pda...@gmail.com> wrote: > > > > > > This doesn't seem to be happening on MacOS, neither in Terminal nor > > RStudio, (R 3.5.1, R-devel, R-patched). So probably Windows specific. > > > > > > -pd > > > > > > > On 7 Feb 2019, at 11:17 , David Byrne <david.byrne...@gmail.com> > > wrote: > > > > > > > > Bug > > > > Using read.table(file, encoding="UTF-8") to import a UTF-8 encoded > > > > file containing the infinity symbol (' ∞ ') results in the infinity > > > > symbol imported as the number 8. Other Unicode characters seem > > > > unaffected, example, Zhe: ж > > > > > > > > Expected Behavior: > > > > The imported data.frame should represent the infinity symbol as the > > > > expected 'Inf' so that normal mathematical operations can be > processed > > > > > > > > Stack Overflow Post: > > > > I created a question on Stack Overflow where one other member was > able > > > > to reproduce the same issues I was having. This question can be found > > > > at: > > > > > > > https://stackoverflow.com/questions/54522196/r-read-table-with-utf-8-encoded-file-reads-infinity-symbol-as-8-int > > > > > > > > Method to Reproduce - 1: > > > > A simple method to reproduce this issues is to use R-Studio: In the > > > > console, type the following: > > > >> read.table(text=" ∞", encoding="UTF-8") > > > > > > > > The result should be a data.frame with a single value of '8' > > > > > > > > Repeating the same with ж Results in correct expected behavior > > > > > > > > Method to Reproduce - 2: > > > > Create a .csv file containing the infinity and Zhe characters (I have > > > > attached the file for convenience, hopefully it is no rejected by > your > > > > email service). Launch an interactive session using > > > > > > > >> r --vanilla > > > > > > > > Enter the following statement taking care to replace the > > > > <path-to-file> with the appropriate one: > > > > > > > >> read.table("<path-to-file>/unicode_chars.csv", sep=",", > > encoding="UTF-8") > > > > > > > > > > > > This should result in a two element data.frame; the first being the > > > > incorrect value of 8 with an additional <U+FEFF> and the second the > > > > correct value of Zhe. > > > > > > > > Note the additional <U+FEFF> prefixed to the front of the '8'. This > > > > appears to be a hidden character for the purposes of letting editors > > > > know the encoding. The following link has some explanation however, > it > > > > states this is caused by excel. The file I created was done so using > > > > notepad and not Excel. > > > > > > > > > > > https://medium.freecodecamp.org/a-quick-tale-about-feff-the-invisible-character-cd25cd4630e7 > > > > > > > > System Details: > > > > OS: > > > >> Windows 10.0.17134 Build 17134 > > > > > > > > > > > > R Version: > > > >> platform x86_64-w64-mingw32 > > > >> arch x86_64 > > > >> os mingw32 > > > >> system x86_64, mingw32 > > > >> status > > > >> major 3 > > > >> minor 4.1 > > > >> year 2017 > > > >> month 06 > > > >> day 30 > > > >> svn rev 72865 > > > >> language R > > > >> version.string R version 3.4.1 (2017-06-30) > > > >> nickname Single Candle > > > > ______________________________________________ > > > > R-devel@r-project.org mailing list > > > > https://stat.ethz.ch/mailman/listinfo/r-devel > > > > > > -- > > > Peter Dalgaard, Professor, > > > Center for Statistics, Copenhagen Business School > > > Solbjerg Plads 3, 2000 Frederiksberg, Denmark > > > Phone: (+45)38153501 > > > Office: A 4.23 > > > Email: pd....@cbs.dk Priv: pda...@gmail.com > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ______________________________________________ > > R-devel@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-devel > > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > [[alternative HTML version deleted]] ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel