[Rd] source(), parse(), and foreign UTF-8 characters

Kirill Müller Tue, 09 May 2017 00:43:25 -0700

Hi

I'm having trouble sourcing or parsing a UTF-8 file that containscharacters that are not representable in the current locale ("foreigncharacters") on Windows. The source() function stops with an error, theparse() function reencodes all foreign characters using the <U+xxxx>notation. I have added a reproducible example below the message.

This seems well within the bounds of documented behavior, although thedocumentation to source() could mention that the file can't containforeign characters. Still, I'd prefer if UTF-8 "just worked" in R, andI'm willing to invest substantial time to help with that. Beforestarting to write a detailed proposal, I feel that I need a betterunderstanding of the problem, and I'm grateful for any feedback youmight have.

I have looked into character encodings in the context of the dplyrpackage, and I have observed the following behavior:


- Strings are treated preferentially in the native encoding

- Only upon specific request (via translateCharUTF8() or enc2utf8() or...), they are translated to UTF-8 and marked as such

- On UTF-8 systems, strings are never marked as UTF-8

- ASCII strings are marked as ASCII internally, but this informationdoesn't seem to be available, e.g., Encoding() returns "unknown" forsuch strings- Most functions in R are encoding-agnostic: they work the sameregardless if they receive a native or UTF-8 encoded string if they areproperly tagged- One important difference are symbols, which must be in the nativeencoding (and are always converted to native encoding, using <U+xxxx>escapes)- I/O is centered around the native encoding, e.g., writeLines() alwaysreencodes to the native encoding

- There is the "bytes" encoding which avoids reencoding.

I haven't looked into serialization or plot devices yet.

The conclusion to the "UTF-8 manifesto" [1] suggests "... to use UTF-8narrow strings everywhere and convert them back and forth when usingplatform APIs that don’t support UTF-8 ...". (It is written in thecontext of the UTF-16 encoding used internally on Windows, but seems toapply just the same here for the native encoding.) I think that Unicodesupport in R could be greatly improved if we follow these guidelines.This seems to mean:

- Convert strings to UTF-8 as soon as possible, and mark them as such(also on systems where UTF-8 is the native encoding)- Translate to native only upon specific request, e.g., in calls to APIfunctions or perhaps for .C()

- Use UTF-8 for symbols

- Avoid the forced round-trip to the native encoding in I/O functionsand for parsing (but still read/write native by default)

- Carefully look into serialization and plot devices

- Add helper functions that simplify mundane tasks such asreading/writing a UTF-8 encoded file

I'm sure I've missed many potential pitfalls, your input is greatlyappreciated. Thanks for your attention.

Further ressources: A write-up by Prof. Ripley [2], a section in R-ints[3], a blog post by Ista Zahn [4], a StackOverflow search [5].



Best regards

Kirill



[1] http://utf8everywhere.org/#conclusions

[2] https://developer.r-project.org/Encodings_and_R.html

[3]https://cran.r-project.org/doc/manuals/r-devel/R-ints.html#Encodings-for-CHARSXPs

[3]http://people.fas.harvard.edu/~izahn/posts/reading-data-with-non-native-encoding-in-r/

[4]http://stackoverflow.com/search?tab=votes&q=%5br%5d%20encoding%20windows%20is%3aquestion




# Use one of the following:
id <- "Gl\u00fcck"
id <- "\u5e78\u798f"
id <- "\u0441\u0447\u0430\u0441\u0442\u044c\u0435"
id <- "\ud589\ubcf5"

file_contents <- paste0('"', id, '"')
Encoding(file_contents)
raw_file_contents <- charToRaw(file_contents)

path <- tempfile(fileext = ".R")
writeBin(raw_file_contents, path)
file.size(path)
length(raw_file_contents)

# Escapes the string
parse(text = file_contents)

# Throws an error
print(source(path, encoding = "UTF-8"))

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] source(), parse(), and foreign UTF-8 characters

Reply via email to