Hi Tomas, Thank you very much for the detailed explanation! I think now I have a bit better understanding on how the things work; at least now I know I didn't understand the concept of "active code page". I'll follow your advice when I need to fix the packages that need some tweaks to handle UTF-8 properly.
Sorry, I'd like to ask one more question related to locale. If I copy the following text and execute `read.csv("clipboard")`, it returns "uao" instead of "úáö" (the characters are transliterated). "col1","col2" "úáö","úáö" While this is probably the status quo (the same behavior on R 4.1) on Latin-1 encoding, things are worse on CJK locales. If I try, "col1","col2" "あ","い" I get the following error: > read.csv("clipboard") Error in type.convert.default(data[[i]], as.is = as.is[i], dec = dec, : invalid multibyte string at '<82><a0>' Is this supposed to work? It seems the characters are encoded as CP932 (my system locale) but marked as UTF-8. > x <- utils:::readClipboard() > x [1] "\"col1\",\"col2\"" "\"\x82\xa0\",\"\x82\xa2\"" > iconv(x, from = "CP932", to = "UTF-8") [1] "\"col1\",\"col2\"" "\"あ\",\"い\"" I read the source code of readClipboard() in src/library/utils/src/windows/util.c, but have no idea if there's anything that needs to be fixed. Best, Yutani 2021年12月21日(火) 17:26 Tomas Kalibera <tomas.kalib...@gmail.com>: > > Hi Yutani, > > On 12/21/21 6:34 AM, Hiroaki Yutani wrote: > > Hi, > > > > I'm more than excited about the announcement about the upcoming UTF-8 > > R on Windows. Let me confirm my understanding. Is R 4.2 supposed to > > work on Windows with non-UTF-8 encoding as the system locale? I think > > this blog post indicates so (as this describes the older Windows than > > the UTF-8 era), but I'm not fully confident if I understand the > > details correctly. > > R 4.2 will automatically use UTF-8 as the active code page (system > locale) and the C library encoding and the R current native encoding on > systems which allow this (recent Windows 10 and newer, Windows Server > 2022, etc). There is no way to opt-out from that, and of course no > reason to, either. It does not matter of what is the system locale set > in Windows for the whole system - these recent Windows allow individual > applications to override the system-wide setting to UTF-8, which is what > R does. Typically the system-wide setting will not be UTF-8, because > many applications will not work with that. > > On older systems, R 4.2 will run in some other system locale and the > same C library encoding and R current native encoding - the same system > default as R 4.1 would run on that system. So for some time, encoding > support for this in R will have to stay, but eventually will be removed. > But yes, R 4.2 is still supposed to work on such systems. > > > https://developer.r-project.org/Blog/public/2021/12/07/upcoming-changes-in-r-4.2-on-windows/index.html > > > > If so, I'm curious what the package authors should do when the locales > > are different between OS and R. For example (disclaimer: I don't > > intend to blame processx at all. Just for an example), the CRAN check > > on the processx package currently fails with this warning on R-devel > > Windows. > > > >> 1. UTF-8 in stdout (test-utf8.R:85:3) - Invalid multi-byte character > >> at end of stream ignored > > https://cran.r-project.org/web/checks/check_results_processx.html > > > > As far as I know, processx launches an external process and captures > > its output, and I suspect the problem is that the output of the > > process is encoded in non-UTF-8 while R assumes it's UTF-8. I > > experienced similar problems with other packages as well, which > > disappear if I switch the locale to the same one as the OS by > > Sys.setlocale(). So, I think it would be great if there's some > > guidance for the package authors on how to handle these properly. > > Incidentally I've debugged this case and sent a detailed analysis to the > maintainer, so he knows about the problem. > > In short, you cannot assume in Windows that different applications use > the same system encoding. That is not true at least with the invention > of the fusion manifests which allow an application to switch to UTF-8 as > system encoding, which R does. So, when using an external application on > Windows, you need to know and respect a specific encoding used by that > application on input and output. > > As an example based on processx, you have an application which prints > its argument to standard output. If you do it this way: > > $ cat pr.c > #include <stdio.h> > #include <locale.h> > #include <string.h> > int main(int argc, char **argv) { > > printf("Locale set to: %s\n", setlocale(LC_ALL, "")); > int i; > for(i = 0; i < argc; i++) { > printf("Argument %d\n", i); > printf("%s\n", argv[i]); > for(int j = 0; j < strlen(argv[i]); j++) { > printf("byte[%d] is %x (%d)\n", i, (unsigned > char)argv[i][j], (unsigned char) > } > } > return 0; > } > > the argument and hence output will be in the current native encoding of > pr.c, because that's the encoding in which the argument will be received > from Windows, so by default the system locale encoding, so by default > not UTF-8 (on my system in Latin-1, as well as on CRAN check systems). > One should also only use such programs with characters representable in > Latin-1 on such systems. When you call such application from R with > UTF-8 as native encoding, Windows will automatically convert the > arguments to Latin-1. > > The old Windows way to avoid this problem is to use the wide-character > API (now UTF-16LE): > > $ cat prw.c > #include <stdio.h> > #include <locale.h> > #include <string.h> > > int wmain(int argc, wchar_t **argv) { > > int i; > for(i = 0; i < argc; i++) { > wprintf(L"Argument %d\n", i); > wprintf(argv[i]); > wprintf(L"\n"); > for(int j = 0; j < wcslen(argv[i]); j++) > wprintf(L"Word[%d] %x\n", j, > (unsigned)argv[i][j]); > } > return 0; > } > > When you call such program from R with UTF-8 as native encoding, Windows > will convert the arguments to UTF-16LE (so all characters will be > representable). But you need to write Windows-specific code for this. > > The new Windows way to avoid this problem is to use UTF-8 as the native > encoding via the fusion manifest, as R does. You can use the "pr.c" as > above, but with something like > > $ cat pr.rc > #include <windows.h> > CREATEPROCESS_MANIFEST_RESOURCE_ID RT_MANIFEST "pr.manifest" > > $ cat pr.manifest > <?xml version="1.0" encoding="UTF-8" standalone="yes"?> > <assembly xmlns="urn:schemas-microsoft-com:asm.v1" manifestVersion="1.0"> > <assemblyIdentity > version="1.0.0.0" > processorArchitecture="amd64" > name="pr.exe" > type="win32" > /> > <application> > <windowsSettings> > <activeCodePage > xmlns="http://schemas.microsoft.com/SMI/2019/WindowsSettings">UTF-8</activeCodePage> > </windowsSettings> > </application> > </assembly> > > windres.exe -i pr.rc -o pr_rc.o > gcc -o pr pr.c pr_rc.o > > When you build the application this way, it will use UTF-8 as native > encoding, so when you call it from R (with UTF-8) as native encoding, no > input conversion will occur. However, when you do this, the output from > the application will also be in UTF-8. > > So, for applications you control, my recommendation would be to make > them use Unicode one of these two ways. Preferably the new one, with the > fusion manifest. Only if it were a Windows-only application, and had to > work on older Windows, then the wide-character version (but such apps > are probably not in R packages). > > When working with external applications you don't control, it is harder > - you need to know which encoding they are expecting and producing, in > whatever interface you use, and convert that, e.g. using iconv(). By the > interface I mean that e.g., the command-line arguments are converted by > Windows, but the input/output sent over a file/stream will not be. > > Of course, this works the other way around as well. If you were using R > with some other external applications expecting a different encoding, > you would need to handle that (by conversions). With applications you > control, it would make sense using this opportunity to switch to UTF-8. > But, in principle, you can use iconv() from R directly or indirectly to > convert input/output streams to/from a known encoding. > > I am happy to give more suggestions if there is interest, but for that > it would be useful to have a specific example (with processx, it is > clear what the options R, there the application is controlled by the > package). > > Best > Tomas > > > > Any suggestions? > > > > Best, > > Yutani > > > > ______________________________________________ > > R-devel@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-devel ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel