Hi Yutani,
On 12/21/21 6:34 AM, Hiroaki Yutani wrote:
Hi,
I'm more than excited about the announcement about the upcoming UTF-8
R on Windows. Let me confirm my understanding. Is R 4.2 supposed to
work on Windows with non-UTF-8 encoding as the system locale? I think
this blog post indicates so (as this describes the older Windows than
the UTF-8 era), but I'm not fully confident if I understand the
details correctly.
R 4.2 will automatically use UTF-8 as the active code page (system
locale) and the C library encoding and the R current native encoding on
systems which allow this (recent Windows 10 and newer, Windows Server
2022, etc). There is no way to opt-out from that, and of course no
reason to, either. It does not matter of what is the system locale set
in Windows for the whole system - these recent Windows allow individual
applications to override the system-wide setting to UTF-8, which is what
R does. Typically the system-wide setting will not be UTF-8, because
many applications will not work with that.
On older systems, R 4.2 will run in some other system locale and the
same C library encoding and R current native encoding - the same system
default as R 4.1 would run on that system. So for some time, encoding
support for this in R will have to stay, but eventually will be removed.
But yes, R 4.2 is still supposed to work on such systems.
https://developer.r-project.org/Blog/public/2021/12/07/upcoming-changes-in-r-4.2-on-windows/index.html
If so, I'm curious what the package authors should do when the locales
are different between OS and R. For example (disclaimer: I don't
intend to blame processx at all. Just for an example), the CRAN check
on the processx package currently fails with this warning on R-devel
Windows.
1. UTF-8 in stdout (test-utf8.R:85:3) - Invalid multi-byte character at
end of stream ignored
https://cran.r-project.org/web/checks/check_results_processx.html
As far as I know, processx launches an external process and captures
its output, and I suspect the problem is that the output of the
process is encoded in non-UTF-8 while R assumes it's UTF-8. I
experienced similar problems with other packages as well, which
disappear if I switch the locale to the same one as the OS by
Sys.setlocale(). So, I think it would be great if there's some
guidance for the package authors on how to handle these properly.
Incidentally I've debugged this case and sent a detailed analysis to the
maintainer, so he knows about the problem.
In short, you cannot assume in Windows that different applications use
the same system encoding. That is not true at least with the invention
of the fusion manifests which allow an application to switch to UTF-8 as
system encoding, which R does. So, when using an external application on
Windows, you need to know and respect a specific encoding used by that
application on input and output.
As an example based on processx, you have an application which prints
its argument to standard output. If you do it this way:
$ cat pr.c
#include <stdio.h>
#include <locale.h>
#include <string.h>
int main(int argc, char **argv) {
printf("Locale set to: %s\n", setlocale(LC_ALL, ""));
int i;
for(i = 0; i < argc; i++) {
printf("Argument %d\n", i);
printf("%s\n", argv[i]);
for(int j = 0; j < strlen(argv[i]); j++) {
printf("byte[%d] is %x (%d)\n", i, (unsigned
char)argv[i][j], (unsigned char)
}
}
return 0;
}
the argument and hence output will be in the current native encoding of
pr.c, because that's the encoding in which the argument will be received
from Windows, so by default the system locale encoding, so by default
not UTF-8 (on my system in Latin-1, as well as on CRAN check systems).
One should also only use such programs with characters representable in
Latin-1 on such systems. When you call such application from R with
UTF-8 as native encoding, Windows will automatically convert the
arguments to Latin-1.
The old Windows way to avoid this problem is to use the wide-character
API (now UTF-16LE):
$ cat prw.c
#include <stdio.h>
#include <locale.h>
#include <string.h>
int wmain(int argc, wchar_t **argv) {
int i;
for(i = 0; i < argc; i++) {
wprintf(L"Argument %d\n", i);
wprintf(argv[i]);
wprintf(L"\n");
for(int j = 0; j < wcslen(argv[i]); j++)
wprintf(L"Word[%d] %x\n", j,
(unsigned)argv[i][j]);
}
return 0;
}
When you call such program from R with UTF-8 as native encoding, Windows
will convert the arguments to UTF-16LE (so all characters will be
representable). But you need to write Windows-specific code for this.
The new Windows way to avoid this problem is to use UTF-8 as the native
encoding via the fusion manifest, as R does. You can use the "pr.c" as
above, but with something like
$ cat pr.rc
#include <windows.h>
CREATEPROCESS_MANIFEST_RESOURCE_ID RT_MANIFEST "pr.manifest"
$ cat pr.manifest
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<assembly xmlns="urn:schemas-microsoft-com:asm.v1" manifestVersion="1.0">
<assemblyIdentity
version="1.0.0.0"
processorArchitecture="amd64"
name="pr.exe"
type="win32"
/>
<application>
<windowsSettings>
<activeCodePage
xmlns="http://schemas.microsoft.com/SMI/2019/WindowsSettings">UTF-8</activeCodePage>
</windowsSettings>
</application>
</assembly>
windres.exe -i pr.rc -o pr_rc.o
gcc -o pr pr.c pr_rc.o
When you build the application this way, it will use UTF-8 as native
encoding, so when you call it from R (with UTF-8) as native encoding, no
input conversion will occur. However, when you do this, the output from
the application will also be in UTF-8.
So, for applications you control, my recommendation would be to make
them use Unicode one of these two ways. Preferably the new one, with the
fusion manifest. Only if it were a Windows-only application, and had to
work on older Windows, then the wide-character version (but such apps
are probably not in R packages).
When working with external applications you don't control, it is harder
- you need to know which encoding they are expecting and producing, in
whatever interface you use, and convert that, e.g. using iconv(). By the
interface I mean that e.g., the command-line arguments are converted by
Windows, but the input/output sent over a file/stream will not be.
Of course, this works the other way around as well. If you were using R
with some other external applications expecting a different encoding,
you would need to handle that (by conversions). With applications you
control, it would make sense using this opportunity to switch to UTF-8.
But, in principle, you can use iconv() from R directly or indirectly to
convert input/output streams to/from a known encoding.
I am happy to give more suggestions if there is interest, but for that
it would be useful to have a specific example (with processx, it is
clear what the options R, there the application is controlled by the
package).
Best
Tomas
Any suggestions?
Best,
Yutani
______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel