[R-pkg-devel] Package Encoding and Literal Strings

2020-12-16 Thread jo...@jorisgoosen.nl
Hello All,

Some context, I am one of the programmers of a software pkg (
https://jasp-stats.org/) that uses an embedded instance of R to do
statistics. And make that a bit easier for people who are intimidated by R
or like to have something more GUI oriented.


We have been working on translating the interface but ran into several
problems related to encoding of strings. We prefer to use UTF-8 for
everything and this works wonderful on unix systems, as is to be expected.

Windows however is a different matter. Currently I am working on some local
changes to "do_gettext" and some related internal functions of R to be able
to get UTF-8 encoded output from there.

But I ran into a bit of a problem and I think this mailinglist is probably
the best place to start.

It seems that if I have an R package that specifies "Encoding: UTF-8" in
DESCRIPTION the literal strings inside the package are converted to the
local codeset/codepage regardless of what I want.

Is it possible to keep the strings in UTF-8 internally in such a pkg
somehow?

Best regards,
Joris Goosen
University of Amsterdam

[[alternative HTML version deleted]]

__
R-package-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel


Re: [R-pkg-devel] Package Encoding and Literal Strings

2020-12-16 Thread jo...@jorisgoosen.nl
David,

Thanks for the response!

So the problem is a bit worse then just setting `encoding="UTF-8"` on
functions like readLines.
I'll describe our setup a bit:
So we run R embedded in a separate executable and through a whole bunch of
C(++) magic get that to the main executable that runs the actual interface.
All the code that isn't R basically uses UTF-8. This works good and we've
made sure that all of our source code is encoded properly and I've verified
that for this particular problem at least my source file is definitely
encoded in UTF-8 (Ive checked a hexdump).

The simplest solution, that we initially took, to get R+Windows to
cooperate with everything is to simply set the locale to "C" before
starting R. That way R simply assumes UTF-8 is native and everything worked
splendidly. Until of course a file needs to be opened in R that contains
some non-ASCII characters. I noticed the problem because a korean user had
hangul in his username and that broke everything. This because R was trying
to convert to a different locale than Windows was using.

The solution I've now been working on is:
I took the sourcecode of R 4.0.3 and changed the backend of "gettext" to
add an `encoding="something something"` option. And a bit of extra stuff
like `bind_textdomain_codeset` in case I need to tweak the codeset/charset
that gettext uses.
I think I've got that working properly now and once I solve the problem of
the encoding in a pkg I will open a bugreport/feature-request and I'll add
a patch that implements it.

The problem I'm stuck with now is simply this:
I have an R pkg here that I want to test the translations with and the code
is definitely saved as UTF-8, the package has "Encoding: UTF-8" in the
DESCRIPTION and it all loads and works. The particular problem I have is
that the R code contains literally: `mathotString <- "Mathôt!"`
The actual file contains the hexadecimal representation of ô as proper
utf-8: "0xC3 0xB4" but R turns it into: "0xf4".
Seemingly on loading the package, because I haven't done anything with it
except put it in my debug c-function to print its contents as
hexadecimals...

The only thing I want to achieve here is that when R loads the package it
keeps those strings in their original UTF-8 encoding, without converting it
to "native" or the strange unicode codepoint it seemingly placed in there
instead. Because otherwise I cannot get gettext to work fully in UTF-8 mode.

Is this already possible in R?

Cheers,
Joris


On Wed, 16 Dec 2020 at 20:15, David Bosak  wrote:

> Joris:
>
>
>
> I’ve fought with encoding problems on Windows a lot.  Here are some
> general suggestions.
>
>
>
>1. Put “@encoding UTF-8” on any Roxygen comments.
>2. Put “encoding = “UTF-8” on any functions like writeLines or
>readLines that read/write to a text file.
>3. This post:
>https://kevinushey.github.io/blog/2018/02/21/string-encoding-and-r/
>
>
>
> If you have a more specific problem, please describe and we can try to
> help.
>
>
>
> David
>
>
>
> Sent from Mail <https://go.microsoft.com/fwlink/?LinkId=550986> for
> Windows 10
>
>
>
> *From: *jo...@jorisgoosen.nl
> *Sent: *Wednesday, December 16, 2020 1:52 PM
> *To: *r-package-devel@r-project.org
> *Subject: *[R-pkg-devel] Package Encoding and Literal Strings
>
>
>
> Hello All,
>
>
>
> Some context, I am one of the programmers of a software pkg (
>
> https://jasp-stats.org/) that uses an embedded instance of R to do
>
> statistics. And make that a bit easier for people who are intimidated by R
>
> or like to have something more GUI oriented.
>
>
>
>
>
> We have been working on translating the interface but ran into several
>
> problems related to encoding of strings. We prefer to use UTF-8 for
>
> everything and this works wonderful on unix systems, as is to be expected.
>
>
>
> Windows however is a different matter. Currently I am working on some local
>
> changes to "do_gettext" and some related internal functions of R to be able
>
> to get UTF-8 encoded output from there.
>
>
>
> But I ran into a bit of a problem and I think this mailinglist is probably
>
> the best place to start.
>
>
>
> It seems that if I have an R package that specifies "Encoding: UTF-8" in
>
> DESCRIPTION the literal strings inside the package are converted to the
>
> local codeset/codepage regardless of what I want.
>
>
>
> Is it possible to keep the strings in UTF-8 internally in such a pkg
>
> somehow?
>
>
>
> Best regards,
>
> Joris Goosen
>
> University of Amsterdam
>
>
>
> [[alternative HTML version deleted]]
>
>
>
> __
>
> R-package-devel@r-project.org mailing list
>
> https://stat.ethz.ch/mailman/listinfo/r-package-devel
>
>
>

[[alternative HTML version deleted]]

__
R-package-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel


Re: [R-pkg-devel] Package Encoding and Literal Strings

2020-12-17 Thread jo...@jorisgoosen.nl
On Thu, 17 Dec 2020 at 10:46, Tomas Kalibera 
wrote:

> On 12/16/20 11:07 PM, jo...@jorisgoosen.nl wrote:
> > David,
> >
> > Thanks for the response!
> >
> > So the problem is a bit worse then just setting `encoding="UTF-8"` on
> > functions like readLines.
> > I'll describe our setup a bit:
> > So we run R embedded in a separate executable and through a whole bunch
> of
> > C(++) magic get that to the main executable that runs the actual
> interface.
> > All the code that isn't R basically uses UTF-8. This works good and we've
> > made sure that all of our source code is encoded properly and I've
> verified
> > that for this particular problem at least my source file is definitely
> > encoded in UTF-8 (Ive checked a hexdump).
> >
> > The simplest solution, that we initially took, to get R+Windows to
> > cooperate with everything is to simply set the locale to "C" before
> > starting R. That way R simply assumes UTF-8 is native and everything
> worked
> > splendidly. Until of course a file needs to be opened in R that contains
> > some non-ASCII characters. I noticed the problem because a korean user
> had
> > hangul in his username and that broke everything. This because R was
> trying
> > to convert to a different locale than Windows was using.
>
> Setting locale to "C" does not make R assume UTF-8 is the native
> encoding, there is no way to make UTF-8 the current native encoding in R
> on the current builds of R on Windows. This is an old limitation of
> Windows, only recently fixed by Microsoft in recent Windows 10 and with
> UCRT Windows runtime (see my blog post [1] for more - to make R support
> this we need a new toolchain to build R).
>
> If you set the locale to C encoding, you are telling R the native
> encoding is C/POSIX (essentially ASCII), not UTF-8. Encoding-sensitive
> operations, including conversions, including those conversions that
> happen without user control e.g. for interacting with Windows, will
> produce incorrect results (garbage) or in better case errors, warnings,
> omitted, substituted or transliterated characters.
>
> In principle setting the encoding via locale is dangerous on Windows,
> because Windows has two current encodings, not just one. By setting
> locale you set the one used in the C runtime, but not the other one used
> by the system calls. If all code (in R, packages, external libraries)
> was perfect, this would still work as long as all strings used were
> representable in both encodings. For other strings it won't work, and
> then code is not perfect in this regard, it is usually written assuming
> there is one current encoding, which common sense dictates should be the
> case. With the recent UTF-8 support ([1]), one can switch both of these
> to UTF-8.
>

Well, this is exactly why I want to get rid of the situation. But this
messes up the output because everything else expects UTF-8 which is why I'm
looking for some kind of solution.



> > The solution I've now been working on is:
> > I took the sourcecode of R 4.0.3 and changed the backend of "gettext" to
> > add an `encoding="something something"` option. And a bit of extra stuff
> > like `bind_textdomain_codeset` in case I need to tweak the
> codeset/charset
> > that gettext uses.
> > I think I've got that working properly now and once I solve the problem
> of
> > the encoding in a pkg I will open a bugreport/feature-request and I'll
> add
> > a patch that implements it.
>
> A number of similar "shortcuts" have been added to R in the past, but
> they may the code more complex, harder to maintain and use, and can't
> realistically solve all of these problems, anyway. Strings will
> eventually be assumed to be in what is the current native encoding by
> the C library. In R, any external code R uses, or code R packages use.
> Now that Microsoft finally is supporting UTF-8, the way to get out of
> this is switching to UTF-8. This needs only small changes to R source
> code compared to those "shortcuts" (or to using UTF-16LE). I'd be
> against polluting the code with any more "shortcuts".
>

I think the addition of " bind_textdomain_codeset" is not strictly
necessary and can be left out. Because I think setting an environment
variable as "OUTPUT_CHARSET=UTF-8" gives the same result for us.
The addition of the "encoding" option to the internal "do_gettext" is just
a few lines of code and I also undid some duplication between do_gettext
and do_ngettext. Which should make it easier to maintain. But all of that
is moot if there is

Re: [R-pkg-devel] Package Encoding and Literal Strings

2020-12-17 Thread jo...@jorisgoosen.nl
Ps. I will try to have a go at using your experimental version to see if
that could help us out. If I run into trouble I will mail you personally.

On Thu, 17 Dec 2020 at 17:17, jo...@jorisgoosen.nl 
wrote:

>
>
> On Thu, 17 Dec 2020 at 10:46, Tomas Kalibera 
> wrote:
>
>> On 12/16/20 11:07 PM, jo...@jorisgoosen.nl wrote:
>> > David,
>> >
>> > Thanks for the response!
>> >
>> > So the problem is a bit worse then just setting `encoding="UTF-8"` on
>> > functions like readLines.
>> > I'll describe our setup a bit:
>> > So we run R embedded in a separate executable and through a whole bunch
>> of
>> > C(++) magic get that to the main executable that runs the actual
>> interface.
>> > All the code that isn't R basically uses UTF-8. This works good and
>> we've
>> > made sure that all of our source code is encoded properly and I've
>> verified
>> > that for this particular problem at least my source file is definitely
>> > encoded in UTF-8 (Ive checked a hexdump).
>> >
>> > The simplest solution, that we initially took, to get R+Windows to
>> > cooperate with everything is to simply set the locale to "C" before
>> > starting R. That way R simply assumes UTF-8 is native and everything
>> worked
>> > splendidly. Until of course a file needs to be opened in R that contains
>> > some non-ASCII characters. I noticed the problem because a korean user
>> had
>> > hangul in his username and that broke everything. This because R was
>> trying
>> > to convert to a different locale than Windows was using.
>>
>> Setting locale to "C" does not make R assume UTF-8 is the native
>> encoding, there is no way to make UTF-8 the current native encoding in R
>> on the current builds of R on Windows. This is an old limitation of
>> Windows, only recently fixed by Microsoft in recent Windows 10 and with
>> UCRT Windows runtime (see my blog post [1] for more - to make R support
>> this we need a new toolchain to build R).
>>
>> If you set the locale to C encoding, you are telling R the native
>> encoding is C/POSIX (essentially ASCII), not UTF-8. Encoding-sensitive
>> operations, including conversions, including those conversions that
>> happen without user control e.g. for interacting with Windows, will
>> produce incorrect results (garbage) or in better case errors, warnings,
>> omitted, substituted or transliterated characters.
>>
>> In principle setting the encoding via locale is dangerous on Windows,
>> because Windows has two current encodings, not just one. By setting
>> locale you set the one used in the C runtime, but not the other one used
>> by the system calls. If all code (in R, packages, external libraries)
>> was perfect, this would still work as long as all strings used were
>> representable in both encodings. For other strings it won't work, and
>> then code is not perfect in this regard, it is usually written assuming
>> there is one current encoding, which common sense dictates should be the
>> case. With the recent UTF-8 support ([1]), one can switch both of these
>> to UTF-8.
>>
>
> Well, this is exactly why I want to get rid of the situation. But this
> messes up the output because everything else expects UTF-8 which is why I'm
> looking for some kind of solution.
>
>
>
>> > The solution I've now been working on is:
>> > I took the sourcecode of R 4.0.3 and changed the backend of "gettext" to
>> > add an `encoding="something something"` option. And a bit of extra stuff
>> > like `bind_textdomain_codeset` in case I need to tweak the
>> codeset/charset
>> > that gettext uses.
>> > I think I've got that working properly now and once I solve the problem
>> of
>> > the encoding in a pkg I will open a bugreport/feature-request and I'll
>> add
>> > a patch that implements it.
>>
>> A number of similar "shortcuts" have been added to R in the past, but
>> they may the code more complex, harder to maintain and use, and can't
>> realistically solve all of these problems, anyway. Strings will
>> eventually be assumed to be in what is the current native encoding by
>> the C library. In R, any external code R uses, or code R packages use.
>> Now that Microsoft finally is supporting UTF-8, the way to get out of
>> this is switching to UTF-8. This needs only small changes to R source
>> code compared to those "shortcuts" (or to using UTF-16LE

Re: [R-pkg-devel] Package Encoding and Literal Strings

2020-12-17 Thread jo...@jorisgoosen.nl
On Thu, 17 Dec 2020 at 18:22, Tomas Kalibera 
wrote:

> On 12/17/20 5:17 PM, jo...@jorisgoosen.nl wrote:
>
>
>
> On Thu, 17 Dec 2020 at 10:46, Tomas Kalibera 
> wrote:
>
>> On 12/16/20 11:07 PM, jo...@jorisgoosen.nl wrote:
>> > David,
>> >
>> > Thanks for the response!
>> >
>> > So the problem is a bit worse then just setting `encoding="UTF-8"` on
>> > functions like readLines.
>> > I'll describe our setup a bit:
>> > So we run R embedded in a separate executable and through a whole bunch
>> of
>> > C(++) magic get that to the main executable that runs the actual
>> interface.
>> > All the code that isn't R basically uses UTF-8. This works good and
>> we've
>> > made sure that all of our source code is encoded properly and I've
>> verified
>> > that for this particular problem at least my source file is definitely
>> > encoded in UTF-8 (Ive checked a hexdump).
>> >
>> > The simplest solution, that we initially took, to get R+Windows to
>> > cooperate with everything is to simply set the locale to "C" before
>> > starting R. That way R simply assumes UTF-8 is native and everything
>> worked
>> > splendidly. Until of course a file needs to be opened in R that contains
>> > some non-ASCII characters. I noticed the problem because a korean user
>> had
>> > hangul in his username and that broke everything. This because R was
>> trying
>> > to convert to a different locale than Windows was using.
>>
>> Setting locale to "C" does not make R assume UTF-8 is the native
>> encoding, there is no way to make UTF-8 the current native encoding in R
>> on the current builds of R on Windows. This is an old limitation of
>> Windows, only recently fixed by Microsoft in recent Windows 10 and with
>> UCRT Windows runtime (see my blog post [1] for more - to make R support
>> this we need a new toolchain to build R).
>>
>> If you set the locale to C encoding, you are telling R the native
>> encoding is C/POSIX (essentially ASCII), not UTF-8. Encoding-sensitive
>> operations, including conversions, including those conversions that
>> happen without user control e.g. for interacting with Windows, will
>> produce incorrect results (garbage) or in better case errors, warnings,
>> omitted, substituted or transliterated characters.
>>
>> In principle setting the encoding via locale is dangerous on Windows,
>> because Windows has two current encodings, not just one. By setting
>> locale you set the one used in the C runtime, but not the other one used
>> by the system calls. If all code (in R, packages, external libraries)
>> was perfect, this would still work as long as all strings used were
>> representable in both encodings. For other strings it won't work, and
>> then code is not perfect in this regard, it is usually written assuming
>> there is one current encoding, which common sense dictates should be the
>> case. With the recent UTF-8 support ([1]), one can switch both of these
>> to UTF-8.
>>
>
> Well, this is exactly why I want to get rid of the situation. But this
> messes up the output because everything else expects UTF-8 which is why I'm
> looking for some kind of solution.
>
>
>
>> > The solution I've now been working on is:
>> > I took the sourcecode of R 4.0.3 and changed the backend of "gettext" to
>> > add an `encoding="something something"` option. And a bit of extra stuff
>> > like `bind_textdomain_codeset` in case I need to tweak the
>> codeset/charset
>> > that gettext uses.
>> > I think I've got that working properly now and once I solve the problem
>> of
>> > the encoding in a pkg I will open a bugreport/feature-request and I'll
>> add
>> > a patch that implements it.
>>
>> A number of similar "shortcuts" have been added to R in the past, but
>> they may the code more complex, harder to maintain and use, and can't
>> realistically solve all of these problems, anyway. Strings will
>> eventually be assumed to be in what is the current native encoding by
>> the C library. In R, any external code R uses, or code R packages use.
>> Now that Microsoft finally is supporting UTF-8, the way to get out of
>> this is switching to UTF-8. This needs only small changes to R source
>> code compared to those "shortcuts" (or to using UTF-16LE). I'd be
>> against polluting the code with any more "shortcuts".
>>
&g

Re: [R-pkg-devel] Package Encoding and Literal Strings

2020-12-18 Thread jo...@jorisgoosen.nl
Hello Tomas,

I have made a minimal example that demonstrates my problem:
https://github.com/JorisGoosen/utf8StringsPkg

This package is encoded in UTF-8 as is Test.R. There is a little Rcpp
function in there I wrote that displays the bytes straight from R's CHAR to
be sure no conversion is happening.
I would expect that the mathotString had "C3 B4" for "ô" but instead it
gets "F4". As you can see when you run
`utf8StringsPkg::testutf8_in_locale()`.

Cheers,
Joris



On Fri, 18 Dec 2020 at 11:48, Tomas Kalibera 
wrote:

> On 12/17/20 6:43 PM, jo...@jorisgoosen.nl wrote:
>
>
>
> On Thu, 17 Dec 2020 at 18:22, Tomas Kalibera 
> wrote:
>
>> On 12/17/20 5:17 PM, jo...@jorisgoosen.nl wrote:
>>
>>
>>
>> On Thu, 17 Dec 2020 at 10:46, Tomas Kalibera 
>> wrote:
>>
>>> On 12/16/20 11:07 PM, jo...@jorisgoosen.nl wrote:
>>> > David,
>>> >
>>> > Thanks for the response!
>>> >
>>> > So the problem is a bit worse then just setting `encoding="UTF-8"` on
>>> > functions like readLines.
>>> > I'll describe our setup a bit:
>>> > So we run R embedded in a separate executable and through a whole
>>> bunch of
>>> > C(++) magic get that to the main executable that runs the actual
>>> interface.
>>> > All the code that isn't R basically uses UTF-8. This works good and
>>> we've
>>> > made sure that all of our source code is encoded properly and I've
>>> verified
>>> > that for this particular problem at least my source file is definitely
>>> > encoded in UTF-8 (Ive checked a hexdump).
>>> >
>>> > The simplest solution, that we initially took, to get R+Windows to
>>> > cooperate with everything is to simply set the locale to "C" before
>>> > starting R. That way R simply assumes UTF-8 is native and everything
>>> worked
>>> > splendidly. Until of course a file needs to be opened in R that
>>> contains
>>> > some non-ASCII characters. I noticed the problem because a korean user
>>> had
>>> > hangul in his username and that broke everything. This because R was
>>> trying
>>> > to convert to a different locale than Windows was using.
>>>
>>> Setting locale to "C" does not make R assume UTF-8 is the native
>>> encoding, there is no way to make UTF-8 the current native encoding in R
>>> on the current builds of R on Windows. This is an old limitation of
>>> Windows, only recently fixed by Microsoft in recent Windows 10 and with
>>> UCRT Windows runtime (see my blog post [1] for more - to make R support
>>> this we need a new toolchain to build R).
>>>
>>> If you set the locale to C encoding, you are telling R the native
>>> encoding is C/POSIX (essentially ASCII), not UTF-8. Encoding-sensitive
>>> operations, including conversions, including those conversions that
>>> happen without user control e.g. for interacting with Windows, will
>>> produce incorrect results (garbage) or in better case errors, warnings,
>>> omitted, substituted or transliterated characters.
>>>
>>> In principle setting the encoding via locale is dangerous on Windows,
>>> because Windows has two current encodings, not just one. By setting
>>> locale you set the one used in the C runtime, but not the other one used
>>> by the system calls. If all code (in R, packages, external libraries)
>>> was perfect, this would still work as long as all strings used were
>>> representable in both encodings. For other strings it won't work, and
>>> then code is not perfect in this regard, it is usually written assuming
>>> there is one current encoding, which common sense dictates should be the
>>> case. With the recent UTF-8 support ([1]), one can switch both of these
>>> to UTF-8.
>>>
>>
>> Well, this is exactly why I want to get rid of the situation. But this
>> messes up the output because everything else expects UTF-8 which is why I'm
>> looking for some kind of solution.
>>
>>
>>
>>> > The solution I've now been working on is:
>>> > I took the sourcecode of R 4.0.3 and changed the backend of "gettext"
>>> to
>>> > add an `encoding="something something"` option. And a bit of extra
>>> stuff
>>> > like `bind_textdomain_codeset` in case I need to tweak the
>>> codeset/charset
>>> > that gettext uses.
>

Re: [R-pkg-devel] Package Encoding and Literal Strings

2020-12-21 Thread jo...@jorisgoosen.nl
Hello Tomas,

Thank you for the feedback and your summary of how things now work and what
goes wrong for the tao- and mathot-string confirms all of my suspicions.
And it also describes my exact problem fairly well.

It seems it does come down to R not keeping the UTF-8 encoding of the
literal strings on Windows with a "typical codepage" when loading a
package.
This despite reading it from file in that particular encoding and also
specifying the same in DESCRIPTION.
While `eval(parse(..., encoding="UTF-8"))` *does* keep the encoding on the
literal strings. Which means there is some discrepancy between the two.
That means the way a package is loaded it uses a different path then when
using `eval(parse(..., encoding="UTF-8"))`?

You mention:
> Strings that cannot be represented in the native encoding like tao will
get the escapes, and so cannot be converted back to UTF-8. This is not
great, but I  see it was the case already in 3.6 (so not a recent
regression) and I don't think it would be worth the time trying to fix that
- as discussed earlier, only switching to UTF-8 would fix all of these
translations, not just one.

Not a recent regression means it used to work the same for both and keeping
the UTF-8 encoding?
I've tried R 3 and it already doesnt work there, I also tried 2.8 but
couldnt get my testpkg (simplified to use "charToRaw" instead of a C-call)
to install there.
However, having this work would already be quite useful as our custom GUI
on top of R is fully UTF-8 anyhow.

And I would certainly be up for figuring out how to fix the regression so
that we can use this until your work on the UTF-8 version with UCRT is
released.
On the other hand, maybe this would not be the wisest investment of my
time.

I've tried using the installer and toolchain you linked to in
https://developer.r-project.org/Blog/public/2020/07/30/windows/utf-8-build-of-r-and-cran-packages/index.html
and use that to compile our software.
This normally works with the Rtools toolchain, but it seems that "make" is
missing from your toolchain. When I build (our project with Riniside in it)
using your toolchain in the beginning of PATH and using mingw32-make from
rtools40 I run into problems of a missing "cc1plus".

If I read https://mxe.cc/ it seems it is meant for cross-compiling, not
locally on Windows?
Maybe that is what is going wrong.
But despite trying for quite a bit I couldn't get our software to compile
in such a way it could link with R.
Which means I couldn't test if it solves our problem...

Cheers,
Joris


On Fri, 18 Dec 2020 at 18:05, Tomas Kalibera 
wrote:

> Hi Joris,
>
> thanks for the example. You can actually simply have Test.R assign the two
> variables and then run
>
> Encoding(utf8StringsPkg1::mathotString)
> charToRaw(utf8StringsPkg1::mathotString)
> Encoding(utf8StringsPkg1::tao)
> charToRaw(utf8StringsPkg1::tao)
>
> I tried on Linux, Windows/UTF-8 (the experimental version) and
> Windows/latin-1 (released version). In all cases, both strings are
> converted to native encoding. The mathotString is converted to latin-1
> fine, because it is representable there. The tao string when running in
> latin-1 locale gets the escapes :
>
> "<99><86>"
>
> Btw, the parse(,encoding="UTF-8") hack works, when you parse the modified
> Test.R file (with the two assignments), and eval the output, you will get
> those strings in UTF-8. But when you don't eval and print the parse tree in
> Rgui, it will not be printed correctly (again a limitation of these hacks,
> they could only do so much).
>
> When accessing strings from C, you should always be prepared for any
> encoding in a CHARSXP, so when you want UTF-8, use "translateCharUTF8()"
> instead of "CHAR()". That will work fine on representable strings like
> mathotString, and that is conceptually the correct way to access them.
>
> Strings that cannot be represented in the native encoding like tao will
> get the escapes, and so cannot be converted back to UTF-8. This is not
> great, but I  see it was the case already in 3.6 (so not a recent
> regression) and I don't think it would be worth the time trying to fix that
> - as discussed earlier, only switching to UTF-8 would fix all of these
> translations, not just one. Btw, the example works fine on the
> experimentation UTF-8 build on Windows.
>
> I am sorry there is not a simple fix for non-representable characters.
>
> Best
> Tomas
>
>
>
> On 12/18/20 1:53 PM, jo...@jorisgoosen.nl wrote:
>
> Hello Tomas,
>
> I have made a minimal example that demonstrates my problem:
> https://github.com/JorisGoosen/utf8StringsPkg
>
> This package is encoded in UTF-8 as is Test.R. There is a little Rcpp
> function in there I wrote

Re: [R-pkg-devel] Package Encoding and Literal Strings

2020-12-22 Thread jo...@jorisgoosen.nl
Hello Tomas,

On Mon, 21 Dec 2020 at 21:21, Tomas Kalibera 
wrote:

> Hi Joris,
>
> On 12/21/20 7:33 PM, jo...@jorisgoosen.nl wrote:
>
> Hello Tomas,
>
> Thank you for the feedback and your summary of how things now work and
> what goes wrong for the tao- and mathot-string confirms all of my
> suspicions. And it also describes my exact problem fairly well.
>
> It seems it does come down to R not keeping the UTF-8 encoding of the
> literal strings on Windows with a "typical codepage" when loading a
> package.
> This despite reading it from file in that particular encoding and also
> specifying the same in DESCRIPTION.
> While `eval(parse(..., encoding="UTF-8"))` *does* keep the encoding on the
> literal strings. Which means there is some discrepancy between the two.
> That means the way a package is loaded it uses a different path then when
> using `eval(parse(..., encoding="UTF-8"))`?
>
> Yes, it must be a different path. The DESCRIPTION field defines what
> encoding is the input in, so that R can read it. It does not tell R how it
> should represent the strings internally. The behavior is ok, well except
> for non-representable characters.
>
> You mention:
> > Strings that cannot be represented in the native encoding like tao will
> get the escapes, and so cannot be converted back to UTF-8. This is not
> great, but I  see it was the case already in 3.6 (so not a recent
> regression) and I don't think it would be worth the time trying to fix that
> - as discussed earlier, only switching to UTF-8 would fix all of these
> translations, not just one.
>
> Not a recent regression means it used to work the same for both and
> keeping the UTF-8 encoding?
> I've tried R 3 and it already doesnt work there, I also tried 2.8 but
> couldnt get my testpkg (simplified to use "charToRaw" instead of a C-call)
> to install there.
> However, having this work would already be quite useful as our custom GUI
> on top of R is fully UTF-8 anyhow.
>
> By "not a recent regression" I meant it wasn't broken recently. It
> probably never worked the way you (and me and probably everyone else) would
> like it to work, that is it probably always translated to native encoding,
> because that was the only option except rewriting all of our code,
> packages, external libraries to use UTF-16LE (as discussed before).
>

Too bad, but that was what I was afraid of in the first place.

> And I would certainly be up for figuring out how to fix the regression so
> that we can use this until your work on the UTF-8 version with UCRT is
> released.
> On the other hand, maybe this would not be the wisest investment of my
> time.
>
> I bet your applications do more than just load a package and then access
> string literals in the code. And as soon as you do anything with those
> strings, R may translate them to native encoding (well unless we document
> this does not happen, typically some code around connections, file paths,
> etc). So, providing a shortcut for this case I am afraid wouldn't help you
> much. If the problem was just parsing, you could also use "\u" escapes as
> workaround in the literals. Remember, the parse(,encoding="UTF-8") only
> could work in single-byte encodings.
>
Ah yeah, the original problem with that was that the `xgettext`
parsingscript doesn't know how to handle those escapes. But that means we
will just have to fix that then.

> I've tried using the installer and toolchain you linked to in
> https://developer.r-project.org/Blog/public/2020/07/30/windows/utf-8-build-of-r-and-cran-packages/index.html
> and use that to compile our software.
> This normally works with the Rtools toolchain, but it seems that "make" is
> missing from your toolchain. When I build (our project with Riniside in it)
> using your toolchain in the beginning of PATH and using mingw32-make from
> rtools40 I run into problems of a missing "cc1plus".
>
> Sorry, building native code is still involved with that demo. You would
> have to set PATHs and well maybe alter the installation or build from
> source, as described in
>
> https://svn.r-project.org/R-dev-web/trunk/WindowsBuilds/winutf8/winutf8.html
>
> What might be actually easier, you could try a current development
> version, I will send you a link.
>

Cheers,
Joris

I got the link and will have a go at that and reply there with any remarks
or questions.

> If I read https://mxe.cc/ it seems it is meant for cross-compiling, not
> locally on Windows?
> Maybe that is what is going wrong.
> But despite trying for quite a bit I couldn't get our software to compile
> in such a way it could link with R.
> Which means I cou