Re: [Rd] Windows iconv() "failure" in certain locales

2017-06-28 Thread Duncan Murdoch

On 27/06/2017 11:36 AM, Martin Maechler wrote:

This is a continuation of the R-devel thread with subject
 "suggestion to fix packageDescription() for Windows users" :

As I said there, a patch should rather address the underlying
problem in packageDescription rather than a kludgy workaround
patch for  citation().
(For that same reason, Ben Marwick proposed to fix
 packageDescription() rather than the symptom seen in citation().)

It's not hard to see that the problem is that  iconv() in
Windows does not always succeed to translate from "UTF-8" to the
"current locale", in the case mentioned there.

I'm giving some easier reproducible examples:  no need to install
half of tidyverse just to get citation("readr") :


x <- c("Ekstr\xf8m", "J\xf6reskog", "bi\xdfchen Z\xfcrcher")
Encoding(x1) <- "latin1"
xU <- iconv(x1, "latin1", "UTF-8")



Sys.setlocale("LC_CTYPE", "Chinese")

[1] "Chinese (Simplified)_People's Republic of China.936"


iconv(x1, "latin1", "") # NA NA NA

[1] NA NA NA

iconv(xU, "UTF-8", "") # NA NA NA

[1] NA NA NA

iconv(xU, "UTF-8", "//TRANSLIT")

[1] "Ekstrøm" "Jöreskog""bißchen Zürcher"

iconv(xU, "UTF-8", "", sub = "byte")

[1] "Ekstrm" "Jreskog""bi<9f>chen Z¨¹rcher"



Sys.setlocale("LC_CTYPE", "Arabic")

[1] "Arabic_Saudi Arabia.1256"

iconv(x1, "latin1", "")  # NA NA NA

[1] NA NA NA

iconv(xU, "UTF-8", "")  # NA NA NA

[1] NA NA NA

iconv(xU, "UTF-8", "//TRANSLIT")

[1] "Ekstr\370m" "J\366reskog""bißchen Zürcher"

iconv(xU, "UTF-8", "", sub="byte")

[1] "Ekstrm" "Jreskog""bi<9f>chen Zürcher"

iconv(xU, "UTF-8", "", sub="?")

[1] "Ekstr??m" "J??reskog""bi??chen Zürcher"

Etc... .  As the above is typically garbled between e-mail
transfer agents, I append both the iconv-Windows.R R script and
the corresponding iconv-Windows.Rout  R transcript to this
e-mail (using MIME type text/plain (easy using emacs for mail..)),
and they contain a bit more than the above.

Note that the above shows that using 'sub = *' and using
"//TRANSLIT" in case of a previous NA  result helps quite a bit,
in the sense that it gives much more information to see
  "J?reskog"  instead   NA.

I'm considering updating  packageDescription() to try these in
case it first returns NA.   This would make the citation() hack
unnecessary.


I agree with the general sentiment (fix the underlying problem).  I 
haven't traced through this one, but the usual cause of problems like 
this is that we too frequently convert to the local encoding even when 
that loses information.


Kirill Müller and I are gradually working through internal code and 
fixing these issues.  I don't know if this one will be fixed sooner or 
later, but I would hope it would be fixed by 3.5.0.


So in order that we don't hide it, I'd ask you not to apply the patch in 
R-devel.


Duncan Murdoch

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Windows iconv() "failure" in certain locales

2017-06-28 Thread Uwe Ligges



On 27.06.2017 17:36, Martin Maechler wrote:

This is a continuation of the R-devel thread with subject
  "suggestion to fix packageDescription() for Windows users" :

As I said there, a patch should rather address the underlying
problem in packageDescription rather than a kludgy workaround
patch for  citation().
(For that same reason, Ben Marwick proposed to fix
  packageDescription() rather than the symptom seen in citation().)

It's not hard to see that the problem is that  iconv() in
Windows does not always succeed to translate from "UTF-8" to the
"current locale", in the case mentioned there.

I'm giving some easier reproducible examples:  no need to install
half of tidyverse just to get citation("readr") :


x <- c("Ekstr\xf8m", "J\xf6reskog", "bi\xdfchen Z\xfcrcher")
Encoding(x1) <- "latin1"
xU <- iconv(x1, "latin1", "UTF-8")



Sys.setlocale("LC_CTYPE", "Chinese")

[1] "Chinese (Simplified)_People's Republic of China.936"


iconv(x1, "latin1", "") # NA NA NA

[1] NA NA NA

iconv(xU, "UTF-8", "") # NA NA NA

[1] NA NA NA

iconv(xU, "UTF-8", "//TRANSLIT")

[1] "Ekstrøm" "Jöreskog""bißchen Zürcher"


Interesting, I get chinese characters here.

Beside the comments from Duncan Murdoch:
iconv(x1, "latin1", "", sub="?")
etc. would be an alternative in case some characters really cannot be 
converted into the target encoding and should perhaps be considered for 
the time after Duncan commits the fix for the underlying porblem.


Best,
Uwe









iconv(xU, "UTF-8", "", sub = "byte")

[1] "Ekstrm" "Jreskog""bi<9f>chen Z¨¹rcher"



Sys.setlocale("LC_CTYPE", "Arabic")

[1] "Arabic_Saudi Arabia.1256"

iconv(x1, "latin1", "")  # NA NA NA

[1] NA NA NA

iconv(xU, "UTF-8", "")  # NA NA NA

[1] NA NA NA

iconv(xU, "UTF-8", "//TRANSLIT")

[1] "Ekstr\370m" "J\366reskog""bißchen Zürcher"

iconv(xU, "UTF-8", "", sub="byte")

[1] "Ekstrm" "Jreskog""bi<9f>chen Zürcher"

iconv(xU, "UTF-8", "", sub="?")

[1] "Ekstr??m" "J??reskog""bi??chen Zürcher"

Etc... .  As the above is typically garbled between e-mail
transfer agents, I append both the iconv-Windows.R R script and
the corresponding iconv-Windows.Rout  R transcript to this
e-mail (using MIME type text/plain (easy using emacs for mail..)),
and they contain a bit more than the above.

Note that the above shows that using 'sub = *' and using
"//TRANSLIT" in case of a previous NA  result helps quite a bit,
in the sense that it gives much more information to see
   "J?reskog"  instead   NA.

I'm considering updating  packageDescription() to try these in
case it first returns NA.   This would make the citation() hack
unnecessary.

Martin


iconv-Windows.R


 iconv() behavior depending on Locales  LC_CTYPE  in Windows
 ===   ==
###
### In a *shell* in Windows (emacs), after doing R.home() in R, use that to do 
something like
###   c:/PROGRA~1/R/R-devel/bin/R CMD BATCH iconv-Windows.R
###   ^^= === = ===  ==> producing  
iconv-Windows.Rout
###
sessionInfo() ## does not matter so much
## -- should be Windows to exhibit the problems

## From  help(iconv) 's  example : Using "latin1" European language letters:
x1 <- c("Ekstr\xf8m", "J\xf6reskog", "bi\xdfchen Z\xfcrcher")
Encoding(x1) <- "latin1"
xU <- iconv(x1, "latin1", "UTF-8")


## 2 locales that do not work well : -
Sys.setlocale("LC_CTYPE", "Chinese")

iconv(x1, "latin1", "") # NA NA NA
iconv(x1, "latin1", "//TRANSLIT") # perfect for Chinese
iconv(x1, "latin1", "", sub = "byte")
iconv(xU, "UTF-8", "") # NA NA NA
iconv(xU, "UTF-8", "//TRANSLIT")
iconv(xU, "UTF-8", "", sub = "byte")
##--
Sys.setlocale("LC_CTYPE", "Arabic")
iconv(x1, "latin1", "")  # NA NA NA
iconv(x1, "latin1", "//TRANSLIT") # not bad, but not perfect
iconv(x1, "latin1", "", sub="byte")
iconv(x1, "latin1", "", sub="?")
iconv(xU, "UTF-8", "")  # NA NA NA
iconv(xU, "UTF-8", "//TRANSLIT")
iconv(xU, "UTF-8", "", sub="byte")
iconv(xU, "UTF-8", "", sub="?")

## 2 locales that work well for these examples (no wonder) ---

Sys.setlocale("LC_CTYPE", "German_Switzerland")
iconv(x1, "latin1", "")
iconv(x1, "latin1", "//TRANSLIT")
iconv(x1, "latin1", "", sub="?")
iconv(xU, "UTF-8", "")
iconv(xU, "UTF-8", "//TRANSLIT")
iconv(xU, "UTF-8", "", sub="?")
##--
Sys.setlocale("LC_CTYPE", "English")
iconv(x1, "latin1", "")
iconv(x1, "latin1", "//TRANSLIT")
iconv(x1, "latin1", "", sub="?")
iconv(xU, "UTF-8", "")
iconv(xU, "UTF-8", "//TRANSLIT")
iconv(xU, "UTF-8", "", sub="?")


iconv-Windows.Rout



R Under development (unstable) (2017-06-25 r72854) -- "Unsuffered Consequences"
Copyright (C) 2017 The R Foundation for Statistical Computing
Platform: x86_64-w64-mingw32/x64 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is

[Rd] regexec() bug in R 3.4.0

2017-06-28 Thread Weeks, Nathan
Hi,

In R 3.4.0, the "Pattern Matching and Replacement" documentation that describes 
regexec(), gregexpr(), etc. states that the "text" argument to regexec is a 
character vector, "or an object which can be coerced by as.character to a 
character vector":

 regexec(pattern, text, ignore.case = FALSE, perl = FALSE,
 fixed = FALSE, useBytes = FALSE)

 x, text: a character vector where matches are sought, or an object
 which can be coerced by as.character to a character vector.
 Long vectors are supported.

However, in R 3.4.0, this coercion doesn't seem to automatically occur for the 
text argument of regexec(), whereas it does for gregexpr(), regexpr(), etc:


$ R --vanilla

R version 3.4.0 (2017-04-21) -- "You Stupid Darkness"
Copyright (C) 2017 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

...
> text <- as.factor("foobar")
> regexec("foo", text)
Error in regexec("foo", text) : invalid 'text' argument
> regexec("foo", as.character(text))
>  [[1]]
>   
>   [1] 1
attr(,"match.length")
[1] 3
attr(,"useBytes")
[1] TRUE

> gregexpr("foo", text) 
>  [[1]]
[1] 1
attr(,"match.length")
[1] 3
attr(,"useBytes")
[1] TRUE


Is this a documentation issue, a bug in regexec(), or am I misunderstanding how 
it's supposed to behave?

Thanks,

--
Nathan Weeks
IT Specialist
USDA-ARS Corn Insects and Crop Genetics Research Unit
Crop Genome Informatics Laboratory
Iowa State University







This electronic message contains information generated by the USDA solely for 
the intended recipients. Any unauthorized interception of this message or the 
use or disclosure of the information it contains may violate the law and 
subject the violator to civil or criminal penalties. If you believe you have 
received this message in error, please notify the sender and delete the email 
immediately.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel