[Rd] source(), parse(), and foreign UTF-8 characters

2017-05-09 Thread Kirill Müller

Hi


I'm having trouble sourcing or parsing a UTF-8 file that contains 
characters that are not representable in the current locale ("foreign 
characters") on Windows. The source() function stops with an error, the 
parse() function reencodes all foreign characters using the  
notation. I have added a reproducible example below the message.


This seems well within the bounds of documented behavior, although the 
documentation to source() could mention that the file can't contain 
foreign characters. Still, I'd prefer if UTF-8 "just worked" in R, and 
I'm willing to invest substantial time to help with that. Before 
starting to write a detailed proposal, I feel that I need a better 
understanding of the problem, and I'm grateful for any feedback you 
might have.


I have looked into character encodings in the context of the dplyr 
package, and I have observed the following behavior:


- Strings are treated preferentially in the native encoding
- Only upon specific request (via translateCharUTF8() or enc2utf8() or 
...), they are translated to UTF-8 and marked as such

- On UTF-8 systems, strings are never marked as UTF-8
- ASCII strings are marked as ASCII internally, but this information 
doesn't seem to be available, e.g., Encoding() returns "unknown" for 
such strings
- Most functions in R are encoding-agnostic: they work the same 
regardless if they receive a native or UTF-8 encoded string if they are 
properly tagged
- One important difference are symbols, which must be in the native 
encoding (and are always converted to native encoding, using  
escapes)
- I/O is centered around the native encoding, e.g., writeLines() always 
reencodes to the native encoding

- There is the "bytes" encoding which avoids reencoding.

I haven't looked into serialization or plot devices yet.

The conclusion to the "UTF-8 manifesto" [1] suggests "... to use UTF-8 
narrow strings everywhere and convert them back and forth when using 
platform APIs that don’t support UTF-8 ...". (It is written in the 
context of the UTF-16 encoding used internally on Windows, but seems to 
apply just the same here for the native encoding.) I think that Unicode 
support in R could be greatly improved if we follow these guidelines. 
This seems to mean:


- Convert strings to UTF-8 as soon as possible, and mark them as such 
(also on systems where UTF-8 is the native encoding)
- Translate to native only upon specific request, e.g., in calls to API 
functions or perhaps for .C()

- Use UTF-8 for symbols
- Avoid the forced round-trip to the native encoding in I/O functions 
and for parsing (but still read/write native by default)

- Carefully look into serialization and plot devices
- Add helper functions that simplify mundane tasks such as 
reading/writing a UTF-8 encoded file


I'm sure I've missed many potential pitfalls, your input is greatly 
appreciated. Thanks for your attention.


Further ressources: A write-up by Prof. Ripley [2], a section in R-ints 
[3], a blog post by Ista Zahn [4], a StackOverflow search [5].



Best regards

Kirill



[1] http://utf8everywhere.org/#conclusions

[2] https://developer.r-project.org/Encodings_and_R.html

[3] 
https://cran.r-project.org/doc/manuals/r-devel/R-ints.html#Encodings-for-CHARSXPs


[3] 
http://people.fas.harvard.edu/~izahn/posts/reading-data-with-non-native-encoding-in-r/


[4] 
http://stackoverflow.com/search?tab=votes&q=%5br%5d%20encoding%20windows%20is%3aquestion




# Use one of the following:
id <- "Gl\u00fcck"
id <- "\u5e78\u798f"
id <- "\u0441\u0447\u0430\u0441\u0442\u044c\u0435"
id <- "\ud589\ubcf5"

file_contents <- paste0('"', id, '"')
Encoding(file_contents)
raw_file_contents <- charToRaw(file_contents)

path <- tempfile(fileext = ".R")
writeBin(raw_file_contents, path)
file.size(path)
length(raw_file_contents)

# Escapes the string
parse(text = file_contents)

# Throws an error
print(source(path, encoding = "UTF-8"))

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] A few suggestions and perspectives from a PhD student

2017-05-09 Thread Hilmar Berger

Hi,

On 08/05/17 16:37, Ista Zahn wrote:

One of the key strengths of R is that packages are not akin to "fan
created mods". They are a central and necessary part of the R system.

I would tend to disagree here. R packages are in their majority not 
maintained by the core R developers. Concepts, features and lifetime 
depend mainly on the maintainers of the package (even though in theory 
GPL will allow to somebody to take over anytime). Several packages that 
are critical for processing big data and providing "modern" 
visualizations introduce concepts quite different from the legacy S/R 
language. I do feel that in a way, current core R shows strongly its 
origin in S, while modern concepts (e.g. data.table, dplyr, ggplot, ...) 
are often only available via extension packages. This is fine if one 
considers R to be a statistical toolkit; as a programming language, 
however, it introduces inconsistencies and uncertainties which could be 
avoided if some of the "modern" parts (including language concepts) 
could be more integrated in core-R.


Best regards,
Hilmar

--
Dr. Hilmar Berger, MD
Max Planck Institute for Infection Biology
Charitéplatz 1
D-10117 Berlin
GERMANY

Phone:  + 49 30 28460 430
Fax:+ 49 30 28460 401
 
E-Mail: ber...@mpiib-berlin.mpg.de

Web   : www.mpiib-berlin.mpg.de

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] A few suggestions and perspectives from a PhD student

2017-05-09 Thread Joris Meys
On Tue, May 9, 2017 at 9:47 AM, Hilmar Berger 
wrote:

> Hi,
>
> On 08/05/17 16:37, Ista Zahn wrote:
>
>> One of the key strengths of R is that packages are not akin to "fan
>> created mods". They are a central and necessary part of the R system.
>>
>> I would tend to disagree here. R packages are in their majority not
> maintained by the core R developers. Concepts, features and lifetime depend
> mainly on the maintainers of the package (even though in theory GPL will
> allow to somebody to take over anytime). Several packages that are critical
> for processing big data and providing "modern" visualizations introduce
> concepts quite different from the legacy S/R language. I do feel that in a
> way, current core R shows strongly its origin in S, while modern concepts
> (e.g. data.table, dplyr, ggplot, ...) are often only available via
> extension packages. This is fine if one considers R to be a statistical
> toolkit; as a programming language, however, it introduces inconsistencies
> and uncertainties which could be avoided if some of the "modern" parts
> (including language concepts) could be more integrated in core-R.
>
> Best regards,
> Hilmar
>

And I would tend to disagree here. R is build upon the paradigm of a
functional programming language, and falls in the same group as clojure,
haskell and the likes. It is a turing complete programming language on its
own. That's quite a bit more than "a statistical toolkit". You can say that
about eg the macro language of SPSS, but not about R.

Second, there's little "modern" about the ideas behind the tidyverse.
Piping is about as old as unix itself. The grammar of graphics, on which
ggplot is based, stems from the SYStat graphics system from the nineties.
Hadley and colleagues did (and do) a great job implementing these ideas in
R, but the ideas do have a respectable age.

Third, there's a lot of nonstandard evaluation going on in all these
packages. Using them inside your own functions requires serious attention
(eg the difference between aes() and aes_() in ggplot2). Actually, even
though I definitely see the merits of these packages in data analysis, the
tidyverse feels like a (clean and powerful) macro language on top of R. And
that's good, but that doesn't mean these parts are essential to transform R
into a programming language. Rather the contrary actually: too heavily
relying on these packages does complicate things when you start to develop
your own packages in R.

Forth, the tidyverse masks quite some native R functions. Obviously they
took great care in keeping the functionality as close as one would expect,
but that's not always the case. The lag() function of dplyr() masks an S3
generic from the stats package for example. So if you work with time series
in the stats package, loading the tidyverse gives you trouble.

Fifth, many of the tidyverse packages are a version 0.x.y : they're still
in beta development and their functionality might (and will) change.
Functions disappear, arguments are called different, tags change,... Often
the changes improve the packages, but they did break older code for me more
than once. You can't expect the R core team to incorporate something that
is bound to change.

Last but not least, the tidyverse actually sometimes works against new R
users. At least R users that go beyond the classic data workflow. I
literally rewrote some code -from a consultant- that abused the _ply
functions to create nested loops. Removing all that stuff and rewriting the
code using a simple list in combination with a simple for-loop, sped up the
code with a factor 150. That has nothing to do with dplyr, it's very fast.
That has everything to do with that person having a hammer and thinking
everything he sees is a nail. The tidyverse is no reason to not learn the
concepts of the language it's built upon.

The one thing I would like to see though, is the adaptation of the
statistical toolkit so that it can work with data.table and tibble objects
directly, as opposed to having to convert to a data.frame once you start
building the models. And I believe that eventually there will be a
replacement for the data.frame that increases R's performance and lessens
its burden on the memory.

So all in all, I do admire the tidyverse and how it speeds up data
preparation for analysis. But tidyverse is a powerful data toolkit, not a
programming language. And it won't make R a programming language either.
Because R is already.

Cheers
Joris

>
> --
> Dr. Hilmar Berger, MD
> Max Planck Institute for Infection Biology
> Charitéplatz 1
> D-10117 Berlin
> GERMANY
>
> Phone:  + 49 30 28460 430
> Fax:+ 49 30 28460 401
>  E-Mail: ber...@mpiib-berlin.mpg.de
> Web   : www.mpiib-berlin.mpg.de
>
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>



-- 
Joris Meys
Statistical consultant

Ghent University
Faculty of Bioscience Engineering
Department of Mathematical Modelling,

Re: [Rd] A few suggestions and perspectives from a PhD student

2017-05-09 Thread Lionel Henry
> Third, there's a lot of nonstandard evaluation going on in all these
> packages. Using them inside your own functions requires serious attention
> (eg the difference between aes() and aes_() in ggplot2). Actually, even
> though I definitely see the merits of these packages in data analysis, the
> tidyverse feels like a (clean and powerful) macro language on top of R.

That is going to change as we have put a lot of effort into learning
how to deal with capturing functions. See the tidyeval framework which
will enable full and flexible programmability of tidyverse grammars.

That said I agree that data analysis and package programming often
require different sets of tools.

Lionel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] A few suggestions and perspectives from a PhD student

2017-05-09 Thread Hilmar Berger


On 09/05/17 11:22, Joris Meys wrote:
>
>
> On Tue, May 9, 2017 at 9:47 AM, Hilmar Berger 
> mailto:ber...@mpiib-berlin.mpg.de>> wrote:
>
> Hi,
>
> On 08/05/17 16:37, Ista Zahn wrote:
>
> One of the key strengths of R is that packages are not akin to
> "fan
> created mods". They are a central and necessary part of the R
> system.
>
> I would tend to disagree here. R packages are in their majority
> not maintained by the core R developers. Concepts, features and
> lifetime depend mainly on the maintainers of the package (even
> though in theory GPL will allow to somebody to take over anytime).
> Several packages that are critical for processing big data and
> providing "modern" visualizations introduce concepts quite
> different from the legacy S/R language. I do feel that in a way,
> current core R shows strongly its origin in S, while modern
> concepts (e.g. data.table, dplyr, ggplot, ...) are often only
> available via extension packages. This is fine if one considers R
> to be a statistical toolkit; as a programming language, however,
> it introduces inconsistencies and uncertainties which could be
> avoided if some of the "modern" parts (including language
> concepts) could be more integrated in core-R.
>
> Best regards,
> Hilmar
>
>
> And I would tend to disagree here. R is build upon the paradigm of a 
> functional programming language, and falls in the same group as 
> clojure, haskell and the likes. It is a turing complete programming 
> language on its own. That's quite a bit more than "a statistical 
> toolkit". You can say that about eg the macro language of SPSS, but 
> not about R.
>
My point was that inconsistencies are harder to tolerate when using R as 
a programming language as opposed to a toolkit that just has to do a job.
> Second, there's little "modern" about the ideas behind the tidyverse. 
> Piping is about as old as unix itself. The grammar of graphics, on 
> which ggplot is based, stems from the SYStat graphics system from the 
> nineties. Hadley and colleagues did (and do) a great job implementing 
> these ideas in R, but the ideas do have a respectable age.
Those ideas seem still to be more modern than e.g. stock R graphics 
designed probably in the seventies or eighties. Which still do their job 
for lots and lots of applications, however, the fact that many newer 
packages use ggplot in stead of plot() forces users to learn and use 
different paradigms for things so simple as drawing a line.

I also would like to make clear that I do not advocate for including the 
whole tidyverse in core R. I just believe that having core concepts well 
supported in core R instead of implemented in a package might make 
things more consistent. E.g. method chaining ("%>%") is a core language 
feature in many languages.
>
> The one thing I would like to see though, is the adaptation of the 
> statistical toolkit so that it can work with data.table and tibble 
> objects directly, as opposed to having to convert to a data.frame once 
> you start building the models. And I believe that eventually there 
> will be a replacement for the data.frame that increases R's 
> performance and lessens its burden on the memory.
>
Which is a perfect example of what I mean: improved functionality should 
find their way into core R at some time point, replacing or extending 
outdated functionality. Otherwise, I don't know how hard it will be to 
develop 21st century methods on top of a 1980s/90s language core. 
Although I admit that the R developers are doing a great job to make it 
possible.

Best,
Hilmar

> So all in all, I do admire the tidyverse and how it speeds up data 
> preparation for analysis. But tidyverse is a powerful data toolkit, 
> not a programming language. And it won't make R a programming language 
> either. Because R is already.
>
> Cheers
> Joris
>
>
> -- 
> Dr. Hilmar Berger, MD
> Max Planck Institute for Infection Biology
> Charitéplatz 1
> D-10117 Berlin
> GERMANY
>
> Phone: + 49 30 28460 430 
> Fax: + 49 30 28460 401 
>  E-Mail: ber...@mpiib-berlin.mpg.de
> 
> Web   : www.mpiib-berlin.mpg.de 
>
>
> __
> R-devel@r-project.org  mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
>
>
>
>
> -- 
> Joris Meys
> Statistical consultant
>
> Ghent University
> Faculty of Bioscience Engineering
> Department of Mathematical Modelling, Statistics and Bio-Informatics
>
> tel :  +32 (0)9 264 61 79
> joris.m...@ugent.be
> ---
> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php

-- 
Dr. Hilmar Berger, MD
Max Planck Institute for Infection Biology
Charitéplatz 1
D-10117 Berlin
GERMANY

Phone:  + 49 30 28460 43

[Rd] Potential Bug in convolve {stats}

2017-05-09 Thread George Anastasiou
Dear All,

I think there is a bug in the convolve function of stats package. Running
the following:

a <- convolve(c(1,1,1,1), 1, type="filter")
a

the answer is:

[1] 1 1

whereas it should be:

[1] 1 1 1 1

Looking at the code of convolve, the bug is on line 22 at:

[-c(1L:n1, (n - n1 + 1L):n)]/n


which is not correct when the second input argument has only one element
(n1=0).

When I run R.version I get the following output:

platform   i686-pc-linux-gnu
arch   i686
os linux-gnu
system i686, linux-gnu
status
major  3
minor  4.0
year   2017
month  04
day21
svn rev72570
language   R
version.string R version 3.4.0 (2017-04-21)
nickname   You Stupid Darkness


Best,

George Anastasiou

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] source(), parse(), and foreign UTF-8 characters

2017-05-09 Thread Duncan Murdoch

On 09/05/2017 3:42 AM, Kirill Müller wrote:

Hi


I'm having trouble sourcing or parsing a UTF-8 file that contains
characters that are not representable in the current locale ("foreign
characters") on Windows. The source() function stops with an error, the
parse() function reencodes all foreign characters using the 
notation. I have added a reproducible example below the message.

This seems well within the bounds of documented behavior, although the
documentation to source() could mention that the file can't contain
foreign characters. Still, I'd prefer if UTF-8 "just worked" in R, and
I'm willing to invest substantial time to help with that. Before
starting to write a detailed proposal, I feel that I need a better
understanding of the problem, and I'm grateful for any feedback you
might have.

I have looked into character encodings in the context of the dplyr
package, and I have observed the following behavior:

- Strings are treated preferentially in the native encoding
- Only upon specific request (via translateCharUTF8() or enc2utf8() or
...), they are translated to UTF-8 and marked as such
- On UTF-8 systems, strings are never marked as UTF-8
- ASCII strings are marked as ASCII internally, but this information
doesn't seem to be available, e.g., Encoding() returns "unknown" for
such strings
- Most functions in R are encoding-agnostic: they work the same
regardless if they receive a native or UTF-8 encoded string if they are
properly tagged
- One important difference are symbols, which must be in the native
encoding (and are always converted to native encoding, using 
escapes)
- I/O is centered around the native encoding, e.g., writeLines() always
reencodes to the native encoding
- There is the "bytes" encoding which avoids reencoding.

I haven't looked into serialization or plot devices yet.

The conclusion to the "UTF-8 manifesto" [1] suggests "... to use UTF-8
narrow strings everywhere and convert them back and forth when using
platform APIs that don’t support UTF-8 ...". (It is written in the
context of the UTF-16 encoding used internally on Windows, but seems to
apply just the same here for the native encoding.) I think that Unicode
support in R could be greatly improved if we follow these guidelines.
This seems to mean:

- Convert strings to UTF-8 as soon as possible, and mark them as such
(also on systems where UTF-8 is the native encoding)
- Translate to native only upon specific request, e.g., in calls to API
functions or perhaps for .C()
- Use UTF-8 for symbols
- Avoid the forced round-trip to the native encoding in I/O functions
and for parsing (but still read/write native by default)
- Carefully look into serialization and plot devices
- Add helper functions that simplify mundane tasks such as
reading/writing a UTF-8 encoded file


Those are good long term goals, though I think the effort is easier than 
you think.  Rather than attempting to do it all at once, you should look 
for ways to do it gradually and submit self-contained patches.  In many 
cases it doesn't matter if strings are left in the local encoding, 
because the encoding doesn't matter.  The problems arise when UTF-8 
strings are converted to the local encoding before it's necessary, 
because that's a lossy conversion.  So a simple way to proceed is to 
identify where these conversions occur, and remove them one-by-one.


Currently I'm working on bug 16098, "Windows doesn't handle high Unicode 
code points".  It doesn't require many changes at all to handle input of 
those characters; all the remaining issues are avoiding the problems you 
identify above.  The origin of the issue is the fact that in Windows 
wchar_t is only 16 bits (not big enough to hold all Unicode code 
points).  As far as I know, Windows has no standard type to hold a 
Unicode code point, most of the run-time functions still use the 16 bit 
wchar_t.


I think once that bug is dealt with, 90+% of the remaining issues could 
be solved by avoiding translateChar on Windows.  This could be done by 
avoiding it everywhere, or by acting as though Windows is running in a 
UTF-8 locale until you actually need to write to a file.  Other systems 
tend to have UTF-8 locales in common use, so they're already fine.


You offered to spend time on this.  I'd appreciate some checks of the 
patch I'm developing for 16098, and also some research into how certain 
things (e.g. the iswprint function) are handled on Windows.


Duncan Murdoch


I'm sure I've missed many potential pitfalls, your input is greatly
appreciated. Thanks for your attention.

Further ressources: A write-up by Prof. Ripley [2], a section in R-ints
[3], a blog post by Ista Zahn [4], a StackOverflow search [5].


Best regards

Kirill



[1] http://utf8everywhere.org/#conclusions

[2] https://developer.r-project.org/Encodings_and_R.html

[3]
https://cran.r-project.org/doc/manuals/r-devel/R-ints.html#Encodings-for-CHARSXPs

[3]
http://people.fas.harvard.edu/~izahn/posts/reading-data-with-non-native-encod

[Rd] registering Fortran routines in R packages

2017-05-09 Thread Christophe Dutang
Dear list,

I’m trying to register Fortran routines in randtoolbox (in srt/init.c file), 
see 
https://r-forge.r-project.org/scm/viewvc.php/pkg/randtoolbox/src/init.c?view=markup&root=rmetrics.
 

Reading 
https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Registering-native-routines
 and looking at what is done in stats package, I first thought that the 
following code will do the job:

static const R_FortranMethodDef FortEntries[] = {
 {"halton", (DL_FUNC) &F77_NAME(HALTON),  7},
 {"sobol", (DL_FUNC) &F77_NAME(SOBOL),  11},
 {NULL, NULL, 0}
};

But I got error messages when building : use of undeclared identifier ‘SOBOL_’. 
I also tried in lower case sobol and halton.

Looking at expm package 
https://r-forge.r-project.org/scm/viewvc.php/pkg/src/init.c?view=markup&revision=94&root=expm,
 I try  

static const R_FortranMethodDef FortEntries[] = {
 {"halton", (DL_FUNC) &F77_SUB(HALTON),  7},
 {"sobol", (DL_FUNC) &F77_SUB(SOBOL),  11},
 {NULL, NULL, 0}
};

But the problem remains the same.

Is there a way to have header file for Fortran codes? how to declare routines 
defined in my Fortran file src/LowDiscrepancy.f?

Any help appreciated

Regards, Christophe
---
Christophe Dutang
LMM, UdM, Le Mans, France
web: http://dutangc.free.fr

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] registering Fortran routines in R packages

2017-05-09 Thread Berend Hasselman

> On 9 May 2017, at 13:44, Christophe Dutang  wrote:
> 
> Dear list,
> 
> I’m trying to register Fortran routines in randtoolbox (in srt/init.c file), 
> see 
> https://r-forge.r-project.org/scm/viewvc.php/pkg/randtoolbox/src/init.c?view=markup&root=rmetrics.
>  
> 
> Reading 
> https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Registering-native-routines
>  and looking at what is done in stats package, I first thought that the 
> following code will do the job:
> 
> static const R_FortranMethodDef FortEntries[] = {
> {"halton", (DL_FUNC) &F77_NAME(HALTON),  7},
> {"sobol", (DL_FUNC) &F77_NAME(SOBOL),  11},
> {NULL, NULL, 0}
> };
> 
> But I got error messages when building : use of undeclared identifier 
> ‘SOBOL_’. I also tried in lower case sobol and halton.
> 
> Looking at expm package 
> https://r-forge.r-project.org/scm/viewvc.php/pkg/src/init.c?view=markup&revision=94&root=expm,
>  I try  
> 
> static const R_FortranMethodDef FortEntries[] = {
> {"halton", (DL_FUNC) &F77_SUB(HALTON),  7},
> {"sobol", (DL_FUNC) &F77_SUB(SOBOL),  11},
> {NULL, NULL, 0}
> };
> 
> But the problem remains the same.
> 
> Is there a way to have header file for Fortran codes? how to declare routines 
> defined in my Fortran file src/LowDiscrepancy.f?
> 

lowercase routine names? manual does mention that.

Berend Hasselman


> Any help appreciated
> 
> Regards, Christophe
> ---
> Christophe Dutang
> LMM, UdM, Le Mans, France
> web: http://dutangc.free.fr
> 
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] Bug simulate.lm() --> needs credential to report it

2017-05-09 Thread Alexandre Courtiol
Dear R developers,

I did not get any reply concerning my email from last week concerning the
bug I found in stats::simulate.lm(). The bug shows up when called upon a
GLM with family gaussian(). I am confident it is a genuine bug related to a
mix-up between weights and prior weights that only impacts the gaussian
family (other families have their own simulate functions defined
elsewhere).

I cannot add the bug in Bugzilla as I have no credential.
Could someone please help me to get credentials so that I can add the bug
in bugzilla?

Thanks a lot,

Simple demonstration for the bug:

set.seed(1L)
y <- 10 + rnorm(n = 100)
mean(y) ##  10.10889
var(y)  ##   0.8067621

mod_glm  <- glm(y ~ 1, family = gaussian(link = "log"))
new.y <- simulate(mod_glm)[, 1]
mean(new.y) ## 10.10553
var(new.y)  ##  0.007243695  # WRONG #

mod_glm$weights <- mod_glm$prior.weights  ## ugly hack showing where the
issue is
new.y <- simulate(mod_glm)[, 1]
mean(new.y) ## 10.13554
var(new.y)  ##  0.8629975  # OK #


-- 
Alexandre Courtiol

http://sites.google.com/site/alexandrecourtiol/home

*"Science is the belief in the ignorance of experts"*, R. Feynman

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] R-3.3.3/R-3.4.0 change in sys.call(sys.parent())

2017-05-09 Thread William Dunlap via R-devel
Some formula methods for S3 generic functions use the idiom
returnValue$call <- sys.call(sys.parent())
to show how to recreate the returned object or to use as a label on a
plot.  It is often followed by
 returnValue$call[[1]] <- quote(myName)
E.g., I see it in packages "latticeExtra" and "leaps", and I suspect it
used in "lattice" as well.

This idiom has not done good things for quite a while (ever?) but I noticed
while running tests that it acts differently in R-3.4.0 than in R-3.3.3.
Neither the old or new behavior is nice.  E.g., in R-3.3.3 we get

> parseEval <- function(text, envir) eval(parse(text=text), envir=envir)
> parseEval('lattice::xyplot(mpg~hp, data=datasets::mtcars)$call',
envir=new.env())
xyplot(expr, envir, enclos)

and

> evalInEnvir <- function(call, envir) eval(call, envir=envir)
> evalInEnvir(quote(lattice::xyplot(mpg~hp, data=datasets::mtcars)$call),
envir=new.env())
xyplot(expr, envir, enclos)

while in R-3.4.0 we get
> parseEval <- function(text, envir) eval(parse(text=text), envir=envir)
> parseEval('lattice::xyplot(mpg~hp, data=datasets::mtcars)$call',
envir=new.env())
xyplot(parse(text = text), envir = envir)

and

> evalInEnvir <- function(call, envir) eval(call, envir=envir)
> evalInEnvir(quote(lattice::xyplot(mpg~hp, data=datasets::mtcars)$call),
envir=new.env())
xyplot(call, envir = envir)

Should these packages be be fixed up to use just sys.call()?

Bill Dunlap
TIBCO Software
wdunlap tibco.com

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] source(), parse(), and foreign UTF-8 characters

2017-05-09 Thread Kirill Müller

On 09.05.2017 13:19, Duncan Murdoch wrote:

On 09/05/2017 3:42 AM, Kirill Müller wrote:

Hi


I'm having trouble sourcing or parsing a UTF-8 file that contains
characters that are not representable in the current locale ("foreign
characters") on Windows. The source() function stops with an error, the
parse() function reencodes all foreign characters using the 
notation. I have added a reproducible example below the message.

This seems well within the bounds of documented behavior, although the
documentation to source() could mention that the file can't contain
foreign characters. Still, I'd prefer if UTF-8 "just worked" in R, and
I'm willing to invest substantial time to help with that. Before
starting to write a detailed proposal, I feel that I need a better
understanding of the problem, and I'm grateful for any feedback you
might have.

I have looked into character encodings in the context of the dplyr
package, and I have observed the following behavior:

- Strings are treated preferentially in the native encoding
- Only upon specific request (via translateCharUTF8() or enc2utf8() or
...), they are translated to UTF-8 and marked as such
- On UTF-8 systems, strings are never marked as UTF-8
- ASCII strings are marked as ASCII internally, but this information
doesn't seem to be available, e.g., Encoding() returns "unknown" for
such strings
- Most functions in R are encoding-agnostic: they work the same
regardless if they receive a native or UTF-8 encoded string if they are
properly tagged
- One important difference are symbols, which must be in the native
encoding (and are always converted to native encoding, using 
escapes)
- I/O is centered around the native encoding, e.g., writeLines() always
reencodes to the native encoding
- There is the "bytes" encoding which avoids reencoding.

I haven't looked into serialization or plot devices yet.

The conclusion to the "UTF-8 manifesto" [1] suggests "... to use UTF-8
narrow strings everywhere and convert them back and forth when using
platform APIs that don’t support UTF-8 ...". (It is written in the
context of the UTF-16 encoding used internally on Windows, but seems to
apply just the same here for the native encoding.) I think that Unicode
support in R could be greatly improved if we follow these guidelines.
This seems to mean:

- Convert strings to UTF-8 as soon as possible, and mark them as such
(also on systems where UTF-8 is the native encoding)
- Translate to native only upon specific request, e.g., in calls to API
functions or perhaps for .C()
- Use UTF-8 for symbols
- Avoid the forced round-trip to the native encoding in I/O functions
and for parsing (but still read/write native by default)
- Carefully look into serialization and plot devices
- Add helper functions that simplify mundane tasks such as
reading/writing a UTF-8 encoded file


Those are good long term goals, though I think the effort is easier 
than you think.  Rather than attempting to do it all at once, you 
should look for ways to do it gradually and submit self-contained 
patches.  In many cases it doesn't matter if strings are left in the 
local encoding, because the encoding doesn't matter.  The problems 
arise when UTF-8 strings are converted to the local encoding before 
it's necessary, because that's a lossy conversion.  So a simple way to 
proceed is to identify where these conversions occur, and remove them 
one-by-one.
Thanks, Duncan, this looks like a good start indeed. Did you really mean 
to say "the effort is easier than I think"? It would be great if I had 
overestimated the effort, I seldom do. That said, I'd be grateful if you 
could review/integrate/... future patches of mine towards parsing and 
sourcing of UTF-8 files with foreign characters, this problem seems to 
be self-contained (but perhaps not that easy).


I still think symbols should be in UTF-8, and this change might be 
difficult to split into smaller changes, especially if taking into 
account serialization and other potential pitfalls.




Currently I'm working on bug 16098, "Windows doesn't handle high 
Unicode code points".  It doesn't require many changes at all to 
handle input of those characters; all the remaining issues are 
avoiding the problems you identify above.  The origin of the issue is 
the fact that in Windows wchar_t is only 16 bits (not big enough to 
hold all Unicode code points).  As far as I know, Windows has no 
standard type to hold a Unicode code point, most of the run-time 
functions still use the 16 bit wchar_t.

I didn't mention non-BMP characters, they are an important issue as well.



I think once that bug is dealt with, 90+% of the remaining issues 
could be solved by avoiding translateChar on Windows.  This could be 
done by avoiding it everywhere, or by acting as though Windows is 
running in a UTF-8 locale until you actually need to write to a file.  
Other systems tend to have UTF-8 locales in common use, so they're 
already fine.
I'd argue against platform-specific switc

Re: [Rd] source(), parse(), and foreign UTF-8 characters

2017-05-09 Thread Duncan Murdoch

On 09/05/2017 5:46 PM, Kirill Müller wrote:

On 09.05.2017 13:19, Duncan Murdoch wrote:

On 09/05/2017 3:42 AM, Kirill Müller wrote:

Hi


I'm having trouble sourcing or parsing a UTF-8 file that contains
characters that are not representable in the current locale ("foreign
characters") on Windows. The source() function stops with an error, the
parse() function reencodes all foreign characters using the 
notation. I have added a reproducible example below the message.

This seems well within the bounds of documented behavior, although the
documentation to source() could mention that the file can't contain
foreign characters. Still, I'd prefer if UTF-8 "just worked" in R, and
I'm willing to invest substantial time to help with that. Before
starting to write a detailed proposal, I feel that I need a better
understanding of the problem, and I'm grateful for any feedback you
might have.

I have looked into character encodings in the context of the dplyr
package, and I have observed the following behavior:

- Strings are treated preferentially in the native encoding
- Only upon specific request (via translateCharUTF8() or enc2utf8() or
...), they are translated to UTF-8 and marked as such
- On UTF-8 systems, strings are never marked as UTF-8
- ASCII strings are marked as ASCII internally, but this information
doesn't seem to be available, e.g., Encoding() returns "unknown" for
such strings
- Most functions in R are encoding-agnostic: they work the same
regardless if they receive a native or UTF-8 encoded string if they are
properly tagged
- One important difference are symbols, which must be in the native
encoding (and are always converted to native encoding, using 
escapes)
- I/O is centered around the native encoding, e.g., writeLines() always
reencodes to the native encoding
- There is the "bytes" encoding which avoids reencoding.

I haven't looked into serialization or plot devices yet.

The conclusion to the "UTF-8 manifesto" [1] suggests "... to use UTF-8
narrow strings everywhere and convert them back and forth when using
platform APIs that don’t support UTF-8 ...". (It is written in the
context of the UTF-16 encoding used internally on Windows, but seems to
apply just the same here for the native encoding.) I think that Unicode
support in R could be greatly improved if we follow these guidelines.
This seems to mean:

- Convert strings to UTF-8 as soon as possible, and mark them as such
(also on systems where UTF-8 is the native encoding)
- Translate to native only upon specific request, e.g., in calls to API
functions or perhaps for .C()
- Use UTF-8 for symbols
- Avoid the forced round-trip to the native encoding in I/O functions
and for parsing (but still read/write native by default)
- Carefully look into serialization and plot devices
- Add helper functions that simplify mundane tasks such as
reading/writing a UTF-8 encoded file


Those are good long term goals, though I think the effort is easier
than you think.  Rather than attempting to do it all at once, you
should look for ways to do it gradually and submit self-contained
patches.  In many cases it doesn't matter if strings are left in the
local encoding, because the encoding doesn't matter.  The problems
arise when UTF-8 strings are converted to the local encoding before
it's necessary, because that's a lossy conversion.  So a simple way to
proceed is to identify where these conversions occur, and remove them
one-by-one.

Thanks, Duncan, this looks like a good start indeed. Did you really mean
to say "the effort is easier than I think"? It would be great if I had
overestimated the effort, I seldom do. That said, I'd be grateful if you
could review/integrate/... future patches of mine towards parsing and
sourcing of UTF-8 files with foreign characters, this problem seems to
be self-contained (but perhaps not that easy).


I'll definitely look at small ones.  I'm not sure I'll have enough time 
to do really big ones, so it's best to try to break things up into small 
bites.




I still think symbols should be in UTF-8, and this change might be
difficult to split into smaller changes, especially if taking into
account serialization and other potential pitfalls.



Currently I'm working on bug 16098, "Windows doesn't handle high
Unicode code points".  It doesn't require many changes at all to
handle input of those characters; all the remaining issues are
avoiding the problems you identify above.  The origin of the issue is
the fact that in Windows wchar_t is only 16 bits (not big enough to
hold all Unicode code points).  As far as I know, Windows has no
standard type to hold a Unicode code point, most of the run-time
functions still use the 16 bit wchar_t.

I didn't mention non-BMP characters, they are an important issue as well.



I think once that bug is dealt with, 90+% of the remaining issues
could be solved by avoiding translateChar on Windows.  This could be
done by avoiding it everywhere, or by acting as though Windows is
running in a UTF-

Re: [Rd] registering Fortran routines in R packages

2017-05-09 Thread Christophe Dutang
Thanks for your email.

I try to change the name in lowercase but it conflicts with a C implementation 
also named halton. So I rename the C function halton2() and sobol2() while the 
Fortran function are HALTON() and SOBOL() (I also try lower case in the Fortran 
code). Unfortunately, it does not help since I get

init.c:97:25: error: use of undeclared identifier 'halton_'; did you mean 
'halton2'?
  {"halton", (DL_FUNC) &F77_SUB(halton),  7},

My current solution is to comment FortEntries array and use 
R_useDynamicSymbols(dll, TRUE) for a dynamic search of Fortran routines.

Regards, Christophe
---
Christophe Dutang
LMM, UdM, Le Mans, France
web: http://dutangc.free.fr 
> Le 9 mai 2017 à 14:32, Berend Hasselman  a écrit :
> 
> 
>> On 9 May 2017, at 13:44, Christophe Dutang  wrote:
>> 
>> Dear list,
>> 
>> I’m trying to register Fortran routines in randtoolbox (in srt/init.c file), 
>> see 
>> https://r-forge.r-project.org/scm/viewvc.php/pkg/randtoolbox/src/init.c?view=markup&root=rmetrics.
>>  
>> 
>> Reading 
>> https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Registering-native-routines
>>  and looking at what is done in stats package, I first thought that the 
>> following code will do the job:
>> 
>> static const R_FortranMethodDef FortEntries[] = {
>> {"halton", (DL_FUNC) &F77_NAME(HALTON),  7},
>> {"sobol", (DL_FUNC) &F77_NAME(SOBOL),  11},
>> {NULL, NULL, 0}
>> };
>> 
>> But I got error messages when building : use of undeclared identifier 
>> ‘SOBOL_’. I also tried in lower case sobol and halton.
>> 
>> Looking at expm package 
>> https://r-forge.r-project.org/scm/viewvc.php/pkg/src/init.c?view=markup&revision=94&root=expm,
>>  I try  
>> 
>> static const R_FortranMethodDef FortEntries[] = {
>> {"halton", (DL_FUNC) &F77_SUB(HALTON),  7},
>> {"sobol", (DL_FUNC) &F77_SUB(SOBOL),  11},
>> {NULL, NULL, 0}
>> };
>> 
>> But the problem remains the same.
>> 
>> Is there a way to have header file for Fortran codes? how to declare 
>> routines defined in my Fortran file src/LowDiscrepancy.f?
>> 
> 
> lowercase routine names? manual does mention that.
> 
> Berend Hasselman
> 
> 
>> Any help appreciated
>> 
>> Regards, Christophe
>> ---
>> Christophe Dutang
>> LMM, UdM, Le Mans, France
>> web: http://dutangc.free.fr
>> 
>> __
>> R-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
> 


[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] registering Fortran routines in R packages

2017-05-09 Thread Berend Hasselman
Christophe, 

> On 10 May 2017, at 08:08, Christophe Dutang  wrote:
> 
> Thanks for your email.
> 
> I try to change the name in lowercase but it conflicts with a C 
> implementation also named halton. So I rename the C function halton2() and 
> sobol2() while the Fortran function are HALTON() and SOBOL() (I also try 
> lower case in the Fortran code). Unfortunately, it does not help since I get
> 
> init.c:97:25: error: use of undeclared identifier 'halton_'; did you mean 
> 'halton2'?
>   {"halton", (DL_FUNC) &F77_SUB(halton),  7},
> 
> My current solution is to comment FortEntries array and use 
> R_useDynamicSymbols(dll, TRUE) for a dynamic search of Fortran routines.

Have a look at my package geigen and its init.c.
Could it be that you are missing extern declarations for the Fortran routines?


Berend

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] registering Fortran routines in R packages

2017-05-09 Thread Jari Oksanen
Have you tried using tools:::package_native_routine_registration_skeleton()? If 
you don't like its output, you can easily edit its results and still avoid most 
pitfalls.

Cheers, Jari Oksanen

From: R-devel  on behalf of Berend Hasselman 

Sent: 10 May 2017 09:48
To: Christophe Dutang
Cc: r-devel@r-project.org
Subject: Re: [Rd] registering Fortran routines in R packages

Christophe,

> On 10 May 2017, at 08:08, Christophe Dutang  wrote:
>
> Thanks for your email.
>
> I try to change the name in lowercase but it conflicts with a C 
> implementation also named halton. So I rename the C function halton2() and 
> sobol2() while the Fortran function are HALTON() and SOBOL() (I also try 
> lower case in the Fortran code). Unfortunately, it does not help since I get
>
> init.c:97:25: error: use of undeclared identifier 'halton_'; did you mean 
> 'halton2'?
>   {"halton", (DL_FUNC) &F77_SUB(halton),  7},
>
> My current solution is to comment FortEntries array and use 
> R_useDynamicSymbols(dll, TRUE) for a dynamic search of Fortran routines.

Have a look at my package geigen and its init.c.
Could it be that you are missing extern declarations for the Fortran routines?


Berend

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel