[Rd] How to print UTF-8 encoded strings from a C routine to R's output?

2016-09-05 Thread Lixin Gong
Dear R experts,

It seems that Rprintf has to be used to print from a C routine to guarantee
to write to R’s output according to
https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Printing.

However if a string is UTF-8 encoded, non-ASCII characters (e.g., the
infinity symbol http://www.fileformat.info/info/unicode/char/221e/index.htm)
are misprinted.
Is this an unsupported feature or is there a workaround for this limitation?

Thanks!

Michael

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] How to print UTF-8 encoded strings from a C routine to R's output?

2016-09-05 Thread Duncan Murdoch

On 05/09/2016 12:40 AM, Lixin Gong wrote:

Dear R experts,

It seems that Rprintf has to be used to print from a C routine to guarantee
to write to R’s output according to
https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Printing.

However if a string is UTF-8 encoded, non-ASCII characters (e.g., the
infinity symbol http://www.fileformat.info/info/unicode/char/221e/index.htm)
are misprinted.
Is this an unsupported feature or is there a workaround for this limitation?


If you are working in a UTF-8 locale (as on most Unix-like systems), you 
should be fine.  If not (as is normal on Windows), you'll need to 
translate the string to the local encoding.  The Writing R Extensions 
manual section 6.11 tells you how to do the re-encoding.


Duncan Murdoch

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Defragmentation of memory

2016-09-05 Thread luke-tierney

On Mon, 5 Sep 2016, Måns Magnusson wrote:


Dear all developers,

I'm working with a lot of textual data in R and need to handle this batch
by batch. The problem is that I read in batches of 10 000 documents and do
some calculations that results in objects that consume quite some memory
(calculate unigrams, 2-grams and 3-grams). In every iteration a new objects
(~ 500 mB) is created (and I can't control the size, so a new object needs
to be created each iteration). The speed of this computations is decreasing
every iteration (first iteration 7 sec, after 30 iterations 20-30 minutes
per iteration).

I (think) I localized the problem to R:s memory handling and that my
approach is fragmenting the memory. If I do this batch handling in Bash and
starting up a new R session for each batch it takes ~ 7 sec per batch, so
it is nothing with the individual batches. The garbage collector do not
seem to handle this (potential) fragmentation.

Can the reason of the poor performance after a couple of iterations be that
I'm fragmenting the memory? If so, is there a solution that can used to
handle this within R, such as defragmentation or restarting R from within R?


Highly unlikely. Fragmentation is rarely an issue on a 64-bit OS and
the symptoms would be different.

To get help with what is actually happening please post a minimal
reproducible example, and please not in html.

Best,

luke



With kind regards
Måns Magnusson

PhD Student, Statistics, Linköping University.

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


--
Luke Tierney
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa  Phone: 319-335-3386
Department of Statistics andFax:   319-335-3017
   Actuarial Science
241 Schaeffer Hall  email:   luke-tier...@uiowa.edu
Iowa City, IA 52242 WWW:  http://www.stat.uiowa.edu
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] How to print UTF-8 encoded strings from a C routine to R's output?

2016-09-05 Thread Lixin Gong
Hi Duncan,

Thanks a lot for your quick reply pointing out the Re-encoding section that
I missed!

Before trying out R's C-level interface to the iconv's encoding conversion
capabilities,
I did some quick tests with Encoding() and iconv() on Windows with Rgui and
Rterm.
After Encoding(), non-ASCII characters are fine with Rgui but still wrong
with Rterm.
After iconv(), non-ASCII characters are still misprinted no matter if it is
Rgui or Rterm.

Here is the code that I used:

(neg_inf_utf8_hex <- as.raw(c(0x2d, 0xe2, 0x88, 0x9e)))
(neg_inf_utf8 <- rawToChar(neg_inf_utf8_hex))
Encoding(neg_inf_utf8)

Encoding(neg_inf_utf8) <- "UTF-8"
Encoding(neg_inf_utf8)
neg_inf_utf8

charToRaw(neg_inf_utf8)
iconv(neg_inf_utf8, from = "UTF-8", to = "", toRaw = FALSE)
iconv(neg_inf_utf8, from = "UTF-8", to = "", toRaw = TRUE)

Here is what I got with Rgui:

> (neg_inf_utf8_hex <- as.raw(c(0x2d, 0xe2, 0x88, 0x9e)))
[1] 2d e2 88 9e
> (neg_inf_utf8 <- rawToChar(neg_inf_utf8_hex))
[1] "-∞"
> Encoding(neg_inf_utf8)
[1] "unknown"
>
> Encoding(neg_inf_utf8) <- "UTF-8"
> Encoding(neg_inf_utf8)
[1] "UTF-8"
> neg_inf_utf8
[1] "-∞"
>
> charToRaw(neg_inf_utf8)
[1] 2d e2 88 9e
> iconv(neg_inf_utf8, from = "UTF-8", to = "", toRaw = FALSE)
[1] "-8"
> iconv(neg_inf_utf8, from = "UTF-8", to = "", toRaw = TRUE)
[[1]]
[1] 2d 38
>

Here is what I got with Rterm:

> (neg_inf_utf8_hex <- as.raw(c(0x2d, 0xe2, 0x88, 0x9e)))
[1] 2d e2 88 9e
> (neg_inf_utf8 <- rawToChar(neg_inf_utf8_hex))
[1] "-â^z"
> Encoding(neg_inf_utf8)
[1] "unknown"
>
> Encoding(neg_inf_utf8) <- "UTF-8"
> Encoding(neg_inf_utf8)
[1] "UTF-8"
> neg_inf_utf8
[1] "-8"
>
> charToRaw(neg_inf_utf8)
[1] 2d e2 88 9e
> iconv(neg_inf_utf8, from = "UTF-8", to = "", toRaw = FALSE)
[1] "-8"
> iconv(neg_inf_utf8, from = "UTF-8", to = "", toRaw = TRUE)
[[1]]
[1] 2d 38
>

Here is the sessionInfo:

> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 14393)
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base
>

Am I missing something obvious?  Thanks a lot for your help and your time!

Michael

On Mon, Sep 5, 2016 at 3:31 AM, Duncan Murdoch 
wrote:

> On 05/09/2016 12:40 AM, Lixin Gong wrote:
>
>> Dear R experts,
>>
>> It seems that Rprintf has to be used to print from a C routine to
>> guarantee
>> to write to R’s output according to
>> https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Printing.
>>
>> However if a string is UTF-8 encoded, non-ASCII characters (e.g., the
>> infinity symbol http://www.fileformat.info/inf
>> o/unicode/char/221e/index.htm)
>> are misprinted.
>> Is this an unsupported feature or is there a workaround for this
>> limitation?
>>
>
> If you are working in a UTF-8 locale (as on most Unix-like systems), you
> should be fine.  If not (as is normal on Windows), you'll need to translate
> the string to the local encoding.  The Writing R Extensions manual section
> 6.11 tells you how to do the re-encoding.
>
> Duncan Murdoch
>
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] mget call can trigger C stack usage error

2016-09-05 Thread Alexandre Courtiol
Hi all, not sure if you will call this a bug or something else but the
following silly call trigger a low level error:

foo <- list(x=1)
class(foo) <- "new"
print.new <- function(x, ...) print(mget(names(formals(
foo

> Error: C stack usage  7969412 is too close to the limit



-- 
Alexandre Courtiol

http://sites.google.com/site/alexandrecourtiol/home

*"Science is the belief in the ignorance of experts"*, R. Feynman

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] A bug in the R Mersenne Twister (RNG) code?

2016-09-05 Thread Martin Maechler
> Gabriel Becker 
> on Thu, 1 Sep 2016 08:34:31 -0700 writes:

> I wonder how useful a (set of?) "time machine" functions
> which look up /infer things like this based on a date
> would be. Could ease the pain of changes generally, though
> not remove it completely.

Such a set (possibly of size one) may be quite useful, notably
if it got an intuitive interface.
I'd recommend to partly follow options() here, i.e., the
  oc <- compatibilityR("2000-02-29")

would set random number generators (and other changeable
defaults) to those that were in effect when R 1.0.0 was released,
*and* a later call

  compatibilityR (oc)  # reset to previous state

would do what the comment says.


> On Wed, Aug 31, 2016 at 5:45 PM, Paul Gilbert
>  wrote:

>> 
>> 
>> On 08/30/2016 06:29 PM, Duncan Murdoch wrote:
>> 
>>> I don't see evidence of a bug.  There have been several
>>> versions of the MT; we may be using a different version
>>> than you are.  Ours is the 1999/10/28 version; the web
>>> page you cite uses one from 2002.
>>> 
>>> Perhaps the newer version fixes some problems, and then
>>> it would be worth considering a change.  But changing
>>> the default RNG definitely introduces problems in
>>> reproducibility,
>>> 
>> 
>> Well "problems in reproducibility" is a bit
>> vague. Results would always be reproducible by specifying
>> kind="Mersenne-Twister" or kind="Buggy Kinderman-Ramage"
>> for older results, so there is no problem reproducing
>> results. The only problem is that users expecting to
>> reproduce results twenty years later will need to know
>> what random generator they used. (BTW, they may also need
>> to record information about the normal or other
>> generator, as well as the seed.) Of course, these changes
>> are recorded pretty well for R, so the history of
>> "default" can always be found.
>> 
>> I think it is a mistake to encourage users into thinking
>> they do not need to keep track of some information if
>> they want reproducibility. Perhaps the default should be
>> changed more often in order to encourage better user
>> habits.
>> 
>> More seriously, I think "default" should continue to be
>> something that is currently considered to be good. So, if
>> there really is a known problem, then I think "default"
>> should be changed.
>> 
>> (And, no I did not get burned by the R 1.7.0 change in
>> the default generator. I got burned by a much earlier,
>> unadvertised, and more subtle change in the Splus
>> generator.)
>> 
>> Paul Gilbert
>> 
>> 
>> so it's not obvious that we
>> 
>>> would do it.
>>> 
>>> Duncan Murdoch
>>> 
>>> 
>>> On 30/08/2016 5:45 PM, Mark Roberts wrote:
>>> 
 Whomever,
 
 I recently sent the "bug report" below
 tor-c...@r-project.org and have just been asked to
 instead submit it to you.
 
 Although I am basically not an R user, I have installed
 version 3.3.1 and am also the author of a statistics
 program written in Visual Basic that contains a
 component which correctly implements the Mersenne
 Twister (MT) algorithm.  I believe that it is not
 possible to generate the correct stream of pseudorandom
 numbers using the MT default random number generator in
 R, and am not the first person to notice this.  Here is
 a posted 2013 entry
 (www.r-bloggers.com/reproducibility-and-randomness/) on
 an R website that asserts that the SAS computer program
 implementation of the MT algorithm produces different
 numbers than R does when using the same starting seed
 number.  The author of this post didn’t get anyone to
 respond to his query about the reason for this SAS
 vs. R discrepancy.
 
 There are two ways of initializing the original MT
 computer program (written in C) so that an identical
 stream of numbers can be repeatedly generated: 1) with
 a particular integer seed number, and 2) with a
 particular array of integers.  In the 'compilation and
 usage' section of this webpage
 (https://github.com/cslarsen/mersenne-twister) there is
 a listing of the first 200 random numbers the MT
 algorithm should produce for seed number = 1.  The
 inventors of the Mersenne Twister random number
 generator provided two different sets of the first 1000
 numbers produced by a correctly coded 32-bit
 implementation of the MT algorithm when initializing it
 with a particular array of integers at:
 www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/MT2002/CODES/mt19937ar.out.
 [There is a link to this output at:
 www.math.sci.hiroshima-u.ac.j