Re: [Rd] [parallel] fixes load balancing of parLapplyLB

2018-03-01 Thread Christian Krause
Dear Tomas,

Thanks for your commitment to fix this issue and also to add the chunk size as 
an argument. If you want our input, let us know ;)

Best Regards

On 02/26/2018 04:01 PM, Tomas Kalibera wrote:
> Dear Christian and Henrik,
> 
> thank you for spotting the problem and suggestions for a fix. We'll probably 
> add a chunk.size argument to parLapplyLB and parLapply to follow OpenMP 
> terminology, which has already been an inspiration for the present code 
> (parLapply already implements static scheduling via internal function 
> staticClusterApply, yet with a fixed chunk size; parLapplyLB already 
> implements dynamic scheduling via internal function dynamicClusterApply, but 
> with a fixed chunk size set to an unlucky value so that it behaves like 
> static scheduling). The default chunk size for parallelLapplyLB will be set 
> so that there is some dynamism in the schedule even by default. I am now 
> testing a patch with these changes.
> 
> Best
> Tomas
> 
> 
> On 02/20/2018 11:45 AM, Christian Krause wrote:
>> Dear Henrik,
>>
>> The rationale is just that it is within these extremes and that it is really 
>> simple to calculate, without making any assumptions and knowing that it 
>> won't be perfect.
>>
>> The extremes A and B you are mentioning are special cases based on 
>> assumptions. Case A is based on the assumption that the function has a long 
>> runtime or varying runtime, then you are likely to get the best load 
>> balancing with really small chunks. Case B is based on the assumption that 
>> the function runtime is the same for each list element, i.e. where you don't 
>> actually need load balancing, i.e. just use `parLapply` without load 
>> balancing.
>>
>> This new default is **not the best one**. It's just a better one than we had 
>> before. There is no best one we can use as default because **we don't know 
>> the function runtime and how it varies**. The user needs to decide that 
>> because he/she knows the function. As mentioned before, I will write a patch 
>> that makes the chunk size an optional argument, so the user can decide 
>> because only he/she has all the information to choose the best chunk size, 
>> just like you did with the `future.scheduling` parameter.
>>
>> Best Regards
>>
>> On February 19, 2018 10:11:04 PM GMT+01:00, Henrik Bengtsson 
>>  wrote:
>>> Hi, I'm trying to understand the rationale for your proposed amount of
>>> splitting and more precisely why that one is THE one.
>>>
>>> If I put labels on your example numbers in one of your previous post:
>>>
>>> nbrOfElements <- 97
>>> nbrOfWorkers <- 5
>>>
>>> With these, there are two extremes in how you can split up the
>>> processing in chunks such that all workers are utilized:
>>>
>>> (A) Each worker, called multiple times, processes one element each
>>> time:
>>>
 nbrOfElements <- 97
 nbrOfWorkers <- 5
 nbrOfChunks <- nbrOfElements
 sapply(parallel:::splitList(1:nbrOfElements, nbrOfChunks), length)
>>> [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
>>> [30] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
>>> [59] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
>>> [88] 1 1 1 1 1 1 1 1 1 1
>>>
>>>
>>> (B) Each worker, called once, processes multiple element:
>>>
 nbrOfElements <- 97
 nbrOfWorkers <- 5
 nbrOfChunks <- nbrOfWorkers
 sapply(parallel:::splitList(1:nbrOfElements, nbrOfChunks), length)
>>> [1] 20 19 19 19 20
>>>
>>> I understand that neither of these two extremes may be the best when
>>> it comes to orchestration overhead and load balancing. Instead, the
>>> best might be somewhere in-between, e.g.
>>>
>>> (C) Each worker, called multiple times, processing multiple elements:
>>>
 nbrOfElements <- 97
 nbrOfWorkers <- 5
 nbrOfChunks <- nbrOfElements / nbrOfWorkers
 sapply(parallel:::splitList(1:nbrOfElements, nbrOfChunks), length)
>>> [1] 5 5 5 5 4 5 5 5 5 5 4 5 5 5 5 4 5 5 5 5
>>>
>>> However, there are multiple alternatives between the two extremes, e.g.
>>>
 nbrOfChunks <- scale * nbrOfElements / nbrOfWorkers
>>> So, is there a reason why you argue for scale = 1.0 to be the optimal?
>>>
>>> FYI, In future.apply::future_lapply(X, FUN, ...) there is a
>>> 'future.scheduling' scale factor(*) argument where default
>>> future.scheduling = 1 corresponds to (B) and future.scheduling = +Inf
>>> to (A).  Using future.scheduling = 4 achieves the amount of
>>> load-balancing you propose in (C).   (*) Different definition from the
>>> above 'scale'. (Disclaimer: I'm the author)
>>>
>>> /Henrik
>>>
>>> On Mon, Feb 19, 2018 at 10:21 AM, Christian Krause
>>>  wrote:
 Dear R-Devel List,

 I have installed R 3.4.3 with the patch applied on our cluster and
>>> ran a *real-world* job of one of our users to confirm that the patch
>>> works to my satisfaction. Here are the results.
 The original was a series of jobs, all essentially doing the same
>>> stuff using bootstrapped data, so for the original there 

[Rd] Small program embedding R crashes in 64 bits

2018-03-01 Thread William
Hi everyone,

I'm trying to create a small C++ program which embed R, but I'm having
problems when I try to do it on Windows 64 bits. I have created a
minimal reproducible example which is just the
src/gnuwin32/front-ends/rtest.c file with the R_ReplDLLdo1() loop, the
only difference is that I set the interactive mode to TRUE. Here is
the cpp file: https://gist.github.com/anonymous/08b42e83c949e250f60b068d58a3ec51

When compiled in 32 bits, everything works: I enter R commands and no
crash. When compiled in 64 bits (mingw64 and R x64 libs, and executed
with R x64 in the PATH), everything works except when there is an
error in R with a command entered by the user. Typically, entering "a"
shows "Error: object 'a' not found" and then the program immediately
crashes. Typing a stop() also trigger a crash.

Code returned by the program is 0xC028, which is STATUS_BAD_STACK
with the description: "An invalid or unaligned stack was encountered
during an unwind operation". I'm not really good at C++ or
makefile/compiler stuff, but I can't get it to work. I'm guessing this
as to do with some longjumps to return to the prompt when there is an
error but I don't know how to fix it.

Compiling in 32 bits:
P:/Rtools/mingw_32/bin/g++ -O3 -Wall -pedantic -IP:/R/R-3.4.3/include
-c testr.cpp -o testr.o
P:/Rtools/mingw_32/bin/g++ -o ./32.exe ./testr.o
-LP:/R/R-3.4.3/bin/i386 -lR -lRgraphapp

Results in:
C:\test> 32.exe
> a
Error: object 'a' not found
> # it works!

But compiling in 64 bits:
P:/Rtools/mingw_64/bin/g++ -O3 -Wall -pedantic -IP:/R/R-3.4.3/include
-c testr.cpp -o testr.o
P:/Rtools/mingw_64/bin/g++ -o ./64.exe ./testr.o
-LP:/R/R-3.4.3/bin/x64 -lR -lRgraphapp

Fails like this:
C:\test> 64.exe
> b <- 1
> b
[1] 1
> a
Error: object 'a' not found


I've tried lots of -std= flags, -DWIN64 -D_WIN64 and lots of other
defines I could find or think of but with no luck. What is missing?

Thanks,

William.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Bug report - duplicate row names with as.data.frame()

2018-03-01 Thread Ron
Hello,

I'd like to report what I think is a bug: using as.data.frame() we can
create duplicate row names in a data frame. R version 3.4.3 (current stable
release).

Rather than paste code in an email, please see the example formatted code
here:
https://stackoverflow.com/questions/49031523/duplicate-row-names-in-r-using-as-data-frame

I posted to StackOverflow, and consensus was that we should proceed with
this as a bug report.

Thanks,
Ron

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Bug report - duplicate row names with as.data.frame()

2018-03-01 Thread Martyn Plummer
On Thu, 2018-03-01 at 09:36 -0500, Ron wrote:
> Hello,
> 
> I'd like to report what I think is a bug: using as.data.frame() we can
> create duplicate row names in a data frame. R version 3.4.3 (current stable
> release).
> 
> Rather than paste code in an email, please see the example formatted code
> here:
> https://stackoverflow.com/questions/49031523/duplicate-row-names-in-r-using-as-data-frame
> 
> I posted to StackOverflow, and consensus was that we should proceed with
> this as a bug report.

Yes that is definitely a bug. 

The end of the as.data.frame.matrix method has:

attr(value, "row.names") <- row.names
class(value) <- "data.frame"
value

Changing this to:

class(value) <- "data.frame"
row.names(value) <- row.names
value

ensures that the row.names<-.data.frame method is called with its built
-in check for duplicate names.

There are quite a few as.data.frame methods so this could be a
recurring problem. I will check.

Martyn


> Thanks,
> Ron
> 
>   [[alternative HTML version deleted]]
> 
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] scale.default gives an incorrect error message when is.numeric() fails on a dgeMatrix

2018-03-01 Thread Martin Maechler
> Michael Chirico 
> on Tue, 27 Feb 2018 20:18:34 +0800 writes:

Slightly amended 'Subject': (unimportant mistake: a dgeMatrix is *not* sparse)

MM: modified to commented R code,  slightly changed from your post:


## I am attempting to use the lars package with a sparse input feature matrix,
## but the following fails:

library(Matrix)
library(lars)
data(diabetes) # from 'lars'
##UAagghh! not like this -- both attach() *and*   as.data.frame()  are horrific!
##UA  attach(diabetes)
##UA  x = as(as.matrix(as.data.frame(x)), 'dgCMatrix')
x <- as(unclass(diabetes$x), "dgCMatrix")
lars(x, y, intercept = FALSE)
## Error in scale.default(x, FALSE, normx) :
##   length of 'scale' must equal the number of columns of 'x'

## More specifically, scale.default fails as called from lars():
normx <- new("dgeMatrix",
  x = c(4, 0, 9, 1, 1, -1, 4, -2, 6, 6)*1e-14, Dim = c(1L, 10L),
  Dimnames = list(NULL,
  c("x.age", "x.sex", "x.bmi", "x.map", "x.tc",
"x.ldl", "x.hdl", "x.tch", "x.ltg", "x.glu")))
scale.default(x, center=FALSE, scale = normx)
## Error in scale.default(x, center = FALSE, scale = normx) :
##   length of 'scale' must equal the number of columns of 'x'

>  The problem is that this check fails because is.numeric(normx) is FALSE:

>  if (is.numeric(scale) && length(scale) == nc)

>  So, the error message is misleading. In fact length(scale) is the same as
>  nc.

Correct, twice.

>  At a minimum, the error message needs to be repaired; do we also want to
>  attempt as.numeric(normx) (which I believe would have allowed scale to work
>  in this case)?

It seems sensible to allow  both 'center' and 'scale' to only
have to *obey*  as.numeric(.)  rather than fulfill is.numeric(.).

Though that is not a bug in scale()  as its help page has always
said that 'center' and 'scale' should either be a logical value
or a numeric vector.

For that reason I can really claim a bug in 'lars' which should
really not use

   scale(x, FALSE, normx)

but rather

   scale(x, FALSE, scale = as.numeric(normx))

and then all would work.

> -

>  (I'm aware that there's some import issues in lars, as the offending line
>  to create normx *should* work, as is.numeric(sqrt(drop(rep(1, nrow(x)) %*%
>  (x^2 is TRUE -- it's simply that lars doesn't import the appropriate S4
>  methods)

>  Michael Chirico

Yes, 'lars' has _not_ been updated since  Spring 2013, notably
because its authors have been saying (for rather more than 5
years I think) that one should really use 

 require("glmnet")

instead.

Your point is still valid that it would be easy to enhance
base :: scale.default()  so it'd work in more cases.

Thank you for that.  I do plan to consider such a change in
R-devel (planned to become R 3.5.0 in April).

Martin Maechler,
ETH Zurich

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] scale.default gives an incorrect error message when is.numeric() fails on a dgeMatrix

2018-03-01 Thread Michael Chirico
thanks. I know the setup code is a mess, just duct-taped something together
from the examples in lars (which are a mess in turn). in fact when I
messaged Prof. Hastie he recommended using glmnet. I wonder why lars is
kept on CRAN if they've no intention of maintaining it... but I digress...

On Mar 2, 2018 1:52 AM, "Martin Maechler" 
wrote:

> > Michael Chirico 
> > on Tue, 27 Feb 2018 20:18:34 +0800 writes:
>
> Slightly amended 'Subject': (unimportant mistake: a dgeMatrix is *not*
> sparse)
>
> MM: modified to commented R code,  slightly changed from your post:
>
>
> ## I am attempting to use the lars package with a sparse input feature
> matrix,
> ## but the following fails:
>
> library(Matrix)
> library(lars)
> data(diabetes) # from 'lars'
> ##UAagghh! not like this -- both attach() *and*   as.data.frame()  are
> horrific!
> ##UA  attach(diabetes)
> ##UA  x = as(as.matrix(as.data.frame(x)), 'dgCMatrix')
> x <- as(unclass(diabetes$x), "dgCMatrix")
> lars(x, y, intercept = FALSE)
> ## Error in scale.default(x, FALSE, normx) :
> ##   length of 'scale' must equal the number of columns of 'x'
>
> ## More specifically, scale.default fails as called from lars():
> normx <- new("dgeMatrix",
>   x = c(4, 0, 9, 1, 1, -1, 4, -2, 6, 6)*1e-14, Dim = c(1L, 10L),
>   Dimnames = list(NULL,
>   c("x.age", "x.sex", "x.bmi", "x.map", "x.tc",
> "x.ldl", "x.hdl", "x.tch", "x.ltg", "x.glu")))
> scale.default(x, center=FALSE, scale = normx)
> ## Error in scale.default(x, center = FALSE, scale = normx) :
> ##   length of 'scale' must equal the number of columns of 'x'
>
> >  The problem is that this check fails because is.numeric(normx) is FALSE:
>
> >  if (is.numeric(scale) && length(scale) == nc)
>
> >  So, the error message is misleading. In fact length(scale) is the same
> as
> >  nc.
>
> Correct, twice.
>
> >  At a minimum, the error message needs to be repaired; do we also want to
> >  attempt as.numeric(normx) (which I believe would have allowed scale to
> work
> >  in this case)?
>
> It seems sensible to allow  both 'center' and 'scale' to only
> have to *obey*  as.numeric(.)  rather than fulfill is.numeric(.).
>
> Though that is not a bug in scale()  as its help page has always
> said that 'center' and 'scale' should either be a logical value
> or a numeric vector.
>
> For that reason I can really claim a bug in 'lars' which should
> really not use
>
>scale(x, FALSE, normx)
>
> but rather
>
>scale(x, FALSE, scale = as.numeric(normx))
>
> and then all would work.
>
> > -
>
> >  (I'm aware that there's some import issues in lars, as the offending
> line
> >  to create normx *should* work, as is.numeric(sqrt(drop(rep(1, nrow(x))
> %*%
> >  (x^2 is TRUE -- it's simply that lars doesn't import the
> appropriate S4
> >  methods)
>
> >  Michael Chirico
>
> Yes, 'lars' has _not_ been updated since  Spring 2013, notably
> because its authors have been saying (for rather more than 5
> years I think) that one should really use
>
>  require("glmnet")
>
> instead.
>
> Your point is still valid that it would be easy to enhance
> base :: scale.default()  so it'd work in more cases.
>
> Thank you for that.  I do plan to consider such a change in
> R-devel (planned to become R 3.5.0 in April).
>
> Martin Maechler,
> ETH Zurich
>
>
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Repeated use of dyn.load().

2018-03-01 Thread Rolf Turner


I sent this enquiry to r-help and received several sympathetic replies, 
none of which were definitive.


It was kindly suggested to me that I might get better mileage out of 
r-devel, so I'm trying here.  I hope that this is not inappropriate.


My original enquiry to r-help:

==
I am working with a function "foo" that explicitly dynamically loads a 
shared object library or "DLL", doing something like dyn.load("bar.so"). 
 This is a debugging exercise so I make changes to the underlying 
Fortran code (yes, I acknowledge that I am a dinosaur) remake the DLL 
"bar.so" and then run foo again.  This is all *without* quitting and 
restarting R.  (I'm going to have to do this a few brazillion times, and

I want the iterations to be as quick as possible.)

This seems to work --- i.e. foo seems to obtain the latest version of 
bar.so.  But have I just been lucky so far?  (I have not experimented 
heavily).


Am I running risks of leading myself down the garden path?  Are there 
Traps for Young (or even Old) Players lurking about?


I would appreciate Wise Counsel.
==

One of the replies that I received from r-help indicated that it might 
be safer if I were to apply dyn.unload() on each iteration.  So I 
thought I might put in the line of code


on.exit(dyn.unload("bar.so"))

immediately after my call to dyn.load().

Comments?

Another reply pointed out that "Writing R Extensions" indicates that 
there could be problems under Solaris, but does not single out any other 
OS for comment.  Might I infer that I am "safe" as long as I don't use 
Solaris?  (Which I certainly *won't* be doing.)


Thanks.

cheers,

Rolf Turner

--
Technical Editor ANZJS
Department of Statistics
University of Auckland
Phone: +64-9-373-7599 ext. 88276

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel