Re: [Rd] most robust way to call R API functions from a secondary thread

2019-05-20 Thread Stepan

Hi Andreas,

note that with the introduction of ALTREP, as far as I understand, calls 
as "simple" as DATAPTR can execute arbitrary code (R or native). Even 
without ALTREP, if you execute user-provided R code via Rf_eval and such 
on some custom thread, you may end up executing native code of some 
package, which may assume it is executed only from the R main thread.


Could you (1) decompose your problem in a way that in some initial phase 
you pull all the necessary data from R, then start the parallel 
computation, and then again in the R main thread "submit" the results 
back to the R world?


If you wanted something really robust, you can (2) "send" the requests 
for R API usage to the R main thread and pause the worker thread until 
it receives the results back. This looks similar to what the "later" 
package does. Maybe you can even use that package for your purposes?


Do you want to parallelize your code to achieve better performance? Even 
with your proposed solution, you need synchronization and chances are 
that excessive synchronization will severely affect the expected 
performance benefits of parallelization. If you do not need to 
synchronize that much, then the question is if you can do with (1) or (2).


Best regards,
Stepan

On 19/05/2019 11:31, Andreas Kersting wrote:

Hi,

As the subject suggests, I am looking for the most robust way to call an 
(arbitrary) function from the R API from another but the main POSIX thread in a 
package's code.

I know that, "[c]alling any of the R API from threaded code is ‘for experts only’ and strongly discouraged. 
Many functions in the R API modify internal R data structures and might corrupt these data structures if called 
simultaneously from multiple threads. Most R API functions can signal errors, which must only happen on the R 
main thread." 
(https://urldefense.proofpoint.com/v2/url?u=https-3A__cran.r-2Dproject.org_doc_manuals_r-2Drelease_R-2Dexts.html-23OpenMP-2Dsupport&d=DwIFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=neKFCw86thQe2E2-61NAgpDMw4cC7oD_tUTTzraOkQM&m=d1r2raD4w0FF7spOVuz2IVEo0P_II3ZtSbw0TU2NmaE&s=JaadZR_m-QiJ3BQzzQ_fJPYt034tM5Ts6vKhdi6f__A&e=)

Let me start with my understanding of the related issues and possible solutions:

1) R API functions are generally not thread-safe and hence one must ensure, 
e.g. by using mutexes, that no two threads use the R API simultaneously

2) R uses longjmps on error and interrupts as well as for condition handling 
and it is undefined behaviour to do a longjmp from one thread to another; 
interrupts can be suspended before creating the threads by setting 
R_interrupts_suspended = TRUE; by wrapping the calls to functions from the R 
API with R_ToplevelExec(), longjmps across thread boundaries can be avoided; 
the only reason for R_ToplevelExec() itself to fail with an R-style error 
(longjmp) is a pointer protection stack overflow

3) R_CheckStack() might be executed (indirectly), which will (probably) signal a stack overflow because it only 
works correctly when called form the main thread (see 
https://urldefense.proofpoint.com/v2/url?u=https-3A__cran.r-2Dproject.org_doc_manuals_r-2Drelease_R-2Dexts.html-23Threading-2Dissues&d=DwIFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=neKFCw86thQe2E2-61NAgpDMw4cC7oD_tUTTzraOkQM&m=d1r2raD4w0FF7spOVuz2IVEo0P_II3ZtSbw0TU2NmaE&s=J_TMw2gu43dxB_EX2vHbtF4Zr4bIAFR8RSFzvbRV6jE&e=);
 in particular, any function that does allocations, e.g. via allocVector3() might end up calling it via GC -> 
finalizer -> ... -> eval; the only way around this problem which I could find is to adjust R_CStackLimit, 
which is outside of the official API; it can be set to -1 to disable the check or be changed to a value 
appropriate for the current thread

4) R sets signal handlers for several signals and some of them make use of the 
R API; hence, issues 1) - 3) apply; signal masks can be used to block delivery 
of signals to secondary threads in general and to the main thread while other 
threads are using the R API


I basically have the following questions:

a) Is my understanding of the issues accurate?
b) Are there more things to consider when calling the R API from secondary 
threads?
c) Are the solutions proposed appropriate? Are there scenarios in which they 
will fail to solve the issue? Or might they even cause new problems?
d) Are there alternative/better solutions?

Any feedback on this is highly appreciated.

Below you can find a template which, combines the proposed solutions (and skips 
all non-illustrative checks of return values). Additionally, 
R_CheckUserInterrupt() is used in combination with R_UnwindProtect() to 
regularly check for interrupts from the main thread, while still being able to 
cleanly cancel the threads before fun_running_in_main_thread() is left via a 
longjmp. This is e.g. required if the secondary threads use memory which was 
allocated in fun_running_in_main_thread() using e.g. R_alloc().

Best regards,
Andreas Kersting



#include 

Re: [Rd] Give update.formula() an option not to simplify or reorder the result -- request for comments

2019-05-20 Thread Danny Smith
Hi Abs,

Re: your last point:

> You made an interesting comment.
>

> > This is not
> > always the desired behavior, because formulas are increasingly used
> > for purposes other than specifying linear models.
>
> Can I ask what these purposes are?



Not sure how relevant these are/what Pavel was referring to specifically,
but there are a few alternative uses that I'm familiar with in the
tidyverse packages.

Since formulas store both an expression and an environment they're really
useful for complex evaluation. rlang's "quosures" are a subclass of formula
.

Othewise the main tidyverse use is a shorthand for specifying anonymous
functions (this is used extensively, particularly in purrr). From
?dplyr::mutate_at:
# You can also pass formulas to create functions on the spot, purrr-style:
starwars %>% mutate_at(c("height", "mass"), ~scale2(., na.rm = TRUE))

Also see ?dplyr::case_when:
x <- 1:50
case_when(
  x %% 35 == 0 ~ "fizz buzz",
  x %% 5 == 0 ~ "fizz",
  x %% 7 == 0 ~ "buzz",
  TRUE ~ as.character(x)
)

And in base R, formulas are used in the plotting functions, e.g.:
## boxplot on a formula:
boxplot(count ~ spray, data = InsectSprays, col = "lightgray")

Cheers,
Danny

On Mon, May 20, 2019 at 12:12 PM Abby Spurdle  wrote:

> Hi Pavel
> (Back On List)
>
> And my two cents...
>
> > At this time, the update.formula() method always performs a number of
> > transformations on the results, eliminating redundant variables and
> > reordering interactions to be after the main effects.
> > This the proposal is to add an option simplify= (defaulting to TRUE,
> > for backwards compatibility) that if FALSE will skip the simplification
> > step.
> > Any thoughts? One particular question that Martin raised is whether the
> > UI should be just a single logical argument, or something else.
>
> Firstly, note that the constructor for formula objects behaves differently
> to the update method, so I think any changes should be consistent between
> the two functions.
> > #constructor - doesn't simplify
> > y ~ x + x
> y ~ x + x
> > #update method - does simplify
> > update (y ~ x, ~. + x)
> y ~ x
>
> Interestingly, this doesn't simplify.
> > update (y ~ I (x), ~. + x)
> y ~ I(x) + x
>
> I think that simplification could mean different things.
> So, there could be something like:
> > update (y ~ x, ~. + x, strip=FALSE)
> y ~ I (2 * x)
>
> I don't know how easy that would be to implement.
> (Symbolic computation on par with computer algebra systems is a discussion
> in itself...).
> And you could have one argument (say, method="simplify") rather than two or
> more logical arguments.
>
> It would also be possible to allow partial forms of simplification, by
> specifying which terms should be collapsed, however, I doubt any possible
> usefulness of this, would justify the complexity.
> However, feel free to disagree.
>
> You made an interesting comment.
>
> > This is not
> > always the desired behavior, because formulas are increasingly used
> > for purposes other than specifying linear models.
>
> Can I ask what these purposes are?
>
>
> kind regards
> Abs
>
> [[alternative HTML version deleted]]
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Race condition on parallel package's mcexit and rmChild

2019-05-20 Thread Sun Yijiang
I've been hacking with parallel package for some time and built a
parallel processing framework with it.  However, although very rarely,
I did notice "ignoring SIGPIPE signal" error every now and then.
After a deep dig into the source code, I think I found something worth
noticing.

In short, wring to pipe in the C function mc_exit(SEXP sRes) may cause
a SIGPIPE.  Code from src/library/parallel/src/fork.c:

SEXP NORET mc_exit(SEXP sRes)
{
int res = asInteger(sRes);
... ...
if (master_fd != -1) { /* send 0 to signify that we're leaving */
size_t len = 0;
/* assign result for Fedora security settings */
ssize_t n = write(master_fd, &len, sizeof(len));
... ...
}

So a pipe write is made in mc_exit, and here's how this function is
used in src/library/parallel/R/unix/mcfork.R:

mcexit <- function(exit.code = 0L, send = NULL)
{
if (!is.null(send)) try(sendMaster(send), silent = TRUE)
.Call(C_mc_exit, as.integer(exit.code))
}

Between sendMaster() and mc_exit() calls, which are made in the child
process, the master process may call readChild() followed by
rmChild().  rmChild closes the pipe on the master side, and if it's
called before child calls mc_exit, a SIGPIPE will be raised when child
tries to write to the pipe in mc_exit.

rmChild is defined but not used in parallel package, so this problem
won't surface in most cases.  However, it is a useful API and may be
used by users like me for advanced control over child processes.  I
hope we can discuss a solution on it.

In fact, I don't see why we need to write to the pipe on child exit
and how it has anything to do with "Fedora security settings" as
suggested in the comments.  Removing it, IMHO, would be a good and
clean way to solve this problem.

Regards,
Yijiang

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Race condition on parallel package's mcexit and rmChild

2019-05-20 Thread Sun Yijiang
I've been hacking with parallel package for some time and built a
parallel processing framework with it.  However, although very rarely,
I did notice "ignoring SIGPIPE signal" error every now and then.
After a deep dig into the source code, I think I found something worth
noticing.

In short, wring to pipe in the C function mc_exit(SEXP sRes) may cause
a SIGPIPE.  Code from src/library/parallel/src/fork.c:

SEXP NORET mc_exit(SEXP sRes)
{
int res = asInteger(sRes);
... ...
if (master_fd != -1) { /* send 0 to signify that we're leaving */
size_t len = 0;
/* assign result for Fedora security settings */
ssize_t n = write(master_fd, &len, sizeof(len));
... ...
}

So a pipe write is made in mc_exit, and here's how this function is
used in src/library/parallel/R/unix/mcfork.R:

mcexit <- function(exit.code = 0L, send = NULL)
{
if (!is.null(send)) try(sendMaster(send), silent = TRUE)
.Call(C_mc_exit, as.integer(exit.code))
}

Between sendMaster() and mc_exit() calls, which are made in the child
process, the master process may call readChild() followed by
rmChild().  rmChild closes the pipe on the master side, and if it's
called before child calls mc_exit, a SIGPIPE will be raised when child
tries to write to the pipe in mc_exit.

rmChild is defined but not used in parallel package, so this problem
won't surface in most cases.  However, it is a useful API and may be
used by users like me for advanced control over child processes.  I
hope we can discuss a solution on it.

In fact, I don't see why we need to write to the pipe on child exit
and how it has anything to do with "Fedora security settings" as
suggested in the comments.  Removing it, IMHO, would be a good and
clean way to solve this problem.

Regards,
Yijiang

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] most robust way to call R API functions from a secondary thread

2019-05-20 Thread Simon Urbanek
Stepan,

Andreas gave a lot more thought into what you question in your reply. His 
question was how you can avoid what you where proposing and have proper 
threading under safe conditions. Having dealt with this before, I think 
Andreas' write up is pretty much the most complete analysis I have seen. I'd 
wait for Luke to chime in as the ultimate authority if he gets to it.

The "classic" approach which you mention is to collect and allocate everything, 
then execute parallel code and then return. What Andres is proposing is 
obviously much more efficient: you only synchronize on R API calls which are 
likely a small fraction on the entire time while you keep all threads alive. 
His question was how to do that safely. (BTW: I really like the touch of 
counting frames that toplevel exec can use ;) - it may make sense to deal with 
that edge-case in R if we can ...).

Cheers,
Simon




> On May 20, 2019, at 5:45 AM, Stepan  wrote:
> 
> Hi Andreas,
> 
> note that with the introduction of ALTREP, as far as I understand, calls as 
> "simple" as DATAPTR can execute arbitrary code (R or native). Even without 
> ALTREP, if you execute user-provided R code via Rf_eval and such on some 
> custom thread, you may end up executing native code of some package, which 
> may assume it is executed only from the R main thread.
> 
> Could you (1) decompose your problem in a way that in some initial phase you 
> pull all the necessary data from R, then start the parallel computation, and 
> then again in the R main thread "submit" the results back to the R world?
> 
> If you wanted something really robust, you can (2) "send" the requests for R 
> API usage to the R main thread and pause the worker thread until it receives 
> the results back. This looks similar to what the "later" package does. Maybe 
> you can even use that package for your purposes?
> 
> Do you want to parallelize your code to achieve better performance? Even with 
> your proposed solution, you need synchronization and chances are that 
> excessive synchronization will severely affect the expected performance 
> benefits of parallelization. If you do not need to synchronize that much, 
> then the question is if you can do with (1) or (2).
> 
> Best regards,
> Stepan
> 
> On 19/05/2019 11:31, Andreas Kersting wrote:
>> Hi,
>> As the subject suggests, I am looking for the most robust way to call an 
>> (arbitrary) function from the R API from another but the main POSIX thread 
>> in a package's code.
>> I know that, "[c]alling any of the R API from threaded code is ‘for experts 
>> only’ and strongly discouraged. Many functions in the R API modify internal 
>> R data structures and might corrupt these data structures if called 
>> simultaneously from multiple threads. Most R API functions can signal 
>> errors, which must only happen on the R main thread." 
>> (https://urldefense.proofpoint.com/v2/url?u=https-3A__cran.r-2Dproject.org_doc_manuals_r-2Drelease_R-2Dexts.html-23OpenMP-2Dsupport&d=DwIFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=neKFCw86thQe2E2-61NAgpDMw4cC7oD_tUTTzraOkQM&m=d1r2raD4w0FF7spOVuz2IVEo0P_II3ZtSbw0TU2NmaE&s=JaadZR_m-QiJ3BQzzQ_fJPYt034tM5Ts6vKhdi6f__A&e=)
>> Let me start with my understanding of the related issues and possible 
>> solutions:
>> 1) R API functions are generally not thread-safe and hence one must ensure, 
>> e.g. by using mutexes, that no two threads use the R API simultaneously
>> 2) R uses longjmps on error and interrupts as well as for condition handling 
>> and it is undefined behaviour to do a longjmp from one thread to another; 
>> interrupts can be suspended before creating the threads by setting 
>> R_interrupts_suspended = TRUE; by wrapping the calls to functions from the R 
>> API with R_ToplevelExec(), longjmps across thread boundaries can be avoided; 
>> the only reason for R_ToplevelExec() itself to fail with an R-style error 
>> (longjmp) is a pointer protection stack overflow
>> 3) R_CheckStack() might be executed (indirectly), which will (probably) 
>> signal a stack overflow because it only works correctly when called form the 
>> main thread (see 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__cran.r-2Dproject.org_doc_manuals_r-2Drelease_R-2Dexts.html-23Threading-2Dissues&d=DwIFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=neKFCw86thQe2E2-61NAgpDMw4cC7oD_tUTTzraOkQM&m=d1r2raD4w0FF7spOVuz2IVEo0P_II3ZtSbw0TU2NmaE&s=J_TMw2gu43dxB_EX2vHbtF4Zr4bIAFR8RSFzvbRV6jE&e=);
>>  in particular, any function that does allocations, e.g. via allocVector3() 
>> might end up calling it via GC -> finalizer -> ... -> eval; the only way 
>> around this problem which I could find is to adjust R_CStackLimit, which is 
>> outside of the official API; it can be set to -1 to disable the check or be 
>> changed to a value appropriate for the current thread
>> 4) R sets signal handlers for several signals and some of them make use of 
>> the R API; hence, issues 1) - 3) apply; signal masks can be used to block 
>>

Re: [Rd] Race condition on parallel package's mcexit and rmChild

2019-05-20 Thread Tomas Kalibera
This issue has already been addressed in 76462 (R-devel) and also ported 
to R-patched. In fact rmChild() is used in mccollect(wait=FALSE).


Best
Tomas

On 5/19/19 11:39 AM, Sun Yijiang wrote:

I've been hacking with parallel package for some time and built a
parallel processing framework with it.  However, although very rarely,
I did notice "ignoring SIGPIPE signal" error every now and then.
After a deep dig into the source code, I think I found something worth
noticing.

In short, wring to pipe in the C function mc_exit(SEXP sRes) may cause
a SIGPIPE.  Code from src/library/parallel/src/fork.c:

SEXP NORET mc_exit(SEXP sRes)
{
 int res = asInteger(sRes);
... ...
 if (master_fd != -1) { /* send 0 to signify that we're leaving */
 size_t len = 0;
 /* assign result for Fedora security settings */
 ssize_t n = write(master_fd, &len, sizeof(len));
... ...
}

So a pipe write is made in mc_exit, and here's how this function is
used in src/library/parallel/R/unix/mcfork.R:

mcexit <- function(exit.code = 0L, send = NULL)
{
 if (!is.null(send)) try(sendMaster(send), silent = TRUE)
 .Call(C_mc_exit, as.integer(exit.code))
}

Between sendMaster() and mc_exit() calls, which are made in the child
process, the master process may call readChild() followed by
rmChild().  rmChild closes the pipe on the master side, and if it's
called before child calls mc_exit, a SIGPIPE will be raised when child
tries to write to the pipe in mc_exit.

rmChild is defined but not used in parallel package, so this problem
won't surface in most cases.  However, it is a useful API and may be
used by users like me for advanced control over child processes.  I
hope we can discuss a solution on it.

In fact, I don't see why we need to write to the pipe on child exit
and how it has anything to do with "Fedora security settings" as
suggested in the comments.  Removing it, IMHO, would be a good and
clean way to solve this problem.

Regards,
Yijiang

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Race condition on parallel package's mcexit and rmChild

2019-05-20 Thread Sun Yijiang
Have read the latest code, but I still don't understand why mc_exit
needs to write zero on exit.  If a child closes its pipe, parent will
know that on next select.

Best,
Yijiang

Tomas Kalibera  于2019年5月20日周一 下午10:52写道:
>
> This issue has already been addressed in 76462 (R-devel) and also ported
> to R-patched. In fact rmChild() is used in mccollect(wait=FALSE).
>
> Best
> Tomas
>
> On 5/19/19 11:39 AM, Sun Yijiang wrote:
> > I've been hacking with parallel package for some time and built a
> > parallel processing framework with it.  However, although very rarely,
> > I did notice "ignoring SIGPIPE signal" error every now and then.
> > After a deep dig into the source code, I think I found something worth
> > noticing.
> >
> > In short, wring to pipe in the C function mc_exit(SEXP sRes) may cause
> > a SIGPIPE.  Code from src/library/parallel/src/fork.c:
> >
> > SEXP NORET mc_exit(SEXP sRes)
> > {
> >  int res = asInteger(sRes);
> > ... ...
> >  if (master_fd != -1) { /* send 0 to signify that we're leaving */
> >  size_t len = 0;
> >  /* assign result for Fedora security settings */
> >  ssize_t n = write(master_fd, &len, sizeof(len));
> > ... ...
> > }
> >
> > So a pipe write is made in mc_exit, and here's how this function is
> > used in src/library/parallel/R/unix/mcfork.R:
> >
> > mcexit <- function(exit.code = 0L, send = NULL)
> > {
> >  if (!is.null(send)) try(sendMaster(send), silent = TRUE)
> >  .Call(C_mc_exit, as.integer(exit.code))
> > }
> >
> > Between sendMaster() and mc_exit() calls, which are made in the child
> > process, the master process may call readChild() followed by
> > rmChild().  rmChild closes the pipe on the master side, and if it's
> > called before child calls mc_exit, a SIGPIPE will be raised when child
> > tries to write to the pipe in mc_exit.
> >
> > rmChild is defined but not used in parallel package, so this problem
> > won't surface in most cases.  However, it is a useful API and may be
> > used by users like me for advanced control over child processes.  I
> > hope we can discuss a solution on it.
> >
> > In fact, I don't see why we need to write to the pipe on child exit
> > and how it has anything to do with "Fedora security settings" as
> > suggested in the comments.  Removing it, IMHO, would be a good and
> > clean way to solve this problem.
> >
> > Regards,
> > Yijiang
> >
> > __
> > R-devel@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
>
>

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Race condition on parallel package's mcexit and rmChild

2019-05-20 Thread Simon Urbanek
Because that's the communication protocol between the parent and child. There 
is a difference between unsolicited exit and empty result exit.

Cheers,
Simon


> On May 20, 2019, at 11:22 AM, Sun Yijiang  wrote:
> 
> Have read the latest code, but I still don't understand why mc_exit
> needs to write zero on exit.  If a child closes its pipe, parent will
> know that on next select.
> 
> Best,
> Yijiang
> 
> Tomas Kalibera  于2019年5月20日周一 下午10:52写道:
>> 
>> This issue has already been addressed in 76462 (R-devel) and also ported
>> to R-patched. In fact rmChild() is used in mccollect(wait=FALSE).
>> 
>> Best
>> Tomas
>> 
>> On 5/19/19 11:39 AM, Sun Yijiang wrote:
>>> I've been hacking with parallel package for some time and built a
>>> parallel processing framework with it.  However, although very rarely,
>>> I did notice "ignoring SIGPIPE signal" error every now and then.
>>> After a deep dig into the source code, I think I found something worth
>>> noticing.
>>> 
>>> In short, wring to pipe in the C function mc_exit(SEXP sRes) may cause
>>> a SIGPIPE.  Code from src/library/parallel/src/fork.c:
>>> 
>>> SEXP NORET mc_exit(SEXP sRes)
>>> {
>>> int res = asInteger(sRes);
>>> ... ...
>>> if (master_fd != -1) { /* send 0 to signify that we're leaving */
>>> size_t len = 0;
>>> /* assign result for Fedora security settings */
>>> ssize_t n = write(master_fd, &len, sizeof(len));
>>> ... ...
>>> }
>>> 
>>> So a pipe write is made in mc_exit, and here's how this function is
>>> used in src/library/parallel/R/unix/mcfork.R:
>>> 
>>> mcexit <- function(exit.code = 0L, send = NULL)
>>> {
>>> if (!is.null(send)) try(sendMaster(send), silent = TRUE)
>>> .Call(C_mc_exit, as.integer(exit.code))
>>> }
>>> 
>>> Between sendMaster() and mc_exit() calls, which are made in the child
>>> process, the master process may call readChild() followed by
>>> rmChild().  rmChild closes the pipe on the master side, and if it's
>>> called before child calls mc_exit, a SIGPIPE will be raised when child
>>> tries to write to the pipe in mc_exit.
>>> 
>>> rmChild is defined but not used in parallel package, so this problem
>>> won't surface in most cases.  However, it is a useful API and may be
>>> used by users like me for advanced control over child processes.  I
>>> hope we can discuss a solution on it.
>>> 
>>> In fact, I don't see why we need to write to the pipe on child exit
>>> and how it has anything to do with "Fedora security settings" as
>>> suggested in the comments.  Removing it, IMHO, would be a good and
>>> clean way to solve this problem.
>>> 
>>> Regards,
>>> Yijiang
>>> 
>>> __
>>> R-devel@r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>> 
>> 
> 
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] [External] most robust way to call R API functions from a secondary thread

2019-05-20 Thread Tierney, Luke
Your analysis looks pretty complete to me and your solutions seems
plausible.  That said, I don't know that I would have the level of
confidence yet that we haven't missed an important point that I would
want before going down this route.

Losing stack checking is risky; it might be eventually possible to
provide some support for this to be handled via a thread-local
variable. Ensuring that R_ToplevelExec can't jump before entering the
body function would be a good idea; if you want to propose a patch we
can have a look.

Best,

luke

On Sun, 19 May 2019, Andreas Kersting wrote:

> Hi,
>
> As the subject suggests, I am looking for the most robust way to call an 
> (arbitrary) function from the R API from another but the main POSIX thread in 
> a package's code.
>
> I know that, "[c]alling any of the R API from threaded code is ‘for experts 
> only’ and strongly discouraged. Many functions in the R API modify internal R 
> data structures and might corrupt these data structures if called 
> simultaneously from multiple threads. Most R API functions can signal errors, 
> which must only happen on the R main thread." 
> (https://cran.r-project.org/doc/manuals/r-release/R-exts.html#OpenMP-support)
>
> Let me start with my understanding of the related issues and possible 
> solutions:
>
> 1) R API functions are generally not thread-safe and hence one must ensure, 
> e.g. by using mutexes, that no two threads use the R API simultaneously
>
> 2) R uses longjmps on error and interrupts as well as for condition handling 
> and it is undefined behaviour to do a longjmp from one thread to another; 
> interrupts can be suspended before creating the threads by setting 
> R_interrupts_suspended = TRUE; by wrapping the calls to functions from the R 
> API with R_ToplevelExec(), longjmps across thread boundaries can be avoided; 
> the only reason for R_ToplevelExec() itself to fail with an R-style error 
> (longjmp) is a pointer protection stack overflow
>
> 3) R_CheckStack() might be executed (indirectly), which will (probably) 
> signal a stack overflow because it only works correctly when called form the 
> main thread (see 
> https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Threading-issues);
>  in particular, any function that does allocations, e.g. via allocVector3() 
> might end up calling it via GC -> finalizer -> ... -> eval; the only way 
> around this problem which I could find is to adjust R_CStackLimit, which is 
> outside of the official API; it can be set to -1 to disable the check or be 
> changed to a value appropriate for the current thread
>
> 4) R sets signal handlers for several signals and some of them make use of 
> the R API; hence, issues 1) - 3) apply; signal masks can be used to block 
> delivery of signals to secondary threads in general and to the main thread 
> while other threads are using the R API
>
>
> I basically have the following questions:
>
> a) Is my understanding of the issues accurate?
> b) Are there more things to consider when calling the R API from secondary 
> threads?
> c) Are the solutions proposed appropriate? Are there scenarios in which they 
> will fail to solve the issue? Or might they even cause new problems?
> d) Are there alternative/better solutions?
>
> Any feedback on this is highly appreciated.
>
> Below you can find a template which, combines the proposed solutions (and 
> skips all non-illustrative checks of return values). Additionally, 
> R_CheckUserInterrupt() is used in combination with R_UnwindProtect() to 
> regularly check for interrupts from the main thread, while still being able 
> to cleanly cancel the threads before fun_running_in_main_thread() is left via 
> a longjmp. This is e.g. required if the secondary threads use memory which 
> was allocated in fun_running_in_main_thread() using e.g. R_alloc().
>
> Best regards,
> Andreas Kersting
>
>
>
> #include 
> #include 
> #include 
> #include 
>
> extern uintptr_t R_CStackLimit;
> extern int R_PPStackTop;
> extern int R_PPStackSize;
>
> #include 
> LibExtern Rboolean R_interrupts_suspended;
> LibExtern int R_interrupts_pending;
> extern void Rf_onintr(void);
>
> // mutex for exclusive access to the R API:
> static pthread_mutex_t r_api_mutex = PTHREAD_MUTEX_INITIALIZER;
>
> // a wrapper arround R_CheckUserInterrupt() which can be passed to 
> R_UnwindProtect():
> SEXP check_interrupt(void *data) {
>  R_CheckUserInterrupt();
>  return R_NilValue;
> }
>
> // a wrapper arround Rf_onintr() which can be passed to R_UnwindProtect():
> SEXP my_onintr(void *data) {
>  Rf_onintr();
>  return R_NilValue;
> }
>
> // function called by R_UnwindProtect() to cleanup on interrupt
> void cleanfun(void *data, Rboolean jump) {
>  if (jump) {
>// terminate threads cleanly ...
>  }
> }
>
> void fun_calling_R_API(void *data) {
>  // call some R API function, e.g. mkCharCE() ...
> }
>
> void *threaded_fun(void *td) {
>
>  // ...
>
>  pthread_mutex_lock(&r_api_mutex);
>
>  // avoid false stack overflow error

[Rd] WISH: Built-in R session-specific universally unique identifier (UUID)

2019-05-20 Thread Henrik Bengtsson
# Proposal

Provide a built-in mechanism for obtaining an identifier for the
current R session, e.g.

> Sys.info()[["session_uuid"]]
[1] "4258db4d-d4fb-46b3-a214-8c762b99a443"

The identifier should be "unique" in the sense that the probability
for two R sessions(*) having the same identifier should be extremely
small.  There's no need for reproducibility, i.e. the algorithm for
producing the identifier may be changed at any time.

(*) Two R sessions running at different times (seconds, minutes, days,
years, ...) or on different machines (locally or anywhere in the
world).


# Use cases

In parallel-processing workflows, R objects may be "exported"
(serialized) to background R processes ("workers") for further
processing.  In other workflows, objects may be saved to file to be
reloaded in a future R session.  However, certain types of objects in
R maybe only be relevant, or valid, in the R session that created
them.  Attempts to use them in other R processes may give an obscure
error or in the worst case produce garbage results.

Having an identifier that is unique to each R process will make it
possible to detect when an object is used in the wrong context.  This
can be done by attaching the session identifier to the object.  For
example,

obj <- 42L
attr(obj, "owner") <- Sys.info()[["session_uuid"]]

With this, it is easy to validate the "ownership" later;

stopifnot(identical(attr(obj, "owner"), Sys.info()[["session_uuid"]]))

I argue that such an identifier should be part of base R for easy
access and avoid each developer having to roll their own.


# Possible implementation

One proposal would be to bring in Simon Urbanek's 'uuid' package
(https://cran.r-project.org/package=uuid) into base R.  This package
provides:

> uuid::UUIDgenerate()
[1] "b7de6182-c9c1-47a8-b5cd-e5c8307a8efb"

based on Theodore Ts'o's libuuid
(https://mirrors.edge.kernel.org/pub/linux/utils/util-linux/).  From
'man uuid_generate':

"The uuid_generate function creates a new universally unique
identifier (UUID). The uuid will be generated based on high-quality
randomness from /dev/urandom, if available. If it is not available,
then uuid_generate will use an alternative algorithm which uses the
current time, the local ethernet MAC address (if available), and
random data generated using a pseudo-random generator.
[...]
The UUID is 16 bytes (128 bits) long, which gives approximately
3.4x10^38 unique values (there are approximately 10^80 elementary
particles in the universe according to Carl Sagan's Cosmos). The new
UUID can reasonably be considered unique among all UUIDs created on
the local system, and among UUIDs created on other systems in the past
and in the future."

An alternative, that does not require adding a dependency on the
libuuid library, would be to roll a poor man's version based on a set
of semi-unique attributes, e.g.

make_id <- function(...) {
  args <- list(...)
  saveRDS(args, file = f <- tempfile())
  on.exit(file.remove(f))
  unname(tools::md5sum(f))
}

session_id <- local({
  id <- NULL
  function() {
if (is.null(id)) {
  id <<- make_id(
info= Sys.info(),
pid = Sys.getpid(),
tempdir = tempdir(),
time= Sys.time(),
random  = sample.int(.Machine$integer.max, size = 1L)
  )
}
id
  }
})

Example:

> session_id()
[1] "8d00b17384e69e7c9ecee47e0426b2a5"

> session_id()
[1] "8d00b17384e69e7c9ecee47e0426b2a5"

/Henrik

PS. Having a built-in make_id() function would be handy too, e.g. when
creating object-specific identifiers for other purposes.

PPS. It would be neat if there was an object, or connection, interface
for tools::md5sum(), which currently only operates on files sitting on
the file system. The digest package provides this functionality.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] WISH: Built-in R session-specific universally unique identifier (UUID)

2019-05-20 Thread William Dunlap via R-devel
I think a machine-specific input, like the MAC address, to the UUID is
essential.  S+ used to make a seed for the random number generator based on
the the current time and process ID.  A customer complained that all
machines in his cluster generated the same random number stream.  The
machines were rebooted each night, simultaneously, and S+ was started
during the boot process so times and process ids were identical, hence the
seeds were identical.

Bill Dunlap
TIBCO Software
wdunlap tibco.com


On Mon, May 20, 2019 at 4:48 PM Henrik Bengtsson 
wrote:

> # Proposal
>
> Provide a built-in mechanism for obtaining an identifier for the
> current R session, e.g.
>
> > Sys.info()[["session_uuid"]]
> [1] "4258db4d-d4fb-46b3-a214-8c762b99a443"
>
> The identifier should be "unique" in the sense that the probability
> for two R sessions(*) having the same identifier should be extremely
> small.  There's no need for reproducibility, i.e. the algorithm for
> producing the identifier may be changed at any time.
>
> (*) Two R sessions running at different times (seconds, minutes, days,
> years, ...) or on different machines (locally or anywhere in the
> world).
>
>
> # Use cases
>
> In parallel-processing workflows, R objects may be "exported"
> (serialized) to background R processes ("workers") for further
> processing.  In other workflows, objects may be saved to file to be
> reloaded in a future R session.  However, certain types of objects in
> R maybe only be relevant, or valid, in the R session that created
> them.  Attempts to use them in other R processes may give an obscure
> error or in the worst case produce garbage results.
>
> Having an identifier that is unique to each R process will make it
> possible to detect when an object is used in the wrong context.  This
> can be done by attaching the session identifier to the object.  For
> example,
>
> obj <- 42L
> attr(obj, "owner") <- Sys.info()[["session_uuid"]]
>
> With this, it is easy to validate the "ownership" later;
>
> stopifnot(identical(attr(obj, "owner"), Sys.info()[["session_uuid"]]))
>
> I argue that such an identifier should be part of base R for easy
> access and avoid each developer having to roll their own.
>
>
> # Possible implementation
>
> One proposal would be to bring in Simon Urbanek's 'uuid' package
> (https://cran.r-project.org/package=uuid) into base R.  This package
> provides:
>
> > uuid::UUIDgenerate()
> [1] "b7de6182-c9c1-47a8-b5cd-e5c8307a8efb"
>
> based on Theodore Ts'o's libuuid
> (https://mirrors.edge.kernel.org/pub/linux/utils/util-linux/).  From
> 'man uuid_generate':
>
> "The uuid_generate function creates a new universally unique
> identifier (UUID). The uuid will be generated based on high-quality
> randomness from /dev/urandom, if available. If it is not available,
> then uuid_generate will use an alternative algorithm which uses the
> current time, the local ethernet MAC address (if available), and
> random data generated using a pseudo-random generator.
> [...]
> The UUID is 16 bytes (128 bits) long, which gives approximately
> 3.4x10^38 unique values (there are approximately 10^80 elementary
> particles in the universe according to Carl Sagan's Cosmos). The new
> UUID can reasonably be considered unique among all UUIDs created on
> the local system, and among UUIDs created on other systems in the past
> and in the future."
>
> An alternative, that does not require adding a dependency on the
> libuuid library, would be to roll a poor man's version based on a set
> of semi-unique attributes, e.g.
>
> make_id <- function(...) {
>   args <- list(...)
>   saveRDS(args, file = f <- tempfile())
>   on.exit(file.remove(f))
>   unname(tools::md5sum(f))
> }
>
> session_id <- local({
>   id <- NULL
>   function() {
> if (is.null(id)) {
>   id <<- make_id(
> info= Sys.info(),
> pid = Sys.getpid(),
> tempdir = tempdir(),
> time= Sys.time(),
> random  = sample.int(.Machine$integer.max, size = 1L)
>   )
> }
> id
>   }
> })
>
> Example:
>
> > session_id()
> [1] "8d00b17384e69e7c9ecee47e0426b2a5"
>
> > session_id()
> [1] "8d00b17384e69e7c9ecee47e0426b2a5"
>
> /Henrik
>
> PS. Having a built-in make_id() function would be handy too, e.g. when
> creating object-specific identifiers for other purposes.
>
> PPS. It would be neat if there was an object, or connection, interface
> for tools::md5sum(), which currently only operates on files sitting on
> the file system. The digest package provides this functionality.
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel