Re: [Rd] SUGGESTION: Settings to disable forked processing in R, e.g. parallel::mclapply()

2019-04-12 Thread Iñaki Ucar
On Thu, 11 Apr 2019 at 22:07, Henrik Bengtsson
 wrote:
>
> ISSUE:
> Using *forks* for parallel processing in R is not always safe.
> [...]
> Comments?

Using fork() is never safe. The reference provided by Kevin [1] is
pretty compelling (I kindly encourage anyone who ever forked a process
to read it). Therefore, I'd go beyond Henrik's suggestion, and I'd
advocate for deprecating fork clusters and eventually removing them
from parallel.

[1] 
https://www.microsoft.com/en-us/research/uploads/prod/2019/04/fork-hotos19.pdf

-- 
Iñaki Úcar

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] SUGGESTION: Settings to disable forked processing in R, e.g. parallel::mclapply()

2019-04-12 Thread Travers Ching
Just throwing my two cents in:

I think removing/deprecating fork would be a bad idea for two reasons:

1) There are no performant alternatives
2) Removing fork would break existing workflows

Even if replaced with something using the same interface (e.g., a
function that automatically detects variables to export as in the
amazing `future` package), the lack of copy-on-write functionality
would cause scripts everywhere to break.

A simple example illustrating these two points:
`x <- 5e8; mclapply(1:24, sum, x, 8)`

Using fork, `mclapply` takes 5 seconds.  Using "psock", `clusterApply`
does not complete.

Travers

On Fri, Apr 12, 2019 at 2:32 AM Iñaki Ucar  wrote:
>
> On Thu, 11 Apr 2019 at 22:07, Henrik Bengtsson
>  wrote:
> >
> > ISSUE:
> > Using *forks* for parallel processing in R is not always safe.
> > [...]
> > Comments?
>
> Using fork() is never safe. The reference provided by Kevin [1] is
> pretty compelling (I kindly encourage anyone who ever forked a process
> to read it). Therefore, I'd go beyond Henrik's suggestion, and I'd
> advocate for deprecating fork clusters and eventually removing them
> from parallel.
>
> [1] 
> https://www.microsoft.com/en-us/research/uploads/prod/2019/04/fork-hotos19.pdf
>
> --
> Iñaki Úcar
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] SUGGESTION: Settings to disable forked processing in R, e.g. parallel::mclapply()

2019-04-12 Thread Iñaki Ucar
On Fri, 12 Apr 2019 at 21:32, Travers Ching  wrote:
>
> Just throwing my two cents in:
>
> I think removing/deprecating fork would be a bad idea for two reasons:
>
> 1) There are no performant alternatives

"Performant"... in terms of what. If the cost of copying the data
predominates over the computation time, maybe you didn't need
parallelization in the first place.

> 2) Removing fork would break existing workflows

I don't see why mclapply could not be rewritten using PSOCK clusters.
And as a side effect, this would enable those workflows on Windows,
which doesn't support fork.

> Even if replaced with something using the same interface (e.g., a
> function that automatically detects variables to export as in the
> amazing `future` package), the lack of copy-on-write functionality
> would cause scripts everywhere to break.

To implement copy-on-write, Linux overcommits virtual memory, and this
is what causes scripts to break unexpectedly: everything works fine,
until you change a small unimportant bit and... boom, out of memory.
And in general, running forks in any GUI would cause things everywhere
to break.

> A simple example illustrating these two points:
> `x <- 5e8; mclapply(1:24, sum, x, 8)`
>
> Using fork, `mclapply` takes 5 seconds.  Using "psock", `clusterApply`
> does not complete.

I'm not sure how did you setup that, but it does complete. Or do you
mean that you ran out of memory? Then try replacing "x" with, e.g.,
"x+1" in your mclapply example and see what happens (hint: save your
work first).

--
Iñaki Úcar

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] SUGGESTION: Settings to disable forked processing in R, e.g. parallel::mclapply()

2019-04-12 Thread Travers Ching
Hi Inaki,

> "Performant"... in terms of what. If the cost of copying the data
> predominates over the computation time, maybe you didn't need
> parallelization in the first place.

Performant in terms of speed.  There's no copying in that example
using `mclapply` and so it is significantly faster than other
alternatives.

It is a very simple and contrived example, but there are lots of
applications that depend on processing of large data and benefit from
multithreading.  For example, if I read in large sequencing data with
`Rsamtools` and want to check sequences for a set of motifs.

> I don't see why mclapply could not be rewritten using PSOCK clusters.

Because it would be much slower.

> To implement copy-on-write, Linux overcommits virtual memory, and this
>  is what causes scripts to break unexpectedly: everything works fine,
> until you change a small unimportant bit and... boom, out of memory.
> And in general, running forks in any GUI would cause things everywhere
> to break.

> I'm not sure how did you setup that, but it does complete. Or do you
> mean that you ran out of memory? Then try replacing "x" with, e.g.,
> "x+1" in your mclapply example and see what happens (hint: save your
> work first).

Yes, I meant that it ran out of memory on my desktop.  I understand
the limits, and it is not perfect because of the GUI issue you
mention, but I don't see a better alternative in terms of speed.

Regards,
Travers




On Fri, Apr 12, 2019 at 3:45 PM Iñaki Ucar  wrote:
>
> On Fri, 12 Apr 2019 at 21:32, Travers Ching  wrote:
> >
> > Just throwing my two cents in:
> >
> > I think removing/deprecating fork would be a bad idea for two reasons:
> >
> > 1) There are no performant alternatives
>
> "Performant"... in terms of what. If the cost of copying the data
> predominates over the computation time, maybe you didn't need
> parallelization in the first place.
>
> > 2) Removing fork would break existing workflows
>
> I don't see why mclapply could not be rewritten using PSOCK clusters.
> And as a side effect, this would enable those workflows on Windows,
> which doesn't support fork.
>
> > Even if replaced with something using the same interface (e.g., a
> > function that automatically detects variables to export as in the
> > amazing `future` package), the lack of copy-on-write functionality
> > would cause scripts everywhere to break.
>
> To implement copy-on-write, Linux overcommits virtual memory, and this
> is what causes scripts to break unexpectedly: everything works fine,
> until you change a small unimportant bit and... boom, out of memory.
> And in general, running forks in any GUI would cause things everywhere
> to break.
>
> > A simple example illustrating these two points:
> > `x <- 5e8; mclapply(1:24, sum, x, 8)`
> >
> > Using fork, `mclapply` takes 5 seconds.  Using "psock", `clusterApply`
> > does not complete.
>
> I'm not sure how did you setup that, but it does complete. Or do you
> mean that you ran out of memory? Then try replacing "x" with, e.g.,
> "x+1" in your mclapply example and see what happens (hint: save your
> work first).
>
> --
> Iñaki Úcar

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] SUGGESTION: Settings to disable forked processing in R, e.g. parallel::mclapply()

2019-04-12 Thread Kevin Ushey
I think it's worth saying that mclapply() works as documented: it
relies on forking, and so doesn't work well in environments where it's
unsafe to fork. This is spelled out explicitly in the documentation of
?mclapply:

It is strongly discouraged to use these functions in GUI or embedded
environments, because it leads to several processes sharing the same
GUI which will likely cause chaos (and possibly crashes). Child
processes should never use on-screen graphics devices.

I believe the expectation is that users who need more control over the
kind of cluster that's used for parallel computations would instead
create the cluster themselves with e.g. `makeCluster()` and then use
`clusterApply()` / `parLapply()` or other APIs as appropriate.

In environments where forking works, `mclapply()` is nice because you
don't need to think -- the process is forked, and anything available
in your main session is automatically available in the child
processes. This is a nice convenience for when you know it's safe to
fork R (and know what you're doing is safe to do within a forked
process). When it's not safe, it's better to prefer the other APIs
available for computation on a cluster.

Forking can be unsafe and dangerous, but it's also convenient and
sometimes that convenience can outweigh the other concerns.

Finally, I want to add: the onus should be on the front-end to work
well with R, and not the other way around. I don't think it's fair to
impose extra work / an extra maintenance burden on the R Core team for
something that's already clearly documented ...

Best,
Kevin


On Fri, Apr 12, 2019 at 6:04 PM Travers Ching  wrote:
>
> Hi Inaki,
>
> > "Performant"... in terms of what. If the cost of copying the data
> > predominates over the computation time, maybe you didn't need
> > parallelization in the first place.
>
> Performant in terms of speed.  There's no copying in that example
> using `mclapply` and so it is significantly faster than other
> alternatives.
>
> It is a very simple and contrived example, but there are lots of
> applications that depend on processing of large data and benefit from
> multithreading.  For example, if I read in large sequencing data with
> `Rsamtools` and want to check sequences for a set of motifs.
>
> > I don't see why mclapply could not be rewritten using PSOCK clusters.
>
> Because it would be much slower.
>
> > To implement copy-on-write, Linux overcommits virtual memory, and this
> >  is what causes scripts to break unexpectedly: everything works fine,
> > until you change a small unimportant bit and... boom, out of memory.
> > And in general, running forks in any GUI would cause things everywhere
> > to break.
>
> > I'm not sure how did you setup that, but it does complete. Or do you
> > mean that you ran out of memory? Then try replacing "x" with, e.g.,
> > "x+1" in your mclapply example and see what happens (hint: save your
> > work first).
>
> Yes, I meant that it ran out of memory on my desktop.  I understand
> the limits, and it is not perfect because of the GUI issue you
> mention, but I don't see a better alternative in terms of speed.
>
> Regards,
> Travers
>
>
>
>
> On Fri, Apr 12, 2019 at 3:45 PM Iñaki Ucar  wrote:
> >
> > On Fri, 12 Apr 2019 at 21:32, Travers Ching  wrote:
> > >
> > > Just throwing my two cents in:
> > >
> > > I think removing/deprecating fork would be a bad idea for two reasons:
> > >
> > > 1) There are no performant alternatives
> >
> > "Performant"... in terms of what. If the cost of copying the data
> > predominates over the computation time, maybe you didn't need
> > parallelization in the first place.
> >
> > > 2) Removing fork would break existing workflows
> >
> > I don't see why mclapply could not be rewritten using PSOCK clusters.
> > And as a side effect, this would enable those workflows on Windows,
> > which doesn't support fork.
> >
> > > Even if replaced with something using the same interface (e.g., a
> > > function that automatically detects variables to export as in the
> > > amazing `future` package), the lack of copy-on-write functionality
> > > would cause scripts everywhere to break.
> >
> > To implement copy-on-write, Linux overcommits virtual memory, and this
> > is what causes scripts to break unexpectedly: everything works fine,
> > until you change a small unimportant bit and... boom, out of memory.
> > And in general, running forks in any GUI would cause things everywhere
> > to break.
> >
> > > A simple example illustrating these two points:
> > > `x <- 5e8; mclapply(1:24, sum, x, 8)`
> > >
> > > Using fork, `mclapply` takes 5 seconds.  Using "psock", `clusterApply`
> > > does not complete.
> >
> > I'm not sure how did you setup that, but it does complete. Or do you
> > mean that you ran out of memory? Then try replacing "x" with, e.g.,
> > "x+1" in your mclapply example and see what happens (hint: save your
> > work first).
> >
> > --
> > Iñaki Úcar
>
> __
> R-devel@r-project

Re: [Rd] SUGGESTION: Settings to disable forked processing in R, e.g. parallel::mclapply()

2019-04-12 Thread Simon Urbanek
I fully agree with Kevin. Front-ends can always use pthread_atfork() to close 
descriptors and suspend threads in children.

Anyone who thinks you can use PSOCK clusters has obviously not used mclappy() 
in real applications - trying to save the workspace and restore it in 20 new 
processes is not only incredibly wasteful (no shared memory whatsoever) but 
slow. If you want to use PSOCK just do it (I never do - you might as well just 
use a full cluster instead), multicore is for the cases where you want to 
parallelize something quickly and it works really well for that purpose.

I'd like to separate the issues here - the fact that RStudio has issues is 
really not R's fault - there is no technical reason why it shouldn't be able to 
handle it correctly. That is not to say that there are cases where fork() is 
dangerous, but in most cases it's not and the benefits outweigh
the risk.

That said, I do acknowledge the idea of having an ability to prevent forking if 
desired - I think that's a good idea, in particular if there is a standard that 
packages can also adhere to it (yes, there are also packages that use fork() 
explicitly). I just think that the motivation is wrong (i.e., I don't think it 
would be wise for RStudio to prevent parallelization by default).

Also I'd like to point out that the main problem came about when packages 
started using parallel implicitly - the good citizens out there expose it as a 
parameter to the user, but not all packages do it which means you can hit 
forked code without knowing it. If you use mclapply() in user code, you 
typically know what you're doing, but if a package author does it for you, it's 
a different story.

Cheers,
Simon


> On Apr 12, 2019, at 21:50, Kevin Ushey  wrote:
> 
> I think it's worth saying that mclapply() works as documented: it
> relies on forking, and so doesn't work well in environments where it's
> unsafe to fork. This is spelled out explicitly in the documentation of
> ?mclapply:
> 
> It is strongly discouraged to use these functions in GUI or embedded
> environments, because it leads to several processes sharing the same
> GUI which will likely cause chaos (and possibly crashes). Child
> processes should never use on-screen graphics devices.
> 
> I believe the expectation is that users who need more control over the
> kind of cluster that's used for parallel computations would instead
> create the cluster themselves with e.g. `makeCluster()` and then use
> `clusterApply()` / `parLapply()` or other APIs as appropriate.
> 
> In environments where forking works, `mclapply()` is nice because you
> don't need to think -- the process is forked, and anything available
> in your main session is automatically available in the child
> processes. This is a nice convenience for when you know it's safe to
> fork R (and know what you're doing is safe to do within a forked
> process). When it's not safe, it's better to prefer the other APIs
> available for computation on a cluster.
> 
> Forking can be unsafe and dangerous, but it's also convenient and
> sometimes that convenience can outweigh the other concerns.
> 
> Finally, I want to add: the onus should be on the front-end to work
> well with R, and not the other way around. I don't think it's fair to
> impose extra work / an extra maintenance burden on the R Core team for
> something that's already clearly documented ...
> 
> Best,
> Kevin
> 
> 
> On Fri, Apr 12, 2019 at 6:04 PM Travers Ching  wrote:
>> 
>> Hi Inaki,
>> 
>>> "Performant"... in terms of what. If the cost of copying the data
>>> predominates over the computation time, maybe you didn't need
>>> parallelization in the first place.
>> 
>> Performant in terms of speed.  There's no copying in that example
>> using `mclapply` and so it is significantly faster than other
>> alternatives.
>> 
>> It is a very simple and contrived example, but there are lots of
>> applications that depend on processing of large data and benefit from
>> multithreading.  For example, if I read in large sequencing data with
>> `Rsamtools` and want to check sequences for a set of motifs.
>> 
>>> I don't see why mclapply could not be rewritten using PSOCK clusters.
>> 
>> Because it would be much slower.
>> 
>>> To implement copy-on-write, Linux overcommits virtual memory, and this
>>> is what causes scripts to break unexpectedly: everything works fine,
>>> until you change a small unimportant bit and... boom, out of memory.
>>> And in general, running forks in any GUI would cause things everywhere
>>> to break.
>> 
>>> I'm not sure how did you setup that, but it does complete. Or do you
>>> mean that you ran out of memory? Then try replacing "x" with, e.g.,
>>> "x+1" in your mclapply example and see what happens (hint: save your
>>> work first).
>> 
>> Yes, I meant that it ran out of memory on my desktop.  I understand
>> the limits, and it is not perfect because of the GUI issue you
>> mention, but I don't see a better alternative in terms of speed.
>> 
>> Rega