Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Frank Harrell
To me it boils down to one simple question: is an update to a package on 
CRAN more likely to (1) fix a bug, (2) introduce a bug or downward 
incompatibility, or (3) add a new feature or fix a compatibility problem 
without introducing a bug?  I think the probability of (1) | (3) is much 
greater than the probability of (2), hence the current approach 
maximizes user benefit.


Frank
--
Frank E Harrell Jr Professor and Chairman  School of Medicine
   Department of Biostatistics Vanderbilt University

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Joshua Ulrich
On Tue, Mar 18, 2014 at 3:24 PM, Jeroen Ooms  wrote:

> ## Summary
>
> Extending the r-release cycle to CRAN seems like a solution that would
> be easy to implement. Package updates simply only get pushed to the
> r-devel branches of cran, rather than r-release and r-release-old.
> This separates development from production/use in a way that is common
> sense in most open source communities. Benefits for R include:
>
Nothing is ever as simple as it seems (especially from the perspective
of one who won't be doing the work).

There is nothing preventing you (or anyone else) from creating
repositories that do what you suggest.  Create a CRAN mirror (or more
than one) that only include the package versions you think they
should.  Then have your production servers use it (them) instead of
CRAN.

Better yet, make those repositories public.  If many people like your
idea, they will use your new repositories instead of CRAN.  There is
no reason to impose this change on all world-wide CRAN users.

Best,
--
Joshua Ulrich  |  about.me/joshuaulrich
FOSS Trading  |  www.fosstrading.com

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Duncan Murdoch
I don't see why CRAN needs to be involved in this effort at all.  A 
third party could take snapshots of CRAN at R release dates, and make 
those available to package users in a separate repository.  It is not 
hard to set a different repository than CRAN as the default location 
from which to obtain packages.


The only objection I can see to this is that it requires extra work by 
the third party, rather than extra work by the CRAN team. I don't think 
the total amount of work required is much different.  I'm very 
unsympathetic to proposals to dump work on others.


Duncan Murdoch

On 18/03/2014 4:24 PM, Jeroen Ooms wrote:

This came up again recently with an irreproducible paper. Below an
attempt to make a case for extending the r-devel/r-release cycle to
CRAN packages. These suggestions are not in any way intended as
criticism on anyone or the status quo.

The proposal described in [1] is to freeze a snapshot of CRAN along
with every release of R. In this design, updates for contributed
packages treated the same as updates for base packages in the sense
that they are only published to the r-devel branch of CRAN and do not
affect users of "released" versions of R. Thereby all users, stacks
and applications using a particular version of R will by default be
using the identical version of each CRAN package. The bioconductor
project uses similar policies.

This system has several important advantages:

## Reproducibility

Currently r/sweave/knitr scripts are unstable because of ambiguity
introduced by constantly changing cran packages. This causes scripts
to break or change behavior when upstream packages are updated, which
makes reproducing old results extremely difficult.

A common counter-argument is that script authors should document
package versions used in the script using sessionInfo(). However even
if authors would manually do this, reconstructing the author's
environment from this information is cumbersome and often nearly
impossible, because binary packages might no longer be available,
dependency conflicts, etc. See [1] for a worked example. In practice,
the current system causes many results or documents generated with R
no to be reproducible, sometimes already after a few months.

In a system where contributed packages inherit the r-base release
cycle, scripts will behave the same across users/systems/time within a
given version of R. This severely reduces ambiguity of R behavior, and
has the potential of making reproducibility a natural part of the
language, rather than a tedious exercise.

## Repository Management

Just like scripts suffer from upstream changes, so do packages
depending on other packages. A particular package that has been
developed and tested against the current version of a particular
dependency is not guaranteed to work against *any future version* of
that dependency. Therefore, packages inevitably break over time as
their dependencies are updated.

One recent example is the Rcpp 0.11 release, which required all
reverse dependencies to be rebuild/modified. This updated caused some
serious disruption on our production servers. Initially we refrained
from updating Rcpp on these servers to prevent currently installed
packages depending on Rcpp to stop working. However soon after the
Rcpp 0.11 release, many other cran packages started to require Rcpp >=
0.11, and our users started complaining about not being able to
install those packages. This resulted in the impossible situation
where currently installed packages would not work with the new Rcpp,
but newly installed packages would not work with the old Rcpp.

Current CRAN policies blame this problem on package authors. However
as is explained in [1], this policy does not solve anything, is
unsustainable with growing repository size, and sets completely the
wrong incentives for contributing code. Progress comes with breaking
changes, and the system should be able to accommodate this. Much of
the trouble could have been prevented by a system that does not push
bleeding edge updates straight to end-users, but has a devel branch
where conflicts are resolved before publishing them in the next
r-release.

## Reliability

Another example, this time on a very small scale. We recently
discovered that R code plotting medal counts from the Sochi Olympics
generated different results for users on OSX than it did on
Linux/Windows. After some debugging, we narrowed it down to the XML
package. The application used the following code to scrape results
from the Sochi website:

XML::readHTMLTable("http://www.sochi2014.com/en/speed-skating";, which=2, skip=1)

This code was developed and tested on mac, but results in a different
winner on windows/linux. This happens because the current version of
the XML package on CRAN is 3.98, but the latest mac binary is 3.95.
Apparently this new version of XML introduces a tiny change that
causes html-table-headers to become colnames, rather than a row in the
matrix, resulting in different medal counts.

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Kasper Daniel Hansen
Our experience in Bioconductor is that this is a pretty hard problem.

What the OP presumably wants is some guarantee that all packages on CRAN
work well together.  A good example is when Rcpp was updated, it broke
other packages (quick note: The Rcpp developers do a incredible amount of
work to deal with this; it is almost impossible to not have a few days of
chaos).  Ensuring this is not a trivial task, and it requires some buy-in
both from the "repository" and from the developers.

For Bioconductor it is even harder as the dependency graph of Bioconductor
is much more involved than the one for CRAN, where most packages depends
only on a few other packages.  This is why we need to do this for Bioc.

Based on my experience with CRAN I am not sure I see a need for a
coordinated release (or rather, I can sympathize with the need, but I don't
think the effort is worth it).

What would be more useful in terms of reproducibility is the capability of
installing a specific version of a package from a repository using
install.packages(), which would require archiving older versions in a
coordinated fashion. I know CRAN archives old versions, but I am not aware
if we can programmatically query the repository about this.

Best,
Kasper


On Wed, Mar 19, 2014 at 8:52 AM, Joshua Ulrich wrote:

> On Tue, Mar 18, 2014 at 3:24 PM, Jeroen Ooms 
> wrote:
> 
> > ## Summary
> >
> > Extending the r-release cycle to CRAN seems like a solution that would
> > be easy to implement. Package updates simply only get pushed to the
> > r-devel branches of cran, rather than r-release and r-release-old.
> > This separates development from production/use in a way that is common
> > sense in most open source communities. Benefits for R include:
> >
> Nothing is ever as simple as it seems (especially from the perspective
> of one who won't be doing the work).
>
> There is nothing preventing you (or anyone else) from creating
> repositories that do what you suggest.  Create a CRAN mirror (or more
> than one) that only include the package versions you think they
> should.  Then have your production servers use it (them) instead of
> CRAN.
>
> Better yet, make those repositories public.  If many people like your
> idea, they will use your new repositories instead of CRAN.  There is
> no reason to impose this change on all world-wide CRAN users.
>
> Best,
> --
> Joshua Ulrich  |  about.me/joshuaulrich
> FOSS Trading  |  www.fosstrading.com
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Dirk Eddelbuettel

Piling on:

On 19 March 2014 at 07:52, Joshua Ulrich wrote:
| There is nothing preventing you (or anyone else) from creating
| repositories that do what you suggest.  Create a CRAN mirror (or more
| than one) that only include the package versions you think they
| should.  Then have your production servers use it (them) instead of
| CRAN.
| 
| Better yet, make those repositories public.  If many people like your
| idea, they will use your new repositories instead of CRAN.  There is
| no reason to impose this change on all world-wide CRAN users.

On 19 March 2014 at 08:52, Duncan Murdoch wrote:
| I don't see why CRAN needs to be involved in this effort at all.  A 
| third party could take snapshots of CRAN at R release dates, and make 
| those available to package users in a separate repository.  It is not 
| hard to set a different repository than CRAN as the default location 
| from which to obtain packages.
| 
| The only objection I can see to this is that it requires extra work by 
| the third party, rather than extra work by the CRAN team. I don't think 
| the total amount of work required is much different.  I'm very 
| unsympathetic to proposals to dump work on others.


And to a first approximation some of those efforts already exist:

  -- 200+ r-cran-* packages in Debian proper

  -- 2000+ r-cran-* packages in Michael's c2d4u (via launchpad)

  -- 5000+ r-cran-* packages in Don's debian-r.debian.net

The only difference here is that Jeroen wants to organize source packages.
But that is just a matter of stacking them in directory trees and calling

setwd("/path/to/root/of/your/repo/version")
tools::write_PACKAGES(".", type="source")'

to create PACKAGES and PACKAGES.gz.

Dirk

-- 
Dirk Eddelbuettel | e...@debian.org | http://dirk.eddelbuettel.com

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Hadley Wickham
> What would be more useful in terms of reproducibility is the capability of
> installing a specific version of a package from a repository using
> install.packages(), which would require archiving older versions in a
> coordinated fashion. I know CRAN archives old versions, but I am not aware
> if we can programmatically query the repository about this.

See devtools::install_version().

The main caveat is that you also need to be able to build the package,
and ensure you have dependencies that work with that version.

Hadley


-- 
http://had.co.nz/

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Geoff Jentry

using the identical version of each CRAN package. The bioconductor
project uses similar policies.


While I agree that this can be an issue, I don't think it is fair to 
compare CRAN to BioC. Unless things have changed, the latter has a more 
rigorous barrier to entry which includes buy in of various ideals (e.g. 
interoperability w/ other BioC packages, making use of BioC constructs, 
the official release cycle). All of that requires extra management 
overhead (read: human effort) which considering that CRAN isn't exactly 
swimming in spare cycles seems unlikely to happen.


It seems like one could set up a curated CRAN-a-like quite easily, 
advertise the heck out of it and let the "market" decide. That is, IMO, 
the beauty of open source.


-J

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Jeroen Ooms
On Wed, Mar 19, 2014 at 5:52 AM, Duncan Murdoch wrote:

> I don't see why CRAN needs to be involved in this effort at all.  A third
> party could take snapshots of CRAN at R release dates, and make those
> available to package users in a separate repository.  It is not hard to set
> a different repository than CRAN as the default location from which to
> obtain packages.
>

I am happy to see many people giving this some thought and engage in the
discussion.

Several have suggested that staging & freezing can be simply done by a
third party. This solution and its limitations is also described in the
paper [1] in the section titled "R: downstream staging and repackaging".

If this would solve the problem without affecting CRAN, we would have been
done this obviously. In fact, as described in the paper and pointed out by
some people, initiatives such as Debian or Revolution Enterprise already
include a frozen library of R packages. Also companies like Google maintain
their own internal repository with packages that are used throughout the
company.

The problem with this approach is that when you using some 3rd party
package snapshot, your r/sweave scripts will still only be
reliable/reproducible for other users of that specific snapshot. E.g. for
the examples above, a script that is written in R 3.0 by a Debian user is
not guaranteed to work on R 3.0 in Google, or R 3.0 on some other 3rd party
cran snapshot. Hence this solution merely redefines the problem from "this
script depends on pkgA 1.1 and pkgB 0.2.3" to "this script depends on
repository foo 2.0". And given that most users would still be pulling
packages straight from CRAN, it would still be terribly difficult to
reproduce a 5 year old sweave script from e.g. JSS.

For this reason I believe the only effective place to organize this staging
is all the way upstream, on CRAN. Imagine a world where your r/sweave
script would be reliable/reproducible, out of the box, on any system, any
platform in any company using on R 3.0. No need to investigate which
specific packages or cran snapshot the author was using at the time of
writing the script, and trying to reconstruct such libraries for each
script you want to reproduce. No ambiguity about which package versions are
used by R 3.0. However for better or worse, I think this could only be
accomplished with a cran release cycle (i.e. "universal snapshots")
accompanying the already existing r releases.



> The only objection I can see to this is that it requires extra work by the
> third party, rather than extra work by the CRAN team. I don't think the
> total amount of work required is much different.  I'm very unsympathetic to
> proposals to dump work on others.


I am merely trying to discuss a technical issue in an attempt to improve
reliability of our software and reproducibility of papers created with R.

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Spencer Graves
  What about having this purpose met with something like an 
expansion of R-Forge?  We could have packages submitted to R-Forge 
rather than CRAN, and people who wanted the latest could get it from 
R-Forge.  If changes I make on R-Forge break a reverse dependency, 
emails explaining the problem are sent to both me and the maintainer for 
the package I broke.



  The budget for R-Forge would almost certainly need to be 
increased:  They currently disable many of the tests they once ran.



  Regarding budget, the R Project would get more donations if they 
asked for them and made it easier to contribute.  I've tried multiple 
times without success to find a way to donate.  I didn't try hard, but 
it shouldn't be hard ;-)  (And donations should be accepted in US 
dollars and Euros -- and maybe other currencies.) There should be a 
procedure whereby anyone could receive a pro forma invoice, which they 
can pay or ignore as they choose.  I mention this, because many grants 
could cover a reasonable fee provided they have an invoice.



  Spencer Graves


On 3/19/2014 10:59 AM, Jeroen Ooms wrote:

On Wed, Mar 19, 2014 at 5:52 AM, Duncan Murdoch wrote:


I don't see why CRAN needs to be involved in this effort at all.  A third
party could take snapshots of CRAN at R release dates, and make those
available to package users in a separate repository.  It is not hard to set
a different repository than CRAN as the default location from which to
obtain packages.


I am happy to see many people giving this some thought and engage in the
discussion.

Several have suggested that staging & freezing can be simply done by a
third party. This solution and its limitations is also described in the
paper [1] in the section titled "R: downstream staging and repackaging".

If this would solve the problem without affecting CRAN, we would have been
done this obviously. In fact, as described in the paper and pointed out by
some people, initiatives such as Debian or Revolution Enterprise already
include a frozen library of R packages. Also companies like Google maintain
their own internal repository with packages that are used throughout the
company.

The problem with this approach is that when you using some 3rd party
package snapshot, your r/sweave scripts will still only be
reliable/reproducible for other users of that specific snapshot. E.g. for
the examples above, a script that is written in R 3.0 by a Debian user is
not guaranteed to work on R 3.0 in Google, or R 3.0 on some other 3rd party
cran snapshot. Hence this solution merely redefines the problem from "this
script depends on pkgA 1.1 and pkgB 0.2.3" to "this script depends on
repository foo 2.0". And given that most users would still be pulling
packages straight from CRAN, it would still be terribly difficult to
reproduce a 5 year old sweave script from e.g. JSS.

For this reason I believe the only effective place to organize this staging
is all the way upstream, on CRAN. Imagine a world where your r/sweave
script would be reliable/reproducible, out of the box, on any system, any
platform in any company using on R 3.0. No need to investigate which
specific packages or cran snapshot the author was using at the time of
writing the script, and trying to reconstruct such libraries for each
script you want to reproduce. No ambiguity about which package versions are
used by R 3.0. However for better or worse, I think this could only be
accomplished with a cran release cycle (i.e. "universal snapshots")
accompanying the already existing r releases.




The only objection I can see to this is that it requires extra work by the
third party, rather than extra work by the CRAN team. I don't think the
total amount of work required is much different.  I'm very unsympathetic to
proposals to dump work on others.


I am merely trying to discuss a technical issue in an attempt to improve
reliability of our software and reproducibility of papers created with R.

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Joshua Ulrich
On Wed, Mar 19, 2014 at 12:59 PM, Jeroen Ooms  wrote:
> On Wed, Mar 19, 2014 at 5:52 AM, Duncan Murdoch 
> wrote:
>
>> I don't see why CRAN needs to be involved in this effort at all.  A third
>> party could take snapshots of CRAN at R release dates, and make those
>> available to package users in a separate repository.  It is not hard to set
>> a different repository than CRAN as the default location from which to
>> obtain packages.
>>
>
> I am happy to see many people giving this some thought and engage in the
> discussion.
>
> Several have suggested that staging & freezing can be simply done by a
> third party. This solution and its limitations is also described in the
> paper [1] in the section titled "R: downstream staging and repackaging".
>
> If this would solve the problem without affecting CRAN, we would have been
> done this obviously. In fact, as described in the paper and pointed out by
> some people, initiatives such as Debian or Revolution Enterprise already
> include a frozen library of R packages. Also companies like Google maintain
> their own internal repository with packages that are used throughout the
> company.
>
The suggested solution is not described in the referenced article.  It
was not suggested that it be the operating system's responsibility to
distribute snapshots, nor was it suggested to create binary
repositories for specific operating systems, nor was it suggested to
freeze only a subset of CRAN packages.

> The problem with this approach is that when you using some 3rd party
> package snapshot, your r/sweave scripts will still only be
> reliable/reproducible for other users of that specific snapshot. E.g. for
> the examples above, a script that is written in R 3.0 by a Debian user is
> not guaranteed to work on R 3.0 in Google, or R 3.0 on some other 3rd party
> cran snapshot. Hence this solution merely redefines the problem from "this
> script depends on pkgA 1.1 and pkgB 0.2.3" to "this script depends on
> repository foo 2.0". And given that most users would still be pulling
> packages straight from CRAN, it would still be terribly difficult to
> reproduce a 5 year old sweave script from e.g. JSS.
>
This can be solved by the third party making the repository public.

> For this reason I believe the only effective place to organize this staging
> is all the way upstream, on CRAN. Imagine a world where your r/sweave
> script would be reliable/reproducible, out of the box, on any system, any
> platform in any company using on R 3.0. No need to investigate which
> specific packages or cran snapshot the author was using at the time of
> writing the script, and trying to reconstruct such libraries for each
> script you want to reproduce. No ambiguity about which package versions are
> used by R 3.0. However for better or worse, I think this could only be
> accomplished with a cran release cycle (i.e. "universal snapshots")
> accompanying the already existing r releases.
>
This could be done by a public third-party repository, independent of
CRAN.  However, you would need to find a way to actively _prevent_
people from installing newer versions of packages with the stable R
releases.

--
Joshua Ulrich  |  about.me/joshuaulrich
FOSS Trading  |  www.fosstrading.com

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Carl Boettiger
Dear list,

I'm curious what people would think of a more modest proposal at this time:

State the version of the dependencies used by the package authors when the
package was built.

Eventually CRAN could enforce such a statement be present in the
description. We encourage users to declare the version of the packages they
use in publications, so why not have the same expectation of developers?
 This would help address the problem of archived packages that Jeroen
raises, as it is currently it is impossible to reliably install archived
packages because their dependencies have since been updated and are no
longer compatible.  (Even if it passes checks and installs, we have no way
of knowing if the upstream changes have introduced a bug).  This
information would be relatively straight forward to capture, shouldn't
change the way anyone currently uses CRAN, and should address a major pain
point anyone trying to install archived versions from CRAN has probably
encountered.  What am I overlooking?

Carl


On Wed, Mar 19, 2014 at 11:36 AM, Spencer Graves <
spencer.gra...@structuremonitoring.com> wrote:

>   What about having this purpose met with something like an expansion
> of R-Forge?  We could have packages submitted to R-Forge rather than CRAN,
> and people who wanted the latest could get it from R-Forge.  If changes I
> make on R-Forge break a reverse dependency, emails explaining the problem
> are sent to both me and the maintainer for the package I broke.
>
>
>   The budget for R-Forge would almost certainly need to be increased:
>  They currently disable many of the tests they once ran.
>
>
>   Regarding budget, the R Project would get more donations if they
> asked for them and made it easier to contribute.  I've tried multiple times
> without success to find a way to donate.  I didn't try hard, but it
> shouldn't be hard ;-)  (And donations should be accepted in US dollars and
> Euros -- and maybe other currencies.) There should be a procedure whereby
> anyone could receive a pro forma invoice, which they can pay or ignore as
> they choose.  I mention this, because many grants could cover a reasonable
> fee provided they have an invoice.
>
>
>   Spencer Graves
>
>
> On 3/19/2014 10:59 AM, Jeroen Ooms wrote:
>
>> On Wed, Mar 19, 2014 at 5:52 AM, Duncan Murdoch > >wrote:
>>
>>  I don't see why CRAN needs to be involved in this effort at all.  A third
>>> party could take snapshots of CRAN at R release dates, and make those
>>> available to package users in a separate repository.  It is not hard to
>>> set
>>> a different repository than CRAN as the default location from which to
>>> obtain packages.
>>>
>>>  I am happy to see many people giving this some thought and engage in the
>> discussion.
>>
>> Several have suggested that staging & freezing can be simply done by a
>> third party. This solution and its limitations is also described in the
>> paper [1] in the section titled "R: downstream staging and repackaging".
>>
>> If this would solve the problem without affecting CRAN, we would have been
>> done this obviously. In fact, as described in the paper and pointed out by
>> some people, initiatives such as Debian or Revolution Enterprise already
>> include a frozen library of R packages. Also companies like Google
>> maintain
>> their own internal repository with packages that are used throughout the
>> company.
>>
>> The problem with this approach is that when you using some 3rd party
>> package snapshot, your r/sweave scripts will still only be
>> reliable/reproducible for other users of that specific snapshot. E.g. for
>> the examples above, a script that is written in R 3.0 by a Debian user is
>> not guaranteed to work on R 3.0 in Google, or R 3.0 on some other 3rd
>> party
>> cran snapshot. Hence this solution merely redefines the problem from "this
>> script depends on pkgA 1.1 and pkgB 0.2.3" to "this script depends on
>> repository foo 2.0". And given that most users would still be pulling
>> packages straight from CRAN, it would still be terribly difficult to
>> reproduce a 5 year old sweave script from e.g. JSS.
>>
>> For this reason I believe the only effective place to organize this
>> staging
>> is all the way upstream, on CRAN. Imagine a world where your r/sweave
>> script would be reliable/reproducible, out of the box, on any system, any
>> platform in any company using on R 3.0. No need to investigate which
>> specific packages or cran snapshot the author was using at the time of
>> writing the script, and trying to reconstruct such libraries for each
>> script you want to reproduce. No ambiguity about which package versions
>> are
>> used by R 3.0. However for better or worse, I think this could only be
>> accomplished with a cran release cycle (i.e. "universal snapshots")
>> accompanying the already existing r releases.
>>
>>
>>
>>  The only objection I can see to this is that it requires extra work by
>>> the
>>> third party, rather than extra work by the CRAN team. 

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Jeroen Ooms
On Wed, Mar 19, 2014 at 7:00 AM, Kasper Daniel Hansen
 wrote:
> Our experience in Bioconductor is that this is a pretty hard problem.
>
> What the OP presumably wants is some guarantee that all packages on CRAN work 
> well together.

Obviously we can not guarantee that all packages on CRAN work
together. But what we can do is prevent problems that are introduced
by version ambiguity. If author develops and tests a script/package
with dependency Rcpp 0.10.6, the best chance of making that script or
package work for other users is using Rcpp 0.10.6.

This especially holds if there is a big time difference between the
author creating the pkg/script and someone using it. In practice most
Sweave/knitr scripts used for generating papers and articles can not
be reproduced after a while because the dependency packages have
changed in the mean time. These problem can largely be mitigated with
a release cycle.

I am not arguing that anyone should put manual effort into testing
that packages work together. On the contrary: a system that separates
development from released branches prevents you from having to
continuously test all reverse dependencies for every package update.

My argument is simply that many problems introduced by version
ambiguity can be prevented if we can unite the entire R community
around using a single version of each CRAN package for every specific
release of R. Similar to how linux distributions use a single version
of each software package in a particular release of the distribution.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Hervé Pagès

Hi,

On 03/19/2014 07:00 AM, Kasper Daniel Hansen wrote:

Our experience in Bioconductor is that this is a pretty hard problem.


What's hard and requires a substantial amount of human resources is to
run our build system (set up the build machines, keep up with changes
in R, babysit the builds, assist developers with build issues, etc...)

But *freezing* the CRAN packages for each version of R is *very* easy
to do. The CRAN maintainers already do it for the binary packages.
What could be the reason for not doing it for source packages too?
Maybe in prehistoric times there was this belief that a source package
was aimed to remain compatible with all versions of R, present and
future, but that dream is dead and gone...

Right now the layout of the CRAN package repo is:

  ├── src
  │   └── contrib
  └── bin
  ├── windows
  │   └── contrib
  │   ├ ...
  │   ├ 3.0
  │   ├ 3.1
  │   ├ ...
  └── macosx
  └── contrib
  ├ ...
  ├ 3.0
  ├ 3.1
  ├ ...

when it could be:

  ├── 3.0
  │   ├── src
  │   │   └── contrib
  │   └── bin
  │   ├── windows
  │   │   └── contrib
  │   └── macosx
  │   └── contrib
  ├── 3.1
  │   ├── src
  │   │   └── contrib
  │   └── bin
  │   ├── windows
  │   │   └── contrib
  │   └── macosx
  │   └── contrib
  ├── ...

That is: the split by version is done at the top, not at the bottom.

It doesn't use more disk space than the current layout (you can just
throw the src/contrib/Archive/ folder away, there is no more need
for it).

install.packages() and family would need to be modified a little bit
to work with this new layout. And that's all!

The never ending changes in Mac OS X binary formats can be handled
in a cleaner way i.e. no more symlinks under bin/macosx to keep
backward compatibility with different binary formats and with old
versions of install.packages().

Then in 10 years from now, you can reproduce an analysis that you
did today with R-3.0. Because when you'll install R-3.0 and the
packages required for this analysis, you'll end up with exactly
the same packages as today.

Cheers,
H.



What the OP presumably wants is some guarantee that all packages on CRAN
work well together.  A good example is when Rcpp was updated, it broke
other packages (quick note: The Rcpp developers do a incredible amount of
work to deal with this; it is almost impossible to not have a few days of
chaos).  Ensuring this is not a trivial task, and it requires some buy-in
both from the "repository" and from the developers.

For Bioconductor it is even harder as the dependency graph of Bioconductor
is much more involved than the one for CRAN, where most packages depends
only on a few other packages.  This is why we need to do this for Bioc.

Based on my experience with CRAN I am not sure I see a need for a
coordinated release (or rather, I can sympathize with the need, but I don't
think the effort is worth it).

What would be more useful in terms of reproducibility is the capability of
installing a specific version of a package from a repository using
install.packages(), which would require archiving older versions in a
coordinated fashion. I know CRAN archives old versions, but I am not aware
if we can programmatically query the repository about this.

Best,
Kasper


On Wed, Mar 19, 2014 at 8:52 AM, Joshua Ulrich wrote:


On Tue, Mar 18, 2014 at 3:24 PM, Jeroen Ooms 
wrote:


## Summary

Extending the r-release cycle to CRAN seems like a solution that would
be easy to implement. Package updates simply only get pushed to the
r-devel branches of cran, rather than r-release and r-release-old.
This separates development from production/use in a way that is common
sense in most open source communities. Benefits for R include:


Nothing is ever as simple as it seems (especially from the perspective
of one who won't be doing the work).

There is nothing preventing you (or anyone else) from creating
repositories that do what you suggest.  Create a CRAN mirror (or more
than one) that only include the package versions you think they
should.  Then have your production servers use it (them) instead of
CRAN.

Better yet, make those repositories public.  If many people like your
idea, they will use your new repositories instead of CRAN.  There is
no reason to impose this change on all world-wide CRAN users.

Best,
--
Joshua Ulrich  |  about.me/joshuaulrich
FOSS Trading  |  www.fosstrading.com

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel



[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel



--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Jeroen Ooms
On Wed, Mar 19, 2014 at 11:50 AM, Joshua Ulrich 
wrote:
>
> The suggested solution is not described in the referenced article.  It
> was not suggested that it be the operating system's responsibility to
> distribute snapshots, nor was it suggested to create binary
> repositories for specific operating systems, nor was it suggested to
> freeze only a subset of CRAN packages.


IMO this is an implementation detail. If we could all agree on a particular
set of cran packages to be used with a certain release of R, then it
doesn't matter how the 'snapshotting' gets implemented. It could be a
separate repository, or a directory on cran with symbolic links, or a page
somewhere with hyperlinks to the respective source packages. Or you can put
all packages in a big zip file, or include it in your OS distribution. You
can even distribute your entire repo on cdroms (debian style!) or do all of
the above.

The hard problem is not implementation. The hard part is that for
reproducibility to work, we need community wide conventions on which
versions of cran packages are used by a particular release of R. Local
downstream solutions are impractical, because this results in
scripts/packages that only work within your niche using this particular
snapshot. I expect that requiring every script be executed in the context
of dependencies from some particular third party repository will make
reproducibility even less common. Therefore I am trying to make a case for
a solution that would naturally improve reliability/reproducibility of R
code without any effort by the end-user.

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Memcheck: Invalid read of size 4

2014-03-19 Thread Christophe Genolini

Hi the list,

One of my package has a memory issue that I do not manage to understand. The 
Memtest notes is here:


Here is the message that I get from Memtest

--- 8< 
 ~ Fast KmL ~
==27283== Invalid read of size 4
==27283==at 0x10C5DF28: kml1 (kml.c:183)
...
==27283==by 0x10C5DE4F: kml1 (kml.c:151)
...
==27283==at 0x10C5DF90: kml1 (kml.c:198)
--- 8< 


Here is the function kml1 from the file kml.c (I add some comments to tag the 
lines 151, 183 and 198)

--- 8< 
void kml1(double *traj, int *nbInd, int *nbTime, int *nbClusters, int *maxIt, 
int *clusterAffectation1, int *convergenceTime){

int i=0,iter=0;
int *clusterAffectation2=malloc(*nbInd * sizeof(int));  
// lines 151
double *trajMean=malloc(*nbClusters * *nbTime * sizeof(double));

for(i = 0; i < *nbClusters * *nbTime; i++){trajMean[i] = 0.0;};
for(i = 0; i < *nbInd; i++){clusterAffectation2[i] = 0;};

for(iter = 0; iter < *maxIt; iter+=2){
calculMean(traj,nbInd,nbTime,clusterAffectation1,nbClusters,trajMean);
affecteIndiv(traj,nbInd,nbTime,trajMean,nbClusters,clusterAffectation2);

i = 0;
while(clusterAffectation1[i]==clusterAffectation2[i] && i 
<*nbInd){i++;}; // lines 183
if(i == *nbInd){
*convergenceTime = iter + 1;
break;
}else{};

calculMean(traj,nbInd,nbTime,clusterAffectation2,nbClusters,trajMean);
affecteIndiv(traj,nbInd,nbTime,trajMean,nbClusters,clusterAffectation1);

i = 0;
while(clusterAffectation1[i]==clusterAffectation2[i] && 
i<*nbInd){i++;}; // lines 198
if(i == *nbInd){
*convergenceTime = iter + 2;
break;
}else{};
}
}
--- 8< 

Do you know what is wrong in my C code?
Thanks

Christophe

--
Christophe Genolini
Maître de conférences en bio-statistique
Université Paris Ouest Nanterre La Défense
INSERM UMR 1027

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Joshua Ulrich
On Wed, Mar 19, 2014 at 4:28 PM, Jeroen Ooms  wrote:
> On Wed, Mar 19, 2014 at 11:50 AM, Joshua Ulrich 
> wrote:
>>
>> The suggested solution is not described in the referenced article.  It
>> was not suggested that it be the operating system's responsibility to
>> distribute snapshots, nor was it suggested to create binary
>> repositories for specific operating systems, nor was it suggested to
>> freeze only a subset of CRAN packages.
>
>
> IMO this is an implementation detail. If we could all agree on a particular
> set of cran packages to be used with a certain release of R, then it doesn't
> matter how the 'snapshotting' gets implemented. It could be a separate
> repository, or a directory on cran with symbolic links, or a page somewhere
> with hyperlinks to the respective source packages. Or you can put all
> packages in a big zip file, or include it in your OS distribution. You can
> even distribute your entire repo on cdroms (debian style!) or do all of the
> above.
>
> The hard problem is not implementation. The hard part is that for
> reproducibility to work, we need community wide conventions on which
> versions of cran packages are used by a particular release of R. Local
> downstream solutions are impractical, because this results in
> scripts/packages that only work within your niche using this particular
> snapshot. I expect that requiring every script be executed in the context of
> dependencies from some particular third party repository will make
> reproducibility even less common. Therefore I am trying to make a case for a
> solution that would naturally improve reliability/reproducibility of R code
> without any effort by the end-user.
>
So implementation isn't a problem.  The problem is that you need a way
to force people not to be able to use different package versions than
what existed at the time of each R release.  I said this in my
previous email, but you removed and did not address it: "However, you
would need to find a way to actively _prevent_ people from installing
newer versions of packages with the stable R releases."  Frankly, I
would stop using CRAN if this policy were adopted.

I suggest you go build this yourself.  You have all the code available
on CRAN, and the dates at which each package was published.  If others
who care about reproducible research find what you've built useful,
you will create the very community you want.  And you won't have to
force one single person to change their workflow.

Best,
--
Joshua Ulrich  |  about.me/joshuaulrich
FOSS Trading  |  www.fosstrading.com

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Dan Tenenbaum


- Original Message -
> From: "Joshua Ulrich" 
> To: "Jeroen Ooms" 
> Cc: "r-devel" 
> Sent: Wednesday, March 19, 2014 2:59:53 PM
> Subject: Re: [Rd] [RFC] A case for freezing CRAN
> 
> On Wed, Mar 19, 2014 at 4:28 PM, Jeroen Ooms
>  wrote:
> > On Wed, Mar 19, 2014 at 11:50 AM, Joshua Ulrich
> > 
> > wrote:
> >>
> >> The suggested solution is not described in the referenced article.
> >>  It
> >> was not suggested that it be the operating system's responsibility
> >> to
> >> distribute snapshots, nor was it suggested to create binary
> >> repositories for specific operating systems, nor was it suggested
> >> to
> >> freeze only a subset of CRAN packages.
> >
> >
> > IMO this is an implementation detail. If we could all agree on a
> > particular
> > set of cran packages to be used with a certain release of R, then
> > it doesn't
> > matter how the 'snapshotting' gets implemented. It could be a
> > separate
> > repository, or a directory on cran with symbolic links, or a page
> > somewhere
> > with hyperlinks to the respective source packages. Or you can put
> > all
> > packages in a big zip file, or include it in your OS distribution.
> > You can
> > even distribute your entire repo on cdroms (debian style!) or do
> > all of the
> > above.
> >
> > The hard problem is not implementation. The hard part is that for
> > reproducibility to work, we need community wide conventions on
> > which
> > versions of cran packages are used by a particular release of R.
> > Local
> > downstream solutions are impractical, because this results in
> > scripts/packages that only work within your niche using this
> > particular
> > snapshot. I expect that requiring every script be executed in the
> > context of
> > dependencies from some particular third party repository will make
> > reproducibility even less common. Therefore I am trying to make a
> > case for a
> > solution that would naturally improve reliability/reproducibility
> > of R code
> > without any effort by the end-user.
> >
> So implementation isn't a problem.  The problem is that you need a
> way
> to force people not to be able to use different package versions than
> what existed at the time of each R release.  I said this in my
> previous email, but you removed and did not address it: "However, you
> would need to find a way to actively _prevent_ people from installing
> newer versions of packages with the stable R releases."  Frankly, I
> would stop using CRAN if this policy were adopted.
> 

I don't see how the proposal forces anyone to do anything. If you have an old 
version of R and you still want to install newer versions of packages, you can 
download them from their CRAN landing page. As I understand it, the proposal 
only addresses what packages would be installed **by default** for a given 
version of R.

People would be free to override those default settings (by downloading newer 
packages as described above) but they should then not expect to be able to 
reproduce an earlier analysis since they'll have the wrong package versions. If 
they don't care, that's fine (provided that no other problems arise, such as 
the newer package depending on a feature of R that doesn't exist in the version 
you're running).

Dan

> I suggest you go build this yourself.  You have all the code
> available
> on CRAN, and the dates at which each package was published.  If
> others
> who care about reproducible research find what you've built useful,
> you will create the very community you want.  And you won't have to
> force one single person to change their workflow.
> 
> Best,
> --
> Joshua Ulrich  |  about.me/joshuaulrich
> FOSS Trading  |  www.fosstrading.com
> 
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Jeroen Ooms
On Wed, Mar 19, 2014 at 2:59 PM, Joshua Ulrich  wrote:
>
> So implementation isn't a problem.  The problem is that you need a way
> to force people not to be able to use different package versions than
> what existed at the time of each R release.  I said this in my
> previous email, but you removed and did not address it: "However, you
> would need to find a way to actively _prevent_ people from installing
> newer versions of packages with the stable R releases."  Frankly, I
> would stop using CRAN if this policy were adopted.

I am not proposing to "force" anything to anyone, those are your
words. Please read the proposal more carefully before derailing the
discussion. Below *verbatim* a section from the paper:

To fully make the transition to a staged CRAN, the default behavior of
the package manager must be modified to download packages from the
stable branch of the current version of R, rather than the latest
development release. As such, all users on a given version of R will
be using the same version of each CRAN package, regardless on when it
was installed. The user could still be given an option to try and
install the development version from the unstable branch, for example
by adding an additional parameter to install.packages named
devel=TRUE. However when installing an unstable package, it must be
flagged, and the user must be warned that this version is not properly
tested and might not be working as expected. Furthermore, when loading
this package a warning could be shown with the version number so that
it is also obvious from the output that results were produced using a
non-standard version of the contributed package. Finally, users that
would always like to use the very latest versions of all packages,
e.g. developers, could install the r-devel release of R. This version
contains the latest commits by R Core and downloads packages from the
devel branch on CRAN, but should not be used or in production or
reproducible research settings.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Memcheck: Invalid read of size 4

2014-03-19 Thread peter dalgaard

On 19 Mar 2014, at 22:58 , Christophe Genolini  wrote:

> Hi the list,
> 
> One of my package has a memory issue that I do not manage to understand. The 
> Memtest notes is here:
> 
> 
> Here is the message that I get from Memtest
> 
> --- 8< 
> ~ Fast KmL ~
> ==27283== Invalid read of size 4
> ==27283==at 0x10C5DF28: kml1 (kml.c:183)
> ...
> ==27283==by 0x10C5DE4F: kml1 (kml.c:151)
> ...
> ==27283==at 0x10C5DF90: kml1 (kml.c:198)
> --- 8< 
> 
> 
> Here is the function kml1 from the file kml.c (I add some comments to tag the 
> lines 151, 183 and 198)
> 
> --- 8< 
> void kml1(double *traj, int *nbInd, int *nbTime, int *nbClusters, int *maxIt, 
> int *clusterAffectation1, int *convergenceTime){
> 
>int i=0,iter=0;
>int *clusterAffectation2=malloc(*nbInd * sizeof(int)); 
>  // lines 151
>double *trajMean=malloc(*nbClusters * *nbTime * sizeof(double));
> 
>for(i = 0; i < *nbClusters * *nbTime; i++){trajMean[i] = 0.0;};
>for(i = 0; i < *nbInd; i++){clusterAffectation2[i] = 0;};
> 
>for(iter = 0; iter < *maxIt; iter+=2){
>   calculMean(traj,nbInd,nbTime,clusterAffectation1,nbClusters,trajMean);
>   
> affecteIndiv(traj,nbInd,nbTime,trajMean,nbClusters,clusterAffectation2);
> 
>   i = 0;
>   while(clusterAffectation1[i]==clusterAffectation2[i] && i 
> <*nbInd){i++;}; // lines 183
>   if(i == *nbInd){
>   *convergenceTime = iter + 1;
>   break;
>   }else{};
> 
>   calculMean(traj,nbInd,nbTime,clusterAffectation2,nbClusters,trajMean);
>   affecteIndiv(traj,nbInd,nbTime,trajMean,nbClusters,clusterAffectation1);
> 
>   i = 0;
>   while(clusterAffectation1[i]==clusterAffectation2[i] && 
> i<*nbInd){i++;}; // lines 198
>   if(i == *nbInd){
>   *convergenceTime = iter + 2;
>   break;
>   }else{};
>}
> }
> --- 8< 
> 
> Do you know what is wrong in my C code?

Yes. You need to reverse operands of &&. Otherwise you'll be indexing with 
i==*nbind before finding that (i < *nbind) is false. 

> Thanks
> 
> Christophe
> 
> -- 
> Christophe Genolini
> Maître de conférences en bio-statistique
> Université Paris Ouest Nanterre La Défense
> INSERM UMR 1027
> 
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd@cbs.dk  Priv: pda...@gmail.com

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Joshua Ulrich
On Wed, Mar 19, 2014 at 5:16 PM, Jeroen Ooms  wrote:
> On Wed, Mar 19, 2014 at 2:59 PM, Joshua Ulrich  
> wrote:
>>
>> So implementation isn't a problem.  The problem is that you need a way
>> to force people not to be able to use different package versions than
>> what existed at the time of each R release.  I said this in my
>> previous email, but you removed and did not address it: "However, you
>> would need to find a way to actively _prevent_ people from installing
>> newer versions of packages with the stable R releases."  Frankly, I
>> would stop using CRAN if this policy were adopted.
>
> I am not proposing to "force" anything to anyone, those are your
> words. Please read the proposal more carefully before derailing the
> discussion. Below *verbatim* a section from the paper:
>


Yes "force" is too strong a word.  You want a barrier (however small)
to prevent people from installing newer (or older) versions of
packages than those that correspond to a given R release.

I still think you're going to have a very hard time convincing CRAN
maintainers to take up your cause, even if you were to build support
for it.  Especially because there's nothing stopping anyone else from
doing it.

--
Joshua Ulrich  |  about.me/joshuaulrich
FOSS Trading  |  www.fosstrading.com

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Hervé Pagès



On 03/19/2014 02:59 PM, Joshua Ulrich wrote:

On Wed, Mar 19, 2014 at 4:28 PM, Jeroen Ooms  wrote:

On Wed, Mar 19, 2014 at 11:50 AM, Joshua Ulrich 
wrote:


The suggested solution is not described in the referenced article.  It
was not suggested that it be the operating system's responsibility to
distribute snapshots, nor was it suggested to create binary
repositories for specific operating systems, nor was it suggested to
freeze only a subset of CRAN packages.



IMO this is an implementation detail. If we could all agree on a particular
set of cran packages to be used with a certain release of R, then it doesn't
matter how the 'snapshotting' gets implemented. It could be a separate
repository, or a directory on cran with symbolic links, or a page somewhere
with hyperlinks to the respective source packages. Or you can put all
packages in a big zip file, or include it in your OS distribution. You can
even distribute your entire repo on cdroms (debian style!) or do all of the
above.

The hard problem is not implementation. The hard part is that for
reproducibility to work, we need community wide conventions on which
versions of cran packages are used by a particular release of R. Local
downstream solutions are impractical, because this results in
scripts/packages that only work within your niche using this particular
snapshot. I expect that requiring every script be executed in the context of
dependencies from some particular third party repository will make
reproducibility even less common. Therefore I am trying to make a case for a
solution that would naturally improve reliability/reproducibility of R code
without any effort by the end-user.


So implementation isn't a problem.  The problem is that you need a way
to force people not to be able to use different package versions than
what existed at the time of each R release.  I said this in my
previous email, but you removed and did not address it: "However, you
would need to find a way to actively _prevent_ people from installing
newer versions of packages with the stable R releases."  Frankly, I
would stop using CRAN if this policy were adopted.

I suggest you go build this yourself.  You have all the code available
on CRAN, and the dates at which each package was published.  If others
who care about reproducible research find what you've built useful,
you will create the very community you want.  And you won't have to
force one single person to change their workflow.


Yeah we've already heard this "do it yourself" kind of answer. Not a
very productive one honestly.

Well actually that's what we've done for the Bioconductor repositories:
we freeze the BioC packages for each version of Bioconductor. But since
this freezing doesn't happen at the CRAN level, and many BioC packages
depend on CRAN packages, the freezing is only at the surface. Would be
much better if the freezing was all the way down to the bottom of the
sea. (Note that it is already if you install binary packages only.)

Yes it's technically possible to work around this by also hosting
frozen versions of CRAN, one per version of Bioconductor, and have
biocLite() (the tool BioC users use for installing packages) point to
these frozen versions of CRAN in order to get the correct dependencies
for any given version of BioC. However we don't do that because that
would mean extra costs for us in terms of storage space and bandwidth.
And also because we believe that it would be more effective and would
ultimately benefit the entire R community (and not just the BioC
community) if this problem was addressed upstream.

H.



Best,
--
Joshua Ulrich  |  about.me/joshuaulrich
FOSS Trading  |  www.fosstrading.com

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel



--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fhcrc.org
Phone:  (206) 667-5791
Fax:(206) 667-1319

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Romain Francois
Weighting in. FWIW, I find the proposal conceptually quite interesting. 

For package developers, it does not have to be a frustration to have to wait a 
new version of R to release their code. Anticipated frustration was my initial 
reaction. Thinking about this more, I think this could be changed into 
opportunity. 

Since the pattern here is to use Rcpp as an example of something causing 
compatibility headaches, and I have some responsibility there, maybe I can 
comment on this. I would find it extremely valuable if there was only one 
unique version of Rcpp for a given released version of R. 

Users would have to wait longer to have the new stuff, but one can argue that 
at least they get something that is more tested. 

Would it be helpful for authors of package that have lots of dependency to 
start having stricter depends declarations in their DESCRIPTION files, e.g. : 

Depends: R (== 3.1.0)

?

Romain


For example, personally I’m waiting for 3.1.0 for releasing Rcpp11 because I 
want to leverage some C++11 support that has been included in R. It has been 
frustrating to have to wait, but it does change the way I make changes to the 
codebase. Perhaps it is a good habit to take. And it does not need « more work 
» for others, just more discipline and self control from people implementing 
this pattern. 

also, declaring a strict dependency requirement against a released version of R 
perhaps could resume the drama of « you were asked to test this against a very 
recent version of R-devel, and guess what a few hours ago I’ve just added a new 
test that makes your package non R CMD check worthy ». So less work for CRAN 
maintainers then. 

Le 19 mars 2014 à 23:57, Hervé Pagès  a écrit :

> 
> 
> On 03/19/2014 02:59 PM, Joshua Ulrich wrote:
>> On Wed, Mar 19, 2014 at 4:28 PM, Jeroen Ooms  
>> wrote:
>>> On Wed, Mar 19, 2014 at 11:50 AM, Joshua Ulrich 
>>> wrote:
 
 The suggested solution is not described in the referenced article.  It
 was not suggested that it be the operating system's responsibility to
 distribute snapshots, nor was it suggested to create binary
 repositories for specific operating systems, nor was it suggested to
 freeze only a subset of CRAN packages.
>>> 
>>> 
>>> IMO this is an implementation detail. If we could all agree on a particular
>>> set of cran packages to be used with a certain release of R, then it doesn't
>>> matter how the 'snapshotting' gets implemented. It could be a separate
>>> repository, or a directory on cran with symbolic links, or a page somewhere
>>> with hyperlinks to the respective source packages. Or you can put all
>>> packages in a big zip file, or include it in your OS distribution. You can
>>> even distribute your entire repo on cdroms (debian style!) or do all of the
>>> above.
>>> 
>>> The hard problem is not implementation. The hard part is that for
>>> reproducibility to work, we need community wide conventions on which
>>> versions of cran packages are used by a particular release of R. Local
>>> downstream solutions are impractical, because this results in
>>> scripts/packages that only work within your niche using this particular
>>> snapshot. I expect that requiring every script be executed in the context of
>>> dependencies from some particular third party repository will make
>>> reproducibility even less common. Therefore I am trying to make a case for a
>>> solution that would naturally improve reliability/reproducibility of R code
>>> without any effort by the end-user.
>>> 
>> So implementation isn't a problem.  The problem is that you need a way
>> to force people not to be able to use different package versions than
>> what existed at the time of each R release.  I said this in my
>> previous email, but you removed and did not address it: "However, you
>> would need to find a way to actively _prevent_ people from installing
>> newer versions of packages with the stable R releases."  Frankly, I
>> would stop using CRAN if this policy were adopted.
>> 
>> I suggest you go build this yourself.  You have all the code available
>> on CRAN, and the dates at which each package was published.  If others
>> who care about reproducible research find what you've built useful,
>> you will create the very community you want.  And you won't have to
>> force one single person to change their workflow.
> 
> Yeah we've already heard this "do it yourself" kind of answer. Not a
> very productive one honestly.
> 
> Well actually that's what we've done for the Bioconductor repositories:
> we freeze the BioC packages for each version of Bioconductor. But since
> this freezing doesn't happen at the CRAN level, and many BioC packages
> depend on CRAN packages, the freezing is only at the surface. Would be
> much better if the freezing was all the way down to the bottom of the
> sea. (Note that it is already if you install binary packages only.)
> 
> Yes it's technically possible to work around this by also hosting
> frozen versions

[Rd] possible bug: graphics::image seems to ignore getOption("preferRaster")

2014-03-19 Thread Dr Gregory Jefferis

the details section of ?image says:

If useRaster is not specified, raster images are used when the 
getOption("preferRaster") is true, the grid is regular and either 
dev.capabilities("raster") is "yes" or it is "non-missing" and there 
are no missing values.


but in my experience this is never the case and 
getOption("preferRaster") is ignored. As far as I can see, the logic for 
checking this is in image is broken here:


  ras <- dev.capabilities("raster")
  if (identical(ras, "yes"))
useRaster <- TRUE

because dev.capabilities("raster") returns a list like this (on my 
machine, R.version in footer)

$rasterImage
[1] "yes"


You can test this by doing:

  ras=structure(list(rasterImage = "yes"), .Names = "rasterImage")
  identical(ras,'yes') # returns FALSE

so the test would need to be something like:

  ras <- dev.capabilities("raster")[[1]]
  if (identical(ras, "yes"))
useRaster <- TRUE

I can't find any relevant changes in R news

http://stat.ethz.ch/R-manual/R-devel/doc/html/NEWS.html

This discussion

https://www.mail-archive.com/r-devel@r-project.org/msg22811.html

suggests that Simon Urbanek may have added the useRaster option and 
looking at git blame on this mirror repo:



https://github.com/wch/r-source/blame/c3ba5b0be36d3a1290e18fe189142c88f1e43236/src/library/graphics/R/image.R#L111-L120

suggests that Brian Ripley's svn commit 56949 was the last to touch 
these lines:



https://github.com/wch/r-source/commit/b9012424f895bf681daf1b85255942547d495bcd

Thanks for any pointers if I am missing something!

Best wishes,

Greg Jefferis.



R.version

   _
platform   x86_64-apple-darwin10.8.0
arch   x86_64
os darwin10.8.0
system x86_64, darwin10.8.0
status
major  3
minor  0.3
year   2014
month  03
day06
svn rev65126
language   R
version.string R version 3.0.3 (2014-03-06)
nickname   Warm Puppy

--
Gregory Jefferis, PhD
Division of Neurobiology
MRC Laboratory of Molecular Biology
Francis Crick Avenue
Cambridge Biomedical Campus
Cambridge, CB2 OQH, UK

http://www2.mrc-lmb.cam.ac.uk/group-leaders/h-to-m/g-jefferis
http://jefferislab.org
http://flybrain.stanford.edu

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Gavin Simpson
"What am I overlooking?"

That this is already available and possible in R today, but perhaps
not widely used. Developers do tend to only include a lower bound if
they include any bounds at all on package dependencies.

As I mentioned elsewhere, R packages often aren't "built" against
other R packages and often developers may have a range of versions
being tested against, some of which may not be on CRAN yet.

Technically, all packages on CRAN would need to have a dependency cap
on R-devel, but as that is a moving target until it is released I
don't see in practice how enforcing an upper limit on the R dependency
would  work. The way CRAN works, you can't just set a dependency on R
== 3.0.x say. (As far as I understand CRAN's policies.)

For packages it is quite trivial for the developers to manually add
the required info for the upperbound, less so the lower bound, but you
could just pick a known working version. An upper range on the
dependencies could be stated as whatever version is current on CRAN.
But then what happens? Unbeknownst to you, a few days after you
release to CRAN your package foo with stated dependency on bar >= 1.2,
bar <= 1.8, the developer of bar releases bar v 2.0 and your package
no longer passes checks, CRAN gets in touch and you have to resubmit
another version. This could be desirable in terms of helping
contribute to reproducibility exercises, but incurs more effort on the
CRAN maintainers and package maintainers. Now, this might be an issue
because of the desire on CRAN's behalf to have some elements of human
intervention in the submission process, but you either work with CRAN
or do your own thing.

As Bioconductor have shown (for example) it is possible, if people
want to put in time and effort and have a community buy into an ethos,
to achieve staged releases etc.

G

On 19 March 2014 12:58, Carl Boettiger  wrote:
> Dear list,
>
> I'm curious what people would think of a more modest proposal at this time:
>
> State the version of the dependencies used by the package authors when the
> package was built.
>
> Eventually CRAN could enforce such a statement be present in the
> description. We encourage users to declare the version of the packages they
> use in publications, so why not have the same expectation of developers?
>  This would help address the problem of archived packages that Jeroen
> raises, as it is currently it is impossible to reliably install archived
> packages because their dependencies have since been updated and are no
> longer compatible.  (Even if it passes checks and installs, we have no way
> of knowing if the upstream changes have introduced a bug).  This
> information would be relatively straight forward to capture, shouldn't
> change the way anyone currently uses CRAN, and should address a major pain
> point anyone trying to install archived versions from CRAN has probably
> encountered.  What am I overlooking?
>
> Carl
>
>
> On Wed, Mar 19, 2014 at 11:36 AM, Spencer Graves <
> spencer.gra...@structuremonitoring.com> wrote:
>
>>   What about having this purpose met with something like an expansion
>> of R-Forge?  We could have packages submitted to R-Forge rather than CRAN,
>> and people who wanted the latest could get it from R-Forge.  If changes I
>> make on R-Forge break a reverse dependency, emails explaining the problem
>> are sent to both me and the maintainer for the package I broke.
>>
>>
>>   The budget for R-Forge would almost certainly need to be increased:
>>  They currently disable many of the tests they once ran.
>>
>>
>>   Regarding budget, the R Project would get more donations if they
>> asked for them and made it easier to contribute.  I've tried multiple times
>> without success to find a way to donate.  I didn't try hard, but it
>> shouldn't be hard ;-)  (And donations should be accepted in US dollars and
>> Euros -- and maybe other currencies.) There should be a procedure whereby
>> anyone could receive a pro forma invoice, which they can pay or ignore as
>> they choose.  I mention this, because many grants could cover a reasonable
>> fee provided they have an invoice.
>>
>>
>>   Spencer Graves
>>
>>
>> On 3/19/2014 10:59 AM, Jeroen Ooms wrote:
>>
>>> On Wed, Mar 19, 2014 at 5:52 AM, Duncan Murdoch >> >wrote:
>>>
>>>  I don't see why CRAN needs to be involved in this effort at all.  A third
 party could take snapshots of CRAN at R release dates, and make those
 available to package users in a separate repository.  It is not hard to
 set
 a different repository than CRAN as the default location from which to
 obtain packages.

  I am happy to see many people giving this some thought and engage in the
>>> discussion.
>>>
>>> Several have suggested that staging & freezing can be simply done by a
>>> third party. This solution and its limitations is also described in the
>>> paper [1] in the section titled "R: downstream staging and repackaging".
>>>
>>> If this would solve the problem with

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Gavin Simpson
Given that R is (has) moved to a 12 month release cycle, I don't want
to either i) wait a year to get new packages (or allow users to use
new versions of my packages), or ii) have to run R-devel just to use
new packages. (or be on R-testing for that matter).

People then will start finding ways around these limitations and then
we're back to square one of having people use a set of R packages and
R versions that could potentially be all over the place.

As a package developer, it is pretty easy to say I've tested my
package works with these other packages and their versions, and set
DESCRIPTION to reflect only those versions as allowed (or a range as a
package matures and the maintainer has tested against more versions of
the dependencies). CRAN may well not like this if your package no
longer builds/checks on their system but then you have a choice to
make; stick to your reproducibility guns & forsake CRAN in favour of
something else (github, one's own repo), or relent and meet CRANs
requirements.

On 19 March 2014 16:57, Hervé Pagès  wrote:
>
>
> On 03/19/2014 02:59 PM, Joshua Ulrich wrote:
>>
>> On Wed, Mar 19, 2014 at 4:28 PM, Jeroen Ooms 
>> wrote:
>>>
>>> On Wed, Mar 19, 2014 at 11:50 AM, Joshua Ulrich 
>>> wrote:


 The suggested solution is not described in the referenced article.  It
 was not suggested that it be the operating system's responsibility to
 distribute snapshots, nor was it suggested to create binary
 repositories for specific operating systems, nor was it suggested to
 freeze only a subset of CRAN packages.
>>>
>>>
>>>
>>> IMO this is an implementation detail. If we could all agree on a
>>> particular
>>> set of cran packages to be used with a certain release of R, then it
>>> doesn't
>>> matter how the 'snapshotting' gets implemented. It could be a separate
>>> repository, or a directory on cran with symbolic links, or a page
>>> somewhere
>>> with hyperlinks to the respective source packages. Or you can put all
>>> packages in a big zip file, or include it in your OS distribution. You
>>> can
>>> even distribute your entire repo on cdroms (debian style!) or do all of
>>> the
>>> above.
>>>
>>> The hard problem is not implementation. The hard part is that for
>>> reproducibility to work, we need community wide conventions on which
>>> versions of cran packages are used by a particular release of R. Local
>>> downstream solutions are impractical, because this results in
>>> scripts/packages that only work within your niche using this particular
>>> snapshot. I expect that requiring every script be executed in the context
>>> of
>>> dependencies from some particular third party repository will make
>>> reproducibility even less common. Therefore I am trying to make a case
>>> for a
>>> solution that would naturally improve reliability/reproducibility of R
>>> code
>>> without any effort by the end-user.
>>>
>> So implementation isn't a problem.  The problem is that you need a way
>> to force people not to be able to use different package versions than
>> what existed at the time of each R release.  I said this in my
>> previous email, but you removed and did not address it: "However, you
>> would need to find a way to actively _prevent_ people from installing
>> newer versions of packages with the stable R releases."  Frankly, I
>> would stop using CRAN if this policy were adopted.
>>
>> I suggest you go build this yourself.  You have all the code available
>> on CRAN, and the dates at which each package was published.  If others
>> who care about reproducible research find what you've built useful,
>> you will create the very community you want.  And you won't have to
>> force one single person to change their workflow.
>
>
> Yeah we've already heard this "do it yourself" kind of answer. Not a
> very productive one honestly.
>
> Well actually that's what we've done for the Bioconductor repositories:
> we freeze the BioC packages for each version of Bioconductor. But since
> this freezing doesn't happen at the CRAN level, and many BioC packages
> depend on CRAN packages, the freezing is only at the surface. Would be
> much better if the freezing was all the way down to the bottom of the
> sea. (Note that it is already if you install binary packages only.)
>
> Yes it's technically possible to work around this by also hosting
> frozen versions of CRAN, one per version of Bioconductor, and have
> biocLite() (the tool BioC users use for installing packages) point to
> these frozen versions of CRAN in order to get the correct dependencies
> for any given version of BioC. However we don't do that because that
> would mean extra costs for us in terms of storage space and bandwidth.
> And also because we believe that it would be more effective and would
> ultimately benefit the entire R community (and not just the BioC
> community) if this problem was addressed upstream.
>
>
> H.
>
>>
>> Best,
>> --
>> Joshua Ulrich  |  about.me/joshuaulrich
>> FOSS Trading  |  www

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Michael Weylandt


On Mar 19, 2014, at 18:42, Joshua Ulrich  wrote:

> On Wed, Mar 19, 2014 at 5:16 PM, Jeroen Ooms  
> wrote:
>> On Wed, Mar 19, 2014 at 2:59 PM, Joshua Ulrich  
>> wrote:
>>> 
>>> So implementation isn't a problem.  The problem is that you need a way
>>> to force people not to be able to use different package versions than
>>> what existed at the time of each R release.  I said this in my
>>> previous email, but you removed and did not address it: "However, you
>>> would need to find a way to actively _prevent_ people from installing
>>> newer versions of packages with the stable R releases."  Frankly, I
>>> would stop using CRAN if this policy were adopted.
>> 
>> I am not proposing to "force" anything to anyone, those are your
>> words. Please read the proposal more carefully before derailing the
>> discussion. Below *verbatim* a section from the paper:
> 
> 
> Yes "force" is too strong a word.  You want a barrier (however small)
> to prevent people from installing newer (or older) versions of
> packages than those that correspond to a given R release.


Jeroen,

Reading this thread again, is it a fair summary of your position to say 
"reproducibility by default is more important than giving users access to the 
newest bug fixes and features by default?" It's certainly arguable, but I'm not 
sure I'm convinced: I'd imagine that the ratio of new work being done vs 
reproductions is rather high and the current setup optimizes for that already. 

What I'm trying to figure out is why the standard "install the following list 
of package versions" isn't good enough in your eyes? Is it the lack of CRAN 
provided binaries or the fact that the user has to proactively set up their 
environment to replicate that of published results?

In your XML example, it seems the problem was that the reproducer didn't check 
that the same package versions as the reproducee and instead assumed that 
'latest' would be the same. Annoying yes, but easy to solve. 

Michael

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Gavin Simpson
Michael,

I think the issue is that Jeroen wants to take that responsibility out
of the hands of the person trying to reproduce a work. If it used R
3.0.x and packages A, B and C then it would be trivial to to install
that version of R and then pull down the stable versions of A B and C
for that version of R. At the moment, one might note the packages used
and even their versions, but what about the versions of the packages
that the used packages rely upon & so on? What if developers don't
state know working versions of dependencies?

The problem is how the heck do you know which versions of packages are
needed if developers don't record these dependencies in sufficient
detail? The suggested solution is to freeze CRAN at intervals
alongside R releases. Then you'd know what the stable versions were.

Or we could just get package developers to be more thorough in
documenting dependencies. Or R CMD check could refuse to pass if a
package is listed as a dependency but with no version qualifiers. Or
have R CMD build add an upper bound (from the current, at build-time
version of dependencies on CRAN) if the package developer didn't
include and upper bound. Or... The first is unliekly to happen
consistently, and no-one wants *more* checks and hoops to jump through
:-)

To my mind it is incumbent upon those wanting reproducibility to build
the tools to enable users to reproduce works. When you write a paper
or release a tool, you will have tested it with a specific set of
packages. It is relatively easy to work out what those versions are
(there are tools in R for this). What is required is an automated way
to record that info in an agreed upon way in an approved
file/location, and have a tool that facilitates setting up a package
library sufficient with which to reproduce a work. That approval
doesn't need to come from CRAN or R Core - we can store anything in
./inst.

Reproducibility is a very important part of doing "science", but not
everyone using CRAN is doing that. Why force everyone to march to the
reproducibility drum? I would place the onus elsewhere to make this
work.

Gavin
A scientist, very much interested in reproducibility of my work and others.

On 19 March 2014 19:55, Michael Weylandt  wrote:
>
>
> On Mar 19, 2014, at 18:42, Joshua Ulrich  wrote:
>
>> On Wed, Mar 19, 2014 at 5:16 PM, Jeroen Ooms  
>> wrote:
>>> On Wed, Mar 19, 2014 at 2:59 PM, Joshua Ulrich  
>>> wrote:

 So implementation isn't a problem.  The problem is that you need a way
 to force people not to be able to use different package versions than
 what existed at the time of each R release.  I said this in my
 previous email, but you removed and did not address it: "However, you
 would need to find a way to actively _prevent_ people from installing
 newer versions of packages with the stable R releases."  Frankly, I
 would stop using CRAN if this policy were adopted.
>>>
>>> I am not proposing to "force" anything to anyone, those are your
>>> words. Please read the proposal more carefully before derailing the
>>> discussion. Below *verbatim* a section from the paper:
>> 
>>
>> Yes "force" is too strong a word.  You want a barrier (however small)
>> to prevent people from installing newer (or older) versions of
>> packages than those that correspond to a given R release.
>
>
> Jeroen,
>
> Reading this thread again, is it a fair summary of your position to say 
> "reproducibility by default is more important than giving users access to the 
> newest bug fixes and features by default?" It's certainly arguable, but I'm 
> not sure I'm convinced: I'd imagine that the ratio of new work being done vs 
> reproductions is rather high and the current setup optimizes for that already.
>
> What I'm trying to figure out is why the standard "install the following list 
> of package versions" isn't good enough in your eyes? Is it the lack of CRAN 
> provided binaries or the fact that the user has to proactively set up their 
> environment to replicate that of published results?
>
> In your XML example, it seems the problem was that the reproducer didn't 
> check that the same package versions as the reproducee and instead assumed 
> that 'latest' would be the same. Annoying yes, but easy to solve.
>
> Michael
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel



-- 
Gavin Simpson, PhD

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] modifying data in a package

2014-03-19 Thread Ross Boylan
I've tweaked Rmpi and want to have some variables that hold data in the
package.  One of the R files starts
mpi.isend.obj <- vector("list", 500) #mpi.request.maxsize())
  
mpi.isend.inuse <- rep(FALSE, 500) #mpi.request.maxsize())

and then functions update those variables with <<-.  When run:
  Error in mpi.isend.obj[[i]] <<- .force.type(x, type) :

  cannot change value of locked binding for 'mpi.isend.obj'

I'm writing to ask the proper way to accomplish this objective (getting
a variable I can update in package namespace--or at least somewhere
useful and hidden from the outside).

I think the problem is that the package namespace is locked.  So how do
I achieve the same effect?
http://www.r-bloggers.com/package-wide-variablescache-in-r-packages/
recommends creating an environment and then updating it.  Is that the
preferred route?  (It seems odd that the list should be locked but the
environment would be manipulable.  I know environments are special.)

The comments indicate that 500 "should" be mpi.request.maxsize().  That
doesn't work because mpi.request.maxsize calls a C function, and there
is an error that the function isn't loaded.  I guess the R code is
evaluated before the C libraries are loaded. The packages zzz.R starts
.onLoad <- function (lib, pkg) {
library.dynam("Rmpi", pkg, lib)

So would moving the code into .onLoad after that work?  In that case,
how do I get the environment into the  proper scope?  Would
 .onLoad <- function (lib, pkg) {
library.dynam("Rmpi", pkg, lib)
assign("mpi.globals", new.env(), environment(mpi.isend))
assign("mpi.isend.obj", vector("list", mpi.request.maxsize(),
mpi.globals)
work?

mpi.isend is a function in Rmpi.  But I'd guess the first assign will
fail because the environment is locked.

Thanks.
Ross Boylan

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Michael Weylandt


On Mar 19, 2014, at 22:17, Gavin Simpson  wrote:

> Michael,
> 
> I think the issue is that Jeroen wants to take that responsibility out
> of the hands of the person trying to reproduce a work. If it used R
> 3.0.x and packages A, B and C then it would be trivial to to install
> that version of R and then pull down the stable versions of A B and C
> for that version of R. At the moment, one might note the packages used
> and even their versions, but what about the versions of the packages
> that the used packages rely upon & so on? What if developers don't
> state know working versions of dependencies?

Doesn't sessionInfo() give all of this?

If you want to be very worried about every last bit, I suppose it should also 
include options(), compiler flags, compiler version, BLAS details, etc.  (Good 
talk on the dregs of a floating point number and how hard it is to reproduce 
them across processors http://www.youtube.com/watch?v=GIlp4rubv8U)

> 
> The problem is how the heck do you know which versions of packages are
> needed if developers don't record these dependencies in sufficient
> detail? The suggested solution is to freeze CRAN at intervals
> alongside R releases. Then you'd know what the stable versions were.

Only if you knew which R release was used. 

> 
> Or we could just get package developers to be more thorough in
> documenting dependencies. Or R CMD check could refuse to pass if a
> package is listed as a dependency but with no version qualifiers. Or
> have R CMD build add an upper bound (from the current, at build-time
> version of dependencies on CRAN) if the package developer didn't
> include and upper bound. Or... The first is unliekly to happen
> consistently, and no-one wants *more* checks and hoops to jump through
> :-)
> 
> To my mind it is incumbent upon those wanting reproducibility to build
> the tools to enable users to reproduce works.

But the tools already allow it with minimal effort. If the author can't even 
include session info, how can we be sure the version of R is known. If we can't 
know which version of R, can we ever change R at all? Etc to absurdity. 

My (serious) point is that the tools are in place, but ramming them down folks' 
throats by intentionally keeping them on older versions by default is too much. 

> When you write a paper
> or release a tool, you will have tested it with a specific set of
> packages. It is relatively easy to work out what those versions are
> (there are tools in R for this). What is required is an automated way
> to record that info in an agreed upon way in an approved
> file/location, and have a tool that facilitates setting up a package
> library sufficient with which to reproduce a work. That approval
> doesn't need to come from CRAN or R Core - we can store anything in
> ./inst.

I think the package version and published paper cases are different. 

For the latter, the recipe is simple: if you want the same results, use the 
same  software (as noted by sessionInfoPlus() or equiv)

For the former, I think you start straying into this NP complete problem: 
http://people.debian.org/~dburrows/model.pdf 

Yes, a good config can (and should be recorded) but isn't that exactly what 
sessionInfo() gives?

> 
> Reproducibility is a very important part of doing "science", but not
> everyone using CRAN is doing that. Why force everyone to march to the
> reproducibility drum? I would place the onus elsewhere to make this
> work.
> 

Agreed: reproducibility is the onus of the author, not the reader


> Gavin
> A scientist, very much interested in reproducibility of my work and others.

Michael
In finance, where we call it "Auditability" and care very much as well :-)


[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Jeroen Ooms
On Wed, Mar 19, 2014 at 6:55 PM, Michael Weylandt
 wrote:
> Reading this thread again, is it a fair summary of your position to say 
> "reproducibility by default is more important than giving users access to the 
> newest bug fixes and features by default?" It's certainly arguable, but I'm 
> not sure I'm convinced: I'd imagine that the ratio of new work being done vs 
> reproductions is rather high and the current setup optimizes for that already.

I think that separating development from released branches can give us
both reliability/reproducibility (stable branch) as well as new
features (unstable branch). The user gets to pick (and you can pick
both!). The same is true for r-base: when using a 'released' version
you get 'stable' base packages that are up to 12 months old. If you
want to have the latest stuff you download a nightly build of r-devel.
For regular users and reproducible research it is recommended to use
the stable branch. However if you are a developer (e.g. package
author) you might want to develop/test/check your work with the latest
r-devel.

I think that extending the R release cycle to CRAN would result both
in more stable released versions of R, as well as more freedom for
package authors to implement rigorous change in the unstable branch.
When writing a script that is part of a production pipeline, or sweave
paper that should be reproducible 10 years from now, or a book on
using R, you use stable version of R, which is guaranteed to behave
the same over time. However when developing packages that should be
compatible with the upcoming release of R, you use r-devel which has
the latest versions of other CRAN and base packages.


> What I'm trying to figure out is why the standard "install the following list 
> of package versions" isn't good enough in your eyes?

Almost nobody does this because it is cumbersome and impractical. We
can do so much better than this. Note that in order to install old
packages you also need to investigate which versions of dependencies
of those packages were used. On win/osx, users need to manually build
those packages which can be a pain. All in all it makes reproducible
research difficult and expensive and error prone. At the end of the
day most published results obtain with R just won't be reproducible.

Also I believe that keeping it simple is essential for solutions to be
practical. If every script has to be run inside an environment with
custom libraries, it takes away much of its power. Running a bash or
python script in Linux is so easy and reliable that entire
distributions are based on it. I don't understand why we make our
lives so difficult in R.

In my estimation, a system where stable versions of R pull packages
from a stable branch of CRAN will naturally resolve the majority of
the reproducibility and reliability problems with R. And in contrast
to what some people here are suggesting it does not introduce any
limitations. If you want to get the latest stuff, you either grab a
copy of r-devel, or just enable the testing branch and off you go.
Debian 'testing' works in a similar way, see
http://www.debian.org/devel/testing.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Michael Weylandt
On Mar 19, 2014, at 22:45, Jeroen Ooms  wrote:

> On Wed, Mar 19, 2014 at 6:55 PM, Michael Weylandt
>  wrote:
>> Reading this thread again, is it a fair summary of your position to say 
>> "reproducibility by default is more important than giving users access to 
>> the newest bug fixes and features by default?" It's certainly arguable, but 
>> I'm not sure I'm convinced: I'd imagine that the ratio of new work being 
>> done vs reproductions is rather high and the current setup optimizes for 
>> that already.
> 
> I think that separating development from released branches can give us
> both reliability/reproducibility (stable branch) as well as new
> features (unstable branch). The user gets to pick (and you can pick
> both!). The same is true for r-base: when using a 'released' version
> you get 'stable' base packages that are up to 12 months old. If you
> want to have the latest stuff you download a nightly build of r-devel.
> For regular users and reproducible research it is recommended to use
> the stable branch. However if you are a developer (e.g. package
> author) you might want to develop/test/check your work with the latest
> r-devel.

I think where you are getting push back (e.g., Frank Harrell and Josh Ulrich) 
is from saying that 'stable' is the right branch for 'regular users.' And I 
tend to agree: I think most folks need features and bug fixes more than they 
need to reproduce a particular paper with no effort on their end. 

> 
> I think that extending the R release cycle to CRAN would result both
> in more stable released versions of R, as well as more freedom for
> package authors to implement rigorous change in the unstable branch.

Not sure what exactly you mean by this sentence. 

> When writing a script that is part of a production pipeline, or sweave
> paper that should be reproducible 10 years from now, or a book on
> using R, you use stable version of R, which is guaranteed to behave
> the same over time.

Only if you never upgrade anything... But that's the case already, isn't it?


> However when developing packages that should be
> compatible with the upcoming release of R, you use r-devel which has
> the latest versions of other CRAN and base packages.
> 
> 
>> What I'm trying to figure out is why the standard "install the following 
>> list of package versions" isn't good enough in your eyes?
> 
> Almost nobody does this because it is cumbersome and impractical. We
> can do so much better than this. Note that in order to install old
> packages you also need to investigate which versions of dependencies
> of those packages were used. On win/osx, users need to manually build
> those packages which can be a pain. All in all it makes reproducible
> research difficult and expensive and error prone. At the end of the
> day most published results obtain with R just won't be reproducible.

So you want CRAN to host old binaries ad infinitum? I think that's entirely 
reasonable/doable if (big if) storage and network are free. 

> 
> Also I believe that keeping it simple is essential for solutions to be
> practical. If every script has to be run inside an environment with
> custom libraries, it takes away much of its power. Running a bash or
> python script in Linux is so easy and reliable that entire
> distributions are based on it. I don't understand why we make our
> lives so difficult in R.

Because for Debian style (stop the world on release) distro, there are no 
upgrades within a release. And that's only halfway reasonable because of 
Debian's shockingly good QA. 

It's certainly not true for, e.g., Arch. 

I've been looking at python incompatibilities across different RHEL versions 
lately. There's simply no way to get around explicit version pinning (either by 
release number or date, but when you have many moving pieces, picking a set of 
release numbers is much easier than finding a single day when they all happened 
to work together) if it has to work exactly as it used to. 

> 
> In my estimation, a system where stable versions of R pull packages
> from a stable branch of CRAN will naturally resolve the majority of
> the reproducibility and reliability problems with R.

And what everyone else is saying is "if you want to reproduce results made with 
old software,  download and use the old software." Both can me made to work -- 
it's just a matter of pros and cons of different defaults. 


> And in contrast
> to what some people here are suggesting it does not introduce any
> limitations. If you want to get the latest stuff, you either grab a
> copy of r-devel, or just enable the testing branch and off you go.
> Debian 'testing' works in a similar way, see
> http://www.debian.org/devel/testing.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Karl Millar
I think what you really want here is the ability to easily identify
and sync to CRAN snapshots.

The easy way to do this is setup a CRAN mirror, but back it up with
version control, so that it's easy to reproduce the exact state of
CRAN at any given point in time.  CRAN's not particularly large and
doesn't churn a whole lot, so most version control systems should be
able to handle that without difficulty.

Using svn, mod_dav_svn and (maybe) mod_rewrite, you could setup the
server so that e.g.:
   http://my.cran.mirror/repos/2013-01-01/
is a mirror of how CRAN looked at midnight 2013-01-01.

Users can then set their repository to that URL, and will have a
stable snapshot to work with, and can have all their packages built
with that snapshot if they like.  For reproducibility purposes, all
users need to do is to agree on the same date to use.  For publication
purposes, the date of the snapshot should be sufficient.

We'd need a version of update.packages() that force-syncs all the
packages to the version in the repository, even if they're downgrades,
but otherwise it ought to be fairly straight-forward.

FWIW, we do something similar internally at Google.  All the packages
that a user has installed come from the same source control revision,
where we know that all the package versions are mutually compatible.
It saves a lot of headaches, and users can rollback to any previous
point in time easily if they run into problems.


On Wed, Mar 19, 2014 at 7:45 PM, Jeroen Ooms  wrote:
> On Wed, Mar 19, 2014 at 6:55 PM, Michael Weylandt
>  wrote:
>> Reading this thread again, is it a fair summary of your position to say 
>> "reproducibility by default is more important than giving users access to 
>> the newest bug fixes and features by default?" It's certainly arguable, but 
>> I'm not sure I'm convinced: I'd imagine that the ratio of new work being 
>> done vs reproductions is rather high and the current setup optimizes for 
>> that already.
>
> I think that separating development from released branches can give us
> both reliability/reproducibility (stable branch) as well as new
> features (unstable branch). The user gets to pick (and you can pick
> both!). The same is true for r-base: when using a 'released' version
> you get 'stable' base packages that are up to 12 months old. If you
> want to have the latest stuff you download a nightly build of r-devel.
> For regular users and reproducible research it is recommended to use
> the stable branch. However if you are a developer (e.g. package
> author) you might want to develop/test/check your work with the latest
> r-devel.
>
> I think that extending the R release cycle to CRAN would result both
> in more stable released versions of R, as well as more freedom for
> package authors to implement rigorous change in the unstable branch.
> When writing a script that is part of a production pipeline, or sweave
> paper that should be reproducible 10 years from now, or a book on
> using R, you use stable version of R, which is guaranteed to behave
> the same over time. However when developing packages that should be
> compatible with the upcoming release of R, you use r-devel which has
> the latest versions of other CRAN and base packages.
>
>
>> What I'm trying to figure out is why the standard "install the following 
>> list of package versions" isn't good enough in your eyes?
>
> Almost nobody does this because it is cumbersome and impractical. We
> can do so much better than this. Note that in order to install old
> packages you also need to investigate which versions of dependencies
> of those packages were used. On win/osx, users need to manually build
> those packages which can be a pain. All in all it makes reproducible
> research difficult and expensive and error prone. At the end of the
> day most published results obtain with R just won't be reproducible.
>
> Also I believe that keeping it simple is essential for solutions to be
> practical. If every script has to be run inside an environment with
> custom libraries, it takes away much of its power. Running a bash or
> python script in Linux is so easy and reliable that entire
> distributions are based on it. I don't understand why we make our
> lives so difficult in R.
>
> In my estimation, a system where stable versions of R pull packages
> from a stable branch of CRAN will naturally resolve the majority of
> the reproducibility and reliability problems with R. And in contrast
> to what some people here are suggesting it does not introduce any
> limitations. If you want to get the latest stuff, you either grab a
> copy of r-devel, or just enable the testing branch and off you go.
> Debian 'testing' works in a similar way, see
> http://www.debian.org/devel/testing.
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/l

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread David Winsemius

On Mar 19, 2014, at 7:45 PM, Jeroen Ooms wrote:

> On Wed, Mar 19, 2014 at 6:55 PM, Michael Weylandt
>  wrote:
>> Reading this thread again, is it a fair summary of your position to say 
>> "reproducibility by default is more important than giving users access to 
>> the newest bug fixes and features by default?" It's certainly arguable, but 
>> I'm not sure I'm convinced: I'd imagine that the ratio of new work being 
>> done vs reproductions is rather high and the current setup optimizes for 
>> that already.
> 
> I think that separating development from released branches can give us
> both reliability/reproducibility (stable branch) as well as new
> features (unstable branch). The user gets to pick (and you can pick
> both!). The same is true for r-base: when using a 'released' version
> you get 'stable' base packages that are up to 12 months old. If you
> want to have the latest stuff you download a nightly build of r-devel.
> For regular users and reproducible research it is recommended to use
> the stable branch. However if you are a developer (e.g. package
> author) you might want to develop/test/check your work with the latest
> r-devel.
> 
> I think that extending the R release cycle to CRAN would result both
> in more stable released versions of R, as well as more freedom for
> package authors to implement rigorous change in the unstable branch.
> When writing a script that is part of a production pipeline, or sweave
> paper that should be reproducible 10 years from now, or a book on
> using R, you use stable version of R, which is guaranteed to behave
> the same over time. However when developing packages that should be
> compatible with the upcoming release of R, you use r-devel which has
> the latest versions of other CRAN and base packages.


As I remember ... The example demonstrating the need for this was an XML 
package that cause an extract from a website where the headers were 
misinterpreted as data in one version of pkg:XML and not in another. That seems 
fairly unconvincing. Data cleaning and validation is a basic task of data 
analysis. It also seems excessive to assert that it is the responsibility of 
CRAN to maintain a synced binary archive that will be available in ten years. 
Bug fixes would be inhibited for years not unlike SAS and Excel. What next? 
Perhaps al bugs should be labeled as features?  Surely this CRAN-of-the-future 
would be offering something that no other statistical package currently offers, 
nicht wahr?

Why not leave it to the authors to specify the packages which version numbers 
were used in their publications. The authors of the packages would get 
recognition and the dependencies would be recorded.

-- 
David.
> 
> 
>> What I'm trying to figure out is why the standard "install the following 
>> list of package versions" isn't good enough in your eyes?
> 
> Almost nobody does this because it is cumbersome and impractical. We
> can do so much better than this. Note that in order to install old
> packages you also need to investigate which versions of dependencies
> of those packages were used. On win/osx, users need to manually build
> those packages which can be a pain. All in all it makes reproducible
> research difficult and expensive and error prone. At the end of the
> day most published results obtain with R just won't be reproducible.
> 
> Also I believe that keeping it simple is essential for solutions to be
> practical. If every script has to be run inside an environment with
> custom libraries, it takes away much of its power. Running a bash or
> python script in Linux is so easy and reliable that entire
> distributions are based on it. I don't understand why we make our
> lives so difficult in R.
> 
> In my estimation, a system where stable versions of R pull packages
> from a stable branch of CRAN will naturally resolve the majority of
> the reproducibility and reliability problems with R. And in contrast
> to what some people here are suggesting it does not introduce any
> limitations. If you want to get the latest stuff, you either grab a
> copy of r-devel, or just enable the testing branch and off you go.
> Debian 'testing' works in a similar way, see
> http://www.debian.org/devel/testing.
> 
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

David Winsemius
Alameda, CA, USA

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] [RFC] A case for freezing CRAN

2014-03-19 Thread Dan Tenenbaum


- Original Message -
> From: "David Winsemius" 
> To: "Jeroen Ooms" 
> Cc: "r-devel" 
> Sent: Wednesday, March 19, 2014 11:03:32 PM
> Subject: Re: [Rd] [RFC] A case for freezing CRAN
> 
> 
> On Mar 19, 2014, at 7:45 PM, Jeroen Ooms wrote:
> 
> > On Wed, Mar 19, 2014 at 6:55 PM, Michael Weylandt
> >  wrote:
> >> Reading this thread again, is it a fair summary of your position
> >> to say "reproducibility by default is more important than giving
> >> users access to the newest bug fixes and features by default?"
> >> It's certainly arguable, but I'm not sure I'm convinced: I'd
> >> imagine that the ratio of new work being done vs reproductions is
> >> rather high and the current setup optimizes for that already.
> > 
> > I think that separating development from released branches can give
> > us
> > both reliability/reproducibility (stable branch) as well as new
> > features (unstable branch). The user gets to pick (and you can pick
> > both!). The same is true for r-base: when using a 'released'
> > version
> > you get 'stable' base packages that are up to 12 months old. If you
> > want to have the latest stuff you download a nightly build of
> > r-devel.
> > For regular users and reproducible research it is recommended to
> > use
> > the stable branch. However if you are a developer (e.g. package
> > author) you might want to develop/test/check your work with the
> > latest
> > r-devel.
> > 
> > I think that extending the R release cycle to CRAN would result
> > both
> > in more stable released versions of R, as well as more freedom for
> > package authors to implement rigorous change in the unstable
> > branch.
> > When writing a script that is part of a production pipeline, or
> > sweave
> > paper that should be reproducible 10 years from now, or a book on
> > using R, you use stable version of R, which is guaranteed to behave
> > the same over time. However when developing packages that should be
> > compatible with the upcoming release of R, you use r-devel which
> > has
> > the latest versions of other CRAN and base packages.
> 
> 
> As I remember ... The example demonstrating the need for this was an
> XML package that cause an extract from a website where the headers
> were misinterpreted as data in one version of pkg:XML and not in
> another. That seems fairly unconvincing. Data cleaning and
> validation is a basic task of data analysis. It also seems excessive
> to assert that it is the responsibility of CRAN to maintain a synced
> binary archive that will be available in ten years. 


CRAN already does this, the bin/windows/contrib directory has subdirectories 
going back to 1.7, with packages dated October 2004. I don't see why it is 
burdensome to continue to archive these. It would be nice if source versions 
had a similar archive.

Dan




> Bug fixes would
> be inhibited for years not unlike SAS and Excel. What next?
> Perhaps al bugs should be labeled as features?  Surely this
> CRAN-of-the-future would be offering something that no other
> statistical package currently offers, nicht wahr?
> 
> Why not leave it to the authors to specify the packages which version
> numbers were used in their publications. The authors of the packages
> would get recognition and the dependencies would be recorded.
> 
> --
> David.
> > 
> > 
> >> What I'm trying to figure out is why the standard "install the
> >> following list of package versions" isn't good enough in your
> >> eyes?
> > 
> > Almost nobody does this because it is cumbersome and impractical.
> > We
> > can do so much better than this. Note that in order to install old
> > packages you also need to investigate which versions of
> > dependencies
> > of those packages were used. On win/osx, users need to manually
> > build
> > those packages which can be a pain. All in all it makes
> > reproducible
> > research difficult and expensive and error prone. At the end of the
> > day most published results obtain with R just won't be
> > reproducible.
> > 
> > Also I believe that keeping it simple is essential for solutions to
> > be
> > practical. If every script has to be run inside an environment with
> > custom libraries, it takes away much of its power. Running a bash
> > or
> > python script in Linux is so easy and reliable that entire
> > distributions are based on it. I don't understand why we make our
> > lives so difficult in R.
> > 
> > In my estimation, a system where stable versions of R pull packages
> > from a stable branch of CRAN will naturally resolve the majority of
> > the reproducibility and reliability problems with R. And in
> > contrast
> > to what some people here are suggesting it does not introduce any
> > limitations. If you want to get the latest stuff, you either grab a
> > copy of r-devel, or just enable the testing branch and off you go.
> > Debian 'testing' works in a similar way, see
> > http://www.debian.org/devel/testing.
> > 
> > __
> > R-devel@r-project.org ma

Re: [Rd] modifying data in a package [a solution]

2014-03-19 Thread Ross Boylan
On Wed, 2014-03-19 at 19:22 -0700, Ross Boylan wrote:
> I've tweaked Rmpi and want to have some variables that hold data in the
> package.  One of the R files starts
> mpi.isend.obj <- vector("list", 500) #mpi.request.maxsize())  
> 
> mpi.isend.inuse <- rep(FALSE, 500) #mpi.request.maxsize())
> 
> and then functions update those variables with <<-.  When run:
>   Error in mpi.isend.obj[[i]] <<- .force.type(x, type) :  
>   
>   cannot change value of locked binding for 'mpi.isend.obj'
> 
> I'm writing to ask the proper way to accomplish this objective (getting
> a variable I can update in package namespace--or at least somewhere
> useful and hidden from the outside).
> 
I've discovered one way to do it:
In one of the regular R files
mpi.global <- new.env()

Then at the end of .onLoad in zzz.R:
assign("mpi.isend.obj", vector("list", mpi.request.maxsize()),
mpi.global)
and similary for the logical vector mpi.isend.inuse

Access with functions like this:
## Next 2 functions have 3 modes
  
##  foo()  returns foo from mpi.global  
  
##  foo(request) returns foo[request] from mpi.global   
  
##  foo(request, value) set foo[request] to value   
  
mpi.isend.inuse <- function(request, value) {
if (missing(request))
return(get("mpi.isend.inuse", mpi.global))
i <- request+1L
parent.env(mpi.global) <- environment()
if (missing(value))
return(evalq(mpi.isend.inuse[i], mpi.global))
return(evalq(mpi.isend.inuse[i] <- value, mpi.global))
}

# request, if present, must be a single value   
  
mpi.isend.obj <- function(request, value){
if (missing(request))
return(get("mpi.isend.obj", mpi.global))
i <- request+1L
parent.env(mpi.global) <- environment()
if (missing(value))
return(evalq(mpi.isend.obj[[i]], mpi.global))
return(evalq(mpi.isend.inuse[[i]] <- value, mpi.global))
}

This is pretty awkward; I'd love to know a better way.  Some of the
names probably should change too: mpi.isend.obj() sounds too much as if
it actually sends something, like mpi.isend.Robj().

Ross

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel