Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread Rainer M Krug
Michael Weylandt  writes:

> On Mar 19, 2014, at 22:17, Gavin Simpson  wrote:
>
>> Michael,
>> 
>> I think the issue is that Jeroen wants to take that responsibility out
>> of the hands of the person trying to reproduce a work. If it used R
>> 3.0.x and packages A, B and C then it would be trivial to to install
>> that version of R and then pull down the stable versions of A B and C
>> for that version of R. At the moment, one might note the packages used
>> and even their versions, but what about the versions of the packages
>> that the used packages rely upon & so on? What if developers don't
>> state know working versions of dependencies?
>
> Doesn't sessionInfo() give all of this?
>
> If you want to be very worried about every last bit, I suppose it
> should also include options(), compiler flags, compiler version, BLAS
> details, etc.  (Good talk on the dregs of a floating point number and
> how hard it is to reproduce them across processors
> http://www.youtube.com/watch?v=GIlp4rubv8U)

In principle yes - but this calls specifically for a package which is
extracting the info and stores it into a human readable format, which
can then be used to re-install (automatically) all the versions for
(hopefully) reproducibility - because if there are external libraries
included, you HAVE problems.

>
>> 
>> The problem is how the heck do you know which versions of packages are
>> needed if developers don't record these dependencies in sufficient
>> detail? The suggested solution is to freeze CRAN at intervals
>> alongside R releases. Then you'd know what the stable versions were.
>
> Only if you knew which R release was used. 

Well - that would be easier to specify in a paper then the version infos
of all packages needed - and which ones of the installed ones are
actually needed? OK - the ones specified in library() calls. But wait -
there are dependencies, imports, ... That is a lot of digging - I wpul;d
not know how to do this out of my head, except by digging through the
DESCRIPTION files of the packages...

>
>> 
>> Or we could just get package developers to be more thorough in
>> documenting dependencies. Or R CMD check could refuse to pass if a
>> package is listed as a dependency but with no version qualifiers. Or
>> have R CMD build add an upper bound (from the current, at build-time
>> version of dependencies on CRAN) if the package developer didn't
>> include and upper bound. Or... The first is unliekly to happen
>> consistently, and no-one wants *more* checks and hoops to jump through
>> :-)
>> 
>> To my mind it is incumbent upon those wanting reproducibility to build
>> the tools to enable users to reproduce works.
>
> But the tools already allow it with minimal effort. If the author
> can't even include session info, how can we be sure the version of R
> is known. If we can't know which version of R, can we ever change R at
> all? Etc to absurdity.
>
> My (serious) point is that the tools are in place, but ramming them
> down folks' throats by intentionally keeping them on older versions by
> default is too much.
>
>> When you write a paper
>> or release a tool, you will have tested it with a specific set of
>> packages. It is relatively easy to work out what those versions are
>> (there are tools in R for this). What is required is an automated way
>> to record that info in an agreed upon way in an approved
>> file/location, and have a tool that facilitates setting up a package
>> library sufficient with which to reproduce a work. That approval
>> doesn't need to come from CRAN or R Core - we can store anything in
>> ./inst.
>
> I think the package version and published paper cases are different. 
>
> For the latter, the recipe is simple: if you want the same results,
> use the same software (as noted by sessionInfoPlus() or equiv)

Dependencies, imports, package versions, ... not that straight forward I
would say.

>
> For the former, I think you start straying into this NP complete problem: 
> http://people.debian.org/~dburrows/model.pdf 
>
> Yes, a good config can (and should be recorded) but isn't that exactly what 
> sessionInfo() gives?
>
>> 
>> Reproducibility is a very important part of doing "science", but not
>> everyone using CRAN is doing that. Why force everyone to march to the
>> reproducibility drum? I would place the onus elsewhere to make this
>> work.
>> 
>
> Agreed: reproducibility is the onus of the author, not the reader

Exactly - but also the authors of the software which is aimed at being
used in the context of reproducibility - the tools should be there to
make it easy!

My points are:

1) I think the snapshot idea of CRAN is a good idea which should be
followed
2) The snapshots should be incorporated at CRAN as I assume that CRAN
will be there longer then any third party repository.
3) the default for the user should *not* change, i.e. normal users will
always get the newest packages as it is now
4) If this can / will not be done because of workload, storage space,
... 

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread Rainer M Krug
Hadley Wickham  writes:

>> What would be more useful in terms of reproducibility is the capability of
>> installing a specific version of a package from a repository using
>> install.packages(), which would require archiving older versions in a
>> coordinated fashion. I know CRAN archives old versions, but I am not aware
>> if we can programmatically query the repository about this.
>
> See devtools::install_version().
>
> The main caveat is that you also need to be able to build the package,
> and ensure you have dependencies that work with that version.

The compiling will always be the problem when using older source
packages, whatever is done.

But for the dependencies: an automatic parsing of the dependencies
(DEPENDS, IMPORTS, ...) would help a lot. 

Together with a command which scans the installed package in the session
and stores them in a parsable human readable format so that all packages
(with the specified version) required can be installed with one command,
and I think the problem would be much closer to be solved.

Rainer

>
> Hadley

-- 
Rainer M. Krug
email: Rainerkrugsde
PGP: 0x0F52F982


pgpfAhwo1bQBT.pgp
Description: PGP signature
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Time format in parameters

2014-03-20 Thread Cathy Lee Gierke
Hi,

I'm having trouble building the help file for my package.  One of the
parameters is a time format, and the % seems to blow things up when I do a
build.

I have tried \%  \\% and as many different things as ai can think of.  But
everything after the % disappears  -- doesn't show up in the help file.

Here is the line I'm having trouble with.
\usage{
CATCosinor(TimeCol=2,Y=c(4,6,7), Components=3, window="noFilt",
RefDateTime="20130203",  timeFormat="%Y%m%d%H%M",
RangeDateTime=list(Start=0, End=0), Units='Hour', dt=0,
Progressive=list(Span=0, Increment=0),
Period=list(Set=c(24,12,8),Start=0,Increment=1,End=0),header=F, Skip=0,
Colors="BW",Graphics="pdf",Output=list(Txt=F,Dat=T,TxtAmp=F,Doc=T,HeatMap=F,LineGraphs=F),Console=F,Debug=FALSE,fileName=fileName,functionName='CallCos3varProg')
}

Thanks for your help!
Cathy Lee Gierke

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread Duncan Murdoch

On 14-03-20 2:15 AM, Dan Tenenbaum wrote:



- Original Message -

From: "David Winsemius" 
To: "Jeroen Ooms" 
Cc: "r-devel" 
Sent: Wednesday, March 19, 2014 11:03:32 PM
Subject: Re: [Rd] [RFC] A case for freezing CRAN


On Mar 19, 2014, at 7:45 PM, Jeroen Ooms wrote:


On Wed, Mar 19, 2014 at 6:55 PM, Michael Weylandt
 wrote:

Reading this thread again, is it a fair summary of your position
to say "reproducibility by default is more important than giving
users access to the newest bug fixes and features by default?"
It's certainly arguable, but I'm not sure I'm convinced: I'd
imagine that the ratio of new work being done vs reproductions is
rather high and the current setup optimizes for that already.


I think that separating development from released branches can give
us
both reliability/reproducibility (stable branch) as well as new
features (unstable branch). The user gets to pick (and you can pick
both!). The same is true for r-base: when using a 'released'
version
you get 'stable' base packages that are up to 12 months old. If you
want to have the latest stuff you download a nightly build of
r-devel.
For regular users and reproducible research it is recommended to
use
the stable branch. However if you are a developer (e.g. package
author) you might want to develop/test/check your work with the
latest
r-devel.

I think that extending the R release cycle to CRAN would result
both
in more stable released versions of R, as well as more freedom for
package authors to implement rigorous change in the unstable
branch.
When writing a script that is part of a production pipeline, or
sweave
paper that should be reproducible 10 years from now, or a book on
using R, you use stable version of R, which is guaranteed to behave
the same over time. However when developing packages that should be
compatible with the upcoming release of R, you use r-devel which
has
the latest versions of other CRAN and base packages.



As I remember ... The example demonstrating the need for this was an
XML package that cause an extract from a website where the headers
were misinterpreted as data in one version of pkg:XML and not in
another. That seems fairly unconvincing. Data cleaning and
validation is a basic task of data analysis. It also seems excessive
to assert that it is the responsibility of CRAN to maintain a synced
binary archive that will be available in ten years.



CRAN already does this, the bin/windows/contrib directory has subdirectories 
going back to 1.7, with packages dated October 2004. I don't see why it is 
burdensome to continue to archive these. It would be nice if source versions 
had a similar archive.


The bin/windows/contrib directories are updated every day for active R 
versions.  It's only when Uwe decides that a version is no longer worth 
active support that he stops doing updates, and it "freezes".  A 
consequence of this is that the snapshots preserved in those older 
directories are unlikely to match what someone who keeps up to date with 
R releases is using.  Their purpose is to make sure that those older 
versions aren't completely useless, but they aren't what Jeroen was 
asking for.


Karl Millar's suggestion seems like an ideal solution to this problem. 
Any CRAN mirror could implement it.  If someone sets this up and commits 
to maintaining it, I'd be happy to work on the necessary changes to the 
install.packages/update.packages code to allow people to use it from 
within R.


Duncan Murdoch

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread Roger Bivand
Gavin Simpson  gmail.com> writes:

> 
...
> 
> 
> To my mind it is incumbent upon those wanting reproducibility to build
> the tools to enable users to reproduce works. When you write a paper
> or release a tool, you will have tested it with a specific set of
> packages. It is relatively easy to work out what those versions are
> (there are tools in R for this). What is required is an automated way
> to record that info in an agreed upon way in an approved
> file/location, and have a tool that facilitates setting up a package
> library sufficient with which to reproduce a work. That approval
> doesn't need to come from CRAN or R Core - we can store anything in
> ./inst.

Gavin,

Thanks for contributing useful insights. With reference to Jeroen's proposal
and the discussion so far, I can see where the problem lies, but the
proposed solutions are very invasive. What might offer a less invasive
resolution is through a robust and predictable schema for sessionInfo()
content, permitting ready parsing, so that (using Hadley's interjection) the
reproducer could reconstruct the original execution environment at least as
far as R and package versions are concerned.

In fact, I'd argue that the responsibility for securing reproducibility lies
with the originating author or organisation, so that work where
reproducibility is desired should include such a standardised record. 

There is an additional problem not addressed directly in this thread but
mentioned in some contributions, upstream of R. The further problem upstream
is actually in the external dependencies and compilers, beyond that in
hardware. So raising consciousness about the importance of being able to
query version information to enable reproducibility is important.

Next, encapsulating the information permitting its parsing would perhaps
enable the original execution environment to be reconstructed locally by
installing external dependencies, then R, then packages from source, using
the same versions of build train components if possible (and noting
mismatches if not). Maybe ressurect StatDataML in addition to RData
serialization of the version dependencies? Of course, current R and package
versions may provide reproducibility, but if they don't, one would use the
parseable record of the original development environment 

> 
> Reproducibility is a very important part of doing "science", but not
> everyone using CRAN is doing that. Why force everyone to march to the
> reproducibility drum? I would place the onus elsewhere to make this
> work.

Exactly.

Roger

> 
> Gavin
> A scientist, very much interested in reproducibility of my work and others.
> 
...
> >
> > __
> > R-devel  r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
>

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread S Ellison
>  If we could all agree on a particular set
> of cran packages to be used with a certain release of R, then it doesn't 
> matter
> how the 'snapshotting' gets implemented.

This is pretty much the sticking point, though. I see no practical way of 
reaching that agreement without the kind of decision authority (and effort) 
that Linux distro maintainers put in to the internal consistency of each 
distribution.

CRAN doesn't try to do that; it's just a place to access packages offered by 
maintainers. 

As a package maintainer, I think support for critical version dependencies in 
the imports or dependency lists is a good idea that individual package 
maintainers could relatively easily manage, but I think freezing CRAN as a 
whole or adopting single release cycles for CRAN would be thoroughly 
impractical.

S Ellison





***
This email and any attachments are confidential. Any use...{{dropped:8}}

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] The case for freezing CRAN

2014-03-20 Thread Therneau, Terry M., Ph.D.

There is a central assertion to this argument that I don't follow:


At the end of the day most published results obtained with R just won't be 
reproducible.


This is a very strong assertion. What is the evidence for it?

 I write a lot of Sweave/knitr in house as a way of documenting complex analyses, and a 
glm() based logistic regression looks the same yesterday as it will tomorrow.


Terry Therneau

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] The case for freezing CRAN

2014-03-20 Thread Michael Weylandt
On Mar 20, 2014, at 8:19, "Therneau, Terry M., Ph.D."  wrote:

> There is a central assertion to this argument that I don't follow:
> 
>> At the end of the day most published results obtained with R just won't be 
>> reproducible.
> 
> This is a very strong assertion. What is the evidence for it?

If I've understood Jeroen correctly, his point might be alternatively phrased 
as "won't be reproducED" (i.e., end user difficulties, not software 
availability).

Michael

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] The case for freezing CRAN

2014-03-20 Thread Therneau, Terry M., Ph.D.



On 03/20/2014 07:48 AM, Michael Weylandt wrote:

On Mar 20, 2014, at 8:19, "Therneau, Terry M., Ph.D."  wrote:


There is a central assertion to this argument that I don't follow:


At the end of the day most published results obtained with R just won't be 
reproducible.


This is a very strong assertion. What is the evidence for it?


If I've understood Jeroen correctly, his point might be alternatively phrased as 
"won't be reproducED" (i.e., end user difficulties, not software availability).

Michael



That was my point as well.  Of the 30+ Sweave documents that I've produced I can't think 
of one that will change its output with a new version of R.  My 0/30 estimate is at odds 
with the "nearly all" assertion.  Perhaps I only do dull things?


Terry T.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] The case for freezing CRAN

2014-03-20 Thread Kevin Coombes


On 3/20/2014 9:00 AM, Therneau, Terry M., Ph.D. wrote:



On 03/20/2014 07:48 AM, Michael Weylandt wrote:
On Mar 20, 2014, at 8:19, "Therneau, Terry M., Ph.D." 
 wrote:



There is a central assertion to this argument that I don't follow:

At the end of the day most published results obtained with R just 
won't be reproducible.


This is a very strong assertion. What is the evidence for it?


If I've understood Jeroen correctly, his point might be alternatively 
phrased as "won't be reproducED" (i.e., end user difficulties, not 
software availability).


Michael



That was my point as well.  Of the 30+ Sweave documents that I've 
produced I can't think of one that will change its output with a new 
version of R.  My 0/30 estimate is at odds with the "nearly all" 
assertion.  Perhaps I only do dull things?


Terry T.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


The only concrete example that comes to mind from my own Sweave reports 
was actually caused by BioConductor and not CRAN. I had a set of 
analyses that used DNAcopy, and the results changed substantially with a 
new release of the package in which they changed the default values to 
the main function call.   As a result, I've taken to writing out more of 
the defaults that I previously just accepted.  There have been a few 
minor issues similar to this one (with changes to parts of the Mclust 
package ??). So my estimates are somewhat higher than 0/30 but are still 
a long way from "almost all".


Kevin

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] The case for freezing CRAN

2014-03-20 Thread Dirk Eddelbuettel

No attempt to summarize the thread, but a few highlighted points:

 o Karl's suggestion of versioned / dated access to the repo by adding a
   layer to webaccess is (as usual) nice.  It works on the 'supply' side. But
   Jeroen's problem is on the demand side.  Even when we know that an
   analysis was done on 20xx-yy-zz, and we reconstruct CRAN that day, it only
   gives us a 'ceiling' estimate of what was on the machine.  In production
   or lab environments, installations get stale.  Maybe packages were already
   a year old?  To me, this is an issue that needs to be addressed on the
   'demand' side of the user. But just writing out version numbers is not
   good enough.

 o Roger correctly notes that R scripts and packages are just one issue.
   Compilers, libraries and the OS matter.  To me, the natural approach these
   days would be to think of something based on Docker or Vagrant or (if you
   must, VirtualBox).  The newer alternatives make snapshotting very cheap
   (eg by using Linux LXC).  That approach reproduces a full environemnt as
   best as we can while still ignoring the hardware layer (and some readers
   may recall the infamous Pentium bug of two decades ago).

 o Reproduciblity will probably remain the responsibility of study
   authors. If an investigator on a mega-grant wants to (or needs to) freeze,
   they do have the tools now.  Requiring the need of a few to push work on
   those already overloaded (ie CRAN) and changing the workflow of everybody
   is a non-starter.

 o As Terry noted, Jeroen made some strong claims about exactly how flawed
   the existing system is and keeps coming back to the example of 'a JSS
   paper that cannot be re-run'.  I would really like to see empirics on
   this.  Studies of reproducibility appear to be publishable these days, so
   maybe some enterprising grad student wants to run with the idea of
   actually _testing_ this.  We maybe be above Terry's 0/30 and nearer to
   Kevin's 'low'/30.  But let's bring some data to the debate.

 o Overall, I would tend to think that our CRAN standards of releasing with
   tests, examples, and checks on every build and release already do a much
   better job of keeping things tidy and workable than in most if not all
   other related / similar open source projects. I would of course welcome
   contradictory examples.

Dirk
 
-- 
Dirk Eddelbuettel | e...@debian.org | http://dirk.eddelbuettel.com

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread Jari Oksanen

On 20/03/2014, at 14:14 PM, S Ellison wrote:

>> If we could all agree on a particular set
>> of cran packages to be used with a certain release of R, then it doesn't 
>> matter
>> how the 'snapshotting' gets implemented.
> 
> This is pretty much the sticking point, though. I see no practical way of 
> reaching that agreement without the kind of decision authority (and effort) 
> that Linux distro maintainers put in to the internal consistency of each 
> distribution.
> 
> CRAN doesn't try to do that; it's just a place to access packages offered by 
> maintainers. 
> 
> As a package maintainer, I think support for critical version dependencies in 
> the imports or dependency lists is a good idea that individual package 
> maintainers could relatively easily manage, but I think freezing CRAN as a 
> whole or adopting single release cycles for CRAN would be thoroughly 
> impractical.
> 

I have a feeling that this discussion has floated between two different 
arguments in favour of freezing: discontent with package authors who break 
their packages within R release cycle, and ability to reproduce old results. In 
the beginning the first argument was more prominent, but now the discussion has 
drifted to reproducing old results. 

I cannot see how freezing CRAN would help with package authors who do not 
separate development and CRAN release branches but introduce broken code, or 
code that breaks other packages. Freezing a broken snapshot would only mean 
that the situation cannot be cured before next R release, and then new breakage 
could be introduced. Result would be dysfunctional CRAN. I think that quite a 
few of the package updates are bug fixes and minor enhancements. Further, I do 
think that these should be "backported" to previous versions of R: users of 
previous version of R should also benefit from bug fixes. This also is the 
current CRAN policy and I think this is a good policy. Personally, I try to 
keep my packages in such a condition that they will also work in previous 
versions of R so that people do not need to upgrade R to have bug fixes in 
packages. 

The policy is the same with Linux maintainers: they do not just build a 
consistent release, but maintain the release by providing bug fixes. In Linux 
distributions, end of life equals freezing, or not providing new versions of 
software.

Another issue is reproducing old analyses. This is a valuable thing, and 
sessionInfo and ability to get certain versions of package certainly are steps 
forward. It looks that guaranteed reproduction is a hard task, though. For 
instance, R 2.14.2 is the oldest version of R that I can build out of the box 
in my Linux desktop. I have earlier built older, even much older, R versions, 
but something has happened in my OS that crashes the build process. To 
reproduce an old analysis, I also should install an older version of my OS,  
then build old R and then get the old versions of packages. It is nice if the 
last step is made easier.

Cheers, Jari Oksanen

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Memcheck: Invalid read of size 4

2014-03-20 Thread Christophe Genolini

Thanks a lot. Your correction works just fine.

Any idea of what goes wrong for the line 151, which is

   int *clusterAffectation2=malloc(*nbInd * sizeof(int));  
// lines 151






On 19 Mar 2014, at 22:58 , Christophe Genolini  wrote:


Hi the list,

One of my package has a memory issue that I do not manage to understand. The 
Memtest notes is here:


Here is the message that I get from Memtest

--- 8< 
~ Fast KmL ~
==27283== Invalid read of size 4
==27283==at 0x10C5DF28: kml1 (kml.c:183)
...
==27283==by 0x10C5DE4F: kml1 (kml.c:151)
...
==27283==at 0x10C5DF90: kml1 (kml.c:198)
--- 8< 


Here is the function kml1 from the file kml.c (I add some comments to tag the 
lines 151, 183 and 198)

--- 8< 
void kml1(double *traj, int *nbInd, int *nbTime, int *nbClusters, int *maxIt, 
int *clusterAffectation1, int *convergenceTime){

int i=0,iter=0;
int *clusterAffectation2=malloc(*nbInd * sizeof(int));  
// lines 151
double *trajMean=malloc(*nbClusters * *nbTime * sizeof(double));

for(i = 0; i < *nbClusters * *nbTime; i++){trajMean[i] = 0.0;};
for(i = 0; i < *nbInd; i++){clusterAffectation2[i] = 0;};

for(iter = 0; iter < *maxIt; iter+=2){
calculMean(traj,nbInd,nbTime,clusterAffectation1,nbClusters,trajMean);
affecteIndiv(traj,nbInd,nbTime,trajMean,nbClusters,clusterAffectation2);

i = 0;
while(clusterAffectation1[i]==clusterAffectation2[i] && i 
<*nbInd){i++;}; // lines 183
if(i == *nbInd){
*convergenceTime = iter + 1;
break;
}else{};

calculMean(traj,nbInd,nbTime,clusterAffectation2,nbClusters,trajMean);
affecteIndiv(traj,nbInd,nbTime,trajMean,nbClusters,clusterAffectation1);

i = 0;
while(clusterAffectation1[i]==clusterAffectation2[i] && 
i<*nbInd){i++;}; // lines 198
if(i == *nbInd){
*convergenceTime = iter + 2;
break;
}else{};
}
}
--- 8< 

Do you know what is wrong in my C code?

Yes. You need to reverse operands of &&. Otherwise you'll be indexing with 
i==*nbind before finding that (i < *nbind) is false.


Thanks

Christophe

--
Christophe Genolini
Maître de conférences en bio-statistique
Université Paris Ouest Nanterre La Défense
INSERM UMR 1027

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel



--
Christophe Genolini
Maître de conférences en bio-statistique
Université Paris Ouest Nanterre La Défense
INSERM UMR 1027

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Memcheck: Invalid read of size 4

2014-03-20 Thread peter dalgaard

On 20 Mar 2014, at 16:56 , Christophe Genolini  wrote:

> Thanks a lot. Your correction works just fine.
> 
> Any idea of what goes wrong for the line 151, which is
> 
>   int *clusterAffectation2=malloc(*nbInd * sizeof(int));  
> // lines 151
> 

Nothing. It's just that memcheck marks the point of allocation for you: There's 
a discrepancy between what you allocate and what you access, but it can't 
really tell whether the allocation was too short or the access steps past the 
end.

-pd

> 
> 
> 
>> On 19 Mar 2014, at 22:58 , Christophe Genolini  wrote:
>> 
>>> Hi the list,
>>> 
>>> One of my package has a memory issue that I do not manage to understand. 
>>> The Memtest notes is here:
>>> 
>>> 
>>> Here is the message that I get from Memtest
>>> 
>>> --- 8< 
>>> ~ Fast KmL ~
>>> ==27283== Invalid read of size 4
>>> ==27283==at 0x10C5DF28: kml1 (kml.c:183)
>>> ...
>>> ==27283==by 0x10C5DE4F: kml1 (kml.c:151)
>>> ...
>>> ==27283==at 0x10C5DF90: kml1 (kml.c:198)
>>> --- 8< 
>>> 
>>> 
>>> Here is the function kml1 from the file kml.c (I add some comments to tag 
>>> the lines 151, 183 and 198)
>>> 
>>> --- 8< 
>>> void kml1(double *traj, int *nbInd, int *nbTime, int *nbClusters, int 
>>> *maxIt, int *clusterAffectation1, int *convergenceTime){
>>> 
>>>int i=0,iter=0;
>>>int *clusterAffectation2=malloc(*nbInd * sizeof(int));   
>>>// lines 151
>>>double *trajMean=malloc(*nbClusters * *nbTime * sizeof(double));
>>> 
>>>for(i = 0; i < *nbClusters * *nbTime; i++){trajMean[i] = 0.0;};
>>>for(i = 0; i < *nbInd; i++){clusterAffectation2[i] = 0;};
>>> 
>>>for(iter = 0; iter < *maxIt; iter+=2){
>>> calculMean(traj,nbInd,nbTime,clusterAffectation1,nbClusters,trajMean);
>>> 
>>> affecteIndiv(traj,nbInd,nbTime,trajMean,nbClusters,clusterAffectation2);
>>> 
>>> i = 0;
>>> while(clusterAffectation1[i]==clusterAffectation2[i] && i 
>>> <*nbInd){i++;}; // lines 183
>>> if(i == *nbInd){
>>> *convergenceTime = iter + 1;
>>> break;
>>> }else{};
>>> 
>>> calculMean(traj,nbInd,nbTime,clusterAffectation2,nbClusters,trajMean);
>>> affecteIndiv(traj,nbInd,nbTime,trajMean,nbClusters,clusterAffectation1);
>>> 
>>> i = 0;
>>> while(clusterAffectation1[i]==clusterAffectation2[i] && 
>>> i<*nbInd){i++;}; // lines 198
>>> if(i == *nbInd){
>>> *convergenceTime = iter + 2;
>>> break;
>>> }else{};
>>>}
>>> }
>>> --- 8< 
>>> 
>>> Do you know what is wrong in my C code?
>> Yes. You need to reverse operands of &&. Otherwise you'll be indexing with 
>> i==*nbind before finding that (i < *nbind) is false.
>> 
>>> Thanks
>>> 
>>> Christophe
>>> 
>>> -- 
>>> Christophe Genolini
>>> Maître de conférences en bio-statistique
>>> Université Paris Ouest Nanterre La Défense
>>> INSERM UMR 1027
>>> 
>>> __
>>> R-devel@r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
> 
> -- 
> Christophe Genolini
> Maître de conférences en bio-statistique
> Université Paris Ouest Nanterre La Défense
> INSERM UMR 1027

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd@cbs.dk  Priv: pda...@gmail.com

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] The case for freezing CRAN

2014-03-20 Thread Greg Snow
On Thu, Mar 20, 2014 at 7:32 AM, Dirk Eddelbuettel  wrote:
[snip]

>  (and some readers
>may recall the infamous Pentium bug of two decades ago).

It was a "Flaw" not a "Bug".  At least I remember the Intel people
making a big deal about that distinction.

But I do remember the time well, I was a biostatistics Ph.D. student
at the time and bought one of the flawed pentiums.  My attempts at
getting the chip replaced resulted in a major run around and each
person that I talked to would first try to explain that I really did
not need the fix because the only people likely to be affected were
large corporations and research scientists.  I will admit that I was
not a large corporation, but if a Ph.D. student in biostatistics is
not a research scientist, then I did not know what they defined one
as.  When I pointed this out they would usually then say that it still
would not matter, unless I did a few thousand floating point
operations I was unlikely to encounter one of the problematic
divisions.  I would then point out that some days I did over 10,000
floating point operations before breakfast (I had checked after the
1st person told me this and 10,000 was a low estimate of a lower bound
of one set of simulations) at which point they would admit that I had
a case and then send me to talk to someone else who would start the
process over.



[snip]
> --
> Dirk Eddelbuettel | e...@debian.org | http://dirk.eddelbuettel.com
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel



-- 
Gregory (Greg) L. Snow Ph.D.
538...@gmail.com

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] The case for freezing CRAN

2014-03-20 Thread Carl Boettiger
There seems to be some question of how frequently changes to software
packages result in irreproducible results.

I am sure Terry is correct that research using functions like `glm` and
other functions that are shipped with base R are quite reliable; and after
all they already benefit from being versioned with R releases as Jeroen
argues.

In my field of ecology and evolution, the situation is quite different.
 Packages are frequently developed by scientists without any background in
programming and become widely used, such as [geiger](
http://cran.r-project.org/web/packages/geiger/), with 463 papers citing it
and probably many more using it that do not cite it (both because it is
sometimes used only as a dependency of another package or just because our
community isn't great at citing packages).  The package has changed
substantially over the time it has been on CRAN and many functions that
would once run based on older versions could no longer run on newer ones.
 It's dependencies, notably the phylogenetics package ape, has changed
continually over that interval with both bug fixes and substantial changes
to the basic data structure.  The ape package has 1,276 citations (again a
lower bound).  I suspect that correctly identifying the right version of
the software used in any of these thousands of papers would prove difficult
and for a large fraction the results would simply not execute successfully.
It would be much harder to track down cases where the bug fixes would have
any impact on the result.  I have certainly seen both problems in the
hundreds of Sweave/knitr files I have produced over the years that use
these packages.

Even work that simply relies on a package that has been archived becomes a
substantial challenge to reproducibility by other scientists even when an
expert familiar with the packages (e.g. the original author) would not have
a problem, as the informatics team at the Evolutionary Synthesis center
recently concluded in an exercise trying to reproduce several papers
including my own that used a package that had been archived (odesolve,
whose replacement, deSolve, does not use quite the same function call for
the same `lsoda` function).

New methods are being published all the time, and I think it is excellent
that in ecology and evolution it is increasingly standard to publish R
packages implementing those methods, as a scan of any table of contents in
"methods in Ecology and Evolution", for instance, will quickly show.  But
unlike `glm`, these methods have a long way to go before they are fully
tested and debugged, and reproducing any work based on them requires a
close eye to the versions (particularly when unit tests and even detailed
changelogs are not common). The methods are invariably built by
"user-developers", researchers developing the code for their own needs, and
thus these packages can themselves fall afoul of changes as they depend and
build upon work of other nascent ecology and evolution packages.

Detailed reproducibility studies of published work in this area are still
hard to come by, not least because the actual code used by the researchers
is seldom published (other than when it is published as it's own R
package).  But incompatibilities between successive versions of the 100s of
packages in our domain, along with the interdependencies of those packages
might provide some window into the difficulties of computational
reproducibility.  I suspect changes in these fast-moving packages are far
more culprit than differences in compilers and operating systems.

Cheers,

Carl








On Thu, Mar 20, 2014 at 10:23 AM, Greg Snow <538...@gmail.com> wrote:

> On Thu, Mar 20, 2014 at 7:32 AM, Dirk Eddelbuettel  wrote:
> [snip]
>
> >  (and some readers
> >may recall the infamous Pentium bug of two decades ago).
>
> It was a "Flaw" not a "Bug".  At least I remember the Intel people
> making a big deal about that distinction.
>
> But I do remember the time well, I was a biostatistics Ph.D. student
> at the time and bought one of the flawed pentiums.  My attempts at
> getting the chip replaced resulted in a major run around and each
> person that I talked to would first try to explain that I really did
> not need the fix because the only people likely to be affected were
> large corporations and research scientists.  I will admit that I was
> not a large corporation, but if a Ph.D. student in biostatistics is
> not a research scientist, then I did not know what they defined one
> as.  When I pointed this out they would usually then say that it still
> would not matter, unless I did a few thousand floating point
> operations I was unlikely to encounter one of the problematic
> divisions.  I would then point out that some days I did over 10,000
> floating point operations before breakfast (I had checked after the
> 1st person told me this and 10,000 was a low estimate of a lower bound
> of one set of simulations) at which point they would admit that I had
> a case and then

Re: [Rd] The case for freezing CRAN

2014-03-20 Thread Marc Schwartz

On Mar 20, 2014, at 12:23 PM, Greg Snow <538...@gmail.com> wrote:

> On Thu, Mar 20, 2014 at 7:32 AM, Dirk Eddelbuettel  wrote:
> [snip]
> 
>> (and some readers
>>   may recall the infamous Pentium bug of two decades ago).
> 
> It was a "Flaw" not a "Bug".  At least I remember the Intel people
> making a big deal about that distinction.
> 
> But I do remember the time well, I was a biostatistics Ph.D. student
> at the time and bought one of the flawed pentiums.  My attempts at
> getting the chip replaced resulted in a major run around and each
> person that I talked to would first try to explain that I really did
> not need the fix because the only people likely to be affected were
> large corporations and research scientists.  I will admit that I was
> not a large corporation, but if a Ph.D. student in biostatistics is
> not a research scientist, then I did not know what they defined one
> as.  When I pointed this out they would usually then say that it still
> would not matter, unless I did a few thousand floating point
> operations I was unlikely to encounter one of the problematic
> divisions.  I would then point out that some days I did over 10,000
> floating point operations before breakfast (I had checked after the
> 1st person told me this and 10,000 was a low estimate of a lower bound
> of one set of simulations) at which point they would admit that I had
> a case and then send me to talk to someone else who would start the
> process over.


Further segue:

That (1994) was a watershed moment for Intel as a company. A time during which 
Intel's future was quite literally at stake. Intel's internal response to that 
debacle, which fundamentally altered their own perception of just who their 
customer was (the OEM's like IBM, COMPAQ and Dell versus the end users like 
us), took time to be realized, as the impact of increasingly negative PR took 
hold. It was also a good example of the impact of public perception (a flawed 
product) versus the realities of how infrequently the flaw would be observed in 
"typical" computing. "Perception is reality", as some would observe.

Intel ultimately spent somewhere in the neighborhood of $500 million (in 1994 
U.S. dollars), as I recall, to implement a large scale Pentium chip replacement 
infrastructure targeted to end users. The "Intel Inside" marketing campaign was 
also an outgrowth of that time period.

Regards,

Marc Schwartz


> [snip]
>> --
>> Dirk Eddelbuettel | e...@debian.org | http://dirk.eddelbuettel.com
>> 
>> __
>> R-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
> 
> 
> -- 
> Gregory (Greg) L. Snow Ph.D.
> 538...@gmail.com
> 
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] The case for freezing CRAN

2014-03-20 Thread Karl Millar
Given the version / dated snapshots of CRAN, and an agreement that
reproducibility is the responsibility of the study author, the author
simply needs to sync all their packages to a chosen date, run the analysis
and publish the chosen date.  It is true that this doesn't include
compilers, OS, system packages etc, but in my experience those are
significantly more stable than CRAN packages.


Also, my previous description of how to serve up a dated CRAN was way too
complicated.  Since most of the files on CRAN never change, they don't need
version control.  Only the metadata about which versions are current really
needs to be tracked, and that's small enough that it could be stored in
static files.




On Thu, Mar 20, 2014 at 6:32 AM, Dirk Eddelbuettel  wrote:

>
> No attempt to summarize the thread, but a few highlighted points:
>
>  o Karl's suggestion of versioned / dated access to the repo by adding a
>layer to webaccess is (as usual) nice.  It works on the 'supply' side.
> But
>Jeroen's problem is on the demand side.  Even when we know that an
>analysis was done on 20xx-yy-zz, and we reconstruct CRAN that day, it
> only
>gives us a 'ceiling' estimate of what was on the machine.  In production
>or lab environments, installations get stale.  Maybe packages were
> already
>a year old?  To me, this is an issue that needs to be addressed on the
>'demand' side of the user. But just writing out version numbers is not
>good enough.
>
>  o Roger correctly notes that R scripts and packages are just one issue.
>Compilers, libraries and the OS matter.  To me, the natural approach
> these
>days would be to think of something based on Docker or Vagrant or (if
> you
>must, VirtualBox).  The newer alternatives make snapshotting very cheap
>(eg by using Linux LXC).  That approach reproduces a full environemnt as
>best as we can while still ignoring the hardware layer (and some readers
>may recall the infamous Pentium bug of two decades ago).
>
>  o Reproduciblity will probably remain the responsibility of study
>authors. If an investigator on a mega-grant wants to (or needs to)
> freeze,
>they do have the tools now.  Requiring the need of a few to push work on
>those already overloaded (ie CRAN) and changing the workflow of
> everybody
>is a non-starter.
>
>  o As Terry noted, Jeroen made some strong claims about exactly how flawed
>the existing system is and keeps coming back to the example of 'a JSS
>paper that cannot be re-run'.  I would really like to see empirics on
>this.  Studies of reproducibility appear to be publishable these days,
> so
>maybe some enterprising grad student wants to run with the idea of
>actually _testing_ this.  We maybe be above Terry's 0/30 and nearer to
>Kevin's 'low'/30.  But let's bring some data to the debate.
>
>  o Overall, I would tend to think that our CRAN standards of releasing with
>tests, examples, and checks on every build and release already do a much
>better job of keeping things tidy and workable than in most if not all
>other related / similar open source projects. I would of course welcome
>contradictory examples.
>
> Dirk
>
> --
> Dirk Eddelbuettel | e...@debian.org | http://dirk.eddelbuettel.com
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread Hervé Pagès

On 03/20/2014 03:52 AM, Duncan Murdoch wrote:

On 14-03-20 2:15 AM, Dan Tenenbaum wrote:



- Original Message -

From: "David Winsemius" 
To: "Jeroen Ooms" 
Cc: "r-devel" 
Sent: Wednesday, March 19, 2014 11:03:32 PM
Subject: Re: [Rd] [RFC] A case for freezing CRAN


On Mar 19, 2014, at 7:45 PM, Jeroen Ooms wrote:


On Wed, Mar 19, 2014 at 6:55 PM, Michael Weylandt
 wrote:

Reading this thread again, is it a fair summary of your position
to say "reproducibility by default is more important than giving
users access to the newest bug fixes and features by default?"
It's certainly arguable, but I'm not sure I'm convinced: I'd
imagine that the ratio of new work being done vs reproductions is
rather high and the current setup optimizes for that already.


I think that separating development from released branches can give
us
both reliability/reproducibility (stable branch) as well as new
features (unstable branch). The user gets to pick (and you can pick
both!). The same is true for r-base: when using a 'released'
version
you get 'stable' base packages that are up to 12 months old. If you
want to have the latest stuff you download a nightly build of
r-devel.
For regular users and reproducible research it is recommended to
use
the stable branch. However if you are a developer (e.g. package
author) you might want to develop/test/check your work with the
latest
r-devel.

I think that extending the R release cycle to CRAN would result
both
in more stable released versions of R, as well as more freedom for
package authors to implement rigorous change in the unstable
branch.
When writing a script that is part of a production pipeline, or
sweave
paper that should be reproducible 10 years from now, or a book on
using R, you use stable version of R, which is guaranteed to behave
the same over time. However when developing packages that should be
compatible with the upcoming release of R, you use r-devel which
has
the latest versions of other CRAN and base packages.



As I remember ... The example demonstrating the need for this was an
XML package that cause an extract from a website where the headers
were misinterpreted as data in one version of pkg:XML and not in
another. That seems fairly unconvincing. Data cleaning and
validation is a basic task of data analysis. It also seems excessive
to assert that it is the responsibility of CRAN to maintain a synced
binary archive that will be available in ten years.



CRAN already does this, the bin/windows/contrib directory has
subdirectories going back to 1.7, with packages dated October 2004. I
don't see why it is burdensome to continue to archive these. It would
be nice if source versions had a similar archive.


The bin/windows/contrib directories are updated every day for active R
versions.  It's only when Uwe decides that a version is no longer worth
active support that he stops doing updates, and it "freezes".  A
consequence of this is that the snapshots preserved in those older
directories are unlikely to match what someone who keeps up to date with
R releases is using.  Their purpose is to make sure that those older
versions aren't completely useless, but they aren't what Jeroen was
asking for.


But it is almost completely useless from a reproducibility point of
view to get random package versions. For example if some people try
to use R-2.13.2 today to reproduce an analysis that was published
2 years ago, they'll get Matrix 1.0-4 on Windows, Matrix 1.0-3 on Mac,
and Matrix 1.1-2-2 on Unix. And none of them of course is what was used
by the authors of the paper (they used Matrix 1.0-1, which is what was
current when they ran their analysis).

A big improvement from a reproducibility point of view would be to
(a) have a clear cut for the freezes, (b) freeze the source
packages as well as the binary packages, and (c) freeze the same
versions of source and binaries. For example the freeze of
bin/windows/contrib/x.y, bin/macosx/contrib/x.y and contrib/x.y
could happen when the R-x.y series itself freezes (i.e. no more
minor versions planned for this series).

Cheers,
H.



Karl Millar's suggestion seems like an ideal solution to this problem.
Any CRAN mirror could implement it.  If someone sets this up and commits
to maintaining it, I'd be happy to work on the necessary changes to the
install.packages/update.packages code to allow people to use it from
within R.

Duncan Murdoch

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fhcrc.org
Phone:  (206) 667-5791
Fax:(206) 667-1319

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] The case for freezing CRAN

2014-03-20 Thread Marc Schwartz

On Mar 20, 2014, at 1:02 PM, Marc Schwartz  wrote:

> 
> On Mar 20, 2014, at 12:23 PM, Greg Snow <538...@gmail.com> wrote:
> 
>> On Thu, Mar 20, 2014 at 7:32 AM, Dirk Eddelbuettel  wrote:
>> [snip]
>> 
>>>(and some readers
>>>  may recall the infamous Pentium bug of two decades ago).
>> 
>> It was a "Flaw" not a "Bug".  At least I remember the Intel people
>> making a big deal about that distinction.
>> 
>> But I do remember the time well, I was a biostatistics Ph.D. student
>> at the time and bought one of the flawed pentiums.  My attempts at
>> getting the chip replaced resulted in a major run around and each
>> person that I talked to would first try to explain that I really did
>> not need the fix because the only people likely to be affected were
>> large corporations and research scientists.  I will admit that I was
>> not a large corporation, but if a Ph.D. student in biostatistics is
>> not a research scientist, then I did not know what they defined one
>> as.  When I pointed this out they would usually then say that it still
>> would not matter, unless I did a few thousand floating point
>> operations I was unlikely to encounter one of the problematic
>> divisions.  I would then point out that some days I did over 10,000
>> floating point operations before breakfast (I had checked after the
>> 1st person told me this and 10,000 was a low estimate of a lower bound
>> of one set of simulations) at which point they would admit that I had
>> a case and then send me to talk to someone else who would start the
>> process over.
> 
> 
> Further segue:
> 
> That (1994) was a watershed moment for Intel as a company. A time during 
> which Intel's future was quite literally at stake. Intel's internal response 
> to that debacle, which fundamentally altered their own perception of just who 
> their customer was (the OEM's like IBM, COMPAQ and Dell versus the end users 
> like us), took time to be realized, as the impact of increasingly negative PR 
> took hold. It was also a good example of the impact of public perception (a 
> flawed product) versus the realities of how infrequently the flaw would be 
> observed in "typical" computing. "Perception is reality", as some would 
> observe.
> 
> Intel ultimately spent somewhere in the neighborhood of $500 million (in 1994 
> U.S. dollars), as I recall, to implement a large scale Pentium chip 
> replacement infrastructure targeted to end users. The "Intel Inside" 
> marketing campaign was also an outgrowth of that time period.
> 


Quick correction, thanks to Peter, on my assertion that the "Intel Inside" 
campaign arose from the 1994 Pentium issue. It actually started in 1991.

I had a faulty recollection from my long ago reading of Andy Grove's 1996 book, 
"Only The Paranoid Survive", that the slogan arose from Intel's reaction to the 
Pentium fiasco. It actually pre-dated that time frame by a few years.

Thanks Peter!

Regards,

Marc

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread Ted Byers
On Thu, Mar 20, 2014 at 3:14 PM, Hervé Pagès  wrote:

> On 03/20/2014 03:52 AM, Duncan Murdoch wrote:
>
>> On 14-03-20 2:15 AM, Dan Tenenbaum wrote:
>>
>>>
>>>
>>> - Original Message -
>>>
 From: "David Winsemius" 
 To: "Jeroen Ooms" 
 Cc: "r-devel" 
 Sent: Wednesday, March 19, 2014 11:03:32 PM
 Subject: Re: [Rd] [RFC] A case for freezing CRAN


 On Mar 19, 2014, at 7:45 PM, Jeroen Ooms wrote:

  On Wed, Mar 19, 2014 at 6:55 PM, Michael Weylandt
>  wrote:
>
>> Reading this thread again, is it a fair summary of your position
>> to say "reproducibility by default is more important than giving
>> users access to the newest bug fixes and features by default?"
>> It's certainly arguable, but I'm not sure I'm convinced: I'd
>> imagine that the ratio of new work being done vs reproductions is
>> rather high and the current setup optimizes for that already.
>>
>
> I think that separating development from released branches can give
> us
> both reliability/reproducibility (stable branch) as well as new
> features (unstable branch). The user gets to pick (and you can pick
> both!). The same is true for r-base: when using a 'released'
> version
> you get 'stable' base packages that are up to 12 months old. If you
> want to have the latest stuff you download a nightly build of
> r-devel.
> For regular users and reproducible research it is recommended to
> use
> the stable branch. However if you are a developer (e.g. package
> author) you might want to develop/test/check your work with the
> latest
> r-devel.
>
> I think that extending the R release cycle to CRAN would result
> both
> in more stable released versions of R, as well as more freedom for
> package authors to implement rigorous change in the unstable
> branch.
> When writing a script that is part of a production pipeline, or
> sweave
> paper that should be reproducible 10 years from now, or a book on
> using R, you use stable version of R, which is guaranteed to behave
> the same over time. However when developing packages that should be
> compatible with the upcoming release of R, you use r-devel which
> has
> the latest versions of other CRAN and base packages.
>


 As I remember ... The example demonstrating the need for this was an
 XML package that cause an extract from a website where the headers
 were misinterpreted as data in one version of pkg:XML and not in
 another. That seems fairly unconvincing. Data cleaning and
 validation is a basic task of data analysis. It also seems excessive
 to assert that it is the responsibility of CRAN to maintain a synced
 binary archive that will be available in ten years.

>>>
>>>
>>> CRAN already does this, the bin/windows/contrib directory has
>>> subdirectories going back to 1.7, with packages dated October 2004. I
>>> don't see why it is burdensome to continue to archive these. It would
>>> be nice if source versions had a similar archive.
>>>
>>
>> The bin/windows/contrib directories are updated every day for active R
>> versions.  It's only when Uwe decides that a version is no longer worth
>> active support that he stops doing updates, and it "freezes".  A
>> consequence of this is that the snapshots preserved in those older
>> directories are unlikely to match what someone who keeps up to date with
>> R releases is using.  Their purpose is to make sure that those older
>> versions aren't completely useless, but they aren't what Jeroen was
>> asking for.
>>
>
> But it is almost completely useless from a reproducibility point of
> view to get random package versions. For example if some people try
> to use R-2.13.2 today to reproduce an analysis that was published
> 2 years ago, they'll get Matrix 1.0-4 on Windows, Matrix 1.0-3 on Mac,
> and Matrix 1.1-2-2 on Unix. And none of them of course is what was used
> by the authors of the paper (they used Matrix 1.0-1, which is what was
> current when they ran their analysis).
>

Initially this discussion brought back nightmares of DLL hell on Windows.
Those as ancient as I will remember that well.  But now, the focus seems to
be on reproducibility, but with what strikes me as a seriously flawed
notion of what reproducibility means.

Herve Pages mentions the risk of irreproducibility across three minor
revisions of version 1.0 of Matrix.  My gut reaction would be that if the
results are not reproducible across such minor revisions of one library,
they are probably just so much BS.  I am trained in mathematical ecology,
with more than a couple decades of post-doc experience working with risk
assessment in the private sector.  When I need to do an analysis, I will
repeat it myself in multiple products, as well as C++ or FORTRAN code I
have hand-crafted myself (and when I wrote number crunching code myself, I
would do so in m

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread Jeroen Ooms
On Thu, Mar 20, 2014 at 1:28 PM, Ted Byers  wrote:
>
> Herve Pages mentions the risk of irreproducibility across three minor
> revisions of version 1.0 of Matrix.  My gut reaction would be that if the
> results are not reproducible across such minor revisions of one library,
> they are probably just so much BS.
>

Perhaps this is just terminology, but what you refer to I would generally
call 'replication'. Of course being able to replicate results with other
data or other software is important to validate claims. But being able to
reproduce how the original results were obtained is an important part of
this process.

If someone is publishing results that I think are questionable and I cannot
replicate them, I want to know exactly how those outcomes were obtained in
the first place, so that I can 'debug' the problem. It's quite important to
be able to trace back if incorrect results were a result of a bug,
incompetence or fraud.

Let's take the example of the Reinhart and Rogoff case. The results
obviously were not replicable, but without more information it was just the
word of a grad students vs two Harvard professors. Only after reproducing
the original analysis it was possible to point out the errors and proof
that the original were incorrect.

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread Tim Triche, Jr.
That doesn't make sense.

If an API changes (e.g. in Matrix) and a program written against the old
API can no longer run, that is a very different issue than if the same
numbers (data) give different results.  The latter is what I am guessing
you address.  The former is what I believe most people are concerned about
here.  Or at least I hope that's so.

It's more an issue of usability than reproducibility in such a case, far as
I can tell (see e.g.
http://liorpachter.wordpress.com/2014/03/18/reproducibility-vs-usability/).
 If the same data produces substantially different results (not
attributable to e.g. better handling of machine precision and so forth,
although that could certainly be a bugaboo in many cases... anyone who has
programmed numerical routines in FORTRAN already knows this) then yes,
that's a different type of bug.  But in order to uncover the latter type of
bug, the code has to run in the first place.  After a while it becomes
rather impenetrable if no thought is given to these changes.

So the Bioconductor solution, as Herve noted, is to have freezes and
releases.  There can be old bugs enshrined in people's code due to using
old versions, and those can be traced even after many releases have come
and gone, because there is a point-in-time snapshot of about when these
things occurred.  As with (say) ANSI C++, deprecation notices stay in place
for a year before anything is actually done to remove a function or break
an API.  It's not impossible, it just requires more discipline than
declaring that the same program should be written multiple times on
multiple platforms every time.  The latter isn't an efficient use of
anyone's time.

Most of these analyses are not about putting a man on the moon or making
sure a dam does not break.  They're relatively low-consequence exploratory
sorties.  If something comes of them, it would be nice to have a
point-in-time reference to check and see whether the original results were
hooey.  That's a lot quicker and more efficient than rewriting everything
from scratch (which, in some fields, simply ensures things won't get
checked).

My $0.02, since we do still have those to bedevil cashiers.



Statistics is the grammar of science.
Karl Pearson 


On Thu, Mar 20, 2014 at 1:28 PM, Ted Byers  wrote:

> On Thu, Mar 20, 2014 at 3:14 PM, Hervé Pagès  wrote:
>
> > On 03/20/2014 03:52 AM, Duncan Murdoch wrote:
> >
> >> On 14-03-20 2:15 AM, Dan Tenenbaum wrote:
> >>
> >>>
> >>>
> >>> - Original Message -
> >>>
>  From: "David Winsemius" 
>  To: "Jeroen Ooms" 
>  Cc: "r-devel" 
>  Sent: Wednesday, March 19, 2014 11:03:32 PM
>  Subject: Re: [Rd] [RFC] A case for freezing CRAN
> 
> 
>  On Mar 19, 2014, at 7:45 PM, Jeroen Ooms wrote:
> 
>   On Wed, Mar 19, 2014 at 6:55 PM, Michael Weylandt
> >  wrote:
> >
> >> Reading this thread again, is it a fair summary of your position
> >> to say "reproducibility by default is more important than giving
> >> users access to the newest bug fixes and features by default?"
> >> It's certainly arguable, but I'm not sure I'm convinced: I'd
> >> imagine that the ratio of new work being done vs reproductions is
> >> rather high and the current setup optimizes for that already.
> >>
> >
> > I think that separating development from released branches can give
> > us
> > both reliability/reproducibility (stable branch) as well as new
> > features (unstable branch). The user gets to pick (and you can pick
> > both!). The same is true for r-base: when using a 'released'
> > version
> > you get 'stable' base packages that are up to 12 months old. If you
> > want to have the latest stuff you download a nightly build of
> > r-devel.
> > For regular users and reproducible research it is recommended to
> > use
> > the stable branch. However if you are a developer (e.g. package
> > author) you might want to develop/test/check your work with the
> > latest
> > r-devel.
> >
> > I think that extending the R release cycle to CRAN would result
> > both
> > in more stable released versions of R, as well as more freedom for
> > package authors to implement rigorous change in the unstable
> > branch.
> > When writing a script that is part of a production pipeline, or
> > sweave
> > paper that should be reproducible 10 years from now, or a book on
> > using R, you use stable version of R, which is guaranteed to behave
> > the same over time. However when developing packages that should be
> > compatible with the upcoming release of R, you use r-devel which
> > has
> > the latest versions of other CRAN and base packages.
> >
> 
> 
>  As I remember ... The example demonstrating the need for this was an
>  XML package that cause an extract from a website where the headers
>  were mis

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread Ted Byers
On Thu, Mar 20, 2014 at 4:53 PM, Jeroen Ooms wrote:

> On Thu, Mar 20, 2014 at 1:28 PM, Ted Byers  wrote:
>>
>> Herve Pages mentions the risk of irreproducibility across three minor
>> revisions of version 1.0 of Matrix.  My gut reaction would be that if the
>> results are not reproducible across such minor revisions of one library,
>> they are probably just so much BS.
>>
>
> Perhaps this is just terminology, but what you refer to I would generally
> call 'replication'. Of course being able to replicate results with other
> data or other software is important to validate claims. But being able to
> reproduce how the original results were obtained is an important part of
> this process.
>
> Fair enough.


> If someone is publishing results that I think are questionable and I
> cannot replicate them, I want to know exactly how those outcomes were
> obtained in the first place, so that I can 'debug' the problem. It's quite
> important to be able to trace back if incorrect results were a result of a
> bug, incompetence or fraud.
>
> OK.  That is where archives come in.  When I had to deal with that sort of
thing, I provided copies of both data and code to whoever asked.  It ought
not be hard for authors to make an archive, to e.g. an optical disk, that
includes the software used along with the data, and store it like any other
backup, so it can be provided to anyone upon request.


> Let's take the example of the Reinhart and Rogoff case. The results
> obviously were not replicable, but without more information it was just the
> word of a grad students vs two Harvard professors. Only after reproducing
> the original analysis it was possible to point out the errors and proof
> that the original were incorrect.
>
>
>
>
> Ok, but, if the practice I used were used, then a copy of the optical disk
to which everything relevant was stored would solve that problem (and it
would be extremely easy for the researcher or his/her supervisor to do).  I
once had a reviewer complain he couldn't reproduce my results, so I sent
him my code, which, translated into any of the Algol family of languages,
would allow  him, or anyone else, to replicate my results regardless of
their programming language of choice.  Once he had my code, he found his
error and reported back that he had finally replicated my results.  Several
of my colleagues used the same practice, with the same consequences
(whenever questioned, they just provide their code, and related software,
and then their results were reproduced).  There is nothing like backups
with due attention to detail.

Cheers

Ted

-- 
R.E.(Ted) Byers, Ph.D.,Ed.D.

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread Ted Byers
On Thu, Mar 20, 2014 at 5:11 PM, Tim Triche, Jr. wrote:

> That doesn't make sense.
>
> If an API changes (e.g. in Matrix) and a program written against the old
> API can no longer run, that is a very different issue than if the same
> numbers (data) give different results.  The latter is what I am guessing
> you address.  The former is what I believe most people are concerned about
> here.  Or at least I hope that's so.
>
> The problem you describe is the classic case of a failure of backward
compatibility.  That is completely different from the question of
reproducibility or replicability.  And, since I, among others, noticed the
question of reproducibility had arisen, I felt a need to primarily address
that.

I do not have a quibble with anything else you wrote (or with anything in
this thread related to the issue of backward compatibility), and I have
enough experience to know both that it is a hard problem and that there are
a number of different solutions people have used.  Appropriate management
of deprecation of features is one, and the use of code freezes is another.
Version control is a third.  Each option carries its own advantages and
disadvantages.


> It's more an issue of usability than reproducibility in such a case, far
> as I can tell (see e.g.
> http://liorpachter.wordpress.com/2014/03/18/reproducibility-vs-usability/).  
> If the same data produces substantially different results (not
> attributable to e.g. better handling of machine precision and so forth,
> although that could certainly be a bugaboo in many cases... anyone who has
> programmed numerical routines in FORTRAN already knows this) then yes,
> that's a different type of bug.  But in order to uncover the latter type of
> bug, the code has to run in the first place.  After a while it becomes
> rather impenetrable if no thought is given to these changes.
>
> So the Bioconductor solution, as Herve noted, is to have freezes and
> releases.  There can be old bugs enshrined in people's code due to using
> old versions, and those can be traced even after many releases have come
> and gone, because there is a point-in-time snapshot of about when these
> things occurred.  As with (say) ANSI C++, deprecation notices stay in place
> for a year before anything is actually done to remove a function or break
> an API.  It's not impossible, it just requires more discipline than
> declaring that the same program should be written multiple times on
> multiple platforms every time.  The latter isn't an efficient use of
> anyone's time.
>
> Most of these analyses are not about putting a man on the moon or making
> sure a dam does not break.  They're relatively low-consequence exploratory
> sorties.  If something comes of them, it would be nice to have a
> point-in-time reference to check and see whether the original results were
> hooey.  That's a lot quicker and more efficient than rewriting everything
> from scratch (which, in some fields, simply ensures things won't get
> checked).
>
> My $0.02, since we do still have those to bedevil cashiers.
>
>
>
> Statistics is the grammar of science.
> Karl Pearson 
>
>
> Cheers

Ted

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread Tim Triche, Jr.
> There is nothing like backups with due attention to detail.

Agreed, although given the complexity of dependencies among packages, this
might entail several GB of snapshots per paper (if not several TB for some
papers) in various cases.  Anyone who is reasonably prolific then gets the
exciting prospect of managing these backups.

At least if I grind out a vignette with a bunch of Bioconductor packages
and call sessionInfo() at the end, I can find out later on (if, say, things
stop working) what was the state of the tree when it last worked, and what
might have changed since then.  If a self-contained C++ or FORTRAN program
is sufficient to perform an entire analysis, that's awesome, and it ought
to be stuffed into revision control (doesn't everyone already do this?).
 But once you start using tools that depend on other tools, it becomes
substantially more difficult to ensure that

1) a comprehensive snapshot is taken
2) reviewers, possibly on different platforms and/or major versions, can
run using that snapshot
3) some means of a quick sanity check ("does this analysis even return
sensible results?") can be run

Hopefully this is better articulated than my previous missive.

I believe we fundamentally agree; some of the particulars may be an issue
of notation or typical workflow.



Statistics is the grammar of science.
Karl Pearson 


On Thu, Mar 20, 2014 at 2:13 PM, Ted Byers  wrote:

> On Thu, Mar 20, 2014 at 4:53 PM, Jeroen Ooms  >wrote:
>
> > On Thu, Mar 20, 2014 at 1:28 PM, Ted Byers 
> wrote:
> >>
> >> Herve Pages mentions the risk of irreproducibility across three minor
> >> revisions of version 1.0 of Matrix.  My gut reaction would be that if
> the
> >> results are not reproducible across such minor revisions of one library,
> >> they are probably just so much BS.
> >>
> >
> > Perhaps this is just terminology, but what you refer to I would generally
> > call 'replication'. Of course being able to replicate results with other
> > data or other software is important to validate claims. But being able to
> > reproduce how the original results were obtained is an important part of
> > this process.
> >
> > Fair enough.
>
>
> > If someone is publishing results that I think are questionable and I
> > cannot replicate them, I want to know exactly how those outcomes were
> > obtained in the first place, so that I can 'debug' the problem. It's
> quite
> > important to be able to trace back if incorrect results were a result of
> a
> > bug, incompetence or fraud.
> >
> > OK.  That is where archives come in.  When I had to deal with that sort
> of
> thing, I provided copies of both data and code to whoever asked.  It ought
> not be hard for authors to make an archive, to e.g. an optical disk, that
> includes the software used along with the data, and store it like any other
> backup, so it can be provided to anyone upon request.
>
>
> > Let's take the example of the Reinhart and Rogoff case. The results
> > obviously were not replicable, but without more information it was just
> the
> > word of a grad students vs two Harvard professors. Only after reproducing
> > the original analysis it was possible to point out the errors and proof
> > that the original were incorrect.
> >
> >
> >
> >
> > Ok, but, if the practice I used were used, then a copy of the optical
> disk
> to which everything relevant was stored would solve that problem (and it
> would be extremely easy for the researcher or his/her supervisor to do).  I
> once had a reviewer complain he couldn't reproduce my results, so I sent
> him my code, which, translated into any of the Algol family of languages,
> would allow  him, or anyone else, to replicate my results regardless of
> their programming language of choice.  Once he had my code, he found his
> error and reported back that he had finally replicated my results.  Several
> of my colleagues used the same practice, with the same consequences
> (whenever questioned, they just provide their code, and related software,
> and then their results were reproduced).  There is nothing like backups
> with due attention to detail.
>
> Cheers
>
> Ted
>
> --
> R.E.(Ted) Byers, Ph.D.,Ed.D.
>
> [[alternative HTML version deleted]]
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread Ted Byers
On Thu, Mar 20, 2014 at 5:27 PM, Tim Triche, Jr. wrote:

> > There is nothing like backups with due attention to detail.
>
> Agreed, although given the complexity of dependencies among packages, this
> might entail several GB of snapshots per paper (if not several TB for some
> papers) in various cases.  Anyone who is reasonably prolific then gets the
> exciting prospect of managing these backups.
>
> Isn't that what support staff is for?  ;-)  But, storage space is cheap,
and as tedious as managing backups can be (definitely not fun), it is
managable.


> At least if I grind out a vignette with a bunch of Bioconductor packages
> and call sessionInfo() at the end, I can find out later on (if, say, things
> stop working) what was the state of the tree when it last worked, and what
> might have changed since then.  If a self-contained C++ or FORTRAN program
> is sufficient to perform an entire analysis, that's awesome, and it ought
> to be stuffed into revision control (doesn't everyone already do this?).
>  But once you start using tools that depend on other tools, it becomes
> substantially more difficult to ensure that
>
> 1) a comprehensive snapshot is taken
> 2) reviewers, possibly on different platforms and/or major versions, can
> run using that snapshot
> 3) some means of a quick sanity check ("does this analysis even return
> sensible results?") can be run
>
> Hopefully this is better articulated than my previous missive.
>
> Tell me about it.  Oh, wait, you already did.  ;-)

I understand this, as I routinely work with complex distributed systems
involving multiple programming languages and other diverse tools.  But such
is part of the overhead of doing quality work.


> I believe we fundamentally agree; some of the particulars may be an issue
> of notation or typical workflow.
>
>
> I agree that we fundamentally agree  ;-)

>From my experience, the issues addressed in this thread are probably best
handled by in the package developers and those authors that use their
packages, rather than imposing additional work on those responsible for
CRAN, especially when the means for doing things a little differently than
how CRAN does it are readily available.

Cheers

Ted
R.E.(Ted) Byers, Ph.D.,Ed.D.


>
> Statistics is the grammar of science.
> Karl Pearson 
>
>
> On Thu, Mar 20, 2014 at 2:13 PM, Ted Byers  wrote:
>
>> On Thu, Mar 20, 2014 at 4:53 PM, Jeroen Ooms > >wrote:
>>
>> > On Thu, Mar 20, 2014 at 1:28 PM, Ted Byers 
>> wrote:
>> >>
>> >> Herve Pages mentions the risk of irreproducibility across three minor
>> >> revisions of version 1.0 of Matrix.  My gut reaction would be that if
>> the
>> >> results are not reproducible across such minor revisions of one
>> library,
>> >> they are probably just so much BS.
>> >>
>> >
>> > Perhaps this is just terminology, but what you refer to I would
>> generally
>> > call 'replication'. Of course being able to replicate results with other
>> > data or other software is important to validate claims. But being able
>> to
>> > reproduce how the original results were obtained is an important part of
>> > this process.
>> >
>> > Fair enough.
>>
>>
>> > If someone is publishing results that I think are questionable and I
>> > cannot replicate them, I want to know exactly how those outcomes were
>> > obtained in the first place, so that I can 'debug' the problem. It's
>> quite
>> > important to be able to trace back if incorrect results were a result
>> of a
>> > bug, incompetence or fraud.
>> >
>> > OK.  That is where archives come in.  When I had to deal with that sort
>> of
>> thing, I provided copies of both data and code to whoever asked.  It ought
>> not be hard for authors to make an archive, to e.g. an optical disk, that
>> includes the software used along with the data, and store it like any
>> other
>> backup, so it can be provided to anyone upon request.
>>
>>
>> > Let's take the example of the Reinhart and Rogoff case. The results
>> > obviously were not replicable, but without more information it was just
>> the
>> > word of a grad students vs two Harvard professors. Only after
>> reproducing
>> > the original analysis it was possible to point out the errors and proof
>> > that the original were incorrect.
>> >
>> >
>> >
>> >
>> > Ok, but, if the practice I used were used, then a copy of the optical
>> disk
>> to which everything relevant was stored would solve that problem (and it
>> would be extremely easy for the researcher or his/her supervisor to do).
>>  I
>> once had a reviewer complain he couldn't reproduce my results, so I sent
>> him my code, which, translated into any of the Algol family of languages,
>> would allow  him, or anyone else, to replicate my results regardless of
>> their programming language of choice.  Once he had my code, he found his
>> error and reported back that he had finally replicated my results.
>>  Several
>> of my colleagues used the same practice, with the same conse

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread Hervé Pagès



On 03/20/2014 01:28 PM, Ted Byers wrote:

On Thu, Mar 20, 2014 at 3:14 PM, Hervé Pagès mailto:hpa...@fhcrc.org>> wrote:

On 03/20/2014 03:52 AM, Duncan Murdoch wrote:

On 14-03-20 2:15 AM, Dan Tenenbaum wrote:



- Original Message -

From: "David Winsemius" mailto:dwinsem...@comcast.net>>
To: "Jeroen Ooms" mailto:jeroen.o...@stat.ucla.edu>>
Cc: "r-devel" mailto:r-devel@r-project.org>>
Sent: Wednesday, March 19, 2014 11:03:32 PM
Subject: Re: [Rd] [RFC] A case for freezing CRAN


On Mar 19, 2014, at 7:45 PM, Jeroen Ooms wrote:

On Wed, Mar 19, 2014 at 6:55 PM, Michael Weylandt
mailto:michael.weyla...@gmail.com>> wrote:

Reading this thread again, is it a fair summary
of your position
to say "reproducibility by default is more
important than giving
users access to the newest bug fixes and
features by default?"
It's certainly arguable, but I'm not sure I'm
convinced: I'd
imagine that the ratio of new work being done vs
reproductions is
rather high and the current setup optimizes for
that already.


I think that separating development from released
branches can give
us
both reliability/reproducibility (stable branch) as
well as new
features (unstable branch). The user gets to pick
(and you can pick
both!). The same is true for r-base: when using a
'released'
version
you get 'stable' base packages that are up to 12
months old. If you
want to have the latest stuff you download a nightly
build of
r-devel.
For regular users and reproducible research it is
recommended to
use
the stable branch. However if you are a developer
(e.g. package
author) you might want to develop/test/check your
work with the
latest
r-devel.

I think that extending the R release cycle to CRAN
would result
both
in more stable released versions of R, as well as
more freedom for
package authors to implement rigorous change in the
unstable
branch.
When writing a script that is part of a production
pipeline, or
sweave
paper that should be reproducible 10 years from now,
or a book on
using R, you use stable version of R, which is
guaranteed to behave
the same over time. However when developing packages
that should be
compatible with the upcoming release of R, you use
r-devel which
has
the latest versions of other CRAN and base packages.



As I remember ... The example demonstrating the need for
this was an
XML package that cause an extract from a website where
the headers
were misinterpreted as data in one version of pkg:XML
and not in
another. That seems fairly unconvincing. Data cleaning and
validation is a basic task of data analysis. It also
seems excessive
to assert that it is the responsibility of CRAN to
maintain a synced
binary archive that will be available in ten years.



CRAN already does this, the bin/windows/contrib directory has
subdirectories going back to 1.7, with packages dated
October 2004. I
don't see why it is burdensome to continue to archive these.
It would
be nice if source versions had a similar archive.


The bin/windows/contrib directories are updated every day for
active R
versions.  It's only when Uwe decides that a version is no
longer worth
active support that he stops doing updates, and it "freezes".  A
consequence of this is that the snapshots preserved in those older
directories are unlikely to match what someone who keeps up to
date with
R releases

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread Uwe Ligges



On 20.03.2014 23:23, Hervé Pagès wrote:



On 03/20/2014 01:28 PM, Ted Byers wrote:

On Thu, Mar 20, 2014 at 3:14 PM, Hervé Pagès mailto:hpa...@fhcrc.org>> wrote:

On 03/20/2014 03:52 AM, Duncan Murdoch wrote:

On 14-03-20 2:15 AM, Dan Tenenbaum wrote:



- Original Message -

From: "David Winsemius" mailto:dwinsem...@comcast.net>>
To: "Jeroen Ooms" mailto:jeroen.o...@stat.ucla.edu>>
Cc: "r-devel" mailto:r-devel@r-project.org>>
Sent: Wednesday, March 19, 2014 11:03:32 PM
Subject: Re: [Rd] [RFC] A case for freezing CRAN


On Mar 19, 2014, at 7:45 PM, Jeroen Ooms wrote:

On Wed, Mar 19, 2014 at 6:55 PM, Michael Weylandt
mailto:michael.weyla...@gmail.com>> wrote:

Reading this thread again, is it a fair summary
of your position
to say "reproducibility by default is more
important than giving
users access to the newest bug fixes and
features by default?"
It's certainly arguable, but I'm not sure I'm
convinced: I'd
imagine that the ratio of new work being done vs
reproductions is
rather high and the current setup optimizes for
that already.


I think that separating development from released
branches can give
us
both reliability/reproducibility (stable branch) as
well as new
features (unstable branch). The user gets to pick
(and you can pick
both!). The same is true for r-base: when using a
'released'
version
you get 'stable' base packages that are up to 12
months old. If you
want to have the latest stuff you download a nightly
build of
r-devel.
For regular users and reproducible research it is
recommended to
use
the stable branch. However if you are a developer
(e.g. package
author) you might want to develop/test/check your
work with the
latest
r-devel.

I think that extending the R release cycle to CRAN
would result
both
in more stable released versions of R, as well as
more freedom for
package authors to implement rigorous change in the
unstable
branch.
When writing a script that is part of a production
pipeline, or
sweave
paper that should be reproducible 10 years from now,
or a book on
using R, you use stable version of R, which is
guaranteed to behave
the same over time. However when developing packages
that should be
compatible with the upcoming release of R, you use
r-devel which
has
the latest versions of other CRAN and base packages.



As I remember ... The example demonstrating the need for
this was an
XML package that cause an extract from a website where
the headers
were misinterpreted as data in one version of pkg:XML
and not in
another. That seems fairly unconvincing. Data cleaning
and
validation is a basic task of data analysis. It also
seems excessive
to assert that it is the responsibility of CRAN to
maintain a synced
binary archive that will be available in ten years.



CRAN already does this, the bin/windows/contrib directory has
subdirectories going back to 1.7, with packages dated
October 2004. I
don't see why it is burdensome to continue to archive these.
It would
be nice if source versions had a similar archive.


The bin/windows/contrib directories are updated every day for
active R
versions.  It's only when Uwe decides that a version is no
longer worth
active support that he stops doing updates, and it "freezes".  A
consequence of this is that the snapshots preserved in those
older
directories are unlikely to match what someone who keeps

Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread Hervé Pagès

On 03/20/2014 03:29 PM, Uwe Ligges wrote:



On 20.03.2014 23:23, Hervé Pagès wrote:



On 03/20/2014 01:28 PM, Ted Byers wrote:

On Thu, Mar 20, 2014 at 3:14 PM, Hervé Pagès mailto:hpa...@fhcrc.org>> wrote:

On 03/20/2014 03:52 AM, Duncan Murdoch wrote:

On 14-03-20 2:15 AM, Dan Tenenbaum wrote:



- Original Message -

From: "David Winsemius" mailto:dwinsem...@comcast.net>>
To: "Jeroen Ooms" mailto:jeroen.o...@stat.ucla.edu>>
Cc: "r-devel" mailto:r-devel@r-project.org>>
Sent: Wednesday, March 19, 2014 11:03:32 PM
Subject: Re: [Rd] [RFC] A case for freezing CRAN


On Mar 19, 2014, at 7:45 PM, Jeroen Ooms wrote:

On Wed, Mar 19, 2014 at 6:55 PM, Michael Weylandt
mailto:michael.weyla...@gmail.com>> wrote:

Reading this thread again, is it a fair summary
of your position
to say "reproducibility by default is more
important than giving
users access to the newest bug fixes and
features by default?"
It's certainly arguable, but I'm not sure I'm
convinced: I'd
imagine that the ratio of new work being done vs
reproductions is
rather high and the current setup optimizes for
that already.


I think that separating development from released
branches can give
us
both reliability/reproducibility (stable branch) as
well as new
features (unstable branch). The user gets to pick
(and you can pick
both!). The same is true for r-base: when using a
'released'
version
you get 'stable' base packages that are up to 12
months old. If you
want to have the latest stuff you download a nightly
build of
r-devel.
For regular users and reproducible research it is
recommended to
use
the stable branch. However if you are a developer
(e.g. package
author) you might want to develop/test/check your
work with the
latest
r-devel.

I think that extending the R release cycle to CRAN
would result
both
in more stable released versions of R, as well as
more freedom for
package authors to implement rigorous change in the
unstable
branch.
When writing a script that is part of a production
pipeline, or
sweave
paper that should be reproducible 10 years from now,
or a book on
using R, you use stable version of R, which is
guaranteed to behave
the same over time. However when developing packages
that should be
compatible with the upcoming release of R, you use
r-devel which
has
the latest versions of other CRAN and base packages.



As I remember ... The example demonstrating the need for
this was an
XML package that cause an extract from a website where
the headers
were misinterpreted as data in one version of pkg:XML
and not in
another. That seems fairly unconvincing. Data cleaning
and
validation is a basic task of data analysis. It also
seems excessive
to assert that it is the responsibility of CRAN to
maintain a synced
binary archive that will be available in ten years.



CRAN already does this, the bin/windows/contrib directory
has
subdirectories going back to 1.7, with packages dated
October 2004. I
don't see why it is burdensome to continue to archive these.
It would
be nice if source versions had a similar archive.


The bin/windows/contrib directories are updated every day for
active R
versions.  It's only when Uwe decides that a version is no
longer worth
active support that he stops doing updates, and it "freezes".  A
consequence of this is that the snapshots preserved in those
older
directories a

[Rd] Memcheck: error in a switch using getGraphicsEvent

2014-03-20 Thread Christophe Genolini

Hi the list,

One of my package has an (other) error detected by memtest that I do not manage 
to understand.
Here is the message that I get from Memtest

--- 8< 
> try(choice(cld1))
Error in switch(EXPR = choix, Up = { : EXPR must be a length 1 vector
--- 8< 

The choice function does call the choiceChangeParam function, which is:

--- 8< 
choiceChangeParam <- function(paramChoice){

texte <- paste(" ~ Choice : menu~\n",sep="")

choix <- getGraphicsEvent(texte,onKeybd=function(key){return(key)})
switch(EXPR=choix,
   "Up"= {
   if(xy[1]>1){
   paramChoice['toDo'] <- "xy"
   xy[2]<-1
   xy[1]<-xy[1]-1
   paramChoice['xy']<-xy
   }else{paramChoice['toDo'] <- ""}
   },
   "Down"  = {
   if(xy[1]https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Memcheck: error in a switch using getGraphicsEvent

2014-03-20 Thread Duncan Murdoch

On 2014-03-20, 8:02 PM, Christophe Genolini wrote:

Hi the list,

One of my package has an (other) error detected by memtest that I do not manage 
to understand.
Here is the message that I get from Memtest

--- 8< 
  > try(choice(cld1))
Error in switch(EXPR = choix, Up = { : EXPR must be a length 1 vector
--- 8< 

The choice function does call the choiceChangeParam function, which is:

--- 8< 
choiceChangeParam <- function(paramChoice){

  texte <- paste(" ~ Choice : menu~\n",sep="")

  choix <- getGraphicsEvent(texte,onKeybd=function(key){return(key)})
  switch(EXPR=choix,
 "Up"= {
 if(xy[1]>1){
 paramChoice['toDo'] <- "xy"
 xy[2]<-1
 xy[1]<-xy[1]-1
 paramChoice['xy']<-xy
 }else{paramChoice['toDo'] <- ""}
 },
 "Down"  = {
 if(xy[1]

It can also return NULL, but to be sure, why not print the value?

Duncan Murdoch

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread Gábor Csárdi
Much of the discussion was about reproducibility so far. Let me emphasize
another point from Jeroen's proposal.

This is hard to measure of course, but I think I can say that the existence
and the quality of CRAN and its packages contributed immensely to the
success of R and the success of people using R. Having one central, well
controlled and tested package repository is a huge advantage for the users.
(I know that there are other repositories, but they are either similarly
well controlled and specialized (BioC), or less used.) It would be great to
keep it like this.

I also think that the current CRAN policy is not ideal for further growth.
In particular, updating a package with many reverse dependencies is a
frustrating process, for everybody. As a maintainer with ~150 reverse
dependencies, I think not twice, but ten times if I really want to publish
a new version on CRAN. I cannot speak for other maintainers of course, but
I have a feeling that I am not alone.

Tying CRAN packages to R releases would help, because then I would not have
to worry about breaking packages in the stable version of CRAN, only in
CRAN-devel.

Somebody mentioned that it is good not to do this because then users get
bug fixes and new features earlier. Well, in my case, the opposite it true.
As I am not updating, they actually get it (much) later. If it wasn't such
a hassle, I would definitely update more often, about once a month. Now my
goal is more like once a year.

Again, I cannot speak for others, but I believe the current policy does not
help progress, and is not sustainable in the long run. It penalizes the
maintainers of "more important" (= many rev. dependencies, that is, which
probably also means many users) packages, and I fear they will slowly move
away from CRAN. I don't think this is what anybody in the R community would
want.

Best,
Gabor

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread William Dunlap
> In particular, updating a package with many reverse dependencies is a
> frustrating process, for everybody. As a maintainer with ~150 reverse
> dependencies, I think not twice, but ten times if I really want to publish
> a new version on CRAN.

It might be easier if more of those packages came with good test suites.

Bill Dunlap
TIBCO Software
wdunlap tibco.com


> -Original Message-
> From: r-devel-boun...@r-project.org [mailto:r-devel-boun...@r-project.org] On 
> Behalf
> Of Gábor Csárdi
> Sent: Thursday, March 20, 2014 6:24 PM
> To: r-devel
> Subject: Re: [Rd] [RFC] A case for freezing CRAN
> 
> Much of the discussion was about reproducibility so far. Let me emphasize
> another point from Jeroen's proposal.
> 
> This is hard to measure of course, but I think I can say that the existence
> and the quality of CRAN and its packages contributed immensely to the
> success of R and the success of people using R. Having one central, well
> controlled and tested package repository is a huge advantage for the users.
> (I know that there are other repositories, but they are either similarly
> well controlled and specialized (BioC), or less used.) It would be great to
> keep it like this.
> 
> I also think that the current CRAN policy is not ideal for further growth.
> In particular, updating a package with many reverse dependencies is a
> frustrating process, for everybody. As a maintainer with ~150 reverse
> dependencies, I think not twice, but ten times if I really want to publish
> a new version on CRAN. I cannot speak for other maintainers of course, but
> I have a feeling that I am not alone.
> 
> Tying CRAN packages to R releases would help, because then I would not have
> to worry about breaking packages in the stable version of CRAN, only in
> CRAN-devel.
> 
> Somebody mentioned that it is good not to do this because then users get
> bug fixes and new features earlier. Well, in my case, the opposite it true.
> As I am not updating, they actually get it (much) later. If it wasn't such
> a hassle, I would definitely update more often, about once a month. Now my
> goal is more like once a year.
> 
> Again, I cannot speak for others, but I believe the current policy does not
> help progress, and is not sustainable in the long run. It penalizes the
> maintainers of "more important" (= many rev. dependencies, that is, which
> probably also means many users) packages, and I fear they will slowly move
> away from CRAN. I don't think this is what anybody in the R community would
> want.
> 
> Best,
> Gabor
> 
>   [[alternative HTML version deleted]]
> 
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread Gábor Csárdi
On Thu, Mar 20, 2014 at 9:45 PM, William Dunlap  wrote:

> > In particular, updating a package with many reverse dependencies is a
> > frustrating process, for everybody. As a maintainer with ~150 reverse
> > dependencies, I think not twice, but ten times if I really want to
> publish
> > a new version on CRAN.
>
> It might be easier if more of those packages came with good test suites.
>

Test suites are great, but I don't think this would make my job easier.
More tests means more potential breakage. The extreme of not having any
examples and tests in these 150 packages would be the easiest for _me_,
actually. Not for the users, though.

What would really help is either fully versioned package dependencies
(daydreaming here), or having a CRAN-devel repository, that changes and
might break often, and a CRAN-stable that does not change (much).

Gabor

[...]

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread Tim Triche, Jr.
Heh, you just described BioC

--t

> On Mar 20, 2014, at 7:15 PM, Gábor Csárdi  wrote:
> 
> On Thu, Mar 20, 2014 at 9:45 PM, William Dunlap  wrote:
> 
>>> In particular, updating a package with many reverse dependencies is a
>>> frustrating process, for everybody. As a maintainer with ~150 reverse
>>> dependencies, I think not twice, but ten times if I really want to
>> publish
>>> a new version on CRAN.
>> 
>> It might be easier if more of those packages came with good test suites.
> 
> Test suites are great, but I don't think this would make my job easier.
> More tests means more potential breakage. The extreme of not having any
> examples and tests in these 150 packages would be the easiest for _me_,
> actually. Not for the users, though.
> 
> What would really help is either fully versioned package dependencies
> (daydreaming here), or having a CRAN-devel repository, that changes and
> might break often, and a CRAN-stable that does not change (much).
> 
> Gabor
> 
> [...]
> 
>[[alternative HTML version deleted]]
> 
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread Tim Triche, Jr.
Except that tests (as vignettes) are mandatory for BioC. So if something blows 
up you hear about it right quick :-)

--t

> On Mar 20, 2014, at 7:15 PM, Gábor Csárdi  wrote:
> 
> On Thu, Mar 20, 2014 at 9:45 PM, William Dunlap  wrote:
> 
>>> In particular, updating a package with many reverse dependencies is a
>>> frustrating process, for everybody. As a maintainer with ~150 reverse
>>> dependencies, I think not twice, but ten times if I really want to
>> publish
>>> a new version on CRAN.
>> 
>> It might be easier if more of those packages came with good test suites.
> 
> Test suites are great, but I don't think this would make my job easier.
> More tests means more potential breakage. The extreme of not having any
> examples and tests in these 150 packages would be the easiest for _me_,
> actually. Not for the users, though.
> 
> What would really help is either fully versioned package dependencies
> (daydreaming here), or having a CRAN-devel repository, that changes and
> might break often, and a CRAN-stable that does not change (much).
> 
> Gabor
> 
> [...]
> 
>[[alternative HTML version deleted]]
> 
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] [RFC] A case for freezing CRAN

2014-03-20 Thread Dan Tenenbaum


- Original Message -
> From: "Gábor Csárdi" 
> To: "r-devel" 
> Sent: Thursday, March 20, 2014 6:23:33 PM
> Subject: Re: [Rd] [RFC] A case for freezing CRAN
> 
> Much of the discussion was about reproducibility so far. Let me
> emphasize
> another point from Jeroen's proposal.
> 
> This is hard to measure of course, but I think I can say that the
> existence
> and the quality of CRAN and its packages contributed immensely to the
> success of R and the success of people using R. Having one central,
> well
> controlled and tested package repository is a huge advantage for the
> users.
> (I know that there are other repositories, but they are either
> similarly
> well controlled and specialized (BioC), or less used.) It would be
> great to
> keep it like this.
> 
> I also think that the current CRAN policy is not ideal for further
> growth.
> In particular, updating a package with many reverse dependencies is a
> frustrating process, for everybody. As a maintainer with ~150 reverse
> dependencies, I think not twice, but ten times if I really want to
> publish
> a new version on CRAN. I cannot speak for other maintainers of
> course, but
> I have a feeling that I am not alone.
> 
> Tying CRAN packages to R releases would help, because then I would
> not have
> to worry about breaking packages in the stable version of CRAN, only
> in
> CRAN-devel.
> 
> Somebody mentioned that it is good not to do this because then users
> get
> bug fixes and new features earlier. Well, in my case, the opposite it
> true.
> As I am not updating, they actually get it (much) later. If it wasn't
> such
> a hassle, I would definitely update more often, about once a month.
> Now my
> goal is more like once a year.
> 

These are good points. Not only do maintainers think twice (or more) before 
updating packages but it also seems that there are CRAN policies that 
discourage frequent updates. Whereas Bioconductor welcomes frequent updates 
because they usually fix problems and help us understand 
interoperability/dependency issues. Probably the main reason for this 
difference is the existence of a devel branch where breakage can happen and 
it's not the end of the world.





> Again, I cannot speak for others, but I believe the current policy
> does not
> help progress, and is not sustainable in the long run. It penalizes
> the
> maintainers of "more important" (= many rev. dependencies, that is,
> which
> probably also means many users) packages, and I fear they will slowly
> move
> away from CRAN. I don't think this is what anybody in the R community
> would
> want.
> 
> Best,
> Gabor
> 
>   [[alternative HTML version deleted]]
> 
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel