[R-pkg-devel] Ensuring permanence and SHA consistency of released CRAN packages for validated software

2022-03-16 Thread Borini, Stefano
Hello,

Validated software needs to ensure consistency and reproducibility of its 
environment, potentially in years' time, when the audit comes. For this reason, 
we identify all SHA of the packages we download from CRAN to ensure that the 
package has not changed after the fact, something that may signal us that the 
package has been corrupted, or malicious code has been added after the fact, 
and also guarantees the auditors that the packages are indeed the correct ones 
as they were at the time of release.

Currently I am dealing with a package that I downloaded once in the past, 
MASS_7.3-54. This package used to have SHA256

b800ccd5b5c2709b1559cf5eab126e4935c4f8826cf7891253432bb6a056e821  
MASS_7.3-54.tar.gz

The current package has instead SHA:

eb644c0e94b447c46387aa22436ef5a43192960ee9cfd0df2940f4a4116179ae  
MASS_7.3-54.tar.gz

This triggers all sort of alarms. It is established poor practice to replace a 
package after the fact exact for these reasons. Once a package is released, it 
should remain immutable. Subsequent builds can be introduced with a different 
build number.

The change appears to be due to the fact that CRAN rebuilds packages 
occasionally, for reasons to me unknown. Diffing the old and the new 
MASS_7.3.54.tar.gz reveals the change to be due to this:

$ diff -Naur MASS_1/ MASS_2/
diff -Naur MASS_1/DESCRIPTION MASS_2/DESCRIPTION
--- MASS_1/DESCRIPTION  2021-05-03 10:03:00.0 +0100
+++ MASS_2/DESCRIPTION  2021-05-03 10:03:50.0 +0100
@@ -33,4 +33,4 @@
   David Firth [ctb]
 Maintainer: Brian Ripley 
 Repository: CRAN
-Date/Publication: 2021-05-03 09:03:00 UTC
+Date/Publication: 2021-05-03 09:03:50 UTC
diff -Naur MASS_1/MD5 MASS_2/MD5
--- MASS_1/MD5  2021-05-03 10:03:00.0 +0100
+++ MASS_2/MD5  2021-05-03 10:03:50.0 +0100
@@ -1,4 +1,4 @@
-560f72bfd93ac57532d2cf113078d2e7 *DESCRIPTION
+ecf84f78aac3c625898be45513307d79 *DESCRIPTION
 35aff05a505ecf7e81e0473767794ca9 *INDEX
 c7acdc0fa828f781a0a5586ab9d4fa1b *LICENCE.note
 0ac7b30ad35a4c19ea69d76a6a366b02 *NAMESPACE

Please prevent SHA changes of released packages on CRAN. Once a package is 
released, it should not be touched again.

--

Stefano Borini
Principal Analytical Tools Developer
AstraZeneca R&D BioPharmaceuticals | Data Science & AI | Early Biometrics & 
Statistical Innovation






AstraZeneca UK Limited is a company incorporated in England and Wales with 
registered number:03674842 and its registered office at 1 Francis Crick Avenue, 
Cambridge Biomedical Campus, Cambridge, CB2 0AA.

This e-mail and its attachments are intended for the above named recipient only 
and may contain confidential and privileged information. If they have come to 
you in error, you must not copy or show them to anyone; instead, please reply 
to this e-mail, highlighting the error to the sender and then immediately 
delete the message. For information about how AstraZeneca UK Limited and its 
affiliates may process information, personal data and monitor communications, 
please see our privacy notice at 
www.astrazeneca.com
__
R-package-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel


Re: [R-pkg-devel] Ensuring permanence and SHA consistency of released CRAN packages for validated software

2022-03-16 Thread Duncan Murdoch

On 16/03/2022 5:01 p.m., Henrik Bengtsson wrote:

Hi,

I think this is a valid concern and feature request, and I believe it
has been raised by others previously on one of our mailing lists.


And what solution or resources for producing one did they offer?

Here's a trivial solution that could even be implemented by a 
pharmaceutical company:  rename the file to include its SHA when you 
download it, and keep a copy and a record of the new name as part of any 
document that is produced with it.


There, it's solved.

Duncan Murdoch



Related to this, there's also been discussion (here or on R-devel), of
having `R CMD build` produce identical tarballs when the input doesn't
change, but the injection of `Packaged: ; ` to the
`DESCRIPTION` file prevents this. If I recall correctly, there was at
least some discussion on being able to control, or anonymize, the
 part.

MRAN (https://mran.microsoft.com/timemachine) provides a daily
snapshot of CRAN, and it goes back several years, but I'm not sure if
that would solve your problem. It's only stable for a particular date,
but I'd guess that in this case it could pick up one build one day,
and the other one the next day.

There are a few working groups over at the R Consortium
(https://www.r-consortium.org/projects/isc-working-groups) who are
interested in reproducibility of R packages. I suspect the 'R
Validation Hub' working group (https://www.pharmar.org/overview/)
would be interested in these type of hiccups, even if it's just to
collect rare "incidents" like this one. I suggest you ping them as
well.

/Henrik

On Wed, Mar 16, 2022 at 12:45 PM Duncan Murdoch
 wrote:


On 16/03/2022 2:51 p.m., Borini, Stefano wrote:

Hello,

Validated software needs to ensure consistency and reproducibility of its 
environment, potentially in years' time, when the audit comes. For this reason, 
we identify all SHA of the packages we download from CRAN to ensure that the 
package has not changed after the fact, something that may signal us that the 
package has been corrupted, or malicious code has been added after the fact, 
and also guarantees the auditors that the packages are indeed the correct ones 
as they were at the time of release.

Currently I am dealing with a package that I downloaded once in the past, 
MASS_7.3-54. This package used to have SHA256

  b800ccd5b5c2709b1559cf5eab126e4935c4f8826cf7891253432bb6a056e821  
MASS_7.3-54.tar.gz

The current package has instead SHA:

  eb644c0e94b447c46387aa22436ef5a43192960ee9cfd0df2940f4a4116179ae  
MASS_7.3-54.tar.gz

This triggers all sort of alarms. It is established poor practice to replace a 
package after the fact exact for these reasons. Once a package is released, it 
should remain immutable. Subsequent builds can be introduced with a different 
build number.

The change appears to be due to the fact that CRAN rebuilds packages 
occasionally, for reasons to me unknown. Diffing the old and the new 
MASS_7.3.54.tar.gz reveals the change to be due to this:

  $ diff -Naur MASS_1/ MASS_2/
  diff -Naur MASS_1/DESCRIPTION MASS_2/DESCRIPTION
  --- MASS_1/DESCRIPTION  2021-05-03 10:03:00.0 +0100
  +++ MASS_2/DESCRIPTION  2021-05-03 10:03:50.0 +0100
  @@ -33,4 +33,4 @@
 David Firth [ctb]
   Maintainer: Brian Ripley 
   Repository: CRAN
  -Date/Publication: 2021-05-03 09:03:00 UTC
  +Date/Publication: 2021-05-03 09:03:50 UTC
  diff -Naur MASS_1/MD5 MASS_2/MD5
  --- MASS_1/MD5  2021-05-03 10:03:00.0 +0100
  +++ MASS_2/MD5  2021-05-03 10:03:50.0 +0100
  @@ -1,4 +1,4 @@
  -560f72bfd93ac57532d2cf113078d2e7 *DESCRIPTION
  +ecf84f78aac3c625898be45513307d79 *DESCRIPTION
   35aff05a505ecf7e81e0473767794ca9 *INDEX
   c7acdc0fa828f781a0a5586ab9d4fa1b *LICENCE.note
   0ac7b30ad35a4c19ea69d76a6a366b02 *NAMESPACE

Please prevent SHA changes of released packages on CRAN. Once a package is 
released, it should not be touched again.

--

Stefano Borini
Principal Analytical Tools Developer
AstraZeneca R&D BioPharmaceuticals | Data Science & AI | Early Biometrics & 
Statistical Innovation


I don't know the reason that MASS was built again 50 seconds after the
first build, and it would be more convenient for you and some other
people if it hadn't been, but your request comes across as unreasonably
demanding.

You work for a company with a very large budget.  CRAN is run by
volunteers, and as far as I know, your company has not contributed
financially to running it.

If you want to guarantee that a CRAN package can be re-installed years
from now, *you* should be archiving a copy of it.  You may be negligent
by not doing so:  there's no guarantee that CRAN will still be
distributing *any* version of MASS when the auditors show up.

Duncan Murdoch

__
R-package-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel



Re: [R-pkg-devel] Ensuring permanence and SHA consistency of released CRAN packages for validated software

2022-03-16 Thread Dirk Eddelbuettel


On 16 March 2022 at 14:01, Henrik Bengtsson wrote:
| Related to this, there's also been discussion (here or on R-devel), of
| having `R CMD build` produce identical tarballs when the input doesn't
| change, but the injection of `Packaged: ; ` to the
| `DESCRIPTION` file prevents this. If I recall correctly, there was at
| least some discussion on being able to control, or anonymize, the
|  part.

It's much bigger than R:  https://reproducible-builds.org/

Started within Debian, but grew fairly quickly beyond one distribution to
many. We patched the build to use the (fixed) time from debian/changelog
(rather than current build time) and a few more things and were at some point
compliant, but there is still more and the package I stand behind as far as
Debian is concerned currently fails this goal of reproducible (i.e. binary
identical builds) [1] (and I have limited time to chase this, but the
initiative is very very good).

If someone wants to help please get in touch off-list. It should just require
some patience and diligence and I may teach your Debian builds in the
process.  The r-cran-* packages generally pass which is good.

Dirk

[1] 
https://tests.reproducible-builds.org/debian/rb-pkg/unstable/amd64/r-base.html


-- 
https://dirk.eddelbuettel.com | @eddelbuettel | e...@debian.org

__
R-package-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel