> On Sep 27, 2022, at 8:25 AM, Iñaki Ucar <iu...@fedoraproject.org> wrote:
>
> On Sat, 24 Sept 2022 at 01:55, Simon Urbanek
> <simon.urba...@r-project.org> wrote:
>>
>> Iñaki,
>>
>> I fully agree, this a very common issue since vast majority of server
>> deployments I have encountered don't allow internet access. In practice this
>> means that such packages are effectively banned.
>>
>> I would argue that not even (1) or (2) are really an issue, because in fact
>> the CRAN policy doesn't impose any absolute limits on size, it only states
>> that the package should be "of minimum necessary size" which means it
>> shouldn't waste space. If there is no way to reduce the size without
>> impacting functionality, it's perfectly fine.
>
> "Packages should be of the minimum necessary size" is subject to
> interpretation. And in practice, there is an issue with e.g. packages
> that "bundle" big third-party libraries. There are also packages that
> require downloading precompiled code, JARs... at installation time.
>
JARs are part of the package, so that's a valid use, no question there, that's
how Java packages do this already.
Downloading pre-compiled binaries is something that shouldn't be done and a
whole can of worms (since those are not sources and it *is* specific to the
platform, os etc.) that is entirely separate, but worth a separate discussion.
So I still don't see any use cases for actual sources. I do see a need for
better specification of external dependencies which are not part of the package
such that those can be satisfied automatically - but that's not the problem you
asked about.
>> That said, there are exceptions such as very large datasets (e.g., as
>> distributed by Bioconductor) which are orders of magnitude larger than what
>> is sustainable. I agree that it would be nice to have a mechanism for
>> specifying such sources. So yes, I like the idea, but I'd like to see more
>> real use cases to justify the effort.
>
> "More real use cases" like in "more use cases" or like in "the
> previous ones are not real ones"? :)
>
>> The issue with any online downloads, though, is that there is no guarantee
>> of availability - which is real issue for reproducibility. So one could
>> argue that if such external sources are required then they should be on a
>> well-defined, independent, permanent storage such as Zenodo. This could be a
>> matter of policy as opposed to the technical side above which would be
>> adding such support to R CMD INSTALL.
>
> Not necessarily. If the package declares the additional sources in the
> DESCRIPTION (probably with hashes), that's a big improvement over the
> current state of things, in which basically we don't know what the
> package tries download, then it may fail, and finally there's no
> guarantee that it's what the author intended in the first place.
>
> But on top of this, R could add a CMD to download those, and then some
> lookaside storage could be used on CRAN. This is e.g. how RPM
> packaging works: the spec declares all the sources, they are
> downloaded once, hashed and stored in a lookaside cache. Then package
> building doesn't need general Internet connectivity, just access to
> the cache.
>
Sure, I fully agree that it would be a good first step, but I'm still waiting
for examples ;).
Cheers,
Simon
> Iñaki
>
>>
>> Cheers,
>> Simon
>>
>>
>>> On Sep 24, 2022, at 3:22 AM, Iñaki Ucar <iu...@fedoraproject.org> wrote:
>>>
>>> Hi all,
>>>
>>> I'd like to open this debate here, because IMO this is a big issue.
>>> Many packages do this for various reasons, some more legitimate than
>>> others, but I think that this shouldn't be allowed, because it
>>> basically means that installation fails in a machine without Internet
>>> access (which happens e.g. in Linux distro builders for security
>>> reasons).
>>>
>>> Now, what if connection is suppressed during package load? There are
>>> basically three use cases out there:
>>>
>>> (1) The package requires additional files for the installation (e.g.
>>> the source code of an external library) that cannot be bundled into
>>> the package due to CRAN restrictions (size).
>>> (2) The package requires additional files for using it (e.g.,
>>> datasets, a JAR...) that cannot be bundled into the package due to
>>> CRAN restrictions (size).
>>> (3) Other spurious reasons (e.g. the maintainer decided that package
>>> load was a good place to check an online service availability, etc.).
>>>
>>> Again IMO, (3) shouldn't be allowed in any case; (2) should be a
>>> separate function that the user actively calls to download the files,
>>> and those files should be placed into the user dir, and (3) is the
>>> only legitimate use, but then other mechanism should be provided to
>>> avoid connections during package load.
>>>
>>> My proposal to support (3) would be to add a new field in the
>>> DESCRIPTION, "Additional_sources", which would be a comma separated
>>> list of additional resources to download during R CMD INSTALL. Those
>>> sources would be downloaded by R CMD INSTALL if not provided via an
>>> option (to support offline installations), and would be placed in a
>>> predefined place for the package to find and configure them (via an
>>> environment variable or in a predefined subdirectory).
>>>
>>> This proposal has several advantages. Apart from the obvious one
>>> (Internet access during package load can be limited without losing
>>> current functionalities), it gives more visibility to the resources
>>> that packages are using during the installation phase, and thus makes
>>> those installations more reproducible and more secure.
>>>
>>> Best,
>>> --
>>> Iñaki Úcar
>>>
>>> ______________________________________________
>>> R-devel@r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
>>
>
>
> --
> Iñaki Úcar
>
> ______________________________________________
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel