> On 27/09/2022, at 10:21 AM, Iñaki Ucar <iu...@fedoraproject.org> wrote:
>
> On Mon, 26 Sept 2022 at 23:07, Simon Urbanek
> <simon.urba...@r-project.org> wrote:
>>
>> Iñaki,
>>
>> I'm not sure I understand - system dependencies are an entirely different
>> topic and I would argue a far more important one (very happy to start a
>> discussion about that), but that has nothing to do with declaring downloads.
>> I assumed your question was about large files in packages which packages
>> avoid to ship and download instead so declaring them would be useful.
>
> Exactly. Maybe there's a misunderstanding, because I didn't talk about system
> dependencies (alas there are packages that try to download things that are
> declared as system dependencies, as Gabe noted). :)
>
Ok, understood. I would like to tackle those as well, but let's start that
conversation in a few weeks when I have a lot more time.
>> And for that, the obvious answer is they shouldn't do that - if a package
>> needs a file to run, it should include it. So an easy solution is to
>> disallow it.
>
> Then we completely agree. My proposal about declaring additional sources was
> because, given that so many packages do this, I thought that I would find a
> strong opposition to this. But if R Core / CRAN is ok with just limiting net
> access at install time, then that's perfect to me. :)
>
Yes we do agree :). I started looking at your list, and so far those seem
simply bugs or design deficiencies in the packages (and outright policy
violations). I think the only reason they exist is that it doesn't get detected
in CRAN incoming, it's certainly not intentional.
Cheers,
Simon
> Iñaki
>
>> But so far all examples where just (ab)use of downloads for binary
>> dependencies which is an entirely different issue that needs a different
>> solution (in a naive way declaring such dependencies, but we know it's not
>> that simple - and download URLs don't help there).
>>
>> Cheers,
>> Simon
>>
>>
>>> On 27/09/2022, at 8:25 AM, Ucar <iu...@fedoraproject.org> wrote:
>>>
>>> On Sat, 24 Sept 2022 at 01:55, Simon Urbanek
>>> <simon.urba...@r-project.org> wrote:
>>>>
>>>> Iñaki,
>>>>
>>>> I fully agree, this a very common issue since vast majority of server
>>>> deployments I have encountered don't allow internet access. In practice
>>>> this means that such packages are effectively banned.
>>>>
>>>> I would argue that not even (1) or (2) are really an issue, because in
>>>> fact the CRAN policy doesn't impose any absolute limits on size, it only
>>>> states that the package should be "of minimum necessary size" which means
>>>> it shouldn't waste space. If there is no way to reduce the size without
>>>> impacting functionality, it's perfectly fine.
>>>
>>> "Packages should be of the minimum necessary size" is subject to
>>> interpretation. And in practice, there is an issue with e.g. packages
>>> that "bundle" big third-party libraries. There are also packages that
>>> require downloading precompiled code, JARs... at installation time.
>>>
>>>> That said, there are exceptions such as very large datasets (e.g., as
>>>> distributed by Bioconductor) which are orders of magnitude larger than
>>>> what is sustainable. I agree that it would be nice to have a mechanism for
>>>> specifying such sources. So yes, I like the idea, but I'd like to see more
>>>> real use cases to justify the effort.
>>>
>>> "More real use cases" like in "more use cases" or like in "the
>>> previous ones are not real ones"? :)
>>>
>>>> The issue with any online downloads, though, is that there is no guarantee
>>>> of availability - which is real issue for reproducibility. So one could
>>>> argue that if such external sources are required then they should be on a
>>>> well-defined, independent, permanent storage such as Zenodo. This could be
>>>> a matter of policy as opposed to the technical side above which would be
>>>> adding such support to R CMD INSTALL.
>>>
>>> Not necessarily. If the package declares the additional sources in the
>>> DESCRIPTION (probably with hashes), that's a big improvement over the
>>> current state of things, in which basically we don't know what the
>>> package tries download, then it may fail, and finally there's no
>>> guarantee that it's what the author intended in the first place.
>>>
>>> But on top of this, R could add a CMD to download those, and then some
>>> lookaside storage could be used on CRAN. This is e.g. how RPM
>>> packaging works: the spec declares all the sources, they are
>>> downloaded once, hashed and stored in a lookaside cache. Then package
>>> building doesn't need general Internet connectivity, just access to
>>> the cache.
>>>
>>> Iñaki
>>>
>>>>
>>>> Cheers,
>>>> Simon
>>>>
>>>>
>>>>> On Sep 24, 2022, at 3:22 AM, Iñaki Ucar <iu...@fedoraproject.org> wrote:
>>>>>
>>>>> Hi all,
>>>>>
>>>>> I'd like to open this debate here, because IMO this is a big issue.
>>>>> Many packages do this for various reasons, some more legitimate than
>>>>> others, but I think that this shouldn't be allowed, because it
>>>>> basically means that installation fails in a machine without Internet
>>>>> access (which happens e.g. in Linux distro builders for security
>>>>> reasons).
>>>>>
>>>>> Now, what if connection is suppressed during package load? There are
>>>>> basically three use cases out there:
>>>>>
>>>>> (1) The package requires additional files for the installation (e.g.
>>>>> the source code of an external library) that cannot be bundled into
>>>>> the package due to CRAN restrictions (size).
>>>>> (2) The package requires additional files for using it (e.g.,
>>>>> datasets, a JAR...) that cannot be bundled into the package due to
>>>>> CRAN restrictions (size).
>>>>> (3) Other spurious reasons (e.g. the maintainer decided that package
>>>>> load was a good place to check an online service availability, etc.).
>>>>>
>>>>> Again IMO, (3) shouldn't be allowed in any case; (2) should be a
>>>>> separate function that the user actively calls to download the files,
>>>>> and those files should be placed into the user dir, and (3) is the
>>>>> only legitimate use, but then other mechanism should be provided to
>>>>> avoid connections during package load.
>>>>>
>>>>> My proposal to support (3) would be to add a new field in the
>>>>> DESCRIPTION, "Additional_sources", which would be a comma separated
>>>>> list of additional resources to download during R CMD INSTALL. Those
>>>>> sources would be downloaded by R CMD INSTALL if not provided via an
>>>>> option (to support offline installations), and would be placed in a
>>>>> predefined place for the package to find and configure them (via an
>>>>> environment variable or in a predefined subdirectory).
>>>>>
>>>>> This proposal has several advantages. Apart from the obvious one
>>>>> (Internet access during package load can be limited without losing
>>>>> current functionalities), it gives more visibility to the resources
>>>>> that packages are using during the installation phase, and thus makes
>>>>> those installations more reproducible and more secure.
>>>>>
>>>>> Best,
>>>>> --
>>>>> Iñaki Úcar
>>>>>
>>>>> ______________________________________________
>>>>> R-devel@r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>>
>>>>
>>>
>>>
>>> --
>>> Iñaki Úcar
>>
>
>
> --
> Iñaki Úcar
>
______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel