[Rd] Suggestion: Install packages on non-appendable file systems (e.g. databricks volumes)

2025-03-26 Thread Sergio Oller
Hello,

I would like to submit a patch to R. Following 5  Submitting Feature
Requests – R Development Guide
,
I would like to ask for feedback before proceeding with a ¿formal?
submission on bugzilla. It's my first attempt contributing to R and I do
not currently have a bugzilla account.

I am working at a company, and we use R with databricks. We want to install
some packages on a distributed filesystem that is not fully POSIX
compliant, as it does not support opening files in append mode. In C terms,
`open(filename, "a")` gives an error. I guess other distributed file
systems beyond the ones in databricks may have issues with append mode as
well.

Our current workaround is to install all packages on a local folder, and
then copy/move the folder to the distributed file system.

If I understand package installation correctly, when a package is
installed, the installation happens inside a 00LOCK directory, and then the
outcome is moved to the final destination.

The contribution I would like to submit allows users/sysadmins to set an
environment variable named PKG_LOCKDIR_PREFIX, that defines the location
where the "00LOCK-" directories are created. The patch is backwards
compatible and it consists of +28,-10 lines, hopefully easy enough to
review.

https://github.com/r-devel/r-svn/pull/196.diff

When I use this patch, I can successfully install packages on a distributed
file system by setting PKG_LOCKDIR_PREFIX to a directory in my local
filesystem (R does all the file append stuff in the local file system, and
finally copies all the package files to the distributed file system)

This setting makes package installation transparent for all data
scientists, since they may not even know that PKG_LOCKDIR_PREFIX has been
set. Package installation just works as expected.

I feel the patch has some added value over our workaround: Even if we
implement the workaround with a simple wrapper over install.packages(), any
third party package that depends on install.packages() (such as renv or
others) won't use our workaround. Besides, with this patch merged any other
R user benefits from being able to install packages in those filesystems.

Any feedback is very much appreciated.

Thanks for your time,

Sergio

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Suggestion: Install packages on non-appendable file systems (e.g. databricks volumes)

2025-03-26 Thread Tomas Kalibera



On 3/26/25 17:47, Sergio Oller wrote:

Hello,

I would like to submit a patch to R. Following 5  Submitting Feature
Requests – R Development Guide
,
I would like to ask for feedback before proceeding with a ¿formal?
submission on bugzilla. It's my first attempt contributing to R and I do
not currently have a bugzilla account.

I am working at a company, and we use R with databricks. We want to install
some packages on a distributed filesystem that is not fully POSIX
compliant, as it does not support opening files in append mode. In C terms,
`open(filename, "a")` gives an error. I guess other distributed file
systems beyond the ones in databricks may have issues with append mode as
well.

Our current workaround is to install all packages on a local folder, and
then copy/move the folder to the distributed file system.


This is something we try to keep working in R if possible, to allow 
users moving installed packages by moving the installation directories. 
If this practice works for you, it is probably fine.


Currently, installing a binary package just means unpacking it to the 
target directory. Probably you could do this also  via binary packages: 
build binary packages on a local filesystem, and then install them to 
the non-POSIX filesystem (provided the unpacking/installation would work 
on such a filesystem). If the installation of a binary package doesn't 
work but could be (possibly optionally) made work, that might be of 
interest.



If I understand package installation correctly, when a package is
installed, the installation happens inside a 00LOCK directory, and then the
outcome is moved to the final destination.

The contribution I would like to submit allows users/sysadmins to set an
environment variable named PKG_LOCKDIR_PREFIX, that defines the location
where the "00LOCK-" directories are created. The patch is backwards
compatible and it consists of +28,-10 lines, hopefully easy enough to
review.

https://github.com/r-devel/r-svn/pull/196.diff

When I use this patch, I can successfully install packages on a distributed
file system by setting PKG_LOCKDIR_PREFIX to a directory in my local
filesystem (R does all the file append stuff in the local file system, and
finally copies all the package files to the distributed file system)


I am not excited about the idea combining this with the locking 
mechanism and staged installation in the described way. The current 
implementation takes advantage of that on a single filesystem, a move 
operation is either atomic (POSIX) or at least very fast (Windows). 
Copying an installed package to a different filesystem isn't. There is a 
risk that some other R session could see a partial installation of a 
package. Then, if the library was on a distributed filesystem accessed 
from different machines, there could even be corruption due to 
concurrent installation from multiple machines. In principle, this could 
be even on a single machine (checking existence of a directory on one 
filesystem and creating it on another wouldn't be atomic).


Perhaps the staging/locking could be implemented in some special way on 
the target filesystem, some second-level staging and installation - but 
it is questionable whether it is worth the effort/maintenance in base R. 
Also keep in mind this could hardly be regularly tested as such 
filesystems are rare.


Best
Tomas

P.S.

about staged installation: 
https://developer.r-project.org/Blog/public/2019/02/14/staged-install/index.html





This setting makes package installation transparent for all data
scientists, since they may not even know that PKG_LOCKDIR_PREFIX has been
set. Package installation just works as expected.

I feel the patch has some added value over our workaround: Even if we
implement the workaround with a simple wrapper over install.packages(), any
third party package that depends on install.packages() (such as renv or
others) won't use our workaround. Besides, with this patch merged any other
R user benefits from being able to install packages in those filesystems.

Any feedback is very much appreciated.

Thanks for your time,

Sergio

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel