[R-pkg-devel] R-extension requirement about third-party random number generators (RNG)

2024-09-27 Thread John Clarke
Hi folks,

Reference from the manual:

   -

   Nor should the C++11 random number library be used nor any other
   third-party random number generators such as those in GSL.”


I am working at wrapping an existing statistical model with RCPP so it can
be used in R. The model is written in C++ and uses the well-known Mersenne
Twister RNG C++ class. We instantiate 4 instances of the RNG to keep the
random variates independent.

I understand from the r-release manual

that
we should not use third-party RNGs and rather use R's interfaces to R's
internal random number generator. If this is a correct reading of the
requirement for an R package/extension, it will add quite a bit of
complexity to the wrapper because we'll have to probably pass in references
to something like *rstream* objects to ensure the streams/substreams remain
independent. I fear that it may also might negatively affect performance of
the model because (if I understand correctly) we need to maintain the state
of a single RNG engine and due to the nature of our model, we don't
pre-generate all the RNGs but rather generate just a few before switching
to another stream. This would mean get/setting the RNG state with every
single draw unless substantial changes were made to batch generate the
numbers a priori.

Before we tackle this problem, I want to confirm that it is indeed
forbidden to use the Mersenne Twister C++ class in a RCPP R package if we
want to publish it on CRAN. Then, if it is a requirement, and assuming I
want to use the *rstream* RNG package, how can I instantiate an rstream
object in C++? I've succeeded at instantiating it in R and passing it
through, but this isn't ideal because it breaks encapsulation/ease-of-use
for the user.

Thanks,

-John

[[alternative HTML version deleted]]

__
R-package-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel


Re: [R-pkg-devel] R-extension requirement about third-party random number generators (RNG)

2024-09-27 Thread John Clarke
Hi Dirk,

Thanks very much for your thoughtful reply. This is a relief for me as my
'solution' for replacing the existing RNG in my C++ was convoluted and I
couldn't see the advantage in doing so. Thank you for the search tip --
that is really helpful and I did not know that there was this kind of
mirroring. Thanks also for the mention of the rcpp-devel -- I think I will
use that instead next time.

One point of contention though -- I think the Writing R-Extensions page
could be more explicit in this regard and perhaps suggest times/examples
when it is OK to include an existing (3rd party) RNG and when it is not.

I also think that it would be helpful if R offered a way to instantiate
multiple instances of its internal RNG manager out of the box -- the RNG
state management strategy appears almost 'magical' to me especially
inside RCPP. It is possible, I just don't understand how to use it.

Best,

-John

On Fri, Sep 27, 2024 at 4:19 PM Dirk Eddelbuettel  wrote:

>
> Hi John,
>
> I think you are reading the text too literally. The intent of WRE is to
> ensure that standard use of a RNG in an extension package uses the RNGs
> that
> come with R (which includes an updated mersenne twister algorithm) so that
> users are not "surprised".  It explicitly mentions the problem of multiple
> seeding.  An example of how this can be done is in RcppArmadillo: a long
> time
> ago we worked out a scheme were in the R use case the RNG is 'dropped in'
> so
> that Armadillo code uses randu(5) you get what runif(5) in R would give you
> (given the same seed from R).
>
> On the other hand, when you know what you do and properly (locally)
> instantiate another PRNG for local use you can. For quick checks I often
> use
> a query at github in the 'cran' organisation mirroring the CRAN repos. For
> example the following shows where std::mt19937 is used in C++ to deploy the
> Mersenne Twister. So as you can see, when done carefully it is in fact
> allowed.
>
>https://github.com/search?q=org%3Acran%20mt19937&type=code
>
> Hope this helps, and allow me to mention that there is also the rcpp-devel
> for more Rcpp-specific questions.
>
> Cheers, Dirk
>
>
> --
> dirk.eddelbuettel.com | @eddelbuettel | e...@debian.org
>

[[alternative HTML version deleted]]

__
R-package-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel


Re: [R-pkg-devel] Rcpp: how best to include source from another Github repository

2025-02-03 Thread John Clarke
Thanks Dirk and others -- I see the tradeoffs now. In my case, I think a
simple copy of source (just a few files), plus maybe some sort of
commit/tag reference to indicate where/when the source is coming from. -John

On Thu, Jan 23, 2025 at 3:47 PM Dirk Eddelbuettel  wrote:

>
> On 21 January 2025 at 12:04, Jonathan Berrisch wrote:
> | first of all: I'm not an expert on this and don't really know if there
> | is a recommended way.
> |
> | However, you may look at my 'rcpptimer' package and how it includes
> | 'cpptimer' as a submodule.
> |
> | You can find the repository here: https://github.com/BerriJ/rcpptimer
>
> I had written a longer (private) email to John expressing the view that git
> submodules "were once more 'en vogue'" but one sees them less these days.
> One reason is that they break some (somewhat standard) workflows, see
> below.
>
> Overall, this is "no win" situation. You can include the files in the
> package
> as a copy [2] enlarging the package, build process, etc but arguably making
> it more robust, or you can keep it external which is cleaner -- but harder
> as
> you now have to ensure users (and CRAN !) can get / have that library.
>
> So it is all tradeoffs one has to make.
>
> Dirk
>
>
> [1] Log from a standard r2u Ubuntu container, `git` and `ssh` added as
> needed:
>
> root@4163d5544547:/# installGithub.r https://github.com/BerriJ/rcpptimer
> Downloading GitHub repo BerriJ/rcpptimer@HEAD
> '/usr/bin/git' clone --depth 1 --no-hardlinks --recurse-submodules
> g...@github.com:BerriJ/cpptimer.git
> /tmp/remotes257564d64e0/BerriJ-rcpptimer-35ca024/inst/include/cpptimer
> Cloning into
> '/tmp/remotes257564d64e0/BerriJ-rcpptimer-35ca024/inst/include/cpptimer'...
> The authenticity of host 'github.com (140.82.113.3)' can't be established.
> ED25519 key fingerprint is
> SHA256:+DiY3wvvV6TuJJhbpZisF/zLDA0zPMSvHdkr4UvCOqU.
> This key is not known by any other names.
> Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
> Warning: Permanently added 'github.com' (ED25519) to the list of known
> hosts.
> g...@github.com: Permission denied (publickey).
> fatal: Could not read from remote repository.
>
> Please make sure you have the correct access rights
> and the repository exists.
> Error: Failed to install 'rcpptimer' from GitHub:
>   Command failed (128)
> In addition: Warning message:
> In system(full, intern = TRUE, ignore.stderr = quiet) :
>   running command ''/usr/bin/git' clone --depth 1 --no-hardlinks
> --recurse-submodules g...@github.com:BerriJ/cpptimer.git
> /tmp/remotes257564d64e0/BerriJ-rcpptimer-35ca024/inst/include/cpptimer' had
> status 128 and error message 'Function not implemented'
> root@4163d5544547:/#
> exit
>
>
> [2] The "Rcpp-library" vignette John refers to also mentions (IIRC) that
> this
> is preferable for smaller libraries; its 'corels' example fits that
> description.  These days other authors also vendor entire applications such
> as whole SQL engines: ¯\_(ツ)_/¯  I just updated qlcal on CRAN, it
> explicitly
> copies the calendaring (subset) from QuantLib as I learned over 20 years
> that
> users have difficulties with that large library. Tradeoffs.
>
> --
> dirk.eddelbuettel.com | @eddelbuettel | e...@debian.org
>
> __
> R-package-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-package-devel
>

[[alternative HTML version deleted]]

__
R-package-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel


Re: [R-pkg-devel] Is it possible to install a pre-compiled R package from Github?

2025-02-03 Thread John Clarke
Thanks Ivan, Iñaki, and Dirk for your answers -- I'm happy to know about
https://r-universe.dev/. I also deduce that I can compile my own binaries
and host them as releases in Github and have R users install from these
binaries as well. Kind regards, -John

On Tue, Jan 28, 2025 at 3:37 PM Ivan Krylov  wrote:

> В Tue, 28 Jan 2025 14:25:23 +0100
> John Clarke  пишет:
>
> > I'm wondering if there is a way to point an R package installer to a
> > pre-compiled release on Github rather than rely on CRAN.
>
> From the point of view of install.packages(), a repository is a
> collection of package files plus an index file arranged in a certain
> directory structure:
>
> https://cran.r-project.org/doc/manuals/R-admin.html#Setting-up-a-package-repository
>
> You can create these index files yourself using tools::write_PACKAGES()
> or the 'drat' package, then publish them on any free web hosting:
> https://search.r-project.org/R/refmans/tools/html/writePACKAGES.html
> https://cran.r-project.org/package=drat
>
> install.packages() will then be able to use this repository using
> either its contriburl=... or the repos=... argument.
>
> The above-mentioned R-Universe will, indeed, not only help you host
> your packages, but also build your source packages into binary packages
> for a number of platforms.
>
> --
> Best regards,
> Ivan
>

[[alternative HTML version deleted]]

__
R-package-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel


Re: [R-pkg-devel] Rcpp: how best to include source from another Github repository

2025-01-22 Thread John Clarke
Thanks Ivan, this is helpful. I'll do some more research. It would be nice
to have an Rcpp standard/recommended way to do this. I don't want to have a
non-standard ./src or ./data folder structure for my Rcpp package, but
these are the two relevant folders in my original repository. Maybe with
some sort of synlinking I could achieve this 'custom' folder/file mapping.
-John

On Tue, Jan 21, 2025 at 1:05 PM Ivan Krylov  wrote:

> В Tue, 21 Jan 2025 11:57:46 +0100
> John Clarke  пишет:
>
> > Ideally, it would be nice to be able to pull the files from the
> > source repo using a tag/hash so that the only code change in the Rcpp
> > repo would be that reference rather than all the changes to the
> > shared source.
>
> I've been using Git submodules for this purpose:
>
> https://codeberg.org/aitap/Ropj/src/branch/master/src
>
> https://git-scm.com/book/en/v2/Git-Tools-Submodules
>
> Every time the upstream changes I have to update the commit pointer in
> my repository too, but other than that, it's been working fine. My
> .Rbuildignore filters out all the unnecessary files included in the
> upstream repository, leaving only the relevant source code in the
> resulting source package.
>
> The resulting repository must be cloned with --recurse-submodules (or,
> if forgotten, must be initialised with git submodule update --init);
> further updates to the tracked commit pointer must be applied with git
> submodule update. If the referenced repository becomes unavailable, it
> will be impossible to build the package.
>
> --
> Best regards,
> Ivan
>

[[alternative HTML version deleted]]

__
R-package-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel


Re: [R-pkg-devel] Rcpp: how best to include source from another Github repository

2025-01-23 Thread John Clarke
Thank you Dirk and Thibault for your additional tips and ideas.

I took a look at the vignette for *rcppcorels* and noted that this is
exactly the model/pattern I used to create my package. And it appears that
the /src files are simply copied from the external *corels* C++ library
into the rcppcorel's /src folder. And committed twice: once in the external
C++ repo and again in the rcppcorels repo. That is what I do now, but I
wasn't satisfied with the double commits and also I'd really like users of
the R package to have a reference/version/tag to confirm that what they
have in the R package is the same as a particular release of the C++ repo.

But in the end, it appears that there is no easy (and robust) way to do
this other than copy the files (via script or manually) and re-commit
changes in both repos. I suppose I could add a reference to the tag from
the C++ library, but this step can easily get lost/forgotten and is a
definite DRY violation. Maybe an import/copy script could help with that.

Thanks again for your help and suggestions.

-John











On Wed, Jan 22, 2025 at 8:53 PM Thibault Vatter 
wrote:

> There is balance between DRY, safety, and customization needs. The
> symlinkish approach would be "dangerous" imo, because you can't guarantee
> the wrapper.cpp will stay compatible with changes in the underlying C++
> library.
>
> The submodule approach works well. Alternatives that I know of are:
>
>
>- a script that pulls the latest sources in the standalone C++ library
>and does things like adding a preprocessor macro, see e.g. rvinecopulib
>- a "patches" folder with patch files in diff format (.patch or
>.diff), see e.g. RcppEigen
>
> Either way, such scripts or patches folders have to be excluded from being
> put into the package via the .Rbuildignore.
>
> On Wed, Jan 22, 2025, 2:33 PM John Clarke 
> wrote:
>
>> Thanks Ivan, this is helpful. I'll do some more research. It would be nice
>> to have an Rcpp standard/recommended way to do this. I don't want to have
>> a
>> non-standard ./src or ./data folder structure for my Rcpp package, but
>> these are the two relevant folders in my original repository. Maybe with
>> some sort of synlinking I could achieve this 'custom' folder/file mapping.
>> -John
>>
>> On Tue, Jan 21, 2025 at 1:05 PM Ivan Krylov  wrote:
>>
>> > В Tue, 21 Jan 2025 11:57:46 +0100
>> > John Clarke  пишет:
>> >
>> > > Ideally, it would be nice to be able to pull the files from the
>> > > source repo using a tag/hash so that the only code change in the Rcpp
>> > > repo would be that reference rather than all the changes to the
>> > > shared source.
>> >
>> > I've been using Git submodules for this purpose:
>> >
>> > https://codeberg.org/aitap/Ropj/src/branch/master/src
>> >
>> > https://git-scm.com/book/en/v2/Git-Tools-Submodules
>> >
>> > Every time the upstream changes I have to update the commit pointer in
>> > my repository too, but other than that, it's been working fine. My
>> > .Rbuildignore filters out all the unnecessary files included in the
>> > upstream repository, leaving only the relevant source code in the
>> > resulting source package.
>> >
>> > The resulting repository must be cloned with --recurse-submodules (or,
>> > if forgotten, must be initialised with git submodule update --init);
>> > further updates to the tracked commit pointer must be applied with git
>> > submodule update. If the referenced repository becomes unavailable, it
>> > will be impossible to build the package.
>> >
>> > --
>> > Best regards,
>> > Ivan
>> >
>>
>> [[alternative HTML version deleted]]
>>
>> __
>> R-package-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-package-devel
>>
>

[[alternative HTML version deleted]]

__
R-package-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel


Re: [R-pkg-devel] Rcpp: how best to include source from another Github repository

2025-01-23 Thread John Clarke
Thanks Jonathan, this is helpful. I'm just a bit concerned based on other
comments that submodules might not be compatible with CRAN and/or other CI
runners. Plus, I only want a few cpp files copied over.

On Thu, Jan 23, 2025 at 3:19 PM Jonathan Berrisch 
wrote:

> Hi John,
>
> first of all: I'm not an expert on this and don't really know if there
> is a recommended way.
>
> However, you may look at my 'rcpptimer' package and how it includes
> 'cpptimer' as a submodule.
>
> You can find the repository here: https://github.com/BerriJ/rcpptimer
>
> And here you can see 'cpptimer' which is included as a submodule:
> https://github.com/BerriJ/rcpptimer/tree/main/inst/include
>
> Whenever I make commits to 'cpptimer' I have to update the submodule
> (basically changing at what commit I want to have that submodule
> included 'git submodule update --remote'). You'll find good
> documentation about submodules online.
>
> Maybe that helps.
>
> Best Regards,
>
> Jonathan / BerriJ
>
>
> On 1/21/25 11:57, John Clarke wrote:
> > Hi folks,
> >
> > I have an Rcpp package I'm developing. All but one of the cpp source code
> > files are pulled from the original/authoritative (CLI) version of the
> > application. The only unique cpp source code to the Rcpp package is my
> > wrapper.cpp which contains the Rcpp interface. This approach works fine,
> > but every time we make changes to the original CLI repository, it
> requires
> > a manual (and duplicate) commit to my Rcpp repo. This is not ideal from a
> > source tracking perspective and is a DRY violation.
> >
> > What is the recommended way of maintaining the shared cpp code in a Rcpp
> > repo? Ideally, it would be nice to be able to pull the files from the
> > source repo using a tag/hash so that the only code change in the Rcpp
> repo
> > would be that reference rather than all the changes to the shared source.
> > It would also be nice to use some sort of symlinkish setup during
> > development to allow quick testing before making commits in the original
> > CLI repo. Is this possible? Recommended? (we can assume dev is on MacOS
> or
> > Linux)
> >
> > Thanks,
> >
> > -John
> >
> > John Clarke | Senior Technical Advisor |
> > Cornerstone Systems Northwest | john.cla...@cornerstonenw.com
> >
> >   [[alternative HTML version deleted]]
> >
> > __
> > R-package-devel@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-package-devel
> >
>
> __
> R-package-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-package-devel
>

[[alternative HTML version deleted]]

__
R-package-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel


[R-pkg-devel] Is it possible to install a pre-compiled R package from Github?

2025-01-28 Thread John Clarke
Hi all,

I'm wondering if there is a way to point an R package installer to a
pre-compiled release on Github rather than rely on CRAN. I will likely use
CRAN, but I'm curious if installing via pre-compiled versions is limited to
CRAN or whether there is another way. This is related to a Rcpp project I'm
working on (so compiling C++), but I think the question is general enough
that it can be asked on this list.

Thank you,

-John

John Clarke | Senior Technical Advisor |
Cornerstone Systems Northwest | john.cla...@cornerstonenw.com

[[alternative HTML version deleted]]

__
R-package-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel


[R-pkg-devel] Rcpp: how best to include source from another Github repository

2025-01-21 Thread John Clarke
Hi folks,

I have an Rcpp package I'm developing. All but one of the cpp source code
files are pulled from the original/authoritative (CLI) version of the
application. The only unique cpp source code to the Rcpp package is my
wrapper.cpp which contains the Rcpp interface. This approach works fine,
but every time we make changes to the original CLI repository, it requires
a manual (and duplicate) commit to my Rcpp repo. This is not ideal from a
source tracking perspective and is a DRY violation.

What is the recommended way of maintaining the shared cpp code in a Rcpp
repo? Ideally, it would be nice to be able to pull the files from the
source repo using a tag/hash so that the only code change in the Rcpp repo
would be that reference rather than all the changes to the shared source.
It would also be nice to use some sort of symlinkish setup during
development to allow quick testing before making commits in the original
CLI repo. Is this possible? Recommended? (we can assume dev is on MacOS or
Linux)

Thanks,

-John

John Clarke | Senior Technical Advisor |
Cornerstone Systems Northwest | john.cla...@cornerstonenw.com

[[alternative HTML version deleted]]

__
R-package-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel


[R-pkg-devel] Retrieving versioned csv datasets for use in an R package

2025-02-14 Thread John Clarke
Hi folks,

I've looked around for this particular question, but haven't found a good
answer. I have a versioned dataset that includes about 6 csv files that
total about 15MB for each version. The versions get updated every few years
or so and are used to drive the model which was written in C++ but is now
inside an Rcpp wrapper. Apart from the fact that CRAN does not permit large
files, I want to have a better way for users to access particular versions
of the dataset.

Usage idea:
 # The following would hopefully also download default/most recent version
of the csv files from CRAN (if allowed) or Github or some other repository
for academic open source data.
install.packages("MyPackage")
mypackage = new(MyPackage)

Then, if necessary, the user could change the dataset used with something
like:
mypackage.dataset("2.1.0") which would retrieve new csv files if they
haven't already been downloaded and update the data_folder path internally
to point to 2.1.0 directory.

Requirements:
- The dataset is csv (not a R data object) and the Rcpp MyPackage expects
this format
- Would be nice to properly include citations for the data as they will
likely be initially released through a journal publication

What is the best practice for this sort of dataset management for a package
in R? Is it okay to use Github to store and version the data? Or
preferred to use an R package (ignoring the file size limit). Or some other
open source data hosting? I see https://r-universe.dev/ as an option as
well. In any case, what is the proper mechanism for retrieving/caching the
data?

Thanks,

-John

John Clarke | Senior Technical Advisor |
Cornerstone Systems Northwest | john.cla...@cornerstonenw.com

[[alternative HTML version deleted]]

__
R-package-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel


Re: [R-pkg-devel] Retrieving versioned csv datasets for use in an R package

2025-02-14 Thread John Clarke
Thanks so much Rafael, I think piggyback is exactly what I was looking for.
I wonder if it is possible/best practice to include a call to it during the
install.packages('MyPackage') process so that the data is available prior
to running tests in the R CMD build Github Action (and also for users to
have the default/most recent dataset) downloaded alongside the package.
-John

On Fri, Feb 14, 2025 at 4:08 PM Rafael H. M. Pereira <
rafa.pereira...@gmail.com> wrote:

> Hi John,
>
> There are different alternatives on where to host the data (e.g. OSF, a
> proprietary server, Github etc). The solution I've been adopting in most of
> my packages is to use a combination of  a  proprietary server and Github.
> So the data is first downloaded from our own server and only if our server
> is offline, then the download is redirected to Github. This is what I try
> to do so our packages do not overload Github. Of course, this creates some
> additional work from our side to make sure the files in our server are
> always mirrored on github.
>
> A key point to pay attention to when hosting the data on Github is to host
> it as an attachment to a *release* . A good way to manage the files and
> releases is using the {piggyback} package, by Carl Boettiger et al at
> ROpenSci. The documentation of the package is a really great guide on how
> to host data on github and it has some really convenient functions to
> create releases, upload and download files. Kudos to them !
> https://docs.ropensci.org/piggyback/
>
> Best,
>
> Rafael Pereira
>
> On Fri, Feb 14, 2025 at 11:55 AM John Clarke <
> john.cla...@cornerstonenw.com> wrote:
>
>> Hi folks,
>>
>> I've looked around for this particular question, but haven't found a good
>> answer. I have a versioned dataset that includes about 6 csv files that
>> total about 15MB for each version. The versions get updated every few
>> years
>> or so and are used to drive the model which was written in C++ but is now
>> inside an Rcpp wrapper. Apart from the fact that CRAN does not permit
>> large
>> files, I want to have a better way for users to access particular versions
>> of the dataset.
>>
>> Usage idea:
>>  # The following would hopefully also download default/most recent version
>> of the csv files from CRAN (if allowed) or Github or some other repository
>> for academic open source data.
>> install.packages("MyPackage")
>> mypackage = new(MyPackage)
>>
>> Then, if necessary, the user could change the dataset used with something
>> like:
>> mypackage.dataset("2.1.0") which would retrieve new csv files if they
>> haven't already been downloaded and update the data_folder path internally
>> to point to 2.1.0 directory.
>>
>> Requirements:
>> - The dataset is csv (not a R data object) and the Rcpp MyPackage expects
>> this format
>> - Would be nice to properly include citations for the data as they will
>> likely be initially released through a journal publication
>>
>> What is the best practice for this sort of dataset management for a
>> package
>> in R? Is it okay to use Github to store and version the data? Or
>> preferred to use an R package (ignoring the file size limit). Or some
>> other
>> open source data hosting? I see https://r-universe.dev/ as an option as
>> well. In any case, what is the proper mechanism for retrieving/caching the
>> data?
>>
>> Thanks,
>>
>> -John
>>
>> John Clarke | Senior Technical Advisor |
>> Cornerstone Systems Northwest | john.cla...@cornerstonenw.com
>>
>> [[alternative HTML version deleted]]
>>
>> __
>> R-package-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-package-devel
>>
>

[[alternative HTML version deleted]]

__
R-package-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel