[Rd] changes in R-devel and zero-extent objects in Rcpp

2024-06-08 Thread Ben Bolker



   A change to R-devel (SVN r86629 or 
https://github.com/r-devel/r-svn/commit/92c1d5de23c93576f55062e26d446feface07250 
has changed the handling of pointers to zero-length objects, leading to 
ASAN issues with a number of Rcpp-based packages (the commit message 
reads, in part, "Also define STRICT_TYPECHECK when compiling inlined.c.")


  I'm interested in discussion from the community.

  Details/diagnosis for the issues in the lme4 package are here: 
https://github.com/lme4/lme4/issues/794, with a bit more discussion 
about how zero-length objects should be handled.


  The short(ish) version is that r86629 enables the 
CATCH_ZERO_LENGTH_ACCESS definition. This turns on the CHKZLN macro 
, 
which returns a trivial pointer (rather than the data pointer that would 
be returned in the normal control flow) if an object has length 0:


/* Attempts to read or write elements of a zero length vector will
   result in a segfault, rather than read and write random memory.
   Returning NULL would be more natural, but Matrix seems to assume
   that even zero-length vectors have non-NULL data pointers, so
   return (void *) 1 instead. Zero-length CHARSXP objects still have a
   trailing zero byte so they are not handled. */

  In the Rcpp context this leads to an inconsistency, where `REAL(x)` 
is a 'real' external pointer and `DATAPTR(x)` is 0x1, which in turn 
leads to ASAN warnings like


runtime error: reference binding to misaligned address 0x0001 
for type 'const double', which requires 8 byte alignment

0x0001: note: pointer points here

   I'm in over my head and hoping for insight into whether this problem 
should be resolved by changing R, Rcpp, or downstream Rcpp packages ...


  cheers
   Ben Bolker

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] changes in R-devel and zero-extent objects in Rcpp

2024-06-08 Thread Kevin Ushey
IMHO, this should be changed in both Rcpp and downstream packages:

1. Rcpp could check for out-of-bounds accesses in cases like these, and
emit an R warning / error when such an access is detected;

2. The downstream packages unintentionally making these out-of-bounds
accesses should be fixed to avoid doing that.

That is, I think this is ultimately a bug in the affected packages, but
Rcpp could do better in detecting and handling this for client packages
(avoiding a segfault).

Best,
Kevin


On Sat, Jun 8, 2024, 3:06 PM Ben Bolker  wrote:

>
> A change to R-devel (SVN r86629 or
>
> https://github.com/r-devel/r-svn/commit/92c1d5de23c93576f55062e26d446feface07250
> has changed the handling of pointers to zero-length objects, leading to
> ASAN issues with a number of Rcpp-based packages (the commit message
> reads, in part, "Also define STRICT_TYPECHECK when compiling inlined.c.")
>
>I'm interested in discussion from the community.
>
>Details/diagnosis for the issues in the lme4 package are here:
> https://github.com/lme4/lme4/issues/794, with a bit more discussion
> about how zero-length objects should be handled.
>
>The short(ish) version is that r86629 enables the
> CATCH_ZERO_LENGTH_ACCESS definition. This turns on the CHKZLN macro
> <
> https://github.com/r-devel/r-svn/blob/4ef83b9dc3c6874e774195d329cbb6c11a71c414/src/main/memory.c#L4090-L4104>,
>
> which returns a trivial pointer (rather than the data pointer that would
> be returned in the normal control flow) if an object has length 0:
>
> /* Attempts to read or write elements of a zero length vector will
> result in a segfault, rather than read and write random memory.
> Returning NULL would be more natural, but Matrix seems to assume
> that even zero-length vectors have non-NULL data pointers, so
> return (void *) 1 instead. Zero-length CHARSXP objects still have a
> trailing zero byte so they are not handled. */
>
>In the Rcpp context this leads to an inconsistency, where `REAL(x)`
> is a 'real' external pointer and `DATAPTR(x)` is 0x1, which in turn
> leads to ASAN warnings like
>
> runtime error: reference binding to misaligned address 0x0001
> for type 'const double', which requires 8 byte alignment
> 0x0001: note: pointer points here
>
> I'm in over my head and hoping for insight into whether this problem
> should be resolved by changing R, Rcpp, or downstream Rcpp packages ...
>
>cheers
> Ben Bolker
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] changes in R-devel and zero-extent objects in Rcpp

2024-06-08 Thread Ben Bolker
  The ASAN errors occur *even if the zero-length object is not actually 
accessed*/is used in a perfectly correct manner, i.e. it's perfectly 
legal in base R to define `m <- numeric(0)` or `m <- matrix(nrow = 0, 
ncol = 0)`, whereas doing the equivalent in Rcpp will (now) lead to an 
ASAN error.


  i.e., these are *not* previously cryptic out-of-bounds accesses that 
are now being revealed, but instead sensible and previously legal 
definitions of zero-length objects that are now causing problems.


   I'm pretty sure I'm right about this, but it's absolutely possible 
that I'm just confused at this point; I don't have a super-simple 
example to show you at the moment. The closest is this example by Mikael 
Jagan: https://github.com/lme4/lme4/issues/794#issuecomment-2155093049


  which shows that if x is a pointer to a zero-length vector (in plain 
C++ for R, no Rcpp is involved), DATAPTR(x) and REAL(x) evaluate to 
different values.


  Mikael further points out that "Rcpp seems to cast a (void *) 
returned by DATAPTR to (double *) when constructing a Vector 
from a SEXP, rather than using the (double *) returned by REAL." So 
perhaps R-core doesn't want to guarantee that these operations give 
identical answers, in which case Rcpp will have to change the way it 
does things ...


  cheers
   Ben



On 2024-06-08 6:39 p.m., Kevin Ushey wrote:

IMHO, this should be changed in both Rcpp and downstream packages:

1. Rcpp could check for out-of-bounds accesses in cases like these, and 
emit an R warning / error when such an access is detected;


2. The downstream packages unintentionally making these out-of-bounds 
accesses should be fixed to avoid doing that.


That is, I think this is ultimately a bug in the affected packages, but 
Rcpp could do better in detecting and handling this for client packages 
(avoiding a segfault).


Best,
Kevin


On Sat, Jun 8, 2024, 3:06 PM Ben Bolker > wrote:



     A change to R-devel (SVN r86629 or
https://github.com/r-devel/r-svn/commit/92c1d5de23c93576f55062e26d446feface07250 

has changed the handling of pointers to zero-length objects, leading to
ASAN issues with a number of Rcpp-based packages (the commit message
reads, in part, "Also define STRICT_TYPECHECK when compiling
inlined.c.")

    I'm interested in discussion from the community.

    Details/diagnosis for the issues in the lme4 package are here:
https://github.com/lme4/lme4/issues/794
, with a bit more discussion
about how zero-length objects should be handled.

    The short(ish) version is that r86629 enables the
CATCH_ZERO_LENGTH_ACCESS definition. This turns on the CHKZLN macro

>,
which returns a trivial pointer (rather than the data pointer that
would
be returned in the normal control flow) if an object has length 0:

/* Attempts to read or write elements of a zero length vector will
     result in a segfault, rather than read and write random memory.
     Returning NULL would be more natural, but Matrix seems to assume
     that even zero-length vectors have non-NULL data pointers, so
     return (void *) 1 instead. Zero-length CHARSXP objects still have a
     trailing zero byte so they are not handled. */

    In the Rcpp context this leads to an inconsistency, where `REAL(x)`
is a 'real' external pointer and `DATAPTR(x)` is 0x1, which in turn
leads to ASAN warnings like

runtime error: reference binding to misaligned address 0x0001
for type 'const double', which requires 8 byte alignment
0x0001: note: pointer points here

     I'm in over my head and hoping for insight into whether this
problem
should be resolved by changing R, Rcpp, or downstream Rcpp packages ...

    cheers
     Ben Bolker

__
R-devel@r-project.org  mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel




--
Dr. Benjamin Bolker
Professor, Mathematics & Statistics and Biology, McMaster University
Director, School of Computational Science and Engineering
(Acting) Graduate chair, Mathematics & Statistics
> E-mail is sent at my convenience; I don't expect replies outside of 
working hours.


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] [External] Re: changes in R-devel and zero-extent objects in Rcpp

2024-06-08 Thread luke-tierney--- via R-devel

On Sat, 8 Jun 2024, Ben Bolker wrote:

 The ASAN errors occur *even if the zero-length object is not actually 
accessed*/is used in a perfectly correct manner, i.e. it's perfectly legal in 
base R to define `m <- numeric(0)` or `m <- matrix(nrow = 0, ncol = 0)`, 
whereas doing the equivalent in Rcpp will (now) lead to an ASAN error.


 i.e., these are *not* previously cryptic out-of-bounds accesses that are 
now being revealed, but instead sensible and previously legal definitions of 
zero-length objects that are now causing problems.


  I'm pretty sure I'm right about this, but it's absolutely possible that 
I'm just confused at this point; I don't have a super-simple example to show 
you at the moment. The closest is this example by Mikael Jagan: 
https://github.com/lme4/lme4/issues/794#issuecomment-2155093049


 which shows that if x is a pointer to a zero-length vector (in plain C++ 
for R, no Rcpp is involved), DATAPTR(x) and REAL(x) evaluate to different 
values.


 Mikael further points out that "Rcpp seems to cast a (void *) returned by 
DATAPTR to (double *) when constructing a Vector from a SEXP, rather 
than using the (double *) returned by REAL." So perhaps R-core doesn't want 
to guarantee that these operations give identical answers, in which case Rcpp 
will have to change the way it does things ...


It looks like REAL and friends should also get this check, but it's
not high priority at this point, at least to me. DATAPTR has been
using this check for a while in a barrier build, so you might want to
test there as well. I expect we will activate more integrity checks
from the barrier build on the API client side as things are tidied up.

However: DATAPTR is not in the API and can't be at least in this form:
It allows access to a writable pointer to STRSXP and VECSXP data and
that is too dangerous for memory manager integrity. I'm not sure
exactly how this will be resolve, but be prepared for changes.

Best,

luke



 cheers
  Ben



On 2024-06-08 6:39 p.m., Kevin Ushey wrote:

IMHO, this should be changed in both Rcpp and downstream packages:

1. Rcpp could check for out-of-bounds accesses in cases like these, and 
emit an R warning / error when such an access is detected;


2. The downstream packages unintentionally making these out-of-bounds 
accesses should be fixed to avoid doing that.


That is, I think this is ultimately a bug in the affected packages, but 
Rcpp could do better in detecting and handling this for client packages 
(avoiding a segfault).


Best,
Kevin


On Sat, Jun 8, 2024, 3:06 PM Ben Bolker > wrote:



     A change to R-devel (SVN r86629 or
https://github.com/r-devel/r-svn/commit/92c1d5de23c93576f55062e26d446feface07250 


has changed the handling of pointers to zero-length objects, leading to
ASAN issues with a number of Rcpp-based packages (the commit message
reads, in part, "Also define STRICT_TYPECHECK when compiling
inlined.c.")

    I'm interested in discussion from the community.

    Details/diagnosis for the issues in the lme4 package are here:
https://github.com/lme4/lme4/issues/794
, 
with a bit more discussion

about how zero-length objects should be handled.

    The short(ish) version is that r86629 enables the
CATCH_ZERO_LENGTH_ACCESS definition. This turns on the CHKZLN macro
>,

which returns a trivial pointer (rather than the data pointer that
would
be returned in the normal control flow) if an object has length 0:

/* Attempts to read or write elements of a zero length vector will
     result in a segfault, rather than read and write random memory.
     Returning NULL would be more natural, but Matrix seems to assume
     that even zero-length vectors have non-NULL data pointers, so
     return (void *) 1 instead. Zero-length CHARSXP objects still have 
a

     trailing zero byte so they are not handled. */

    In the Rcpp context this leads to an inconsistency, where `REAL(x)`
is a 'real' external pointer and `DATAPTR(x)` is 0x1, which in turn
leads to ASAN warnings like

runtime error: reference binding to misaligned address 0x0001
for type 'const double', which requires 8 byte alignment
0x0001: note: pointer points here

     I'm in over my head and hoping for insight into whether this
problem
should be resolved by changing R, Rcpp, or downstream Rcpp packages ...

    cheers
     Ben Bolker

__
R-devel@r-project.org  mailing list
https://stat.ethz

Re: [Rd] clarifying and adjusting the C API for R

2024-06-08 Thread Hiroaki Yutani
Thanks so much for your wonderful work, Luke!
I didn't expect such a clarification to happen this soon. This is really
great.

For convenience, I created a quick web page to search the result of
tools:::funAPI().

https://yutannihilation.github.io/R-fun-API/

Hope this helps those who are too lazy to install R-devel to check.

Best,
Yutani

2024年6月6日(木) 23:47 luke-tierney--- via R-devel :

> This is an update on some current work on the C API for use in R
> extensions.
>
> The internal R implementation makes use of tens of thousands of C
> entry points. On Linux and Windows, which support visibility
> restrictions, most of these are visible only within the R executble or
> shared library. About 1500 are not hidden and are visible to
> dynamically loaded shared libraries, such as ones in packages, and to
> embedding applications.
>
> There are two main reasons for limiting access to entry points in a
> software framework:
>
> - Some entry points are very easy to use in ways that corrupt internal
>data, leading to segfaults or, worse, incorrect computations without
>segfaults.
>
> - Some entry point expose internal structure and other implementation
>details, which makes it hard to make improvements without breaking
>client code that has come to depend on these details.
>
> The API of C entry points that can be used in R extensions, both for
> packages and embedding, has evolved organically over many years. The
> definition for the current release expressed in the Writing R
> Extensions manual (WRE) is roughly:
>
>  An entry point can be used if (1) it is declared in a header file
>  in R.home("include"), and (2) if it is documented for use in WRE.
>
> Ideally, (1) would be necessary and sufficient, but for a variety of
> reasons that isn't achievable, at least not in the near term. (2) can
> be challenging to determine; in particular, it is not amenable to a
> computational answer.
>
> An experimental effort is underway to add annotations to the WRE
> Texinfo source to allow (2) to be answered unambiguously. The
> annotations so far mostly reflect my reading or WRE and may be revised
> as they are reviewed by others. The annotated document can be used for
> programmatically identifying what is currently considered part of the C
> API. The result so far is an experimental function tools:::funAPI():
>
>  > head(tools:::funAPI())
>  nameloc apitype
>  1 Rf_AdobeSymbol2utf8 R_ext/GraphicsDevice.heapi
>  2alloc3DArrayWRE api
>  3  allocArrayWRE api
>  4   allocLangWRE api
>  5   allocListWRE api
>  6 allocMatrixWRE api
>
> The 'apitype' field has three possible levels
>
>  | api  | stable (ideally) API |
>  | eapi | experimental API |
>  | emb  | embedding API|
>
> Entry points in the embedded API would typically only be used in
> applications embedding R or providing new front ends, but might be
> reasonable to use in packages that support embedding.
>
> The 'loc' field indicates how the entry point is identified as part of
> an API: explicit mention in WRE, or declaration in a header file
> identified as fully part of an API.
>
> [tools:::funAPI() may not be completely accurate as it relies on
> regular expressions for examining header files considered part of the
> API rather than proper parsing. But it seems to be pretty close to
> what can be achieved with proper parsing.  Proper parsing would add
> dependencies on additional tools, which I would like to avoid for
> now. One dependency already present is that a C compiler has to be on
> the search path and cc -E has to run the C pre-processor.]
>
> Two additional experimental functions are available for analyzing
> package compliance: tools:::checkPkgAPI and tools:::checkAllPkgsAPI.
> These examine installed packages.
>
> [These may produce some false positives on macOS; they may or may not
> work on Windows at this point.]
>
> Using these tools initially showed around 200 non-API entry points
> used across packages on CRAN and BIOC. Ideally this number should be
> reduced to zero. This will require a combination of additions to the
> API and changes in packages.
>
> Some entry points can safely be added to the API. Around 40 have
> already been added to WRE with API annotations; another 40 or so can
> probably be added after review.
>
> The remainder mostly fall into two groups:
>
> - Entry points that should never be used in packages, such as
>SET_OBJECT or SETLENGTH (or any non-API SETXYZ functions for that
>matter) that can create inconsistent or corrupt internal state.
>
> - Entry points that depend on the existence of internal structure that
>might be subject to change, such as the existence of promise objects
>or internal structure of environments.
>
> Many, i

Re: [Rd] [External] Re: changes in R-devel and zero-extent objects in Rcpp

2024-06-08 Thread Hiroaki Yutani
Sorry to ask about a bit drifted topic, but will there be an alternative
API to DATAPTR?

> DATAPTR is not in the API and can't be at least in this form

I believe it's vital for ALTREP to return the pointer to the expanded
version of a SEXP just like the implementation in base R does [1].
At least, VECSXP has no other measure to expose the pointer if I understand
correctly.

Best,
Yutani

[1]:
https://github.com/r-devel/r-svn/blob/a3508b75d28164b0e5bcb2c87f816ce5169729a4/src/main/altclasses.c#L186


2024年6月9日(日) 10:43 luke-tierney--- via R-devel :

> On Sat, 8 Jun 2024, Ben Bolker wrote:
>
> >  The ASAN errors occur *even if the zero-length object is not actually
> > accessed*/is used in a perfectly correct manner, i.e. it's perfectly
> legal in
> > base R to define `m <- numeric(0)` or `m <- matrix(nrow = 0, ncol = 0)`,
> > whereas doing the equivalent in Rcpp will (now) lead to an ASAN error.
> >
> >  i.e., these are *not* previously cryptic out-of-bounds accesses that
> are
> > now being revealed, but instead sensible and previously legal
> definitions of
> > zero-length objects that are now causing problems.
> >
> >   I'm pretty sure I'm right about this, but it's absolutely possible
> that
> > I'm just confused at this point; I don't have a super-simple example to
> show
> > you at the moment. The closest is this example by Mikael Jagan:
> > https://github.com/lme4/lme4/issues/794#issuecomment-2155093049
> >
> >  which shows that if x is a pointer to a zero-length vector (in plain
> C++
> > for R, no Rcpp is involved), DATAPTR(x) and REAL(x) evaluate to
> different
> > values.
> >
> >  Mikael further points out that "Rcpp seems to cast a (void *) returned
> by
> > DATAPTR to (double *) when constructing a Vector from a SEXP,
> rather
> > than using the (double *) returned by REAL." So perhaps R-core doesn't
> want
> > to guarantee that these operations give identical answers, in which case
> Rcpp
> > will have to change the way it does things ...
>
> It looks like REAL and friends should also get this check, but it's
> not high priority at this point, at least to me. DATAPTR has been
> using this check for a while in a barrier build, so you might want to
> test there as well. I expect we will activate more integrity checks
> from the barrier build on the API client side as things are tidied up.
>
> However: DATAPTR is not in the API and can't be at least in this form:
> It allows access to a writable pointer to STRSXP and VECSXP data and
> that is too dangerous for memory manager integrity. I'm not sure
> exactly how this will be resolve, but be prepared for changes.
>
> Best,
>
> luke
>
> >
> >  cheers
> >   Ben
> >
> >
> >
> > On 2024-06-08 6:39 p.m., Kevin Ushey wrote:
> >> IMHO, this should be changed in both Rcpp and downstream packages:
> >>
> >> 1. Rcpp could check for out-of-bounds accesses in cases like these, and
> >> emit an R warning / error when such an access is detected;
> >>
> >> 2. The downstream packages unintentionally making these out-of-bounds
> >> accesses should be fixed to avoid doing that.
> >>
> >> That is, I think this is ultimately a bug in the affected packages, but
> >> Rcpp could do better in detecting and handling this for client packages
> >> (avoiding a segfault).
> >>
> >> Best,
> >> Kevin
> >>
> >>
> >> On Sat, Jun 8, 2024, 3:06 PM Ben Bolker  >> > wrote:
> >>
> >>
> >>  A change to R-devel (SVN r86629 or
> >>
> https://github.com/r-devel/r-svn/commit/92c1d5de23c93576f55062e26d446feface07250
> >> <
> https://github.com/r-devel/r-svn/commit/92c1d5de23c93576f55062e26d446feface07250
> >
> >> has changed the handling of pointers to zero-length objects,
> leading to
> >> ASAN issues with a number of Rcpp-based packages (the commit message
> >> reads, in part, "Also define STRICT_TYPECHECK when compiling
> >> inlined.c.")
> >>
> >> I'm interested in discussion from the community.
> >>
> >> Details/diagnosis for the issues in the lme4 package are here:
> >> https://github.com/lme4/lme4/issues/794
> >> ,
> >> with a bit more discussion
> >> about how zero-length objects should be handled.
> >>
> >> The short(ish) version is that r86629 enables the
> >> CATCH_ZERO_LENGTH_ACCESS definition. This turns on the CHKZLN macro
> >> <
> https://github.com/r-devel/r-svn/blob/4ef83b9dc3c6874e774195d329cbb6c11a71c414/src/main/memory.c#L4090-L4104
> >> <
> https://github.com/r-devel/r-svn/blob/4ef83b9dc3c6874e774195d329cbb6c11a71c414/src/main/memory.c#L4090-L4104
> >>,
> >> which returns a trivial pointer (rather than the data pointer that
> >> would
> >> be returned in the normal control flow) if an object has length 0:
> >>
> >> /* Attempts to read or write elements of a zero length vector will
> >>  result in a segfault, rather than read and write random memory.
> >>  Returning NULL would be more natural, but Matrix se