Re: [Rd] Inconsistent behavior for the C AP's R_ParseVector() ?

2019-12-14 Thread Laurent Gautier
Le lun. 9 déc. 2019 à 09:57, Tomas Kalibera  a
écrit :

> On 12/9/19 2:54 PM, Laurent Gautier wrote:
>
>
>
> Le lun. 9 déc. 2019 à 05:43, Tomas Kalibera  a
> écrit :
>
>> On 12/7/19 10:32 PM, Laurent Gautier wrote:
>>
>> Thanks for the quick response Tomas.
>>
>> The same error is indeed happening when trying to have a zero-length
>> variable name in an environment. The surprising bit is then "why is this
>> happening during parsing" (that is why are variables assigned to an
>> environment) ?
>>
>> The emitted R error (in the R console) is not a parse (syntax) error, but
>> an error emitted during parsing when the parser tries to intern a name -
>> look it up in a symbol table. Empty string is not allowed as a symbol name,
>> and hence the error. In the call "list(''=1)" , the empty name is what
>> could eventually become a name of a local variable inside list(), even
>> though not yet during parsing.
>>
>
> Thanks Tomas.
>
> I guess this has do with R expressions being lazily evaluated, and names
> of arguments in a call are also part of the expression. Now the puzzling
> part is why is that at all part of the parsing: I would have expected
> R_ParseVector() to be restricted to parsing... Now it feels like
> R_ParseVector() is performing parsing, and a first level of evalution for
> expressions that "should never work" (the empty name).
>
> Think of it as an exception in say Python. Some failures during parsing
> result in an exception (called error in R and implemented using a long
> jump). Any time you are calling into R you can get an error; out of memory
> is also signalled as R error.
>


The surprising bit for me was that I had expected the function to solely
perform parsing. I did expect an exception (and a jmp smashing the stack)
when the function concerned is in the C-API, is parsing a string, and is
using a parameter (pointer) to store whether parsing was a failure or a
success.

Since you are making a comparison with Python, the distinction I am making
between parsing and evaluation seem to apply there. For example:

```
>>> import parser
>>> parser.expr('1+')
  Traceback (most recent call last):
  File "", line 1, in 
  File "", line 1
1+
 ^
SyntaxError: unexpected EOF while parsing
>>> p = parser.expr('list(""=1)')
>>> p

>>> eval(p)
Traceback (most recent call last):
  File "", line 1, in 
TypeError: eval() arg 1 must be a string, bytes or code object

>>> list(""=1)
  File "", line 1
SyntaxError: keyword can't be an expression
```


> There is probably some error in how the external code is handling R
>> errors  (Fatal error: unable to initialize the JIT, stack smashing, etc)
>> and possibly also how R is initialized before calling ParseVector. Probably
>> you would get the same problem when running say "stop('myerror')". Please
>> note R errors are implemented as long-jumps, so care has to be taken when
>> calling into R, Writing R Extensions has more details (and section 8
>> specifically about embedding R). This is unlike parse (syntax) errors
>> signaled via return value to ParseVector()
>>
>
> The issue is that the segfault (because of stack smashing, therefore
> because of what also suspected to be an incontrolled jump) is happening
> within the execution of R_ParseVector(). I would think that an issue with
> the initialization of R is less likely because the project is otherwise
> used a fair bit and is well covered by automated continuous tests.
>
> After looking more into R's gram.c I suspect that an execution context is
> required for R_ParseVector() to know to properly work (know where to jump
> in case of error) when the parsing code decides to fail outside what it
> thinks is a syntax error. If the case, this would make R_ParseVector()
> function well when called from say, a C-extension to an R package, but fail
> the way I am seeing it fail when called from an embedded R.
>
> Yes, contexts are used internally to handle errors. For external use
> please see Writing R Extensions, section 6.12.
>

I have wrapped my call to R_ParseVector() in a R_tryCatchError(), and this
is seems to help me overcome the issue. Thanks for the pointer.

Best,


Laurent


> Best
> Tomas
>
>
> Best,
>
> Laurent
>
>> Best,
>> Tomas
>>
>>
>> We are otherwise aware that the error is not occurring in the R console,
>> but can be traced to a call to R_ParseVector() in R's C API:(
>> https://github.com/rpy2/rpy2/blob/master/rpy2/rinterface_lib/_rinterface_capi.py#L509
>> ).
>>
>> Our specific setup is calling an embedded R from Python, using the cffi
>> library. An error on end was the first possibility considered, but the
>> puzzling specificity of the error (as shown below other parsing errors are
>> handled properly) and the difficulty tracing what is in happening in
>> R_ParseVector() made me ask whether someone on this list had a suggestion
>> about the possible issue"
>>
>> ```
>>
>> >>> import rpy2.rinterface as ri>>> ri.initr()>>> e = ri.parse("list(''=1+") 
>> >>> -

Re: [Rd] Inconsistent behavior for the C AP's R_ParseVector() ?

2019-12-14 Thread Simon Urbanek
Laurent,

the main point here is that ParseVector() just like any other R API has to be 
called in a correct context since it can raise errors so the issue was that 
your C code has a bug of not setting R correctly (my guess would be your'e not 
creating the initial context necessary in embedded R). There are many different 
errors, your is just one of many that can occur - any R API call that does 
allocation (and parsing obviously does) can cause errors. Note that this is 
true for pretty much all R API functions.

Cheers,
Simon



> On Dec 14, 2019, at 11:25 AM, Laurent Gautier  wrote:
> 
> Le lun. 9 déc. 2019 à 09:57, Tomas Kalibera  a
> écrit :
> 
>> On 12/9/19 2:54 PM, Laurent Gautier wrote:
>> 
>> 
>> 
>> Le lun. 9 déc. 2019 à 05:43, Tomas Kalibera  a
>> écrit :
>> 
>>> On 12/7/19 10:32 PM, Laurent Gautier wrote:
>>> 
>>> Thanks for the quick response Tomas.
>>> 
>>> The same error is indeed happening when trying to have a zero-length
>>> variable name in an environment. The surprising bit is then "why is this
>>> happening during parsing" (that is why are variables assigned to an
>>> environment) ?
>>> 
>>> The emitted R error (in the R console) is not a parse (syntax) error, but
>>> an error emitted during parsing when the parser tries to intern a name -
>>> look it up in a symbol table. Empty string is not allowed as a symbol name,
>>> and hence the error. In the call "list(''=1)" , the empty name is what
>>> could eventually become a name of a local variable inside list(), even
>>> though not yet during parsing.
>>> 
>> 
>> Thanks Tomas.
>> 
>> I guess this has do with R expressions being lazily evaluated, and names
>> of arguments in a call are also part of the expression. Now the puzzling
>> part is why is that at all part of the parsing: I would have expected
>> R_ParseVector() to be restricted to parsing... Now it feels like
>> R_ParseVector() is performing parsing, and a first level of evalution for
>> expressions that "should never work" (the empty name).
>> 
>> Think of it as an exception in say Python. Some failures during parsing
>> result in an exception (called error in R and implemented using a long
>> jump). Any time you are calling into R you can get an error; out of memory
>> is also signalled as R error.
>> 
> 
> 
> The surprising bit for me was that I had expected the function to solely
> perform parsing. I did expect an exception (and a jmp smashing the stack)
> when the function concerned is in the C-API, is parsing a string, and is
> using a parameter (pointer) to store whether parsing was a failure or a
> success.
> 
> Since you are making a comparison with Python, the distinction I am making
> between parsing and evaluation seem to apply there. For example:
> 
> ```
 import parser
 parser.expr('1+')
>  Traceback (most recent call last):
>  File "", line 1, in 
>  File "", line 1
>1+
> ^
> SyntaxError: unexpected EOF while parsing
 p = parser.expr('list(""=1)')
 p
> 
 eval(p)
> Traceback (most recent call last):
>  File "", line 1, in 
> TypeError: eval() arg 1 must be a string, bytes or code object
> 
 list(""=1)
>  File "", line 1
> SyntaxError: keyword can't be an expression
> ```
> 
> 
>> There is probably some error in how the external code is handling R
>>> errors  (Fatal error: unable to initialize the JIT, stack smashing, etc)
>>> and possibly also how R is initialized before calling ParseVector. Probably
>>> you would get the same problem when running say "stop('myerror')". Please
>>> note R errors are implemented as long-jumps, so care has to be taken when
>>> calling into R, Writing R Extensions has more details (and section 8
>>> specifically about embedding R). This is unlike parse (syntax) errors
>>> signaled via return value to ParseVector()
>>> 
>> 
>> The issue is that the segfault (because of stack smashing, therefore
>> because of what also suspected to be an incontrolled jump) is happening
>> within the execution of R_ParseVector(). I would think that an issue with
>> the initialization of R is less likely because the project is otherwise
>> used a fair bit and is well covered by automated continuous tests.
>> 
>> After looking more into R's gram.c I suspect that an execution context is
>> required for R_ParseVector() to know to properly work (know where to jump
>> in case of error) when the parsing code decides to fail outside what it
>> thinks is a syntax error. If the case, this would make R_ParseVector()
>> function well when called from say, a C-extension to an R package, but fail
>> the way I am seeing it fail when called from an embedded R.
>> 
>> Yes, contexts are used internally to handle errors. For external use
>> please see Writing R Extensions, section 6.12.
>> 
> 
> I have wrapped my call to R_ParseVector() in a R_tryCatchError(), and this
> is seems to help me overcome the issue. Thanks for the pointer.
> 
> Best,
> 
> 
> Laurent
> 
> 
>> Best
>> Tomas
>> 
>> 
>> Best,
>> 
>> Laurent
>> 
>>> Best,
>>> Tomas
>

Re: [Rd] Inconsistencies in wilcox.test

2019-12-14 Thread Martin Maechler
> Martin Maechler 
> on Thu, 12 Dec 2019 17:20:47 +0100 writes:

> Karolis Koncevičius 
> on Mon, 9 Dec 2019 23:43:36 +0200 writes:

>> So I tried adding Infinity support for all cases.  And it
>> is (as could be expected) more complicated than I
>> thought.

> "Of course !"  Thank you, Karolis, in any case!

>> It is easy to add Inf support for the test. The problems
>> start with conf.int=TRUE.

>> Currently confidence intervals are computed via
>> `uniroot()` and, in the case of infinities, we are
>> computationally looking for roots over infinite interval
>> which results in an error. I suspect this is the reason
>> Inf values were removed in the first place.

> Maybe. It's still wrong to be done "up front".  I'm sure
> 98% (or so ;-) of all calls to wilcox.test() do *not* use
> conf.int = TRUE


>> Just a note, I found a few more errors/inconsistencies
>> when requesting confidence intervals with paired=TRUE
>> (due to Infinities being left in).

>> Current error in Inf-Inf scenario:

>> wilcox.test(c(1,2,Inf), c(4,8,Inf), paired=TRUE,
>> conf.int=TRUE) Error in if (ZEROES) x <- x[x != 0] :
>> missing value where TRUE/FALSE needed

> Good catch .. notably as it also happens when
> conf.int=FALSE as by default.  My version of wilcox.test()
> now does give the same as when the to 'Inf' are left away.

>> NaN confidence intervals:

>> wilcox.test(c(1:9,Inf), c(21:28,Inf,30), paired=TRUE,
>> conf.int=TRUE)

>> Wilcoxon signed rank test with continuity correction

>> data: c(1:9, Inf) and c(21:28, Inf, 30) V = 9.5, p-value
>> = 0.0586 alternative hypothesis: true location shift is
>> not equal to 0 0 percent confidence interval: NaN NaN
>> sample estimates: midrange NaN

> I don't see a big problem here. The NaN's in some sense
> show the best that can be computed with this data.
> Statistic and P-value, but no conf.int.


>> The easiest "fix" for consistency would be to simply
>> remove Infinity support from the paired=TRUE case.

> I strongly disagree.  We are not pruning good
> functionality just for some definition of consistency.

>> But going with the more desirable approach of adding
>> Infinity support for non-paired cases - it is currently
>> not clear to me what confidence intervals and
>> pseudomedian should be. Specially when Infinities are on
>> both sides.

> I deem that not to be a big deal.  They are not defined
> given the default formulas and that is reflected by NA /
> NaN in those parts of the result.

>> Regards, Karolis Koncevičius.

> But I have also spent a few hours now on
> wilcox.test.default() behavior notably also looking at the
> "rounding" / "machine precision" situation, and also on
> your remark that the 'method: ...' does not indicate well
> enough what was computed.

> In my (not yet committed) but hereby proposed enhancement
> of wilcox.test(), I have a new argument, 'digits.rank =
> Inf' (the default 'Inf' corresponding to the current
> behavior) with help page documentation:

> digits.rank: a number; if finite, ‘rank(signif(r,
> digits.rank))’ will be used to compute ranks for the test
> statistic instead of (the default) ‘rank(r)’.

> and then in 'Details :'

>  For stability reasons, it may be advisable to use
> rounded data or to set ‘digits.rank = 7’, say, such that
> determination of ties does not depend on very small
> numeric differences (see the example).

> and then in 'Examples: '

>  ## accuracy in ties determination via 'digits.rank':
> wilcox.test( 4:2, 3:1, paired=TRUE) # Warning: cannot
> compute exact p-value with ties wilcox.test((4:2)/10,
> (3:1)/10, paired=TRUE) # no ties => *no* warning
> wilcox.test((4:2)/10, (3:1)/10, paired=TRUE, digits.rank =
> 9) # same ties as (4:2, 3:1)
 
> --

> Lastly, I propose to replace "test" by "exact test" in the
> 'method' component (and print out) in case exact
> computations were used.  This information should be part
> of the returned "htest" object, and not only visible from
> the arguments and warnings that are printed during the
> computations.  This last change is in some sense the "most
> back-incompatible" change of these, because many
> wilcox.test() printouts would slightly change, e.g.,

>> w0 <- wilcox.test( 1:5, 4*(0:4), paired=TRUE)

> Wilcoxon signed rank exact test

>   data: 1:5 and 4 * (0:4) V = 1, p-value = 0.125
> alternative hypothesis: true location shift is not equal
> to 0

> where before (in R <= 3.6.x) it is just

> Wilcoxon signed rank test

>   data: .  ...  ...

> but I think R 4.0.0 is a good occasion for such a change.

> Constructive

Re: [Rd] Inconsistent behavior for the C AP's R_ParseVector() ?

2019-12-14 Thread Laurent Gautier
Hi Simon,

Widespread errors would have caught my earlier as the way that code is
using only one initialization of the embedded R, is used quite a bit, and
is covered by quite a few unit tests. This is the only situation I am aware
of in which an error occurs.

What is a "correct context", or initial context, the code should from ?
Searching for "context" in the R-exts manual does not return much.

Best,

Laurent


Le sam. 14 déc. 2019 à 12:20, Simon Urbanek  a
écrit :

> Laurent,
>
> the main point here is that ParseVector() just like any other R API has to
> be called in a correct context since it can raise errors so the issue was
> that your C code has a bug of not setting R correctly (my guess would be
> your'e not creating the initial context necessary in embedded R). There are
> many different errors, your is just one of many that can occur - any R API
> call that does allocation (and parsing obviously does) can cause errors.
> Note that this is true for pretty much all R API functions.
>
> Cheers,
> Simon
>
>
>
> > On Dec 14, 2019, at 11:25 AM, Laurent Gautier 
> wrote:
> >
> > Le lun. 9 déc. 2019 à 09:57, Tomas Kalibera  a
> > écrit :
> >
> >> On 12/9/19 2:54 PM, Laurent Gautier wrote:
> >>
> >>
> >>
> >> Le lun. 9 déc. 2019 à 05:43, Tomas Kalibera 
> a
> >> écrit :
> >>
> >>> On 12/7/19 10:32 PM, Laurent Gautier wrote:
> >>>
> >>> Thanks for the quick response Tomas.
> >>>
> >>> The same error is indeed happening when trying to have a zero-length
> >>> variable name in an environment. The surprising bit is then "why is
> this
> >>> happening during parsing" (that is why are variables assigned to an
> >>> environment) ?
> >>>
> >>> The emitted R error (in the R console) is not a parse (syntax) error,
> but
> >>> an error emitted during parsing when the parser tries to intern a name
> -
> >>> look it up in a symbol table. Empty string is not allowed as a symbol
> name,
> >>> and hence the error. In the call "list(''=1)" , the empty name is what
> >>> could eventually become a name of a local variable inside list(), even
> >>> though not yet during parsing.
> >>>
> >>
> >> Thanks Tomas.
> >>
> >> I guess this has do with R expressions being lazily evaluated, and names
> >> of arguments in a call are also part of the expression. Now the puzzling
> >> part is why is that at all part of the parsing: I would have expected
> >> R_ParseVector() to be restricted to parsing... Now it feels like
> >> R_ParseVector() is performing parsing, and a first level of evalution
> for
> >> expressions that "should never work" (the empty name).
> >>
> >> Think of it as an exception in say Python. Some failures during parsing
> >> result in an exception (called error in R and implemented using a long
> >> jump). Any time you are calling into R you can get an error; out of
> memory
> >> is also signalled as R error.
> >>
> >
> >
> > The surprising bit for me was that I had expected the function to solely
> > perform parsing. I did expect an exception (and a jmp smashing the stack)
> > when the function concerned is in the C-API, is parsing a string, and is
> > using a parameter (pointer) to store whether parsing was a failure or a
> > success.
> >
> > Since you are making a comparison with Python, the distinction I am
> making
> > between parsing and evaluation seem to apply there. For example:
> >
> > ```
>  import parser
>  parser.expr('1+')
> >  Traceback (most recent call last):
> >  File "", line 1, in 
> >  File "", line 1
> >1+
> > ^
> > SyntaxError: unexpected EOF while parsing
>  p = parser.expr('list(""=1)')
>  p
> > 
>  eval(p)
> > Traceback (most recent call last):
> >  File "", line 1, in 
> > TypeError: eval() arg 1 must be a string, bytes or code object
> >
>  list(""=1)
> >  File "", line 1
> > SyntaxError: keyword can't be an expression
> > ```
> >
> >
> >> There is probably some error in how the external code is handling R
> >>> errors  (Fatal error: unable to initialize the JIT, stack smashing,
> etc)
> >>> and possibly also how R is initialized before calling ParseVector.
> Probably
> >>> you would get the same problem when running say "stop('myerror')".
> Please
> >>> note R errors are implemented as long-jumps, so care has to be taken
> when
> >>> calling into R, Writing R Extensions has more details (and section 8
> >>> specifically about embedding R). This is unlike parse (syntax) errors
> >>> signaled via return value to ParseVector()
> >>>
> >>
> >> The issue is that the segfault (because of stack smashing, therefore
> >> because of what also suspected to be an incontrolled jump) is happening
> >> within the execution of R_ParseVector(). I would think that an issue
> with
> >> the initialization of R is less likely because the project is otherwise
> >> used a fair bit and is well covered by automated continuous tests.
> >>
> >> After looking more into R's gram.c I suspect that an execution context
> is
> >> required for R_ParseVector() to know to properly work (kno

Re: [Rd] Inconsistent behavior for the C AP's R_ParseVector() ?

2019-12-14 Thread Simon Urbanek
Laurent,


> On Dec 14, 2019, at 5:29 PM, Laurent Gautier  wrote:
> 
> Hi Simon,
> 
> Widespread errors would have caught my earlier as the way that code is
> using only one initialization of the embedded R, is used quite a bit, and
> is covered by quite a few unit tests. This is the only situation I am aware
> of in which an error occurs.
> 

It may or may not be "widespread" - almost all R API functions can raise errors 
(e.g., unable to allocate). You'll only find out once they do and that's too 
late ;).


> What is a "correct context", or initial context, the code should from ?
> Searching for "context" in the R-exts manual does not return much.
> 

It depends which embedded API use - see R-ext 8.1 the two options are 
run_Rmainloop() and R_ReplDLLinit() which both setup the top-level context with 
SETJMP. If you don't use either then you have to use one of the advanced R APIs 
that do it such as R_ToplevelExec() or R_UnwindProtect(), otherwise your point 
to abort to on error doesn't exist. Embedding R is much more complex than many 
think ...

Cheers,
Simon



> Best,
> 
> Laurent
> 
> 
> Le sam. 14 déc. 2019 à 12:20, Simon Urbanek  a
> écrit :
> 
>> Laurent,
>> 
>> the main point here is that ParseVector() just like any other R API has to
>> be called in a correct context since it can raise errors so the issue was
>> that your C code has a bug of not setting R correctly (my guess would be
>> your'e not creating the initial context necessary in embedded R). There are
>> many different errors, your is just one of many that can occur - any R API
>> call that does allocation (and parsing obviously does) can cause errors.
>> Note that this is true for pretty much all R API functions.
>> 
>> Cheers,
>> Simon
>> 
>> 
>> 
>>> On Dec 14, 2019, at 11:25 AM, Laurent Gautier 
>> wrote:
>>> 
>>> Le lun. 9 déc. 2019 à 09:57, Tomas Kalibera  a
>>> écrit :
>>> 
 On 12/9/19 2:54 PM, Laurent Gautier wrote:
 
 
 
 Le lun. 9 déc. 2019 à 05:43, Tomas Kalibera 
>> a
 écrit :
 
> On 12/7/19 10:32 PM, Laurent Gautier wrote:
> 
> Thanks for the quick response Tomas.
> 
> The same error is indeed happening when trying to have a zero-length
> variable name in an environment. The surprising bit is then "why is
>> this
> happening during parsing" (that is why are variables assigned to an
> environment) ?
> 
> The emitted R error (in the R console) is not a parse (syntax) error,
>> but
> an error emitted during parsing when the parser tries to intern a name
>> -
> look it up in a symbol table. Empty string is not allowed as a symbol
>> name,
> and hence the error. In the call "list(''=1)" , the empty name is what
> could eventually become a name of a local variable inside list(), even
> though not yet during parsing.
> 
 
 Thanks Tomas.
 
 I guess this has do with R expressions being lazily evaluated, and names
 of arguments in a call are also part of the expression. Now the puzzling
 part is why is that at all part of the parsing: I would have expected
 R_ParseVector() to be restricted to parsing... Now it feels like
 R_ParseVector() is performing parsing, and a first level of evalution
>> for
 expressions that "should never work" (the empty name).
 
 Think of it as an exception in say Python. Some failures during parsing
 result in an exception (called error in R and implemented using a long
 jump). Any time you are calling into R you can get an error; out of
>> memory
 is also signalled as R error.
 
>>> 
>>> 
>>> The surprising bit for me was that I had expected the function to solely
>>> perform parsing. I did expect an exception (and a jmp smashing the stack)
>>> when the function concerned is in the C-API, is parsing a string, and is
>>> using a parameter (pointer) to store whether parsing was a failure or a
>>> success.
>>> 
>>> Since you are making a comparison with Python, the distinction I am
>> making
>>> between parsing and evaluation seem to apply there. For example:
>>> 
>>> ```
>> import parser
>> parser.expr('1+')
>>> Traceback (most recent call last):
>>> File "", line 1, in 
>>> File "", line 1
>>>   1+
>>>^
>>> SyntaxError: unexpected EOF while parsing
>> p = parser.expr('list(""=1)')
>> p
>>> 
>> eval(p)
>>> Traceback (most recent call last):
>>> File "", line 1, in 
>>> TypeError: eval() arg 1 must be a string, bytes or code object
>>> 
>> list(""=1)
>>> File "", line 1
>>> SyntaxError: keyword can't be an expression
>>> ```
>>> 
>>> 
 There is probably some error in how the external code is handling R
> errors  (Fatal error: unable to initialize the JIT, stack smashing,
>> etc)
> and possibly also how R is initialized before calling ParseVector.
>> Probably
> you would get the same problem when running say "stop('myerror')".
>> Please
> note R errors are implemented as long-jumps, so care has to be taken
>> w