[Rd] Should last default to .Machine$integer.max-1 for substring()

2021-06-20 Thread Michael Chirico
Currently, substring defaults to last=100L, which strongly
suggests the intent is to default to "nchar(x)" without having to
compute/allocate that up front.

Unfortunately, this default makes no sense for "very large" strings
which may exceed 100L in "width".

The max width of a string is .Machine$integer.max-1:

# works
x = strrep(" ", .Machine$integer.max-1L)
# fails
x = strrep(" ", .Machine$integer.max)
Error in strrep(" ", .Machine$integer.max) :
  'Calloc' could not allocate memory (18446744071562067968 of 1 bytes)

(see also the comment in src/main/character.c: "Character strings in R
are less than 2^31-1 bytes, so we use int not size_t.")

So it seems to me either .Machine$integer.max or
.Machine$integer.max-1L would be a more sensible default. Am I missing
something?

Mike C

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Should last default to .Machine$integer.max-1 for substring()

2021-06-20 Thread brodie gaslam via R-devel


> On Sunday, June 20, 2021, 6:21:22 PM EDT, Michael Chirico 
>  wrote:
>
> Currently, substring defaults to last=100L, which strongly
> suggests the intent is to default to "nchar(x)" without having to
> compute/allocate that up front.
>
> Unfortunately, this default makes no sense for "very large" strings
> which may exceed 100L in "width".
>
> The max width of a string is .Machine$integer.max-1:

I think the max width is .Machine$integer.max.  What happened below is a
bug due to buffer overflow in `strrep`:

> # works
> x = strrep(" ", .Machine$integer.max-1L)
> # fails
> x = strrep(" ", .Machine$integer.max)
> Error in strrep(" ", .Machine$integer.max) :
>   'Calloc' could not allocate memory (18446744071562067968 of 1 bytes)

Notice the very large number that was tried to be Calloc'ed.  That's
(size_t) -1.

The problem is (src/include/R_ext/RS.h@85):

    #define CallocCharBuf(n) (char *) R_chk_calloc((R_SIZE_T) ((n)+1), 
sizeof(char))

The `((n) + 1)` overflows `int` and produces -1 (well, undefined behavior
so who knows), which when cast to size_t produces that very large number
which can't be allocated.

I think this should be:

    #define CallocCharBuf(n) (char *) R_chk_calloc(((R_SIZE_T)(n))+1, 
sizeof(char))

I can reproduce the failure before the change.  After the change I get:

    > x = strrep(" ", .Machine$integer.max)
    Error in strrep(" ", .Machine$integer.max) :
  'Calloc' could not allocate memory (2147483648 of 1 bytes)

I believe this to be the expected result on a machine that doesn't have
enough memory to allocate INT_MAX + 1 bytes, as happens to be the case on
my R build system (it's a VM that gets 2GB total as the host machine can
barely spare that to begin with).

> (see also the comment in src/main/character.c: "Character strings in R
> are less than 2^31-1 bytes, so we use int not size_t.")

FWIW WRE states:

> Note that R character strings are restricted to 2^31 - 1 bytes

This is INT_MAX or .Machine$integer.max, at least on machines for which
`int` is 32 bits, which I think typical for machines R builds on.   From
having looked at the code a while ago I think WRE is right (so maybe the
comment in the code is wrong), but it was a while ago and I haven't tried
to allocate an INT_MAX long string.

Sorry this doesn't answer your original question.

Best,

Brodie.

>
>
> So it seems to me either .Machine$integer.max or
> .Machine$integer.max-1L would be a more sensible default. Am I missing
> something?
>
> Mike C
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Should last default to .Machine$integer.max-1 for substring()

2021-06-20 Thread brodie gaslam via R-devel
> On Sunday, June 20, 2021, 9:29:28 PM EDT, brodie gaslam via R-devel 
>  wrote:
>
>> On Sunday, June 20, 2021, 6:21:22 PM EDT, Michael Chirico 
>>  wrote:
>>
>> The max width of a string is .Machine$integer.max-1:
>
> I think the max width is .Machine$integer.max.  What happened below is a
> bug due to buffer overflow in `strrep`:

Sorry, integer overflow.

>> # works
>> x = strrep(" ", .Machine$integer.max-1L)
>> # fails
>> x = strrep(" ", .Machine$integer.max)
>> Error in strrep(" ", .Machine$integer.max) :
>>   'Calloc' could not allocate memory (18446744071562067968 of 1 bytes)
>> (see also the comment in src/main/character.c: "Character strings in R
>> are less than 2^31-1 bytes, so we use int not size_t.")
>
> FWIW WRE states:
>
>> Note that R character strings are restricted to 2^31 - 1 bytes
>
> This is INT_MAX or .Machine$integer.max, at least on machines for which
> `int` is 32 bits, which I think typical for machines R builds on.   From
> having looked at the code a while ago I think WRE is right (so maybe the
> comment in the code is wrong), but it was a while ago and I haven't tried
> to allocate an INT_MAX long string.

So I tried it on a machine with more memory, and it works:

    > x <- strrep(" ", .Machine$integer.max-1L)
    > x <- paste0(x, " ")
    > nchar(x)
    [1] 2147483647
    > nchar(x) == .Machine$integer.max
    [1] TRUE

B.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel