On 6/21/21 9:25 PM, Bill Dunlap wrote:
NULL cannot be in an integer or numeric vector so it would not be a good fit for substring's 'first' or 'last' argument (or substr's 'start' and 'stop').
Yes, that would only work if used as a scalar, such as in the default for 'last' where 1000000L is used now.
In other cases, users already had to provide their own values for 'last' explicitly, and hence they would know if they provided a value too small given their data.
Also, it is conceivable that string lengths may be 64 bit integers in the future, so why not use Inf as the default? Then the following would give 4 identical results with no warning:
Yes, that would work also in vector use, but integers over 2^53 won't be representable as doubles exactly, so we would have to revisit/change the interface when moving to 64 bit integers.
Yet another option would be say using -1, that would also work with vector use and integers. But, negative indexes (and zero) are now treated as start of the string (1), and while not documented, perhaps this is good/intuitive behavior.
Tomas
substring("abcde", 3, c(10, 2^31-1, 2^31, Inf))[1] "cde" "cde" NA NA Warning message: In substring("abcde", 3, c(10, 2^31 - 1, 2^31, Inf)) : NAs introduced by coercion to integer range -Bill On Mon, Jun 21, 2021 at 10:22 AM Michael Chirico <[email protected]> wrote:Thanks all, great points well taken. Indeed it seems the default of 1000000 predates SVN tracking in 1997. I think a NULL default behaving as "end of string" regardless of encoding makes sense and avoids the overheads of a $ call and a much heavier nchar() calculation. Mike C On Mon, Jun 21, 2021 at 1:32 AM Martin Maechler <[email protected]> wrote:Tomas Kalibera on Mon, 21 Jun 2021 10:08:37 +0200 writes:> On 6/21/21 9:35 AM, Martin Maechler wrote: >>>>>>> Michael Chirico >>>>>>> on Sun, 20 Jun 2021 15:20:26 -0700 writes: >> > Currently, substring defaults to last=1000000L, which >> > strongly suggests the intent is to default to "nchar(x)" >> > without having to compute/allocate that up front. >> >> > Unfortunately, this default makes no sense for "very >> > large" strings which may exceed 1000000L in "width". >> >> Yes; and I tend to agree with you that this default is outdated >> (Remember : R was written to work and run on 2 (or 4?) MB of RAMon the>> student lab Macs in Auckland in ca 1994). >> >> > The max width of a string is .Machine$integer.max-1: >> >> (which Brodie showed was only almost true) >> >> > So it seems to me either .Machine$integer.max or >> > .Machine$integer.max-1L would be a more sensible default. Am Imissing>> > something? >> >> The "drawback" is of course that .Machine$integer.max is still >> a function call (as R beginners may forget) contrary to <nnnnn>L, >> but that may even be inlined by the byte compiler (? how would wecheck ?)>> and even if it's not, it does more clearly convey the concept >> and idea *and* would probably even port automatically if ever >> integer would be increased in R. > We still have the problem that we need to count characters, notbytes,> if we want the default semantics of "until the end of the string". > I think we would have to fix this either by really using > "nchar(type="c"))" or by using e.g. NULL and then treating this asa> special case, that would be probably faster. > Tomas You are right, as always, Tomas. I agree that would be better and we should do it if/when we change the default there. Martin______________________________________________ [email protected] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel[[alternative HTML version deleted]] ______________________________________________ [email protected] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
______________________________________________ [email protected] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
