I was going to recommend the regular-expression.info website, but you got
to it first.
I tried looking at the source, but it is kind of dense. I think stepping
through the code below may illustrate why the zero width match returned
from "\\b" cannot be allowed as-is or the strsplit algorithm will go
into an infinite loop.
slowstrsplit1 <- function( x, split ) {
result <- list()
right <- x
while ( 0 < nchar( right )
&& 0 <= ( idx <- regexpr( split, right ) )
) {
# index of beginning of pattern
patidx <- c( idx )
left <- substr( right, 1, if ( 1==patidx ) { 1 } else { patidx-1 } )
# number of matched characters in the string corresponding
# to the pattern
patlen <- attr( idx, "match.length" )
if ( 0 == patlen ) {
patlen <- 1
}
# if patlen is allowed to be zero in the following line then the
# loop will never end
rightidx <- patidx + patlen
# remember the left chunk
result <- append( result, left )
right <- substr( right, rightidx, nchar( right ) )
}
if ( 0 != nchar( right ) ) {
result <- append( result, right )
}
unlist( result )
}
slowstrsplit <- function( x, split ) {
lapply( x, function( x1 ) { slowstrsplit1( x1, split ) } )
}
# test
teststr <- "red green"
pat <- "\\b"
slowstrsplit( teststr, pat )
pat <- " "
slowstrsplit( teststr, pat )
On Sat, 11 Jul 2015, David Winsemius wrote:
On Jul 11, 2015, at 11:05 AM, David Winsemius wrote:
On Jul 11, 2015, at 7:47 AM, Bert Gunter wrote:
I noticed the following:
strsplit("red green","\\b")
[[1]]
[1] "r" "e" "d" " " "g" "r" "e" "e" "n"
After reading the ?regex help page, I didn't understand why `\b` would split within
sequences of "word"-characters, either. I expected this to be the result:
[[1]]
[1] "red" " " "green"
There is a warning in that paragraph: "(The interpretation of ?word? depends on the
locale and implementation.)"
I got the expected result with only one of "\\>" and "\\<"
strsplit("red green","\\<")
[[1]]
[1] "r" "e" "d" " " "g" "r" "e" "e" "n"
strsplit("red green","\\>")
[[1]]
[1] "red" " green"
The result with "\\<" seems decidedly unexpected.
I'm wondered if the "original" regex documentation uses the same language as
the R help page. So I went to the cited website and find:
=======
An assertion-character can be any of the following:
? < ? Beginning of word
? > ? End of word
? b ? Word boundary
? B ? Non-word boundary
? d ? Digit character (equivalent to [[:digit:]])
? D ? Non-digit character (equivalent to [^[:digit:]])
? s ? Space character (equivalent to [[:space:]])
? S ? Non-space character (equivalent to [^[:space:]])
? w ? Word character (equivalent to [[:alnum:]_])
? W ? Non-word character (equivalent to [^[:alnum:]_])
========
The word-"word" appears nowhere else on that page.
This page:
http://www.regular-expressions.info/wordboundaries.html
implies that naked boundaries were not expected to be use and that "\B" and "\b" were expected to
be "flanking" patterns with the real "meat" either sandwiched between them or perhaps at either end.
> strsplit( " red green blue", split="\\b \\b")
[[1]]
[1] " red green" "blue"
So perhaps there is an implicit "any-word" that follows the "\\b" assertion?
strsplit( "redgreen", split="\\bgreen")
[[1]]
[1] "redgreen"
strsplit( "redgreen", split="green\\b")
[[1]]
[1] "red"
--
David.
strsplit("red green","\\W")
[[1]]
[1] "red" "green"
`\W` matches the byte-width non-word characters. So the " "-character would be
discarded.
I would have thought that "\\b" should give what "\\W" did. Note that:
grep("\\bred\\b","red green")
[1] 1
## as expected
Does strsplit use a different regex engine than grep()? Or more
likely, what am I misunderstanding?
Thanks.
Bert
David Winsemius
Alameda, CA, USA
______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
David Winsemius
Alameda, CA, USA
______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
---------------------------------------------------------------------------
Jeff Newmiller The ..... ..... Go Live...
DCN:<jdnew...@dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go...
Live: OO#.. Dead: OO#.. Playing
Research Engineer (Solar/Batteries O.O#. #.O#. with
/Software/Embedded Controllers) .OO#. .OO#. rocks...1k
______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.