Re: [R] Do grep() and strsplit() use different regex engines?

David Winsemius Sat, 11 Jul 2015 11:19:08 -0700

On Jul 11, 2015, at 11:05 AM, David Winsemius wrote:

> 
> On Jul 11, 2015, at 7:47 AM, Bert Gunter wrote:
> 
>> I noticed the following:
>> 
>>> strsplit("red green","\\b")
>> [[1]]
>> [1] "r" "e" "d" " " "g" "r" "e" "e" "n"
> 
> After reading the ?regex help page, I didn't understand why `\b` would split 
> within sequences of "word"-characters, either. I expected this to be the 
> result:
> 
> [[1]]
> [1] "red"  " "  "green"
> 
> There is a warning in that paragraph: "(The interpretation of ‘word’ depends 
> on the locale and implementation.)"
> 
> I got the expected result with only one of "\\>" and "\\<"
> 
>> strsplit("red green","\\<")
> [[1]]
> [1] "r" "e" "d" " " "g" "r" "e" "e" "n"
> 
>> strsplit("red green","\\>")
> [[1]]
> [1] "red"    " green"
> 
> The result with "\\<" seems decidedly unexpected.
> 
> I'm wondered if the "original" regex documentation uses the same language as 
> the R help page. So I went to the cited website and find:
> =======
> An assertion-character can be any of the following:
> 
>       • < – Beginning of word
>       • > – End of word
>       • b – Word boundary
>       • B – Non-word boundary
>       • d – Digit character (equivalent to [[:digit:]])
>       • D – Non-digit character (equivalent to [^[:digit:]])
>       • s – Space character (equivalent to [[:space:]])
>       • S – Non-space character (equivalent to [^[:space:]])
>       • w – Word character (equivalent to [[:alnum:]_])
>       • W – Non-word character (equivalent to [^[:alnum:]_])
> ========
> 
> The word-"word" appears nowhere else on that page.
>


This page:

http://www.regular-expressions.info/wordboundaries.html

 implies that naked boundaries were not expected to be use and that "\B" and 
"\b" were expected to be "flanking" patterns with the real "meat" either 
sandwiched between them or perhaps at either end.

   > strsplit( "     red green   blue", split="\\b   \\b")
[[1]]
[1] "     red green" "blue"  


So perhaps there is an implicit "any-word" that follows the "\\b" assertion?

> strsplit( "redgreen", split="\\bgreen")
[[1]]
[1] "redgreen"

> strsplit( "redgreen", split="green\\b")
[[1]]
[1] "red"


-- 
David.
> 
>>> strsplit("red green","\\W")
>> [[1]]
>> [1] "red"   "green"
> 
> `\W` matches the byte-width non-word characters. So the " "-character would 
> be discarded.
> 
>> 
>> I would have thought that "\\b" should give what "\\W" did. Note that:
>> 
>>> grep("\\bred\\b","red green")
>> [1] 1
>> ## as expected
>> 
>> Does strsplit use a different regex engine than grep()? Or more
>> likely, what am I misunderstanding?
>> 
>> Thanks.
>> 
>> Bert
>> 
>> 
> 
> 
> David Winsemius
> Alameda, CA, USA
> 
> ______________________________________________
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius
Alameda, CA, USA

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Do grep() and strsplit() use different regex engines?

Reply via email to