On Jul 11, 2015, at 7:47 AM, Bert Gunter wrote: > I noticed the following: > >> strsplit("red green","\\b") > [[1]] > [1] "r" "e" "d" " " "g" "r" "e" "e" "n"
After reading the ?regex help page, I didn't understand why `\b` would split within sequences of "word"-characters, either. I expected this to be the result: [[1]] [1] "red" " " "green" There is a warning in that paragraph: "(The interpretation of ‘word’ depends on the locale and implementation.)" I got the expected result with only one of "\\>" and "\\<" > strsplit("red green","\\<") [[1]] [1] "r" "e" "d" " " "g" "r" "e" "e" "n" > strsplit("red green","\\>") [[1]] [1] "red" " green" The result with "\\<" seems decidedly unexpected. I'm wondered if the "original" regex documentation uses the same language as the R help page. So I went to the cited website and find: ======= An assertion-character can be any of the following: • < – Beginning of word • > – End of word • b – Word boundary • B – Non-word boundary • d – Digit character (equivalent to [[:digit:]]) • D – Non-digit character (equivalent to [^[:digit:]]) • s – Space character (equivalent to [[:space:]]) • S – Non-space character (equivalent to [^[:space:]]) • w – Word character (equivalent to [[:alnum:]_]) • W – Non-word character (equivalent to [^[:alnum:]_]) ======== The word-"word" appears nowhere else on that page. >> strsplit("red green","\\W") > [[1]] > [1] "red" "green" `\W` matches the byte-width non-word characters. So the " "-character would be discarded. > > I would have thought that "\\b" should give what "\\W" did. Note that: > >> grep("\\bred\\b","red green") > [1] 1 > ## as expected > > Does strsplit use a different regex engine than grep()? Or more > likely, what am I misunderstanding? > > Thanks. > > Bert > > David Winsemius Alameda, CA, USA ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.