David/Jeff: Thank you both.
You seem to confirm that my observation of an "infelicity" in strsplit() is real. That is most helpful. I found nothing in David's message 2 code that was surprising. That is, the splits shown conform to what I would expect from "\\b" . But not to what I originally showed and David enlarged upon in his first message. I still don't really get why a split should occur at every letter. Jeff may very well have found the explanation, but I have not gone through his code. If the infelicities noted (are there more?) by David and me are not really bugs -- and I would be frankly surprised if they were -- I would suggest that perhaps they deserve mention in the strsplit() man page. Something to the effect that "\b and \< should not be used as split characters..." . Bert Gunter "Data is not information. Information is not knowledge. And knowledge is certainly not wisdom." -- Clifford Stoll On Sat, Jul 11, 2015 at 11:05 AM, David Winsemius <dwinsem...@comcast.net> wrote: > > On Jul 11, 2015, at 7:47 AM, Bert Gunter wrote: > >> I noticed the following: >> >>> strsplit("red green","\\b") >> [[1]] >> [1] "r" "e" "d" " " "g" "r" "e" "e" "n" > > After reading the ?regex help page, I didn't understand why `\b` would split > within sequences of "word"-characters, either. I expected this to be the > result: > > [[1]] > [1] "red" " " "green" > > There is a warning in that paragraph: "(The interpretation of ‘word’ depends > on the locale and implementation.)" > > I got the expected result with only one of "\\>" and "\\<" > >> strsplit("red green","\\<") > [[1]] > [1] "r" "e" "d" " " "g" "r" "e" "e" "n" > >> strsplit("red green","\\>") > [[1]] > [1] "red" " green" > > The result with "\\<" seems decidedly unexpected. > > I'm wondered if the "original" regex documentation uses the same language as > the R help page. So I went to the cited website and find: > ======= > An assertion-character can be any of the following: > > • < – Beginning of word > • > – End of word > • b – Word boundary > • B – Non-word boundary > • d – Digit character (equivalent to [[:digit:]]) > • D – Non-digit character (equivalent to [^[:digit:]]) > • s – Space character (equivalent to [[:space:]]) > • S – Non-space character (equivalent to [^[:space:]]) > • w – Word character (equivalent to [[:alnum:]_]) > • W – Non-word character (equivalent to [^[:alnum:]_]) > ======== > > The word-"word" appears nowhere else on that page. > > >>> strsplit("red green","\\W") >> [[1]] >> [1] "red" "green" > > `\W` matches the byte-width non-word characters. So the " "-character would > be discarded. > >> >> I would have thought that "\\b" should give what "\\W" did. Note that: >> >>> grep("\\bred\\b","red green") >> [1] 1 >> ## as expected >> >> Does strsplit use a different regex engine than grep()? Or more >> likely, what am I misunderstanding? >> >> Thanks. >> >> Bert >> >> > > > David Winsemius > Alameda, CA, USA > ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.