On Jul 11, 2015, at 3:07 PM, Bert Gunter wrote: > David/Jeff: > > Thank you both. > > You seem to confirm that my observation of an "infelicity" in > strsplit() is real. That is most helpful. > > I found nothing in David's message 2 code that was surprising. That > is, the splits shown conform to what I would expect from "\\b" . But > not to what I originally showed and David enlarged upon in his first > message. I still don't really get why a split should occur at every > letter. > > Jeff may very well have found the explanation, but I have not gone > through his code. > > If the infelicities noted (are there more?) by David and me are not > really bugs -- and I would be frankly surprised if they were -- I > would suggest that perhaps they deserve mention in the strsplit() man > page. Something to the effect that "\b and \< should not be used as > split characters..." .
It's more of a regex infelicity or what appears (to us both at a minimum) as a violation of a 'least surprise principle': > gsub("\\b", " ", " This is a test case") [1] " T h i s i s a t e s t c a s e " -- David. > > Bert Gunter > > "Data is not information. Information is not knowledge. And knowledge > is certainly not wisdom." > -- Clifford Stoll > > > On Sat, Jul 11, 2015 at 11:05 AM, David Winsemius > <dwinsem...@comcast.net> wrote: >> >> On Jul 11, 2015, at 7:47 AM, Bert Gunter wrote: >> >>> I noticed the following: >>> >>>> strsplit("red green","\\b") >>> [[1]] >>> [1] "r" "e" "d" " " "g" "r" "e" "e" "n" >> >> After reading the ?regex help page, I didn't understand why `\b` would split >> within sequences of "word"-characters, either. I expected this to be the >> result: >> >> [[1]] >> [1] "red" " " "green" >> >> There is a warning in that paragraph: "(The interpretation of ‘word’ depends >> on the locale and implementation.)" >> >> I got the expected result with only one of "\\>" and "\\<" >> >>> strsplit("red green","\\<") >> [[1]] >> [1] "r" "e" "d" " " "g" "r" "e" "e" "n" >> >>> strsplit("red green","\\>") >> [[1]] >> [1] "red" " green" >> >> The result with "\\<" seems decidedly unexpected. >> >> I'm wondered if the "original" regex documentation uses the same language as >> the R help page. So I went to the cited website and find: >> ======= >> An assertion-character can be any of the following: >> >> • < – Beginning of word >> • > – End of word >> • b – Word boundary >> • B – Non-word boundary >> • d – Digit character (equivalent to [[:digit:]]) >> • D – Non-digit character (equivalent to [^[:digit:]]) >> • s – Space character (equivalent to [[:space:]]) >> • S – Non-space character (equivalent to [^[:space:]]) >> • w – Word character (equivalent to [[:alnum:]_]) >> • W – Non-word character (equivalent to [^[:alnum:]_]) >> ======== >> >> The word-"word" appears nowhere else on that page. >> >> >>>> strsplit("red green","\\W") >>> [[1]] >>> [1] "red" "green" >> >> `\W` matches the byte-width non-word characters. So the " "-character would >> be discarded. >> >>> >>> I would have thought that "\\b" should give what "\\W" did. Note that: >>> >>>> grep("\\bred\\b","red green") >>> [1] 1 >>> ## as expected >>> >>> Does strsplit use a different regex engine than grep()? Or more >>> likely, what am I misunderstanding? >>> >>> Thanks. >>> >>> Bert >>> >>> >> >> >> David Winsemius >> Alameda, CA, USA >> David Winsemius Alameda, CA, USA ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.