omigosh -- you're right. -- Bert Bert Gunter
"Data is not information. Information is not knowledge. And knowledge is certainly not wisdom." -- Clifford Stoll On Sat, Jul 11, 2015 at 3:31 PM, David Winsemius <dwinsem...@comcast.net> wrote: > > On Jul 11, 2015, at 3:07 PM, Bert Gunter wrote: > >> David/Jeff: >> >> Thank you both. >> >> You seem to confirm that my observation of an "infelicity" in >> strsplit() is real. That is most helpful. >> >> I found nothing in David's message 2 code that was surprising. That >> is, the splits shown conform to what I would expect from "\\b" . But >> not to what I originally showed and David enlarged upon in his first >> message. I still don't really get why a split should occur at every >> letter. >> >> Jeff may very well have found the explanation, but I have not gone >> through his code. >> >> If the infelicities noted (are there more?) by David and me are not >> really bugs -- and I would be frankly surprised if they were -- I >> would suggest that perhaps they deserve mention in the strsplit() man >> page. Something to the effect that "\b and \< should not be used as >> split characters..." . > > It's more of a regex infelicity or what appears (to us both at a minimum) as > a violation of a 'least surprise principle': > >> gsub("\\b", " ", " This is a test case") > [1] " T h i s i s a t e s t c a s e " > > > -- > David. > >> >> Bert Gunter >> >> "Data is not information. Information is not knowledge. And knowledge >> is certainly not wisdom." >> -- Clifford Stoll >> >> >> On Sat, Jul 11, 2015 at 11:05 AM, David Winsemius >> <dwinsem...@comcast.net> wrote: >>> >>> On Jul 11, 2015, at 7:47 AM, Bert Gunter wrote: >>> >>>> I noticed the following: >>>> >>>>> strsplit("red green","\\b") >>>> [[1]] >>>> [1] "r" "e" "d" " " "g" "r" "e" "e" "n" >>> >>> After reading the ?regex help page, I didn't understand why `\b` would >>> split within sequences of "word"-characters, either. I expected this to be >>> the result: >>> >>> [[1]] >>> [1] "red" " " "green" >>> >>> There is a warning in that paragraph: "(The interpretation of ‘word’ >>> depends on the locale and implementation.)" >>> >>> I got the expected result with only one of "\\>" and "\\<" >>> >>>> strsplit("red green","\\<") >>> [[1]] >>> [1] "r" "e" "d" " " "g" "r" "e" "e" "n" >>> >>>> strsplit("red green","\\>") >>> [[1]] >>> [1] "red" " green" >>> >>> The result with "\\<" seems decidedly unexpected. >>> >>> I'm wondered if the "original" regex documentation uses the same language >>> as the R help page. So I went to the cited website and find: >>> ======= >>> An assertion-character can be any of the following: >>> >>> • < – Beginning of word >>> • > – End of word >>> • b – Word boundary >>> • B – Non-word boundary >>> • d – Digit character (equivalent to [[:digit:]]) >>> • D – Non-digit character (equivalent to [^[:digit:]]) >>> • s – Space character (equivalent to [[:space:]]) >>> • S – Non-space character (equivalent to [^[:space:]]) >>> • w – Word character (equivalent to [[:alnum:]_]) >>> • W – Non-word character (equivalent to [^[:alnum:]_]) >>> ======== >>> >>> The word-"word" appears nowhere else on that page. >>> >>> >>>>> strsplit("red green","\\W") >>>> [[1]] >>>> [1] "red" "green" >>> >>> `\W` matches the byte-width non-word characters. So the " "-character would >>> be discarded. >>> >>>> >>>> I would have thought that "\\b" should give what "\\W" did. Note that: >>>> >>>>> grep("\\bred\\b","red green") >>>> [1] 1 >>>> ## as expected >>>> >>>> Does strsplit use a different regex engine than grep()? Or more >>>> likely, what am I misunderstanding? >>>> >>>> Thanks. >>>> >>>> Bert >>>> >>>> >>> >>> >>> David Winsemius >>> Alameda, CA, USA >>> > > David Winsemius > Alameda, CA, USA > ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.