[R] strsplit("dia ma", "\\b") splits characterwise

Suharto Anggono Suharto Anggono Thu, 08 Jul 2010 04:52:06 -0700

\b is word boundary.
But, unexpectedly, strsplit("dia ma", "\\b") splits character by character.


> strsplit("dia ma", "\\b")
[[1]]
[1] "d" "i" "a" " " "m" "a"

> strsplit("dia ma", "\\b", perl=TRUE)
[[1]]
[1] "d" "i" "a" " " "m" "a"


How can that be?

This is the output of 'gregexpr'.

> gregexpr("\\b", "dia ma")
[[1]]
[1] 1 2 3 4 5 6
attr(,"match.length")
[1] 0 0 0 0 0 0

> gregexpr("\\b", "dia ma", perl=TRUE)
[[1]]
[1] 1 4 5 7
attr(,"match.length")
[1] 0 0 0 0


The output from gregexpr("\\b", "dia ma", perl=TRUE) is what I expect. I expect 
'strsplit' to split at that points.

This is in Windows. R was installed from binary.

> sessionInfo()
R version 2.11.1 (2010-05-31)
i386-pc-mingw32

locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base



R 2.8.1 shows the same 'strsplit' behavior, but the behavior of default 
'gregexpr' (i.e. perl=FALSE) is different.

> strsplit("dia ma", "\\b")
[[1]]
[1] "d" "i" "a" " " "m" "a"

> strsplit("dia ma", "\\b", perl=TRUE)
[[1]]
[1] "d" "i" "a" " " "m" "a"

> gregexpr("\\b", "dia ma")
[[1]]
[1] 1 4 5 7
attr(,"match.length")
[1] 0 0 0 0

> gregexpr("\\b", "dia ma", perl=TRUE)
[[1]]
[1] 1 4 5 7
attr(,"match.length")
[1] 0 0 0 0

> sessionInfo()
R version 2.8.1 (2008-12-22)
i386-pc-mingw32

locale:
LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MON
ETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252


attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base




______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] strsplit("dia ma", "\\b") splits characterwise

Reply via email to