Le mercredi 10 avril 2013 à 13:17 +0200, Ingo Feinerer a écrit : > On Wed, Apr 10, 2013 at 10:29:27AM +0200, Milan Bouchet-Valat wrote: > > Thanks for the reproducible example. Indeed, it does not work here > > either (Linux with UTF-8 locale). The problem seems to be in the call to > > gsub() in removeWords: the pattern "\\b" does not match anything when > > perl=TRUE. With perl=FALSE, it works. > > The \b versus perl versus UTF-8 issue seems to be known, and it is > advised to use perl = TRUE with \b. See e.g. the warning in the gsub > help page (?gsub): > > ---8<-------------------------------------------------------------------------- > Warning: > > POSIX 1003.2 mode of ‘gsub’ and ‘gregexpr’ does not work correctly with > repeated word-boundaries (e.g. ‘pattern = "\b"’). Use ‘perl = TRUE’ for > such matches (but that may not work as expected with non-ASCII inputs, > as the meaning of ‘word’ is system-dependent). > ---8<-------------------------------------------------------------------------- Thanks for the pointer. Indeed, this allowed me to discover the existence of the PCRE_UCP (Unicode Character Properties) flag, which changes matching behavior so that Unicode alphanumerics are not considered as word boundaries.
This flag should probably be used by R when calling pcre_compile() in gsub() and friends. At the moment, R's behavior is inconsistent across platforms: - on Fedora 18, R 2.15.3 : gsub("\\bt\\b", "", "télégramme", perl=TRUE) [1] "élégramme" - on Windows 2008, R 2.15.1 and 3.0.0 : gsub("\\bt\\b", "", "télégramme", perl=TRUE) [1] "télégramme" Luckily, the bug can be fixed at tm's level by adding (*UCP) at the beginning of the pattern. This works for our examples : > gsub(sprintf("\\b(%s)\\b", "който"), "", "който", perl=TRUE) [1] "който" > gsub(sprintf("(*UCP)\\b(%s)\\b", "който"), "", "който", perl=TRUE) [1] "" gsub("\\bt\\b", "", "télégramme", perl=TRUE) [1] "élégramme" gsub("(*UCP)\\bt\\b", "", "télégramme", perl=TRUE) [1] "télégramme" Regards ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.