I just wanted to confirm that Milan's suggestion about adding (*UCP) like in the example below:
gsub(sprintf("(*UCP)\\b(%s)\\b", "който"), "", "който", perl=TRUE) solved all problems (under openSuse Linux 12.3 64-bit, R 2.15.2). I reencoded input files and stop word list in UTF-8, and now stop words are properly removed using the suggested syntax: sme.corpus<-tm_map(sme.corpus,removeWords.PlainTextDocument,stoplist) where: removeWords.PlainTextDocument <- function (x, words) gsub(sprintf("(*UCP)\\b(%s)\\b", paste(words, collapse = "|")), "", x, perl=TRUE) and stoplist is a character vector of stop words. The wordcloud function now also accept the preprocessed corpus without warnings or errors. Now, if only I could do stemming in Bulgarian, that would have been priceless! Thanks again, this has been tremendous help indeed! Vince On Wednesday 10 April 2013 20:43:27 Milan Bouchet-Valat wrote: > Le mercredi 10 avril 2013 à 13:17 +0200, Ingo Feinerer a écrit : > > On Wed, Apr 10, 2013 at 10:29:27AM +0200, Milan Bouchet-Valat wrote: > > > Thanks for the reproducible example. Indeed, it does not work here > > > either (Linux with UTF-8 locale). The problem seems to be in the call to > > > gsub() in removeWords: the pattern "\\b" does not match anything when > > > perl=TRUE. With perl=FALSE, it works. > > > > The \b versus perl versus UTF-8 issue seems to be known, and it is > > advised to use perl = TRUE with \b. See e.g. the warning in the gsub > > help page (?gsub): > > > > ---8<--------------------------------------------------------------------- > > ----- Warning: > > > > POSIX 1003.2 mode of ‘gsub’ and ‘gregexpr’ does not work correctly with > > repeated word-boundaries (e.g. ‘pattern = "\b"’). Use ‘perl = TRUE’ for > > such matches (but that may not work as expected with non-ASCII inputs, > > as the meaning of ‘word’ is system-dependent). > > ---8<--------------------------------------------------------------------- > > ----- > Thanks for the pointer. Indeed, this allowed me to discover the > existence of the PCRE_UCP (Unicode Character Properties) flag, which > changes matching behavior so that Unicode alphanumerics are not > considered as word boundaries. > > This flag should probably be used by R when calling pcre_compile() in > gsub() and friends. At the moment, R's behavior is inconsistent across > platforms: > - on Fedora 18, R 2.15.3 : > gsub("\\bt\\b", "", "télégramme", perl=TRUE) > [1] "élégramme" > > - on Windows 2008, R 2.15.1 and 3.0.0 : > gsub("\\bt\\b", "", "télégramme", perl=TRUE) > [1] "télégramme" > > > Luckily, the bug can be fixed at tm's level by adding (*UCP) at the > > beginning of the pattern. This works for our examples : > > gsub(sprintf("\\b(%s)\\b", "който"), "", "който", perl=TRUE) > > [1] "който" > > > gsub(sprintf("(*UCP)\\b(%s)\\b", "който"), "", "който", perl=TRUE) > > [1] "" > > gsub("\\bt\\b", "", "télégramme", perl=TRUE) > [1] "élégramme" > gsub("(*UCP)\\bt\\b", "", "télégramme", perl=TRUE) > [1] "télégramme" > > > Regards ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel