Le mercredi 10 avril 2013 à 10:50 +0300, Ventseslav Kozarev, MPP a écrit : > Hi, > > Thanks for taking the time. Here is a more reproducible example of the > entire process: > > # Creating a vector source - stupid text in the Bulgarian language > bg<-c('Днес е хубав и слънчев ден, в който всички искат да бъдат > навън.','Утре ще бъде още по-хубав ден.') > > # Converting strings from the vector source to UTF-8. Without this step > # in my setup, I don't see Cyrillic letters, even if I set the default > # code page to CP1251. > bg<-iconv(bg,to='UTF-8') > > # Load the tm library > library(tm) > > # Create the corpus from the vector source > corp<-Corpus(VectorSource(bg),readerControl=list(language='bulgarian')) > > # Create a custom stop list based on the example vector source > # Converting to UTF-8 > stoplist<-c('е','и','в','който','всички','да','бъдат','навън','ще','бъде','още') > stoplist<-iconv(stoplist,to='UTF-8') > > # Preprocessing > corp<-tm_map(corp,removePunctuation) > corp<-tm_map(corp,removeNumbers) > corp<-tm_map(corp,tolower) > corp<-tm_map(corp,removeWords,stoplist) > > # End of code here > > Now, if I run inspect(corp), I still see all the stop words intact > inside the corpus. I can't figure out why. I tried experimenting with > file encodings, with and without explicit statements of encoding, and it > never works. As far as I can tell, my code is not wrong, and the > function stopwords('language') returns a character vector, so just > replacing it by a different character vector should do the trick. Alas, > no list of stop words for Bulgarian language is available as part of the > tm package (not surprisingly). > > In the above example, I also tried to read in the list of stop words > from a file using the scan function, per the example in my original > message. It also fails to remove stop words, without any warnings or > error messages. > > An alternative I tried was to convert to a term-document matrix, and > then loop through the words inside and remove those that are also on the > stop list. That only partially works for two reasons. The TDM is > actually a list, and I am not sure what code I need to use if I delete > words, but do not update the underlying indeces. And second, some of the > words still don't get removed even though they are in the list. But that > is another issue altogether... > > Thanks for your attention and for your help! > Vince Thanks for the reproducible example. Indeed, it does not work here either (Linux with UTF-8 locale). The problem seems to be in the call to gsub() in removeWords: the pattern "\\b" does not match anything when perl=TRUE. With perl=FALSE, it works.
gsub("днес", "", "днес е хубав") # [1] " е хубав" gsub("днес", "", "днес е хубав", perl=TRUE) # [1] " е хубав" gsub("\\bднес\\b", "", "днес е хубав") # [1] " е хубав" gsub("\\bднес\\b", "", "днес е хубав", perl=TRUE) # [1] "днес е хубав" It looks like some non-ASCII characters like é or € are supported, but not others like œ or the Cyrillic characters you provided. For a temporary solution, you can define this function to replace the one provided by tm: removeWords.PlainTextDocument <- function (x, words) gsub(sprintf("\\b(%s)\\b", paste(words, collapse = "|")), "", x) I have CCed tm's developer, Ingo Feinerer, to see if he has an idea to fix the problem in tm; but this looks like a bug in R (or in perl regexps). Regards > On 9.4.2013 г. 22:55 ч., Milan Bouchet-Valat wrote: > > Le mardi 09 avril 2013 à 10:10 +0300, Ventseslav Kozarev, MPP a écrit : > >> Hi, > >> > >> I bumped into a serious issue while trying to analyse some texts in > >> Bulgarian language (with the tm package). I import a tab-separated csv > >> file, which holds a total of 22 variables, most of which are text cells > >> (not factors), using the read.delim function: > >> > >> data<-read.delim("bigcompanies_ascii.csv", > >> header=TRUE, > >> quote="'", > >> sep="\t", > >> as.is=TRUE, > >> encoding='CP1251', > >> fileEncoding='CP1251') > >> > >> (I also tried the above with UTF-8 encoding on a UTF-8-saved file.) > >> > >> I have my list of stop words written in a separate text file, one word > >> per line, which I read into R using the scan function: > >> > >> stoplist<-scan(file='stoplist_ascii.txt', > >> what='character', > >> strip.white=TRUE, > >> blank.lines.skip=TRUE, > >> fileEncoding='CP1251', > >> encoding='CP1251') > >> > >> (also tried with UTF-8 here on a correspondingly encoded file) > >> > >> I currently only test with a corpus based on the contents of just one > >> variable, and I construct the corpus from a VectorSource. When I run > >> inspect, all seems fine and I can see the text properly, with unicode > >> characters present: > >> > >> data.corpus<-Corpus(VectorSource(data$variable,encoding='UTF-8'), > >> readerControl=list(language='bulgarian')) > >> > >> However, no matter what I do - like which encoding I select - UTF-8 or > >> CP1251, which is the typical code page for Bulgarian texts, I cannot get > >> to remove the stop words from my corpus. The issue is present in both > >> Linux and Windows, and across the computers I use R on, and I don't > >> think it is related to bad configuration. Removal of punctuation, white > >> space, and numbers is flawless, but the inability to remove stop words > >> prevents me from further analysing the texts. > >> > >> Has somebody had experience with languages other than English, and for > >> which there is no predefined stop list available through the stopwords > >> function? I will highly appreciate any tips and advice! > > Well, at least show us the code that you use to remove stopwords... Can > > you provide a reproducible example with a toy corpus? > > > >> Thanks in advance, > >> Vince > >> > >> ______________________________________________ > >> R-help@r-project.org mailing list > >> https://stat.ethz.ch/mailman/listinfo/r-help > >> PLEASE do read the posting guide > >> http://www.R-project.org/posting-guide.html > >> and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.