Le mardi 09 avril 2013 à 10:10 +0300, Ventseslav Kozarev, MPP a écrit : > Hi, > > I bumped into a serious issue while trying to analyse some texts in > Bulgarian language (with the tm package). I import a tab-separated csv > file, which holds a total of 22 variables, most of which are text cells > (not factors), using the read.delim function: > > data<-read.delim("bigcompanies_ascii.csv", > header=TRUE, > quote="'", > sep="\t", > as.is=TRUE, > encoding='CP1251', > fileEncoding='CP1251') > > (I also tried the above with UTF-8 encoding on a UTF-8-saved file.) > > I have my list of stop words written in a separate text file, one word > per line, which I read into R using the scan function: > > stoplist<-scan(file='stoplist_ascii.txt', > what='character', > strip.white=TRUE, > blank.lines.skip=TRUE, > fileEncoding='CP1251', > encoding='CP1251') > > (also tried with UTF-8 here on a correspondingly encoded file) > > I currently only test with a corpus based on the contents of just one > variable, and I construct the corpus from a VectorSource. When I run > inspect, all seems fine and I can see the text properly, with unicode > characters present: > > data.corpus<-Corpus(VectorSource(data$variable,encoding='UTF-8'), > readerControl=list(language='bulgarian')) > > However, no matter what I do - like which encoding I select - UTF-8 or > CP1251, which is the typical code page for Bulgarian texts, I cannot get > to remove the stop words from my corpus. The issue is present in both > Linux and Windows, and across the computers I use R on, and I don't > think it is related to bad configuration. Removal of punctuation, white > space, and numbers is flawless, but the inability to remove stop words > prevents me from further analysing the texts. > > Has somebody had experience with languages other than English, and for > which there is no predefined stop list available through the stopwords > function? I will highly appreciate any tips and advice! Well, at least show us the code that you use to remove stopwords... Can you provide a reproducible example with a toy corpus?
> Thanks in advance, > Vince > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.