Hi,

Thanks for taking the time. Here is a more reproducible example of the entire process:

# Creating a vector source - stupid text in the Bulgarian language
bg<-c('Днес е хубав и слънчев ден, в който всички искат да бъдат навън.','Утре ще бъде още по-хубав ден.')

# Converting strings from the vector source to UTF-8. Without this step
# in my setup, I don't see Cyrillic letters, even if I set the default
# code page to CP1251.
bg<-iconv(bg,to='UTF-8')

# Load the tm library
library(tm)

# Create the corpus from the vector source
corp<-Corpus(VectorSource(bg),readerControl=list(language='bulgarian'))

# Create a custom stop list based on the example vector source
# Converting to UTF-8
stoplist<-c('е','и','в','който','всички','да','бъдат','навън','ще','бъде','още')
stoplist<-iconv(stoplist,to='UTF-8')

# Preprocessing
corp<-tm_map(corp,removePunctuation)
corp<-tm_map(corp,removeNumbers)
corp<-tm_map(corp,tolower)
corp<-tm_map(corp,removeWords,stoplist)

# End of code here

Now, if I run inspect(corp), I still see all the stop words intact inside the corpus. I can't figure out why. I tried experimenting with file encodings, with and without explicit statements of encoding, and it never works. As far as I can tell, my code is not wrong, and the function stopwords('language') returns a character vector, so just replacing it by a different character vector should do the trick. Alas, no list of stop words for Bulgarian language is available as part of the tm package (not surprisingly).

In the above example, I also tried to read in the list of stop words from a file using the scan function, per the example in my original message. It also fails to remove stop words, without any warnings or error messages.

An alternative I tried was to convert to a term-document matrix, and then loop through the words inside and remove those that are also on the stop list. That only partially works for two reasons. The TDM is actually a list, and I am not sure what code I need to use if I delete words, but do not update the underlying indeces. And second, some of the words still don't get removed even though they are in the list. But that is another issue altogether...

Thanks for your attention and for your help!
Vince

On 9.4.2013 г. 22:55 ч., Milan Bouchet-Valat wrote:
Le mardi 09 avril 2013 à 10:10 +0300, Ventseslav Kozarev, MPP a écrit :
Hi,

I bumped into a serious issue while trying to analyse some texts in
Bulgarian language (with the tm package). I import a tab-separated csv
file, which holds a total of 22 variables, most of which are text cells
(not factors), using the read.delim function:

data<-read.delim("bigcompanies_ascii.csv",
                  header=TRUE,
                  quote="'",
                  sep="\t",
                  as.is=TRUE,
                  encoding='CP1251',
                  fileEncoding='CP1251')

(I also tried the above with UTF-8 encoding on a UTF-8-saved file.)

I have my list of stop words written in a separate text file, one word
per line, which I read into R using the scan function:

stoplist<-scan(file='stoplist_ascii.txt',
                 what='character',
                 strip.white=TRUE,
                 blank.lines.skip=TRUE,
                 fileEncoding='CP1251',
                 encoding='CP1251')

(also tried with UTF-8 here on a correspondingly encoded file)

I currently only test with a corpus based on the contents of just one
variable, and I construct the corpus from a VectorSource. When I run
inspect, all seems fine and I can see the text properly, with unicode
characters present:

data.corpus<-Corpus(VectorSource(data$variable,encoding='UTF-8'),
                     readerControl=list(language='bulgarian'))

However, no matter what I do - like which encoding I select - UTF-8 or
CP1251, which is the typical code page for Bulgarian texts, I cannot get
to remove the stop words from my corpus. The issue is present in both
Linux and Windows, and across the computers I use R on, and I don't
think it is related to bad configuration. Removal of punctuation, white
space, and numbers is flawless, but the inability to remove stop words
prevents me from further analysing the texts.

Has somebody had experience with languages other than English, and for
which there is no predefined stop list available through the stopwords
function? I will highly appreciate any tips and advice!
Well, at least show us the code that you use to remove stopwords... Can
you provide a reproducible example with a toy corpus?

Thanks in advance,
Vince

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to