Re: [R] Question on Stopword Removal from a Cyrillic (Bulgarian)Text

Ventseslav Kozarev, MPP Wed, 10 Apr 2013 00:52:45 -0700

Hi,

Thanks for taking the time. Here is a more reproducible example of theentire process:


# Creating a vector source - stupid text in the Bulgarian language

bg<-c('Днес е хубав и слънчев ден, в който всички искат да бъдатнавън.','Утре ще бъде още по-хубав ден.')


# Converting strings from the vector source to UTF-8. Without this step
# in my setup, I don't see Cyrillic letters, even if I set the default
# code page to CP1251.
bg<-iconv(bg,to='UTF-8')

# Load the tm library
library(tm)

# Create the corpus from the vector source
corp<-Corpus(VectorSource(bg),readerControl=list(language='bulgarian'))

# Create a custom stop list based on the example vector source
# Converting to UTF-8
stoplist<-c('е','и','в','който','всички','да','бъдат','навън','ще','бъде','още')
stoplist<-iconv(stoplist,to='UTF-8')

# Preprocessing
corp<-tm_map(corp,removePunctuation)
corp<-tm_map(corp,removeNumbers)
corp<-tm_map(corp,tolower)
corp<-tm_map(corp,removeWords,stoplist)

# End of code here

Now, if I run inspect(corp), I still see all the stop words intactinside the corpus. I can't figure out why. I tried experimenting withfile encodings, with and without explicit statements of encoding, and itnever works. As far as I can tell, my code is not wrong, and thefunction stopwords('language') returns a character vector, so justreplacing it by a different character vector should do the trick. Alas,no list of stop words for Bulgarian language is available as part of thetm package (not surprisingly).

In the above example, I also tried to read in the list of stop wordsfrom a file using the scan function, per the example in my originalmessage. It also fails to remove stop words, without any warnings orerror messages.

An alternative I tried was to convert to a term-document matrix, andthen loop through the words inside and remove those that are also on thestop list. That only partially works for two reasons. The TDM isactually a list, and I am not sure what code I need to use if I deletewords, but do not update the underlying indeces. And second, some of thewords still don't get removed even though they are in the list. But thatis another issue altogether...


Thanks for your attention and for your help!
Vince

On 9.4.2013 г. 22:55 ч., Milan Bouchet-Valat wrote:

Le mardi 09 avril 2013 à 10:10 +0300, Ventseslav Kozarev, MPP a écrit :

Hi,

I bumped into a serious issue while trying to analyse some texts in
Bulgarian language (with the tm package). I import a tab-separated csv
file, which holds a total of 22 variables, most of which are text cells
(not factors), using the read.delim function:

data<-read.delim("bigcompanies_ascii.csv",
                  header=TRUE,
                  quote="'",
                  sep="\t",
                  as.is=TRUE,
                  encoding='CP1251',
                  fileEncoding='CP1251')

(I also tried the above with UTF-8 encoding on a UTF-8-saved file.)

I have my list of stop words written in a separate text file, one word
per line, which I read into R using the scan function:

stoplist<-scan(file='stoplist_ascii.txt',
                 what='character',
                 strip.white=TRUE,
                 blank.lines.skip=TRUE,
                 fileEncoding='CP1251',
                 encoding='CP1251')

(also tried with UTF-8 here on a correspondingly encoded file)

I currently only test with a corpus based on the contents of just one
variable, and I construct the corpus from a VectorSource. When I run
inspect, all seems fine and I can see the text properly, with unicode
characters present:

data.corpus<-Corpus(VectorSource(data$variable,encoding='UTF-8'),
                     readerControl=list(language='bulgarian'))

However, no matter what I do - like which encoding I select - UTF-8 or
CP1251, which is the typical code page for Bulgarian texts, I cannot get
to remove the stop words from my corpus. The issue is present in both
Linux and Windows, and across the computers I use R on, and I don't
think it is related to bad configuration. Removal of punctuation, white
space, and numbers is flawless, but the inability to remove stop words
prevents me from further analysing the texts.

Has somebody had experience with languages other than English, and for
which there is no predefined stop list available through the stopwords
function? I will highly appreciate any tips and advice!

Well, at least show us the code that you use to remove stopwords... Can
you provide a reproducible example with a toy corpus?

Thanks in advance,
Vince

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Question on Stopword Removal from a Cyrillic (Bulgarian)Text

Reply via email to