Re: [R] Question on Stopword Removal from a Cyrillic (Bulgarian)Text

Ventseslav Kozarev, MPP Wed, 10 Apr 2013 01:44:27 -0700

Thank you so much! You made it look (almost) so easy. I greatlyappreciate it!


On 10.4.2013 г. 11:29 ч., Milan Bouchet-Valat wrote:

Le mercredi 10 avril 2013 à 10:50 +0300, Ventseslav Kozarev, MPP a
écrit :

Hi,


Thanks for taking the time. Here is a more reproducible example of the
entire process:

# Creating a vector source - stupid text in the Bulgarian language
bg<-c('Днес е хубав и слънчев ден, в който всички искат да бъдат
навън.','Утре ще бъде още по-хубав ден.')

# Converting strings from the vector source to UTF-8. Without this step
# in my setup, I don't see Cyrillic letters, even if I set the default
# code page to CP1251.
bg<-iconv(bg,to='UTF-8')

# Load the tm library
library(tm)

# Create the corpus from the vector source
corp<-Corpus(VectorSource(bg),readerControl=list(language='bulgarian'))

# Create a custom stop list based on the example vector source
# Converting to UTF-8
stoplist<-c('е','и','в','който','всички','да','бъдат','навън','ще','бъде','още')
stoplist<-iconv(stoplist,to='UTF-8')

# Preprocessing
corp<-tm_map(corp,removePunctuation)
corp<-tm_map(corp,removeNumbers)
corp<-tm_map(corp,tolower)
corp<-tm_map(corp,removeWords,stoplist)

# End of code here

Now, if I run inspect(corp), I still see all the stop words intact
inside the corpus. I can't figure out why. I tried experimenting with
file encodings, with and without explicit statements of encoding, and it
never works. As far as I can tell, my code is not wrong, and the
function stopwords('language') returns a character vector, so just
replacing it by a different character vector should do the trick. Alas,
no list of stop words for Bulgarian language is available as part of the
tm package (not surprisingly).

In the above example, I also tried to read in the list of stop words
from a file using the scan function, per the example in my original
message. It also fails to remove stop words, without any warnings or
error messages.

An alternative I tried was to convert to a term-document matrix, and
then loop through the words inside and remove those that are also on the
stop list. That only partially works for two reasons. The TDM is
actually a list, and I am not sure what code I need to use if I delete
words, but do not update the underlying indeces. And second, some of the
words still don't get removed even though they are in the list. But that
is another issue altogether...

Thanks for your attention and for your help!
Vince

Thanks for the reproducible example. Indeed, it does not work here
either (Linux with UTF-8 locale). The problem seems to be in the call to
gsub() in removeWords: the pattern "\\b" does not match anything when
perl=TRUE. With perl=FALSE, it works.

gsub("днес", "", "днес е хубав")
# [1] " е хубав"
gsub("днес", "", "днес е хубав", perl=TRUE)
# [1] " е хубав"
gsub("\\bднес\\b", "", "днес е хубав")
# [1] " е хубав"
gsub("\\bднес\\b", "", "днес е хубав", perl=TRUE)
# [1] "днес е хубав"

It looks like some non-ASCII characters like é or € are supported, but
not others like œ or the Cyrillic characters you provided.

For a temporary solution, you can define this function to replace the
one provided by tm:
removeWords.PlainTextDocument <- function (x, words)
     gsub(sprintf("\\b(%s)\\b", paste(words, collapse = "|")), "", x)

I have CCed tm's developer, Ingo Feinerer, to see if he has an idea to
fix the problem in tm; but this looks like a bug in R (or in perl
regexps).


Regards

On 9.4.2013 г. 22:55 ч., Milan Bouchet-Valat wrote:

Le mardi 09 avril 2013 à 10:10 +0300, Ventseslav Kozarev, MPP a écrit :

Hi,

I bumped into a serious issue while trying to analyse some texts in
Bulgarian language (with the tm package). I import a tab-separated csv
file, which holds a total of 22 variables, most of which are text cells
(not factors), using the read.delim function:

data<-read.delim("bigcompanies_ascii.csv",
                   header=TRUE,
                   quote="'",
                   sep="\t",
                   as.is=TRUE,
                   encoding='CP1251',
                   fileEncoding='CP1251')

(I also tried the above with UTF-8 encoding on a UTF-8-saved file.)

I have my list of stop words written in a separate text file, one word
per line, which I read into R using the scan function:

stoplist<-scan(file='stoplist_ascii.txt',
                  what='character',
                  strip.white=TRUE,
                  blank.lines.skip=TRUE,
                  fileEncoding='CP1251',
                  encoding='CP1251')

(also tried with UTF-8 here on a correspondingly encoded file)

I currently only test with a corpus based on the contents of just one
variable, and I construct the corpus from a VectorSource. When I run
inspect, all seems fine and I can see the text properly, with unicode
characters present:

data.corpus<-Corpus(VectorSource(data$variable,encoding='UTF-8'),
                      readerControl=list(language='bulgarian'))

However, no matter what I do - like which encoding I select - UTF-8 or
CP1251, which is the typical code page for Bulgarian texts, I cannot get
to remove the stop words from my corpus. The issue is present in both
Linux and Windows, and across the computers I use R on, and I don't
think it is related to bad configuration. Removal of punctuation, white
space, and numbers is flawless, but the inability to remove stop words
prevents me from further analysing the texts.

Has somebody had experience with languages other than English, and for
which there is no predefined stop list available through the stopwords
function? I will highly appreciate any tips and advice!

Well, at least show us the code that you use to remove stopwords... Can
you provide a reproducible example with a toy corpus?

Thanks in advance,
Vince

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Question on Stopword Removal from a Cyrillic (Bulgarian)Text

Reply via email to