Re: [R] package "tm" fails to remove "the" with remove stopwords

Mark Kimpel Fri, 13 Nov 2009 08:47:57 -0800

Sam,

Thanks for the example. Removing stop words after the DocumentTermMatrix has
been created works fine if one is working with single words, but what if one
is creating a dtm of possible combinations of words? Wouldn't one want to
remove them from the corpus?


Mark

Mark W. Kimpel MD  ** Neuroinformatics ** Dept. of Psychiatry
Indiana University School of Medicine

15032 Hunter Court, Westfield, IN  46074

(317) 490-5129 Work, & Mobile & VoiceMail
(317) 399-1219 Skype No Voicemail please


On Thu, Nov 12, 2009 at 12:04 PM, Sam Thomas <sam.tho...@revelanttech.com>wrote:

>  I'm not sure what's wrong with your approach, but this seems to strip
> "the"
>
>
>
> require(tm)
>
> params <- list(minDocFreq = 1,
>
>                                 removeNumbers = TRUE,
>
>                                 stemming = TRUE,
>
>                                 stopwords = TRUE,
>
>                                 weighting = weightTf)
>
>
>
> myDocument <- c("the rain in Spain", "falls mainly on the plain", "jack and
> jill ran up the hill", "to fetch a pail of water")
>
> text.corp <- Corpus(VectorSource(myDocument))
>
> dtm <- DocumentTermMatrix(text.corp, control = params)
>
> dtm
>
> dtm.mat <- as.matrix(dtm)
>
> dtm.mat
>
>
>
>
>
> *From:* Mark Kimpel [mailto:mwkim...@gmail.com]
> *Sent:* Thursday, November 12, 2009 11:30 AM
> *To:* r-help@r-project.org; feine...@logic.at; Sam Thomas
> *Subject:* package "tm" fails to remove "the" with remove stopwords
>
>
>
> I am using code that previously worked to remove stopwords using package
> "tm". Even manually adding "the" to the list does not work to remove "the".
> This package has undergone extensive redevelopment with changes to the
> function syntax, so perhaps I am just missing something.
>
>
>
> Please see my simple example, output, and sessionInfo() below.
>
>
>
> Thanks!
>
> Mark
>
>
>
> require(tm)
>
> myDocument <- c("the rain in Spain", "falls mainly on the plain", "jack and
> jill ran up the hill", "to fetch a pail of water")
>
> text.corp <- Corpus(VectorSource(myDocument))
>
> #########################
>
> text.corp <- tm_map(text.corp, stripWhitespace)
>
> text.corp <- tm_map(text.corp, removeNumbers)
>
> text.corp <- tm_map(text.corp, removePunctuation)
>
> ## text.corp <- tm_map(text.corp, stemDocument)
>
> text.corp <- tm_map(text.corp, removeWords, c("the", stopwords("english")))
>
> dtm <- DocumentTermMatrix(text.corp)
>
> dtm
>
> dtm.mat <- as.matrix(dtm)
>
> dtm.mat
>
>
>
> > dtm.mat
>
>     Terms
>
> Docs falls fetch hill jack jill mainly pail plain rain ran spain the water
>
>    1     0     0    0    0    0      0    0     0    1   0     1   1     0
>
>    2     1     0    0    0    0      1    0     1    0   0     0   0     0
>
>    3     0     0    1    1    1      0    0     0    0   1     0   0     0
>
>    4     0     1    0    0    0      0    1     0    0   0     0   0     1
>
>
>
> R version 2.10.0 Patched (2009-10-27 r50222)
>
> x86_64-unknown-linux-gnu
>
>
>
> locale:
>
>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>
>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>
>  [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>
>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>
>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
>
>
> attached base packages:
>
> [1] stats     graphics  grDevices datasets  utils     methods   base
>
>
>
> other attached packages:
>
> [1] chron_2.3-33 RWeka_0.3-23 tm_0.5-1
>
>
>
> loaded via a namespace (and not attached):
>
> [1] grid_2.10.0  rJava_0.8-1  slam_0.1-6   tools_2.10.0
>
>
>
>
>
> Mark W. Kimpel MD  ** Neuroinformatics ** Dept. of Psychiatry
> Indiana University School of Medicine
>
> 15032 Hunter Court, Westfield, IN  46074
>
> (317) 490-5129 Work, & Mobile & VoiceMail
> (317) 399-1219 Skype No Voicemail please
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] package "tm" fails to remove "the" with remove stopwords

Reply via email to