Sam, Thanks for the example. Removing stop words after the DocumentTermMatrix has been created works fine if one is working with single words, but what if one is creating a dtm of possible combinations of words? Wouldn't one want to remove them from the corpus?
Mark Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry Indiana University School of Medicine 15032 Hunter Court, Westfield, IN 46074 (317) 490-5129 Work, & Mobile & VoiceMail (317) 399-1219 Skype No Voicemail please On Thu, Nov 12, 2009 at 12:04 PM, Sam Thomas <sam.tho...@revelanttech.com>wrote: > I'm not sure what's wrong with your approach, but this seems to strip > "the" > > > > require(tm) > > params <- list(minDocFreq = 1, > > removeNumbers = TRUE, > > stemming = TRUE, > > stopwords = TRUE, > > weighting = weightTf) > > > > myDocument <- c("the rain in Spain", "falls mainly on the plain", "jack and > jill ran up the hill", "to fetch a pail of water") > > text.corp <- Corpus(VectorSource(myDocument)) > > dtm <- DocumentTermMatrix(text.corp, control = params) > > dtm > > dtm.mat <- as.matrix(dtm) > > dtm.mat > > > > > > *From:* Mark Kimpel [mailto:mwkim...@gmail.com] > *Sent:* Thursday, November 12, 2009 11:30 AM > *To:* r-help@r-project.org; feine...@logic.at; Sam Thomas > *Subject:* package "tm" fails to remove "the" with remove stopwords > > > > I am using code that previously worked to remove stopwords using package > "tm". Even manually adding "the" to the list does not work to remove "the". > This package has undergone extensive redevelopment with changes to the > function syntax, so perhaps I am just missing something. > > > > Please see my simple example, output, and sessionInfo() below. > > > > Thanks! > > Mark > > > > require(tm) > > myDocument <- c("the rain in Spain", "falls mainly on the plain", "jack and > jill ran up the hill", "to fetch a pail of water") > > text.corp <- Corpus(VectorSource(myDocument)) > > ######################### > > text.corp <- tm_map(text.corp, stripWhitespace) > > text.corp <- tm_map(text.corp, removeNumbers) > > text.corp <- tm_map(text.corp, removePunctuation) > > ## text.corp <- tm_map(text.corp, stemDocument) > > text.corp <- tm_map(text.corp, removeWords, c("the", stopwords("english"))) > > dtm <- DocumentTermMatrix(text.corp) > > dtm > > dtm.mat <- as.matrix(dtm) > > dtm.mat > > > > > dtm.mat > > Terms > > Docs falls fetch hill jack jill mainly pail plain rain ran spain the water > > 1 0 0 0 0 0 0 0 0 1 0 1 1 0 > > 2 1 0 0 0 0 1 0 1 0 0 0 0 0 > > 3 0 0 1 1 1 0 0 0 0 1 0 0 0 > > 4 0 1 0 0 0 0 1 0 0 0 0 0 1 > > > > R version 2.10.0 Patched (2009-10-27 r50222) > > x86_64-unknown-linux-gnu > > > > locale: > > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > > [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 > > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > > [9] LC_ADDRESS=C LC_TELEPHONE=C > > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > > > attached base packages: > > [1] stats graphics grDevices datasets utils methods base > > > > other attached packages: > > [1] chron_2.3-33 RWeka_0.3-23 tm_0.5-1 > > > > loaded via a namespace (and not attached): > > [1] grid_2.10.0 rJava_0.8-1 slam_0.1-6 tools_2.10.0 > > > > > > Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry > Indiana University School of Medicine > > 15032 Hunter Court, Westfield, IN 46074 > > (317) 490-5129 Work, & Mobile & VoiceMail > (317) 399-1219 Skype No Voicemail please > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.