Hi: I try to construct a Document-Term Meatrix from a corpus. The commands I used are: > library(parallel)> library(tm)> library(RWeka)> library(topicmodels)> > library(RTextTools)> cl=makeCluster(detectCores())> > invisible(clusterEvalQ(cl, library(tm)))> invisible(clusterEvalQ(cl, > library(RWeka))) > invisible(clusterEvalQ(cl, library(topicmodels)))> > invisible(clusterEvalQ(cl, library(RTextTools)))> myCorpus > <-Corpus(DirSource("/home/neeph/Test/DMOZ_Business"), encoding="UTF-8", > readerControl=list(reader=readPlain))> removeURL <- function(x) > gsub("http[[:alnum:]]*", "", x)> myCorpus <- tm_map(myCorpus, removeURL)> > removeAmp <- function(x) gsub("&", "", x)> myCorpus <- tm_map(myCorpus, > removeAmp)> removeWWW <- function(x) gsub("www[[:alnum:]]*", "", x)> myCorpus > <- tm_map(myCorpus, removeWWW)> myCorpus <- tm_map(myCorpus, tolower)> > myCorpus <- tm_map(myCorpus, removeNumbers)> myCorpus <- tm_map(myCorpus, > removePunctuation)> myCorpus <- tm_map(myCorpus, removeWords, > stopwords("english"))> myCorpus <- tm_map(myCorpus, removeWords, > stopwords("SMART"))> myCorpus <! - tm_map(myCorpus, stripWhitespace)> myDtm <- DocumentTermMatrix(myCorpus, control = list(wordLengths = c(1,Inf))) Everything works fine upto this stage, if I do not include tokenizing. However, when I run the code with the following alteration:> dictCorpus <- myCorpus> myDtm <- DocumentTermMatrix(myCorpus, control = list(wordlengths=c(1,Inf),tokenize=NGramTokenizer, dictionary=dictCorpus)) it hangs. I have kept it running overnight, but no results. Any help would be much appreciated. Thanks--Neep Hazarika [[alternative HTML version deleted]]
______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.