[R] R hangs at NGramTokenizer

Neep Hazarika Thu, 26 Sep 2013 07:17:12 -0700

Hi:
I try to construct a Document-Term Meatrix from a corpus. The commands I used 
are:
> library(parallel)> library(tm)> library(RWeka)> library(topicmodels)> 
> library(RTextTools)> cl=makeCluster(detectCores())> 
> invisible(clusterEvalQ(cl, library(tm)))> invisible(clusterEvalQ(cl, 
> library(RWeka))) > invisible(clusterEvalQ(cl, library(topicmodels)))> 
> invisible(clusterEvalQ(cl, library(RTextTools)))> myCorpus 
> <-Corpus(DirSource("/home/neeph/Test/DMOZ_Business"), encoding="UTF-8", 
> readerControl=list(reader=readPlain))> removeURL <- function(x) 
> gsub("http[[:alnum:]]*", "", x)> myCorpus <- tm_map(myCorpus, removeURL)> 
> removeAmp <- function(x) gsub("&amp;", "", x)> myCorpus <- tm_map(myCorpus, 
> removeAmp)> removeWWW <- function(x) gsub("www[[:alnum:]]*", "", x)> myCorpus 
> <- tm_map(myCorpus, removeWWW)> myCorpus <- tm_map(myCorpus, tolower)> 
> myCorpus <- tm_map(myCorpus, removeNumbers)> myCorpus <- tm_map(myCorpus, 
> removePunctuation)> myCorpus <- tm_map(myCorpus, removeWords, 
> stopwords("english"))> myCorpus <- tm_map(myCorpus, removeWords, 
> stopwords("SMART"))> myCorpus <!
 - tm_map(myCorpus, stripWhitespace)> myDtm <- DocumentTermMatrix(myCorpus, 
control = list(wordLengths = c(1,Inf)))
Everything works fine upto this stage, if I do not include tokenizing. However, 
when I run the code with the following alteration:> dictCorpus <- myCorpus> 
myDtm <- DocumentTermMatrix(myCorpus, control = 
list(wordlengths=c(1,Inf),tokenize=NGramTokenizer, dictionary=dictCorpus))
it hangs. I have kept it running overnight, but no results. Any help would be 
much appreciated. 
Thanks--Neep Hazarika                                     
        [[alternative HTML version deleted]]


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] R hangs at NGramTokenizer

Reply via email to