Dear Martin, Thanks for testing the code. You are right. I modified the code:
If I test it for a sample text, txt1<-"Romney A.K. different, (= than other people. Is it?" OP's code: pattern <- "(\\b[A-Za-z]+\\b)" match <- gregexpr(pattern,txt1) words.txt <- regmatches(txt1,match) words.txt<-unlist(words.txt) words.txt<-table(words.txt,dnn="words") words.txt<-sort(words.txt,decreasing=TRUE) words.txt #words # A different Is it K other people Romney # 1 1 1 1 1 1 1 1 # than # 1 #My code: words.txt1<-sort(table(gsub("\\W","",unlist(strsplit(tolower(txt1),"\\s")))[grepl("\\b\\w+\\b",gsub("\\W","",unlist(strsplit(tolower(txt1),"\\s"))))])) # ak different is it other people romney than # 1 1 1 1 1 1 1 1 Here, as you can see, OP's code split A.K. to two words, but my code joins it. I didn't fix it because the concern is to minimize memory usage. I again, tested the new code with text of : sapply(strsplit(txt1," "),length) #[1] 4850 9072 6400 2071 sum(sapply(strsplit(txt1," "),length)) #[1] 22393 : words. #OP's code: system.time({ txt1<-tolower(scan("text_file","character",sep="\n")) pattern <- "(\\b[A-Za-z]+\\b)" match <- gregexpr(pattern,txt1) words.txt <- regmatches(txt1,match) words.txt<-unlist(words.txt) words.txt<-table(words.txt,dnn="words") words.txt<-sort(words.txt,decreasing=TRUE) words.txt<-paste(names(words.txt),words.txt,sep="\t") cat("Word\tFREQ",words.txt,file="frequencies",sep="\n") }) #Read 4 items # user system elapsed # 12.056 0.000 12.066 #My code: system.time({ txt1<-tolower(scan("text_file","character",sep="\n")) words.txt<-sort(table(gsub("\\W","",unlist(strsplit(tolower(txt1),"\\s")))[grepl("\\b\\w+\\b",gsub("\\W","",unlist(strsplit(tolower(txt1),"\\s"))))]),decreasing=TRUE) words.txt<-paste(names(words.txt),words.txt,sep="\t") cat("Word\tFREQ",words.txt,file="frequencies",sep="\n") }) #Read 4 items # user system elapsed # 0.148 0.000 0.150 There is improvement in the speed. Output also looked similar. This code may be still improved. A.K. ----- Original Message ----- From: Martin Maechler <maech...@stat.math.ethz.ch> To: arun <smartpink...@yahoo.com> Cc: mcelis <mce...@lightminersystems.com>; R help <r-help@r-project.org> Sent: Tuesday, September 25, 2012 9:07 AM Subject: Re: [R] Memory usage in R grows considerably while calculating word frequencies >>>>> arun <smartpink...@yahoo.com> >>>>> on Mon, 24 Sep 2012 19:59:35 -0700 writes: > HI, > In the previous email, I forgot to add unlist(). > With four paragraphs, > sapply(strsplit(txt1," "),length) > #[1] 4850 9072 6400 2071 > #Your code: > system.time({ > txt1<-tolower(scan("text_file","character",sep="\n")) > pattern <- "(\\b[A-Za-z]+\\b)" > match <- gregexpr(pattern,txt1) > words.txt <- regmatches(txt1,match) > words.txt<-unlist(words.txt) > words.txt<-table(words.txt,dnn="words") > words.txt<-sort(words.txt,decreasing=TRUE) > words.txt<-paste(names(words.txt),words.txt,sep="\t") > cat("Word\tFREQ",words.txt,file="frequencies",sep="\n") > }) > #Read 4 items > # user system elapsed > # 11.781 0.004 11.799 > #Modified code: > system.time({ > txt1<-tolower(scan("text_file","character",sep="\n")) > words.txt<-sort(table(unlist(strsplit(tolower(txt1),"\\s"))),decreasing=TRUE) > words.txt<-paste(names(words.txt),words.txt,sep="\t") > cat("Word\tFREQ",words.txt,file="frequencies",sep="\n") > }) > #Read 4 items > #user system elapsed > # 0.036 0.008 0.043 > A.K. Well, dear A.K., your definition of "word" is really different, and in my view clearly much too simplistic, compared to what the OP (= original-poster) asked from. E.g., from the above paragraph, your method will get words such as "A.K.," "different," or "(=" clearly wrongly. Martin Maechler, ETH Zurich > ----- Original Message ----- > From: mcelis <mce...@lightminersystems.com> > To: r-help@r-project.org > Cc: > Sent: Monday, September 24, 2012 7:29 PM > Subject: [R] Memory usage in R grows considerably while calculating word frequencies > I am working with some large text files (up to 16 GBytes). I am interested > in extracting the words and counting each time each word appears in the > text. I have written a very simple R program by following some suggestions > and examples I found online. > If my input file is 1 GByte, I see that R uses up to 11 GBytes of memory > when executing the program on > a 64-bit system running CentOS 6.3. Why is R using so much memory? Is there > a better way to do this that will > minimize memory usage. > I am very new to R, so I would appreciate some tips on how to improve my > program or a better way to do it. > R program: > # Read in the entire file and convert all words in text to lower case > words.txt<-tolower(scan("text_file","character",sep="\n")) > # Extract words > pattern <- "(\\b[A-Za-z]+\\b)" > match <- gregexpr(pattern,words.txt) > words.txt <- regmatches(words.txt,match) > # Create a vector from the list of words > words.txt<-unlist(words.txt) > # Calculate word frequencies > words.txt<-table(words.txt,dnn="words") > # Sort by frequency, not alphabetically > words.txt<-sort(words.txt,decreasing=TRUE) > # Put into some readable form, "Name of word" and "Number of times it > occurs" > words.txt<-paste(names(words.txt),words.txt,sep="\t") > # Results to a file > cat("Word\tFREQ",words.txt,file="frequencies",sep="\n") > -- > View this message in context: http://r.789695.n4.nabble.com/Memory-usage-in-R-grows-considerably-while-calculating-word-frequencies-tp4644053.html > Sent from the R help mailing list archive at Nabble.com. > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.