Re: [R] Memory usage in R grows considerably while calculating word frequencies

arun Tue, 25 Sep 2012 13:04:48 -0700

Dear Martin,

Thanks for testing the code.  You are right.
I modified the code:


If I test it for a sample text,

txt1<-"Romney A.K. different, (= than other people.  Is it?"
OP's code:
pattern <- "(\\b[A-Za-z]+\\b)"
 match <- gregexpr(pattern,txt1)
 words.txt <- regmatches(txt1,match)
 words.txt<-unlist(words.txt)
 words.txt<-table(words.txt,dnn="words")
words.txt<-sort(words.txt,decreasing=TRUE)
words.txt
#words
 #       A different        Is        it         K     other    people    
Romney 
 #       1         1         1         1         1         1         1         
1 
  #   than 
   #     1 


#My code:

 words.txt1<-sort(table(gsub("\\W","",unlist(strsplit(tolower(txt1),"\\s")))[grepl("\\b\\w+\\b",gsub("\\W","",unlist(strsplit(tolower(txt1),"\\s"))))]))
   #  ak different        is        it     other    people    romney      than 
   #     1         1         1         1         1         1         1         
1 
 

Here, as you can see, OP's code split A.K. to two words, but my code joins it. 
I didn't fix it because the concern is to minimize memory usage.

I again, tested the new code with text of :
 sapply(strsplit(txt1," "),length)
#[1] 4850 9072 6400 2071
 sum(sapply(strsplit(txt1," "),length))
#[1] 22393
: words.

#OP's code:
system.time({
txt1<-tolower(scan("text_file","character",sep="\n"))
pattern <- "(\\b[A-Za-z]+\\b)"
match <- gregexpr(pattern,txt1)
words.txt <- regmatches(txt1,match)
words.txt<-unlist(words.txt)
words.txt<-table(words.txt,dnn="words")
words.txt<-sort(words.txt,decreasing=TRUE)
words.txt<-paste(names(words.txt),words.txt,sep="\t")
cat("Word\tFREQ",words.txt,file="frequencies",sep="\n")
})
#Read 4 items
 # user  system elapsed 
# 12.056   0.000  12.066 

#My code:
system.time({
txt1<-tolower(scan("text_file","character",sep="\n")) 
 words.txt<-sort(table(gsub("\\W","",unlist(strsplit(tolower(txt1),"\\s")))[grepl("\\b\\w+\\b",gsub("\\W","",unlist(strsplit(tolower(txt1),"\\s"))))]),decreasing=TRUE)
 words.txt<-paste(names(words.txt),words.txt,sep="\t")
 cat("Word\tFREQ",words.txt,file="frequencies",sep="\n") 
})
#Read 4 items
  # user  system elapsed 
 # 0.148   0.000   0.150 

There is improvement in the speed.  Output also looked similar.  This code may 
be still improved.
A.K.
   




----- Original Message -----
From: Martin Maechler <maech...@stat.math.ethz.ch>
To: arun <smartpink...@yahoo.com>
Cc: mcelis <mce...@lightminersystems.com>; R help <r-help@r-project.org>
Sent: Tuesday, September 25, 2012 9:07 AM
Subject: Re: [R] Memory usage in R grows considerably while calculating word 
frequencies

>>>>> arun  <smartpink...@yahoo.com>
>>>>>     on Mon, 24 Sep 2012 19:59:35 -0700 writes:

    > HI,
    > In the previous email, I forgot to add unlist().
    > With four paragraphs,
    > sapply(strsplit(txt1," "),length)
    > #[1] 4850 9072 6400 2071


    > #Your code:
    > system.time({
    > txt1<-tolower(scan("text_file","character",sep="\n")) 
    > pattern <- "(\\b[A-Za-z]+\\b)"
    > match <- gregexpr(pattern,txt1)
    > words.txt <- regmatches(txt1,match)
    > words.txt<-unlist(words.txt)
    > words.txt<-table(words.txt,dnn="words")
    > words.txt<-sort(words.txt,decreasing=TRUE)
    > words.txt<-paste(names(words.txt),words.txt,sep="\t")
    > cat("Word\tFREQ",words.txt,file="frequencies",sep="\n")
    > })

    > #Read 4 items
    > #   user  system elapsed 
    > # 11.781   0.004  11.799 


    > #Modified code:
    > system.time({
    > txt1<-tolower(scan("text_file","character",sep="\n")) 
    >  
words.txt<-sort(table(unlist(strsplit(tolower(txt1),"\\s"))),decreasing=TRUE)
    >  words.txt<-paste(names(words.txt),words.txt,sep="\t")
    >  cat("Word\tFREQ",words.txt,file="frequencies",sep="\n") 
    > })
    > #Read 4 items
    >  #user  system elapsed 
    >  # 0.036   0.008   0.043 


    > A.K.

Well, dear A.K., your definition of "word" is really different,
and in my view clearly much too simplistic, compared to what the
OP (= original-poster) asked from.

E.g., from the above paragraph, your method will get words such as
"A.K.,"   "different,"  or  "(="  
clearly wrongly.

Martin Maechler, ETH Zurich



    > ----- Original Message -----
    > From: mcelis <mce...@lightminersystems.com>
    > To: r-help@r-project.org
    > Cc: 
    > Sent: Monday, September 24, 2012 7:29 PM
    > Subject: [R] Memory usage in R grows considerably while calculating word 
frequencies

    > I am working with some large text files (up to 16 GBytes).  I am 
interested
    > in extracting the words and counting each time each word appears in the
    > text. I have written a very simple R program by following some suggestions
    > and examples I found online.  

    > If my input file is 1 GByte, I see that R uses up to 11 GBytes of memory
    > when executing the program on
    > a 64-bit system running CentOS 6.3. Why is R using so much memory? Is 
there
    > a better way to do this that will
    > minimize memory usage.

    > I am very new to R, so I would appreciate some tips on how to improve my
    > program or a better way to do it.

    > R program:
    > # Read in the entire file and convert all words in text to lower case
    > words.txt<-tolower(scan("text_file","character",sep="\n"))

    > # Extract words
    > pattern <- "(\\b[A-Za-z]+\\b)"
    > match <- gregexpr(pattern,words.txt)
    > words.txt <- regmatches(words.txt,match)

    > # Create a vector from the list of words
    > words.txt<-unlist(words.txt)

    > # Calculate word frequencies
    > words.txt<-table(words.txt,dnn="words")

    > # Sort by frequency, not alphabetically
    > words.txt<-sort(words.txt,decreasing=TRUE)

    > # Put into some readable form, "Name of word" and "Number of times it
    > occurs"
    > words.txt<-paste(names(words.txt),words.txt,sep="\t")

    > # Results to a file
    > cat("Word\tFREQ",words.txt,file="frequencies",sep="\n")



    > --
    > View this message in context: 
http://r.789695.n4.nabble.com/Memory-usage-in-R-grows-considerably-while-calculating-word-frequencies-tp4644053.html
    > Sent from the R help mailing list archive at Nabble.com.

    > ______________________________________________
    > R-help@r-project.org mailing list
    > https://stat.ethz.ch/mailman/listinfo/r-help
    > PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html
    > and provide commented, minimal, self-contained, reproducible code.


    > ______________________________________________
    > R-help@r-project.org mailing list
    > https://stat.ethz.ch/mailman/listinfo/r-help
    > PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html
    > and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Memory usage in R grows considerably while calculating word frequencies

Reply via email to