-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 25/09/12 01:29, mcelis wrote:
> I am working with some large text files (up to 16 GBytes).  I am interested 
> in extracting the 
> words and counting each time each word appears in the text. I have written a 
> very simple R 
> program by following some suggestions and examples I found online.

Just an idea (I have no experience with what you want to do, so it might not 
work):

What about putting the text in a database (sqlite comes to mind) where each 
word is one entry.
Then you could use sql to query the database, which should need much less 
memory.

In addition, it should make further processing much easier.

Cheers,

Rainer

> 
> If my input file is 1 GByte, I see that R uses up to 11 GBytes of memory when 
> executing the 
> program on a 64-bit system running CentOS 6.3. Why is R using so much memory? 
> Is there a
> better way to do this that will minimize memory usage.
> 
> I am very new to R, so I would appreciate some tips on how to improve my 
> program or a better 
> way to do it.
> 
> R program: # Read in the entire file and convert all words in text to lower 
> case 
> words.txt<-tolower(scan("text_file","character",sep="\n"))
> 
> # Extract words pattern <- "(\\b[A-Za-z]+\\b)" match <- 
> gregexpr(pattern,words.txt) words.txt 
> <- regmatches(words.txt,match)
> 
> # Create a vector from the list of words words.txt<-unlist(words.txt)
> 
> # Calculate word frequencies words.txt<-table(words.txt,dnn="words")
> 
> # Sort by frequency, not alphabetically 
> words.txt<-sort(words.txt,decreasing=TRUE)
> 
> # Put into some readable form, "Name of word" and "Number of times it occurs" 
> words.txt<-paste(names(words.txt),words.txt,sep="\t")
> 
> # Results to a file cat("Word\tFREQ",words.txt,file="frequencies",sep="\n")
> 
> 
> 
> -- View this message in context: 
> http://r.789695.n4.nabble.com/Memory-usage-in-R-grows-considerably-while-calculating-word-frequencies-tp4644053.html
>
>
> 
Sent from the R help mailing list archive at Nabble.com.
> 

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://www.enigmail.net/

iEYEARECAAYFAlBitboACgkQoYgNqgF2egr1pgCgjHxE/E1qIwUbrYzB30qIk9cK
z/oAoILCYn66+c9CF5tzkWeQH3E2utwi
=ahI5
-----END PGP SIGNATURE-----

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to