HI, In the previous email, I forgot to add unlist(). With four paragraphs, sapply(strsplit(txt1," "),length) #[1] 4850 9072 6400 2071
#Your code: system.time({ txt1<-tolower(scan("text_file","character",sep="\n")) pattern <- "(\\b[A-Za-z]+\\b)" match <- gregexpr(pattern,txt1) words.txt <- regmatches(txt1,match) words.txt<-unlist(words.txt) words.txt<-table(words.txt,dnn="words") words.txt<-sort(words.txt,decreasing=TRUE) words.txt<-paste(names(words.txt),words.txt,sep="\t") cat("Word\tFREQ",words.txt,file="frequencies",sep="\n") }) #Read 4 items # user system elapsed # 11.781 0.004 11.799 #Modified code: system.time({ txt1<-tolower(scan("text_file","character",sep="\n")) words.txt<-sort(table(unlist(strsplit(tolower(txt1),"\\s"))),decreasing=TRUE) words.txt<-paste(names(words.txt),words.txt,sep="\t") cat("Word\tFREQ",words.txt,file="frequencies",sep="\n") }) #Read 4 items #user system elapsed # 0.036 0.008 0.043 A.K. ----- Original Message ----- From: mcelis <mce...@lightminersystems.com> To: r-help@r-project.org Cc: Sent: Monday, September 24, 2012 7:29 PM Subject: [R] Memory usage in R grows considerably while calculating word frequencies I am working with some large text files (up to 16 GBytes). I am interested in extracting the words and counting each time each word appears in the text. I have written a very simple R program by following some suggestions and examples I found online. If my input file is 1 GByte, I see that R uses up to 11 GBytes of memory when executing the program on a 64-bit system running CentOS 6.3. Why is R using so much memory? Is there a better way to do this that will minimize memory usage. I am very new to R, so I would appreciate some tips on how to improve my program or a better way to do it. R program: # Read in the entire file and convert all words in text to lower case words.txt<-tolower(scan("text_file","character",sep="\n")) # Extract words pattern <- "(\\b[A-Za-z]+\\b)" match <- gregexpr(pattern,words.txt) words.txt <- regmatches(words.txt,match) # Create a vector from the list of words words.txt<-unlist(words.txt) # Calculate word frequencies words.txt<-table(words.txt,dnn="words") # Sort by frequency, not alphabetically words.txt<-sort(words.txt,decreasing=TRUE) # Put into some readable form, "Name of word" and "Number of times it occurs" words.txt<-paste(names(words.txt),words.txt,sep="\t") # Results to a file cat("Word\tFREQ",words.txt,file="frequencies",sep="\n") -- View this message in context: http://r.789695.n4.nabble.com/Memory-usage-in-R-grows-considerably-while-calculating-word-frequencies-tp4644053.html Sent from the R help mailing list archive at Nabble.com. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.