Re: [R] How can this code be improved?

Richard R. Liu Thu, 12 Nov 2009 16:19:44 -0800

Jim and Dennis,

Thanks for your suggestions. Almost 24 hours later, the script hasfinished a bit more than half the reports. Free RAM varies between1.2GB and a few MB. I hesitate to interrupt it in order to implementthe improvements that you have suggested, lest they do not decreasethe execution time by at least an order of magnitude; however, Idefinitely will implement and test your and my improvements.


Regards,
Richard

On Nov 13, 2009, at 0:53 , jim holtman wrote:

Run the script on a small subset of the data and use Rprof to profile
the code.  This will give you an idea of where time is being spent and
where to focus for improvement.  I would suggest that you do not
convert the output of the 'table(t)' do a dataframe.  You can just
extract the 'names' to get the words.  You might be spending some of
the time in the accessing the information in the dataframe, which is
really not necessary for your code.
On Thu, Nov 12, 2009 at 2:12 AM, Richard R. Liu <richard....@pueo-owl.ch> wrote:
I am running the following code on a MacBook Pro 17" Unibody early2009 with8GB RAM, OS X 10.5.8, R 2.10.0 Patch from Nov. 2, 2009, in 64-bitmode.
freq.stopwords <- numeric(0)
freq.nonstopwords <- numeric(0)
token.tables <- list(0)
i.ss <- c(0)
cat("Beginning at ", date(), ".\n")
for (i.d in 1:length(tokens)) {
       tt <- list(0)
       for (i.s in 1:length(tokens[[i.d]])) {
               t <- tolower(tokens[[i.d]][[i.s]])
               t <- sub("^[[:punct:]]*", "", t)
               t <- sub("[[:punct:]]*$", "", t)
               t <- as.data.frame(table(t))
               i.m <- match(t$t, stopwords)
               i.m.is.na <- is.na(i.m)
               i.ss <- i.ss + 1
               freq.stopwords[i.ss] <- sum(t$Freq * !i.m.is.na)
               freq.nonstopwords[i.ss] <- sum(t$Freq * i.m.is.na)
               tt[[i.s]] <- data.frame(token = t$t, freq = t$Freq,
matches.stopword = i.m)
       }
       token.tables[[i.d]] <- tt
if (i.d %% 5 == 0) cat(i.d, "reports completed at ", date(),".\n")
}
cat("Terminating at ", date(), ".\n")

The object in the innermost loop are:
* tokens: a list of lists. In the expression tokens[[i.d]][[i.s]], thefirst index runs over 1697 reports, the second over the sentencesin thereport, each of which consists of a vector of tokens, i.e., thecharacter
strings between the white spaces in the sentence.  One of the largest
reports takes up 58MB on the harddisk. Thus, the number ofsentences can bequite large, and some of the sentences are quite long (measure intokens as
well as in characters).
* stopwords: is a vector of 571 words that occur very often inwritten
English.
The code operates on sentences, converting each token in thesentence tolowercase, removing punctuation at the beginning and end of thetoken,tabulating the frequency of the unique tokens, and generating anarray that
indicates which tokens correspond to stopwords.  It also sums the
frequencies of the stopwords and that of the non-stopwords. Theresult is a
list of list of dataframes.
I began running on Thursday Nov. 12, 2009 at 01:56:36. As of7:52:00 510
reports had been processed.  The Activity Monitor indicates no memory
bottleneck. R is using 4.31 GB of real memory, 7.23 GB of virtualmemory,
and 1.67 GB of real memory are free.

I admit that I am an R newbie.  From my understanding of the "apply"
functions (e.g., lapply), I see no way to use them to simplify theloops. Iwould appreciate any suggestions about making the code more "R-like" and,
above all, much faster.

Regards,
Richard
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
--
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] How can this code be improved?

Reply via email to