[R] How can this code be improved?

Richard R. Liu Wed, 11 Nov 2009 23:52:16 -0800

I am running the following code on a MacBook Pro 17" Unibody early2009 with 8GB RAM, OS X 10.5.8, R 2.10.0 Patch from Nov. 2, 2009, in64-bit mode.


freq.stopwords <- numeric(0)
freq.nonstopwords <- numeric(0)
token.tables <- list(0)
i.ss <- c(0)
cat("Beginning at ", date(), ".\n")
for (i.d in 1:length(tokens)) {
        tt <- list(0)
        for (i.s in 1:length(tokens[[i.d]])) {
                t <- tolower(tokens[[i.d]][[i.s]])
                t <- sub("^[[:punct:]]*", "", t)
                t <- sub("[[:punct:]]*$", "", t)
                t <- as.data.frame(table(t))
                i.m <- match(t$t, stopwords)
                i.m.is.na <- is.na(i.m)
                i.ss <- i.ss + 1
                freq.stopwords[i.ss] <- sum(t$Freq * !i.m.is.na)
                freq.nonstopwords[i.ss] <- sum(t$Freq * i.m.is.na)

tt[[i.s]] <- data.frame(token = t$t, freq = t$Freq, matches.stopword= i.m)

        }
        token.tables[[i.d]] <- tt
        if (i.d %% 5 == 0) cat(i.d, "reports completed at ", date(), ".\n")
}
cat("Terminating at ", date(), ".\n")


The object in the innermost loop are:

* tokens: a list of lists. In the expression tokens[[i.d]][[i.s]],the first index runs over 1697 reports, the second over the sentencesin the report, each of which consists of a vector of tokens, i.e., thecharacter strings between the white spaces in the sentence. One ofthe largest reports takes up 58MB on the harddisk. Thus, the numberof sentences can be quite large, and some of the sentences are quitelong (measure in tokens as well as in characters).* stopwords: is a vector of 571 words that occur very often inwritten English.

The code operates on sentences, converting each token in the sentenceto lowercase, removing punctuation at the beginning and end of thetoken, tabulating the frequency of the unique tokens, and generatingan array that indicates which tokens correspond to stopwords. It alsosums the frequencies of the stopwords and that of the non-stopwords.The result is a list of list of dataframes.

I began running on Thursday Nov. 12, 2009 at 01:56:36. As of 7:52:00510 reports had been processed. The Activity Monitor indicates nomemory bottleneck. R is using 4.31 GB of real memory, 7.23 GB ofvirtual memory, and 1.67 GB of real memory are free.

I admit that I am an R newbie. From my understanding of the "apply"functions (e.g., lapply), I see no way to use them to simplify theloops. I would appreciate any suggestions about making the code more"R-like" and, above all, much faster.


Regards,
Richard

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] How can this code be improved?

Reply via email to