Jim, Dennis, Once again, thanks for all your suggestions. After developing a more R-like version of the script I terminated the running one after 976 (of 1697) reports had been processed. At that point, the script had been running for approx. 33.5 hours! Here is the new version:
library(filehash) db <- dbInit("/Volumes/Work on RDR Test Documents/R Databases/DB_TXT", type = "RDS") dbLoad(db) dba <- dbInit("/Volumes/Work on RDR Test Documents/R Databases/DB_Aux", type = "RDS") dbLoad(dba) tokens <- sentences.all.tokenized stopwords <- stopwords.pubmed # Convert to lowercase, remove beginning and end punctuation, tabulate my.func <- function(sent, stop, ...){ list( freq.table = (temp.table <- table( sub( "[[:punct:]]*$", "", sub( "^[[:punct:]]*", "", tolower(sent) ) ) )), stopword.matches = (temp.matches <- match(names(temp.table), stop)), stopword.summary = array(tapply(temp.table, !is.na(temp.matches), sum), dim = 2, dimnames = list(c("no.non.stopwords", "no.stopwords"))) ) } cat("Beginning at ", date(), ".\n", sep = "") token.tables <- lapply(1:length(tokens), function(i.d, doc, stop, func, ...){ if ((i.d - 1) %% 10 == 0) cat((i.d - 1), " report(s) completed at ", date(), ".\n", sep = "") lapply(1:length(doc[[i.d]]), function(i.s, sent, stop, func, ...){ func(sent[[i.s]], stop, ...) } , sent = doc[[i.d]], stop = stop, func = func, ... ) } , doc = tokens, stop = stopwords, func = my.func ) cat("Terminating at ", date(), ".\n", sep = "") This script reaches the same point in approx. 1:09 hours, a little under 70 minutes! What I am noticing now is a severe lack of real memory. Activity Monitor shows about 20MB of real memory free. R, running in 64-bit mode, is using 6.75GB of real and 10GB of virtual memory. I see lots of disk activity. This is undoubtedly the swapping between real and virtual memory. CPU activity is very low. I suppose I could run the script twice, each time on half the tokens. That would give me two lists, which I would have to combine into a single one. Regards, Richard On Thu, 12 Nov 2009 18:53:34 -0500, jim holtman wrote > Run the script on a small subset of the data and use Rprof to profile > the code. This will give you an idea of where time is being spent > and where to focus for improvement. I would suggest that you do not > convert the output of the 'table(t)' do a dataframe. You can just > extract the 'names' to get the words. You might be spending some of > the time in the accessing the information in the dataframe, which is > really not necessary for your code. > > On Thu, Nov 12, 2009 at 2:12 AM, Richard R. Liu <richard....@pueo- > owl.ch> wrote: > > I am running the following code on a MacBook Pro 17" Unibody early 2009 with > > 8GB RAM, OS X 10.5.8, R 2.10.0 Patch from Nov. 2, 2009, in 64-bit mode. > > > > freq.stopwords <- numeric(0) > > freq.nonstopwords <- numeric(0) > > token.tables <- list(0) > > i.ss <- c(0) > > cat("Beginning at ", date(), ".\n") > > for (i.d in 1:length(tokens)) { > > tt <- list(0) > > for (i.s in 1:length(tokens[[i.d]])) { > > t <- tolower(tokens[[i.d]][[i.s]]) > > t <- sub("^[[:punct:]]*", "", t) > > t <- sub("[[:punct:]]*$", "", t) > > t <- as.data.frame(table(t)) > > i.m <- match(t$t, stopwords) > > i.m.is.na <- is.na(i.m) > > i.ss <- i.ss + 1 > > freq.stopwords[i.ss] <- sum(t$Freq * !i.m.is.na) > > freq.nonstopwords[i.ss] <- sum(t$Freq * i.m.is.na) > > tt[[i.s]] <- data.frame(token = t$t, freq = t$Freq, > > matches.stopword = i.m) > > } > > token.tables[[i.d]] <- tt > > if (i.d %% 5 == 0) cat(i.d, "reports completed at ", date(), ".\n") > > } > > cat("Terminating at ", date(), ".\n") > > > > The object in the innermost loop are: > > * tokens: a list of lists. In the expression tokens[[i.d]][[i.s]], the > > first index runs over 1697 reports, the second over the sentences in the > > report, each of which consists of a vector of tokens, i.e., the character > > strings between the white spaces in the sentence. One of the largest > > reports takes up 58MB on the harddisk. Thus, the number of sentences can be > > quite large, and some of the sentences are quite long (measure in tokens as > > well as in characters). > > * stopwords: is a vector of 571 words that occur very often in written > > English. > > > > The code operates on sentences, converting each token in the sentence to > > lowercase, removing punctuation at the beginning and end of the token, > > tabulating the frequency of the unique tokens, and generating an array that > > indicates which tokens correspond to stopwords. It also sums the > > frequencies of the stopwords and that of the non-stopwords. The result is a > > list of list of dataframes. > > > > I began running on Thursday Nov. 12, 2009 at 01:56:36. As of 7:52:00 510 > > reports had been processed. The Activity Monitor indicates no memory > > bottleneck. R is using 4.31 GB of real memory, 7.23 GB of virtual memory, > > and 1.67 GB of real memory are free. > > > > I admit that I am an R newbie. From my understanding of the "apply" > > functions (e.g., lapply), I see no way to use them to simplify the loops. I > > would appreciate any suggestions about making the code more "R-like" and, > > above all, much faster. > > > > Regards, > > Richard > > ______________________________________________ > > R-help@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > > > > -- > Jim Holtman > Cincinnati, OH > +1 513 646 9390 > > What is the problem that you are trying to solve? -- Richard R. Liu Dittingerstr. 33 CH-4053 Basel Switzerland Tel.: +41 61 331 10 47 Email: richard....@pueo-owl.ch ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.