My goal is simple: calcuate GC content of each sequence in a list of nucleotide sequences. I have figured out how to vectorize, but all my attempts at memoization failed.
Can you show me how to properly memoize my function? There is a StackOverflow post on the subject of memoization, but it does not help me: http://stackoverflow.com/questions/7262485/options-for-caching-memoization-hashing-in-r I haven't been able to find any other discussions on this subject. Searching for "memoise" or "memoize" on r-bloggers.com returns zero results. Searching for those keywords at http://r-project.markmail.org/ does not return helpful discussions. Here's my data: seqs <- c("","G","C","CCC","T","","TTCCT","","C","CTC") Some sequences are missing, so they're blank `""`. I have a function for calculating GC content: > GC <- function(s) { if (!is.character(s)) return(NA) n <- nchar(s) if (n == 0) return(NA) m <- gregexpr('[GCSgcs]', s)[[1]] if (m[1] < 1) return(0) return(100.0 * length(m) / n) } It works: > GC('') [1] NA > GC('G') [1] 100 > GC('GAG') [1] 66.66667 > sapply(seqs, GC) G C CCC T TTCCT C NA 100.00000 100.00000 100.00000 0.00000 NA 40.00000 NA 100.00000 CTC 66.66667 I want to memoize it. Then, I want to vectorize it. Should be easy, right? Apparently, I must have the wrong mindset for using the `memoise` or `R.cache` R packages: > system.time(dummy <- sapply(rep(seqs,100), GC)) user system elapsed 0.044 0.000 0.054 > > library(memoise) > GCm1 <- memoise(GC) > system.time(dummy <- sapply(rep(seqs,100), GCm1)) user system elapsed 0.164 0.000 0.173 > > library(R.cache) > GCm2 <- addMemoization(GC) > system.time(dummy <- sapply(rep(seqs,100), GCm2)) user system elapsed 10.601 0.252 10.926 Notice that the memoized functions are several orders of magnitude slower. I tried the `hash` package, but things seem to be happening behind the scenes and I don't understand the output: > cache <- hash() > GCc <- function(s) { if (!is.character(s) || nchar(s) == 0) { return(NA) } if(exists(s, cache)) { return(cache[[s]]) } result <- GC(s) cache[[s]] <<- result return(result) } > sapply(seqs,GCc) [[1]] [1] NA $G [1] 100 $C NULL $CCC [1] 100 $T NULL [[6]] [1] NA $TTCCT [1] 40 [[8]] [1] NA $C NULL $CTC [1] 66.66667 At least I figured out how to vectorize: > GCv <- Vectorize(GC) > GCv(seqs) G C CCC T TTCCT C 0.00000 100.00000 100.00000 100.00000 0.00000 0.00000 40.00000 0.00000 100.00000 CTC 66.66667 [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.