My goal is simple: calcuate GC content of each sequence in a list of
nucleotide
sequences. I have figured out how to vectorize, but all my attempts at
memoization failed.

Can you show me how to properly memoize my function?

There is a StackOverflow post on the subject of memoization, but it does not
help me:
http://stackoverflow.com/questions/7262485/options-for-caching-memoization-hashing-in-r

I haven't been able to find any other discussions on this subject. Searching
for "memoise" or "memoize" on r-bloggers.com returns zero results. Searching
for those keywords at http://r-project.markmail.org/ does not return helpful
discussions.

Here's my data:

    seqs <- c("","G","C","CCC","T","","TTCCT","","C","CTC")

Some sequences are missing, so they're blank `""`.

I have a function for calculating GC content:

    > GC <- function(s) {
        if (!is.character(s)) return(NA)
        n <- nchar(s)
        if (n == 0) return(NA)
        m <- gregexpr('[GCSgcs]', s)[[1]]
        if (m[1] < 1) return(0)
        return(100.0 * length(m) / n)
    }

It works:

    > GC('')
    [1] NA
    > GC('G')
    [1] 100
    > GC('GAG')
    [1] 66.66667
    > sapply(seqs, GC)
                      G         C       CCC         T               TTCCT
                C
           NA 100.00000 100.00000 100.00000   0.00000        NA  40.00000
     NA 100.00000
          CTC
     66.66667

I want to memoize it. Then, I want to vectorize it. Should be easy, right?

Apparently, I must have the wrong mindset for using the `memoise` or
`R.cache`
R packages:

    > system.time(dummy <- sapply(rep(seqs,100), GC))
       user  system elapsed
      0.044   0.000   0.054
    >
    > library(memoise)
    > GCm1 <- memoise(GC)
    > system.time(dummy <- sapply(rep(seqs,100), GCm1))
       user  system elapsed
      0.164   0.000   0.173
    >
    > library(R.cache)
    > GCm2 <- addMemoization(GC)
    > system.time(dummy <- sapply(rep(seqs,100), GCm2))
       user  system elapsed
     10.601   0.252  10.926

Notice that the memoized functions are several orders of magnitude slower.

I tried the `hash` package, but things seem to be happening behind the
scenes
and I don't understand the output:

    > cache <- hash()
    > GCc <- function(s) {
        if (!is.character(s) || nchar(s) == 0) {
            return(NA)
        }
        if(exists(s, cache)) {
            return(cache[[s]])
        }
        result <- GC(s)
        cache[[s]] <<- result
        return(result)
    }
    > sapply(seqs,GCc)
    [[1]]
    [1] NA

    $G
    [1] 100

    $C
    NULL

    $CCC
    [1] 100

    $T
    NULL

    [[6]]
    [1] NA

    $TTCCT
    [1] 40

    [[8]]
    [1] NA

    $C
    NULL

    $CTC
    [1] 66.66667

At least I figured out how to vectorize:

    > GCv <- Vectorize(GC)
    > GCv(seqs)
                      G         C       CCC         T               TTCCT
                C
      0.00000 100.00000 100.00000 100.00000   0.00000   0.00000  40.00000
0.00000 100.00000
          CTC
     66.66667

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to