On 04/26/2012 03:21 PM, Kamil Slowikowski wrote:
My goal is simple: calcuate GC content of each sequence in a list of
nucleotide
sequences. I have figured out how to vectorize, but all my attempts at
memoization failed.

Can you show me how to properly memoize my function?

There is a StackOverflow post on the subject of memoization, but it does not
help me:
http://stackoverflow.com/questions/7262485/options-for-caching-memoization-hashing-in-r

I haven't been able to find any other discussions on this subject. Searching
for "memoise" or "memoize" on r-bloggers.com returns zero results. Searching
for those keywords at http://r-project.markmail.org/ does not return helpful
discussions.

Here's my data:

     seqs<- c("","G","C","CCC","T","","TTCCT","","C","CTC")

Some sequences are missing, so they're blank `""`.

I have a function for calculating GC content:

     >  GC<- function(s) {
         if (!is.character(s)) return(NA)
         n<- nchar(s)
         if (n == 0) return(NA)
         m<- gregexpr('[GCSgcs]', s)[[1]]
         if (m[1]<  1) return(0)
         return(100.0 * length(m) / n)
     }

It works:

     >  GC('')
     [1] NA
     >  GC('G')
     [1] 100
     >  GC('GAG')
     [1] 66.66667
     >  sapply(seqs, GC)
                       G         C       CCC         T               TTCCT
                 C
            NA 100.00000 100.00000 100.00000   0.00000        NA  40.00000
      NA 100.00000
           CTC
      66.66667

I want to memoize it. Then, I want to vectorize it. Should be easy, right?

Hi Kamil --

Not really an answer to your question, but looking at

  http://bioconductor.org/packages/2.10/bioc/html/Biostrings.html

will tell you to install Biostrings with

  source("http://bioconductor.org/biocLite.R";)
  biocLite("Biostrings")

and then

  library(Biostrings)
  dna = DNAStringSet(c("","G","C","CCC","T","","TTCCT","","C","CTC"))
  alf = alphabetFrequency(dna, as.prob=TRUE, baseOnly=TRUE)
  rowSums(alf[,c("G", "C")])

will give you GC content of each string.

> rowSums(alf[,c("G", "C")])
 [1]       NaN 1.0000000 1.0000000 1.0000000 0.0000000       NaN 0.4000000
 [8]       NaN 1.0000000 0.6666667

this will be fast and scalable; Biostrings and other Bioconductor (http://bioconductor.org) packages have many useful functions for working with DNA.

See the Bioconductor mailing list for more help if this is a promising direction.

  http://bioconductor.org/help/mailing-list/

Martin


Apparently, I must have the wrong mindset for using the `memoise` or
`R.cache`
R packages:

     >  system.time(dummy<- sapply(rep(seqs,100), GC))
        user  system elapsed
       0.044   0.000   0.054
     >
     >  library(memoise)
     >  GCm1<- memoise(GC)
     >  system.time(dummy<- sapply(rep(seqs,100), GCm1))
        user  system elapsed
       0.164   0.000   0.173
     >
     >  library(R.cache)
     >  GCm2<- addMemoization(GC)
     >  system.time(dummy<- sapply(rep(seqs,100), GCm2))
        user  system elapsed
      10.601   0.252  10.926

Notice that the memoized functions are several orders of magnitude slower.

I tried the `hash` package, but things seem to be happening behind the
scenes
and I don't understand the output:

     >  cache<- hash()
     >  GCc<- function(s) {
         if (!is.character(s) || nchar(s) == 0) {
             return(NA)
         }
         if(exists(s, cache)) {
             return(cache[[s]])
         }
         result<- GC(s)
         cache[[s]]<<- result
         return(result)
     }
     >  sapply(seqs,GCc)
     [[1]]
     [1] NA

     $G
     [1] 100

     $C
     NULL

     $CCC
     [1] 100

     $T
     NULL

     [[6]]
     [1] NA

     $TTCCT
     [1] 40

     [[8]]
     [1] NA

     $C
     NULL

     $CTC
     [1] 66.66667

At least I figured out how to vectorize:

     >  GCv<- Vectorize(GC)
     >  GCv(seqs)
                       G         C       CCC         T               TTCCT
                 C
       0.00000 100.00000 100.00000 100.00000   0.00000   0.00000  40.00000
0.00000 100.00000
           CTC
      66.66667

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


--
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to