On 04/26/2012 03:21 PM, Kamil Slowikowski wrote:
My goal is simple: calcuate GC content of each sequence in a list of
nucleotide
sequences. I have figured out how to vectorize, but all my attempts at
memoization failed.
Can you show me how to properly memoize my function?
There is a StackOverflow post on the subject of memoization, but it does not
help me:
http://stackoverflow.com/questions/7262485/options-for-caching-memoization-hashing-in-r
I haven't been able to find any other discussions on this subject. Searching
for "memoise" or "memoize" on r-bloggers.com returns zero results. Searching
for those keywords at http://r-project.markmail.org/ does not return helpful
discussions.
Here's my data:
seqs<- c("","G","C","CCC","T","","TTCCT","","C","CTC")
Some sequences are missing, so they're blank `""`.
I have a function for calculating GC content:
> GC<- function(s) {
if (!is.character(s)) return(NA)
n<- nchar(s)
if (n == 0) return(NA)
m<- gregexpr('[GCSgcs]', s)[[1]]
if (m[1]< 1) return(0)
return(100.0 * length(m) / n)
}
It works:
> GC('')
[1] NA
> GC('G')
[1] 100
> GC('GAG')
[1] 66.66667
> sapply(seqs, GC)
G C CCC T TTCCT
C
NA 100.00000 100.00000 100.00000 0.00000 NA 40.00000
NA 100.00000
CTC
66.66667
I want to memoize it. Then, I want to vectorize it. Should be easy, right?
Hi Kamil --
Not really an answer to your question, but looking at
http://bioconductor.org/packages/2.10/bioc/html/Biostrings.html
will tell you to install Biostrings with
source("http://bioconductor.org/biocLite.R")
biocLite("Biostrings")
and then
library(Biostrings)
dna = DNAStringSet(c("","G","C","CCC","T","","TTCCT","","C","CTC"))
alf = alphabetFrequency(dna, as.prob=TRUE, baseOnly=TRUE)
rowSums(alf[,c("G", "C")])
will give you GC content of each string.
> rowSums(alf[,c("G", "C")])
[1] NaN 1.0000000 1.0000000 1.0000000 0.0000000 NaN 0.4000000
[8] NaN 1.0000000 0.6666667
this will be fast and scalable; Biostrings and other Bioconductor
(http://bioconductor.org) packages have many useful functions for
working with DNA.
See the Bioconductor mailing list for more help if this is a promising
direction.
http://bioconductor.org/help/mailing-list/
Martin
Apparently, I must have the wrong mindset for using the `memoise` or
`R.cache`
R packages:
> system.time(dummy<- sapply(rep(seqs,100), GC))
user system elapsed
0.044 0.000 0.054
>
> library(memoise)
> GCm1<- memoise(GC)
> system.time(dummy<- sapply(rep(seqs,100), GCm1))
user system elapsed
0.164 0.000 0.173
>
> library(R.cache)
> GCm2<- addMemoization(GC)
> system.time(dummy<- sapply(rep(seqs,100), GCm2))
user system elapsed
10.601 0.252 10.926
Notice that the memoized functions are several orders of magnitude slower.
I tried the `hash` package, but things seem to be happening behind the
scenes
and I don't understand the output:
> cache<- hash()
> GCc<- function(s) {
if (!is.character(s) || nchar(s) == 0) {
return(NA)
}
if(exists(s, cache)) {
return(cache[[s]])
}
result<- GC(s)
cache[[s]]<<- result
return(result)
}
> sapply(seqs,GCc)
[[1]]
[1] NA
$G
[1] 100
$C
NULL
$CCC
[1] 100
$T
NULL
[[6]]
[1] NA
$TTCCT
[1] 40
[[8]]
[1] NA
$C
NULL
$CTC
[1] 66.66667
At least I figured out how to vectorize:
> GCv<- Vectorize(GC)
> GCv(seqs)
G C CCC T TTCCT
C
0.00000 100.00000 100.00000 100.00000 0.00000 0.00000 40.00000
0.00000 100.00000
CTC
66.66667
[[alternative HTML version deleted]]
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
--
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
Location: M1-B861
Telephone: 206 667-2793
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.