On 11/05/2010 09:13 AM, Changbin Du wrote: > HI, Phil, > > I used the following codes and run it overnight for 15 hours, this morning, > I stopped it. It seems it is still not efficient. > > >> > matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth", > sep="\t", skip=0, header=F,fill=T) # >> names(matt)<-c("id","reads") > >> dim(matt) > [1] 3384766 2
[snip] >>> On Thu, 4 Nov 2010, Changbin Du wrote: >>> >>> HI, Dear R community, >>>> >>>> I have one data set like this, What I want to do is to calculate the >>>> cumulative coverage. The following codes works for small data set (#rows >>>> = >>>> 100), but when feed the whole data set, it still running after 24 hours. >>>> Can someone give some suggestions for long vector? >>>> >>>> id reads >>>> Contig79:1 4 >>>> Contig79:2 8 >>>> Contig79:3 13 >>>> Contig79:4 14 >>>> Contig79:5 17 >>>> Contig79:6 20 >>>> Contig79:7 25 >>>> Contig79:8 27 >>>> Contig79:9 32 >>>> Contig79:10 33 >>>> Contig79:11 34 >>>> >>>> >>>> matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth", >>>> sep="\t", skip=0, header=F,fill=T) # >>>> dim(matt) >>>> [1] 3384766 2 >>>> >>>> matt_plot<-function(matt, outputfile) { >>>> names(matt)<-c("id","reads") >>>> >>>> cover<-matt$reads >>>> >>>> >>>> #calculate the cumulative coverage. >>>> + cover_per<-function (data) { >>>> + output<-numeric(0) >>>> + for (i in data) { >>>> + x<-(100*sum(ifelse(data >= i, 1, 0))/length(data)) >>>> + output<-c(output, x) >>>> + } >>>> + return(output) >>>> + } >>>> >>>> >>>> result<-cover_per(cover) Hi Changbin If I understand correctly, your contigs 'start' at position 1, and have 'width' equal to matt$reads. You'd like to know the coverage at the last covered location of each contig in matt$reads. ## first time only source("http://bioconductor.org") biocLite("IRanges") ## library(IRanges) contigs = IRanges(start=1, width=matt$reads) cvg = coverage(contigs) ## an RLE summarizing coverage, from position 1 as.vector(cvg[matt$reads]) / nrow(matt) ## at the end of each contig for a larger data set: > matt=data.frame(reads=ceiling(as.integer(runif(3384766, 1, 100)))) > contigs = IRanges(start=1, width=matt$reads) > system.time(cvg <- coverage(contigs)) user system elapsed 5.145 0.050 5.202 Martin >>>> >>>> >>>> Thanks so much! >>>> >>>> >>>> -- >>>> Sincerely, >>>> Changbin >>>> -- >>>> >>>> [[alternative HTML version deleted]] >>>> >>>> ______________________________________________ >>>> R-help@r-project.org mailing list >>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>> PLEASE do read the posting guide >>>> http://www.R-project.org/posting-guide.html >>>> and provide commented, minimal, self-contained, reproducible code. >>>> >>>> >> >> >> -- >> Sincerely, >> Changbin >> -- >> >> Changbin Du >> DOE Joint Genome Institute >> Bldg 400 Rm 457 >> 2800 Mitchell Dr >> Walnut Creet, CA 94598 >> Phone: 925-927-2856 >> >> >> > > -- Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: M1-B861 Telephone: 206 667-2793 ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.