On 06/23/2010 07:46 PM, Martin Morgan wrote: > On 06/23/2010 06:55 PM, G FANG wrote: >> Hi, >> >> I want to group a large list (20 million) of strings into categories >> based on string similarity? >> >> The specific problem is: given a list of DNA sequence as below >> >> ACTCCCGCCGTTCGCGCGCAGCATGATCCTG >> ACTCCCGCCGTTCGCGCGCNNNNNNNNNNNN >> CAGGATCATGCTGCGCGCGAACGGCGGGAGT >> CAGGATCATGCTGCGCGCGAANNNNNNNNNN >> CAGGATCATGCTGCGCGCGNNNNNNNNNNNN >> ...... >> ..... >> NNNNNNNCCGTTCGCGCGCAGCATGATCCTG >> NNNNNNNNNNNNCGCGCGCAGCATGATCCTG >> NNNNNNNNNNNNGCGCGCGAACGGCGGGAGT >> NNNNNNNNNNNNNNCGCGCAGCATGATCCTG >> NNNNNNNNNNNTGCGCGCGAACGGCGGGAGT >> NNNNNNNNNNTTCGCGCGCAGCATGATCCTG >> >> 'N' is the missing letter >> >> It can be seen that some strings are the same except for those N's >> (i.e. N can match with any base) >> >> given this list of string, I want to have >> >> 1) a vector corresponding to each row (string), for each string assign >> an id, such that similar strings (those only differ at N's) have the >> same id >> 2) also get a mapping list from unique strings ('unique' in term of >> the same similarity defined above) to the ids >> >> I am a matlab user shifting to R. Please advice on efficient ways to do this. > > The Bioconductor Biostrings package has many tools for this sort of > operation. See http://bioconductor.org/packages/release/Software.html > > Maybe a one-time install > > source('http://bioconductor.org/biocLite.R') > biocLite('Biostrings') > > then > > library(Biostrings) > x <- c("ACTCCCGCCGTTCGCGCGCAGCATGATCCTG", > "ACTCCCGCCGTTCGCGCGCNNNNNNNNNNNN", > "CAGGATCATGCTGCGCGCGAACGGCGGGAGT", > "CAGGATCATGCTGCGCGCGAANNNNNNNNNN", > "NCAGGATCATGCTGCGCGCGAANNNNNNNNN", > "CAGGATCATGCTGCGCGCGNNNNNNNNNNNN", > "NNNCAGGATCATGCTGCGCGCGAANNNNNNN") > names(x) <- seq_along(x) > dna <- DNAStringSet(x) > while (!all(width(dna) == > width(dna <- trimLRPatterns("N", "N", dna)))) {} > names(dna)[rank(dna)]
oops, maybe closer to names(dna)[order(dna)[rank(dna, ties.method="min")]] > although there might be a faster way (e.g., match 8, 4, 2, 1 N's). Also, > your sequences likely come from a fasta file (Biostrings::readFASTA) or > a text file with a column of sequences (ShortRead::readXStringColumns) > or from alignment software (ShortRead::readAligned / > ShortRead::readFastq). If you go this route you'll want to address > questions to the Bioconductor mailing list > > http://bioconductor.org/docs/mailList.html > > Martin > >> Thanks! >> >> Gang >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > > -- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793 ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.