On 06/23/2010 06:55 PM, G FANG wrote: > Hi, > > I want to group a large list (20 million) of strings into categories > based on string similarity? > > The specific problem is: given a list of DNA sequence as below > > ACTCCCGCCGTTCGCGCGCAGCATGATCCTG > ACTCCCGCCGTTCGCGCGCNNNNNNNNNNNN > CAGGATCATGCTGCGCGCGAACGGCGGGAGT > CAGGATCATGCTGCGCGCGAANNNNNNNNNN > CAGGATCATGCTGCGCGCGNNNNNNNNNNNN > ...... > ..... > NNNNNNNCCGTTCGCGCGCAGCATGATCCTG > NNNNNNNNNNNNCGCGCGCAGCATGATCCTG > NNNNNNNNNNNNGCGCGCGAACGGCGGGAGT > NNNNNNNNNNNNNNCGCGCAGCATGATCCTG > NNNNNNNNNNNTGCGCGCGAACGGCGGGAGT > NNNNNNNNNNTTCGCGCGCAGCATGATCCTG > > 'N' is the missing letter > > It can be seen that some strings are the same except for those N's > (i.e. N can match with any base) > > given this list of string, I want to have > > 1) a vector corresponding to each row (string), for each string assign > an id, such that similar strings (those only differ at N's) have the > same id > 2) also get a mapping list from unique strings ('unique' in term of > the same similarity defined above) to the ids > > I am a matlab user shifting to R. Please advice on efficient ways to do this.
The Bioconductor Biostrings package has many tools for this sort of operation. See http://bioconductor.org/packages/release/Software.html Maybe a one-time install source('http://bioconductor.org/biocLite.R') biocLite('Biostrings') then library(Biostrings) x <- c("ACTCCCGCCGTTCGCGCGCAGCATGATCCTG", "ACTCCCGCCGTTCGCGCGCNNNNNNNNNNNN", "CAGGATCATGCTGCGCGCGAACGGCGGGAGT", "CAGGATCATGCTGCGCGCGAANNNNNNNNNN", "NCAGGATCATGCTGCGCGCGAANNNNNNNNN", "CAGGATCATGCTGCGCGCGNNNNNNNNNNNN", "NNNCAGGATCATGCTGCGCGCGAANNNNNNN") names(x) <- seq_along(x) dna <- DNAStringSet(x) while (!all(width(dna) == width(dna <- trimLRPatterns("N", "N", dna)))) {} names(dna)[rank(dna)] although there might be a faster way (e.g., match 8, 4, 2, 1 N's). Also, your sequences likely come from a fasta file (Biostrings::readFASTA) or a text file with a column of sequences (ShortRead::readXStringColumns) or from alignment software (ShortRead::readAligned / ShortRead::readFastq). If you go this route you'll want to address questions to the Bioconductor mailing list http://bioconductor.org/docs/mailList.html Martin > Thanks! > > Gang > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. -- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793 ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.