On 06/26/2010 11:42 AM, G FANG wrote: > Hi Martin, > > Thanks a lot for your advice. > > I tried the process you suggested as below, it worked, but in a > different way that I planned. > > library(Biostrings) > x <- c("ACTCCCGCCGTTCGCGCGCAGCATGATCCTG", > "ACTCCCGCCGTTCGCGCGCNNNNNNNNNNNN", > "CAGGATCATGCTGCGCGCGAACGGCGGGAGT", > "CAGGATCATGCTGCGCGCGAANNNNNNNNNN", > "NCAGGATCATGCTGCGCGCGAANNNNNNNNN", > "CAGGATCATGCTGCGCGCGNNNNNNNNNNNN", > "NNNCAGGATCATGCTGCGCGCGAANNNNNNN") > names(x) <- seq_along(x) > dna <- DNAStringSet(x) > while (!all(width(dna) == width(dna <- trimLRPatterns("N", "N", dna)))) {} > names(dna)[order(dna)[rank(dna, ties.method="min")]] > > The output is, > "1" "2" "3" "4" "4" "6" "4", this is the right answer after trimining > N's, i.e. without considering N, which strings are the same. > > But actually, the match I planned is position-to-position match, i.e. > 1st and 2nd strings are the same except for the N's > > So, the expected output is 1 1 2 2 3 2 4 > > Please advice.
what would you expect of ACTG ACTC ACTN ? I'd guess an elegant solution would require a fancy data structure / algorithm (suffix arrays, I guess), and that you'll end up doing something ad-hoc, along the lines of x <- c("ACTCCCGCCGTTCGCGCGCAGCATGATCCTG", "ACTCCCGCCGTTCGCGCGCNNNNNNNNNNNN", "CAGGATCATGCTGCGCGCGAACGGCGGGAGT", "CAGGATCATGCTGCGCGCGAANNNNNNNNNN", "NCAGGATCATGCTGCGCGCGAANNNNNNNNN", "CAGGATCATGCTGCGCGCGNNNNNNNNNNNN", "NNNCAGGATCATGCTGCGCGCGAANNNNNNN") names(x) = seq_along(x) dna = DNAStringSet(x) wd <- unique(width(dna)) stopifnot(1 == length(wd)) nm <- names(dna)[order(dna)[rank(dna)]] for (i in rev(seq_len(wd)[-1])) { seq = narrow(dna, 1, i-1) n = narrow(dna, i, len) allN = alphabetFrequency(n, baseOnly=TRUE)[,"other"] == width(n) nm[allN] <- names(seq)[order(seq)[rank(seq)]][allN] } which gives > as.integer(factor(nm)) [1] 1 1 2 2 3 2 4 (this doesn't deal with the leading N's, but the logic might be the same). >>> If you go this route you'll want to address >>> questions to the Bioconductor mailing list >>> >>> http://bioconductor.org/docs/mailList.html Martin > > Thanks! > > --gang > > On Wed, Jun 23, 2010 at 7:55 PM, Martin Morgan <mtmor...@fhcrc.org> wrote: >> On 06/23/2010 07:46 PM, Martin Morgan wrote: >>> On 06/23/2010 06:55 PM, G FANG wrote: >>>> Hi, >>>> >>>> I want to group a large list (20 million) of strings into categories >>>> based on string similarity? >>>> >>>> The specific problem is: given a list of DNA sequence as below >>>> >>>> ACTCCCGCCGTTCGCGCGCAGCATGATCCTG >>>> ACTCCCGCCGTTCGCGCGCNNNNNNNNNNNN >>>> CAGGATCATGCTGCGCGCGAACGGCGGGAGT >>>> CAGGATCATGCTGCGCGCGAANNNNNNNNNN >>>> CAGGATCATGCTGCGCGCGNNNNNNNNNNNN >>>> ...... >>>> ..... >>>> NNNNNNNCCGTTCGCGCGCAGCATGATCCTG >>>> NNNNNNNNNNNNCGCGCGCAGCATGATCCTG >>>> NNNNNNNNNNNNGCGCGCGAACGGCGGGAGT >>>> NNNNNNNNNNNNNNCGCGCAGCATGATCCTG >>>> NNNNNNNNNNNTGCGCGCGAACGGCGGGAGT >>>> NNNNNNNNNNTTCGCGCGCAGCATGATCCTG >>>> >>>> 'N' is the missing letter >>>> >>>> It can be seen that some strings are the same except for those N's >>>> (i.e. N can match with any base) >>>> >>>> given this list of string, I want to have >>>> >>>> 1) a vector corresponding to each row (string), for each string assign >>>> an id, such that similar strings (those only differ at N's) have the >>>> same id >>>> 2) also get a mapping list from unique strings ('unique' in term of >>>> the same similarity defined above) to the ids >>>> >>>> I am a matlab user shifting to R. Please advice on efficient ways to do >>>> this. >>> >>> The Bioconductor Biostrings package has many tools for this sort of >>> operation. See http://bioconductor.org/packages/release/Software.html >>> >>> Maybe a one-time install >>> >>> source('http://bioconductor.org/biocLite.R') >>> biocLite('Biostrings') >>> >>> then >>> >>> library(Biostrings) >>> x <- c("ACTCCCGCCGTTCGCGCGCAGCATGATCCTG", >>> "ACTCCCGCCGTTCGCGCGCNNNNNNNNNNNN", >>> "CAGGATCATGCTGCGCGCGAACGGCGGGAGT", >>> "CAGGATCATGCTGCGCGCGAANNNNNNNNNN", >>> "NCAGGATCATGCTGCGCGCGAANNNNNNNNN", >>> "CAGGATCATGCTGCGCGCGNNNNNNNNNNNN", >>> "NNNCAGGATCATGCTGCGCGCGAANNNNNNN") >>> names(x) <- seq_along(x) >>> dna <- DNAStringSet(x) >>> while (!all(width(dna) == >>> width(dna <- trimLRPatterns("N", "N", dna)))) {} >>> names(dna)[rank(dna)] >> >> oops, maybe closer to >> >> names(dna)[order(dna)[rank(dna, ties.method="min")]] >> >>> although there might be a faster way (e.g., match 8, 4, 2, 1 N's). Also, >>> your sequences likely come from a fasta file (Biostrings::readFASTA) or >>> a text file with a column of sequences (ShortRead::readXStringColumns) >>> or from alignment software (ShortRead::readAligned / >>> ShortRead::readFastq). If you go this route you'll want to address >>> questions to the Bioconductor mailing list >>> >>> http://bioconductor.org/docs/mailList.html >>> >>> Martin >>> >>>> Thanks! >>>> >>>> Gang >>>> >>>> ______________________________________________ >>>> R-help@r-project.org mailing list >>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>> PLEASE do read the posting guide >>>> http://www.R-project.org/posting-guide.html >>>> and provide commented, minimal, self-contained, reproducible code. >>> >>> >> >> >> -- >> Martin Morgan >> Computational Biology / Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N. >> PO Box 19024 Seattle, WA 98109 >> >> Location: Arnold Building M1 B861 >> Phone: (206) 667-2793 >> -- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793 ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.