I don't have an answer, trying to solicit more input with additional questions.
> From: tal.gal...@gmail.com > Date: Tue, 21 Dec 2010 11:21:03 +0200 > To: r-help@r-project.org; bioconduc...@r-project.org > Subject: [R] Performing basic Multiple Sequence Alignment in R? > > Hello everyone, > > I am not sure if this should go on the general R mailing list (for example, > if there is a text mining solution that might work here) or the bioconductor > mailing list (since I wasn't able to find a solution to my question on > searching their lists) - so this time I tried both, and in the future I'll > know better (in case it should go to only one of the two). > I take it you don't want an R interface for clustal and I seem to recall, from doing this a few years ago, that alignment by exact string matching was a bit of a research area ( I think you can find papers on citeseer for example). It does seem you are asking about exact string matches for alignment markers- your left sequences appear exactly someplace on the right- but your overall interests are not real clear. I never got my code fully working but I was happy that I could do different strains of e coli ( or something in the 5-10 Mbp genome range ) very quickly ( seconds as I recall ) and you could also presumably find similar items that had moved a long way. Earlier someone came here with a task and was pointed to bio packages but I thought there may be something in computational linguistics or mining better suited to needs but no one ever volunteered anything. > > The task I'm trying to achieve is to align several sequences together. > I don't have a basic pattern to match to. All that I know is that the > "True" pattern should be of length "30" and that the sequences I'm looking > at, have had missing values introduced to them at random points. Alternatively I guess someone could make an R interface for various BLAST's, sometimes the help desk at NCBI can get questions like this to the right person internally. > Here is an example of such sequences, were on the left we see what is the > real location of the missing values, and on the right we see the sequence > that we will be able to observe. My goal is to reconstruct the left column > using only the sequences I've got on the right column (based on the fact > that many of the letters in each position are the same) > > Real_sequence The_sequence_we_see > 1 CGCAATACTAAC-AGCTGACTTACGCACCG CGCAATACTAACAGCTGACTTACGCACCG > 2 CGCAATACTAGC-AGGTGACTTCC-CT-CG CGCAATACTAGCAGGTGACTTCCCTCG > 3 CGCAATGATCAC--GGTGGCTCCCGGTGCG CGCAATGATCACGGTGGCTCCCGGTGCG > 4 CGCAATACTAACCA-CTAACT--CGCTGCG CGCAATACTAACCACTAACTCGCTGCG > 5 CGCACGGGTAAGAACGTGA-TTACGCTCAG CGCACGGGTAAGAACGTGATTACGCTCAG > 6 CGCTATACTAACAA-GTG-CTTAGGC-CTG CGCTATACTAACAAGTGCTTAGGCCTG > 7 CCCA-C-CTAA-ACGGTGACTTACGCTCCG CCCACCTAAACGGTGACTTACGCTCCG > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.