Re: [R] Performing basic Multiple Sequence Alignment in R?

David Winsemius Tue, 21 Dec 2010 05:22:55 -0800

Tal; I'm trimming the BioC posting. In the R lists it is consideredspamming to cross post. (Please re-read the Posting Guide.)


On Dec 21, 2010, at 4:21 AM, Tal Galili wrote:

Hello everyone,
I am not sure if this should go on the general R mailing list (forexample,if there is a text mining solution that might work here) or thebioconductor
mailing list (since I wasn't able to find a solution to my question on
searching their lists) - so this time I tried both, and in thefuture I'll
know better (in case it should go to only one of the two).


The task I'm trying to achieve is to align several sequences together.
I don't have a basic pattern to match to.  All that I know is that the
"True" pattern should be of length "30" and that the sequences I'mlooking
at, have had missing values introduced to them at random points.
Here is an example of such sequences, were on the left we see whatis thereal location of the missing values, and on the right we see thesequencethat we will be able to observe. My goal is to reconstruct the leftcolumnusing only the sequences I've got on the right column (based on thefact
that many of the letters in each position are the same)

                    Real_sequence           The_sequence_we_see
1   CGCAATACTAAC-AGCTGACTTACGCACCG CGCAATACTAACAGCTGACTTACGCACCG
2   CGCAATACTAGC-AGGTGACTTCC-CT-CG   CGCAATACTAGCAGGTGACTTCCCTCG
3   CGCAATGATCAC--GGTGGCTCCCGGTGCG  CGCAATGATCACGGTGGCTCCCGGTGCG
4   CGCAATACTAACCA-CTAACT--CGCTGCG   CGCAATACTAACCACTAACTCGCTGCG
5   CGCACGGGTAAGAACGTGA-TTACGCTCAG CGCACGGGTAAGAACGTGATTACGCTCAG
6   CGCTATACTAACAA-GTG-CTTAGGC-CTG   CGCTATACTAACAAGTGCTTAGGCCTG
7   CCCA-C-CTAA-ACGGTGACTTACGCTCCG   CCCACCTAAACGGTGACTTACGCTCCG

The agrep function allows one to specify which sort of differences toconsider in calculating a Levenshtein edit distance. Insertions areone possible distance component. You could take a look at its code (inC in hte sources) and perhaps rejigger it to spit out the location ofthe deletions.

> agrep(seqdat$The_sequence_we_see[1], seqdat$Real_sequence,max.distance=list(deletions=0, substitutions=0, insertions=0))

integer(0)

> agrep(seqdat$The_sequence_we_see[1], seqdat$Real_sequence,max.distance=list(deletions=0, substitutions=0, insertions=1))

[1] 1

--
David.


Here is an example code to reproduce the above example:

ATCG <- c("A","T","C","G")
set.seed(40)

original.seq <- sample(ATCG, 30, T)

seqS <- matrix(original.seq,200,30, T)

change.letters <- function(x, number.of.changes = 15,
letters.to.change.with = ATCG)
{

   number.of.changes <- sample(seq_len(number.of.changes), 1)

new.letters <- sample(letters.to.change.with , number.of.changes,T)

where.to.change.the.letters <- sample(seq_along(x) ,number.of.changes, F)


   x[where.to.change.the.letters] <- new.letters

   return(x)
}

change.letters(original.seq)

insert.missing.values <- function(x) change.letters(x, 3, "-")

insert.missing.values(original.seq)

seqS2 <- t(apply(seqS, 1, change.letters))

seqS3 <- t(apply(seqS2, 1, insert.missing.values))

seqS4 <- apply(seqS3,1, function(x) {paste(x, collapse = "")})
require(stringr)
# library(help=stringr)

all.seqS <- str_replace(seqS4,"-" , "")

# how do we allign this?

data.frame(Real_sequence = seqS4, The_sequence_we_see = all.seqS)

I understand that if all I had was a string and a pattern I would beable to

use

library(Biostrings)

pairwiseAlignment(...)

But in the case I present we are dealing with many sequences toalign to one

another (instead of aligning them to one pattern).

Is there a known method for doing this in R?


Thanks,

Tal



----------------Contact
Details:-------------------------------------------------------
Contact me: tal.gal...@gmail.com |  972-52-7275845

Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il(Hebrew) |

www.r-statistics.com (English)
----------------------------------------------------------------------------------------------

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


David Winsemius, MD
West Hartford, CT

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Performing basic Multiple Sequence Alignment in R?

Reply via email to