How are you supposed to interprete the string that is doing the parsing? Does each sequence have the same number of ">>>>" for the opening sequence as it does for "<<<<" on the ending sequence? That what it appears to be looking at the way stem 3 is parsed. You will have to provide a little more insight on how to interprete the symbols. Does the parsing always start with a partial stem 0 as your example shows? Is there a way of making sure you have the right sequences when you start? Is there a chance of error in the middle of the string that you have to restart from? How long are these strings that you want to parse? Is each one a self contained sequence like you show in your example, or do they go on for thousands of characters? Is there always at least one '.' between stems? A full set of rules as to how the parsing should be done would be useful. Do you have the BNF syntax for parsing?
On Tue, Mar 16, 2010 at 6:10 AM, Tal Galili <tal.gal...@gmail.com> wrote: > Hello all, > > For some work I am doing on RNA, I want to use R to do string parsing that > (I think) is like a simplistic HTML parsing. > > > For example, let's say we have the following two variables: > > Seq <- > "GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGGGGCA" > Str <- > ">>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<." > > Say that I want to parse "Seq" According to "Str", by using the legend here > > Seq: > GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGGGGCA > Str: > >>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<. > > | | | | | | | || > | > > +-----+ +--------------+ +---------------+ > +---------------++-----+ > > | Stem 1 Stem 2 Stem 3 | > > | | > > +----------------------------------------------------------------+ > > Stem 0 > > Assume that we always have 4 stems (0 to 3), but that the length of letters > before and after each of them can very. > > The output should be something like the following list structure: > > > list( > "Stem 0 opening" = "GCCTCGA", > "before Stem 1" = "TA", > "Stem 1" = list(opening = "GCTC", > inside = "AGTTGGGA", > closing = "GAGC" > ), > "between Stem 1 and 2" = "G", > "Stem 2" = list(opening = "TACGA", > inside = "CTGAAGA", > closing = "TCGTA" > ), > "between Stem 2 and 3" = "AGGtC", > "Stem 3" = list(opening = "ACCAG", > inside = "TTCGATC", > closing = "CTGGT" > ), > "After Stem 3" = "", > "Stem 0 closing" = "TCGGGGC" > ) > > > I don't have any experience with programming a parser, and would like > advices as to what strategy to use when programming something like this > (and > any recommended R commands to use). > > > What I was thinking of is to first get rid of the "Stem 0", then go through > the inner string with a recursive function (let's call it "seperate.stem") > that each time will split the string into: > 1. before stem > 2. opening stem > 3. inside stem > 4. closing stem > 5. after stem > > Where the "after stem" will then be recursively entered into the same > function ("seperate.stem") > > The thing is that I am not sure how to try and do this coding without using > a loop. > > Any advices will be most welcomed. > > > ----------------Contact > Details:------------------------------------------------------- > Contact me: tal.gal...@gmail.com | 972-52-7275845 > Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) | > www.r-statistics.com (English) > > ---------------------------------------------------------------------------------------------- > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html> > and provide commented, minimal, self-contained, reproducible code. > -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve? [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.