Hi Jim, Thanks for the questions, here are my answers: *Q: Does each sequence have the same number of ">>>>" for the opening sequence as it does for "<<<<" on the ending sequence? * A: Yes
*Q: Does the parsing always start with a partial stem 0 as your example shows? * A: No. Sometimes it will start with a few "." *Q: Is there a way of making sure you have the right sequences when you start? * A: I am not sure I understand what you mean. *Q: Is there a chance of error in the middle of the string that you have to restart from?* A: Sadly, yes. In which case, I'll need to ignore one of the inner stems... *Q: How long are these strings that you want to parse? * A: Each string has between 60 to 150 characters (and I have tens of thousands of them...) *Q: Is each one a self contained sequence like you show in your example, or do they go on for thousands of characters? * A: each sequence is self contained. *Q: Is there always at least one '.' between stems? * A: No. *Q: A full set of rules as to how the parsing should be done would be useful.* A: I agree. But since I don't have even a basic idea on how to start coding this, I thought first to have some help on the beginning and try to tweak with the other cases that will come up before turning back for help. *Q: Do you have the BNF syntax for parsing?* A: No. Your e-mail is the first time I came across it ( http://en.wikipedia.org/wiki/BackusNaur_Form<http://en.wikipedia.org/wiki/Backus%E2%80%93Naur_Form> ). Thanks for the help, Tal ----------------Contact Details:------------------------------------------------------- Contact me: tal.gal...@gmail.com | 972-52-7275845 Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) | www.r-statistics.com (English) ---------------------------------------------------------------------------------------------- On Tue, Mar 16, 2010 at 1:59 PM, jim holtman <jholt...@gmail.com> wrote: > How are you supposed to interprete the string that is doing the parsing? > Does each sequence have the same number of ">>>>" for the opening sequence > as it does for "<<<<" on the ending sequence? That what it appears to be > looking at the way stem 3 is parsed. You will have to provide a little more > insight on how to interprete the symbols. Does the parsing always start > with a partial stem 0 as your example shows? Is there a way of making sure > you have the right sequences when you start? Is there a chance of error in > the middle of the string that you have to restart from? How long are these > strings that you want to parse? Is each one a self contained sequence like > you show in your example, or do they go on for thousands of characters? Is > there always at least one '.' between stems? A full set of rules as to how > the parsing should be done would be useful. Do you have the BNF syntax for > parsing? > > On Tue, Mar 16, 2010 at 6:10 AM, Tal Galili <tal.gal...@gmail.com> wrote: > >> Hello all, >> >> For some work I am doing on RNA, I want to use R to do string parsing that >> (I think) is like a simplistic HTML parsing. >> >> >> For example, let's say we have the following two variables: >> >> Seq <- >> >> "GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGGGGCA" >> Str <- >> >> ">>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<." >> >> Say that I want to parse "Seq" According to "Str", by using the legend >> here >> >> Seq: >> GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGGGGCA >> Str: >> >>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<. >> >> | | | | | | | || >> | >> >> +-----+ +--------------+ +---------------+ >> +---------------++-----+ >> >> | Stem 1 Stem 2 Stem 3 | >> >> | | >> >> +----------------------------------------------------------------+ >> >> Stem 0 >> >> Assume that we always have 4 stems (0 to 3), but that the length of >> letters >> before and after each of them can very. >> >> The output should be something like the following list structure: >> >> >> list( >> "Stem 0 opening" = "GCCTCGA", >> "before Stem 1" = "TA", >> "Stem 1" = list(opening = "GCTC", >> inside = "AGTTGGGA", >> closing = "GAGC" >> ), >> "between Stem 1 and 2" = "G", >> "Stem 2" = list(opening = "TACGA", >> inside = "CTGAAGA", >> closing = "TCGTA" >> ), >> "between Stem 2 and 3" = "AGGtC", >> "Stem 3" = list(opening = "ACCAG", >> inside = "TTCGATC", >> closing = "CTGGT" >> ), >> "After Stem 3" = "", >> "Stem 0 closing" = "TCGGGGC" >> ) >> >> >> I don't have any experience with programming a parser, and would like >> advices as to what strategy to use when programming something like this >> (and >> any recommended R commands to use). >> >> >> What I was thinking of is to first get rid of the "Stem 0", then go >> through >> the inner string with a recursive function (let's call it "seperate.stem") >> that each time will split the string into: >> 1. before stem >> 2. opening stem >> 3. inside stem >> 4. closing stem >> 5. after stem >> >> Where the "after stem" will then be recursively entered into the same >> function ("seperate.stem") >> >> The thing is that I am not sure how to try and do this coding without >> using >> a loop. >> >> Any advices will be most welcomed. >> >> >> ----------------Contact >> Details:------------------------------------------------------- >> Contact me: tal.gal...@gmail.com | 972-52-7275845 >> Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) | >> www.r-statistics.com (English) >> >> ---------------------------------------------------------------------------------------------- >> >> [[alternative HTML version deleted]] >> >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html> >> and provide commented, minimal, self-contained, reproducible code. >> > > > > -- > Jim Holtman > Cincinnati, OH > +1 513 646 9390 > > What is the problem that you are trying to solve? > [[alternative HTML version deleted]]
______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.