On Oct 2, 2012, at 3:19 PM, Florian Huber wrote:
> Thanks guys, for the answers. :-)
>
> I'm sorry I posted a shortened version of the code as I thought it'd make it
> easier to read while still getting the message across. So here's the actual
> example and the corresponding output:
>
> The string is:
>
> >ENSG00000112365|ENST00000230122|109783797|109787053TGTTTCACAATTCATTTTCTACTAAATGTGTACCATTTTTTAAATTGTTTTAACAGAAAGCTGAGGAATGAAAAAACTTCAAGCATTATTTTCAAG
>
> So I'm trying to retrieve'ENSG00000112365', 'ENST00000230122' and the
> sequence bit, starting with a 'T' and get rid of the junk in between.
>
> code:
>
> /#!/usr/bin/perl//
> //
> //use strict;//
> //use warnings;//
> //
> //my $gene;//
> //my @elements = <>;//
> //
> //foreach $gene (@elements) {//
> // $gene =~ />(ENSG\d*) \| (ENST\d*) .*? ([AGCT]*)/x;//
> // print "$1 $2 $3\n";//
> //}/
Why all the / characters? Did you put those there, or is it some artifact of
your email client or mine?
In the future, try posting a complete program that people can run without
having to generate a data file. In this case, just assign a scalar variable
with your data line and modify your program to parse that:
my $element =
q(ENSG00000112365|ENST00000230122|109783797|109787053TGTTTCACAA...CTTCAAGCATTATTTTCAAG);
etc.
> This will print "ENSG00000112365 ENST00000230122"
>
> without the sequence. Originally I had .* before the ([ACGT]) so I figured
> it's greedy and will eat the sequence away. ? makes it nongreedy, doesn't it?
> Still doesn't work.
You are not realizing that [AGCT]* means "zero or more characters from the set
A, G, C, and T". You are getting a zero-character match because that is what
you are asking for. Try ([AGCT]+) that insists on at least one matching
character and will match the longest successive set of AGTC characters.
>
> Other results:
>
> with ([AGCT])* it says that $3 is uninitialised - so here it didn't match at
> all???
>
You are telling it "zero or more", so no match is fine, and that will be the
first thing the RE engine tries, so that is what you get.
> with ([AGCT]{5}) it works fine - it returns TGTTT.
>
>
> This I found kinda strange - looks like I've got something with the
> greediness/precedence wrong?
What you have wrong is telling the RE engine that you don't care about matching
any characters!
--
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
http://learn.perl.org/