On Fri, Feb 27, 2009 at 2:22 AM, spir <denis.s...@free.fr> wrote: > Anyway for a startup exploration you can use regular expressions (regex) to > extract individual data item. For instance: > > from re import compile as Pattern > pattern = Pattern(r""".*<ID>(.+)<.+>.*""") > line = "text text text <ID>Joseph</text text text>" > print pattern.findall(line) > text = """\ > text text text <ID>Joseph</text text text> > text text text <ID>Jodia</text text text> > text text text <ID>Joobawap</text text text> > """ > print pattern.findall(text) > ==> > ['Joseph'] > ['Joseph', 'Jodia', 'Joobawap']
You need to be a bit careful with wildcards, your regex doesn't work correctly if there are two <ID>s on a line: In [7]: re.findall(r""".*<ID>(.+)<.+>.*""", 'text <ID>Joseph</ID><ID>Mary</ID>') Out[7]: ['Mary'] The problem is that the initial .* matches the whole line; the regex then backtracks to the second <ID>, finds a match and stops. Taking out the initial .* shows another problem: In [8]: re.findall(r"""<ID>(.+)<.+>""", 'text <ID>Joseph</ID><ID>Mary</ID>') Out[8]: ['Joseph</ID><ID>Mary'] Now (.+) is matching to the end of the line, then backing up to find the last <. One way to fix this is to use non-greedy matching: In [10]: re.findall(r"""<ID>(.+?)<""", 'text <ID>Joseph</ID><ID>Mary</ID>') Out[10]: ['Joseph', 'Mary'] Another way is to specifically exclude the character you are matching from the wildcard match: In [11]: re.findall(r"""<ID>([^[<]+)<""", 'text <ID>Joseph</ID><ID>Mary</ID>') Out[11]: ['Joseph', 'Mary'] Kent _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor