Re: [Tutor] Extract strings from a text file

spir Thu, 26 Feb 2009 23:23:10 -0800

Le Thu, 26 Feb 2009 21:53:43 -0800,
Mohamed Hassan <linuxlove...@gmail.com> s'exprima ainsi:


> Hi all,
> 
> I am new to Python and still trying to figure out some things. Here is the
> situation:
> 
> There is a text file that looks like this:
> 
> text text text <ID>Joseph</text text text>
> text text text text text text text text text text text
> text text text text text text text text text text text
> text text text text text text text text text text text
> text text text text text text text text text text text
> text text text text text text text text text text text
> text text text text text text text text text text text
> text text text text text text text text text text text
> text text text <Full name> Joseph Smith</text text text>
> text text text <Rights> 1</text text text>
> text text text <LDAP> 0</text text text>
> 
> 
> This text file is very long, however all the entries in it looks the same at
> the above.
> 
> What I am trying to do is:
> 
> 1. I need to extract the name and the full name from this text file. For
> example: ( ID is Joseph & Full name is Joseph Smith).
> 
> 
> - I am thinking I need to write something that will check the whole text
> file line by line which I have done already.
> - Now what I am trying to figure out is : How can I write a function that
> will check to see if the line contains the word ID between < > then copy the
> letters after > until > and dump it to a text file.
> 
> Can somebody help please. I know this might soudn easy for some people, but
> again I am new to Python and still figuring out things.
> 
> Thank you

This is a typical text parsing job. There are tools for that. However, probably 
we would need a bit more information about the real text structure, and first 
of all what you wish to do with it later, to point you to the most appropriate 
tool. I guess that there is a higher level structure that nests IDs, names, 
rights etc in a section and that you will need to keep them together for 
further process.
Anyway for a startup exploration you can use regular expressions (regex) to 
extract individual data item. For instance:

from re import compile as Pattern
pattern = Pattern(r""".*<ID>(.+)<.+>.*""")
line = "text text text <ID>Joseph</text text text>"
print pattern.findall(line)
text = """\
text text text <ID>Joseph</text text text>
text text text <ID>Jodia</text text text>
text text text <ID>Joobawap</text text text>
"""
print pattern.findall(text)
==>
['Joseph']
['Joseph', 'Jodia', 'Joobawap']

There is a nice tutorial on regexes somewhere (you will easily find). Key 
points on this example are:

        r""".*<ID>(.+)<.+>.*"""
* the pattern between """...""" expresses the overall format to be matched
* all what is between (...) will be extracted by findall
* '.' mean 'any character'; '*' means zero or more of what is just before; '+' 
mean one or more of what is just before.

So the pattern will look for chains that contains a sequence formed of:

1. possible start chars
2. <ID> literally
3. one or more chars -- to return
4. something between <...>
5. possible end chars

Denis
------
la vita e estrany
_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Extract strings from a text file

Reply via email to