Re: [Tutor] Another regular expression question

Kent Johnson Wed, 14 Sep 2005 08:20:56 -0700

Bernard Lebel wrote:
> The file size is 112 Kb. Most lines look this way:
> 
> <parameter name="roty" type="Parameter" sourceclassname="nosource">
> 
> 
> I'll give a try to ElementTree.


To get you started:

from elementtree import ElementTree
doc = ElementTree.parse('myfile.xml')
for sceneobject in doc.findall('//sceneobject'):
  if sceneobject.get('type') == 'CameraRoot':
    # this is a sceneobject that you want
    print sceneobject.get('name')

One gotcha - if your XML uses namespaces, you have to prefix the namespace to 
the tag name in findall(). It will look something like
  d.findall('//{http://www.imsproject.org/xsd/imscp_rootv1p1p2}resource')

Let us know how long that takes...

Kent

> 
> 
> Bernard
> 
> 
> 
> On 9/14/05, Kent Johnson <[EMAIL PROTECTED]> wrote:
> 
>>Bernard Lebel wrote:
>>
>>>Thanks for that pointer Kent, I'll check it out. Also thanks for
>>>letting me know I'm not nuts! :-)
>>>
>>>Alan's suggestion about BeautifulSoup is actually excellent. The
>>>documentation is nice and the tool is very easy to use.
>>>
>>>However is it normal that to parse a 2618 lines xml file it takes
>>>20-30 seconds or so?
>>
>>That seems slow to me unless the lines are really long! How many bytes is the 
>>file? But I don't have much experience with BeautifulSoup.
>>
>>ElementTree is fast and cElementTree (the C implementation) is really fast. I 
>>have used it to read, process and write a 28 MB XML file, it took about 10 
>>seconds.
>>
>>Kent
>>
>>
>>>
>>>Thanks
>>>Bernard
>>>
>>>
>>>
>>>On 9/14/05, Kent Johnson <[EMAIL PROTECTED]> wrote:
>>>
>>>
>>>>Bernard Lebel wrote:
>>>>
>>>>
>>>>>Thanks Alan,
>>>>>
>>>>>I'll check BeautifulSoup asap.
>>>>>
>>>>>I'm using regex simply because I have no clue where to start to parse
>>>>>XML. I have read the various xml tools available in the Python
>>>>>library, however I'm a complete loss at what to make out of them. Many
>>>>>of them seem to use some programming standards, wich I am completely
>>>>>unfamiliar with (this is the first time that I dig into XML writing
>>>>>and parsing).
>>>>>
>>>>>I don't know where to start to learn about all these standards, and as
>>>>>usual with new programming things, the documentation is hard to
>>>>>swallow (it usually is written more as a reference than a proper user
>>>>>guide/tutorial). I have to admit this is very frustrating, so if I'm
>>>>>looking at things from a wrong perspective please advise me, I need
>>>>>it.
>>>>
>>>>I agree that the Python XML story is confusing even for the files in the 
>>>>standard library. Worse, the (IMO) best solutions are not to be found in 
>>>>the standard lib or PyXML at all.
>>>>
>>>>The std lib and PyXML are based on the DOM and SAX standards. These 
>>>>standards were designed to be "language-neutral" - there are 
>>>>implementations in Python, Java and other languages. The good side of this 
>>>>is, if you learn how to use them, the knowledge is pretty portable to other 
>>>>languages. The bad side is, the APIs defined by the standard are IMO clunky 
>>>>and painful to use, especially in Python.
>>>>
>>>>There is a current thread on comp.lang.python discussing this with good 
>>>>suggestions and pointers to more info:
>>>>http://groups.google.com/group/comp.lang.python/browse_frm/thread/a48891aa645ead13/dcd8fdc20b4b191b?hl=en#dcd8fdc20b4b191b
>>>>
>>>>My personal preference is ElementTree. Beautiful Soup is good too though I 
>>>>have only tried it with HTML. If I was running on Linux I would try lxml 
>>>>which uses the ElementTree API and adds full XPath support. Amara looks 
>>>>like the Cadillac solution - big and cushy. I haven't tried it. Uche's 
>>>>articles (referenced in the thread above) have pointers to many other 
>>>>choices but these seem to be the most popular.
>>>>
>>>>My favorite XML lib is actually dom4j which is in Java. It works great with 
>>>>Jython.
>>>>
>>>>Kent
>>>>
>>>>
>>>>
>>>>>So right now I'm just taking a shortcut and using ultra-simple
>>>>>re-based parser to retrieve the tags I'm looking for. I know it will
>>>>>probably be slow, but hopefully I'll get familiar with sophisticated
>>>>>parsing in the future and improve my code. As it stands right now,
>>>>>even the re syntax is not super easy to learn.
>>>>
>>>>For what you are doing re seems fine to me. You can get in trouble using 
>>>>re's with XML because of nested tags, variations in spelling and order, 
>>>>probably a bunch of other things. But for simple stuff it can work fine.
>>>>
>>>>Kent
>>>>
>>>>
>>>>
>>>>>Kent: That works (of course!). Thanks a bunch once again!
>>>>>
>>>>>
>>>>>Thanks
>>>>>Bernard
>>>>>
>>>>>On 9/14/05, Alan G <[EMAIL PROTECTED]> wrote:
>>>>>
>>>>>
>>>>>
>>>>>>Hi Bernard,
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>Hello, yet another regular expression question :-)
>>>>>>>
>>>>>>>So I have this xml file that I'm trying to find a
>>>>>>>specific tag in.
>>>>>>
>>>>>>I'm always suspicious when I see regular expression
>>>>>>and xml/html in the same context. regex are not good
>>>>>>for parsing xml/html files and it's usually much easier
>>>>>>to use a proper parser - such as beautiful soup.
>>>>>>
>>>>>>http://www.crummy.com/software/BeautifulSoup/
>>>>>>
>>>>>>Is there any special reason why you are using a regex
>>>>>>sledgehammer to crack this particular nut? Or is it
>>>>>>just to gain experience using regex?
>>>>>>
>>>>>>Alan G.
>>>>>>
>>>>>
>>>>>_______________________________________________
>>>>>Tutor maillist  -  Tutor@python.org
>>>>>http://mail.python.org/mailman/listinfo/tutor
>>>>>
>>>>>
>>>>
>>>>_______________________________________________
>>>>Tutor maillist  -  Tutor@python.org
>>>>http://mail.python.org/mailman/listinfo/tutor
>>>>
>>>
>>>_______________________________________________
>>>Tutor maillist  -  Tutor@python.org
>>>http://mail.python.org/mailman/listinfo/tutor
>>>
>>>
>>
>>_______________________________________________
>>Tutor maillist  -  Tutor@python.org
>>http://mail.python.org/mailman/listinfo/tutor
>>
> 
> 
> 

_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Another regular expression question

Reply via email to