Re: [Tutor] Another regular expression question

Kent Johnson Wed, 14 Sep 2005 07:57:24 -0700

Bernard Lebel wrote:
> Thanks for that pointer Kent, I'll check it out. Also thanks for
> letting me know I'm not nuts! :-)
> 
> Alan's suggestion about BeautifulSoup is actually excellent. The
> documentation is nice and the tool is very easy to use.
> 
> However is it normal that to parse a 2618 lines xml file it takes
> 20-30 seconds or so?


That seems slow to me unless the lines are really long! How many bytes is the 
file? But I don't have much experience with BeautifulSoup.

ElementTree is fast and cElementTree (the C implementation) is really fast. I 
have used it to read, process and write a 28 MB XML file, it took about 10 
seconds.

Kent

> 
> 
> Thanks
> Bernard
> 
> 
> 
> On 9/14/05, Kent Johnson <[EMAIL PROTECTED]> wrote:
> 
>>Bernard Lebel wrote:
>>
>>>Thanks Alan,
>>>
>>>I'll check BeautifulSoup asap.
>>>
>>>I'm using regex simply because I have no clue where to start to parse
>>>XML. I have read the various xml tools available in the Python
>>>library, however I'm a complete loss at what to make out of them. Many
>>>of them seem to use some programming standards, wich I am completely
>>>unfamiliar with (this is the first time that I dig into XML writing
>>>and parsing).
>>>
>>>I don't know where to start to learn about all these standards, and as
>>>usual with new programming things, the documentation is hard to
>>>swallow (it usually is written more as a reference than a proper user
>>>guide/tutorial). I have to admit this is very frustrating, so if I'm
>>>looking at things from a wrong perspective please advise me, I need
>>>it.
>>
>>I agree that the Python XML story is confusing even for the files in the 
>>standard library. Worse, the (IMO) best solutions are not to be found in the 
>>standard lib or PyXML at all.
>>
>>The std lib and PyXML are based on the DOM and SAX standards. These standards 
>>were designed to be "language-neutral" - there are implementations in Python, 
>>Java and other languages. The good side of this is, if you learn how to use 
>>them, the knowledge is pretty portable to other languages. The bad side is, 
>>the APIs defined by the standard are IMO clunky and painful to use, 
>>especially in Python.
>>
>>There is a current thread on comp.lang.python discussing this with good 
>>suggestions and pointers to more info:
>>http://groups.google.com/group/comp.lang.python/browse_frm/thread/a48891aa645ead13/dcd8fdc20b4b191b?hl=en#dcd8fdc20b4b191b
>>
>>My personal preference is ElementTree. Beautiful Soup is good too though I 
>>have only tried it with HTML. If I was running on Linux I would try lxml 
>>which uses the ElementTree API and adds full XPath support. Amara looks like 
>>the Cadillac solution - big and cushy. I haven't tried it. Uche's articles 
>>(referenced in the thread above) have pointers to many other choices but 
>>these seem to be the most popular.
>>
>>My favorite XML lib is actually dom4j which is in Java. It works great with 
>>Jython.
>>
>>Kent
>>
>>
>>>So right now I'm just taking a shortcut and using ultra-simple
>>>re-based parser to retrieve the tags I'm looking for. I know it will
>>>probably be slow, but hopefully I'll get familiar with sophisticated
>>>parsing in the future and improve my code. As it stands right now,
>>>even the re syntax is not super easy to learn.
>>
>>For what you are doing re seems fine to me. You can get in trouble using re's 
>>with XML because of nested tags, variations in spelling and order, probably a 
>>bunch of other things. But for simple stuff it can work fine.
>>
>>Kent
>>
>>
>>>
>>>Kent: That works (of course!). Thanks a bunch once again!
>>>
>>>
>>>Thanks
>>>Bernard
>>>
>>>On 9/14/05, Alan G <[EMAIL PROTECTED]> wrote:
>>>
>>>
>>>>Hi Bernard,
>>>>
>>>>
>>>>
>>>>>Hello, yet another regular expression question :-)
>>>>>
>>>>>So I have this xml file that I'm trying to find a
>>>>>specific tag in.
>>>>
>>>>I'm always suspicious when I see regular expression
>>>>and xml/html in the same context. regex are not good
>>>>for parsing xml/html files and it's usually much easier
>>>>to use a proper parser - such as beautiful soup.
>>>>
>>>>http://www.crummy.com/software/BeautifulSoup/
>>>>
>>>>Is there any special reason why you are using a regex
>>>>sledgehammer to crack this particular nut? Or is it
>>>>just to gain experience using regex?
>>>>
>>>>Alan G.
>>>>
>>>
>>>_______________________________________________
>>>Tutor maillist  -  Tutor@python.org
>>>http://mail.python.org/mailman/listinfo/tutor
>>>
>>>
>>
>>_______________________________________________
>>Tutor maillist  -  Tutor@python.org
>>http://mail.python.org/mailman/listinfo/tutor
>>
> 
> _______________________________________________
> Tutor maillist  -  Tutor@python.org
> http://mail.python.org/mailman/listinfo/tutor
> 
> 

_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Another regular expression question

Reply via email to