Re: [Tutor] man pages parsing (still)

Tiago Saboga Mon, 11 Sep 2006 13:01:44 -0700

Em Segunda 11 Setembro 2006 12:59, Kent Johnson escreveu:
> Tiago Saboga wrote:
> > Em Segunda 11 Setembro 2006 12:24, Kent Johnson escreveu:
> >> Tiago Saboga wrote:
> >>> Em Segunda 11 Setembro 2006 11:15, Kent Johnson escreveu:
> >>>> Tiago Saboga wrote:
> >>>> How big is the XML? 25 seconds is a long time...I would look at
> >>>> cElementTree (implementation of ElementTree in C), it is pretty fast.
> >>>> http://effbot.org/zone/celementtree.htm
> >>>
> >>> It's about 10k. Hey, it seems easy, but I'd like not to start over
> >>> again. Of course, if it's the only solution... 25 (28, in fact, for the
> >>> cp man page) isn't really acceptable.
> >>
> >> That's tiny! No way it should take 25 seconds to parse a 10k file.
> >>
> >> Have you tried saving the file separately and parsing from disk? That
> >> would help determine if the interprocess pipe is the problem.
> >
> > Just tried, and - incredible - it took even longer: 46s. But in the
> > second run it came back to 25s. I really don't understand what's going
> > on. I did some other tests, and I found that all the code before
> > "parser.parse(stout)" runs almost instantly; it then takes all the
> > running somewhere between this call and the first event; and the rest is
> > almost instantly again. Any ideas?
>
> What did you try, buffering or reading from a file? If parsing from a
> file takes 25 secs, I am amazed...


I read from a file, and before you ask, no, I'm not working in a 286 and 
compiling my kernel at the same time... ;-)

In fact, I decided to strip down both my code and the xml file. I've stripped 
the code to almost nothing, having yet a 23s time. And the same with the xml 
file... until I cut out the second line, with the dtd [1]. And surprise: I've 
a nice time. So I put it all together again, but have the following caveat: 
there's an error that did not raise previously:]

Traceback (most recent call last):
  File "./liftopy.py", line 130, in ?
    parser.parse(stout)
  File "/usr/lib/python2.3/site-packages/_xmlplus/sax/expatreader.py", line 
109, in parse
    xmlreader.IncrementalParser.parse(self, source)
  File "/usr/lib/python2.3/site-packages/_xmlplus/sax/xmlreader.py", line 123, 
in parse
    self.feed(buffer)
  File "/usr/lib/python2.3/site-packages/_xmlplus/sax/expatreader.py", line 
220, in feed
    self._err_handler.fatalError(exc)
  File "/usr/lib/python2.3/site-packages/_xmlplus/sax/handler.py", line 38, in 
fatalError
    raise exception
xml.sax._exceptions.SAXParseException: 
/home/tiago/Computador/python/opy/manraw/doclift/cp.1.xml.stripped:279:16: 
undefined entity

Ok, the guilty line (279) has a "&copy;" that was probably defined in the dtd, 
but as it doesn't know what is the right dtd... But wait... How does python 
read the dtd? It fetches it from the net? I tried it (disconnected) and the 
answer is yes, it fetches it from the net. So that's the problem!

But how do I avoid it? I'll search. But if you can spare me some time, you'll 
make me a little happier. 

[1] - The line is as follows:
<!DOCTYPE refentry PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN"
                   "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd";>

Thanks!

Tiago.
_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] man pages parsing (still)

Reply via email to