According to Geoff Hutchison:
> At 5:41 PM -0500 11/18/99, Tom Metro wrote:
> >  > printf("PDF::parse: cannot find pdf parser %s\n", arg0.get());
> >BTW, is that going to STDOUT as it appears, rather than STDERR? Is
> >that normal practice for htdig's error messages?
> 
> The error messages are somewhat inconsistent in this regard. It 
> should probably go to STDERR.

Most of htdig's messages to to stdout, but it's a bit of a mixed bag right
now.  Given that htdig doesn't normally have any output other than error
and debugging messages, it would probably make sense for all messages to
to to stdout, instead of stderr, to make piping and redirection easier.
Stderr is best used when the program would otherwise have a normal output
stream on stdout, in which you don't want to bury error messages if the
output stream is piped or redirected.

> >BTW, if you have a bad_extensions directive, why add .cgi to
> >exclude_urls?
> >
> >     exclude_urls:   /cgi-bin/ .cgi
> 
> Because most people don't think of .cgi as an extension. Or that 
> would be my guess.

There's another reason.  You can have additional path info on a cgi program,
which gets passed to it via the PATH_INFO environment variable. E.g.:

        http://www.xyz.com/data/view.cgi/unitb/foo.dat

I believe the bad_extensions are only checked at the very ending of the URL.

> >Also, the documentation for exclude_urls makes mention of "patterns",
> >yet if I understand correctly (I haven't checked the code) it simply
> >performs a (case sensitive?) sub-string match. To me, pattern implies
> >the inclusion of wildcards or other meta characters.
> 
> The documentation is not necessarily perfect. As many people will 
> point out, developers are often not the best at writing documentation.

Yeah, right now the term "pattern" is used quite liberally throughout the
docs and code, for the StringMatch class's substring matching.  It's called
pattern because it can be string1|string2|string3...  Most of this will be
replaced with regular expression handling in 3.2, so the term pattern will
be even more applicable then.

> >The question that comes to mind is why is pdf_parser treated specially
> >and not implemented via the generalized external parser interface?
> 
> Gilles can probably answer this more effectively than I, but at the 
> time of PDF.cc being contributed, acroread was essentially the only 
> reliable technique around for translating PDF to text. At this point, 
> xpdf is probably a better program (for a variety of reasons, some of 
> them license-related).
> 
> Of course having a builtin parser is almost always faster than an 
> external parser.

Don't know for sure, as I only came on the scene about 14 months ago,
but I suspect that Sylvain's PDF.cc code may predate the external parser
support, or perhaps he had some reservations about using external parsers
back then (they were buggy until about April, if I recall, and still not
super efficient).

I've been giving this whole internal parsers vs. external parsers issue
some thought lately.  I don't think we want to introduce radical channge
to the way things are done in 3.1.x, or even for the upcoming 3.2.0b1
release, but here's what I'd like to see later in 3.2:

- The whole semi-internal, semi-external pdf_parser support has been a
frequent source of confusion - I'd like to see it go.  It could now be
replaced with an external converter that spits out a text/plain version
of the PDF's contents, using either pdftotext, or acroread -toPostScript
piped through an Acrobat PostScript to text converter based on Sylvain's
code.  That way, those who prefer acroread to xpdf aren't left out in the
cold.

- It's a bit of a pain to maintain multiple internal parsers, and it leads
to a certain amount of duplication of code.  We got rid of the PostScript
parser, because it never did work, and now we can get rid of PDF.cc.  That
leaves Plaintext.cc and HTML.cc, which isn't bad.  If you think about it,
though, if you SGMLify plain text (at least the <, >, and &) you can pass
it through the HTML parser - that way, you'd only need a single internal
parser to maintain.  That would probably greatly simplify things internally.

All other parsing can be done externally, or better yet, externally convert
any document type you want to text/html or text/plain, and leave the actual
parsing and word separation to the one builtin parser, to be assured of
consistent treatment of words regardless of the source document type.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED]
You'll receive a message confirming the unsubscription.

Reply via email to