RE: [htdig] RE: Problems with poorly coded database - Fork Failure in ExternalParser

michael.brockington Mon, 19 Dec 2005 03:07:01 -0800

Update:

Thanks to everyone for their assistance. It would appear that this is
now working almost as expected - I ran it again over the weekend, and
nothing major seems to have gone wrong. (There are still issues with
'timing' with the OS occasionally holding on to a file for too long, or
something.)


To summarise the problem:
Documents being served to htdig with a 200 'Success' code, but
containing only a plain-text error message, not the 'text/html'
advertised in the MIME-type header, served by a third-party application
(which cannot be modified)

The environment:
Htdig 3.1.6 (pre-compiled binary produced by third-party)
Windows 2000 server
Apache 2.0.xx
Perl 5.6.1

The solution:
added/modified entry within  external_parsers:
        text/html->text/html-internal  C:/htdig/bin/cpol2html.bat

This line tells htdig to use   cpol2html.bat   as an external filter,
then to parse the returned content as  text/html  with the built-in
parser.

cpol2html.bat  in turn feeds the contents of htdig's temp file into
cpol.pl, before cleaning up the temp file. (This clean-up seems to be
unreliable for some reason.)

cpol.pl  starts reading the file, line by line, looking for a handful of
HTML constructs, ie  '<html'  '<head'  '<body'  '<meta'  '<title'
Once/if  three of these are found, the search terminates and the entire
document is dumped to standard-out, without modification.
In this case, the logic could be reversed as the content of the error
message is almost completely static, but this method is more flexible,
and even allows for documents to be slightly mal-formed without causing
a problem. 
If the document's content proves inappropriate, then an empty string is
written to standard-out. This is still seen by htdig as a valid
document, and appears as such in the statistics, but is deleted by
htmerge for having no content.


cpol.pl was originally a copy of doc2html.pl, which is used elsewhere on
this site.
The logic described above could easily be changed to exclude 404 or 500
pages on those sites that insist on presenting helpful search forms, or
any other situation where unwanted content appears.  The processing
overhead seems fairly low, however I was not able to test this
empirically, as the site in question needs to be throttled to avoid
overloading. At the last test, 117,000 URLs were tested, of which 40 to
50,000 were valid pages. I believe that my earlier issues with 'Fork
Error' may have been due to a missing 'EXIT' statement in cpol.pl - I am
not a PERL guru by any means, so this is not a pretty piece of code.

Mike



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_idv37&alloc_id865&op=click
_______________________________________________
ht://Dig general mailing list: <[email protected]>
ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-general

RE: [htdig] RE: Problems with poorly coded database - Fork Failure in ExternalParser

Reply via email to