Update:
Thanks to everyone for their assistance. It would appear that this is
now working almost as expected - I ran it again over the weekend, and
nothing major seems to have gone wrong. (There are still issues with
'timing' with the OS occasionally holding on to a file for too long, or
something.)
To summarise the problem:
Documents being served to htdig with a 200 'Success' code, but
containing only a plain-text error message, not the 'text/html'
advertised in the MIME-type header, served by a third-party application
(which cannot be modified)
The environment:
Htdig 3.1.6 (pre-compiled binary produced by third-party)
Windows 2000 server
Apache 2.0.xx
Perl 5.6.1
The solution:
added/modified entry within external_parsers:
text/html->text/html-internal C:/htdig/bin/cpol2html.bat
This line tells htdig to use cpol2html.bat as an external filter,
then to parse the returned content as text/html with the built-in
parser.
cpol2html.bat in turn feeds the contents of htdig's temp file into
cpol.pl, before cleaning up the temp file. (This clean-up seems to be
unreliable for some reason.)
cpol.pl starts reading the file, line by line, looking for a handful of
HTML constructs, ie '<html' '<head' '<body' '<meta' '<title'
Once/if three of these are found, the search terminates and the entire
document is dumped to standard-out, without modification.
In this case, the logic could be reversed as the content of the error
message is almost completely static, but this method is more flexible,
and even allows for documents to be slightly mal-formed without causing
a problem.
If the document's content proves inappropriate, then an empty string is
written to standard-out. This is still seen by htdig as a valid
document, and appears as such in the statistics, but is deleted by
htmerge for having no content.
cpol.pl was originally a copy of doc2html.pl, which is used elsewhere on
this site.
The logic described above could easily be changed to exclude 404 or 500
pages on those sites that insist on presenting helpful search forms, or
any other situation where unwanted content appears. The processing
overhead seems fairly low, however I was not able to test this
empirically, as the site in question needs to be throttled to avoid
overloading. At the last test, 117,000 URLs were tested, of which 40 to
50,000 were valid pages. I believe that my earlier issues with 'Fork
Error' may have been due to a missing 'EXIT' statement in cpol.pl - I am
not a PERL guru by any means, so this is not a pretty piece of code.
Mike
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_idv37&alloc_id865&op=click
_______________________________________________
ht://Dig general mailing list: <[email protected]>
ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-general