[htdig] htdig 3.2.0b6 and PDF"s

Robert Isaac Tue, 09 Aug 2005 11:53:53 -0700

For some unknown reason things are all working correctly now, and I didn’t do anything. The only possibility is that when the new server was unplugged from its setting up desk and swapped with the old out-going server and plugged into the Internet, it would not boot; the CPU had died. So a new (and more powerful) one was fitted. I can’t think of any other explanation. Thanks for your help

Bob

_____________________________________________________________________________________________

Robert Isaac

Director and Web Administrator

Volvo Owners Club

www.volvoclub.org.uk

++++++++++++++++++++++++++++++++++++++++++++++++++

Please provide more details on what happened. What do you mean it deleted the PDFs?

Could you proved the few lines of stdout when htdig is parsing a PDF?

When you run doc2html.pl or acroconv.pl via command-line on a given PDF do you see any text returned? Not all PDFs have text in them..

some are actually images of text. These type of PDFs need OCR software to get indexable text out of.

You can get more speed by disabling index compression

wordlist_compress_zlib: false

wordlist_compress: false

Thanks.

On 7/31/05, Robert Isaac <[EMAIL PROTECTED]> wrote:

> I am setting up a new ProLiant DL360 G4 server with Red Hat ES Linux

> 4 and Apache 2.0.x.

> I had copied over htdig 3.1.6 from the old server, but decided to

> install

> 3.2.0b6 with the view of using it when the server goes live in a few days.

> What a nightmare.

> The htdig web site ( http://www.htdig.org/dev/htdig-3.2/)

> is ambiguous about 3.2.0b6 and PDF indexing. In the FAQ 1.13 it refers

> to FAQ 4.9. I have the xpdf package installed, used it with 3.1.6.

> When I indexed our web site - 3200 pages half of them PDF's - it took

> over 13 hours

> - yes thirteen hours!! And then it deleted every one of the PDF's.

> That was

> using:

> external_parsers: application/pdf->text/html

> /var/www/cgi-bin/doc2html.pl

> in htdig.conf.

> I also tried acroconv.pl but it didn't work at all.

> I would appreciate some help with this.

> Thanks

> Bob

> [EMAIL PROTECTED]

[htdig] htdig 3.2.0b6 and PDF"s

Reply via email to