Re: indexing pdf documents

Bess Sadler Tue, 13 May 2008 06:19:25 -0700

C.B., are you saying you have metadata about your PDF files (i.e.,title, author, etc) separate from the PDF file itself, or are yousaying you want to extract that information from the PDF file? Thefirst of these is pretty easy, the second of these can be difficultor impossible, depending on how your PDF file was generated and howconsistent your files are.

It's a bit of a hack, but I've had great success in the past withusing XTF (http://www.cdlib.org/inside/projects/xtf/) to index my PDFfiles, and then pointing solr at the resulting lucene index. It'sworth checking to see if this would do the trick for you.


Bess

Elizabeth (Bess) Sadler
Research and Development Librarian
Digital Scholarship Services
Box 400129
Alderman Library
University of Virginia
Charlottesville, VA 22904

On May 13, 2008, at 3:58 AM, Cam Bazz wrote:

yes, I have seen the documentation on RichDocumentRequestHandler atthe
http://wiki.apache.org/solr/UpdateRichDocuments page.
However, from what I understand this just feeds documents to solr.How can Iconstruct something like: document_id, document_name, document_textand feed
it in. (i.e. my documents have labels)

Best.
-C.B.
On Tue, May 13, 2008 at 1:30 AM, Chris Harris <[EMAIL PROTECTED]>wrote:
Solr does not have this support built in, but there's a patch for it:

https://issues.apache.org/jira/browse/SOLR-284

On Mon, May 12, 2008 at 2:02 PM, Cam Bazz <[EMAIL PROTECTED]> wrote:
Hello,
Before making a little program to extract the txt from my pdfsand feed
it
into solr with xml, I just wanted to check if solr hascapability to
digest
 pdf files apart from xml?

 Best Regards,
 -C.B.

Re: indexing pdf documents

Reply via email to