C.B., are you saying you have metadata about your PDF files (i.e., title, author, etc) separate from the PDF file itself, or are you saying you want to extract that information from the PDF file? The first of these is pretty easy, the second of these can be difficult or impossible, depending on how your PDF file was generated and how consistent your files are.

It's a bit of a hack, but I've had great success in the past with using XTF (http://www.cdlib.org/inside/projects/xtf/) to index my PDF files, and then pointing solr at the resulting lucene index. It's worth checking to see if this would do the trick for you.

Bess

Elizabeth (Bess) Sadler
Research and Development Librarian
Digital Scholarship Services
Box 400129
Alderman Library
University of Virginia
Charlottesville, VA 22904

On May 13, 2008, at 3:58 AM, Cam Bazz wrote:
yes, I have seen the documentation on RichDocumentRequestHandler at the
http://wiki.apache.org/solr/UpdateRichDocuments page.
However, from what I understand this just feeds documents to solr. How can I construct something like: document_id, document_name, document_text and feed
it in. (i.e. my documents have labels)

Best.
-C.B.

On Tue, May 13, 2008 at 1:30 AM, Chris Harris <[EMAIL PROTECTED]> wrote:

Solr does not have this support built in, but there's a patch for it:

https://issues.apache.org/jira/browse/SOLR-284

On Mon, May 12, 2008 at 2:02 PM, Cam Bazz <[EMAIL PROTECTED]> wrote:
Hello,

Before making a little program to extract the txt from my pdfs and feed
it
into solr with xml, I just wanted to check if solr has capability to
digest
 pdf files apart from xml?

 Best Regards,
 -C.B.





Reply via email to