C.B., are you saying you have metadata about your PDF files (i.e.,
title, author, etc) separate from the PDF file itself, or are you
saying you want to extract that information from the PDF file? The
first of these is pretty easy, the second of these can be difficult
or impossible, depending on how your PDF file was generated and how
consistent your files are.
It's a bit of a hack, but I've had great success in the past with
using XTF (http://www.cdlib.org/inside/projects/xtf/) to index my PDF
files, and then pointing solr at the resulting lucene index. It's
worth checking to see if this would do the trick for you.
Bess
Elizabeth (Bess) Sadler
Research and Development Librarian
Digital Scholarship Services
Box 400129
Alderman Library
University of Virginia
Charlottesville, VA 22904
On May 13, 2008, at 3:58 AM, Cam Bazz wrote:
yes, I have seen the documentation on RichDocumentRequestHandler at
the
http://wiki.apache.org/solr/UpdateRichDocuments page.
However, from what I understand this just feeds documents to solr.
How can I
construct something like: document_id, document_name, document_text
and feed
it in. (i.e. my documents have labels)
Best.
-C.B.
On Tue, May 13, 2008 at 1:30 AM, Chris Harris <[EMAIL PROTECTED]>
wrote:
Solr does not have this support built in, but there's a patch for it:
https://issues.apache.org/jira/browse/SOLR-284
On Mon, May 12, 2008 at 2:02 PM, Cam Bazz <[EMAIL PROTECTED]> wrote:
Hello,
Before making a little program to extract the txt from my pdfs
and feed
it
into solr with xml, I just wanted to check if solr has
capability to
digest
pdf files apart from xml?
Best Regards,
-C.B.