Re: indexing pdf documents

Brian Carmalt Wed, 14 May 2008 04:13:54 -0700

Hello Cam,

The wiki for RichDocuments explains how you can add meta data to the
RDUpdater.  
http://wiki.apache.org/solr/UpdateRichDocuments


I have used the patch to index docs and thier meta data, but it was not 
exactly what we needed. 

Brian. 

Am Mittwoch, den 14.05.2008, 12:38 +0300 schrieb Cam Bazz:
> Hello Elizabeth;
> 
> Yes, I have PDF files, and metadata about them already extracted.
> 
> so I need something like:
> 
> <doc><add>
> <field id="author">someone</field>
> <field id=...
> <field id="content">content of my pdf file</field>
> </add></doc>
> 
> it seems that the updaterichdocument patch can only accept pdfs in raw form
> - so it is not possible to feed metadata.
> 
> Have you found a solution other then to manually convert pdf into txt then
> forming xmls?
> 
> Best Regards,
> -C.B.
> 
> On Tue, May 13, 2008 at 4:15 PM, Bess Sadler <[EMAIL PROTECTED]> wrote:
> 
> > C.B., are you saying you have metadata about your PDF files (i.e., title,
> > author, etc) separate from the PDF file itself, or are you saying you want
> > to extract that information from the PDF file? The first of these is pretty
> > easy, the second of these can be difficult or impossible, depending on how
> > your PDF file was generated and how consistent your files are.
> >
> > It's a bit of a hack, but I've had great success in the past with using
> > XTF (http://www.cdlib.org/inside/projects/xtf/) to index my PDF files, and
> > then pointing solr at the resulting lucene index.  It's worth checking to
> > see if this would do the trick for you.
> >
> > Bess
> >
> > Elizabeth (Bess) Sadler
> > Research and Development Librarian
> > Digital Scholarship Services
> > Box 400129
> > Alderman Library
> > University of Virginia
> > Charlottesville, VA 22904
> >
> >
> > On May 13, 2008, at 3:58 AM, Cam Bazz wrote:
> >
> > > yes, I have seen the documentation on RichDocumentRequestHandler at the
> > > http://wiki.apache.org/solr/UpdateRichDocuments page.
> > > However, from what I understand this just feeds documents to solr. How
> > > can I
> > > construct something like: document_id, document_name, document_text and
> > > feed
> > > it in. (i.e. my documents have labels)
> > >
> > > Best.
> > > -C.B.
> > >
> > > On Tue, May 13, 2008 at 1:30 AM, Chris Harris <[EMAIL PROTECTED]> wrote:
> > >
> > >  Solr does not have this support built in, but there's a patch for it:
> > > >
> > > > https://issues.apache.org/jira/browse/SOLR-284
> > > >
> > > > On Mon, May 12, 2008 at 2:02 PM, Cam Bazz <[EMAIL PROTECTED]> wrote:
> > > >
> > > > > Hello,
> > > > >
> > > > >  Before making a little program to extract the txt from my pdfs and
> > > > > feed
> > > > >
> > > > it
> > > >
> > > > >  into solr with xml, I just wanted to check if solr has capability
> > > > > to
> > > > >
> > > > digest
> > > >
> > > > >  pdf files apart from xml?
> > > > >
> > > > >  Best Regards,
> > > > >  -C.B.
> > > > >
> > > > >
> > > >
> >
> >
> >

Re: indexing pdf documents

Reply via email to