Re: indexing pdf documents

Cam Bazz Wed, 14 May 2008 02:39:25 -0700

Hello Elizabeth;

Yes, I have PDF files, and metadata about them already extracted.


so I need something like:

<doc><add>
<field id="author">someone</field>
<field id=...
<field id="content">content of my pdf file</field>
</add></doc>

it seems that the updaterichdocument patch can only accept pdfs in raw form
- so it is not possible to feed metadata.

Have you found a solution other then to manually convert pdf into txt then
forming xmls?

Best Regards,
-C.B.

On Tue, May 13, 2008 at 4:15 PM, Bess Sadler <[EMAIL PROTECTED]> wrote:

> C.B., are you saying you have metadata about your PDF files (i.e., title,
> author, etc) separate from the PDF file itself, or are you saying you want
> to extract that information from the PDF file? The first of these is pretty
> easy, the second of these can be difficult or impossible, depending on how
> your PDF file was generated and how consistent your files are.
>
> It's a bit of a hack, but I've had great success in the past with using
> XTF (http://www.cdlib.org/inside/projects/xtf/) to index my PDF files, and
> then pointing solr at the resulting lucene index.  It's worth checking to
> see if this would do the trick for you.
>
> Bess
>
> Elizabeth (Bess) Sadler
> Research and Development Librarian
> Digital Scholarship Services
> Box 400129
> Alderman Library
> University of Virginia
> Charlottesville, VA 22904
>
>
> On May 13, 2008, at 3:58 AM, Cam Bazz wrote:
>
> > yes, I have seen the documentation on RichDocumentRequestHandler at the
> > http://wiki.apache.org/solr/UpdateRichDocuments page.
> > However, from what I understand this just feeds documents to solr. How
> > can I
> > construct something like: document_id, document_name, document_text and
> > feed
> > it in. (i.e. my documents have labels)
> >
> > Best.
> > -C.B.
> >
> > On Tue, May 13, 2008 at 1:30 AM, Chris Harris <[EMAIL PROTECTED]> wrote:
> >
> >  Solr does not have this support built in, but there's a patch for it:
> > >
> > > https://issues.apache.org/jira/browse/SOLR-284
> > >
> > > On Mon, May 12, 2008 at 2:02 PM, Cam Bazz <[EMAIL PROTECTED]> wrote:
> > >
> > > > Hello,
> > > >
> > > >  Before making a little program to extract the txt from my pdfs and
> > > > feed
> > > >
> > > it
> > >
> > > >  into solr with xml, I just wanted to check if solr has capability
> > > > to
> > > >
> > > digest
> > >
> > > >  pdf files apart from xml?
> > > >
> > > >  Best Regards,
> > > >  -C.B.
> > > >
> > > >
> > >
>
>
>

Re: indexing pdf documents

Reply via email to