Hello Cam, The wiki for RichDocuments explains how you can add meta data to the RDUpdater. http://wiki.apache.org/solr/UpdateRichDocuments
I have used the patch to index docs and thier meta data, but it was not exactly what we needed. Brian. Am Mittwoch, den 14.05.2008, 12:38 +0300 schrieb Cam Bazz: > Hello Elizabeth; > > Yes, I have PDF files, and metadata about them already extracted. > > so I need something like: > > <doc><add> > <field id="author">someone</field> > <field id=... > <field id="content">content of my pdf file</field> > </add></doc> > > it seems that the updaterichdocument patch can only accept pdfs in raw form > - so it is not possible to feed metadata. > > Have you found a solution other then to manually convert pdf into txt then > forming xmls? > > Best Regards, > -C.B. > > On Tue, May 13, 2008 at 4:15 PM, Bess Sadler <[EMAIL PROTECTED]> wrote: > > > C.B., are you saying you have metadata about your PDF files (i.e., title, > > author, etc) separate from the PDF file itself, or are you saying you want > > to extract that information from the PDF file? The first of these is pretty > > easy, the second of these can be difficult or impossible, depending on how > > your PDF file was generated and how consistent your files are. > > > > It's a bit of a hack, but I've had great success in the past with using > > XTF (http://www.cdlib.org/inside/projects/xtf/) to index my PDF files, and > > then pointing solr at the resulting lucene index. It's worth checking to > > see if this would do the trick for you. > > > > Bess > > > > Elizabeth (Bess) Sadler > > Research and Development Librarian > > Digital Scholarship Services > > Box 400129 > > Alderman Library > > University of Virginia > > Charlottesville, VA 22904 > > > > > > On May 13, 2008, at 3:58 AM, Cam Bazz wrote: > > > > > yes, I have seen the documentation on RichDocumentRequestHandler at the > > > http://wiki.apache.org/solr/UpdateRichDocuments page. > > > However, from what I understand this just feeds documents to solr. How > > > can I > > > construct something like: document_id, document_name, document_text and > > > feed > > > it in. (i.e. my documents have labels) > > > > > > Best. > > > -C.B. > > > > > > On Tue, May 13, 2008 at 1:30 AM, Chris Harris <[EMAIL PROTECTED]> wrote: > > > > > > Solr does not have this support built in, but there's a patch for it: > > > > > > > > https://issues.apache.org/jira/browse/SOLR-284 > > > > > > > > On Mon, May 12, 2008 at 2:02 PM, Cam Bazz <[EMAIL PROTECTED]> wrote: > > > > > > > > > Hello, > > > > > > > > > > Before making a little program to extract the txt from my pdfs and > > > > > feed > > > > > > > > > it > > > > > > > > > into solr with xml, I just wanted to check if solr has capability > > > > > to > > > > > > > > > digest > > > > > > > > > pdf files apart from xml? > > > > > > > > > > Best Regards, > > > > > -C.B. > > > > > > > > > > > > > > > > > > > >