Hello Elizabeth; Yes, I have PDF files, and metadata about them already extracted.
so I need something like: <doc><add> <field id="author">someone</field> <field id=... <field id="content">content of my pdf file</field> </add></doc> it seems that the updaterichdocument patch can only accept pdfs in raw form - so it is not possible to feed metadata. Have you found a solution other then to manually convert pdf into txt then forming xmls? Best Regards, -C.B. On Tue, May 13, 2008 at 4:15 PM, Bess Sadler <[EMAIL PROTECTED]> wrote: > C.B., are you saying you have metadata about your PDF files (i.e., title, > author, etc) separate from the PDF file itself, or are you saying you want > to extract that information from the PDF file? The first of these is pretty > easy, the second of these can be difficult or impossible, depending on how > your PDF file was generated and how consistent your files are. > > It's a bit of a hack, but I've had great success in the past with using > XTF (http://www.cdlib.org/inside/projects/xtf/) to index my PDF files, and > then pointing solr at the resulting lucene index. It's worth checking to > see if this would do the trick for you. > > Bess > > Elizabeth (Bess) Sadler > Research and Development Librarian > Digital Scholarship Services > Box 400129 > Alderman Library > University of Virginia > Charlottesville, VA 22904 > > > On May 13, 2008, at 3:58 AM, Cam Bazz wrote: > > > yes, I have seen the documentation on RichDocumentRequestHandler at the > > http://wiki.apache.org/solr/UpdateRichDocuments page. > > However, from what I understand this just feeds documents to solr. How > > can I > > construct something like: document_id, document_name, document_text and > > feed > > it in. (i.e. my documents have labels) > > > > Best. > > -C.B. > > > > On Tue, May 13, 2008 at 1:30 AM, Chris Harris <[EMAIL PROTECTED]> wrote: > > > > Solr does not have this support built in, but there's a patch for it: > > > > > > https://issues.apache.org/jira/browse/SOLR-284 > > > > > > On Mon, May 12, 2008 at 2:02 PM, Cam Bazz <[EMAIL PROTECTED]> wrote: > > > > > > > Hello, > > > > > > > > Before making a little program to extract the txt from my pdfs and > > > > feed > > > > > > > it > > > > > > > into solr with xml, I just wanted to check if solr has capability > > > > to > > > > > > > digest > > > > > > > pdf files apart from xml? > > > > > > > > Best Regards, > > > > -C.B. > > > > > > > > > > > > > >