hi, we want to use SOLR as our intranet search engine. i downloaded the nightly bild of solr 1.4. pdf extraction does via Solr Cell/Tika. i can send the pdf via curl to solr.
we do have a large set of meta-tags to all our intranet documents, including PDF, PPT etc. to import html files from our CMS i have access to all of this meta tags and create a xml document which i send to SOLR, eg. <?xml version='1.0' encoding='UTF-8'?> <add> <doc> <field name="id">1</field> <field name="title">this is the title</field> </doc> <doc> <field name="id">2</field> <field name="title">this is another title</field> </doc> <doc> <field name="id">3</field> <field name="title">this is the third title</field> </doc> </add> this works fine with html files where i can grab all the meta tags, including "body". so my question is, can i use this xml-document to send a pdf file also? ok, one way would be to use the extracthandler with extract only and put the data in the "body"-field. is there any other way? -- mit freundlichen Grüßen Markus Rietzler - <rietzler_software/> Rechenzentrum der Finanzverwaltung NRW 0211/4572-2130