solr cell/tika: pdf import with xml metatags

Markus.Rietzler Tue, 27 Oct 2009 03:36:43 -0700

hi,

we want to use SOLR as our intranet search engine.
i downloaded the nightly bild of solr 1.4. pdf extraction does via Solr 
Cell/Tika. i can send the pdf via curl
to solr.


we do have a large set of meta-tags to all our intranet documents, including 
PDF, PPT etc. to import html
files from our CMS i have access to all of this meta tags and create a xml 
document which i send to SOLR, 

eg.

<?xml version='1.0' encoding='UTF-8'?>
<add>
<doc>
<field name="id">1</field>
<field name="title">this is the title</field>
</doc>
<doc>
<field name="id">2</field>
<field name="title">this is another title</field>
</doc>
<doc>
<field name="id">3</field>
<field name="title">this is the third title</field>
</doc>
</add>

this works fine with html files where i can grab all the meta tags, including 
"body".

so my question is, can i use this xml-document to send a pdf file also? ok, one 
way would be to use
the extracthandler with extract only and put the data in the "body"-field. 

is there any other way? 





--
mit freundlichen Grüßen

Markus Rietzler - <rietzler_software/>
Rechenzentrum der Finanzverwaltung NRW
0211/4572-2130

solr cell/tika: pdf import with xml metatags

Reply via email to