ExtractingRequestHandler configuration

alessandro.ri...@virgilio.it Sun, 05 Dec 2010 07:42:35 -0800

 Hi All,
I added to my solr 1.4.1 instance the ExtractingRequestHandler with the default 
configuration that I found on the wiki 
(http://wiki.apache.org/solr/ExtractingRequestHandler).


<requestHandler name="/update/extract" 
class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
    <lst name="defaults">
      <str name="fmap.Last-Modified">last_modified</str>
      <str name="uprefix">ignored_</str>
    </lst>
    <!--Optional.  Specify a path to a tika configuration file.  See the Tika 
docs for details.-->
    <!--<str name="tika.config">/my/path/to/tika.config</str>-->

    <!-- Optional. Specify one or more date formats to parse.  See 
DateUtil.DEFAULT_DATE_FORMATS for default date formats -->
<!--
    <lst name="date.formats">
      <str>yyyy-MM-dd</str>
    </lst>
-->
  </requestHandler>

now when I injest via solrj api the html and pdf document I can find in the 
solr indexes document like that:


stored/uncompressed,indexed,tokenized<Content-Type:application/pdf>
stored/uncompressed,indexed,omitNorms<PID:eims-document:25445#objects/eims-document:226946/datastreams/PDF/content>

stored/uncompressed,indexed,tokenized<content:  stream_size 1168557   
Content-Type application/pdf         >
stored/uncompressed,indexed,tokenized<stream_size:1168557>
stored/uncompressed,indexed,omitNorms<timestamp:2010-12-05T12:34:44.423>


How can I add the configuration to strip the PDF/HTML content  and add it to 
the content field?
In order to update the a document in the index, Is it possible to inject 
multiple binary object with the same pid? 

Regards
Alessandro

ExtractingRequestHandler configuration

Reply via email to