Hi All, I added to my solr 1.4.1 instance the ExtractingRequestHandler with the default configuration that I found on the wiki (http://wiki.apache.org/solr/ExtractingRequestHandler).
<requestHandler name="/update/extract" class="org.apache.solr.handler.extraction.ExtractingRequestHandler"> <lst name="defaults"> <str name="fmap.Last-Modified">last_modified</str> <str name="uprefix">ignored_</str> </lst> <!--Optional. Specify a path to a tika configuration file. See the Tika docs for details.--> <!--<str name="tika.config">/my/path/to/tika.config</str>--> <!-- Optional. Specify one or more date formats to parse. See DateUtil.DEFAULT_DATE_FORMATS for default date formats --> <!-- <lst name="date.formats"> <str>yyyy-MM-dd</str> </lst> --> </requestHandler> now when I injest via solrj api the html and pdf document I can find in the solr indexes document like that: stored/uncompressed,indexed,tokenized<Content-Type:application/pdf> stored/uncompressed,indexed,omitNorms<PID:eims-document:25445#objects/eims-document:226946/datastreams/PDF/content> stored/uncompressed,indexed,tokenized<content: stream_size 1168557 Content-Type application/pdf > stored/uncompressed,indexed,tokenized<stream_size:1168557> stored/uncompressed,indexed,omitNorms<timestamp:2010-12-05T12:34:44.423> How can I add the configuration to strip the PDF/HTML content and add it to the content field? In order to update the a document in the index, Is it possible to inject multiple binary object with the same pid? Regards Alessandro