Greetings Solr folk, How can I instruct the extract request handler to ignore metadata/headers etc. when it constructs the "content" of the document I send to it?
For example, I created an MS Word document containing just the word "SEARCHWORD" and nothing else. However, when I ship this doc to my solr server, here's what's thrown in the index: <str name="meta"> Last-Printed 2009-02-05T15:02:00Z Revision-Number 22 Comments stream_source_info myfile Last-Author Inigo Montoya Template Normal.dotm Page-Count 1 subject Application-Name Microsoft Macintosh Word Author Jesus Baggins Word-Count 2 xmpTPg:NPages 1 Edit-Time 108600000000 Creation-Date 2008-11-05T20:19:00Z stream_content_type application/octet-stream Character Count 14 stream_size 31232 stream_name /Applications/MAMP/tmp/php/phpHCIg7y Company Parkman Elastomers Pvt Ltd Content-Type application/msword Keywords Last-Save-Date 2012-05-01T18:55:00Z SEARCHWORD </str> All I want is the body of the document, in this case the word "SEARCHWORD." For further reference, here's my extraction handler: <requestHandler name="/update/extract" startup="lazy" class="solr.extraction.ExtractingRequestHandler" > <lst name="defaults"> <!-- All the main content goes into "text"... if you need to return the extracted text or do highlighting, use a stored field. --> <str name="fmap.content">meta</str> <str name="lowernames">true</str> <str name="uprefix">ignored_</str> </lst> </requestHandler> (Ironically, "meta" is the field in the solr schema to which I'm attempting to extract the body of the document. Don't ask). Thanks in advance for any pointers you can provide me. -- - Joe