My standard answer when you want to really customize how stuff like this works is to do the Tika processing in SolrJ. That lets you ignore/modify/whatever anything you want. It also moves the parsing load off of the Solr node which scales much better. Here's an example: http://lucidworks.com/blog/indexing-with-solrj/
IOW, I don't know how to do what you're asking for from within the Extracting Request Handler. Not quite sure whether "literals" would work for you, see: https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika Best, Erick On Wed, Apr 15, 2015 at 10:26 AM, Patrick Savelberg <patrick.savelb...@schulinck.nl> wrote: > I use Solr to index different kinds of database tables. I have a Solr index > containing a field named category. I make sure that the category field in > Solr gets occupied with the right value depending on the table. This I can > use to build facet queries which works fine. > > The problem I have is with tables that contain records which represent binary > documents like PDF's. I use the extract query (TIKA) to index the contents of > the binary document along with the data from the database record. Tika > sometimes finds metadata in the document which has the same name as one of my > index fields I have in my schema.xml, like category. I end up with the > category field being a multi-value field containing the category data from my > database record AND the additional data from the category (meta)field > extracted by TIKA from the actual binary document. It seems that the > extracthandler adds every field it may find to my index if there is a > corresponding field in my index. > > How can I prevent this from happening? All I need is the textual > representation of the binary document added as content and not the extra > (meta?) fields. I don't want the extra data TIKA may find to be added to any > field in my index. However I do want to keep the data in the category field > which comes from my database record. So adding a fmap.category="ignored_" > won't help me because then the data of my database record will be ignored as > well. > > Another reason for wanting to prevent this is that I cannot know in advance > which other fields TIKA might come up with when the document is extracted. In > other words choosing more elaborated names (like a namespace like prefix) for > my index fields will never guarantee field name collisions 100%. > > So, how can I prevent the data the extract comes up with is added to my index > field or am I missing a point here? >