I use Solr to index different kinds of database tables. I have a Solr index containing a field named category. I make sure that the category field in Solr gets occupied with the right value depending on the table. This I can use to build facet queries which works fine.
The problem I have is with tables that contain records which represent binary documents like PDF's. I use the extract query (TIKA) to index the contents of the binary document along with the data from the database record. Tika sometimes finds metadata in the document which has the same name as one of my index fields I have in my schema.xml, like category. I end up with the category field being a multi-value field containing the category data from my database record AND the additional data from the category (meta)field extracted by TIKA from the actual binary document. It seems that the extracthandler adds every field it may find to my index if there is a corresponding field in my index. How can I prevent this from happening? All I need is the textual representation of the binary document added as content and not the extra (meta?) fields. I don't want the extra data TIKA may find to be added to any field in my index. However I do want to keep the data in the category field which comes from my database record. So adding a fmap.category="ignored_" won't help me because then the data of my database record will be ignored as well. Another reason for wanting to prevent this is that I cannot know in advance which other fields TIKA might come up with when the document is extracted. In other words choosing more elaborated names (like a namespace like prefix) for my index fields will never guarantee field name collisions 100%. So, how can I prevent the data the extract comes up with is added to my index field or am I missing a point here?