I use Solr to index different kinds of database tables. I have a Solr index 
containing a field named category. I make sure that the category field in Solr 
gets occupied with the right value depending on the table. This I can use to 
build facet queries which works fine.

The problem I have is with tables that contain records which represent binary 
documents like PDF's. I use the extract query (TIKA) to index the contents of 
the binary document along with the data from the database record. Tika 
sometimes finds metadata in the document which has the same name as one of my 
index fields I have in my schema.xml, like category. I end up with the category 
field being a multi-value field containing the category data from my database 
record AND the additional data from the category (meta)field extracted by TIKA 
from the actual binary document. It seems that the extracthandler adds every 
field it may find to my index if there is a corresponding field in my index.

How can I prevent this from happening? All I need is the textual representation 
of the binary document added as content and not the extra (meta?) fields. I 
don't want the extra data TIKA may find to be added to any field in my index. 
However I do want to keep the data in the category field which comes from my 
database record. So adding a fmap.category="ignored_" won't help me because 
then the data of my database record will be ignored as well.

Another reason for wanting to prevent this is that I cannot know in advance 
which other fields TIKA might come up with when the document is extracted. In 
other words choosing more elaborated names (like a namespace like prefix) for 
my index fields will never guarantee field name collisions 100%.

So, how can I prevent the data the extract comes up with is added to my index 
field or am I missing a point here?

Reply via email to