Re: How do I tell Tika to not complement a field's value defined in my Solr schema when indexing a binary document?

Erick Erickson Wed, 15 Apr 2015 14:47:35 -0700

My standard answer when you want to really customize how stuff like
this works is to do the Tika processing in SolrJ. That lets you
ignore/modify/whatever anything you want. It also moves the parsing
load off of the Solr node which scales much better. Here's an example:
http://lucidworks.com/blog/indexing-with-solrj/


IOW, I don't know how to do what you're asking for from within the
Extracting Request Handler. Not quite sure whether "literals" would
work for you, see:
https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika

Best,
Erick

On Wed, Apr 15, 2015 at 10:26 AM, Patrick Savelberg
<patrick.savelb...@schulinck.nl> wrote:
> I use Solr to index different kinds of database tables. I have a Solr index 
> containing a field named category. I make sure that the category field in 
> Solr gets occupied with the right value depending on the table. This I can 
> use to build facet queries which works fine.
>
> The problem I have is with tables that contain records which represent binary 
> documents like PDF's. I use the extract query (TIKA) to index the contents of 
> the binary document along with the data from the database record. Tika 
> sometimes finds metadata in the document which has the same name as one of my 
> index fields I have in my schema.xml, like category. I end up with the 
> category field being a multi-value field containing the category data from my 
> database record AND the additional data from the category (meta)field 
> extracted by TIKA from the actual binary document. It seems that the 
> extracthandler adds every field it may find to my index if there is a 
> corresponding field in my index.
>
> How can I prevent this from happening? All I need is the textual 
> representation of the binary document added as content and not the extra 
> (meta?) fields. I don't want the extra data TIKA may find to be added to any 
> field in my index. However I do want to keep the data in the category field 
> which comes from my database record. So adding a fmap.category="ignored_" 
> won't help me because then the data of my database record will be ignored as 
> well.
>
> Another reason for wanting to prevent this is that I cannot know in advance 
> which other fields TIKA might come up with when the document is extracted. In 
> other words choosing more elaborated names (like a namespace like prefix) for 
> my index fields will never guarantee field name collisions 100%.
>
> So, how can I prevent the data the extract comes up with is added to my index 
> field or am I missing a point here?
>

Re: How do I tell Tika to not complement a field's value defined in my Solr schema when indexing a binary document?

Reply via email to