Hi, I'm looking to customizing index time de-duplication. Here's my use case and what I'm trying to achieve.
I've identical documents coming from different release year of a given product. I need to index them in Solr as they are required in individual year context. But there's a generic search which spans across all the years and hence bring back duplicate/identical content. My goal is to only return the latest document and filter out the rest. For e.g. if product A has identical documents for 2015, 2014 and 2013, search should only return 2015 (latest document) and filter out the rest. What I'm thinking (if possible) during index time : Index all documents, but add a special tag (e.g. dedup=true) to 2013 and 2014 content, keeping 2015 (the latest release) untouched. During query time, I'll add a filter which will exclude contents tagged with "dedup". Just wondering if this is achievable by perhaps extending UpdateRequestProcessorFactory or customizing SignatureUpdateProcessorFactory ? Any pointers will be appreciated. Regards, Shamik