Yes, you can try to use the SignatureUpdateProcessorFactory to do a hashing of the content to a signature field, and group the signature field during your search.
You can find more information here: https://cwiki.apache.org/confluence/display/solr/De-Duplication I have been using this method to group the index with duplicated content, and it is working fine. Regards, Edwin On 30 October 2015 at 07:20, Shamik Bandopadhyay <sham...@gmail.com> wrote: > Hi, > > I'm looking to customizing index time de-duplication. Here's my use case > and what I'm trying to achieve. > > I've identical documents coming from different release year of a given > product. I need to index them in Solr as they are required in individual > year context. But there's a generic search which spans across all the years > and hence bring back duplicate/identical content. My goal is to only return > the latest document and filter out the rest. For e.g. if product A has > identical documents for 2015, 2014 and 2013, search should only return 2015 > (latest document) and filter out the rest. > > What I'm thinking (if possible) during index time : > > Index all documents, but add a special tag (e.g. dedup=true) to 2013 and > 2014 content, keeping 2015 (the latest release) untouched. During query > time, I'll add a filter which will exclude contents tagged with "dedup". > > Just wondering if this is achievable by perhaps extending > UpdateRequestProcessorFactory or > customizing SignatureUpdateProcessorFactory ? > > Any pointers will be appreciated. > > Regards, > Shamik >