Re: Question on index time de-duplication

Zheng Lin Edwin Yeo Thu, 29 Oct 2015 20:59:48 -0700

Yes, you can try to use the SignatureUpdateProcessorFactory to do a hashing
of the content to a signature field, and group the signature field during
your search.


You can find more information here:
https://cwiki.apache.org/confluence/display/solr/De-Duplication

I have been using this method to group the index with duplicated content,
and it is working fine.

Regards,
Edwin


On 30 October 2015 at 07:20, Shamik Bandopadhyay <sham...@gmail.com> wrote:

> Hi,
>
>   I'm looking to customizing index time de-duplication. Here's my use case
> and what I'm trying to achieve.
>
> I've identical documents coming from different release year of a given
> product. I need to index them in Solr as they are required in individual
> year context. But there's a generic search which spans across all the years
> and hence bring back duplicate/identical content. My goal is to only return
> the latest document and filter out the rest. For e.g. if product A has
> identical documents for 2015, 2014 and 2013, search should only return 2015
> (latest document) and filter out the rest.
>
> What I'm thinking (if possible) during index time :
>
> Index all documents, but add a special tag (e.g. dedup=true) to 2013 and
> 2014 content, keeping 2015 (the latest release) untouched. During query
> time, I'll add a filter which will exclude contents tagged with "dedup".
>
> Just wondering if this is achievable by perhaps extending
> UpdateRequestProcessorFactory or
> customizing SignatureUpdateProcessorFactory ?
>
> Any pointers will be appreciated.
>
> Regards,
> Shamik
>

Re: Question on index time de-duplication

Reply via email to