Hello - keep in mind that both SignatureUpdateProcessorFactory and field 
collapsing do not work in distributed search unless you map identical 
signatures to identical shards.
Markus
 
-----Original message-----
> From:Scott Stults <sstu...@opensourceconnections.com>
> Sent: Friday 30th October 2015 11:58
> To: solr-user@lucene.apache.org
> Subject: Re: Question on index time de-duplication
> 
> At the top of the De-Duplication wiki page is a note about collapsing
> results. Once you have the signature (identical for each of the duplicates)
> you'll want to collapse your results, keeping the one with max date.
> 
> https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results
> 
> 
> k/r,
> Scott
> 
> On Thu, Oct 29, 2015 at 11:59 PM, Zheng Lin Edwin Yeo <edwinye...@gmail.com>
> wrote:
> 
> > Yes, you can try to use the SignatureUpdateProcessorFactory to do a hashing
> > of the content to a signature field, and group the signature field during
> > your search.
> >
> > You can find more information here:
> > https://cwiki.apache.org/confluence/display/solr/De-Duplication
> >
> > I have been using this method to group the index with duplicated content,
> > and it is working fine.
> >
> > Regards,
> > Edwin
> >
> >
> > On 30 October 2015 at 07:20, Shamik Bandopadhyay <sham...@gmail.com>
> > wrote:
> >
> > > Hi,
> > >
> > >   I'm looking to customizing index time de-duplication. Here's my use
> > case
> > > and what I'm trying to achieve.
> > >
> > > I've identical documents coming from different release year of a given
> > > product. I need to index them in Solr as they are required in individual
> > > year context. But there's a generic search which spans across all the
> > years
> > > and hence bring back duplicate/identical content. My goal is to only
> > return
> > > the latest document and filter out the rest. For e.g. if product A has
> > > identical documents for 2015, 2014 and 2013, search should only return
> > 2015
> > > (latest document) and filter out the rest.
> > >
> > > What I'm thinking (if possible) during index time :
> > >
> > > Index all documents, but add a special tag (e.g. dedup=true) to 2013 and
> > > 2014 content, keeping 2015 (the latest release) untouched. During query
> > > time, I'll add a filter which will exclude contents tagged with "dedup".
> > >
> > > Just wondering if this is achievable by perhaps extending
> > > UpdateRequestProcessorFactory or
> > > customizing SignatureUpdateProcessorFactory ?
> > >
> > > Any pointers will be appreciated.
> > >
> > > Regards,
> > > Shamik
> > >
> >
> 
> 
> 
> -- 
> Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
> | 434.409.2780
> http://www.opensourceconnections.com
> 

Reply via email to