That's what I observed as well. Perhaps there's a way to customize
SignatureUpdateProcessorFactory to support my use case. I'll look into the
source code and figure if there's a way to do it.
--
View this message in context:
http://lucene.472066.n3.nabble.com/Question-on-index-time-de-duplicati
Hi Shamik,
I'm using most of the configuration out of the box, but I'm also looking at
tagging an identifier or something so that it will always show the latest
documents.
At first I thought it will automatically show the one that is indexed
later, but seems that it is not the case. It will just
Thanks for your reply. Have you customized SignatureUpdateProcessorFactory or
are you using the configuration out of the box ? I know it works for simple
dedup, but my requirement is tad different as I need to tag an identifier to
the latest document. My goal is to understand if that's possible usi
Thanks Markus. I've been using field collapsing till now but the performance
constraint is forcing me to think about index time de-duplication. I've been
using a composite router to make sure that duplicate documents are routed to
the same shard. Won't that work for SignatureUpdateProcessorFactory
Thanks Scott. I could directly use field collapsing on adskdedup field
without the signature field. Problem with field collapsing is the
performance overhead. It slows down the query to 10 folds.
CollapsingQParserPlugin is a better option, unfortunately, it doesn't
support ngroups equivalent, which
solr-user@lucene.apache.org
> Subject: Re: Question on index time de-duplication
>
> At the top of the De-Duplication wiki page is a note about collapsing
> results. Once you have the signature (identical for each of the duplicates)
> you'll want to collapse your results, keeping the
At the top of the De-Duplication wiki page is a note about collapsing
results. Once you have the signature (identical for each of the duplicates)
you'll want to collapse your results, keeping the one with max date.
https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results
k/r,
Yes, you can try to use the SignatureUpdateProcessorFactory to do a hashing
of the content to a signature field, and group the signature field during
your search.
You can find more information here:
https://cwiki.apache.org/confluence/display/solr/De-Duplication
I have been using this method to g