Re: Question on index time de-duplication

2015-11-01 Thread shamik
That's what I observed as well. Perhaps there's a way to customize SignatureUpdateProcessorFactory to support my use case. I'll look into the source code and figure if there's a way to do it. -- View this message in context: http://lucene.472066.n3.nabble.com/Question-on-index-time-de-duplicati

Re: Question on index time de-duplication

2015-10-31 Thread Zheng Lin Edwin Yeo
Hi Shamik, I'm using most of the configuration out of the box, but I'm also looking at tagging an identifier or something so that it will always show the latest documents. At first I thought it will automatically show the one that is indexed later, but seems that it is not the case. It will just

Re: Question on index time de-duplication

2015-10-30 Thread shamik
Thanks for your reply. Have you customized SignatureUpdateProcessorFactory or are you using the configuration out of the box ? I know it works for simple dedup, but my requirement is tad different as I need to tag an identifier to the latest document. My goal is to understand if that's possible usi

RE: Question on index time de-duplication

2015-10-30 Thread shamik
Thanks Markus. I've been using field collapsing till now but the performance constraint is forcing me to think about index time de-duplication. I've been using a composite router to make sure that duplicate documents are routed to the same shard. Won't that work for SignatureUpdateProcessorFactory

Re: Question on index time de-duplication

2015-10-30 Thread shamik
Thanks Scott. I could directly use field collapsing on adskdedup field without the signature field. Problem with field collapsing is the performance overhead. It slows down the query to 10 folds. CollapsingQParserPlugin is a better option, unfortunately, it doesn't support ngroups equivalent, which

RE: Question on index time de-duplication

2015-10-30 Thread Markus Jelsma
solr-user@lucene.apache.org > Subject: Re: Question on index time de-duplication > > At the top of the De-Duplication wiki page is a note about collapsing > results. Once you have the signature (identical for each of the duplicates) > you'll want to collapse your results, keeping the

Re: Question on index time de-duplication

2015-10-30 Thread Scott Stults
At the top of the De-Duplication wiki page is a note about collapsing results. Once you have the signature (identical for each of the duplicates) you'll want to collapse your results, keeping the one with max date. https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results k/r,

Re: Question on index time de-duplication

2015-10-29 Thread Zheng Lin Edwin Yeo
Yes, you can try to use the SignatureUpdateProcessorFactory to do a hashing of the content to a signature field, and group the signature field during your search. You can find more information here: https://cwiki.apache.org/confluence/display/solr/De-Duplication I have been using this method to g