subject:"Re\: Question on index time de\-duplication"

Re: Question on index time de-duplication

2015-11-01 Thread shamik

That's what I observed as well. Perhaps there's a way to customize SignatureUpdateProcessorFactory to support my use case. I'll look into the source code and figure if there's a way to do it. -- View this message in context: http://lucene.472066.n3.nabble.com/Question-on-index-time-de-duplicati

Re: Question on index time de-duplication

2015-10-31 Thread Zheng Lin Edwin Yeo

Hi Shamik, I'm using most of the configuration out of the box, but I'm also looking at tagging an identifier or something so that it will always show the latest documents. At first I thought it will automatically show the one that is indexed later, but seems that it is not the case. It will just

Re: Question on index time de-duplication

2015-10-30 Thread shamik

Thanks for your reply. Have you customized SignatureUpdateProcessorFactory or are you using the configuration out of the box ? I know it works for simple dedup, but my requirement is tad different as I need to tag an identifier to the latest document. My goal is to understand if that's possible usi

RE: Question on index time de-duplication

2015-10-30 Thread shamik

Thanks Markus. I've been using field collapsing till now but the performance constraint is forcing me to think about index time de-duplication. I've been using a composite router to make sure that duplicate documents are routed to the same shard. Won't that work for SignatureUpdateProcessorFactory

Re: Question on index time de-duplication

2015-10-30 Thread shamik

Thanks Scott. I could directly use field collapsing on adskdedup field without the signature field. Problem with field collapsing is the performance overhead. It slows down the query to 10 folds. CollapsingQParserPlugin is a better option, unfortunately, it doesn't support ngroups equivalent, which

RE: Question on index time de-duplication

2015-10-30 Thread Markus Jelsma

solr-user@lucene.apache.org > Subject: Re: Question on index time de-duplication > > At the top of the De-Duplication wiki page is a note about collapsing > results. Once you have the signature (identical for each of the duplicates) > you'll want to collapse your results, keeping the

Re: Question on index time de-duplication

2015-10-30 Thread Scott Stults

At the top of the De-Duplication wiki page is a note about collapsing results. Once you have the signature (identical for each of the duplicates) you'll want to collapse your results, keeping the one with max date. https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results k/r,

Re: Question on index time de-duplication

2015-10-29 Thread Zheng Lin Edwin Yeo

Yes, you can try to use the SignatureUpdateProcessorFactory to do a hashing of the content to a signature field, and group the signature field during your search. You can find more information here: https://cwiki.apache.org/confluence/display/solr/De-Duplication I have been using this method to g

Re: Question on index time de-duplication

Re: Question on index time de-duplication

Re: Question on index time de-duplication

RE: Question on index time de-duplication

Re: Question on index time de-duplication

RE: Question on index time de-duplication

Re: Question on index time de-duplication

Re: Question on index time de-duplication

8 matches

Site Navigation

Mail list logo

Footer information