> The first question I'd ask is "why are there duplicates
> in your index in the first place?". If you're denormalizing,
> that would account for it. Mostly, I'm just asking to be
> sure that you expect duplicate product IDs. If you make
> your productid a <uniqueKey>, there'll only be one of each....
> 
> You'll have to re-index if you make this change though.
> 
> But grouping/field collapsing would, indeed, apply to this
> problem.
> 
> deduplication isn't applicable, since you know exactly what
> duplicates are. deduplication is more for "fuzzy" removal
> of near-duplicates..

That's only if you use Nutch' TextProfileSignature, MD5 and Lookup3 are meant 
for exact matching. I don't know if Lookup3Signature works on non-string/text 
values but i see no reason it should not work.

Might be an improvement to allow deduplication that skips creating a signature 
field and dedup on non-string values instead of that signature field.

> 
> Hope this helps
> Erick
> 
> On Wed, Aug 31, 2011 at 12:01 AM, Aaron Bains <aaronba...@gmail.com> wrote:
> > Hello,
> > 
> > What is the best way to remove duplicate values on output. I am using the
> > following query:
> > 
> > /solr/select/?q=wrt54g2&version=2.2&start=0&rows=10&indent=on&*fl=product
> > id*
> > 
> > And I get the following results:
> > 
> > <doc>
> > <int name="productid">1011630553</int>
> > </doc>
> > <doc>
> > <int name="productid">1011630553</int>
> > </doc>
> > <doc><int name="productid">1011630553</int>
> > </doc>
> > <doc><int name="productid">1011630553</int>
> > </doc>
> > <doc><int name="productid">1011630553</int>
> > </doc>
> > <doc><int name="productid">1011630553</int>
> > </doc>
> > <doc><int name="productid">1011630553</int>
> > </doc>
> > <doc><int name="productid">1013033708</int>
> > </doc>
> > <doc><int name="productid">1013033708</int>
> > </doc>
> > <doc><int name="productid">1013033708</int>
> > </doc>
> > 
> > 
> > But I don't want those results because there are duplicates. I am looking
> > for results like below:
> > 
> > <doc>
> > <int name="productid">1011630553</int>
> > </doc>
> > <doc>
> > <int name="productid">1013033708</int>
> > </doc>
> > 
> > I know there is deduplication and field collapsing but I am not sure if
> > they are applicable in this situation. Thanks for your help!

Reply via email to