http://wiki.apache.org/solr/Deduplication

On Monday 04 April 2011 11:34:52 Frederico Azeiteiro wrote:
> Hi,
> 
> The ideia is don't index if something similar (headline+bodytext) for
> the same exact medianame.
> 
> Do you mean I would need to index the doc first (maybe in a temp index)
> and then use the MLT feature to find similar docs before adding to final
> index?
> 
> Thanks,
> Frederico
> 
> 
> -----Original Message-----
> From: Chris Fauerbach [mailto:chris.fauerb...@gmail.com]
> Sent: segunda-feira, 4 de Abril de 2011 10:22
> To: solr-user@lucene.apache.org
> Subject: Re: Using MLT feature
> 
> Do you want to not index if something similar? Or don't index if exact.
> I would look into a hash code of the document if you don't want to index
> exact.    Similar though, I think has to be based off a document in the
> index.
> 
> On Apr 4, 2011, at 5:16, Frederico Azeiteiro
> 
> <frederico.azeite...@cision.com> wrote:
> > Hi,
> > 
> > 
> > 
> > I would like to hear your opinion about the MLT feature and if it's a
> > good solution to what I need to implement.
> > 
> > 
> > 
> > My index has fields like: headline, body and medianame.
> > 
> > What I need to do is, before adding a new doc, verify if a similar doc
> > exists for this media.
> > 
> > 
> > 
> > My idea is to use the MorelikeThisHandler
> > (http://wiki.apache.org/solr/MoreLikeThisHandler) in the following
> 
> way:
> > For each new doc, perform a MLT search with q= medianame and
> > stream.body=headline+bodytext.
> > 
> > If no similar docs are found than I can safely add the doc.
> > 
> > 
> > 
> > Is this feasible using the MLT handler? Is it a good approach? Are
> 
> there
> 
> > a better way to perform this comparison?
> > 
> > 
> > 
> > Thank you for your help.
> > 
> > 
> > 
> > Best regards,
> > 
> > ____________________________________________
> > 
> > Frederico Azeiteiro

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Reply via email to