http://wiki.apache.org/solr/Deduplication
On Monday 04 April 2011 11:34:52 Frederico Azeiteiro wrote: > Hi, > > The ideia is don't index if something similar (headline+bodytext) for > the same exact medianame. > > Do you mean I would need to index the doc first (maybe in a temp index) > and then use the MLT feature to find similar docs before adding to final > index? > > Thanks, > Frederico > > > -----Original Message----- > From: Chris Fauerbach [mailto:chris.fauerb...@gmail.com] > Sent: segunda-feira, 4 de Abril de 2011 10:22 > To: solr-user@lucene.apache.org > Subject: Re: Using MLT feature > > Do you want to not index if something similar? Or don't index if exact. > I would look into a hash code of the document if you don't want to index > exact. Similar though, I think has to be based off a document in the > index. > > On Apr 4, 2011, at 5:16, Frederico Azeiteiro > > <frederico.azeite...@cision.com> wrote: > > Hi, > > > > > > > > I would like to hear your opinion about the MLT feature and if it's a > > good solution to what I need to implement. > > > > > > > > My index has fields like: headline, body and medianame. > > > > What I need to do is, before adding a new doc, verify if a similar doc > > exists for this media. > > > > > > > > My idea is to use the MorelikeThisHandler > > (http://wiki.apache.org/solr/MoreLikeThisHandler) in the following > > way: > > For each new doc, perform a MLT search with q= medianame and > > stream.body=headline+bodytext. > > > > If no similar docs are found than I can safely add the doc. > > > > > > > > Is this feasible using the MLT handler? Is it a good approach? Are > > there > > > a better way to perform this comparison? > > > > > > > > Thank you for your help. > > > > > > > > Best regards, > > > > ____________________________________________ > > > > Frederico Azeiteiro -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350