Re: Using MLT feature

Markus Jelsma Tue, 05 Apr 2011 04:01:36 -0700


On Tuesday 05 April 2011 12:19:33 Frederico Azeiteiro wrote:
> Sorry, the reply I made yesterday was directed to Markus and not the
> list...
> 
> Here's my thoughts on this. At this point I'm a little confused if SOLR
> is a good option to find near duplicate docs.
> 
> >> Yes there is, try set overwriteDupes to true and documents yielding
> 
> the same signature will be overwritten
> 
> The problem is that I don't want to overwrite the doc, I need to
> maintain the original version (because the doc has others fields I need
> to maintain).
> 
> >>If you have need both fuzzy and exact matching then add a second
> 
> update processor inside the chain and create another signature field.
> 
> I just need the fuzzy search but the quick tests I made, return
> different signatures for what I consider duplicate docs.
> "Army deploys as clan war kills 11 in Philippine south"
> "Army deploys as clan war kills 11 in Philippine south."
> 
> Same sig for the above 2 strings, that's ok.
> 
> But a different sig was created for:
> "Army deploys as clan war kills 11 in Philippine south the."
> 
> Is there a way to setup the TextProfileSignature parameters to adjust
> the "sensibility" on SOLR (QUANT_RATE or MIN_TOKEN_LEN)?
> 
> Do you think that these parameters can help creating the same sig for
> the above example?


You can only fix this by increasing minTokenLen to 4 to prevent `the` from 
being added to the list of tokens but this may affect other signatures. 
Possibly more documents will then get the same signature. Messing around with 
quantRate won't do much good because all your tokens have the same frequency 
(1) so quant will always be 1 in this short text. That's why 
TextProfileSignature works less well for short texts.

http://nutch.apache.org/apidocs-1.2/org/apache/nutch/crawl/TextProfileSignature.html

> 
> Is anyone using the TextProfileSignature with success?
> 
> Thank you,
> Frederico
> 
> 
> -----Original Message-----
> From: Markus Jelsma [mailto:[email protected]]
> Sent: segunda-feira, 4 de Abril de 2011 16:47
> To: [email protected]
> Cc: Frederico Azeiteiro
> Subject: Re: Using MLT feature
> 
> > Hi again,
> > I guess I was wrong on my early post... There's no automated way to
> 
> avoid
> 
> > the indexation of the duplicate doc.
> 
> Yes there is, try set overwriteDupes to true and documents yielding the
> same
> signature will be overwritten. If you have need both fuzzy and exact
> matching
> then add a second update processor inside the chain and create another
> signature field.
> 
> > I guess I have 2 options:
> > 
> > 1. Create a temp index with signatures and then have an app that for
> 
> each
> 
> > new doc verifies if sig exists on my primary index. If not, add the
> > article.
> > 
> > 2. Before adding the doc, create a signature (using the same algorithm
> 
> that
> 
> > SOLR uses) on my indexing app and then verify if signature exists
> 
> before
> 
> > adding.
> > 
> > I'm way thinking the right way here? :)
> > 
> > Thank you,
> > Frederico
> > 
> > 
> > 
> > -----Original Message-----
> > From: Frederico Azeiteiro [mailto:[email protected]]
> > Sent: segunda-feira, 4 de Abril de 2011 11:59
> > To: [email protected]
> > Subject: RE: Using MLT feature
> > 
> > Thank you Markus it looks great.
> > 
> > But the wiki is not very detailed on this.
> > Do you mean if I:
> > 
> > 1. Create:
> > <updateRequestProcessorChain name="dedupe">
> > 
> >     <processor
> 
> class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory"
> 
> > <bool name="enabled">true</bool>
> > 
> >       <bool name="overwriteDupes">false</bool>
> >       <str name="signatureField">signature</str>
> >       <str name="fields">headline,body,medianame</str>
> >       <str
> 
> name="signatureClass">org.apache.solr.update.processor.Lookup3Signature<
> /s
> 
> > tr> </processor>
> > 
> >     <processor class="solr.LogUpdateProcessorFactory" />
> >     <processor class="solr.RunUpdateProcessorFactory" />
> >   
> >   </updateRequestProcessorChain>
> > 
> > 2. Add the request as the default update request
> > 3. Add a "signature" indexed field to my schema.
> > 
> > Then,
> > When adding a new doc to my index, it is only added of not considered
> 
> a
> 
> > duplicate using a Lookup3Signature on the field defined? All
> 
> duplicates
> 
> > are ignored and not added to my index?
> > Is it so simple as that?
> > 
> > Does it works even if the medianame should be an exact match (not
> 
> similar
> 
> > match as the headline and bodytext are)?
> > 
> > Thank you for your help,
> > 
> > ____________________________________________
> > Frederico Azeiteiro
> > Developer
> > 
> > 
> > 
> > -----Original Message-----
> > From: Markus Jelsma [mailto:[email protected]]
> > Sent: segunda-feira, 4 de Abril de 2011 10:48
> > To: [email protected]
> > Subject: Re: Using MLT feature
> > 
> > http://wiki.apache.org/solr/Deduplication
> > 
> > On Monday 04 April 2011 11:34:52 Frederico Azeiteiro wrote:
> > > Hi,
> > > 
> > > The ideia is don't index if something similar (headline+bodytext)
> 
> for
> 
> > > the same exact medianame.
> > > 
> > > Do you mean I would need to index the doc first (maybe in a temp
> 
> index)
> 
> > > and then use the MLT feature to find similar docs before adding to
> 
> final
> 
> > > index?
> > > 
> > > Thanks,
> > > Frederico
> > > 
> > > 
> > > -----Original Message-----
> > > From: Chris Fauerbach [mailto:[email protected]]
> > > Sent: segunda-feira, 4 de Abril de 2011 10:22
> > > To: [email protected]
> > > Subject: Re: Using MLT feature
> > > 
> > > Do you want to not index if something similar? Or don't index if
> 
> exact.
> 
> > > I would look into a hash code of the document if you don't want to
> 
> index
> 
> > > exact.    Similar though, I think has to be based off a document in
> 
> the
> 
> > > index.
> > > 
> > > On Apr 4, 2011, at 5:16, Frederico Azeiteiro
> > > 
> > > <[email protected]> wrote:
> > > > Hi,
> > > > 
> > > > 
> > > > 
> > > > I would like to hear your opinion about the MLT feature and if
> 
> it's a
> 
> > > > good solution to what I need to implement.
> > > > 
> > > > 
> > > > 
> > > > My index has fields like: headline, body and medianame.
> > > > 
> > > > What I need to do is, before adding a new doc, verify if a similar
> 
> doc
> 
> > > > exists for this media.
> > > > 
> > > > 
> > > > 
> > > > My idea is to use the MorelikeThisHandler
> > > > (http://wiki.apache.org/solr/MoreLikeThisHandler) in the following
> > > 
> > > way:
> > > > For each new doc, perform a MLT search with q= medianame and
> > > > stream.body=headline+bodytext.
> > > > 
> > > > If no similar docs are found than I can safely add the doc.
> > > > 
> > > > 
> > > > 
> > > > Is this feasible using the MLT handler? Is it a good approach? Are
> > > 
> > > there
> > > 
> > > > a better way to perform this comparison?
> > > > 
> > > > 
> > > > 
> > > > Thank you for your help.
> > > > 
> > > > 
> > > > 
> > > > Best regards,
> > > > 
> > > > ____________________________________________
> > > > 
> > > > Frederico Azeiteiro
> > 
> > Hi again,
> > I guess I was wrong on my early post... There's no automated way to
> 
> avoid
> 
> > the indexation of the duplicate doc.
> 
> Yes there is, try set overwriteDupes to true and documents yielding the
> same
> signature will be overwritten. If you have need both fuzzy and exact
> matching
> then add a second update processor inside the chain and create another
> signature field.
> 
> > I guess I have 2 options:
> > 
> > 1. Create a temp index with signatures and then have an app that for
> 
> each
> 
> > new doc verifies if sig exists on my primary index. If not, add the
> > article.
> > 
> > 2. Before adding the doc, create a signature (using the same algorithm
> 
> that
> 
> > SOLR uses) on my indexing app and then verify if signature exists
> 
> before
> 
> > adding.
> > 
> > I'm way thinking the right way here? :)
> > 
> > Thank you,
> > Frederico
> > 
> > 
> > 
> > -----Original Message-----
> > From: Frederico Azeiteiro [mailto:[email protected]]
> > Sent: segunda-feira, 4 de Abril de 2011 11:59
> > To: [email protected]
> > Subject: RE: Using MLT feature
> > 
> > Thank you Markus it looks great.
> > 
> > But the wiki is not very detailed on this.
> > Do you mean if I:
> > 
> > 1. Create:
> > <updateRequestProcessorChain name="dedupe">
> > 
> >     <processor
> 
> class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory"
> 
> > <bool name="enabled">true</bool>
> > 
> >       <bool name="overwriteDupes">false</bool>
> >       <str name="signatureField">signature</str>
> >       <str name="fields">headline,body,medianame</str>
> >       <str
> 
> name="signatureClass">org.apache.solr.update.processor.Lookup3Signature<
> /s
> 
> > tr> </processor>
> > 
> >     <processor class="solr.LogUpdateProcessorFactory" />
> >     <processor class="solr.RunUpdateProcessorFactory" />
> >   
> >   </updateRequestProcessorChain>
> > 
> > 2. Add the request as the default update request
> > 3. Add a "signature" indexed field to my schema.
> > 
> > Then,
> > When adding a new doc to my index, it is only added of not considered
> 
> a
> 
> > duplicate using a Lookup3Signature on the field defined? All
> 
> duplicates
> 
> > are ignored and not added to my index?
> > Is it so simple as that?
> > 
> > Does it works even if the medianame should be an exact match (not
> 
> similar
> 
> > match as the headline and bodytext are)?
> > 
> > Thank you for your help,
> > 
> > ____________________________________________
> > Frederico Azeiteiro
> > Developer
> > 
> > 
> > 
> > -----Original Message-----
> > From: Markus Jelsma [mailto:[email protected]]
> > Sent: segunda-feira, 4 de Abril de 2011 10:48
> > To: [email protected]
> > Subject: Re: Using MLT feature
> > 
> > http://wiki.apache.org/solr/Deduplication
> > 
> > On Monday 04 April 2011 11:34:52 Frederico Azeiteiro wrote:
> > > Hi,
> > > 
> > > The ideia is don't index if something similar (headline+bodytext)
> 
> for
> 
> > > the same exact medianame.
> > > 
> > > Do you mean I would need to index the doc first (maybe in a temp
> 
> index)
> 
> > > and then use the MLT feature to find similar docs before adding to
> 
> final
> 
> > > index?
> > > 
> > > Thanks,
> > > Frederico
> > > 
> > > 
> > > -----Original Message-----
> > > From: Chris Fauerbach [mailto:[email protected]]
> > > Sent: segunda-feira, 4 de Abril de 2011 10:22
> > > To: [email protected]
> > > Subject: Re: Using MLT feature
> > > 
> > > Do you want to not index if something similar? Or don't index if
> 
> exact.
> 
> > > I would look into a hash code of the document if you don't want to
> 
> index
> 
> > > exact.    Similar though, I think has to be based off a document in
> 
> the
> 
> > > index.
> > > 
> > > On Apr 4, 2011, at 5:16, Frederico Azeiteiro
> > > 
> > > <[email protected]> wrote:
> > > > Hi,
> > > > 
> > > > 
> > > > 
> > > > I would like to hear your opinion about the MLT feature and if
> 
> it's a
> 
> > > > good solution to what I need to implement.
> > > > 
> > > > 
> > > > 
> > > > My index has fields like: headline, body and medianame.
> > > > 
> > > > What I need to do is, before adding a new doc, verify if a similar
> 
> doc
> 
> > > > exists for this media.
> > > > 
> > > > 
> > > > 
> > > > My idea is to use the MorelikeThisHandler
> > > > (http://wiki.apache.org/solr/MoreLikeThisHandler) in the following
> > > 
> > > way:
> > > > For each new doc, perform a MLT search with q= medianame and
> > > > stream.body=headline+bodytext.
> > > > 
> > > > If no similar docs are found than I can safely add the doc.
> > > > 
> > > > 
> > > > 
> > > > Is this feasible using the MLT handler? Is it a good approach? Are
> > > 
> > > there
> > > 
> > > > a better way to perform this comparison?
> > > > 
> > > > 
> > > > 
> > > > Thank you for your help.
> > > > 
> > > > 
> > > > 
> > > > Best regards,
> > > > 
> > > > ____________________________________________
> > > > 
> > > > Frederico Azeiteiro

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Using MLT feature

Reply via email to