If you check the code for TextProfileSignature [1] your'll notice the init 
method reading params. You can set those params as you did. Reading Javadoc 
[2] might help as well. But what's not documented in the Javadoc is how QUANT 
is computed; it rounds.

[1]: 
http://svn.apache.org/viewvc/lucene/solr/branches/branch-1.4/src/java/org/apache/solr/update/processor/TextProfileSignature.java?view=markup
[2]: 
http://lucene.apache.org/solr/api/org/apache/solr/update/processor/TextProfileSignature.html

On Tuesday 05 April 2011 16:10:08 Frederico Azeiteiro wrote:
> Thank you, I'll try to create a c# method to create the same sig of SOLR,
> and then compare both sigs before index the doc. This way I can avoid the
> indexation of existing docs.
> 
> If anyone needs to use this parameter (as this info is not on the wiki),
> you can add the option
> 
> <str name="minTokenLen">5</str>
> 
> On the processor tag.
> 
> Best regards,
> Frederico 
> 
> 
> -----Original Message-----
> From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> Sent: terça-feira, 5 de Abril de 2011 12:01
> To: solr-user@lucene.apache.org
> Cc: Frederico Azeiteiro
> Subject: Re: Using MLT feature
> 
> On Tuesday 05 April 2011 12:19:33 Frederico Azeiteiro wrote:
> > Sorry, the reply I made yesterday was directed to Markus and not the
> > list...
> > 
> > Here's my thoughts on this. At this point I'm a little confused if SOLR
> > is a good option to find near duplicate docs.
> > 
> > >> Yes there is, try set overwriteDupes to true and documents yielding
> > 
> > the same signature will be overwritten
> > 
> > The problem is that I don't want to overwrite the doc, I need to
> > maintain the original version (because the doc has others fields I need
> > to maintain).
> > 
> > >>If you have need both fuzzy and exact matching then add a second
> > 
> > update processor inside the chain and create another signature field.
> > 
> > I just need the fuzzy search but the quick tests I made, return
> > different signatures for what I consider duplicate docs.
> > "Army deploys as clan war kills 11 in Philippine south"
> > "Army deploys as clan war kills 11 in Philippine south."
> > 
> > Same sig for the above 2 strings, that's ok.
> > 
> > But a different sig was created for:
> > "Army deploys as clan war kills 11 in Philippine south the."
> > 
> > Is there a way to setup the TextProfileSignature parameters to adjust
> > the "sensibility" on SOLR (QUANT_RATE or MIN_TOKEN_LEN)?
> > 
> > Do you think that these parameters can help creating the same sig for
> > the above example?
> 
> You can only fix this by increasing minTokenLen to 4 to prevent `the` from
> being added to the list of tokens but this may affect other signatures.
> Possibly more documents will then get the same signature. Messing around
> with quantRate won't do much good because all your tokens have the same
> frequency (1) so quant will always be 1 in this short text. That's why
> TextProfileSignature works less well for short texts.
> 
> http://nutch.apache.org/apidocs-1.2/org/apache/nutch/crawl/TextProfileSigna
> ture.html
> 
> > Is anyone using the TextProfileSignature with success?
> > 
> > Thank you,
> > Frederico
> > 
> > 
> > -----Original Message-----
> > From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> > Sent: segunda-feira, 4 de Abril de 2011 16:47
> > To: solr-user@lucene.apache.org
> > Cc: Frederico Azeiteiro
> > Subject: Re: Using MLT feature
> > 
> > > Hi again,
> > > I guess I was wrong on my early post... There's no automated way to
> > 
> > avoid
> > 
> > > the indexation of the duplicate doc.
> > 
> > Yes there is, try set overwriteDupes to true and documents yielding the
> > same
> > signature will be overwritten. If you have need both fuzzy and exact
> > matching
> > then add a second update processor inside the chain and create another
> > signature field.
> > 
> > > I guess I have 2 options:
> > > 
> > > 1. Create a temp index with signatures and then have an app that for
> > 
> > each
> > 
> > > new doc verifies if sig exists on my primary index. If not, add the
> > > article.
> > > 
> > > 2. Before adding the doc, create a signature (using the same algorithm
> > 
> > that
> > 
> > > SOLR uses) on my indexing app and then verify if signature exists
> > 
> > before
> > 
> > > adding.
> > > 
> > > I'm way thinking the right way here? :)
> > > 
> > > Thank you,
> > > Frederico
> > > 
> > > 
> > > 
> > > -----Original Message-----
> > > From: Frederico Azeiteiro [mailto:frederico.azeite...@cision.com]
> > > Sent: segunda-feira, 4 de Abril de 2011 11:59
> > > To: solr-user@lucene.apache.org
> > > Subject: RE: Using MLT feature
> > > 
> > > Thank you Markus it looks great.
> > > 
> > > But the wiki is not very detailed on this.
> > > Do you mean if I:
> > > 
> > > 1. Create:
> > > <updateRequestProcessorChain name="dedupe">
> > > 
> > >     <processor
> > 
> > class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory"
> > 
> > > <bool name="enabled">true</bool>
> > > 
> > >       <bool name="overwriteDupes">false</bool>
> > >       <str name="signatureField">signature</str>
> > >       <str name="fields">headline,body,medianame</str>
> > >       <str
> > 
> > name="signatureClass">org.apache.solr.update.processor.Lookup3Signature<
> > /s
> > 
> > > tr> </processor>
> > > 
> > >     <processor class="solr.LogUpdateProcessorFactory" />
> > >     <processor class="solr.RunUpdateProcessorFactory" />
> > >   
> > >   </updateRequestProcessorChain>
> > > 
> > > 2. Add the request as the default update request
> > > 3. Add a "signature" indexed field to my schema.
> > > 
> > > Then,
> > > When adding a new doc to my index, it is only added of not considered
> > 
> > a
> > 
> > > duplicate using a Lookup3Signature on the field defined? All
> > 
> > duplicates
> > 
> > > are ignored and not added to my index?
> > > Is it so simple as that?
> > > 
> > > Does it works even if the medianame should be an exact match (not
> > 
> > similar
> > 
> > > match as the headline and bodytext are)?
> > > 
> > > Thank you for your help,
> > > 
> > > ____________________________________________
> > > Frederico Azeiteiro
> > > Developer
> > > 
> > > 
> > > 
> > > -----Original Message-----
> > > From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> > > Sent: segunda-feira, 4 de Abril de 2011 10:48
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: Using MLT feature
> > > 
> > > http://wiki.apache.org/solr/Deduplication
> > > 
> > > On Monday 04 April 2011 11:34:52 Frederico Azeiteiro wrote:
> > > > Hi,
> > > > 
> > > > The ideia is don't index if something similar (headline+bodytext)
> > 
> > for
> > 
> > > > the same exact medianame.
> > > > 
> > > > Do you mean I would need to index the doc first (maybe in a temp
> > 
> > index)
> > 
> > > > and then use the MLT feature to find similar docs before adding to
> > 
> > final
> > 
> > > > index?
> > > > 
> > > > Thanks,
> > > > Frederico
> > > > 
> > > > 
> > > > -----Original Message-----
> > > > From: Chris Fauerbach [mailto:chris.fauerb...@gmail.com]
> > > > Sent: segunda-feira, 4 de Abril de 2011 10:22
> > > > To: solr-user@lucene.apache.org
> > > > Subject: Re: Using MLT feature
> > > > 
> > > > Do you want to not index if something similar? Or don't index if
> > 
> > exact.
> > 
> > > > I would look into a hash code of the document if you don't want to
> > 
> > index
> > 
> > > > exact.    Similar though, I think has to be based off a document in
> > 
> > the
> > 
> > > > index.
> > > > 
> > > > On Apr 4, 2011, at 5:16, Frederico Azeiteiro
> > > > 
> > > > <frederico.azeite...@cision.com> wrote:
> > > > > Hi,
> > > > > 
> > > > > 
> > > > > 
> > > > > I would like to hear your opinion about the MLT feature and if
> > 
> > it's a
> > 
> > > > > good solution to what I need to implement.
> > > > > 
> > > > > 
> > > > > 
> > > > > My index has fields like: headline, body and medianame.
> > > > > 
> > > > > What I need to do is, before adding a new doc, verify if a similar
> > 
> > doc
> > 
> > > > > exists for this media.
> > > > > 
> > > > > 
> > > > > 
> > > > > My idea is to use the MorelikeThisHandler
> > > > > (http://wiki.apache.org/solr/MoreLikeThisHandler) in the following
> > > > 
> > > > way:
> > > > > For each new doc, perform a MLT search with q= medianame and
> > > > > stream.body=headline+bodytext.
> > > > > 
> > > > > If no similar docs are found than I can safely add the doc.
> > > > > 
> > > > > 
> > > > > 
> > > > > Is this feasible using the MLT handler? Is it a good approach? Are
> > > > 
> > > > there
> > > > 
> > > > > a better way to perform this comparison?
> > > > > 
> > > > > 
> > > > > 
> > > > > Thank you for your help.
> > > > > 
> > > > > 
> > > > > 
> > > > > Best regards,
> > > > > 
> > > > > ____________________________________________
> > > > > 
> > > > > Frederico Azeiteiro
> > > 
> > > Hi again,
> > > I guess I was wrong on my early post... There's no automated way to
> > 
> > avoid
> > 
> > > the indexation of the duplicate doc.
> > 
> > Yes there is, try set overwriteDupes to true and documents yielding the
> > same
> > signature will be overwritten. If you have need both fuzzy and exact
> > matching
> > then add a second update processor inside the chain and create another
> > signature field.
> > 
> > > I guess I have 2 options:
> > > 
> > > 1. Create a temp index with signatures and then have an app that for
> > 
> > each
> > 
> > > new doc verifies if sig exists on my primary index. If not, add the
> > > article.
> > > 
> > > 2. Before adding the doc, create a signature (using the same algorithm
> > 
> > that
> > 
> > > SOLR uses) on my indexing app and then verify if signature exists
> > 
> > before
> > 
> > > adding.
> > > 
> > > I'm way thinking the right way here? :)
> > > 
> > > Thank you,
> > > Frederico
> > > 
> > > 
> > > 
> > > -----Original Message-----
> > > From: Frederico Azeiteiro [mailto:frederico.azeite...@cision.com]
> > > Sent: segunda-feira, 4 de Abril de 2011 11:59
> > > To: solr-user@lucene.apache.org
> > > Subject: RE: Using MLT feature
> > > 
> > > Thank you Markus it looks great.
> > > 
> > > But the wiki is not very detailed on this.
> > > Do you mean if I:
> > > 
> > > 1. Create:
> > > <updateRequestProcessorChain name="dedupe">
> > > 
> > >     <processor
> > 
> > class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory"
> > 
> > > <bool name="enabled">true</bool>
> > > 
> > >       <bool name="overwriteDupes">false</bool>
> > >       <str name="signatureField">signature</str>
> > >       <str name="fields">headline,body,medianame</str>
> > >       <str
> > 
> > name="signatureClass">org.apache.solr.update.processor.Lookup3Signature<
> > /s
> > 
> > > tr> </processor>
> > > 
> > >     <processor class="solr.LogUpdateProcessorFactory" />
> > >     <processor class="solr.RunUpdateProcessorFactory" />
> > >   
> > >   </updateRequestProcessorChain>
> > > 
> > > 2. Add the request as the default update request
> > > 3. Add a "signature" indexed field to my schema.
> > > 
> > > Then,
> > > When adding a new doc to my index, it is only added of not considered
> > 
> > a
> > 
> > > duplicate using a Lookup3Signature on the field defined? All
> > 
> > duplicates
> > 
> > > are ignored and not added to my index?
> > > Is it so simple as that?
> > > 
> > > Does it works even if the medianame should be an exact match (not
> > 
> > similar
> > 
> > > match as the headline and bodytext are)?
> > > 
> > > Thank you for your help,
> > > 
> > > ____________________________________________
> > > Frederico Azeiteiro
> > > Developer
> > > 
> > > 
> > > 
> > > -----Original Message-----
> > > From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> > > Sent: segunda-feira, 4 de Abril de 2011 10:48
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: Using MLT feature
> > > 
> > > http://wiki.apache.org/solr/Deduplication
> > > 
> > > On Monday 04 April 2011 11:34:52 Frederico Azeiteiro wrote:
> > > > Hi,
> > > > 
> > > > The ideia is don't index if something similar (headline+bodytext)
> > 
> > for
> > 
> > > > the same exact medianame.
> > > > 
> > > > Do you mean I would need to index the doc first (maybe in a temp
> > 
> > index)
> > 
> > > > and then use the MLT feature to find similar docs before adding to
> > 
> > final
> > 
> > > > index?
> > > > 
> > > > Thanks,
> > > > Frederico
> > > > 
> > > > 
> > > > -----Original Message-----
> > > > From: Chris Fauerbach [mailto:chris.fauerb...@gmail.com]
> > > > Sent: segunda-feira, 4 de Abril de 2011 10:22
> > > > To: solr-user@lucene.apache.org
> > > > Subject: Re: Using MLT feature
> > > > 
> > > > Do you want to not index if something similar? Or don't index if
> > 
> > exact.
> > 
> > > > I would look into a hash code of the document if you don't want to
> > 
> > index
> > 
> > > > exact.    Similar though, I think has to be based off a document in
> > 
> > the
> > 
> > > > index.
> > > > 
> > > > On Apr 4, 2011, at 5:16, Frederico Azeiteiro
> > > > 
> > > > <frederico.azeite...@cision.com> wrote:
> > > > > Hi,
> > > > > 
> > > > > 
> > > > > 
> > > > > I would like to hear your opinion about the MLT feature and if
> > 
> > it's a
> > 
> > > > > good solution to what I need to implement.
> > > > > 
> > > > > 
> > > > > 
> > > > > My index has fields like: headline, body and medianame.
> > > > > 
> > > > > What I need to do is, before adding a new doc, verify if a similar
> > 
> > doc
> > 
> > > > > exists for this media.
> > > > > 
> > > > > 
> > > > > 
> > > > > My idea is to use the MorelikeThisHandler
> > > > > (http://wiki.apache.org/solr/MoreLikeThisHandler) in the following
> > > > 
> > > > way:
> > > > > For each new doc, perform a MLT search with q= medianame and
> > > > > stream.body=headline+bodytext.
> > > > > 
> > > > > If no similar docs are found than I can safely add the doc.
> > > > > 
> > > > > 
> > > > > 
> > > > > Is this feasible using the MLT handler? Is it a good approach? Are
> > > > 
> > > > there
> > > > 
> > > > > a better way to perform this comparison?
> > > > > 
> > > > > 
> > > > > 
> > > > > Thank you for your help.
> > > > > 
> > > > > 
> > > > > 
> > > > > Best regards,
> > > > > 
> > > > > ____________________________________________
> > > > > 
> > > > > Frederico Azeiteiro

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Reply via email to