If you check the code for TextProfileSignature [1] your'll notice the init method reading params. You can set those params as you did. Reading Javadoc [2] might help as well. But what's not documented in the Javadoc is how QUANT is computed; it rounds.
[1]: http://svn.apache.org/viewvc/lucene/solr/branches/branch-1.4/src/java/org/apache/solr/update/processor/TextProfileSignature.java?view=markup [2]: http://lucene.apache.org/solr/api/org/apache/solr/update/processor/TextProfileSignature.html On Tuesday 05 April 2011 16:10:08 Frederico Azeiteiro wrote: > Thank you, I'll try to create a c# method to create the same sig of SOLR, > and then compare both sigs before index the doc. This way I can avoid the > indexation of existing docs. > > If anyone needs to use this parameter (as this info is not on the wiki), > you can add the option > > <str name="minTokenLen">5</str> > > On the processor tag. > > Best regards, > Frederico > > > -----Original Message----- > From: Markus Jelsma [mailto:markus.jel...@openindex.io] > Sent: terça-feira, 5 de Abril de 2011 12:01 > To: solr-user@lucene.apache.org > Cc: Frederico Azeiteiro > Subject: Re: Using MLT feature > > On Tuesday 05 April 2011 12:19:33 Frederico Azeiteiro wrote: > > Sorry, the reply I made yesterday was directed to Markus and not the > > list... > > > > Here's my thoughts on this. At this point I'm a little confused if SOLR > > is a good option to find near duplicate docs. > > > > >> Yes there is, try set overwriteDupes to true and documents yielding > > > > the same signature will be overwritten > > > > The problem is that I don't want to overwrite the doc, I need to > > maintain the original version (because the doc has others fields I need > > to maintain). > > > > >>If you have need both fuzzy and exact matching then add a second > > > > update processor inside the chain and create another signature field. > > > > I just need the fuzzy search but the quick tests I made, return > > different signatures for what I consider duplicate docs. > > "Army deploys as clan war kills 11 in Philippine south" > > "Army deploys as clan war kills 11 in Philippine south." > > > > Same sig for the above 2 strings, that's ok. > > > > But a different sig was created for: > > "Army deploys as clan war kills 11 in Philippine south the." > > > > Is there a way to setup the TextProfileSignature parameters to adjust > > the "sensibility" on SOLR (QUANT_RATE or MIN_TOKEN_LEN)? > > > > Do you think that these parameters can help creating the same sig for > > the above example? > > You can only fix this by increasing minTokenLen to 4 to prevent `the` from > being added to the list of tokens but this may affect other signatures. > Possibly more documents will then get the same signature. Messing around > with quantRate won't do much good because all your tokens have the same > frequency (1) so quant will always be 1 in this short text. That's why > TextProfileSignature works less well for short texts. > > http://nutch.apache.org/apidocs-1.2/org/apache/nutch/crawl/TextProfileSigna > ture.html > > > Is anyone using the TextProfileSignature with success? > > > > Thank you, > > Frederico > > > > > > -----Original Message----- > > From: Markus Jelsma [mailto:markus.jel...@openindex.io] > > Sent: segunda-feira, 4 de Abril de 2011 16:47 > > To: solr-user@lucene.apache.org > > Cc: Frederico Azeiteiro > > Subject: Re: Using MLT feature > > > > > Hi again, > > > I guess I was wrong on my early post... There's no automated way to > > > > avoid > > > > > the indexation of the duplicate doc. > > > > Yes there is, try set overwriteDupes to true and documents yielding the > > same > > signature will be overwritten. If you have need both fuzzy and exact > > matching > > then add a second update processor inside the chain and create another > > signature field. > > > > > I guess I have 2 options: > > > > > > 1. Create a temp index with signatures and then have an app that for > > > > each > > > > > new doc verifies if sig exists on my primary index. If not, add the > > > article. > > > > > > 2. Before adding the doc, create a signature (using the same algorithm > > > > that > > > > > SOLR uses) on my indexing app and then verify if signature exists > > > > before > > > > > adding. > > > > > > I'm way thinking the right way here? :) > > > > > > Thank you, > > > Frederico > > > > > > > > > > > > -----Original Message----- > > > From: Frederico Azeiteiro [mailto:frederico.azeite...@cision.com] > > > Sent: segunda-feira, 4 de Abril de 2011 11:59 > > > To: solr-user@lucene.apache.org > > > Subject: RE: Using MLT feature > > > > > > Thank you Markus it looks great. > > > > > > But the wiki is not very detailed on this. > > > Do you mean if I: > > > > > > 1. Create: > > > <updateRequestProcessorChain name="dedupe"> > > > > > > <processor > > > > class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory" > > > > > <bool name="enabled">true</bool> > > > > > > <bool name="overwriteDupes">false</bool> > > > <str name="signatureField">signature</str> > > > <str name="fields">headline,body,medianame</str> > > > <str > > > > name="signatureClass">org.apache.solr.update.processor.Lookup3Signature< > > /s > > > > > tr> </processor> > > > > > > <processor class="solr.LogUpdateProcessorFactory" /> > > > <processor class="solr.RunUpdateProcessorFactory" /> > > > > > > </updateRequestProcessorChain> > > > > > > 2. Add the request as the default update request > > > 3. Add a "signature" indexed field to my schema. > > > > > > Then, > > > When adding a new doc to my index, it is only added of not considered > > > > a > > > > > duplicate using a Lookup3Signature on the field defined? All > > > > duplicates > > > > > are ignored and not added to my index? > > > Is it so simple as that? > > > > > > Does it works even if the medianame should be an exact match (not > > > > similar > > > > > match as the headline and bodytext are)? > > > > > > Thank you for your help, > > > > > > ____________________________________________ > > > Frederico Azeiteiro > > > Developer > > > > > > > > > > > > -----Original Message----- > > > From: Markus Jelsma [mailto:markus.jel...@openindex.io] > > > Sent: segunda-feira, 4 de Abril de 2011 10:48 > > > To: solr-user@lucene.apache.org > > > Subject: Re: Using MLT feature > > > > > > http://wiki.apache.org/solr/Deduplication > > > > > > On Monday 04 April 2011 11:34:52 Frederico Azeiteiro wrote: > > > > Hi, > > > > > > > > The ideia is don't index if something similar (headline+bodytext) > > > > for > > > > > > the same exact medianame. > > > > > > > > Do you mean I would need to index the doc first (maybe in a temp > > > > index) > > > > > > and then use the MLT feature to find similar docs before adding to > > > > final > > > > > > index? > > > > > > > > Thanks, > > > > Frederico > > > > > > > > > > > > -----Original Message----- > > > > From: Chris Fauerbach [mailto:chris.fauerb...@gmail.com] > > > > Sent: segunda-feira, 4 de Abril de 2011 10:22 > > > > To: solr-user@lucene.apache.org > > > > Subject: Re: Using MLT feature > > > > > > > > Do you want to not index if something similar? Or don't index if > > > > exact. > > > > > > I would look into a hash code of the document if you don't want to > > > > index > > > > > > exact. Similar though, I think has to be based off a document in > > > > the > > > > > > index. > > > > > > > > On Apr 4, 2011, at 5:16, Frederico Azeiteiro > > > > > > > > <frederico.azeite...@cision.com> wrote: > > > > > Hi, > > > > > > > > > > > > > > > > > > > > I would like to hear your opinion about the MLT feature and if > > > > it's a > > > > > > > good solution to what I need to implement. > > > > > > > > > > > > > > > > > > > > My index has fields like: headline, body and medianame. > > > > > > > > > > What I need to do is, before adding a new doc, verify if a similar > > > > doc > > > > > > > exists for this media. > > > > > > > > > > > > > > > > > > > > My idea is to use the MorelikeThisHandler > > > > > (http://wiki.apache.org/solr/MoreLikeThisHandler) in the following > > > > > > > > way: > > > > > For each new doc, perform a MLT search with q= medianame and > > > > > stream.body=headline+bodytext. > > > > > > > > > > If no similar docs are found than I can safely add the doc. > > > > > > > > > > > > > > > > > > > > Is this feasible using the MLT handler? Is it a good approach? Are > > > > > > > > there > > > > > > > > > a better way to perform this comparison? > > > > > > > > > > > > > > > > > > > > Thank you for your help. > > > > > > > > > > > > > > > > > > > > Best regards, > > > > > > > > > > ____________________________________________ > > > > > > > > > > Frederico Azeiteiro > > > > > > Hi again, > > > I guess I was wrong on my early post... There's no automated way to > > > > avoid > > > > > the indexation of the duplicate doc. > > > > Yes there is, try set overwriteDupes to true and documents yielding the > > same > > signature will be overwritten. If you have need both fuzzy and exact > > matching > > then add a second update processor inside the chain and create another > > signature field. > > > > > I guess I have 2 options: > > > > > > 1. Create a temp index with signatures and then have an app that for > > > > each > > > > > new doc verifies if sig exists on my primary index. If not, add the > > > article. > > > > > > 2. Before adding the doc, create a signature (using the same algorithm > > > > that > > > > > SOLR uses) on my indexing app and then verify if signature exists > > > > before > > > > > adding. > > > > > > I'm way thinking the right way here? :) > > > > > > Thank you, > > > Frederico > > > > > > > > > > > > -----Original Message----- > > > From: Frederico Azeiteiro [mailto:frederico.azeite...@cision.com] > > > Sent: segunda-feira, 4 de Abril de 2011 11:59 > > > To: solr-user@lucene.apache.org > > > Subject: RE: Using MLT feature > > > > > > Thank you Markus it looks great. > > > > > > But the wiki is not very detailed on this. > > > Do you mean if I: > > > > > > 1. Create: > > > <updateRequestProcessorChain name="dedupe"> > > > > > > <processor > > > > class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory" > > > > > <bool name="enabled">true</bool> > > > > > > <bool name="overwriteDupes">false</bool> > > > <str name="signatureField">signature</str> > > > <str name="fields">headline,body,medianame</str> > > > <str > > > > name="signatureClass">org.apache.solr.update.processor.Lookup3Signature< > > /s > > > > > tr> </processor> > > > > > > <processor class="solr.LogUpdateProcessorFactory" /> > > > <processor class="solr.RunUpdateProcessorFactory" /> > > > > > > </updateRequestProcessorChain> > > > > > > 2. Add the request as the default update request > > > 3. Add a "signature" indexed field to my schema. > > > > > > Then, > > > When adding a new doc to my index, it is only added of not considered > > > > a > > > > > duplicate using a Lookup3Signature on the field defined? All > > > > duplicates > > > > > are ignored and not added to my index? > > > Is it so simple as that? > > > > > > Does it works even if the medianame should be an exact match (not > > > > similar > > > > > match as the headline and bodytext are)? > > > > > > Thank you for your help, > > > > > > ____________________________________________ > > > Frederico Azeiteiro > > > Developer > > > > > > > > > > > > -----Original Message----- > > > From: Markus Jelsma [mailto:markus.jel...@openindex.io] > > > Sent: segunda-feira, 4 de Abril de 2011 10:48 > > > To: solr-user@lucene.apache.org > > > Subject: Re: Using MLT feature > > > > > > http://wiki.apache.org/solr/Deduplication > > > > > > On Monday 04 April 2011 11:34:52 Frederico Azeiteiro wrote: > > > > Hi, > > > > > > > > The ideia is don't index if something similar (headline+bodytext) > > > > for > > > > > > the same exact medianame. > > > > > > > > Do you mean I would need to index the doc first (maybe in a temp > > > > index) > > > > > > and then use the MLT feature to find similar docs before adding to > > > > final > > > > > > index? > > > > > > > > Thanks, > > > > Frederico > > > > > > > > > > > > -----Original Message----- > > > > From: Chris Fauerbach [mailto:chris.fauerb...@gmail.com] > > > > Sent: segunda-feira, 4 de Abril de 2011 10:22 > > > > To: solr-user@lucene.apache.org > > > > Subject: Re: Using MLT feature > > > > > > > > Do you want to not index if something similar? Or don't index if > > > > exact. > > > > > > I would look into a hash code of the document if you don't want to > > > > index > > > > > > exact. Similar though, I think has to be based off a document in > > > > the > > > > > > index. > > > > > > > > On Apr 4, 2011, at 5:16, Frederico Azeiteiro > > > > > > > > <frederico.azeite...@cision.com> wrote: > > > > > Hi, > > > > > > > > > > > > > > > > > > > > I would like to hear your opinion about the MLT feature and if > > > > it's a > > > > > > > good solution to what I need to implement. > > > > > > > > > > > > > > > > > > > > My index has fields like: headline, body and medianame. > > > > > > > > > > What I need to do is, before adding a new doc, verify if a similar > > > > doc > > > > > > > exists for this media. > > > > > > > > > > > > > > > > > > > > My idea is to use the MorelikeThisHandler > > > > > (http://wiki.apache.org/solr/MoreLikeThisHandler) in the following > > > > > > > > way: > > > > > For each new doc, perform a MLT search with q= medianame and > > > > > stream.body=headline+bodytext. > > > > > > > > > > If no similar docs are found than I can safely add the doc. > > > > > > > > > > > > > > > > > > > > Is this feasible using the MLT handler? Is it a good approach? Are > > > > > > > > there > > > > > > > > > a better way to perform this comparison? > > > > > > > > > > > > > > > > > > > > Thank you for your help. > > > > > > > > > > > > > > > > > > > > Best regards, > > > > > > > > > > ____________________________________________ > > > > > > > > > > Frederico Azeiteiro -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350