Re: TextProfileSigature using deduplication

2008-11-20 Thread Andrzej Bialecki
Mark Miller wrote: Thanks for sharing Marc, thats very nice to know. I'll take your experience as a starting point for some wiki recommendations. Sounds like we should add a switch to order alpha as well. On the general note of near-duplicate detection ... I found this paper in the proceedin

Re: TextProfileSigature using deduplication

2008-11-20 Thread Mark Miller
Thanks for sharing Marc, thats very nice to know. I'll take your experience as a starting point for some wiki recommendations. Sounds like we should add a switch to order alpha as well. Marc Sturlese wrote: Hey there, I found couple of solutions that work fine for my case (is not exacly what

Re: TextProfileSigature using deduplication

2008-11-20 Thread Marc Sturlese
Hey there, I found couple of solutions that work fine for my case (is not exacly what I was looking for at the begining but I could adapt it). First one: Use always quantum=1 and minTokenLen=2. Instead of order the tokens by frequency, I order them alphabetically, doing this I am a little more p

Re: TextProfileSigature using deduplication

2008-11-18 Thread Ken Krugler
Marc Sturlese wrote: Hey there, I've been testing and checking the source of the TextProfileSignature.java to avoid similar entries at indexing time. What I understood is that it is useful for huge text where the frequency of the tokens (the words in lowercase just with number and leters in taht

Re: TextProfileSigature using deduplication

2008-11-18 Thread Andrzej Bialecki
Marc Sturlese wrote: Hey there, I've been testing and checking the source of the TextProfileSignature.java to avoid similar entries at indexing time. What I understood is that it is useful for huge text where the frequency of the tokens (the words in lowercase just with number and leters in taht

Re: TextProfileSigature using deduplication

2008-11-18 Thread Marc Sturlese
>> >> I have my own duplication system to detect that but I use String >> comparison >> so it works really slow... >> What are you doing for the String comparison? Not exact right? hey, My comparison method looks for similar (not just exact)... what I do is to compare two text word to word. Wh

Re: TextProfileSigature using deduplication

2008-11-18 Thread Mark Miller
I have my own duplication system to detect that but I use String comparison so it works really slow... What are you doing for the String comparison? Not exact right?

Re: TextProfileSigature using deduplication

2008-11-18 Thread Mark Miller
Have you tried the tunning params for TextProfileSignature? I probably have to update the dedupe wiki. You can set the quantRate and the minTokenLength. Those are the variables names and you set them right with signatureClass, signatureField, fields, etc. Whether or not you can tune it to me