Mark Miller wrote:
Thanks for sharing Marc, thats very nice to know. I'll take your
experience as a starting point for some wiki recommendations.
Sounds like we should add a switch to order alpha as well.
On the general note of near-duplicate detection ... I found this paper
in the proceedin
Thanks for sharing Marc, thats very nice to know. I'll take your
experience as a starting point for some wiki recommendations.
Sounds like we should add a switch to order alpha as well.
Marc Sturlese wrote:
Hey there,
I found couple of solutions that work fine for my case (is not exacly what
Hey there,
I found couple of solutions that work fine for my case (is not exacly what
I was looking for at the begining but I could adapt it).
First one:
Use always quantum=1 and minTokenLen=2.
Instead of order the tokens by frequency, I order them alphabetically, doing
this I am a little more p
Marc Sturlese wrote:
Hey there, I've been testing and checking the source of the
TextProfileSignature.java to avoid similar entries at indexing time.
What I understood is that it is useful for huge text where the frequency of
the tokens (the words in lowercase just with number and leters in taht
Marc Sturlese wrote:
Hey there, I've been testing and checking the source of the
TextProfileSignature.java to avoid similar entries at indexing time.
What I understood is that it is useful for huge text where the frequency of
the tokens (the words in lowercase just with number and leters in taht
>>
>> I have my own duplication system to detect that but I use String
>> comparison
>> so it works really slow...
>>
What are you doing for the String comparison? Not exact right?
hey,
My comparison method looks for similar (not just exact)... what I do is to
compare two text word to word. Wh
I have my own duplication system to detect that but I use String
comparison
so it works really slow...
What are you doing for the String comparison? Not exact right?
Have you tried the tunning params for TextProfileSignature? I probably
have to update the dedupe wiki.
You can set the quantRate and the minTokenLength. Those are the
variables names and you set them right with signatureClass,
signatureField, fields, etc.
Whether or not you can tune it to me