Have you tried the tunning params for TextProfileSignature? I probably
have to update the dedupe wiki.
You can set the quantRate and the minTokenLength. Those are the
variables names and you set them right with signatureClass,
signatureField, fields, etc.
Whether or not you can tune it to meet your needs I am not quite sure.
There are quite a few more advanced fuzzy hash algorithms out there, but
frankly, most of them are still just making my head hurt. Hope to see
some of them in solr at some point though. The rolling hash spamsum alg
looks like it might be fairly doable...I've got half a dozen pdf papers
on other algorithms as well, but they are not a joke for me to implement.
- Mark
Marc Sturlese wrote:
Hey there, I've been testing and checking the source of the
TextProfileSignature.java to avoid similar entries at indexing time.
What I understood is that it is useful for huge text where the frequency of
the tokens (the words in lowercase just with number and leters in taht case)
is important. If you want to detect duplicates in not huge text and not
giving a lot of importance to the frequencies it doesn't work...
The hash will be made just with the terms wich frequency is higher than a
QUANTUM (which value is given in function of the max freq between all the
terms). So it will say that:
aaa sss ddd fff ggg hhh aaa kkk lll ooo
aaa xxx iii www qqq aaa jjj eee zzz nnn
are duplicates because quantum here wolud be 2 and the frequency of aaa
would be 2 aswell. So, to make the hash just the term aaa would be used.
In this case:
aaa sss ddd fff ggg hhh kkk lll ooo
apa sss ddd fff ggg hhh kkk lll ooo
Here quantum would be 1 and the frequencies of all terms would be 1 so all
terms would be use for the hash. It will consider this two strings not
similar.
As I understood the algorithm there's no way to make it understand that in
my second case both strings are similar. I wish i were wrong...
I have my own duplication system to detect that but I use String comparison
so it works really slow... Would like to know if there is any tuning
possibility to do that with TextProfileSignature
Don't know if I should pot this here or in the developers forum...
Thanks in advance