TextProfileSigature using deduplication

Marc Sturlese Tue, 18 Nov 2008 05:03:51 -0800

Hey there, I've been testing and checking the source of the
TextProfileSignature.java to avoid similar entries at indexing time.
What I understood is that it is useful for huge text where the frequency of
the tokens (the words in lowercase just with number and leters in taht case)
is important. If you want to detect duplicates in not huge text and not
giving a lot of importance to the frequencies it doesn't work...
The hash will be made just with the terms wich frequency is higher than a
QUANTUM (which value is given in function of the max freq between all the
terms). So it will say that:


aaa sss ddd fff ggg hhh aaa kkk lll ooo
aaa xxx iii www qqq aaa jjj eee zzz nnn

are duplicates because quantum here wolud be 2 and the frequency of aaa
would be 2 aswell. So, to make the hash just the term aaa would be used.

In this case:
aaa sss ddd fff ggg hhh kkk lll ooo
apa sss ddd fff ggg hhh kkk lll ooo

Here quantum would be 1 and the frequencies of all terms would be 1 so all
terms would be use for the hash. It will consider this two strings not
similar.

As I understood the algorithm there's no way to make it understand that in
my second case both strings are similar. I wish i were wrong...

I have my own duplication system to detect that but I use String comparison
so it works really slow... Would like to know if there is any tuning
possibility to do that with TextProfileSignature 

Don't know if I should pot this here or in the developers forum...

Thanks in advance
-- 
View this message in context: 
http://www.nabble.com/TextProfileSigature-using-deduplication-tp20559155p20559155.html
Sent from the Solr - User mailing list archive at Nabble.com.

TextProfileSigature using deduplication

Reply via email to