Hey there, I've been testing and checking the source of the TextProfileSignature.java to avoid similar entries at indexing time. What I understood is that it is useful for huge text where the frequency of the tokens (the words in lowercase just with number and leters in taht case) is important. If you want to detect duplicates in not huge text and not giving a lot of importance to the frequencies it doesn't work... The hash will be made just with the terms wich frequency is higher than a QUANTUM (which value is given in function of the max freq between all the terms). So it will say that:
aaa sss ddd fff ggg hhh aaa kkk lll ooo aaa xxx iii www qqq aaa jjj eee zzz nnn are duplicates because quantum here wolud be 2 and the frequency of aaa would be 2 aswell. So, to make the hash just the term aaa would be used. In this case: aaa sss ddd fff ggg hhh kkk lll ooo apa sss ddd fff ggg hhh kkk lll ooo Here quantum would be 1 and the frequencies of all terms would be 1 so all terms would be use for the hash. It will consider this two strings not similar. As I understood the algorithm there's no way to make it understand that in my second case both strings are similar. I wish i were wrong... I have my own duplication system to detect that but I use String comparison so it works really slow... Would like to know if there is any tuning possibility to do that with TextProfileSignature Don't know if I should pot this here or in the developers forum... Thanks in advance -- View this message in context: http://www.nabble.com/TextProfileSigature-using-deduplication-tp20559155p20559155.html Sent from the Solr - User mailing list archive at Nabble.com.