Re: TextProfileSigature using deduplication

Marc Sturlese Thu, 20 Nov 2008 04:26:53 -0800

Hey there,
I found couple of solutions that work fine  for my case (is not exacly what
I was looking for at the begining but I could adapt it).


First one:
Use always quantum=1 and minTokenLen=2.
Instead of order the tokens by frequency, I order them alphabetically, doing
this I am a little more permissive (there will be more duplicates that
ordering by freq).
Use minTokenLen=3 could be usefull in here, depending on the use case

Second one
Oreder tokens alphabetically
Use minToken=2 quantum 2 for all fields less one. In this one I use
quantum=1. This filed that uses quantum=1 will make me be more restrictive
but I know that if it doesn't match the docs can't be considered as
duplicated.

These are 2 ways to detect duplicates in not huge texts. It is specific for
my case but the idea of giving different quantum to different fields could
be helpful in other cases aswell





 

Ken Krugler wrote:
> 
>>Marc Sturlese wrote:
>>>Hey there, I've been testing and checking the source of the
>>>TextProfileSignature.java to avoid similar entries at indexing time.
>>>What I understood is that it is useful for huge text where the frequency
of
>>>the tokens (the words in lowercase just with number and leters in taht
case)
>>>is important. If you want to detect duplicates in not huge text and not
>>>giving a lot of importance to the frequencies it doesn't work...
>>>The hash will be made just with the terms wich frequency is higher than a
>>>QUANTUM (which value is given in function of the max freq between all the
>>>terms). So it will say that:
>>>
>>>aaa sss ddd fff ggg hhh aaa kkk lll ooo
>>>aaa xxx iii www qqq aaa jjj eee zzz nnn
>>>
>>>are duplicates because quantum here wolud be 2 and the frequency of aaa
>>>would be 2 aswell. So, to make the hash just the term aaa would be used.
>>>
>>>In this case:
>>>aaa sss ddd fff ggg hhh kkk lll ooo
>>>apa sss ddd fff ggg hhh kkk lll ooo
>>>
>>>Here quantum would be 1 and the frequencies of all terms would be 1 so
all
>>>terms would be use for the hash. It will consider this two strings not
>>>similar.
>>>
>>>As I understood the algorithm there's no way to make it understand that
in
>>>my second case both strings are similar. I wish i were wrong...
>>>
>>>I have my own duplication system to detect that but I use String
comparison
>>>so it works really slow... Would like to know if there is any tuning
>>>possibility to do that with TextProfileSignature
>>>Don't know if I should pot this here or in the developers forum...
>>
>>Hi Marc,
>>
>>TextProfileSignature is a rather crude 
>>implementation of approximate similarity, and as 
>>you pointed out it's best suited for large 
>>texts. The original purpose of this Signature 
>>was to deduplicate web pages in large amounts of 
>>crawled pages (in Nutch), where it worked 
>>reasonably well. Its advantage is also that it's 
>>easy to compute and doesn't require multiple 
>>passes over the corpus.
>>
>>As it is implemented now, it breaks badly in the 
>>case you describe. You could modify this 
>>implementation to include also word-level 
>>ngrams, i.e. sequences of more than 1 word, up 
>>to N (e.g. 5) - this should work in your case.
>>
>>Ultimately, what you are probably looking for is 
>>a shingle-based algorithm, but it's relatively 
>>costly and requires multiple passes.
> 
> There's an intermediate approach we use...
> 
> * Generate separate hashes for each of the quantized bands
> * Create additional fingerprint values (depends on the nature of the data)
> * Find potentially similar files using the above
> * Then apply an accurate but slower comparison to determine true
> similarity
> 
> From our data, it's common to get files where 
> (due to small text changes) the frequency of a 
> term moves between quantized bands. This then 
> changes the über hash that you get from combining 
> all terms, but with 10 or so bands we still get 
> some matches on the hashes from the individual 
> bands.
> 
> The "find potentially similar files" uses a 
> simple Lucene scoring function, based on the 
> number of matching fingerprint values.
> 
> -- Ken
> --
> Ken Krugler
> Krugle, Inc.
> +1 530-210-6378
> "If you can't find it, you can't fix it"
> 
> 

-- 
View this message in context: 
http://www.nabble.com/TextProfileSigature-using-deduplication-tp20559155p20600118.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: TextProfileSigature using deduplication

Reply via email to