Re: TextProfileSigature using deduplication

Mark Miller Thu, 20 Nov 2008 04:37:12 -0800

Thanks for sharing Marc, thats very nice to know. I'll take yourexperience as a starting point for some wiki recommendations.


Sounds like we should add a switch to order alpha as well.


Marc Sturlese wrote:

Hey there,
I found couple of solutions that work fine  for my case (is not exacly what
I was looking for at the begining but I could adapt it).

First one:
Use always quantum=1 and minTokenLen=2.
Instead of order the tokens by frequency, I order them alphabetically, doing
this I am a little more permissive (there will be more duplicates that
ordering by freq).
Use minTokenLen=3 could be usefull in here, depending on the use case

Second one
Oreder tokens alphabetically
Use minToken=2 quantum 2 for all fields less one. In this one I use
quantum=1. This filed that uses quantum=1 will make me be more restrictive
but I know that if it doesn't match the docs can't be considered as
duplicated.

These are 2 ways to detect duplicates in not huge texts. It is specific for
my case but the idea of giving different quantum to different fields could
be helpful in other cases aswell

Ken Krugler wrote:

Marc Sturlese wrote:

Hey there, I've been testing and checking the source of the
TextProfileSignature.java to avoid similar entries at indexing time.
What I understood is that it is useful for huge text where the frequency

of

the tokens (the words in lowercase just with number and leters in taht

case)

is important. If you want to detect duplicates in not huge text and not
giving a lot of importance to the frequencies it doesn't work...
The hash will be made just with the terms wich frequency is higher than a
QUANTUM (which value is given in function of the max freq between all the
terms). So it will say that:

aaa sss ddd fff ggg hhh aaa kkk lll ooo
aaa xxx iii www qqq aaa jjj eee zzz nnn

are duplicates because quantum here wolud be 2 and the frequency of aaa
would be 2 aswell. So, to make the hash just the term aaa would be used.

In this case:
aaa sss ddd fff ggg hhh kkk lll ooo
apa sss ddd fff ggg hhh kkk lll ooo

Here quantum would be 1 and the frequencies of all terms would be 1 so

all

terms would be use for the hash. It will consider this two strings not
similar.

As I understood the algorithm there's no way to make it understand that

in

my second case both strings are similar. I wish i were wrong...

I have my own duplication system to detect that but I use String

comparison

so it works really slow... Would like to know if there is any tuning
possibility to do that with TextProfileSignature
Don't know if I should pot this here or in the developers forum...
Hi Marc,
TextProfileSignature is a rather crudeimplementation of approximate similarity, and asyou pointed out it's best suited for largetexts. The original purpose of this Signaturewas to deduplicate web pages in large amounts ofcrawled pages (in Nutch), where it workedreasonably well. Its advantage is also that it'seasy to compute and doesn't require multiplepasses over the corpus.
As it is implemented now, it breaks badly in thecase you describe. You could modify thisimplementation to include also word-levelngrams, i.e. sequences of more than 1 word, upto N (e.g. 5) - this should work in your case.
Ultimately, what you are probably looking for isa shingle-based algorithm, but it's relativelycostly and requires multiple passes.
There's an intermediate approach we use...

* Generate separate hashes for each of the quantized bands
* Create additional fingerprint values (depends on the nature of the data)
* Find potentially similar files using the above
* Then apply an accurate but slower comparison to determine true
similarity
From our data, it's common to get files where(due to small text changes) the frequency of aterm moves between quantized bands. This thenchanges the über hash that you get from combiningall terms, but with 10 or so bands we still getsome matches on the hashes from the individualbands.
The "find potentially similar files" uses asimple Lucene scoring function, based on thenumber of matching fingerprint values.
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"

Re: TextProfileSigature using deduplication

Reply via email to