Thanks for sharing Marc, thats very nice to know. I'll take your
experience as a starting point for some wiki recommendations.
Sounds like we should add a switch to order alpha as well.
Marc Sturlese wrote:
Hey there,
I found couple of solutions that work fine for my case (is not exacly what
I was looking for at the begining but I could adapt it).
First one:
Use always quantum=1 and minTokenLen=2.
Instead of order the tokens by frequency, I order them alphabetically, doing
this I am a little more permissive (there will be more duplicates that
ordering by freq).
Use minTokenLen=3 could be usefull in here, depending on the use case
Second one
Oreder tokens alphabetically
Use minToken=2 quantum 2 for all fields less one. In this one I use
quantum=1. This filed that uses quantum=1 will make me be more restrictive
but I know that if it doesn't match the docs can't be considered as
duplicated.
These are 2 ways to detect duplicates in not huge texts. It is specific for
my case but the idea of giving different quantum to different fields could
be helpful in other cases aswell
Ken Krugler wrote:
Marc Sturlese wrote:
Hey there, I've been testing and checking the source of the
TextProfileSignature.java to avoid similar entries at indexing time.
What I understood is that it is useful for huge text where the frequency
of
the tokens (the words in lowercase just with number and leters in taht
case)
is important. If you want to detect duplicates in not huge text and not
giving a lot of importance to the frequencies it doesn't work...
The hash will be made just with the terms wich frequency is higher than a
QUANTUM (which value is given in function of the max freq between all the
terms). So it will say that:
aaa sss ddd fff ggg hhh aaa kkk lll ooo
aaa xxx iii www qqq aaa jjj eee zzz nnn
are duplicates because quantum here wolud be 2 and the frequency of aaa
would be 2 aswell. So, to make the hash just the term aaa would be used.
In this case:
aaa sss ddd fff ggg hhh kkk lll ooo
apa sss ddd fff ggg hhh kkk lll ooo
Here quantum would be 1 and the frequencies of all terms would be 1 so
all
terms would be use for the hash. It will consider this two strings not
similar.
As I understood the algorithm there's no way to make it understand that
in
my second case both strings are similar. I wish i were wrong...
I have my own duplication system to detect that but I use String
comparison
so it works really slow... Would like to know if there is any tuning
possibility to do that with TextProfileSignature
Don't know if I should pot this here or in the developers forum...
Hi Marc,
TextProfileSignature is a rather crude
implementation of approximate similarity, and as
you pointed out it's best suited for large
texts. The original purpose of this Signature
was to deduplicate web pages in large amounts of
crawled pages (in Nutch), where it worked
reasonably well. Its advantage is also that it's
easy to compute and doesn't require multiple
passes over the corpus.
As it is implemented now, it breaks badly in the
case you describe. You could modify this
implementation to include also word-level
ngrams, i.e. sequences of more than 1 word, up
to N (e.g. 5) - this should work in your case.
Ultimately, what you are probably looking for is
a shingle-based algorithm, but it's relatively
costly and requires multiple passes.
There's an intermediate approach we use...
* Generate separate hashes for each of the quantized bands
* Create additional fingerprint values (depends on the nature of the data)
* Find potentially similar files using the above
* Then apply an accurate but slower comparison to determine true
similarity
From our data, it's common to get files where
(due to small text changes) the frequency of a
term moves between quantized bands. This then
changes the über hash that you get from combining
all terms, but with 10 or so bands we still get
some matches on the hashes from the individual
bands.
The "find potentially similar files" uses a
simple Lucene scoring function, based on the
number of matching fingerprint values.
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"