Well the first is something I have found out earlier, consider a table
with the fields A and B and the following records:
A B
1 1
1 2
2 1
for that record, you can not add the constraint UNIQUE(A) nor
UNIQUE(B), but you can add the constraint UNIQUE(A,B)
Correct. But if we're starting with A and B= f(a) where f is 1:1 lime
md5, then you will actually get a table like:
A B
1 aa...43
1 aa...43
2 48...6c
Well, as you have stated there is a non zero possibility to get the
same fingerprint for two different messageblocks.
Non-zero, but on the order of 1 in 2^127 chance, or virtually never in
the age of the universe, at least for md5. You can obviously get
duplicates, since if you hash every 128 bit string into 128 bits,
you're assured of getting a collision on the next string that you hash,
if you haven't done so already. But as I've said, that's a major time
investment.
I beleive the query optimizer in the databsae will only use the
fingerprint as index, since this probably would be much faster than
searching on the messageblock itself.
It is probably wrong to have the fingerprint as a bigint, it should
probably only be 32 or 64 bits.
128 is probably the minimum that I've seen for a good hash, 160 bit
sha1 is another possibility.
I mean some kind of for us invisible index, that only the database
query engine uses, without us really knowing.
MySQL has some text search functions that could be useful.
You should just add it to the DB like INDEX(messageblock)
That would be a fulltext index on messageblock, and rather expensive
and wouldn't even help the uniqueness problem.
I am still wondering if this is possible or if it kills the DB. If the
DB die, then we would have to use a separate table/record with
fingerprint records etc. since the DB can live with it.
I don't see how storing fingerprints could kill a database.
eric