Well the first is something I have found out earlier, consider a table with the fields A and B and the following records:
A B
1 1
1 2
2 1

for that record, you can not add the constraint UNIQUE(A) nor UNIQUE(B), but you can add the constraint UNIQUE(A,B)

Correct. But if we're starting with A and B= f(a) where f is 1:1 lime md5, then you will actually get a table like:
A B
1 aa...43
1 aa...43
2 48...6c

Well, as you have stated there is a non zero possibility to get the same fingerprint for two different messageblocks.

Non-zero, but on the order of 1 in 2^127 chance, or virtually never in the age of the universe, at least for md5. You can obviously get duplicates, since if you hash every 128 bit string into 128 bits, you're assured of getting a collision on the next string that you hash, if you haven't done so already. But as I've said, that's a major time investment.

I beleive the query optimizer in the databsae will only use the fingerprint as index, since this probably would be much faster than searching on the messageblock itself. It is probably wrong to have the fingerprint as a bigint, it should probably only be 32 or 64 bits.

128 is probably the minimum that I've seen for a good hash, 160 bit sha1 is another possibility.

I mean some kind of for us invisible index, that only the database query engine uses, without us really knowing.


MySQL has some text search functions that could be useful.
You should just add it to the DB like INDEX(messageblock)

That would be a fulltext index on messageblock, and rather expensive and wouldn't even help the uniqueness problem.

I am still wondering if this is possible or if it kills the DB. If the DB die, then we would have to use a separate table/record with fingerprint records etc. since the DB can live with it.

I don't see how storing fingerprints could kill a database.

eric

Reply via email to