Eric Soroos wrote:

On Nov 20, 2003, at 8:28 AM, Magnus Sundberg wrote:

Eric Soroos wrote:


You can. That's effectively what I was doing in my message, except that you're not seeing it in the messageblk view. You probably don't want a unique constraint on messageblk, since the idea of the fingerprint is that it's a 1:1 mapping of the messageblk down to 128 bits.

I agree with you if I had written "UNIQUE (fingerprint)", but my belief is that the combination of both keys should be unique.

It's redundant.

I agree about that, but I my belief is that
"UNIQUE (messageblock, fingerprint)" is faster than
"UNIQUE (messageblock)"
Well the first is something I have found out earlier, consider a table with the fields A and B and the following records:
A B
1 1
1 2
2 1

for that record, you can not add the constraint UNIQUE(A) nor UNIQUE(B), but you can add the constraint UNIQUE(A,B)

Well, as you have stated there is a non zero possibility to get the same fingerprint for two different messageblocks. Therefore you can use the UNIQUE(messageblock, fingerprint) as a protection against duplicate messageblocks. I beleive the query optimizer in the databsae will only use the fingerprint as index, since this probably would be much faster than searching on the messageblock itself. It is probably wrong to have the fingerprint as a bigint, it should probably only be 32 or 64 bits.



I don't understand. Can you explain your reasoning? Especially how it's different/faster/better than Unique(fingerprint).


Thats the reason I put it together like that.

The fingerprint is going to be unique iff the messageblock is unique. If that wasn't the case, then there's no point in storing or calculating it. Unique constraints are generally done with indexes (not sure about mysql, but it's certainly the case with postgres), so you'd end up adding the entire messageblk to the index. And it's not a particularly useful index, since that's the only thing that index would be good for.


I agree with you on this point.


Is it possible to have some hash algoritms and just index the messageblock? I.e. let the database itself do the fingerprinting and hiding it for us?


Yes, that was my proposal. I'm using md5 in my test case. The proposal requires a couple of rules, a stored procedure, and a view, so it's doubtful that it will work on mysql.

I mean some kind of for us invisible index, that only the database query engine uses, without us really knowing.
MySQL has some text search functions that could be useful.
You should just add it to the DB like INDEX(messageblock)

I am still wondering if this is possible or if it kills the DB. If the DB die, then we would have to use a separate table/record with fingerprint records etc. since the DB can live with it.



Would this kill the database?


No, it seems to have about the same performance as the original schema. It appears that the md5 hash of the messageblk data is lost in the overhead of all the other work that's happening on insert.

eric

Magnus



Reply via email to