Would xxHash be large enough for digests? Looks like there's no 128-bit version yet, and it seems like 64 bits would be a bit short to avoid accidental collisions/matches. FarmHash128 or MetroHash128 might be a good choice. Not quite as fast as xxHash64, but not far off and still much, much faster than MD5 and somewhat faster than murmur3. May require some amount of benchmarking, since most of the performance comparisons are C and the performance of the Java implementations may vary drastically.
Looks like https://issues.apache.org/jira/browse/CASSANDRA-13291 already switched to Guava, which probably makes Murmur3_128 easier to switch to than the rest, and it may be enough faster than MD5 to be beyond the point of diminishing returns anyways. (far from an expert, but this thread prompted me to go poking through hash options out of curiosity) On Wed, Sep 26, 2018 at 9:04 PM Joseph Lynch <joe.e.ly...@gmail.com> wrote: > Michael Kjellman and others (Jason, Sam, et al.) have already done a lot of > work in 4.0 to help change the use of MD5 to something more modern [1][2]. > Also I cut a ticket a little while back about the significant performance > penalty of using MD5 for digests when doing quorum reads of wide partitions > [1]. Given the profiling that Michael has done and the production profiling > we did I think it's fair to say that changing the digest from MD5 to > murmur3 or xxHash would lead to a noticeable performance improvement for > quorum reads, perhaps even something like a 2x throughput increase for e.g. > wide partition workloads. > > The hard part is changing the digest hash without breaking older versions, > e.g. during a rolling restart you can't have one node give a MD5 hash and > the other give a xxHash hash as you'll end up with lots of mismatches and > read repairs ... so that would be the tricky part. I believe that we just > need to do what was done during the 3.0 storage engine refactor (I can't > remember the ticket but I'm pretty sure Sylvain did the work) which checked > the messaging version of the destination node and sent the appropriate hash > back. > > -Joey > > [1] https://issues.apache.org/jira/browse/CASSANDRA-13291 > [2] https://issues.apache.org/jira/browse/CASSANDRA-13292 > [3] https://issues.apache.org/jira/browse/CASSANDRA-14611 > > > On Wed, Sep 26, 2018 at 5:00 PM Elliott Sims <elli...@backblaze.com> > wrote: > > > They also don't matter for digests, as long as we're assuming all nodes > in > > the cluster are non-malicious (which is a pretty reasonable and probably > > necessary assumption). Or at least, deliberate collisions don't. > > Accidental collisions do, but 128 bits is sufficient to make that > > sufficiently unlikely (as in, chances are nobody will ever see a single > > collision) > > > > On Wed, Sep 26, 2018 at 7:58 PM Brandon Williams <dri...@gmail.com> > wrote: > > > > > Collisions don't matter in the partitioner. > > > > > > On Wed, Sep 26, 2018, 6:53 PM Anirudh Kubatoor < > > anirudh.kubat...@gmail.com > > > > > > > wrote: > > > > > > > Isn't MD5 broken from a security standpoint? From wikipedia > > > > *"One basic requirement of any cryptographic hash function is that it > > > > should be computationally infeasible > > > > < > > > > > > > > > > https://en.wikipedia.org/wiki/Computational_complexity_theory#Intractability > > > > > > > > > to > > > > find two non-identical messages which hash to the same value. MD5 > fails > > > > this requirement catastrophically; such collisions > > > > <https://en.wikipedia.org/wiki/Collision_resistance> can be found in > > > > seconds on an ordinary home computer"* > > > > > > > > Regards, > > > > Anirudh > > > > > > > > On Wed, Sep 26, 2018 at 7:14 PM Jeff Jirsa <jji...@gmail.com> wrote: > > > > > > > > > In some installations, it's used for hashing the partition key to > > find > > > > the > > > > > host ( RandomPartitioner ) > > > > > It's used for prepared statement IDs > > > > > It's used for hashing the data for reads to know if the data > matches > > on > > > > all > > > > > different replicas. > > > > > > > > > > We don't use CRC because conflicts would be really bad. There's > > > probably > > > > > something in the middle that's slightly faster than md5 without the > > > > > drawbacks of crc32 > > > > > > > > > > > > > > > On Wed, Sep 26, 2018 at 3:47 PM Tyagi, Preetika < > > > > preetika.ty...@intel.com> > > > > > wrote: > > > > > > > > > > > Hi all, > > > > > > > > > > > > I have a question about MD5 being used in the read path in > > Cassandra. > > > > > > I wanted to understand what exactly it is being used for and why > > not > > > > > > something like CRC is used which is less complex in comparison to > > > MD5. > > > > > > > > > > > > Thanks, > > > > > > Preetika > > > > > > > > > > > > > > > > > > > > > > > > > > >