Michael Kjellman and others (Jason, Sam, et al.) have already done a lot of work in 4.0 to help change the use of MD5 to something more modern [1][2]. Also I cut a ticket a little while back about the significant performance penalty of using MD5 for digests when doing quorum reads of wide partitions [1]. Given the profiling that Michael has done and the production profiling we did I think it's fair to say that changing the digest from MD5 to murmur3 or xxHash would lead to a noticeable performance improvement for quorum reads, perhaps even something like a 2x throughput increase for e.g. wide partition workloads.
The hard part is changing the digest hash without breaking older versions, e.g. during a rolling restart you can't have one node give a MD5 hash and the other give a xxHash hash as you'll end up with lots of mismatches and read repairs ... so that would be the tricky part. I believe that we just need to do what was done during the 3.0 storage engine refactor (I can't remember the ticket but I'm pretty sure Sylvain did the work) which checked the messaging version of the destination node and sent the appropriate hash back. -Joey [1] https://issues.apache.org/jira/browse/CASSANDRA-13291 [2] https://issues.apache.org/jira/browse/CASSANDRA-13292 [3] https://issues.apache.org/jira/browse/CASSANDRA-14611 On Wed, Sep 26, 2018 at 5:00 PM Elliott Sims <elli...@backblaze.com> wrote: > They also don't matter for digests, as long as we're assuming all nodes in > the cluster are non-malicious (which is a pretty reasonable and probably > necessary assumption). Or at least, deliberate collisions don't. > Accidental collisions do, but 128 bits is sufficient to make that > sufficiently unlikely (as in, chances are nobody will ever see a single > collision) > > On Wed, Sep 26, 2018 at 7:58 PM Brandon Williams <dri...@gmail.com> wrote: > > > Collisions don't matter in the partitioner. > > > > On Wed, Sep 26, 2018, 6:53 PM Anirudh Kubatoor < > anirudh.kubat...@gmail.com > > > > > wrote: > > > > > Isn't MD5 broken from a security standpoint? From wikipedia > > > *"One basic requirement of any cryptographic hash function is that it > > > should be computationally infeasible > > > < > > > > > > https://en.wikipedia.org/wiki/Computational_complexity_theory#Intractability > > > > > > > to > > > find two non-identical messages which hash to the same value. MD5 fails > > > this requirement catastrophically; such collisions > > > <https://en.wikipedia.org/wiki/Collision_resistance> can be found in > > > seconds on an ordinary home computer"* > > > > > > Regards, > > > Anirudh > > > > > > On Wed, Sep 26, 2018 at 7:14 PM Jeff Jirsa <jji...@gmail.com> wrote: > > > > > > > In some installations, it's used for hashing the partition key to > find > > > the > > > > host ( RandomPartitioner ) > > > > It's used for prepared statement IDs > > > > It's used for hashing the data for reads to know if the data matches > on > > > all > > > > different replicas. > > > > > > > > We don't use CRC because conflicts would be really bad. There's > > probably > > > > something in the middle that's slightly faster than md5 without the > > > > drawbacks of crc32 > > > > > > > > > > > > On Wed, Sep 26, 2018 at 3:47 PM Tyagi, Preetika < > > > preetika.ty...@intel.com> > > > > wrote: > > > > > > > > > Hi all, > > > > > > > > > > I have a question about MD5 being used in the read path in > Cassandra. > > > > > I wanted to understand what exactly it is being used for and why > not > > > > > something like CRC is used which is less complex in comparison to > > MD5. > > > > > > > > > > Thanks, > > > > > Preetika > > > > > > > > > > > > > > > > > > > >