Re: Calculating Revision IDs outside erlang (proposal to add {minor_version, 1} to the calc)

Adam Kocoloski Tue, 22 Mar 2016 19:53:29 -0700

Wow, does this mean that a CouchDB server running R16 and another running R17 
will compute different revision IDs for the same document? We should certainly 
bump to minor_version=1 across the board; we did this for on-disk 
representations of document bodies quite a long time ago I think.


Adam

> On Mar 22, 2016, at 10:45 PM, Paul Davis <[email protected]> wrote:
> 
> +1 to adding the minor version option. Floats are hard. Its still not
> perfect but it at least should make most cases easier.
> 
> On Tue, Mar 22, 2016 at 7:30 PM, Michael Fair <[email protected]> wrote:
>> Greetings CouchDBers!
>> 
>> I've been modifying a BERT library to recreate the md5 calc of a RevisionID
>> in Java.
>> 
>> I haven't tackled attachments yet, however with the awesome help of rnewson
>> on the IRC channel, I've succeeded in recreating the md5 for all the
>> documents I've tried so far which includes docs with values of strings, big
>> and small integers, lists of big integers, lists of small integers, true,
>> false, null, and objects; however the glaring exception is floats.
>> 
>> The {minor_version, 0} format used for floats (A 31 byte string based
>> representation in %.20e format) is dependent on the host environment doing
>> the encoding and can't be reliably duplicated in other machines and
>> languages.
>> 
>> For instance, here are examples of encoding 3.14159 as %.20e string on this
>> laptop:
>> erlang: 3.1415899999999999000e+00  (This is what term_to_binary is using)
>> python: 3.14158999999999988262e+00
>> java:   3.14159000000000000000e+00
>> 
>> These minor numerical differences unfortunately make the md5 computation
>> untenable.  And further, it seems that even different OTP versions and
>> different hardware will encode the {minor_version, 0} format slightly
>> differently on different Couch instances (A couple people on IRC shared
>> with me what their OTP produced).
>> 
>> 
>> To make a long story short and spare folks reading the mind-numbing
>> details, without changing something, replicating the md5 for the revision
>> id of documents with floats just can't be done sanely.
>> 
>> As things are now, like I mentioned, even different installations of
>> CouchDB can disagree on the MD5 revision id for the document {"pi":3.14159}.
>> 
>> 
>> So where does this create an issue?
>> 
>> It shows up by creating a conflict document during replication when the two
>> servers calculated different revision ids for the same document update
>> (which only happens if it was a multi-master update (an update where both
>> sides were updated before replicating -- like separate laptops on separate
>> planes each doing the same thing)).
>> 
>> If only one side or the other was updated, it doesn't cause a problem.
>> 
>> My goal is enabling people to upload documents from multiple server
>> applications using JSON and Couch to handle the replication bits.
>> 
>> To give this heterogeneous environment the same multi-master intelligence
>> that Couch has, they need to be able to compute the same revision id that
>> Couch would compute; otherwise documents modified directly in couch could
>> create these kinds of multi-master type conflicts.
>> 
>> 
>> ----
>> 
>> What to do (aside from simply do nothing)?
>> 
>> At the least I recommend changing the term_to_binary computation to use the
>> {minor_version, 1} option in the rev_id calculation.
>> 
>> This changes how floats are encoded to the 64-bit IEEE format.  It became
>> the standard way of encoding floats in OTP 17.0+ and is available as an
>> option all the way back to OTP 11.  As long as it's explicitly provided as
>> a requested option in the term_to_binary call, all currently deployed OTP
>> installations for Couch can do it.
>> 
>> Doing this normalizes the md5 calculation for floats regardless of the OTP
>> platform, and should make it feasible for third party applications to
>> replicate the encoding.
>> 
>> 
>> 
>> I have some other ideas beyond that, but they would require changes to the
>> replication protocol to support.
>> 
>> 
>> ----
>> 
>> For anyone interested I'd be happy to share the code I have.  It's still a
>> bit rough in the document construction part, but once constructed, getting
>> the binary encoding and revision id are each just a single call.
>> 
>> 
>> Thanks,
>> Mike

Re: Calculating Revision IDs outside erlang (proposal to add {minor_version, 1} to the calc)

Reply via email to