Replication with <2.0, third party servers, and Automerging conflicts

Michael Fair Sun, 27 Mar 2016 01:09:06 -0700

Greetings all!

Sorry if this one is a bit long; you really only need to read the first few
paragraphs, the rest of this is rationalization on why it's sound/sane and
some proposed implementation details.


I've been thinking about replication with third party servers and the
problems of evolving binary formats and revision id algorithms and believe
that by adding a simple optional change to the way conflicts are handled,
eventual consistency will make most of the associated problems go away.

Basically, instead of trying to create the one true canonical JSON
representation, add a set of "digests" to the current leaves of a document
and follow some rules:

1) the JSON format used between two end points is either the original JSON
document; or a negotiated format that preserves the fidelity of the
original JSON

2) The deterministic algorithm for selecting which revision to use in the
presence of conflicting branches is honored

and

3) The following proposed enhancement for merging revisions with the same
digest data (which centers on the concept that two documents with the same
_id and contents are in fact the same document, and conflicting branches
with the same JSON content should be eliminated/merged (not preserved) when
the receiving server detects them.)


When a conflicting leaf of a document is updated to have the same contents
(as determined by a message digest of the contents) of another current
leaf; this should be regarded as a MERGE operation between those leaves.


The determined revision id between the two revisions would be returned and
the losing leaf be automatically deleted (thereby resolving that conflict
as the contents now match).


A record stored on the deleted document to track the revision id the
deleted document was merged into would make for nicer revision history
graphs but is completely unnecessary.


Further, when replicating and receiving _bulk_docs/all_or_nothing, if two
documents are detected with the same _id and different revision ids but
having the same digest, a conflict should not be created at all.  The same
merge algorithm should apply; the determined revision id of the two would
be kept and the other automatically marked as deleted.


It's important to note that the digest here is a separate computation from
the revision id (and could use its own algorithm). The revision id here
could have been randomly generated.  This proposal is saying "check the
contents (via a digest) before creating/persisting a conflict based on
revision id".


Here's the rationalization:

This keeps with the philosophy one _id, one document; it makes sense within
the context of what people consider a document revision to be (a version of
the document's contents); and it supports people and applications in
resolving document conflicts in a meaningful way.


I realize doing any conflict resolution is something new for the CouchDB
code.  For added context on why this keeps with the Couch philosophy,
here's a small snippet from the docs:

[

Here, (r4a, r3b, r3c) are the set of conflicting revisions. The way you
resolve a conflict is to delete the leaf nodes along the other branches. So
when you combine (r4a+r3b+r3c) into a single merged document, you would
replace r4a and delete r3b and r3c.

]


This statement is all about bringing the document branches to a common
place and terminating the lower branches. This proposal isn't doing
something new by attempting to merge/resolve documents with different
information; it's aiding what's already the defined procedure.


This proposal provides an alternative method to accomplishing the same end
result as described; you can PUT into r4a, r3b, and r3c the merged content
(or say if r3b already has the right info, then just update r4a and r3c to
match r3b's contents); under this proposal the result would be identical to
updating r4a with the corrected info and deleting r3b and r3c as described.


The example in the docs is all about managing a contact record.  This is a
great example of a document that can arrive at the same end state but take
many paths internally on local device databases.


When syncing with another database, the fact the document took a different
path isn't something to preserve a conflict over.  The contents at the time
the document is synced is.


And while it's tempting to say "you might want to track those histories
separately" and that "just because they match contents doesn't make them
the same branch", those assertions go against the idea of one _id, one
document thinking.  It advocates for keeping conflicts as a lightweight
form of revision history tracking.  If the contents of two leaves of a
document can be shown to be the same, those branches should be merged and
that conflict within the _id resolved.  The number of times a document was
saved, or its values path, in a local device database is not important to
the current state.


Said another way, if it has the same _id and the same contents, excluding
the revision id, it's the same document.  And once they're detected as the
same document, it needs a single revision id, so the deterministic
algorithm selects the winner.


Assuming the concept is acceptable, then to avoid breaking anything
existing, I propose this new message digest value be stored in a new string
field called "MD5/CouchDB-2.0" as part of a new optional document object
field called "_digests".


Its value would be the same as the md5 portion of the revision id.  Adding
"_digests"  as an object allows for different digest algorithms to be added
to the same doc by other applications, or by future algorithms without
breaking anything existing.


It also doesn't break anything for a server to throw out the _digests field
on a doc and not store it (it means more CPU work during replication for
revision id conflicts but that might be desired over storing the data on
the doc).


This proposal also resolves the third party revision id problem.


When a server (Couch or otherwise) receives documents and detects a
revision id conflict, by using its own supported digest algorithms on the
document it can detect and resolve conflicts where the JSON content is the
same but the revision id was calculated using a different method.


Because the revision id selection algorithm is determinstic each receiving
server will pick the same revision to keep and the same revision to delete.


The only important thing is that the server's chosen digest algorithm
generate different digests for different documents and the same digest for
the same document.  It doesn't matter if the two severs used the same
digest algorithm.


A server may ignore this proposal and produce a conflict revision instead
of merging the revision.  Replication with a server that does honor this
proposal would detect and resolve those conflicts or applications and
humans might resolve it the way it's already done.


A server should preserve all digests listed in the _digests object on the
document, however it may preserve only its own, some of them, or throw out
the object (as mentioned earlier).


A server may populate as many digest algorithms as it wishes and knows how
to compute.


Only digests for the currently active leaves of a document need be
preserved (and even then, only for documents that have active conflicts).
Historical digests add no value for this purpose.


A digest algorithm can be provided as part of a design document or
map/reduce view definition to enable other servers to compute a preferred
digest.  (doing this also helps ensure the same algorithm is used as the
design docs/map view definitions can be replicated.)


When a server receives a document that doesn't have its own algorithm
listed in the "_digests" object it will have to compute it should a
revision id conflict be detected.


I believe this can be implemented as an erlang plugin for existing 1.6.1
servers.


And lastly, this sets up the preferred application method for resolving
conflicts in an application to be download all the existing conflicting
revisions; massage the JSON contents; upload the same document to all the
now resolved revisions.  This to me seems easier to code and follow than
having the application decide which revision id is the right one to update
and which one it should delete.  "Just update them all" is an easier
approach. :)


Thanks everyone, thoughts on the matter are obviously welcomed,

Mike

Replication with <2.0, third party servers, and Automerging conflicts

Reply via email to