ACCUMULO-2925 Add warning about server-assigned timestamps with replication
Leave a note about updates to equal keys that have different updates that are assigned the same timestamp by the server. Project: http://git-wip-us.apache.org/repos/asf/accumulo/repo Commit: http://git-wip-us.apache.org/repos/asf/accumulo/commit/4d7e90ae Tree: http://git-wip-us.apache.org/repos/asf/accumulo/tree/4d7e90ae Diff: http://git-wip-us.apache.org/repos/asf/accumulo/diff/4d7e90ae Branch: refs/heads/master Commit: 4d7e90aeef3a6de6a36a30a188d5c1bc564ade3a Parents: 0676057 Author: Josh Elser <els...@apache.org> Authored: Thu Jun 19 17:58:10 2014 -0700 Committer: Josh Elser <els...@apache.org> Committed: Thu Jun 19 17:58:10 2014 -0700 ---------------------------------------------------------------------- docs/src/main/asciidoc/chapters/replication.txt | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/accumulo/blob/4d7e90ae/docs/src/main/asciidoc/chapters/replication.txt ---------------------------------------------------------------------- diff --git a/docs/src/main/asciidoc/chapters/replication.txt b/docs/src/main/asciidoc/chapters/replication.txt index 8755e24..5d24649 100644 --- a/docs/src/main/asciidoc/chapters/replication.txt +++ b/docs/src/main/asciidoc/chapters/replication.txt @@ -361,3 +361,23 @@ primary and peer. As such, the SummingCombiner wouldn't be recommended on a tabl While there are changes that could be made to the replication implementation which could attempt to mitigate this risk, presently, it is not recommended to configure Iterators or Combiners which are not idempotent to support cases where inaccuracy of aggregations is not acceptable. + +==== Server-Assigned Timestamps + +Accumulo has the ability to, when not provided by the client, assign a timestamp to updates made to a table. This is a +very useful feature as it reduces the amount of code a client must write and also gives some notion of ordering to the +updates that were made to a table (in addition to some solving some very problematic Accumulo implementation details). +However, replicating Mutations that were created with a server-assigned timestamp can be very problematic. To understand +this, we must first start at the BatchWriter. + +To allow for efficient ingest into Accumulo, the BatchWriter will collect many mutations, group them into batches and +send them to the correct server to be applied to the appropriate Tablet. For each Mutation in that batch that the server +receives, the server will set a timestamp that is at least as large as the last timestamp (to account for clock skew). In short, +this means that all of the Mutations in this batch will get the same timestamp and be deduplicated in a certain order +via the in-memory map and recorded in the write-ahead log. + +The problem is that these updates could be replayed on the remote in different commit sessions, which means that they +could result in different RFiles on disk (separate minor-compactions). Because of this, mutations with server-assigned +timestamps which are written within the same batch have the possibility to be applied in a different order on a peer. In +the case where a user might submit multiple updates for the same Key in rapid succession, the user should ensure proper +timestamps are set at the client.