Repository: accumulo Updated Branches: refs/heads/master 4b1196257 -> 80805545e
ACCUMULO-3500 Update replication docs for bulk imports Project: http://git-wip-us.apache.org/repos/asf/accumulo/repo Commit: http://git-wip-us.apache.org/repos/asf/accumulo/commit/80805545 Tree: http://git-wip-us.apache.org/repos/asf/accumulo/tree/80805545 Diff: http://git-wip-us.apache.org/repos/asf/accumulo/diff/80805545 Branch: refs/heads/master Commit: 80805545e7617bed41bfd5f50c0ba8032fd71d91 Parents: 4b11962 Author: Josh Elser <els...@apache.org> Authored: Thu Jan 22 10:39:41 2015 -0500 Committer: Josh Elser <els...@apache.org> Committed: Thu Jan 22 10:39:41 2015 -0500 ---------------------------------------------------------------------- docs/src/main/asciidoc/chapters/replication.txt | 10 ++++++++++ 1 file changed, 10 insertions(+) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/accumulo/blob/80805545/docs/src/main/asciidoc/chapters/replication.txt ---------------------------------------------------------------------- diff --git a/docs/src/main/asciidoc/chapters/replication.txt b/docs/src/main/asciidoc/chapters/replication.txt index 48f6ffa..69bb3c4 100644 --- a/docs/src/main/asciidoc/chapters/replication.txt +++ b/docs/src/main/asciidoc/chapters/replication.txt @@ -377,3 +377,13 @@ As is the recommendation without replication enabled, if multiple values for the Accumulo, it is strongly recommended that the value in the timestamp properly reflects the intended version by the client. That is to say, newer values inserted into the table should have larger timestamps. If the time between writing updates to the same key is significant (order minutes), this concern can likely be ignored. + +==== Bulk Imports + +Currently, files that are bulk imported into a table configured for replication are not replicated. There is no +technical reason why it was not implemented, it was simply omitted from the initial implementation. This is considered a +fair limitation because bulk importing generated files multiple locations is much simpler than bifurcating "live" ingest +data into two instances. Given some existing bulk import process which creates files and them imports them into an +Accumulo instance, it is trivial to copy those files to a new HDFS instance and import them into another Accumulo +instance using the same process. Hadoop's +distcp+ command provides an easy way to copy large amounts of data to another +HDFS instance which makes the problem of duplicating bulk imports very easy to solve.