http://git-wip-us.apache.org/repos/asf/accumulo-website/blob/7cc70b2e/_docs-unreleased/administration/overview.md ---------------------------------------------------------------------- diff --git a/_docs-unreleased/administration/overview.md b/_docs-unreleased/administration/overview.md new file mode 100644 index 0000000..39ff163 --- /dev/null +++ b/_docs-unreleased/administration/overview.md @@ -0,0 +1,1151 @@ +--- +title: Overview +category: administration +order: 1 +--- + +## Hardware + +Because we are running essentially two or three systems simultaneously layered +across the cluster: HDFS, Accumulo and MapReduce, it is typical for hardware to +consist of 4 to 8 cores, and 8 to 32 GB RAM. This is so each running process can have +at least one core and 2 - 4 GB each. + +One core running HDFS can typically keep 2 to 4 disks busy, so each machine may +typically have as little as 2 x 300GB disks and as much as 4 x 1TB or 2TB disks. + +It is possible to do with less than this, such as with 1u servers with 2 cores and 4GB +each, but in this case it is recommended to only run up to two processes per +machine -- i.e. DataNode and TabletServer or DataNode and MapReduce worker but +not all three. The constraint here is having enough available heap space for all the +processes on a machine. + +## Network + +Accumulo communicates via remote procedure calls over TCP/IP for both passing +data and control messages. In addition, Accumulo uses HDFS clients to +communicate with HDFS. To achieve good ingest and query performance, sufficient +network bandwidth must be available between any two machines. + +In addition to needing access to ports associated with HDFS and ZooKeeper, Accumulo will +use the following default ports. Please make sure that they are open, or change +their value in accumulo-site.xml. + +|Port | Description | Property Name +|-----|-------------|-------------- +|4445 | Shutdown Port (Accumulo MiniCluster) | n/a +|4560 | Accumulo monitor (for centralized log display) | monitor.port.log4j +|9995 | Accumulo HTTP monitor | monitor.port.client +|9997 | Tablet Server | tserver.port.client +|9998 | Accumulo GC | gc.port.client +|9999 | Master Server | master.port.client +|12234 | Accumulo Tracer | trace.port.client +|42424 | Accumulo Proxy Server | n/a +|10001 | Master Replication service | master.replication.coordinator.port +|10002 | TabletServer Replication service | replication.receipt.service.port + +In addition, the user can provide `0` and an ephemeral port will be chosen instead. This +ephemeral port is likely to be unique and not already bound. Thus, configuring ports to +use `0` instead of an explicit value, should, in most cases, work around any issues of +running multiple distinct Accumulo instances (or any other process which tries to use the +same default ports) on the same hardware. Finally, the *.port.client properties will work +with the port range syntax (M-N) allowing the user to specify a range of ports for the +service to attempt to bind. The ports in the range will be tried in a 1-up manner starting +at the low end of the range to, and including, the high end of the range. + +## Installation + +Download a binary distribution of Accumulo and install it to a directory on a disk with +sufficient space: + + cd <install directory> + tar xzf accumulo-X.Y.Z-bin.tar.gz # Replace 'X.Y.Z' with your Accumulo version + cd accumulo-X.Y.Z + +Repeat this step on each machine in your cluster. Typically, the same `<install directory>` +is chosen for all machines in the cluster. + +There are four scripts in the `bin/` directory that are used to manage Accumulo: + +1. `accumulo` - Runs Accumulo command-line tools and starts Accumulo processes +2. `accumulo-service` - Runs Accumulo processes as services +3. `accumulo-cluster` - Manages Accumulo cluster on a single node or several nodes +4. `accumulo-util` - Accumulo utilities for creating configuration, native libraries, etc. + +These scripts will be used in the remaining instructions to configure and run Accumulo. + +## Dependencies + +Accumulo requires HDFS and ZooKeeper to be configured and running +before starting. Password-less SSH should be configured between at least the +Accumulo master and TabletServer machines. It is also a good idea to run Network +Time Protocol (NTP) within the cluster to ensure nodes' clocks don't get too out of +sync, which can cause problems with automatically timestamped data. + +## Configuration + +The Accumulo tarball contains a `conf/` directory where Accumulo looks for configuration. If you +installed Accumulo using downstream packaging, the `conf/` could be something else like +`/etc/accumulo/`. + +Before starting Accumulo, the configuration files `accumulo-env.sh` and `accumulo-site.xml` must +exist in `conf/` and be properly configured. If you are using `accumulo-cluster` to launch +a cluster, the `conf/` directory must also contain hosts file for Accumulo services (i.e `gc`, +`masters`, `monitor`, `tservers`, `tracers`). You can either create these files manually or run +`accumulo-cluster create-config`. + +Logging is configured in `accumulo-env.sh` to use three log4j configuration files in `conf/`. The +file used depends on the Accumulo command or service being run. Logging for most Accumulo services +(i.e Master, TabletServer, Garbage Collector) is configured by `log4j-service.properties` except for +the Monitor which is configured by `log4j-monitor.properties`. All Accumulo commands (i.e `init`, +`shell`, etc) are configured by `log4j.properties`. + +### Configure accumulo-env.sh + +Accumulo needs to know where to find the software it depends on. Edit accumulo-env.sh +and specify the following: + +1. Enter the location of Hadoop for `$HADOOP_PREFIX` +2. Enter the location of ZooKeeper for `$ZOOKEEPER_HOME` +3. Optionally, choose a different location for Accumulo logs using `$ACCUMULO_LOG_DIR` + +Accumulo uses `HADOOP_PREFIX` and `ZOOKEEPER_HOME` to locate Hadoop and Zookeeper jars +and add them the `CLASSPATH` variable. If you are running a vendor-specific release of Hadoop +or Zookeeper, you may need to change how your `CLASSPATH` is built in `accumulo-env.sh`. If +Accumulo has problems later on finding jars, run `accumulo classpath -d` to debug and print +Accumulo's classpath. + +You may want to change the default memory settings for Accumulo's TabletServer which are +by set in the `JAVA_OPTS` settings for 'tservers' in `accumulo-env.sh`. Note the +syntax is that of the Java JVM command line options. This value should be less than the +physical memory of the machines running TabletServers. + +There are similar options for the master's memory usage and the garbage collector +process. Reduce these if they exceed the physical RAM of your hardware and +increase them, within the bounds of the physical RAM, if a process fails because of +insufficient memory. + +Note that you will be specifying the Java heap space in accumulo-env.sh. You should +make sure that the total heap space used for the Accumulo tserver and the Hadoop +DataNode and TaskTracker is less than the available memory on each worker node in +the cluster. On large clusters, it is recommended that the Accumulo master, Hadoop +NameNode, secondary NameNode, and Hadoop JobTracker all be run on separate +machines to allow them to use more heap space. If you are running these on the +same machine on a small cluster, likewise make sure their heap space settings fit +within the available memory. + +### Native Map + +The tablet server uses a data structure called a MemTable to store sorted key/value +pairs in memory when they are first received from the client. When a minor compaction +occurs, this data structure is written to HDFS. The MemTable will default to using +memory in the JVM but a JNI version, called the native map, can be used to significantly +speed up performance by utilizing the memory space of the native operating system. The +native map also avoids the performance implications brought on by garbage collection +in the JVM by causing it to pause much less frequently. + +#### Building + +32-bit and 64-bit Linux and Mac OS X versions of the native map can be built by executing +`accumulo-util build-native`. If your system's default compiler options are insufficient, +you can add additional compiler options to the command line, such as options for the +architecture. These will be passed to the Makefile in the environment variable `USERFLAGS`. + +Examples: + + accumulo-util build-native + accumulo-util build-native -m32 + +After building the native map from the source, you will find the artifact in +`lib/native`. Upon starting up, the tablet server will look +in this directory for the map library. If the file is renamed or moved from its +target directory, the tablet server may not be able to find it. The system can +also locate the native maps shared library by setting `LD_LIBRARY_PATH` +(or `DYLD_LIBRARY_PATH` on Mac OS X) in `accumulo-env.sh`. + +#### Native Maps Configuration + +As mentioned, Accumulo will use the native libraries if they are found in the expected +location and `tserver.memory.maps.native.enabled` is set to `true` (which is the default). +Using the native maps over JVM Maps nets a noticeable improvement in ingest rates; however, +certain configuration variables are important to modify when increasing the size of the +native map. + +To adjust the size of the native map, increase the value of `tserver.memory.maps.max`. +By default, the maximum size of the native map is 1GB. When increasing this value, it is +also important to adjust the values of `table.compaction.minor.logs.threshold` and +`tserver.walog.max.size`. `table.compaction.minor.logs.threshold` is the maximum +number of write-ahead log files that a tablet can reference before they will be automatically +minor compacted. `tserver.walog.max.size` is the maximum size of a write-ahead log. + +The maximum size of the native maps for a server should be less than the product +of the write-ahead log maximum size and minor compaction threshold for log files: + +`$table.compaction.minor.logs.threshold * $tserver.walog.max.size >= $tserver.memory.maps.max` + +This formula ensures that minor compactions won't be automatically triggered before the native +maps can be completely saturated. + +Subsequently, when increasing the size of the write-ahead logs, it can also be important +to increase the HDFS block size that Accumulo uses when creating the files for the write-ahead log. +This is controlled via `tserver.wal.blocksize`. A basic recommendation is that when +`tserver.walog.max.size` is larger than 2GB in size, set `tserver.wal.blocksize` to 2GB. +Increasing the block size to a value larger than 2GB can result in decreased write +performance to the write-ahead log file which will slow ingest. + +### Cluster Specification + +If you are using `accumulo-cluster` to start a cluster, configure the following on the +machine that will serve as the Accumulo master: + +1. Write the IP address or domain name of the Accumulo Master to the `conf/masters` file. +2. Write the IP addresses or domain name of the machines that will be TabletServers in `conf/tservers`, one per line. + +Note that if using domain names rather than IP addresses, DNS must be configured +properly for all machines participating in the cluster. DNS can be a confusing source +of errors. + +### Configure accumulo-site.xml + +Specify appropriate values for the following settings in `accumulo-site.xml`: + +```xml +<property> + <name>instance.zookeeper.host</name> + <value>zooserver-one:2181,zooserver-two:2181</value> + <description>list of zookeeper servers</description> +</property> +``` + +This enables Accumulo to find ZooKeeper. Accumulo uses ZooKeeper to coordinate +settings between processes and helps finalize TabletServer failure. + +```xml +<property> + <name>instance.secret</name> + <value>DEFAULT</value> +</property> +``` + +The instance needs a secret to enable secure communication between servers. Configure your +secret and make sure that the `accumulo-site.xml` file is not readable to other users. +For alternatives to storing the `instance.secret` in plaintext, please read the +`Sensitive Configuration Values` section. + +Some settings can be modified via the Accumulo shell and take effect immediately, but +some settings require a process restart to take effect. See the [configuration management][config-mgmt] +documentation for details. + +### Hostnames in configuration files + +Accumulo has a number of configuration files which can contain references to other hosts in your +network. All of the "host" configuration files for Accumulo (`gc`, `masters`, `tservers`, `monitor`, +`tracers`) as well as `instance.volumes` in accumulo-site.xml must contain some host reference. + +While IP address, short hostnames, or fully qualified domain names (FQDN) are all technically valid, it +is good practice to always use FQDNs for both Accumulo and other processes in your Hadoop cluster. +Failing to consistently use FQDNs can have unexpected consequences in how Accumulo uses the FileSystem. + +A common way for this problem can be observed is via applications that use Bulk Ingest. The Accumulo +Master coordinates moving the input files to Bulk Ingest to an Accumulo-managed directory. However, +Accumulo cannot safely move files across different Hadoop FileSystems. This is problematic because +Accumulo also cannot make reliable assertions across what is the same FileSystem which is specified +with different names. Naively, while 127.0.0.1:8020 might be a valid identifier for an HDFS instance, +Accumulo identifies `localhost:8020` as a different HDFS instance than `127.0.0.1:8020`. + +### Deploy Configuration + +Copy accumulo-env.sh and accumulo-site.xml from the `conf/` directory on the master to all Accumulo +tablet servers. The "host" configuration files files `accumulo-cluster` only need to be on servers +where that command is run. + +### Sensitive Configuration Values + +Accumulo has a number of properties that can be specified via the accumulo-site.xml +file which are sensitive in nature, instance.secret and trace.token.property.password +are two common examples. Both of these properties, if compromised, have the ability +to result in data being leaked to users who should not have access to that data. + +In Hadoop-2.6.0, a new CredentialProvider class was introduced which serves as a common +implementation to abstract away the storage and retrieval of passwords from plaintext +storage in configuration files. Any Property marked with the `Sensitive` annotation +is a candidate for use with these CredentialProviders. For version of Hadoop which lack +these classes, the feature will just be unavailable for use. + +A comma separated list of CredentialProviders can be configured using the Accumulo Property +`general.security.credential.provider.paths`. Each configured URL will be consulted +when the Configuration object for accumulo-site.xml is accessed. + +### Using a JavaKeyStoreCredentialProvider for storage + +One of the implementations provided in Hadoop-2.6.0 is a Java KeyStore CredentialProvider. +Each entry in the KeyStore is the Accumulo Property key name. For example, to store the +`instance.secret`, the following command can be used: + + hadoop credential create instance.secret --provider jceks://file/etc/accumulo/conf/accumulo.jceks + +The command will then prompt you to enter the secret to use and create a keystore in: + + /path/to/accumulo/conf/accumulo.jceks + +Then, accumulo-site.xml must be configured to use this KeyStore as a CredentialProvider: + +```xml +<property> + <name>general.security.credential.provider.paths</name> + <value>jceks://file/path/to/accumulo/conf/accumulo.jceks</value> +</property> +``` + +This configuration will then transparently extract the `instance.secret` from +the configured KeyStore and alleviates a human readable storage of the sensitive +property. + +A KeyStore can also be stored in HDFS, which will make the KeyStore readily available to +all Accumulo servers. If the local filesystem is used, be aware that each Accumulo server +will expect the KeyStore in the same location. + +### Client Configuration + +In version 1.6.0, Accumulo included a new type of configuration file known as a client +configuration file. One problem with the traditional "site.xml" file that is prevalent +through Hadoop is that it is a single file used by both clients and servers. This makes +it very difficult to protect secrets that are only meant for the server processes while +allowing the clients to connect to the servers. + +The client configuration file is a subset of the information stored in accumulo-site.xml +meant only for consumption by clients of Accumulo. By default, Accumulo checks a number +of locations for a client configuration by default: + +* `/path/to/accumulo/conf/client.conf` +* `/etc/accumulo/client.conf` +* `/etc/accumulo/conf/client.conf` +* `~/.accumulo/config` + +These files are [Java Properties files](https://en.wikipedia.org/wiki/.properties). These files +can currently contain information about ZooKeeper servers, RPC properties (such as SSL or SASL +connectors), distributed tracing properties. Valid properties are defined by the [ClientProperty](https://github.com/apache/accumulo/blob/f1d0ec93d9f13ff84844b5ac81e4a7b383ced467/core/src/main/java/org/apache/accumulo/core/client/ClientConfiguration.java#L54) +enum contained in the client API. + +#### Custom Table Tags + +Accumulo has the ability for users to add custom tags to tables. This allows +applications to set application-level metadata about a table. These tags can be +anything from a table description, administrator notes, date created, etc. +This is done by naming and setting a property with a prefix `table.custom.*`. + +Currently, table properties are stored in ZooKeeper. This means that the number +and size of custom properties should be restricted on the order of 10's of properties +at most without any properties exceeding 1MB in size. ZooKeeper's performance can be +very sensitive to an excessive number of nodes and the sizes of the nodes. Applications +which leverage the user of custom properties should take these warnings into +consideration. There is no enforcement of these warnings via the API. + +#### Configuring the ClassLoader + +Accumulo builds its Java classpath in `accumulo-env.sh`. After an Accumulo application has started, it will load classes from the locations +specified in the deprecated `general.classpaths` property. Additionally, Accumulo will load classes from the locations specified in the +`general.dynamic.classpaths` property and will monitor and reload them if they change. The reloading feature is useful during the development +and testing of iterators as new or modified iterator classes can be deployed to Accumulo without having to restart the database. +/ +Accumulo also has an alternate configuration for the classloader which will allow it to load classes from remote locations. This mechanism +uses Apache Commons VFS which enables locations such as http and hdfs to be used. This alternate configuration also uses the +`general.classpaths` property in the same manner described above. It differs in that you need to configure the +`general.vfs.classpaths` property instead of the `general.dynamic.classpath` property. As in the default configuration, this alternate +configuration will also monitor the vfs locations for changes and reload if necessary. + +The Accumulo classpath can be viewed in human readable format by running `accumulo classpath -d`. + +##### ClassLoader Contexts + +With the addition of the VFS based classloader, we introduced the notion of classloader contexts. A context is identified +by a name and references a set of locations from which to load classes and can be specified in the accumulo-site.xml file or added +using the `config` command in the shell. Below is an example for specify the app1 context in the accumulo-site.xml file: + +```xml +<property> + <name>general.vfs.context.classpath.app1</name> + <value>hdfs://localhost:8020/applicationA/classpath/.*.jar,file:///opt/applicationA/lib/.*.jar</value> + <description>Application A classpath, loads jars from HDFS and local file system</description> +</property> +``` + +The default behavior follows the Java ClassLoader contract in that classes, if they exists, are loaded from the parent classloader first. +You can override this behavior by delegating to the parent classloader after looking in this classloader first. An example of this +configuration is: + +```xml +<property> + <name>general.vfs.context.classpath.app1.delegation=post</name> + <value>hdfs://localhost:8020/applicationA/classpath/.*.jar,file:///opt/applicationA/lib/.*.jar</value> + <description>Application A classpath, loads jars from HDFS and local file system</description> +</property> +``` + +To use contexts in your application you can set the `table.classpath.context` on your tables or use the `setClassLoaderContext()` method on Scanner +and BatchScanner passing in the name of the context, app1 in the example above. Setting the property on the table allows your minc, majc, and scan +iterators to load classes from the locations defined by the context. Passing the context name to the scanners allows you to override the table setting +to load only scan time iterators from a different location. + +## Initialization + +Accumulo must be initialized to create the structures it uses internally to locate +data across the cluster. HDFS is required to be configured and running before +Accumulo can be initialized. + +Once HDFS is started, initialization can be performed by executing +`accumulo init` . This script will prompt for a name +for this instance of Accumulo. The instance name is used to identify a set of tables +and instance-specific settings. The script will then write some information into +HDFS so Accumulo can start properly. + +The initialization script will prompt you to set a root password. Once Accumulo is +initialized it can be started. + +## Running + +### Starting Accumulo + +Make sure Hadoop is configured on all of the machines in the cluster, including +access to a shared HDFS instance. Make sure HDFS and ZooKeeper are running. +Make sure ZooKeeper is configured and running on at least one machine in the +cluster. +Start Accumulo using `accumulo-cluster start`. + +To verify that Accumulo is running, check the [Accumulo monitor][monitor]. +In addition, the Shell can provide some information about the status of tables via reading the metadata tables. + +### Stopping Accumulo + +To shutdown cleanly, run `accumulo-cluster stop` and the master will orchestrate the +shutdown of all the tablet servers. Shutdown waits for all minor compactions to finish, so it may +take some time for particular configurations. + +### Adding a Tablet Server + +Update your `conf/tservers` file to account for the addition. + +Next, ssh to each of the hosts you want to add and run: + + accumulo-service tserver start + +Make sure the host in question has the new configuration, or else the tablet +server won't start; at a minimum this needs to be on the host(s) being added, +but in practice it's good to ensure consistent configuration across all nodes. + +### Decomissioning a Tablet Server + +If you need to take a node out of operation, you can trigger a graceful shutdown of a tablet +server. Accumulo will automatically rebalance the tablets across the available tablet servers. + + accumulo admin stop <host(s)> {<host> ...} + +Alternatively, you can ssh to each of the hosts you want to remove and run: + + accumulo-service tserver stop + +Be sure to update your `conf/tservers` file to +account for the removal of these hosts. Bear in mind that the monitor will not re-read the +tservers file automatically, so it will report the decommissioned servers as down; it's +recommended that you restart the monitor so that the node list is up to date. + +The steps described to decommission a node can also be used (without removal of the host +from the `conf/tservers` file) to gracefully stop a node. This will +ensure that the tabletserver is cleanly stopped and recovery will not need to be performed +when the tablets are re-hosted. + +### Restarting process on a node + +Occasionally, it might be necessary to restart the processes on a specific node. In addition +to the `accumulo-cluster` script, Accumulo has a `accumulo-service` script that +can be use to start/stop processes on a node. + +#### A note on rolling restarts + +For sufficiently large Accumulo clusters, restarting multiple TabletServers within a short window can place significant +load on the Master server. If slightly lower availability is acceptable, this load can be reduced by globally setting +`table.suspend.duration` to a positive value. + +With `table.suspend.duration` set to, say, `5m`, Accumulo will wait +for 5 minutes for any dead TabletServer to return before reassigning that TabletServer's responsibilities to other TabletServers. +If the TabletServer returns to the cluster before the specified timeout has elapsed, Accumulo will assign the TabletServer +its original responsibilities. + +It is important not to choose too large a value for `table.suspend.duration`, as during this time, all scans against the +data that TabletServer had hosted will block (or time out). + +### Running multiple TabletServers on a single node + +With very powerful nodes, it may be beneficial to run more than one TabletServer on a given +node. This decision should be made carefully and with much deliberation as Accumulo is designed +to be able to scale to using 10's of GB of RAM and 10's of CPU cores. + +Accumulo TabletServers bind certain ports on the host to accommodate remote procedure calls to/from +other nodes. Running more than one TabletServer on a host requires that you set the environment variable +`ACCUMULO_SERVICE_INSTANCE` to an instance number (i.e 1, 2) for each instance that is started. Also, set +these properties in `accumulo-site.xml`: + +```xml + <property> + <name>tserver.port.search</name> + <value>true</value> + </property> + <property> + <name>replication.receipt.service.port</name> + <value>0</value> + </property> +``` + +## Monitoring + +### Accumulo Monitor + +The Accumulo Monitor provides an interface for monitoring the status and health of +Accumulo components. The Accumulo Monitor provides a web UI for accessing this information at +`http://_monitorhost_:9995/`. + +Things highlighted in yellow may be in need of attention. +If anything is highlighted in red on the monitor page, it is something that definitely needs attention. + +The Overview page contains some summary information about the Accumulo instance, including the version, instance name, and instance ID. +There is a table labeled Accumulo Master with current status, a table listing the active Zookeeper servers, and graphs displaying various metrics over time. +These include ingest and scan performance and other useful measurements. + +The Master Server, Tablet Servers, and Tables pages display metrics grouped in different ways (e.g. by tablet server or by table). +Metrics typically include number of entries (key/value pairs), ingest and query rates. +The number of running scans, major and minor compactions are in the form _number_running_ (_number_queued_). +Another important metric is hold time, which is the amount of time a tablet has been waiting but unable to flush its memory in a minor compaction. + +The Server Activity page graphically displays tablet server status, with each server represented as a circle or square. +Different metrics may be assigned to the nodes' color and speed of oscillation. +The Overall Avg metric is only used on the Server Activity page, and represents the average of all the other metrics (after normalization). +Similarly, the Overall Max metric picks the metric with the maximum normalized value. + +The Garbage Collector page displays a list of garbage collection cycles, the number of files found of each type (including deletion candidates in use and files actually deleted), and the length of the deletion cycle. +The Traces page displays data for recent traces performed (see the following section for information on [tracing][tracing]). +The Recent Logs page displays warning and error logs forwarded to the monitor from all Accumulo processes. +Also, the XML and JSON links provide metrics in XML and JSON formats, respectively. + +### SSL + +SSL may be enabled for the monitor page by setting the following properties in the `accumulo-site.xml` file: + + monitor.ssl.keyStore + monitor.ssl.keyStorePassword + monitor.ssl.trustStore + monitor.ssl.trustStorePassword + +If the Accumulo conf directory has been configured (in particular the `accumulo-env.sh` file must be set up), the +`accumulo-util gen-monitor-cert` command can be used to create the keystore and truststore files with random passwords. The command +will print out the properties that need to be added to the `accumulo-site.xml` file. The stores can also be generated manually with the +Java `keytool` command, whose usage can be seen in the `accumulo-util` script. + +If desired, the SSL ciphers allowed for connections can be controlled via the following properties in `accumulo-site.xml`: + + monitor.ssl.include.ciphers + monitor.ssl.exclude.ciphers + +If SSL is enabled, the monitor URL can only be accessed via https. +This also allows you to access the Accumulo shell through the monitor page. +The left navigation bar will have a new link to Shell. +An Accumulo user name and password must be entered for access to the shell. + +## Metrics + +Accumulo can expose metrics through a legacy metrics library and using the Hadoop Metrics2 library. + +### Legacy Metrics + +Accumulo has a legacy metrics library that can be exposes metrics using JMX endpoints or file-based logging. These metrics can +be enabled by setting `general.legacy.metrics` to `true` in `accumulo-site.xml` and placing the `accumulo-metrics.xml` +configuration file on the classpath (which is typically done by placing the file in the `conf/` directory). A template for +`accumulo-metrics.xml` can be found in `conf/templates` of the Accumulo tarball. + +### Hadoop Metrics2 + +Hadoop Metrics2 is a library which allows for routing of metrics generated by registered MetricsSources to +configured MetricsSinks. Examples of sinks that are implemented by Hadoop include file-based logging, Graphite and Ganglia. +All metric sources are exposed via JMX when using Metrics2. + +Metrics2 is configured by examining the classpath for a file that matches `hadoop-metrics2*.properties`. The Accumulo tarball +contains an example `hadoop-metrics2-accumulo.properties` file in `conf/templates` which can be copied to `conf/` to place +on classpath. This file is used to enable file, Graphite or Ganglia sinks (some minimal configuration required for Graphite +and Ganglia). Because the Hadoop configuration is also on the Accumulo classpath, be sure that you do not have multiple +Metrics2 configuration files. It is recommended to consolidate metrics in a single properties file in a central location to +remove ambiguity. The contents of `hadoop-metrics2-accumulo.properties` can be added to a central `hadoop-metrics2.properties` +in `$HADOOP_CONF_DIR`. + +As a note for configuring the file sink, the provided path should be absolute. A relative path or file name will be created relative +to the directory in which the Accumulo process was started. External tools, such as logrotate, can be used to prevent these files +from growing without bound. + +Each server process should have log messages from the Metrics2 library about the sinks that were created. Be sure to check +the Accumulo processes log files when debugging missing metrics output. + +For additional information on configuring Metrics2, visit the [Javadoc page for Metrics2](https://hadoop.apache.org/docs/current/api/org/apache/hadoop/metrics2/package-summary.html). + +## Tracing + +It can be difficult to determine why some operations are taking longer +than expected. For example, you may be looking up items with very low +latency, but sometimes the lookups take much longer. Determining the +cause of the delay is difficult because the system is distributed, and +the typical lookup is fast. + +Accumulo has been instrumented to record the time that various +operations take when tracing is turned on. The fact that tracing is +enabled follows all the requests made on behalf of the user throughout +the distributed infrastructure of accumulo, and across all threads of +execution. + +These time spans will be inserted into the `trace` table in +Accumulo. You can browse recent traces from the Accumulo monitor +page. You can also read the `trace` table directly like any +other table. + +The design of Accumulo's distributed tracing follows that of [Google's Dapper](http://research.google.com/pubs/pub36356.html). + +### Tracers + +To collect traces, Accumulo needs at least one tracer server running. If you are using `accumulo-cluster` to start your cluster, +configure your server in `conf/tracers`. The server collects traces from clients and writes them to the `trace` table. The Accumulo +user that the tracer connects to Accumulo with can be configured with the following properties (see the [configuration management][config-mgmt] +page for setting Accumulo server properties) + + trace.user + trace.token.property.password + +Other tracer configuration properties include + + trace.port.client - port tracer listens on + trace.table - table tracer writes to + trace.zookeeper.path - zookeeper path where tracers register + +The zookeeper path is configured to /tracers by default. If +multiple Accumulo instances are sharing the same ZooKeeper +quorum, take care to configure Accumulo with unique values for +this property. + +### Configuring Tracing + +Traces are collected via SpanReceivers. The default SpanReceiver +configured is org.apache.accumulo.core.trace.ZooTraceClient, which +sends spans to an Accumulo Tracer process, as discussed in the +previous section. This default can be changed to a different span +receiver, or additional span receivers can be added in a +comma-separated list, by modifying the property + + trace.span.receivers + +Individual span receivers may require their own configuration +parameters, which are grouped under the trace.span.receiver.* +prefix. ZooTraceClient uses the following properties. The first +three properties are populated from other Accumulo properties, +while the remaining ones should be prefixed with +trace.span.receiver. when set in the Accumulo configuration. + + tracer.zookeeper.host - populated from instance.zookeepers + tracer.zookeeper.timeout - populated from instance.zookeeper.timeout + tracer.zookeeper.path - populated from trace.zookeeper.path + tracer.send.timer.millis - timer for flushing send queue (in ms, default 1000) + tracer.queue.size - max queue size (default 5000) + tracer.span.min.ms - minimum span length to store (in ms, default 1) + +Note that to configure an Accumulo client for tracing, including +the Accumulo shell, the client configuration must be given the same +trace.span.receivers, trace.span.receiver.*, and trace.zookeeper.path +properties as the servers have. + +Hadoop can also be configured to send traces to Accumulo, as of +Hadoop 2.6.0, by setting properties in Hadoop's core-site.xml +file. Instead of using the trace.span.receiver.* prefix, Hadoop +uses hadoop.htrace.*. The Hadoop configuration does not have +access to Accumulo's properties, so the +hadoop.htrace.tracer.zookeeper.host property must be specified. +The zookeeper timeout defaults to 30000 (30 seconds), and the +zookeeper path defaults to /tracers. An example of configuring +Hadoop to send traces to ZooTraceClient is + +```xml + <property> + <name>hadoop.htrace.spanreceiver.classes</name> + <value>org.apache.accumulo.core.trace.ZooTraceClient</value> + </property> + <property> + <name>hadoop.htrace.tracer.zookeeper.host</name> + <value>zookeeperHost:2181</value> + </property> + <property> + <name>hadoop.htrace.tracer.zookeeper.path</name> + <value>/tracers</value> + </property> + <property> + <name>hadoop.htrace.tracer.span.min.ms</name> + <value>1</value> + </property> +``` + +The accumulo-core, accumulo-tracer, accumulo-fate and libthrift +jars must also be placed on Hadoop's classpath. + +##### Adding additional SpanReceivers + +[Zipkin](https://github.com/openzipkin/zipkin) has a SpanReceiver supported by HTrace and popularized by Twitter +that users looking for a more graphical trace display may opt to use. +The following steps configure Accumulo to use `org.apache.htrace.impl.ZipkinSpanReceiver` +in addition to the Accumulo's default ZooTraceClient, and they serve as a template +for adding any SpanReceiver to Accumulo: + +1. Add the Jar containing the ZipkinSpanReceiver class file to the +`lib/` directory. It is critical that the Jar is placed in +`lib/` and NOT in `lib/ext/` so that the new SpanReceiver class +is visible to the same class loader of htrace-core. + +2. Add the following to `accumulo-site.xml`: + + <property> + <name>trace.span.receivers</name> + <value>org.apache.accumulo.tracer.ZooTraceClient,org.apache.htrace.impl.ZipkinSpanReceiver</value> + </property> + +3. Restart your Accumulo tablet servers. + +In order to use ZipkinSpanReceiver from a client as well as the Accumulo server, + +1. Ensure your client can see the ZipkinSpanReceiver class at runtime. For Maven projects, +this is easily done by adding to your client's pom.xml (taking care to specify a good version) + + <dependency> + <groupId>org.apache.htrace</groupId> + <artifactId>htrace-zipkin</artifactId> + <version>3.1.0-incubating</version> + <scope>runtime</scope> + </dependency> + +2. Add the following to your client configuration. + + trace.span.receivers=org.apache.accumulo.tracer.ZooTraceClient,org.apache.htrace.impl.ZipkinSpanReceiver + +3. Instrument your client as in the next section. + +Your SpanReceiver may require additional properties, and if so these should likewise +be placed in the ClientConfiguration (if applicable) and Accumulo's `accumulo-site.xml`. +Two such properties for ZipkinSpanReceiver, listed with their default values, are + +```xml + <property> + <name>trace.span.receiver.zipkin.collector-hostname</name> + <value>localhost</value> + </property> + <property> + <name>trace.span.receiver.zipkin.collector-port</name> + <value>9410</value> + </property> +``` + +#### Instrumenting a Client + +Tracing can be used to measure a client operation, such as a scan, as +the operation traverses the distributed system. To enable tracing for +your application call + +```java +import org.apache.accumulo.core.trace.DistributedTrace; +... +DistributedTrace.enable(hostname, "myApplication"); +// do some tracing +... +DistributedTrace.disable(); +``` + +Once tracing has been enabled, a client can wrap an operation in a trace. + +```java +import org.apache.htrace.Sampler; +import org.apache.htrace.Trace; +import org.apache.htrace.TraceScope; +... +TraceScope scope = Trace.startSpan("Client Scan", Sampler.ALWAYS); +BatchScanner scanner = conn.createBatchScanner(...); +// Configure your scanner +for (Entry entry : scanner) { +} +scope.close(); +``` + +The user can create additional Spans within a Trace. + +The sampler (such as `Sampler.ALWAYS`) for the trace should only be specified with a top-level span, +and subsequent spans will be collected depending on whether that first span was sampled. +Don't forget to specify a Sampler at the top-level span +because the default Sampler only samples when part of a pre-existing trace, +which will never occur in a client that never specifies a Sampler. + +```java +TraceScope scope = Trace.startSpan("Client Update", Sampler.ALWAYS); +... +TraceScope readScope = Trace.startSpan("Read"); +... +readScope.close(); +... +TraceScope writeScope = Trace.startSpan("Write"); +... +writeScope.close(); +scope.close(); +``` + +Like Dapper, Accumulo tracing supports user defined annotations to associate additional data with a Trace. +Checking whether currently tracing is necessary when using a sampler other than Sampler.ALWAYS. + +```java +... +int numberOfEntriesRead = 0; +TraceScope readScope = Trace.startSpan("Read"); +// Do the read, update the counter +... +if (Trace.isTracing) + readScope.getSpan().addKVAnnotation("Number of Entries Read".getBytes(StandardCharsets.UTF_8), + String.valueOf(numberOfEntriesRead).getBytes(StandardCharsets.UTF_8)); +``` + +It is also possible to add timeline annotations to your spans. +This associates a string with a given timestamp between the start and stop times for a span. + +```java +... +writeScope.getSpan().addTimelineAnnotation("Initiating Flush"); +``` + +Some client operations may have a high volume within your +application. As such, you may wish to only sample a percentage of +operations for tracing. As seen below, the CountSampler can be used to +help enable tracing for 1-in-1000 operations + +```java +import org.apache.htrace.impl.CountSampler; +... +Sampler sampler = new CountSampler(HTraceConfiguration.fromMap( + Collections.singletonMap(CountSampler.SAMPLER_FREQUENCY_CONF_KEY, "1000"))); +... +TraceScope readScope = Trace.startSpan("Read", sampler); +... +readScope.close(); +``` + +Remember to close all spans and disable tracing when finished. + +```java +DistributedTrace.disable(); +``` + +### Viewing Collected Traces + +To view collected traces, use the "Recent Traces" link on the Monitor +UI. You can also programmatically access and print traces using the +`TraceDump` class. + +#### Trace Table Format + +This section is for developers looking to use data recorded in the trace table +directly, above and beyond the default services of the Accumulo monitor. +Please note the trace table format and its supporting classes +are not in the public API and may be subject to change in future versions. + +Each span received by a tracer's ZooTraceClient is recorded in the trace table +in the form of three entries: span entries, index entries, and start time entries. +Span and start time entries record full span information, +whereas index entries provide indexing into span information +useful for quickly finding spans by type or start time. + +Each entry is illustrated by a description and sample of data. +In the description, a token in quotes is a String literal, +whereas other other tokens are span variables. +Parentheses group parts together, to distinguish colon characters inside the +column family or qualifier from the colon that separates column family and qualifier. +We use the format `row columnFamily:columnQualifier columnVisibility value` +(omitting timestamp which records the time an entry is written to the trace table). + +Span entries take the following form: + + traceId "span":(parentSpanId:spanId) [] spanBinaryEncoding + 63b318de80de96d1 span:4b8f66077df89de1:3778c6739afe4e1 [] %18;%09;... + +The parentSpanId is "" for the root span of a trace. +The spanBinaryEncoding is a compact Apache Thrift encoding of the original Span object. +This allows clients (and the Accumulo monitor) to recover all the details of the original Span +at a later time, by scanning the trace table and decoding the value of span entries +via `TraceFormatter.getRemoteSpan(entry)`. + +The trace table has a formatter class by default (org.apache.accumulo.tracer.TraceFormatter) +that changes how span entries appear from the Accumulo shell. +Normal scans to the trace table do not use this formatter representation; +it exists only to make span entries easier to view inside the Accumulo shell. + +Index entries take the following form: + + "idx":service:startTime description:sender [] traceId:elapsedTime + idx:tserver:14f3828f58b startScan:localhost [] 63b318de80de96d1:1 + +The service and sender are set by the first call of each Accumulo process +(and instrumented client processes) to `DistributedTrace.enable(...)` +(the sender is autodetected if not specified). +The description is specified in each span. +Start time and the elapsed time (start - stop, 1 millisecond in the example above) +are recorded in milliseconds as long values serialized to a string in hex. + +Start time entries take the following form: + + "start":startTime "id":traceId [] spanBinaryEncoding + start:14f3828a351 id:63b318de80de96d1 [] %18;%09;... + +The following classes may be run while Accumulo is running to provide insight into trace statistics. These require +accumulo-trace-VERSION.jar to be provided on the Accumulo classpath (`lib/ext` is fine). + + $ accumulo org.apache.accumulo.tracer.TraceTableStats -u username -p password -i instancename + $ accumulo org.apache.accumulo.tracer.TraceDump -u username -p password -i instancename -r + +### Tracing from the Shell +You can enable tracing for operations run from the shell by using the +`trace on` and `trace off` commands. + +``` +root@test test> trace on + +root@test test> scan +a b:c [] d + +root@test test> trace off +Waiting for trace information +Waiting for trace information +Trace started at 2013/08/26 13:24:08.332 +Time Start Service@Location Name + 3628+0 shell@localhost shell:root + 8+1690 shell@localhost scan + 7+1691 shell@localhost scan:location + 6+1692 tserver@localhost startScan + 5+1692 tserver@localhost tablet read ahead 6 +``` + +## Logging + +Accumulo processes each write to a set of log files. By default, these logs are found at directory +set by `ACCUMULO_LOG_DIR` in `accumulo-env.sh`. + +## Recovery + +In the event of TabletServer failure or error on shutting Accumulo down, some +mutations may not have been minor compacted to HDFS properly. In this case, +Accumulo will automatically reapply such mutations from the write-ahead log +either when the tablets from the failed server are reassigned by the Master (in the +case of a single TabletServer failure) or the next time Accumulo starts (in the event of +failure during shutdown). + +Recovery is performed by asking a tablet server to sort the logs so that tablets can easily find their missing +updates. The sort status of each file is displayed on +Accumulo monitor status page. Once the recovery is complete any +tablets involved should return to an ``online'' state. Until then those tablets will be +unavailable to clients. + +The Accumulo client library is configured to retry failed mutations and in many +cases clients will be able to continue processing after the recovery process without +throwing an exception. + +## Migrating Accumulo from non-HA Namenode to HA Namenode + +The following steps will allow a non-HA instance to be migrated to an HA instance. Consider an HDFS URL +`hdfs://namenode.example.com:8020` which is going to be moved to `hdfs://nameservice1`. + +Before moving HDFS over to the HA namenode, use `accumulo admin volumes` to confirm +that the only volume displayed is the volume from the current namenode's HDFS URL. + + Listing volumes referenced in zookeeper + Volume : hdfs://namenode.example.com:8020/accumulo + + Listing volumes referenced in accumulo.root tablets section + Volume : hdfs://namenode.example.com:8020/accumulo + Listing volumes referenced in accumulo.root deletes section (volume replacement occurrs at deletion time) + + Listing volumes referenced in accumulo.metadata tablets section + Volume : hdfs://namenode.example.com:8020/accumulo + + Listing volumes referenced in accumulo.metadata deletes section (volume replacement occurrs at deletion time) + +After verifying the current volume is correct, shut down the cluster and transition HDFS to the HA nameservice. + +Edit `accumulo-site.xml` to notify accumulo that a volume is being replaced. First, +add the new nameservice volume to the `instance.volumes` property. Next, add the +`instance.volumes.replacements` property in the form of `old new`. It's important to not include +the volume that's being replaced in `instance.volumes`, otherwise it's possible accumulo could continue +to write to the volume. + +```xml +<!-- instance.dfs.uri and instance.dfs.dir should not be set--> +<property> + <name>instance.volumes</name> + <value>hdfs://nameservice1/accumulo</value> +</property> +<property> + <name>instance.volumes.replacements</name> + <value>hdfs://namenode.example.com:8020/accumulo hdfs://nameservice1/accumulo</value> +</property> +``` + +Run `accumulo init --add-volumes` and start up the accumulo cluster. Verify that the +new nameservice volume shows up with `accumulo admin volumes`. + + Listing volumes referenced in zookeeper + Volume : hdfs://namenode.example.com:8020/accumulo + Volume : hdfs://nameservice1/accumulo + + Listing volumes referenced in accumulo.root tablets section + Volume : hdfs://namenode.example.com:8020/accumulo + Volume : hdfs://nameservice1/accumulo + Listing volumes referenced in accumulo.root deletes section (volume replacement occurrs at deletion time) + + Listing volumes referenced in accumulo.metadata tablets section + Volume : hdfs://namenode.example.com:8020/accumulo + Volume : hdfs://nameservice1/accumulo + Listing volumes referenced in accumulo.metadata deletes section (volume replacement occurrs at deletion time) + +Some erroneous GarbageCollector messages may still be seen for a small period while data is transitioning to +the new volumes. This is expected and can usually be ignored. + +## Achieving Stability in a VM Environment + +For testing, demonstration, and even operation uses, Accumulo is often +installed and run in a virtual machine (VM) environment. The majority of +long-term operational uses of Accumulo are on bare-metal cluster. However, the +core design of Accumulo and its dependencies do not preclude running stably for +long periods within a VM. Many of Accumuloâs operational robustness features to +handle failures like periodic network partitioning in a large cluster carry +over well to VM environments. This guide covers general recommendations for +maximizing stability in a VM environment, including some of the common failure +modes that are more common when running in VMs. + +### Known failure modes: Setup and Troubleshooting + +In addition to the general failure modes of running Accumulo, VMs can introduce a +couple of environmental challenges that can affect process stability. Clock +drift is something that is more common in VMs, especially when VMs are +suspended and resumed. Clock drift can cause Accumulo servers to assume that +they have lost connectivity to the other Accumulo processes and/or lose their +locks in Zookeeper. VM environments also frequently have constrained resources, +such as CPU, RAM, network, and disk throughput and capacity. Accumulo generally +deals well with constrained resources from a stability perspective (optimizing +performance will require additional tuning, which is not covered in this +section), however there are some limits. + +#### Physical Memory + +One of those limits has to do with the Linux out of memory killer. A common +failure mode in VM environments (and in some bare metal installations) is when +the Linux out of memory killer decides to kill processes in order to avoid a +kernel panic when provisioning a memory page. This often happens in VMs due to +the large number of processes that must run in a small memory footprint. In +addition to the Linux core processes, a single-node Accumulo setup requires a +Hadoop Namenode, a Hadoop Secondary Namenode a Hadoop Datanode, a Zookeeper +server, an Accumulo Master, an Accumulo GC and an Accumulo TabletServer. +Typical setups also include an Accumulo Monitor, an Accumulo Tracer, a Hadoop +ResourceManager, a Hadoop NodeManager, provisioning software, and client +applications. Between all of these processes, it is not uncommon to +over-subscribe the available RAM in a VM. We recommend setting up VMs without +swap enabled, so rather than performance grinding to a halt when physical +memory is exhausted the kernel will randomly* select processes to kill in order +to free up memory. + +Calculating the maximum possible memory usage is essential in creating a stable +Accumulo VM setup. Safely engineering memory allocation for stability is a +matter of then bringing the calculated maximum memory usage under the physical +memory by a healthy margin. The margin is to account for operating system-level +operations, such as managing process, maintaining virtual memory pages, and +file system caching. When the java out-of-memory killer finds your process, you +will probably only see evidence of that in /var/log/messages. Out-of-memory +process kills do not show up in Accumulo or Hadoop logs. + +To calculate the max memory usage of all java virtual machine (JVM) processes +add the maximum heap size (often limited by a -Xmx... argument, such as in +accumulo-site.xml) and the off-heap memory usage. Off-heap memory usage +includes the following: + +* "Permanent Space", where the JVM stores Classes, Methods, and other code elements. This can be limited by a JVM flag such as `-XX:MaxPermSize:100m`, and is typically tens of megabytes. +* Code generation space, where the JVM stores just-in-time compiled code. This is typically small enough to ignore +* Socket buffers, where the JVM stores send and receive buffers for each socket. +* Thread stacks, where the JVM allocates memory to manage each thread. +* Direct memory space and JNI code, where applications can allocate memory outside of the JVM-managed space. For Accumulo, this includes the native in-memory maps that are allocated with the memory.maps.max parameter in accumulo-site.xml. +* Garbage collection space, where the JVM stores information used for garbage collection. + +You can assume that each Hadoop and Accumulo process will use ~100-150MB for +Off-heap memory, plus the in-memory map of the Accumulo TServer process. A +simple calculation for physical memory requirements follows: + +``` + Physical memory needed + = (per-process off-heap memory) + (heap memory) + (other processes) + (margin) + = (number of java processes * 150M + native map) + (sum of -Xmx settings for java process) + (total applications memory, provisioning memory, etc.) + (1G) + = (11*150M +500M) + (1G +1G +1G +256M +1G +256M +512M +512M +512M +512M +512M) + (2G) + (1G) + = (2150M) + (7G) + (2G) + (1G) + = ~12GB +``` + +These calculations can add up quickly with the large number of processes, +especially in constrained VM environments. To reduce the physical memory +requirements, it is a good idea to reduce maximum heap limits and turn off +unnecessary processes. If you're not using YARN in your application, you can +turn off the ResourceManager and NodeManager. If you're not expecting to +re-provision the cluster frequently you can turn off or reduce provisioning +processes such as Salt Stack minions and masters. + +#### Disk Space + +Disk space is primarily used for two operations: storing data and storing logs. +While Accumulo generally stores all of its key/value data in HDFS, Accumulo, +Hadoop, and Zookeeper all store a significant amount of logs in a directory on +a local file system. Care should be taken to make sure that (a) limitations to +the amount of logs generated are in place, and (b) enough space is available to +host the generated logs on the partitions that they are assigned. When space is +not available to log, processes will hang. This can cause interruptions in +availability of Accumulo, as well as cascade into failures of various +processes. + +Hadoop, Accumulo, and Zookeeper use log4j as a logging mechanism, and each of +them has a way of limiting the logs and directing them to a particular +directory. Logs are generated independently for each process, so when +considering the total space you need to add up the maximum logs generated by +each process. Typically, a rolling log setup in which each process can generate +something like 10 100MB files is instituted, resulting in a maximum file system +usage of 1GB per process. Default setups for Hadoop and Zookeeper are often +unbounded, so it is important to set these limits in the logging configuration +files for each subsystem. Consult the user manual for each system for +instructions on how to limit generated logs. + +#### Zookeeper Interaction + +Accumulo is designed to scale up to thousands of nodes. At that scale, +intermittent interruptions in network service and other rare failures of +compute nodes become more common. To limit the impact of node failures on +overall service availability, Accumulo uses a heartbeat monitoring system that +leverages Zookeeper's ephemeral locks. There are several conditions that can +occur that cause Accumulo process to lose their Zookeeper locks, some of which +are true interruptions to availability and some of which are false positives. +Several of these conditions become more common in VM environments, where they +can be exacerbated by resource constraints and clock drift. + +#### Tested Versions + +Each release of Accumulo is built with a specific version of Apache +Hadoop, Apache ZooKeeper and Apache Thrift. We expect Accumulo to +work with versions that are API compatible with those versions. +However this compatibility is not guaranteed because Hadoop, ZooKeeper +and Thift may not provide guarantees between their own versions. We +have also found that certain versions of Accumulo and Hadoop included +bugs that greatly affected overall stability. Thrift is particularly +prone to compatibility changes between versions and you must use the +same version your Accumulo is built with. + +Please check the release notes for your Accumulo version or use the +mailing lists at https://accumulo.apache.org for more info. + +[tracing]: {{page.docs_baseurl}}/administration/overview#tracing +[monitor]: {{page.docs_baseurl}}/administration/overview#monitoring +[config-mgmt]: {{page.docs_baseurl}}/administration/configuration-management +[config-props]: {{page.docs_baseurl}}/administration/configuration-properties
http://git-wip-us.apache.org/repos/asf/accumulo-website/blob/7cc70b2e/_docs-unreleased/administration/replication.md ---------------------------------------------------------------------- diff --git a/_docs-unreleased/administration/replication.md b/_docs-unreleased/administration/replication.md new file mode 100644 index 0000000..edfa1f8 --- /dev/null +++ b/_docs-unreleased/administration/replication.md @@ -0,0 +1,385 @@ +--- +title: Replication +category: administration +order: 5 +--- + +## Overview + +Replication is a feature of Accumulo which provides a mechanism to automatically +copy data to other systems, typically for the purpose of disaster recovery, +high availability, or geographic locality. It is best to consider this feature +as a framework for automatic replication instead of the ability to copy data +from to another Accumulo instance as copying to another Accumulo cluster is +only an implementation detail. The local Accumulo cluster is hereby referred +to as the `primary` while systems being replicated to are known as +`peers`. + +This replication framework makes two Accumulo instances, where one instance +replicates to another, eventually consistent between one another, as opposed +to the strong consistency that each single Accumulo instance still holds. That +is to say, attempts to read data from a table on a peer which has pending replication +from the primary will not wait for that data to be replicated before running the scan. +This is desirable for a number of reasons, the most important is that the replication +framework is not limited by network outages or offline peers, but only by the HDFS +space available on the primary system. + +Replication configurations can be considered as a directed graph which allows cycles. +The systems in which data was replicated from is maintained in each Mutation which +allow each system to determine if a peer has already has the data in which +the system wants to send. + +Data is replicated by using the Write-Ahead logs (WAL) that each TabletServer is +already maintaining. TabletServers records which WALs have data that need to be +replicated to the `accumulo.metadata` table. The Master uses these records, +combined with the local Accumulo table that the WAL was used with, to create records +in the `replication` table which track which peers the given WAL should be +replicated to. The Master latter uses these work entries to assign the actual +replication task to a local TabletServer using ZooKeeper. A TabletServer will get +a lock in ZooKeeper for the replication of this file to a peer, and proceed to +replicate to the peer, recording progress in the `replication` table as +data is successfully replicated on the peer. Later, the Master and Garbage Collector +will remove records from the `accumulo.metadata` and `replication` tables +and files from HDFS, respectively, after replication to all peers is complete. + +## Configuration + +Configuration of Accumulo to replicate data to another system can be categorized +into the following sections. + +### Site Configuration + +Each system involved in replication (even the primary) needs a name that uniquely +identifies it across all peers in the replication graph. This should be considered +fixed for an instance, and set in `accumulo-site.xml`. + +```xml +<property> + <name>replication.name</name> + <value>primary</value> + <description>Unique name for this system used by replication</description> +</property> +``` + +### Instance Configuration + +For each peer of this system, Accumulo needs to know the name of that peer, +the class used to replicate data to that system and some configuration information +to connect to this remote peer. In the case of Accumulo, this additional data +is the Accumulo instance name and ZooKeeper quorum; however, this varies on the +replication implementation for the peer. + +These can be set in the site configuration to ease deployments; however, as they may +change, it can be useful to set this information using the Accumulo shell. + +To configure a peer with the name `peer1` which is an Accumulo system with an instance name of `accumulo_peer` +and a ZooKeeper quorum of `10.0.0.1,10.0.2.1,10.0.3.1`, invoke the following +command in the shell. + +``` +root@accumulo_primary> config -s +replication.peer.peer1=org.apache.accumulo.tserver.replication.AccumuloReplicaSystem,accumulo_peer,10.0.0.1,10.0.2.1,10.0.3.1 +``` + +Since this is an Accumulo system, we also want to set a username and password +to use when authenticating with this peer. On our peer, we make a special user +which has permission to write to the tables we want to replicate data into, "replication" +with a password of "password". We then need to record this in the primary's configuration. + +``` +root@accumulo_primary> config -s replication.peer.user.peer1=replication +root@accumulo_primary> config -s replication.peer.password.peer1=password +``` + +Alternatively, when configuring replication on Accumulo running Kerberos, a keytab +file per peer can be configured instead of a password. The provided keytabs must be readable +by the unix user running Accumulo. They keytab for a peer can be unique from the +keytab used by Accumulo or any keytabs for other peers. + +``` +accum...@example.com@accumulo_primary> config -s replication.peer.user.peer1=replicat...@example.com +accum...@example.com@accumulo_primary> config -s replication.peer.keytab.peer1=/path/to/replication.keytab +``` + +### Table Configuration + +Now, we presently have a peer defined, so we just need to configure which tables will +replicate to that peer. We also need to configure an identifier to determine where +this data will be replicated on the peer. Since we're replicating to another Accumulo +cluster, this is a table ID. In this example, we want to enable replication on +`my_table` and configure our peer `accumulo_peer` as a target, sending +the data to the table with an ID of `2` in `accumulo_peer`. + +``` +root@accumulo_primary> config -t my_table -s table.replication=true +root@accumulo_primary> config -t my_table -s table.replication.target.accumulo_peer=2 +``` + +To replicate a single table on the primary to multiple peers, the second command +in the above shell snippet can be issued, for each peer and remote identifier pair. + +## Monitoring + +Basic information about replication status from a primary can be found on the Accumulo +Monitor server, using the `Replication` link the sidebar. + +On this page, information is broken down into the following sections: + +1. Files pending replication by peer and target +2. Files queued for replication, with progress made + +## Work Assignment + +Depending on the schema of a table, different implementations of the WorkAssigner used could +be configured. The implementation is controlled via the property `replication.work.assigner` +and the full class name for the implementation. This can be configured via the shell or +`accumulo-site.xml`. + +```xml +<property> + <name>replication.work.assigner</name> + <value>org.apache.accumulo.master.replication.SequentialWorkAssigner</value> + <description>Implementation used to assign work for replication</description> +</property> +``` + +``` +root@accumulo_primary> config -t my_table -s replication.work.assigner=org.apache.accumulo.master.replication.SequentialWorkAssigner +``` + +Two implementations are provided. By default, the `SequentialWorkAssigner` is configured for an +instance. The SequentialWorkAssigner ensures that, per peer and each remote identifier, each WAL is +replicated in the order in which they were created. This is sufficient to ensure that updates to a table +will be replayed in the correct order on the peer. This implementation has the downside of only replicating +a single WAL at a time. + +The second implementation, the `UnorderedWorkAssigner` can be used to overcome the limitation +of only a single WAL being replicated to a target and peer at any time. Depending on the table schema, +it's possible that multiple versions of the same Key with different values are infrequent or nonexistent. +In this case, parallel replication to a peer and target is possible without any downsides. In the case +where this implementation is used were column updates are frequent, it is possible that there will be +an inconsistency between the primary and the peer. + +## ReplicaSystems + +`ReplicaSystem` is the interface which allows abstraction of replication of data +to peers of various types. Presently, only an `AccumuloReplicaSystem` is provided +which will replicate data to another Accumulo instance. A `ReplicaSystem` implementation +is run inside of the TabletServer process, and can be configured as mentioned in the +`Instance Configuration` section of this document. Theoretically, an implementation +of this interface could send data to other filesystems, databases, etc. + +### AccumuloReplicaSystem + +The `AccumuloReplicaSystem` uses Thrift to communicate with a peer Accumulo instance +and replicate the necessary data. The TabletServer running on the primary will communicate +with the Master on the peer to request the address of a TabletServer on the peer which +this TabletServer will use to replicate the data. + +The TabletServer on the primary will then replicate data in batches of a configurable +size (`replication.max.unit.size`). The TabletServer on the peer will report how many +records were applied back to the primary, which will be used to record how many records +were successfully replicated. The TabletServer on the primary will continue to replicate +data in these batches until no more data can be read from the file. + +## Other Configuration + +There are a number of configuration values that can be used to control how +the implementation of various components operate. + +|Property | Description | Default +|---------|-------------|-------- +|replication.max.work.queue | Maximum number of files queued for replication at one time | 1000 +|replication.work.assignment.sleep | Time between invocations of the WorkAssigner | 30s +|replication.worker.threads | Size of threadpool used to replicate data to peers | 4 +|replication.receipt.service.port | Thrift service port to listen for replication requests, can use '0' for a random port | 10002 +|replication.work.attempts | Number of attempts to replicate to a peer before aborting the attempt | 10 +|replication.receiver.min.threads | Minimum number of idle threads for handling incoming replication | 1 +|replication.receiver.threadcheck.time | Time between attempting adjustments of thread pool for incoming replications | 30s +|replication.max.unit.size | Maximum amount of data to be replicated in one RPC | 64M +|replication.work.assigner | Work Assigner implementation | org.apache.accumulo.master.replication.SequentialWorkAssigner +|tserver.replication.batchwriter.replayer.memory| Size of BatchWriter cache to use in applying replication requests | 50M + +## Example Practical Configuration + +A real-life example is now provided to give concrete application of replication configuration. This +example is a two instance Accumulo system, one primary system and one peer system. They are called +primary and peer, respectively. Each system also have a table of the same name, "my_table". The instance +name for each is also the same (primary and peer), and both have ZooKeeper hosts on a node with a hostname +with that name as well (primary:2181 and peer:2181). + +We want to configure these systems so that "my_table" on "primary" replicates to "my_table" on "peer". + +### accumulo-site.xml + +We can assign the "unique" name that identifies this Accumulo instance among all others that might participate +in replication together. In this example, we will use the names provided in the description. + +#### Primary + +```xml +<property> + <name>replication.name</name> + <value>primary</value> + <description>Defines the unique name</description> +</property> +``` + +#### Peer + +```xml +<property> + <name>replication.name</name> + <value>peer</value> +</property> +``` + +### masters and tservers files + +Be *sure* to use non-local IP addresses. Other nodes need to connect to it and using localhost will likely result in +a local node talking to another local node. + +### Start both instances + +The rest of the configuration is dynamic and is best configured on the fly (in ZooKeeper) than in accumulo-site.xml. + +### Peer + +The next series of command are to be run on the peer system. Create a user account for the primary instance called +"peer". The password for this account will need to be saved in the configuration on the primary + +``` +root@peer> createtable my_table +root@peer> createuser peer +root@peer> grant -t my_table -u peer Table.WRITE +root@peer> grant -t my_table -u peer Table.READ +root@peer> tables -l +``` + +Remember what the table ID for 'my_table' is. You'll need that to configured the primary instance. + +### Primary + +Next, configure the primary instance. + +#### Set up the table + +``` +root@primary> createtable my_table +``` + +#### Define the Peer as a replication peer to the Primary + +We're defining the instance with replication.name of 'peer' as a peer. We provide the implementation of ReplicaSystem +that we want to use, and the configuration for the AccumuloReplicaSystem. In this case, the configuration is the Accumulo +Instance name for 'peer' and the ZooKeeper quorum string. The configuration key is of the form +"replication.peer.$peer_name". + +``` +root@primary> config -s replication.peer.peer=org.apache.accumulo.tserver.replication.AccumuloReplicaSystem,peer,$peer_zk_quorum +``` + +#### Set the authentication credentials + +We want to use that special username and password that we created on the peer, so we have a means to write data to +the table that we want to replicate to. The configuration key is of the form "replication.peer.user.$peer_name". + +``` +root@primary> config -s replication.peer.user.peer=peer +root@primary> config -s replication.peer.password.peer=peer +``` + +#### Enable replication on the table + +Now that we have defined the peer on the primary and provided the authentication credentials, we need to configure +our table with the implementation of ReplicaSystem we want to use to replicate to the peer. In this case, our peer +is an Accumulo instance, so we want to use the AccumuloReplicaSystem. + +The configuration for the AccumuloReplicaSystem is the table ID for the table on the peer instance that we +want to replicate into. Be sure to use the correct value for $peer_table_id. The configuration key is of +the form "table.replication.target.$peer_name". + +``` +root@primary> config -t my_table -s table.replication.target.peer=$peer_table_id +``` + +Finally, we can enable replication on this table. + +``` +root@primary> config -t my_table -s table.replication=true +``` + +## Extra considerations for use + +While this feature is intended for general-purpose use, its implementation does carry some baggage. Like any software, +replication is a feature that operates well within some set of use cases but is not meant to support all use cases. +For the benefit of the users, we can enumerate these cases. + +### Latency + +As previously mentioned, the replication feature uses the Write-Ahead Log files for a number of reasons, one of which +is to prevent the need for data to be written to RFiles before it is available to be replicated. While this can help +reduce the latency for a batch of Mutations that have been written to Accumulo, the latency is at least seconds to tens +of seconds for replication once ingest is active. For a table which replication has just been enabled on, this is likely +to take a few minutes before replication will begin. + +Once ingest is active and flowing into the system at a regular rate, replication should be occurring at a similar rate, +given sufficient computing resources. Replication attempts to copy data at a rate that is to be considered low latency +but is not a replacement for custom indexing code which can ensure near real-time referential integrity on secondary indexes. + +### Table-Configured Iterators + +Accumulo Iterators tend to be a heavy hammer which can be used to solve a variety of problems. In general, it is highly +recommended that Iterators which are applied at major compaction time are both idempotent and associative due to the +non-determinism in which some set of files for a Tablet might be compacted. In practice, this translates to common patterns, +such as aggregation, which are implemented in a manner resilient to duplication (such as using a Set instead of a List). + +Due to the asynchronous nature of replication and the expectation that hardware failures and network partitions will exist, +it is generally not recommended to not configure replication on a table which has Iterators set which are not idempotent. +While the replication implementation can make some simple assertions to try to avoid re-replication of data, it is not +presently guaranteed that all data will only be sent to a peer once. Data will be replicated at least once. Typically, +this is not a problem as the VersioningIterator will automaticaly deduplicate this over-replication because they will +have the same timestamp; however, certain Combiners may result in inaccurate aggregations. + +As a concrete example, consider a table which has the SummingCombiner configured to sum all values for +multiple versions of the same Key. For some key, consider a set of numeric values that are written to a table on the +primary: [1, 2, 3]. On the primary, all of these are successfully written and thus the current value for the given key +would be 6, (1 + 2 + 3). Consider, however, that each of these updates to the peer were done independently (because +other data was also included in the write-ahead log that needed to be replicated). The update with a value of 1 was +successfully replicated, and then we attempted to replicate the update with a value of 2 but the remote server never +responded. The primary does not know whether the update with a value of 2 was actually applied or not, so the +only recourse is to re-send the update. After we receive confirmation that the update with a value of 2 was replicated, +we will then replicate the update with 3. If the peer did never apply the first update of '2', the summation is accurate. +If the update was applied but the acknowledgement was lost for some reason (system failure, network partition), the +update will be resent to the peer. Because addition is non-idempotent, we have created an inconsistency between the +primary and peer. As such, the SummingCombiner wouldn't be recommended on a table being replicated. + +While there are changes that could be made to the replication implementation which could attempt to mitigate this risk, +presently, it is not recommended to configure Iterators or Combiners which are not idempotent to support cases where +inaccuracy of aggregations is not acceptable. + +### Duplicate Keys + +In Accumulo, when more than one key exists that are exactly the same, keys that are equal down to the timestamp, +the retained value is non-deterministic. Replication introduces another level of non-determinism in this case. +For a table that is being replicated and has multiple equal keys with different values inserted into it, the final +value in that table on the primary instance is not guaranteed to be the final value on all replicas. + +For example, say the values that were inserted on the primary instance were `value1` and `value2` and the final +value was `value1`, it is not guaranteed that all replicas will have `value1` like the primary. The final value is +non-deterministic for each instance. + +As is the recommendation without replication enabled, if multiple values for the same key (sans timestamp) are written to +Accumulo, it is strongly recommended that the value in the timestamp properly reflects the intended version by +the client. That is to say, newer values inserted into the table should have larger timestamps. If the time between +writing updates to the same key is significant (order minutes), this concern can likely be ignored. + +### Bulk Imports + +Currently, files that are bulk imported into a table configured for replication are not replicated. There is no +technical reason why it was not implemented, it was simply omitted from the initial implementation. This is considered a +fair limitation because bulk importing generated files multiple locations is much simpler than bifurcating "live" ingest +data into two instances. Given some existing bulk import process which creates files and them imports them into an +Accumulo instance, it is trivial to copy those files to a new HDFS instance and import them into another Accumulo +instance using the same process. Hadoop's `distcp` command provides an easy way to copy large amounts of data to another +HDFS instance which makes the problem of duplicating bulk imports very easy to solve. http://git-wip-us.apache.org/repos/asf/accumulo-website/blob/7cc70b2e/_docs-unreleased/administration/ssl.md ---------------------------------------------------------------------- diff --git a/_docs-unreleased/administration/ssl.md b/_docs-unreleased/administration/ssl.md new file mode 100644 index 0000000..a104adf --- /dev/null +++ b/_docs-unreleased/administration/ssl.md @@ -0,0 +1,124 @@ +--- +title: SSL +category: administration +order: 8 +--- + +Accumulo, through Thrift's TSSLTransport, provides the ability to encrypt +wire communication between Accumulo servers and clients using secure +sockets layer (SSL). SSL certifcates signed by the same certificate authority +control the "circle of trust" in which a secure connection can be established. +Typically, each host running Accumulo processes would be given a certificate +which identifies itself. + +Clients can optionally also be given a certificate, when client-auth is enabled, +which prevents unwanted clients from accessing the system. The SSL integration +presently provides no authentication support within Accumulo (an Accumulo username +and password are still required) and is only used to establish a means for +secure communication. + +## Server configuration + +As previously mentioned, the circle of trust is established by the certificate +authority which created the certificates in use. Because of the tight coupling +of certificate generation with an organization's policies, Accumulo does not +provide a method in which to automatically create the necessary SSL components. + +Administrators without existing infrastructure built on SSL are encourage to +use OpenSSL and the `keytool` command. An example of these commands are +included in a section below. Accumulo servers require a certificate and keystore, +in the form of Java KeyStores, to enable SSL. The following configuration assumes +these files already exist. + +In `accumulo-site.xml`, the following properties are required: + +* **rpc.javax.net.ssl.keyStore**=_The path on the local filesystem to the keystore containing the server's certificate_ +* **rpc.javax.net.ssl.keyStorePassword**=_The password for the keystore containing the server's certificate_ +* **rpc.javax.net.ssl.trustStore**=_The path on the local filesystem to the keystore containing the certificate authority's public key_ +* **rpc.javax.net.ssl.trustStorePassword**=_The password for the keystore containing the certificate authority's public key_ +* **instance.rpc.ssl.enabled**=_true_ + +Optionally, SSL client-authentication (two-way SSL) can also be enabled by setting +`instance.rpc.ssl.clientAuth=true` in `accumulo-site.xml`. +This requires that each client has access to valid certificate to set up a secure connection +to the servers. By default, Accumulo uses one-way SSL which does not require clients to have +their own certificate. + +## Client configuration + +To establish a connection to Accumulo servers, each client must also have +special configuration. This is typically accomplished through the use of +the client configuration file whose default location is `~/.accumulo/config`. + +The following properties must be set to connect to an Accumulo instance using SSL: + +* **rpc.javax.net.ssl.trustStore**=_The path on the local filesystem to the keystore containing the certificate authority's public key_ +* **rpc.javax.net.ssl.trustStorePassword**=_The password for the keystore containing the certificate authority's public key_ +* **instance.rpc.ssl.enabled**=_true_ + +If two-way SSL if enabled (`instance.rpc.ssl.clientAuth=true`) for the instance, the client must also define +their own certificate and enable client authenticate as well. + +* **rpc.javax.net.ssl.keyStore**=_The path on the local filesystem to the keystore containing the server's certificate_ +* **rpc.javax.net.ssl.keyStorePassword**=_The password for the keystore containing the server's certificate_ +* **instance.rpc.ssl.clientAuth**=_true_ + +## Generating SSL material using OpenSSL + +The following is included as an example for generating your own SSL material (certificate authority and server/client +certificates) using OpenSSL and Java's KeyTool command. + +### Generate a certificate authority + +```shell +# Create a private key +openssl genrsa -des3 -out root.key 4096 + +# Create a certificate request using the private key +openssl req -x509 -new -key root.key -days 365 -out root.pem + +# Generate a Base64-encoded version of the PEM just created +openssl x509 -outform der -in root.pem -out root.der + +# Import the key into a Java KeyStore +keytool -import -alias root-key -keystore truststore.jks -file root.der + +# Remove the DER formatted key file (as we don't need it anymore) +rm root.der +``` + +The `truststore.jks` file is the Java keystore which contains the certificate authority's public key. + +### Generate a certificate/keystore per host + +It's common that each host in the instance is issued its own certificate (notably to ensure that revocation procedures +can be easily followed). The following steps can be taken for each host. + +```shell +# Create the private key for our server +openssl genrsa -out server.key 4096 + +# Generate a certificate signing request (CSR) with our private key +openssl req -new -key server.key -out server.csr + +# Use the CSR and the CA to create a certificate for the server (a reply to the CSR) +openssl x509 -req -in server.csr -CA root.pem -CAkey root.key -CAcreateserial \ + -out server.crt -days 365 + +# Use the certificate and the private key for our server to create PKCS12 file +openssl pkcs12 -export -in server.crt -inkey server.key -certfile server.crt \ + -name 'server-key' -out server.p12 + +# Create a Java KeyStore for the server using the PKCS12 file (private key) +keytool -importkeystore -srckeystore server.p12 -srcstoretype pkcs12 -destkeystore \ + server.jks -deststoretype JKS + +# Remove the PKCS12 file as we don't need it +rm server.p12 + +# Import the CA-signed certificate to the keystore +keytool -import -trustcacerts -alias server-crt -file server.crt -keystore server.jks +``` + +The `server.jks` file is the Java keystore containing the certificate for a given host. The above +methods are equivalent whether the certficate is generate for an Accumulo server or a client.