Merge branch '1.6.1-SNAPSHOT' Conflicts: docs/src/main/asciidoc/chapters/administration.txt docs/src/main/asciidoc/chapters/clients.txt
Project: http://git-wip-us.apache.org/repos/asf/accumulo/repo Commit: http://git-wip-us.apache.org/repos/asf/accumulo/commit/c3907969 Tree: http://git-wip-us.apache.org/repos/asf/accumulo/tree/c3907969 Diff: http://git-wip-us.apache.org/repos/asf/accumulo/diff/c3907969 Branch: refs/heads/master Commit: c390796911f9f47a787966eb6f3702aabddb9bf3 Parents: d440049 d5e094d Author: Josh Elser <els...@apache.org> Authored: Thu Aug 7 14:32:18 2014 -0400 Committer: Josh Elser <els...@apache.org> Committed: Thu Aug 7 14:32:18 2014 -0400 ---------------------------------------------------------------------- docs/src/main/asciidoc/chapters/administration.txt | 4 ++++ docs/src/main/asciidoc/chapters/clients.txt | 10 ++++++++++ 2 files changed, 14 insertions(+) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/accumulo/blob/c3907969/docs/src/main/asciidoc/chapters/administration.txt ---------------------------------------------------------------------- diff --cc docs/src/main/asciidoc/chapters/administration.txt index 5e92465,0000000..9817b07 mode 100644,000000..100644 --- a/docs/src/main/asciidoc/chapters/administration.txt +++ b/docs/src/main/asciidoc/chapters/administration.txt @@@ -1,492 -1,0 +1,496 @@@ +// Licensed to the Apache Software Foundation (ASF) under one or more +// contributor license agreements. See the NOTICE file distributed with +// this work for additional information regarding copyright ownership. +// The ASF licenses this file to You under the Apache License, Version 2.0 +// (the "License"); you may not use this file except in compliance with +// the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +== Administration + +=== Hardware + +Because we are running essentially two or three systems simultaneously layered +across the cluster: HDFS, Accumulo and MapReduce, it is typical for hardware to +consist of 4 to 8 cores, and 8 to 32 GB RAM. This is so each running process can have +at least one core and 2 - 4 GB each. + +One core running HDFS can typically keep 2 to 4 disks busy, so each machine may +typically have as little as 2 x 300GB disks and as much as 4 x 1TB or 2TB disks. + +It is possible to do with less than this, such as with 1u servers with 2 cores and 4GB +each, but in this case it is recommended to only run up to two processes per +machine -- i.e. DataNode and TabletServer or DataNode and MapReduce worker but +not all three. The constraint here is having enough available heap space for all the +processes on a machine. + +=== Network + +Accumulo communicates via remote procedure calls over TCP/IP for both passing +data and control messages. In addition, Accumulo uses HDFS clients to +communicate with HDFS. To achieve good ingest and query performance, sufficient +network bandwidth must be available between any two machines. + +In addition to needing access to ports associated with HDFS and ZooKeeper, Accumulo will +use the following default ports. Please make sure that they are open, or change +their value in conf/accumulo-site.xml. + +.Accumulo default ports +[width="75%",cols=">,^2,^2"] +[options="header"] +|==== +|Port | Description | Property Name +|4445 | Shutdown Port (Accumulo MiniCluster) | n/a +|4560 | Accumulo monitor (for centralized log display) | monitor.port.log4j +|9997 | Tablet Server | tserver.port.client +|9999 | Master Server | master.port.client +|12234 | Accumulo Tracer | trace.port.client +|42424 | Accumulo Proxy Server | n/a +|50091 | Accumulo GC | gc.port.client +|50095 | Accumulo HTTP monitor | monitor.port.client +|==== + +In addition, the user can provide +0+ and an ephemeral port will be chosen instead. This +ephemeral port is likely to be unique and not already bound. Thus, configuring ports to +use +0+ instead of an explicit value, should, in most cases, work around any issues of +running multiple distinct Accumulo instances (or any other process which tries to use the +same default ports) on the same hardware. + +=== Installation +Choose a directory for the Accumulo installation. This directory will be referenced +by the environment variable +$ACCUMULO_HOME+. Run the following: + + $ tar xzf accumulo-1.6.0-bin.tar.gz # unpack to subdirectory + $ mv accumulo-1.6.0 $ACCUMULO_HOME # move to desired location + +Repeat this step at each machine within the cluster. Usually all machines have the +same +$ACCUMULO_HOME+. + +=== Dependencies +Accumulo requires HDFS and ZooKeeper to be configured and running +before starting. Password-less SSH should be configured between at least the +Accumulo master and TabletServer machines. It is also a good idea to run Network +Time Protocol (NTP) within the cluster to ensure nodes' clocks don't get too out of +sync, which can cause problems with automatically timestamped data. + +=== Configuration + +Accumulo is configured by editing several Shell and XML files found in ++$ACCUMULO_HOME/conf+. The structure closely resembles Hadoop's configuration +files. + +==== Edit conf/accumulo-env.sh + +Accumulo needs to know where to find the software it depends on. Edit accumulo-env.sh +and specify the following: + +. Enter the location of the installation directory of Accumulo for +$ACCUMULO_HOME+ +. Enter your system's Java home for +$JAVA_HOME+ +. Enter the location of Hadoop for +$HADOOP_PREFIX+ +. Choose a location for Accumulo logs and enter it for +$ACCUMULO_LOG_DIR+ +. Enter the location of ZooKeeper for +$ZOOKEEPER_HOME+ + +By default Accumulo TabletServers are set to use 1GB of memory. You may change +this by altering the value of +$ACCUMULO_TSERVER_OPTS+. Note the syntax is that of +the Java JVM command line options. This value should be less than the physical +memory of the machines running TabletServers. + +There are similar options for the master's memory usage and the garbage collector +process. Reduce these if they exceed the physical RAM of your hardware and +increase them, within the bounds of the physical RAM, if a process fails because of +insufficient memory. + +Note that you will be specifying the Java heap space in accumulo-env.sh. You should +make sure that the total heap space used for the Accumulo tserver and the Hadoop +DataNode and TaskTracker is less than the available memory on each slave node in +the cluster. On large clusters, it is recommended that the Accumulo master, Hadoop +NameNode, secondary NameNode, and Hadoop JobTracker all be run on separate +machines to allow them to use more heap space. If you are running these on the +same machine on a small cluster, likewise make sure their heap space settings fit +within the available memory. + +==== Native Map + +The tablet server uses a data structure called a MemTable to store sorted key/value +pairs in memory when they are first received from the client. When a minor compaction +occurs, this data structure is written to HDFS. The MemTable will default to using +memory in the JVM but a JNI version, called the native map, can be used to significantly +speed up performance by utilizing the memory space of the native operating system. The +native map also avoids the performance implications brought on by garbage collection +in the JVM by causing it to pause much less frequently. + +32-bit and 64-bit Linux and Mac OS X versions of the native map can be built +from the Accumulo bin package by executing ++$ACCUMULO_HOME/bin/build_native_library.sh+. If your system's +default compiler options are insufficient, you can add additional compiler +options to the command line, such as options for the architecture. These will be +passed to the Makefile in the environment variable +USERFLAGS+. + +Examples: + +. +$ACCUMULO_HOME/bin/build_native_library.sh+ +. +$ACCUMULO_HOME/bin/build_native_library.sh -m32+ + +After building the native map from the source, you will find the artifact in ++$ACCUMULO_HOME/lib/native+. Upon starting up, the tablet server will look +in this directory for the map library. If the file is renamed or moved from its +target directory, the tablet server may not be able to find it. The system can +also locate the native maps shared library by setting +LD_LIBRARY_PATH+ +(or +DYLD_LIBRARY_PATH+ on Mac OS X) in +$ACCUMULO_HOME/conf/accumulo-env.sh+. + +==== Cluster Specification + +On the machine that will serve as the Accumulo master: + +. Write the IP address or domain name of the Accumulo Master to the +$ACCUMULO_HOME/conf/masters+ file. +. Write the IP addresses or domain name of the machines that will be TabletServers in +$ACCUMULO_HOME/conf/slaves+, one per line. + +Note that if using domain names rather than IP addresses, DNS must be configured +properly for all machines participating in the cluster. DNS can be a confusing source +of errors. + +==== Accumulo Settings +Specify appropriate values for the following settings in ++$ACCUMULO_HOME/conf/accumulo-site.xml+ : + +[source,xml] +<property> + <name>instance.zookeeper.host</name> + <value>zooserver-one:2181,zooserver-two:2181</value> + <description>list of zookeeper servers</description> +</property> + +This enables Accumulo to find ZooKeeper. Accumulo uses ZooKeeper to coordinate +settings between processes and helps finalize TabletServer failure. + +[source,xml] +<property> + <name>instance.secret</name> + <value>DEFAULT</value> +</property> + +The instance needs a secret to enable secure communication between servers. Configure your +secret and make sure that the +accumulo-site.xml+ file is not readable to other users. +For alternatives to storing the +instance.secret+ in plaintext, please read the ++Sensitive Configuration Values+ section. + +Some settings can be modified via the Accumulo shell and take effect immediately, but +some settings require a process restart to take effect. See the configuration documentation +(available in the docs directory of the tarball and in <<configuration>>) for details. + +==== Deploy Configuration + +Copy the masters, slaves, accumulo-env.sh, and if necessary, accumulo-site.xml +from the +$ACCUMULO_HOME/conf/+ directory on the master to all the machines +specified in the slaves file. + +==== Sensitive Configuration Values + +Accumulo has a number of properties that can be specified via the accumulo-site.xml +file which are sensitive in nature, instance.secret and trace.token.property.password +are two common examples. Both of these properties, if compromised, have the ability +to result in data being leaked to users who should not have access to that data. + +In Hadoop-2.6.0, a new CredentialProvider class was introduced which serves as a common +implementation to abstract away the storage and retrieval of passwords from plaintext +storage in configuration files. Any Property marked with the +Sensitive+ annotation +is a candidate for use with these CredentialProviders. For version of Hadoop which lack +these classes, the feature will just be unavailable for use. + +A comma separated list of CredentialProviders can be configured using the Accumulo Property ++general.security.credential.provider.paths+. Each configured URL will be consulted +when the Configuration object for accumulo-site.xml is accessed. + +==== Using a JavaKeyStoreCredentialProvider for storage + +One of the implementations provided in Hadoop-2.6.0 is a Java KeyStore CredentialProvider. +Each entry in the KeyStore is the Accumulo Property key name. For example, to store the +\texttt{instance.secret}, the following command can be used: + + hadoop credential create instance.secret --provider jceks://file/etc/accumulo/conf/accumulo.jceks + +The command will then prompt you to enter the secret to use and create a keystore in: + + /etc/accumulo/conf/accumulo.jceks + +Then, accumulo-site.xml must be configured to use this KeyStore as a CredentialProvider: + +[source,xml] +<property> + <name>general.security.credential.provider.paths</name> + <value>jceks://file/etc/accumulo/conf/accumulo.jceks</value> +</property> + +This configuration will then transparently extract the +instance.secret+ from +the configured KeyStore and alleviates a human readable storage of the sensitive +property. + ++A KeyStore can also be stored in HDFS, which will make the KeyStore readily available to ++all Accumulo servers. If the local filesystem is used, be aware that each Accumulo server ++will expect the KeyStore in the same location. ++ +=== Initialization + +Accumulo must be initialized to create the structures it uses internally to locate +data across the cluster. HDFS is required to be configured and running before +Accumulo can be initialized. + +Once HDFS is started, initialization can be performed by executing ++$ACCUMULO_HOME/bin/accumulo init+ . This script will prompt for a name +for this instance of Accumulo. The instance name is used to identify a set of tables +and instance-specific settings. The script will then write some information into +HDFS so Accumulo can start properly. + +The initialization script will prompt you to set a root password. Once Accumulo is +initialized it can be started. + +=== Running + +==== Starting Accumulo + +Make sure Hadoop is configured on all of the machines in the cluster, including +access to a shared HDFS instance. Make sure HDFS and ZooKeeper are running. +Make sure ZooKeeper is configured and running on at least one machine in the +cluster. +Start Accumulo using the +bin/start-all.sh+ script. + +To verify that Accumulo is running, check the Status page as described in +<<monitoring>>. In addition, the Shell can provide some information about the status of +tables via reading the metadata tables. + +==== Stopping Accumulo + +To shutdown cleanly, run +bin/stop-all.sh+ and the master will orchestrate the +shutdown of all the tablet servers. Shutdown waits for all minor compactions to finish, so it may +take some time for particular configurations. + +==== Adding a Node + +Update your +$ACCUMULO_HOME/conf/slaves+ (or +$ACCUMULO_CONF_DIR/slaves+) file to account for the addition. + + $ACCUMULO_HOME/bin/accumulo admin start <host(s)> {<host> ...} + +Alternatively, you can ssh to each of the hosts you want to add and run: + + $ACCUMULO_HOME/bin/start-here.sh + +Make sure the host in question has the new configuration, or else the tablet +server won't start; at a minimum this needs to be on the host(s) being added, +but in practice it's good to ensure consistent configuration across all nodes. + +==== Decomissioning a Node + +If you need to take a node out of operation, you can trigger a graceful shutdown of a tablet +server. Accumulo will automatically rebalance the tablets across the available tablet servers. + + $ACCUMULO_HOME/bin/accumulo admin stop <host(s)> {<host> ...} + +Alternatively, you can ssh to each of the hosts you want to remove and run: + + $ACCUMULO_HOME/bin/stop-here.sh + +Be sure to update your +$ACCUMULO_HOME/conf/slaves+ (or +$ACCUMULO_CONF_DIR/slaves+) file to +account for the removal of these hosts. Bear in mind that the monitor will not re-read the +slaves file automatically, so it will report the decomissioned servers as down; it's +recommended that you restart the monitor so that the node list is up to date. + +[[monitoring]] +=== Monitoring + +The Accumulo Master provides an interface for monitoring the status and health of +Accumulo components. The Accumulo Monitor provides a web UI for accessing this information at ++http://_monitorhost_:50095/+. + +Things highlighted in yellow may be in need of attention. +If anything is highlighted in red on the monitor page, it is something that definitely needs attention. + +The Overview page contains some summary information about the Accumulo instance, including the version, instance name, and instance ID. +There is a table labeled Accumulo Master with current status, a table listing the active Zookeeper servers, and graphs displaying various metrics over time. +These include ingest and scan performance and other useful measurements. + +The Master Server, Tablet Servers, and Tables pages display metrics grouped in different ways (e.g. by tablet server or by table). +Metrics typically include number of entries (key/value pairs), ingest and query rates. +The number of running scans, major and minor compactions are in the form _number_running_ (_number_queued_). +Another important metric is hold time, which is the amount of time a tablet has been waiting but unable to flush its memory in a minor compaction. + +The Server Activity page graphically displays tablet server status, with each server represented as a circle or square. +Different metrics may be assigned to the nodes' color and speed of oscillation. +The Overall Avg metric is only used on the Server Activity page, and represents the average of all the other metrics (after normalization). +Similarly, the Overall Max metric picks the metric with the maximum normalized value. + +The Garbage Collector page displays a list of garbage collection cycles, the number of files found of each type (including deletion candidates in use and files actually deleted), and the length of the deletion cycle. +The Traces page displays data for recent traces performed (see the following section for information on <<tracing>>). +The Recent Logs page displays warning and error logs forwarded to the monitor from all Accumulo processes. +Also, the XML and JSON links provide metrics in XML and JSON formats, respectively. + +==== SSL +SSL may be enabled for the monitor page by setting the following properties in the +accumulo-site.xml+ file: + + monitor.ssl.keyStore + monitor.ssl.keyStorePassword + monitor.ssl.trustStore + monitor.ssl.trustStorePassword + +If the Accumulo conf directory has been configured (in particular the +accumulo-env.sh+ file must be set up), the +generate_monitor_certificate.sh+ script in the Accumulo +bin+ directory can be used to create the keystore and truststore files with random passwords. +The script will print out the properties that need to be added to the +accumulo-site.xml+ file. +The stores can also be generated manually with the Java +keytool+ command, whose usage can be seen in the +generate_monitor_certificate.sh+ script. + +If desired, the SSL ciphers allowed for connections can be controlled via the following properties in +accumulo-site.xml+: + + monitor.ssl.include.ciphers + monitor.ssl.exclude.ciphers + +If SSL is enabled, the monitor URL can only be accessed via https. +This also allows you to access the Accumulo shell through the monitor page. +The left navigation bar will have a new link to Shell. +An Accumulo user name and password must be entered for access to the shell. + +[[tracing]] +=== Tracing +It can be difficult to determine why some operations are taking longer +than expected. For example, you may be looking up items with very low +latency, but sometimes the lookups take much longer. Determining the +cause of the delay is difficult because the system is distributed, and +the typical lookup is fast. + +Accumulo has been instrumented to record the time that various +operations take when tracing is turned on. The fact that tracing is +enabled follows all the requests made on behalf of the user throughout +the distributed infrastructure of accumulo, and across all threads of +execution. + +These time spans will be inserted into the +trace+ table in +Accumulo. You can browse recent traces from the Accumulo monitor +page. You can also read the +trace+ table directly like any +other table. + +The design of Accumulo's distributed tracing follows that of +http://research.google.com/pubs/pub36356.html[Google's Dapper]. + +==== Tracers +To collect traces, Accumulo needs at least one server listed in + +$ACCUMULO_HOME/conf/tracers+. The server collects traces +from clients and writes them to the +trace+ table. The Accumulo +user that the tracer connects to Accumulo with can be configured with +the following properties + + trace.user + trace.token.property.password + +==== Instrumenting a Client +Tracing can be used to measure a client operation, such as a scan, as +the operation traverses the distributed system. To enable tracing for +your application call + +[source,java] +DistributedTrace.enable(instance, new ZooReader(instance), hostname, "myApplication"); + +Once tracing has been enabled, a client can wrap an operation in a trace. + +[source,java] +Trace.on("Client Scan"); +BatchScanner scanner = conn.createBatchScanner(...); +// Configure your scanner +for (Entry entry : scanner) { +} +Trace.off(); + +Additionally, the user can create additional Spans within a Trace. + +[source,java] +Trace.on("Client Update"); +... +Span readSpan = Trace.start("Read"); +... +readSpan.stop(); +... +Span writeSpan = Trace.start("Write"); +... +writeSpan.stop(); +Trace.off(); + +Like Dapper, Accumulo tracing supports user defined annotations to associate additional data with a Trace. + +[source,java] +... +int numberOfEntriesRead = 0; +Span readSpan = Trace.start("Read"); +// Do the read, update the counter +... +readSpan.data("Number of Entries Read", String.valueOf(numberOfEntriesRead)); + +Some client operations may have a high volume within your +application. As such, you may wish to only sample a percentage of +operations for tracing. As seen below, the CountSampler can be used to +help enable tracing for 1-in-1000 operations + +[source,java] +Sampler sampler = new CountSampler(1000); +... +if (sampler.next()) { + Trace.on("Read"); +} +... +Trace.offNoFlush(); + +It should be noted that it is safe to turn off tracing even if it +isn't currently active. The +Trace.offNoFlush()+ should be used if the +user does not wish to have +Trace.off()+ block while flushing trace +data. + +==== Viewing Collected Traces +To view collected traces, use the "Recent Traces" link on the Monitor +UI. You can also programmatically access and print traces using the ++TraceDump+ class. + +==== Tracing from the Shell +You can enable tracing for operations run from the shell by using the ++trace on+ and +trace off+ commands. + +---- +root@test test> trace on + +root@test test> scan +a b:c [] d + +root@test test> trace off +Waiting for trace information +Waiting for trace information +Trace started at 2013/08/26 13:24:08.332 +Time Start Service@Location Name + 3628+0 shell@localhost shell:root + 8+1690 shell@localhost scan + 7+1691 shell@localhost scan:location + 6+1692 tserver@localhost startScan + 5+1692 tserver@localhost tablet read ahead 6 +---- + +=== Logging +Accumulo processes each write to a set of log files. By default these are found under ++$ACCUMULO/logs/+. + +=== Recovery + +In the event of TabletServer failure or error on shutting Accumulo down, some +mutations may not have been minor compacted to HDFS properly. In this case, +Accumulo will automatically reapply such mutations from the write-ahead log +either when the tablets from the failed server are reassigned by the Master (in the +case of a single TabletServer failure) or the next time Accumulo starts (in the event of +failure during shutdown). + +Recovery is performed by asking a tablet server to sort the logs so that tablets can easily find their missing +updates. The sort status of each file is displayed on +Accumulo monitor status page. Once the recovery is complete any +tablets involved should return to an ``online'' state. Until then those tablets will be +unavailable to clients. + +The Accumulo client library is configured to retry failed mutations and in many +cases clients will be able to continue processing after the recovery process without +throwing an exception. http://git-wip-us.apache.org/repos/asf/accumulo/blob/c3907969/docs/src/main/asciidoc/chapters/clients.txt ---------------------------------------------------------------------- diff --cc docs/src/main/asciidoc/chapters/clients.txt index 6b071ba,0000000..48123a3 mode 100644,000000..100644 --- a/docs/src/main/asciidoc/chapters/clients.txt +++ b/docs/src/main/asciidoc/chapters/clients.txt @@@ -1,320 -1,0 +1,330 @@@ +// Licensed to the Apache Software Foundation (ASF) under one or more +// contributor license agreements. See the NOTICE file distributed with +// this work for additional information regarding copyright ownership. +// The ASF licenses this file to You under the Apache License, Version 2.0 +// (the "License"); you may not use this file except in compliance with +// the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +== Writing Accumulo Clients + +=== Running Client Code + +There are multiple ways to run Java code that uses Accumulo. Below is a list +of the different ways to execute client code. + +* using java executable +* using the accumulo script +* using the tool script + +In order to run client code written to run against Accumulo, you will need to +include the jars that Accumulo depends on in your classpath. Accumulo client +code depends on Hadoop and Zookeeper. For Hadoop add the hadoop client jar, all +of the jars in the Hadoop lib directory, and the conf directory to the +classpath. For Zookeeper 3.3 you only need to add the Zookeeper jar, and not +what is in the Zookeeper lib directory. You can run the following command on a +configured Accumulo system to see what its using for its classpath. + + $ACCUMULO_HOME/bin/accumulo classpath + +Another option for running your code is to put a jar file in ++$ACCUMULO_HOME/lib/ext+. After doing this you can use the accumulo +script to execute your code. For example if you create a jar containing the +class +com.foo.Client+ and placed that in +lib/ext+, then you could use the command ++$ACCUMULO_HOME/bin/accumulo com.foo.Client+ to execute your code. + +If you are writing map reduce job that access Accumulo, then you can use the +bin/tool.sh script to run those jobs. See the map reduce example. + +=== Connecting + +All clients must first identify the Accumulo instance to which they will be +communicating. Code to do this is as follows: + +[source,java] +---- +String instanceName = "myinstance"; +String zooServers = "zooserver-one,zooserver-two" +Instance inst = new ZooKeeperInstance(instanceName, zooServers); + +Connector conn = inst.getConnector("user", new PasswordToken("passwd")); +---- + ++The PasswordToken is the most common implementation of an \texttt{AuthenticationToken}. ++This general interface allow authentication as an Accumulo user to come from ++a variety of sources or means. The CredentialProviderToken leverages the Hadoop ++CredentialProviders (new in Hadoop 2.6). ++ ++For example, the CredentialProviderToken can be used in conjunction with a Java ++KeyStore to alleviate passwords stored in cleartext. When stored in HDFS, a single ++KeyStore can be used across an entire instance. Be aware that KeyStores stored on ++the local filesystem must be made available to all nodes in the Accumulo cluster. ++ +=== Writing Data + +Data are written to Accumulo by creating Mutation objects that represent all the +changes to the columns of a single row. The changes are made atomically in the +TabletServer. Clients then add Mutations to a BatchWriter which submits them to +the appropriate TabletServers. + +Mutations can be created thus: + +[source,java] +---- +Text rowID = new Text("row1"); +Text colFam = new Text("myColFam"); +Text colQual = new Text("myColQual"); +ColumnVisibility colVis = new ColumnVisibility("public"); +long timestamp = System.currentTimeMillis(); + +Value value = new Value("myValue".getBytes()); + +Mutation mutation = new Mutation(rowID); +mutation.put(colFam, colQual, colVis, timestamp, value); +---- + +==== BatchWriter +The BatchWriter is highly optimized to send Mutations to multiple TabletServers +and automatically batches Mutations destined for the same TabletServer to +amortize network overhead. Care must be taken to avoid changing the contents of +any Object passed to the BatchWriter since it keeps objects in memory while +batching. + +Mutations are added to a BatchWriter thus: + +[source,java] +---- +// BatchWriterConfig has reasonable defaults +BatchWriterConfig config = new BatchWriterConfig(); +config.setMaxMemory(10000000L); // bytes available to batchwriter for buffering mutations + +BatchWriter writer = conn.createBatchWriter("table", config) + +writer.add(mutation); + +writer.close(); +---- + +An example of using the batch writer can be found at ++accumulo/docs/examples/README.batch+. + +==== ConditionalWriter +The ConditionalWriter enables efficient, atomic read-modify-write operations on +rows. The ConditionalWriter writes special Mutations which have a list of per +column conditions that must all be met before the mutation is applied. The +conditions are checked in the tablet server while a row lock is +held (Mutations written by the BatchWriter will not obtain a row +lock). The conditions that can be checked for a column are equality and +absence. For example a conditional mutation can require that column A is +absent inorder to be applied. Iterators can be applied when checking +conditions. Using iterators, many other operations besides equality and +absence can be checked. For example, using an iterator that converts values +less than 5 to 0 and everything else to 1, its possible to only apply a +mutation when a column is less than 5. + +In the case when a tablet server dies after a client sent a conditional +mutation, its not known if the mutation was applied or not. When this happens +the ConditionalWriter reports a status of UNKNOWN for the ConditionalMutation. +In many cases this situation can be dealt with by simply reading the row again +and possibly sending another conditional mutation. If this is not sufficient, +then a higher level of abstraction can be built by storing transactional +information within a row. + +An example of using the batch writer can be found at ++accumulo/docs/examples/README.reservations+. + +=== Reading Data + +Accumulo is optimized to quickly retrieve the value associated with a given key, and +to efficiently return ranges of consecutive keys and their associated values. + +==== Scanner + +To retrieve data, Clients use a Scanner, which acts like an Iterator over +keys and values. Scanners can be configured to start and stop at particular keys, and +to return a subset of the columns available. + +[source,java] +---- +// specify which visibilities we are allowed to see +Authorizations auths = new Authorizations("public"); + +Scanner scan = + conn.createScanner("table", auths); + +scan.setRange(new Range("harry","john")); +scan.fetchColumnFamily(new Text("attributes")); + +for(Entry<Key,Value> entry : scan) { + Text row = entry.getKey().getRow(); + Value value = entry.getValue(); +} +---- + +==== Isolated Scanner + +Accumulo supports the ability to present an isolated view of rows when +scanning. There are three possible ways that a row could change in Accumulo : + +* a mutation applied to a table +* iterators executed as part of a minor or major compaction +* bulk import of new files + +Isolation guarantees that either all or none of the changes made by these +operations on a row are seen. Use the IsolatedScanner to obtain an isolated +view of an Accumulo table. When using the regular scanner it is possible to see +a non isolated view of a row. For example if a mutation modifies three +columns, it is possible that you will only see two of those modifications. +With the isolated scanner either all three of the changes are seen or none. + +The IsolatedScanner buffers rows on the client side so a large row will not +crash a tablet server. By default rows are buffered in memory, but the user +can easily supply their own buffer if they wish to buffer to disk when rows are +large. + +For an example, look at the following + + examples/simple/src/main/java/org/apache/accumulo/examples/simple/isolation/InterferenceTest.java + +==== BatchScanner + +For some types of access, it is more efficient to retrieve several ranges +simultaneously. This arises when accessing a set of rows that are not consecutive +whose IDs have been retrieved from a secondary index, for example. + +The BatchScanner is configured similarly to the Scanner; it can be configured to +retrieve a subset of the columns available, but rather than passing a single Range, +BatchScanners accept a set of Ranges. It is important to note that the keys returned +by a BatchScanner are not in sorted order since the keys streamed are from multiple +TabletServers in parallel. + +[source,java] +---- +ArrayList<Range> ranges = new ArrayList<Range>(); +// populate list of ranges ... + +BatchScanner bscan = + conn.createBatchScanner("table", auths, 10); +bscan.setRanges(ranges); +bscan.fetchColumnFamily("attributes"); + +for(Entry<Key,Value> entry : scan) { + System.out.println(entry.getValue()); +} +---- + +An example of the BatchScanner can be found at ++accumulo/docs/examples/README.batch+. + +=== Proxy + +The proxy API allows the interaction with Accumulo with languages other than Java. +A proxy server is provided in the codebase and a client can further be generated. + +==== Prequisites + +The proxy server can live on any node in which the basic client API would work. That +means it must be able to communicate with the Master, ZooKeepers, NameNode, and the +DataNodes. A proxy client only needs the ability to communicate with the proxy server. + + +==== Configuration + +The configuration options for the proxy server live inside of a properties file. At +the very least, you need to supply the following properties: + + protocolFactory=org.apache.thrift.protocol.TCompactProtocol$Factory + tokenClass=org.apache.accumulo.core.client.security.tokens.PasswordToken + port=42424 + instance=test + zookeepers=localhost:2181 + +You can find a sample configuration file in your distribution: + + $ACCUMULO_HOME/proxy/proxy.properties. + +This sample configuration file further demonstrates an ability to back the proxy server +by MockAccumulo or the MiniAccumuloCluster. + +==== Running the Proxy Server + +After the properties file holding the configuration is created, the proxy server +can be started using the following command in the Accumulo distribution (assuming +your properties file is named +config.properties+): + + $ACCUMULO_HOME/bin/accumulo proxy -p config.properties + +==== Creating a Proxy Client + +Aside from installing the Thrift compiler, you will also need the language-specific library +for Thrift installed to generate client code in that language. Typically, your operating +system's package manager will be able to automatically install these for you in an expected +location such as +/usr/lib/python/site-packages/thrift+. + +You can find the thrift file for generating the client: + + $ACCUMULO_HOME/proxy/proxy.thrift. + +After a client is generated, the port specified in the configuration properties above will be +used to connect to the server. + +==== Using a Proxy Client + +The following examples have been written in Java and the method signatures may be +slightly different depending on the language specified when generating client with +the Thrift compiler. After initiating a connection to the Proxy (see Apache Thrift's +documentation for examples of connecting to a Thrift service), the methods on the +proxy client will be available. The first thing to do is log in: + +[source,java] +Map password = new HashMap<String,String>(); +password.put("password", "secret"); +ByteBuffer token = client.login("root", password); + +Once logged in, the token returned will be used for most subsequent calls to the client. +Let's create a table, add some data, scan the table, and delete it. + + +First, create a table. + +[source,java] +client.createTable(token, "myTable", true, TimeType.MILLIS); + + +Next, add some data: + +[source,java] +---- +// first, create a writer on the server +String writer = client.createWriter(token, "myTable", new WriterOptions()); + +// build column updates +Map<ByteBuffer, List<ColumnUpdate> cells> cellsToUpdate = //... + +// send updates to the server +client.updateAndFlush(writer, "myTable", cellsToUpdate); + +client.closeWriter(writer); +---- + + +Scan for the data and batch the return of the results on the server: + +[source,java] +---- +String scanner = client.createScanner(token, "myTable", new ScanOptions()); +ScanResult results = client.nextK(scanner, 100); + +for(KeyValue keyValue : results.getResultsIterator()) { + // do something with results +} + +client.closeScanner(scanner); +----