http://git-wip-us.apache.org/repos/asf/accumulo-website/blob/32ac61d3/docs/master/design.md ---------------------------------------------------------------------- diff --git a/docs/master/design.md b/docs/master/design.md new file mode 100644 index 0000000..6c77cb6 --- /dev/null +++ b/docs/master/design.md @@ -0,0 +1,180 @@ +// Licensed to the Apache Software Foundation (ASF) under one or more +// contributor license agreements. See the NOTICE file distributed with +// this work for additional information regarding copyright ownership. +// The ASF licenses this file to You under the Apache License, Version 2.0 +// (the "License"); you may not use this file except in compliance with +// the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +== Accumulo Design + +=== Data Model + +Accumulo provides a richer data model than simple key-value stores, but is not a +fully relational database. Data is represented as key-value pairs, where the key and +value are comprised of the following elements: + +[width="75%",cols="^,^,^,^,^,^"] +|=========================================================================== + 5+|Key .3+^.^|Value +.2+^.^|Row ID 3+|Column .2+^.^|Timestamp + |Family |Qualifier |Visibility +|=========================================================================== + +All elements of the Key and the Value are represented as byte arrays except for +Timestamp, which is a Long. Accumulo sorts keys by element and lexicographically +in ascending order. Timestamps are sorted in descending order so that later +versions of the same Key appear first in a sequential scan. Tables consist of a set of +sorted key-value pairs. + +=== Architecture + +Accumulo is a distributed data storage and retrieval system and as such consists of +several architectural components, some of which run on many individual servers. +Much of the work Accumulo does involves maintaining certain properties of the +data, such as organization, availability, and integrity, across many commodity-class +machines. + +=== Components + +An instance of Accumulo includes many TabletServers, one Garbage Collector process, +one Master server and many Clients. + +==== Tablet Server + +The TabletServer manages some subset of all the tablets (partitions of tables). This includes receiving writes from clients, persisting writes to a +write-ahead log, sorting new key-value pairs in memory, periodically +flushing sorted key-value pairs to new files in HDFS, and responding +to reads from clients, forming a merge-sorted view of all keys and +values from all the files it has created and the sorted in-memory +store. + +TabletServers also perform recovery of a tablet +that was previously on a server that failed, reapplying any writes +found in the write-ahead log to the tablet. + +==== Garbage Collector + +Accumulo processes will share files stored in HDFS. Periodically, the Garbage +Collector will identify files that are no longer needed by any process, and +delete them. Multiple garbage collectors can be run to provide hot-standby support. +They will perform leader election among themselves to choose a single active instance. + +==== Master + +The Accumulo Master is responsible for detecting and responding to TabletServer +failure. It tries to balance the load across TabletServer by assigning tablets carefully +and instructing TabletServers to unload tablets when necessary. The Master ensures all +tablets are assigned to one TabletServer each, and handles table creation, alteration, +and deletion requests from clients. The Master also coordinates startup, graceful +shutdown and recovery of changes in write-ahead logs when Tablet servers fail. + +Multiple masters may be run. The masters will choose among themselves a single master, +and the others will become backups if the master should fail. + +==== Tracer + +The Accumulo Tracer process supports the distributed timing API provided by Accumulo. +One to many of these processes can be run on a cluster which will write the timing +information to a given Accumulo table for future reference. Seeing the section on +Tracing for more information on this support. + +==== Monitor + +The Accumulo Monitor is a web application that provides a wealth of information about +the state of an instance. The Monitor shows graphs and tables which contain information +about read/write rates, cache hit/miss rates, and Accumulo table information such as scan +rate and active/queued compactions. Additionally, the Monitor should always be the first +point of entry when attempting to debug an Accumulo problem as it will show high-level problems +in addition to aggregated errors from all nodes in the cluster. See the section on <<monitoring>> +for more information. + +Multiple Monitors can be run to provide hot-standby support in the face of failure. Due to the +forwarding of logs from remote hosts to the Monitor, only one Monitor process should be active +at one time. Leader election will be performed internally to choose the active Monitor. + +==== Client + +Accumulo includes a client library that is linked to every application. The client +library contains logic for finding servers managing a particular tablet, and +communicating with TabletServers to write and retrieve key-value pairs. + +=== Data Management + +Accumulo stores data in tables, which are partitioned into tablets. Tablets are +partitioned on row boundaries so that all of the columns and values for a particular +row are found together within the same tablet. The Master assigns Tablets to one +TabletServer at a time. This enables row-level transactions to take place without +using distributed locking or some other complicated synchronization mechanism. As +clients insert and query data, and as machines are added and removed from the +cluster, the Master migrates tablets to ensure they remain available and that the +ingest and query load is balanced across the cluster. + +image::data_distribution.png[width=500] + +=== Tablet Service + + +When a write arrives at a TabletServer it is written to a Write-Ahead Log and +then inserted into a sorted data structure in memory called a MemTable. When the +MemTable reaches a certain size, the TabletServer writes out the sorted +key-value pairs to a file in HDFS called a Relative Key File (RFile), which is a +kind of Indexed Sequential Access Method (ISAM) file. This process is called a +minor compaction. A new MemTable is then created and the fact of the compaction +is recorded in the Write-Ahead Log. + +When a request to read data arrives at a TabletServer, the TabletServer does a +binary search across the MemTable as well as the in-memory indexes associated +with each RFile to find the relevant values. If clients are performing a scan, +several key-value pairs are returned to the client in order from the MemTable +and the set of RFiles by performing a merge-sort as they are read. + +=== Compactions + +In order to manage the number of files per tablet, periodically the TabletServer +performs Major Compactions of files within a tablet, in which some set of RFiles +are combined into one file. The previous files will eventually be removed by the +Garbage Collector. This also provides an opportunity to permanently remove +deleted key-value pairs by omitting key-value pairs suppressed by a delete entry +when the new file is created. + +=== Splitting + +When a table is created it has one tablet. As the table grows its initial +tablet eventually splits into two tablets. Its likely that one of these +tablets will migrate to another tablet server. As the table continues to grow, +its tablets will continue to split and be migrated. The decision to +automatically split a tablet is based on the size of a tablets files. The +size threshold at which a tablet splits is configurable per table. In addition +to automatic splitting, a user can manually add split points to a table to +create new tablets. Manually splitting a new table can parallelize reads and +writes giving better initial performance without waiting for automatic +splitting. + +As data is deleted from a table, tablets may shrink. Over time this can lead +to small or empty tablets. To deal with this, merging of tablets was +introduced in Accumulo 1.4. This is discussed in more detail later. + +=== Fault-Tolerance + +If a TabletServer fails, the Master detects it and automatically reassigns the tablets +assigned from the failed server to other servers. Any key-value pairs that were in +memory at the time the TabletServer fails are automatically reapplied from the Write-Ahead +Log(WAL) to prevent any loss of data. + +Tablet servers write their WALs directly to HDFS so the logs are available to all tablet +servers for recovery. To make the recovery process efficient, the updates within a log are +grouped by tablet. TabletServers can quickly apply the mutations from the sorted logs +that are destined for the tablets they have now been assigned. + +TabletServer failures are noted on the Master's monitor page, accessible via ++http://master-address:9995/monitor+. + +image::failure_handling.png[width=500]
http://git-wip-us.apache.org/repos/asf/accumulo-website/blob/32ac61d3/docs/master/development_clients.md ---------------------------------------------------------------------- diff --git a/docs/master/development_clients.md b/docs/master/development_clients.md new file mode 100644 index 0000000..18821e3 --- /dev/null +++ b/docs/master/development_clients.md @@ -0,0 +1,107 @@ +// Licensed to the Apache Software Foundation (ASF) under one or more +// contributor license agreements. See the NOTICE file distributed with +// this work for additional information regarding copyright ownership. +// The ASF licenses this file to You under the Apache License, Version 2.0 +// (the "License"); you may not use this file except in compliance with +// the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +== Development Clients + +Normally, Accumulo consists of lots of moving parts. Even a stand-alone version of +Accumulo requires Hadoop, Zookeeper, the Accumulo master, a tablet server, etc. If +you want to write a unit test that uses Accumulo, you need a lot of infrastructure +in place before your test can run. + +=== Mock Accumulo + +Mock Accumulo supplies mock implementations for much of the client API. It presently +does not enforce users, logins, permissions, etc. It does support Iterators and Combiners. +Note that MockAccumulo holds all data in memory, and will not retain any data or +settings between runs. + +While normal interaction with the Accumulo client looks like this: + +[source,java] +Instance instance = new ZooKeeperInstance(...); +Connector conn = instance.getConnector(user, passwordToken); + +To interact with the MockAccumulo, just replace the ZooKeeperInstance with MockInstance: + +[source,java] +Instance instance = new MockInstance(); + +In fact, you can use the +--fake+ option to the Accumulo shell and interact with +MockAccumulo: + +---- +$ accumulo shell --fake -u root -p '' + +Shell - Apache Accumulo Interactive Shell +- +- version: 2.x.x +- instance name: fake +- instance id: mock-instance-id +- +- type 'help' for a list of available commands +- + +root@fake> createtable test + +root@fake test> insert row1 cf cq value +root@fake test> insert row2 cf cq value2 +root@fake test> insert row3 cf cq value3 + +root@fake test> scan +row1 cf:cq [] value +row2 cf:cq [] value2 +row3 cf:cq [] value3 + +root@fake test> scan -b row2 -e row2 +row2 cf:cq [] value2 + +root@fake test> +---- + +When testing Map Reduce jobs, you can also set the Mock Accumulo on the AccumuloInputFormat +and AccumuloOutputFormat classes: + +[source,java] +// ... set up job configuration +AccumuloInputFormat.setMockInstance(job, "mockInstance"); +AccumuloOutputFormat.setMockInstance(job, "mockInstance"); + +=== Mini Accumulo Cluster + +While the Mock Accumulo provides a lightweight implementation of the client API for unit +testing, it is often necessary to write more realistic end-to-end integration tests that +take advantage of the entire ecosystem. The Mini Accumulo Cluster makes this possible by +configuring and starting Zookeeper, initializing Accumulo, and starting the Master as well +as some Tablet Servers. It runs against the local filesystem instead of having to start +up HDFS. + +To start it up, you will need to supply an empty directory and a root password as arguments: + +[source,java] +File tempDirectory = // JUnit and Guava supply mechanisms for creating temp directories +MiniAccumuloCluster accumulo = new MiniAccumuloCluster(tempDirectory, "password"); +accumulo.start(); + +Once we have our mini cluster running, we will want to interact with the Accumulo client API: + +[source,java] +Instance instance = new ZooKeeperInstance(accumulo.getInstanceName(), accumulo.getZooKeepers()); +Connector conn = instance.getConnector("root", new PasswordToken("password")); + +Upon completion of our development code, we will want to shutdown our MiniAccumuloCluster: + +[source,java] +accumulo.stop(); +// delete your temporary folder http://git-wip-us.apache.org/repos/asf/accumulo-website/blob/32ac61d3/docs/master/high_speed_ingest.md ---------------------------------------------------------------------- diff --git a/docs/master/high_speed_ingest.md b/docs/master/high_speed_ingest.md new file mode 100644 index 0000000..2a7a702 --- /dev/null +++ b/docs/master/high_speed_ingest.md @@ -0,0 +1,124 @@ +// Licensed to the Apache Software Foundation (ASF) under one or more +// contributor license agreements. See the NOTICE file distributed with +// this work for additional information regarding copyright ownership. +// The ASF licenses this file to You under the Apache License, Version 2.0 +// (the "License"); you may not use this file except in compliance with +// the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +== High-Speed Ingest + +Accumulo is often used as part of a larger data processing and storage system. To +maximize the performance of a parallel system involving Accumulo, the ingestion +and query components should be designed to provide enough parallelism and +concurrency to avoid creating bottlenecks for users and other systems writing to +and reading from Accumulo. There are several ways to achieve high ingest +performance. + +=== Pre-Splitting New Tables + +New tables consist of a single tablet by default. As mutations are applied, the table +grows and splits into multiple tablets which are balanced by the Master across +TabletServers. This implies that the aggregate ingest rate will be limited to fewer +servers than are available within the cluster until the table has reached the point +where there are tablets on every TabletServer. + +Pre-splitting a table ensures that there are as many tablets as desired available +before ingest begins to take advantage of all the parallelism possible with the cluster +hardware. Tables can be split at any time by using the shell: + + user@myinstance mytable> addsplits -sf /local_splitfile -t mytable + +For the purposes of providing parallelism to ingest it is not necessary to create more +tablets than there are physical machines within the cluster as the aggregate ingest +rate is a function of the number of physical machines. Note that the aggregate ingest +rate is still subject to the number of machines running ingest clients, and the +distribution of rowIDs across the table. The aggregation ingest rate will be +suboptimal if there are many inserts into a small number of rowIDs. + +=== Multiple Ingester Clients + +Accumulo is capable of scaling to very high rates of ingest, which is dependent upon +not just the number of TabletServers in operation but also the number of ingest +clients. This is because a single client, while capable of batching mutations and +sending them to all TabletServers, is ultimately limited by the amount of data that +can be processed on a single machine. The aggregate ingest rate will scale linearly +with the number of clients up to the point at which either the aggregate I/O of +TabletServers or total network bandwidth capacity is reached. + +In operational settings where high rates of ingest are paramount, clusters are often +configured to dedicate some number of machines solely to running Ingester Clients. +The exact ratio of clients to TabletServers necessary for optimum ingestion rates +will vary according to the distribution of resources per machine and by data type. + +=== Bulk Ingest + +Accumulo supports the ability to import files produced by an external process such +as MapReduce into an existing table. In some cases it may be faster to load data this +way rather than via ingesting through clients using BatchWriters. This allows a large +number of machines to format data the way Accumulo expects. The new files can +then simply be introduced to Accumulo via a shell command. + +To configure MapReduce to format data in preparation for bulk loading, the job +should be set to use a range partitioner instead of the default hash partitioner. The +range partitioner uses the split points of the Accumulo table that will receive the +data. The split points can be obtained from the shell and used by the MapReduce +RangePartitioner. Note that this is only useful if the existing table is already split +into multiple tablets. + + user@myinstance mytable> getsplits + aa + ab + ac + ... + zx + zy + zz + +Run the MapReduce job, using the AccumuloFileOutputFormat to create the files to +be introduced to Accumulo. Once this is complete, the files can be added to +Accumulo via the shell: + + user@myinstance mytable> importdirectory /files_dir /failures + +Note that the paths referenced are directories within the same HDFS instance over +which Accumulo is running. Accumulo places any files that failed to be added to the +second directory specified. + +See the https://github.com/apache/accumulo-examples/blob/master/docs/bulkIngest.md[Bulk Ingest example] +for a complete example. + +=== Logical Time for Bulk Ingest + +Logical time is important for bulk imported data, for which the client code may +be choosing a timestamp. At bulk import time, the user can choose to enable +logical time for the set of files being imported. When its enabled, Accumulo +uses a specialized system iterator to lazily set times in a bulk imported file. +This mechanism guarantees that times set by unsynchronized multi-node +applications (such as those running on MapReduce) will maintain some semblance +of causal ordering. This mitigates the problem of the time being wrong on the +system that created the file for bulk import. These times are not set when the +file is imported, but whenever it is read by scans or compactions. At import, a +time is obtained and always used by the specialized system iterator to set that +time. + +The timestamp assigned by Accumulo will be the same for every key in the file. +This could cause problems if the file contains multiple keys that are identical +except for the timestamp. In this case, the sort order of the keys will be +undefined. This could occur if an insert and an update were in the same bulk +import file. + +=== MapReduce Ingest + +It is possible to efficiently write many mutations to Accumulo in parallel via a +MapReduce job. In this scenario the MapReduce is written to process data that lives +in HDFS and write mutations to Accumulo using the AccumuloOutputFormat. See +the MapReduce section under Analytics for details. The https://github.com/apache/accumulo-examples/blob/master/docs/mapred.md[MapReduce example] +is also a good reference for example code. http://git-wip-us.apache.org/repos/asf/accumulo-website/blob/32ac61d3/docs/master/implementation.md ---------------------------------------------------------------------- diff --git a/docs/master/implementation.md b/docs/master/implementation.md new file mode 100644 index 0000000..520f538 --- /dev/null +++ b/docs/master/implementation.md @@ -0,0 +1,86 @@ +// Licensed to the Apache Software Foundation (ASF) under one or more +// contributor license agreements. See the NOTICE file distributed with +// this work for additional information regarding copyright ownership. +// The ASF licenses this file to You under the Apache License, Version 2.0 +// (the "License"); you may not use this file except in compliance with +// the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +== Implementation Details + +=== Fault-Tolerant Executor (FATE) + +Accumulo must implement a number of distributed, multi-step operations to support +the client API. Creating a new table is a simple example of an atomic client call +which requires multiple steps in the implementation: get a unique table ID, configure +default table permissions, populate information in ZooKeeper to record the table's +existence, create directories in HDFS for the table's data, etc. Implementing these +steps in a way that is tolerant to node failure and other concurrent operations is +very difficult to achieve. Accumulo includes a Fault-Tolerant Executor (FATE) which +is widely used server-side to implement the client API safely and correctly. + +FATE is the implementation detail which ensures that tables in creation when the +Master dies will be successfully created when another Master process is started. +This alleviates the need for any external tools to correct some bad state -- Accumulo can +undo the failure and self-heal without any external intervention. + +=== Overview + +FATE consists of two primary components: a repeatable, persisted operation (REPO), a storage +layer for REPOs and an execution system to run REPOs. Accumulo uses ZooKeeper as the storage +layer for FATE and the Accumulo Master acts as the execution system to run REPOs. + +The important characteristic of REPOs are that they implemented in a way that is idempotent: +every operation must be able to undo or replay a partial execution of itself. Requiring the +implementation of the operation to support this functional greatly simplifies the execution +of these operations. This property is also what guarantees safety in light of failure conditions. + +=== Administration + +Sometimes, it is useful to inspect the current FATE operations, both pending and executing. +For example, a command that is not completing could be blocked on the execution of another +operation. Accumulo provides an Accumulo shell command to interact with fate. + +The +fate+ shell command accepts a number of arguments for different functionality: ++list+/+print+, +fail+, +delete+, +dump+. + +==== List/Print + +Without any additional arguments, this command will print all operations that still exist in +the FATE store (ZooKeeper). This will include active, pending, and completed operations (completed +operations are lazily removed from the store). Each operation includes a unique "transaction ID", the +state of the operation (e.g. +NEW+, +IN_PROGRESS+, +FAILED+), any locks the +transaction actively holds and any locks it is waiting to acquire. + +This option can also accept transaction IDs which will restrict the list of transactions shown. + +==== Fail + +This command can be used to manually fail a FATE transaction and requires a transaction ID +as an argument. Failing an operation is not a normal procedure and should only be performed +by an administrator who understands the implications of why they are failing the operation. + +==== Delete + +This command requires a transaction ID and will delete any locks that the transaction +holds. Like the fail command, this command should only be used in extreme circumstances +by an administrator that understands the implications of the command they are about to +invoke. It is not normal to invoke this command. + +==== Dump + +This command accepts zero more transaction IDs. If given no transaction IDs, +it will dump all active transactions. A FATE operations is compromised as a +sequence of REPOs. In order to start a FATE transaction, a REPO is pushed onto +a per transaction REPO stack. The top of the stack always contains the next +REPO the FATE transaction should execute. When a REPO is successful it may +return another REPO which is pushed on the stack. The +dump+ command will +print all of the REPOs on each transactions stack. The REPOs are serialized to +JSON in order to make them human readable. http://git-wip-us.apache.org/repos/asf/accumulo-website/blob/32ac61d3/docs/master/introduction.md ---------------------------------------------------------------------- diff --git a/docs/master/introduction.md b/docs/master/introduction.md new file mode 100644 index 0000000..1b964b4 --- /dev/null +++ b/docs/master/introduction.md @@ -0,0 +1,25 @@ +// Licensed to the Apache Software Foundation (ASF) under one or more +// contributor license agreements. See the NOTICE file distributed with +// this work for additional information regarding copyright ownership. +// The ASF licenses this file to You under the Apache License, Version 2.0 +// (the "License"); you may not use this file except in compliance with +// the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +== Introduction +Apache Accumulo is a highly scalable structured store based on Google's BigTable. +Accumulo is written in Java and operates over the Hadoop Distributed File System +(HDFS), which is part of the popular Apache Hadoop project. Accumulo supports +efficient storage and retrieval of structured data, including queries for ranges, and +provides support for using Accumulo tables as input and output for MapReduce +jobs. + +Accumulo features automatic load-balancing and partitioning, data compression +and fine-grained security labels. http://git-wip-us.apache.org/repos/asf/accumulo-website/blob/32ac61d3/docs/master/iterator_design.md ---------------------------------------------------------------------- diff --git a/docs/master/iterator_design.md b/docs/master/iterator_design.md new file mode 100644 index 0000000..4beaeb0 --- /dev/null +++ b/docs/master/iterator_design.md @@ -0,0 +1,401 @@ +// Licensed to the Apache Software Foundation (ASF) under one or more +// contributor license agreements. See the NOTICE file distributed with +// this work for additional information regarding copyright ownership. +// The ASF licenses this file to You under the Apache License, Version 2.0 +// (the "License"); you may not use this file except in compliance with +// the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +== Iterator Design + +Accumulo SortedKeyValueIterators, commonly referred to as Iterators for short, are server-side programming constructs +that allow users to implement custom retrieval or computational purpose within Accumulo TabletServers. The name rightly +brings forward similarities to the Java Iterator interface; however, Accumulo Iterators are more complex than Java +Iterators. Notably, in addition to the expected methods to retrieve the current element and advance to the next element +in the iteration, Accumulo Iterators must also support the ability to "move" (`seek`) to an specified point in the +iteration (the Accumulo table). Accumulo Iterators are designed to be concatenated together, similar to applying a +series of transformations to a list of elements. Accumulo Iterators can duplicate their underlying source to create +multiple "pointers" over the same underlying data (which is extremely powerful since each stream is sorted) or they can +merge multiple Iterators into a single view. In this sense, a collection of Iterators operating in tandem is close to +a tree-structure than a list, but there is always a sense of a flow of Key-Value pairs through some Iterators. Iterators +are not designed to act as triggers nor are they designed to operate outside of the purview of a single table. + +Understanding how TabletServers invoke the methods on a SortedKeyValueIterator can be obtuse as the actual code is +buried within the implementation of the TabletServer; however, it is generally unnecessary to have a strong +understanding of this as the interface provides clear definitions about what each action each method should take. This +chapter aims to provide a more detailed description of how Iterators are invoked, some best practices and some common +pitfalls. + +=== Instantiation + +To invoke an Accumulo Iterator inside of the TabletServer, the Iterator class must be on the classpath of every +TabletServer. For production environments, it is common to place a JAR file which contains the Iterator in +`lib/`. In development environments, it is convenient to instead place the JAR file in `lib/ext/` as JAR files +in this directory are dynamically reloaded by the TabletServers alleviating the need to restart Accumulo while +testing an Iterator. Advanced classloader features which enable other types of filesystems and per-table classpath +configurations (as opposed to process-wide classpaths). These features are not covered here, but elsewhere in the user +manual. + +Accumulo references the Iterator class by name and uses Java reflection to instantiate the Iterator. This means that +Iterators must have a public no-args constructor. + +=== Interface + +A normal implementation of the SortedKeyValueIterator defines functionality for the following methods: + +[source,java] +---- +void init(SortedKeyValueIterator<Key,Value> source, Map<String,String> options, IteratorEnvironment env) throws IOException; + +boolean hasTop(); + +void next() throws IOException; + +void seek(Range range, Collection<ByteSequence> columnFamilies, boolean inclusive) throws IOException; + +Key getTopKey(); + +Value getTopValue(); + +SortedKeyValueIterator<Key,Value> deepCopy(IteratorEnvironment env); +---- + +==== `init` + +The `init` method is called by the TabletServer after it constructs an instance of the Iterator. This method should +clear/reset any internal state in the Iterator and prepare it to process data. The first argument, the `source`, is the +Iterator "below" this Iterator (where the client is at "top" and the Iterator for files in HDFS are at the "bottom"). +The "source" Iterator provides the Key-Value pairs which this Iterator will operate upon. + +The second argument, a Map of options, is made up of options provided by the user, options set in the table's +configuration, and/or options set in the containing namespace's configuration. +These options allow for Iterators to dynamically configure themselves on the fly. If no options are used in the current context +(a Scan or Compaction), the Map will be empty. An example of a configuration item for an Iterator could be a pattern used to filter +Key-Value pairs in a regular expression Iterator. + +The third argument, the `IteratorEnvironment`, is a special object which provides information to this Iterator about the +context in which it was invoked. Commonly, this information is not necessary to inspect. For example, if an Iterator +knows that it is running in the context of a full-major compaction (reading all of the data) as opposed to a user scan +(which may strongly limit the number of columns), the Iterator might make different algorithmic decisions in an attempt to +optimize itself. + +==== `seek` + +The `seek` method is likely the most confusing method on the Iterator interface. The purpose of this method is to +advance the stream of Key-Value pairs to a certain point in the iteration (the Accumulo table). It is common that before +the implementation of this method returns some additional processing is performed which may further advance the current +position past the `startKey` of the `Range`. This, however, is dependent on the functionality the iterator provides. For +example, a filtering iterator would consume a number Key-Value pairs which do not meets its criteria before `seek` +returns. The important condition for `seek` to meet is that this Iterator should be ready to return the first Key-Value +pair, or none if no such pair is available, when the method returns. The Key-Value pair would be returned by `getTopKey` +and `getTopValue`, respectively, and `hasTop` should return a boolean denoting whether or not there is +a Key-Value pair to return. + +The arguments passed to seek are as follows: + +The TabletServer first provides a `Range`, an object which defines some collection of Accumulo `Key`s, which defines the +Key-Value pairs that this Iterator should return. Each `Range` has a `startKey` and `endKey` with an inclusive flag for +both. While this Range is often similar to the Range(s) set by the client on a Scanner or BatchScanner, it is not +guaranteed to be a Range that the client set. Accumulo will split up larger ranges and group them together based on +Tablet boundaries per TabletServer. Iterators should not attempt to implement any custom logic based on the Range(s) +provided to `seek` and Iterators should not return any Keys that fall outside of the provided Range. + +The second argument, a `Collection<ByteSequence>`, is the set of column families which should be retained or +excluded by this Iterator. The third argument, a boolean, defines whether the collection of column families +should be treated as an inclusion collection (true) or an exclusion collection (false). + +It is likely that all implementations of `seek` will first make a call to the `seek` method on the +"source" Iterator that was provided in the `init` method. The collection of column families and +the boolean `include` argument should be passed down as well as the `Range`. Somewhat commonly, the Iterator will +also implement some sort of additional logic to find or compute the first Key-Value pair in the provided +Range. For example, a regular expression Iterator would consume all records which do not match the given +pattern before returning from `seek`. + +It is important to retain the original Range passed to this method to know when this Iterator should stop +reading more Key-Value pairs. Ignoring this typically does not affect scans from a Scanner, but it +will result in duplicate keys emitting from a BatchScan if the scanned table has more than one tablet. +Best practice is to never emit entries outside the seek range. + +==== `next` + +The `next` method is analogous to the `next` method on a Java Iterator: this method should advance +the Iterator to the next Key-Value pair. For implementations that perform some filtering or complex +logic, this may result in more than one Key-Value pair being inspected. This method alters +some internal state that is exposed via the `hasTop`, `getTopKey`, and `getTopValue` methods. + +The result of this method is commonly caching a Key-Value pair which `getTopKey` and `getTopValue` +can later return. While there is another Key-Value pair to return, `hasTop` should return true. +If there are no more Key-Value pairs to return from this Iterator since the last call to +`seek`, `hasTop` should return false. + +==== `hasTop` + +The `hasTop` method is similar to the `hasNext` method on a Java Iterator in that it informs +the caller if there is a Key-Value pair to be returned. If there is no pair to return, this method +should return false. Like a Java Iterator, multiple calls to `hasTop` (without calling `next`) should not +alter the internal state of the Iterator. + +==== `getTopKey` and `getTopValue` + +These methods simply return the current Key-Value pair for this iterator. If `hasTop` returns true, +both of these methods should return non-null objects. If `hasTop` returns false, it is undefined +what these methods should return. Like `hasTop`, multiple calls to these methods should not alter +the state of the Iterator. + +Users should take caution when either + +1. caching the Key/Value from `getTopKey`/`getTopValue`, for use after calling `next` on the source iterator. +In this case, the cached Key/Value object is aliased to the reference returned by the source iterator. +Iterators may reuse the same Key/Value object in a `next` call for performance reasons, changing the data +that the cached Key/Value object references and resulting in a logic bug. +2. modifying the Key/Value from `getTopKey`/`getTopValue`. If the source iterator reuses data stored in the Key/Value, +then the source iterator may use the modified data that the Key/Value references. This may/may not result in a logic bug. + +In both cases, copying the Key/Value's data into a new object ensures iterator correctness. If neither case applies, +it is safe to not copy the Key/Value. The general guideline is to be aware of who else may use Key/Value objects +returned from `getTopKey`/`getTopValue`. + +==== `deepCopy` + +The `deepCopy` method is similar to the `clone` method from the Java `Cloneable` interface. +Implementations of this method should return a new object of the same type as the Accumulo Iterator +instance it was called on. Any internal state from the instance `deepCopy` was called +on should be carried over to the returned copy. The returned copy should be ready to have +`seek` called on it. The SortedKeyValueIterator interface guarantees that `init` will be called on +an iterator before `deepCopy` and that `init` will not be called on the iterator returned by +`deepCopy`. + +Typically, implementations of `deepCopy` call a copy-constructor which will initialize +internal data structures. As with `seek`, it is common for the `IteratorEnvironment` +argument to be ignored as most Iterator implementations can be written without the explicit +information the environment provides. + +In the analogy of a series of Iterators representing a tree, `deepCopy` can be thought of as +early programming assignments which implement their own tree data structures. `deepCopy` calls +copy on its sources (the children), copies itself, attaches the copies of the children, and +then returns itself. + +=== TabletServer invocation of Iterators + +The following code is a general outline for how TabletServers invoke Iterators. + +[source,java] +---- + List<KeyValue> batch; + Range range = getRangeFromClient(); + while(!overSizeLimit(batch)){ + SortedKeyValueIterator source = getSystemIterator(); + + for(String clzName : getUserIterators()){ + Class<?> clz = Class.forName(clzName); + SortedKeyValueIterator iter = (SortedKeyValueIterator) clz.newInstance(); + iter.init(source, opts, env); + source = iter; + } + + // read a batch of data to return to client + // the last iterator, the "top" + SortedKeyValueIterator topIter = source; + topIter.seek(getRangeFromUser(), ...) + + while(topIter.hasTop() && !overSizeLimit(batch)){ + key = topIter.getTopKey() + val = topIter.getTopValue() + batch.add(new KeyValue(key, val) + if(systemDataSourcesChanged()){ + // code does not show isolation case, which will + // keep using same data sources until a row boundry is hit + range = new Range(key, false, range.endKey(), range.endKeyInclusive()); + break; + } + } + } + //return batch of key values to client +---- + +Additionally, the obtuse "re-seek" case can be outlined as the following: + +[source,java] +---- + // Given the above + List<KeyValue> batch = getNextBatch(); + + // Store off lastKeyReturned for this client + lastKeyReturned = batch.get(batch.size() - 1).getKey(); + + // thread goes away (client stops asking for the next batch). + + // Eventually client comes back + // Setup as before... + + Range userRange = getRangeFromUser(); + Range actualRange = new Range(lastKeyReturned, false + userRange.getEndKey(), userRange.isEndKeyInclusive()); + + // Use the actualRange, not the user provided one + topIter.seek(actualRange); +---- + + +=== Isolation + +Accumulo provides a feature which clients can enable to prevent the viewing of partially +applied mutations within the context of rows. If a client is submitting multiple column +updates to rows at a time, isolation would ensure that a client would either see all of +updates made to that row or none of the updates (until they are all applied). + +When using Isolation, there are additional concerns in iterator design. A scan time iterator in accumulo +reads from a set of data sources. While an iterator is reading data it has an isolated view. However, after it returns a +key/value it is possible that accumulo may switch data sources and re-seek the iterator. This is done so that resources +may be reclaimed. When the user does not request isolation this can occur after any key is returned. When a user enables +Isolation, this will only occur after a new row is returned, in which case it will re-seek to the very beginning of the +next possible row. + +=== Abstract Iterators + +A number of Abstract implementations of Iterators are provided to allow for faster creation +of common patterns. The most commonly used abstract implementations are the `Filter` and +`Combiner` classes. When possible these classes should be used instead as they have been +thoroughly tested inside Accumulo itself. + +==== Filter + +The `Filter` abstract Iterator provides a very simple implementation which allows implementations +to define whether or not a Key-Value pair should be returned via an `accept(Key, Value)` method. + +Filters are extremely simple to implement; however, when the implementation is filtering a +large percentage of Key-Value pairs with respect to the total number of pairs examined, +it can be very inefficient. For example, if a Filter implementation can determine after examining +part of the row that no other pairs in this row will be accepted, there is no mechanism to +efficiently skip the remaining Key-Value pairs. Concretely, take a row which is comprised of +1000 Key-Value pairs. After examining the first 10 Key-Value pairs, it is determined +that no other Key-Value pairs in this row will be accepted. The Filter must still examine each +remaining 990 Key-Value pairs in this row. Another way to express this deficiency is that +Filters have no means to leverage the `seek` method to efficiently skip large portions +of Key-Value pairs. + +As such, the `Filter` class functions well for filtering small amounts of data, but is +inefficient for filtering large amounts of data. The decision to use a `Filter` strongly +depends on the use case and distribution of data being filtered. + +==== Combiner + +The `Combiner` class is another common abstract Iterator. Similar to the `Combiner` interface +define in Hadoop's MapReduce framework, implementations of this abstract class reduce +multiple Values for different versions of a Key (Keys which only differ by timestamps) into one Key-Value pair. +Combiners provide a simple way to implement common operations like summation and +aggregation without the need to implement the entire Accumulo Iterator interface. + +One important consideration when choosing to design a Combiner is that the "reduction" operation +is often best represented when it is associative and commutative. Operations which do not meet +these criteria can be implemented; however, the implementation can be difficult. + +A second consideration is that a Combiner is not guaranteed to see every Key-Value pair +which differ only by timestamp every time it is invoked. For example, if there are 5 Key-Value +pairs in a table which only differ by the timestamps 1, 2, 3, 4, and 5, it is not guaranteed that +every invocation of the Combiner will see 5 timestamps. One invocation might see the Values for +Keys with timestamp 1 and 4, while another invocation might see the Values for Keys with the +timestamps 1, 2, 4 and 5. + +Finally, when configuring an Accumulo table to use a Combiner, be sure to disable the Versioning Iterator or set the +Combiner at a priority less than the Combiner (the Versioning Iterator is added at a priority of 20 by default). The +Versioning Iterator will filter out multiple Key-Value pairs that differ only by timestamp and return only the Key-Value +pair that has the largest timestamp. + +=== Best practices + +Because of the flexibility that the `SortedKeyValueInterface` provides, it doesn't directly disallow +many implementations which are poor design decisions. The following are some common recommendations to +follow and pitfalls to avoid in Iterator implementations. + +==== Avoid special logic encoded in Ranges + +Commonly, granular Ranges that a client passes to an Iterator from a `Scanner` or `BatchScanner` are unmodified. +If a `Range` falls within the boundaries of a Tablet, an Iterator will often see that same Range in the +`seek` method. However, there is no guarantee that the `Range` will remain unaltered from client to server. As such, Iterators +should *never* make assumptions about the current state/context based on the `Range`. + +The common failure condition is referred to as a "re-seek". In the context of a Scan, TabletServers construct the +"stack" of Iterators and batch up Key-Value pairs to send back to the client. When a sufficient number of Key-Value +pairs are collected, it is common for the Iterators to be "torn down" until the client asks for the next batch of +Key-Value pairs. This is done by the TabletServer to add fairness in ensuring one Scan does not monopolize the available +resources. When the client asks for the next batch, the implementation modifies the original Range so that servers know +the point to resume the iteration (to avoid returning duplicate Key-Value pairs). Specifically, the new Range is created +from the original but is shortened by setting the startKey of the original Range to the Key last returned by the Scan, +non-inclusive. + +==== `seek`'ing backwards + +The ability for an Iterator to "skip over" large blocks of Key-Value pairs is a major tenet behind Iterators. +By `seek`'ing when it is known that there is a collection of Key-Value pairs which can be ignored can +greatly increase the speed of a scan as many Key-Value pairs do not have to be deserialized and processed. + +While the `seek` method provides the `Range` that should be used to `seek` the underlying source Iterator, +there is no guarantee that the implementing Iterator uses that `Range` to perform the `seek` on its +"source" Iterator. As such, it is possible to seek to any `Range` and the interface has no assertions +to prevent this from happening. + +Since Iterators are allowed to `seek` to arbitrary Keys, it also allows Iterators to create infinite loops +inside Scans that will repeatedly read the same data without end. If an arbitrary Range is constructed, it should +construct a completely new Range as it allows for bugs to be introduced which will break Accumulo. + +Thus, `seek`'s should always be thought of as making "forward progress" in the view of the total iteration. The +`startKey` of a `Range` should always be greater than the current Key seen by the Iterator while the `endKey` of the +`Range` should always retain the original `endKey` (and `endKey` inclusivity) of the last `Range` seen by your +Iterator's implementation of seek. + +==== Take caution in constructing new data in an Iterator + +Implementations of Iterator might be tempted to open BatchWriters inside of an Iterator as a means +to implement triggers for writing additional data outside of their client application. The lifecycle of an Iterator +is *not* managed in such a way that guarantees that this is safe nor efficient. Specifically, there +is no way to guarantee that the internal ThreadPool inside of the BatchWriter is closed (and the thread(s) +are reaped) without calling the close() method. `close`'ing and recreating a `BatchWriter` after every +Key-Value pair is also prohibitively performance limiting to be considered an option. + +The only safe way to generate additional data in an Iterator is to alter the current Key-Value pair. +For example, the `WholeRowIterator` serializes the all of the Key-Values pairs that fall within each +row. A safe way to generate more data in an Iterator would be to construct an Iterator that is +"higher" (at a larger priority) than the `WholeRowIterator`, that is, the Iterator receives the Key-Value pairs which are +a serialization of many Key-Value pairs. The custom Iterator could deserialize the pairs, compute +some function, and add a new Key-Value pair to the original collection, re-serializing the collection +of Key-Value pairs back into a single Key-Value pair. + +Any other situation is likely not guaranteed to ensure that the caller (a Scan or a Compaction) will +always see all intended data that is generated. + +=== Final things to remember + +Some simple recommendations/points to keep in mind: + +==== Method call order + +On an instance of an Iterator: `init` is always called before `seek`, `seek` is always called before `hasTop`, +`getTopKey` and `getTopValue` will not be called if `hasTop` returns false. + +==== Teardown + +As mentioned, instance of Iterators may be torn down inside of the server transparently. When a complex +collection of iterators is performing some advanced functionality, they will not be torn down until a Key-Value +pair is returned out of the "stack" of Iterators (and added into the batch of Key-Values to be returned +to the caller). Being torn-down is equivalent to a new instance of the Iterator being creating and `deepCopy` +being called on the new instance with the old instance provided as the argument to `deepCopy`. References +to the old instance are removed and the object is lazily garbage collected by the JVM. + +=== Compaction-time Iterators + +When Iterators are configured to run during compactions, at the `minc` or `majc` scope, these Iterators sometimes need +to make different assertions than those who only operate at scan time. Iterators won't see the delete entries; however, +Iterators will not necessarily see all of the Key-Value pairs in ever invocation. Because compactions often do not rewrite +all files (only a subset of them), it is possible that the logic take this into consideration. + +For example, a Combiner that runs over data at during compactions, might not see all of the values for a given Key. The +Combiner must recognize this and not perform any function that would be incorrect due +to the missing values. http://git-wip-us.apache.org/repos/asf/accumulo-website/blob/32ac61d3/docs/master/iterator_test_harness.md ---------------------------------------------------------------------- diff --git a/docs/master/iterator_test_harness.md b/docs/master/iterator_test_harness.md new file mode 100644 index 0000000..91ae53a --- /dev/null +++ b/docs/master/iterator_test_harness.md @@ -0,0 +1,110 @@ +// Licensed to the Apache Software Foundation (ASF) under one or more +// contributor license agreements. See the NOTICE file distributed with +// this work for additional information regarding copyright ownership. +// The ASF licenses this file to You under the Apache License, Version 2.0 +// (the "License"); you may not use this file except in compliance with +// the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +== Iterator Testing + +Iterators, while extremely powerful, are notoriously difficult to test. While the API defines +the methods an Iterator must implement and each method's functionality, the actual invocation +of these methods by Accumulo TabletServers can be surprisingly difficult to mimic in unit tests. + +The Apache Accumulo "Iterator Test Harness" is designed to provide a generalized testing framework +for all Accumulo Iterators to leverage to identify common pitfalls in user-created Iterators. + +=== Framework Use + +The harness provides an abstract class for use with JUnit4. Users must define the following for this +abstract class: + + * A `SortedMap` of input data (`Key`-`Value` pairs) + * A `Range` to use in tests + * A `Map` of options (`String` to `String` pairs) + * A `SortedMap` of output data (`Key`-`Value` pairs) + * A list of `IteratorTestCase`s (these can be automatically discovered) + +The majority of effort a user must make is in creating the input dataset and the expected +output dataset for the iterator being tested. + +=== Normal Test Outline + +Most iterator tests will follow the given outline: + +[source,java] +---- +import java.util.List; +import java.util.SortedMap; + +import org.apache.accumulo.core.data.Key; +import org.apache.accumulo.core.data.Range; +import org.apache.accumulo.core.data.Value; +import org.apache.accumulo.iteratortest.IteratorTestCaseFinder; +import org.apache.accumulo.iteratortest.IteratorTestInput; +import org.apache.accumulo.iteratortest.IteratorTestOutput; +import org.apache.accumulo.iteratortest.junit4.BaseJUnit4IteratorTest; +import org.apache.accumulo.iteratortest.testcases.IteratorTestCase; +import org.junit.runners.Parameterized.Parameters; + +public class MyIteratorTest extends BaseJUnit4IteratorTest { + + @Parameters + public static Object[][] parameters() { + final IteratorTestInput input = createIteratorInput(); + final IteratorTestOutput output = createIteratorOutput(); + final List<IteratorTestCase> testCases = IteratorTestCaseFinder.findAllTestCases(); + return BaseJUnit4IteratorTest.createParameters(input, output, tests); + } + + private static SortedMap<Key,Value> INPUT_DATA = createInputData(); + private static SortedMap<Key,Value> OUTPUT_DATA = createOutputData(); + + private static SortedMap<Key,Value> createInputData() { + // TODO -- implement this method + } + + private static SortedMap<Key,Value> createOutputData() { + // TODO -- implement this method + } + + private static IteratorTestInput createIteratorInput() { + final Map<String,String> options = createIteratorOptions(); + final Range range = createRange(); + return new IteratorTestInput(MyIterator.class, options, range, INPUT_DATA); + } + + private static Map<String,String> createIteratorOptions() { + // TODO -- implement this method + // Tip: Use INPUT_DATA if helpful in generating output + } + + private static Range createRange() { + // TODO -- implement this method + } + + private static IteratorTestOutput createIteratorOutput() { + return new IteratorTestOutput(OUTPUT_DATA); + } + +} +---- + +=== Limitations + +While the provided `IteratorTestCase`s should exercise common edge-cases in user iterators, +there are still many limitations to the existing test harness. Some of them are: + + * Can only specify a single iterator, not many (a "stack") + * No control over provided IteratorEnvironment for tests + * Exercising delete keys (especially with major compactions that do not include all files) + +These are left as future improvements to the harness.
