http://git-wip-us.apache.org/repos/asf/accumulo-website/blob/7cc70b2e/docs/master/design.md ---------------------------------------------------------------------- diff --git a/docs/master/design.md b/docs/master/design.md deleted file mode 100644 index 6c77cb6..0000000 --- a/docs/master/design.md +++ /dev/null @@ -1,180 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one or more -// contributor license agreements. See the NOTICE file distributed with -// this work for additional information regarding copyright ownership. -// The ASF licenses this file to You under the Apache License, Version 2.0 -// (the "License"); you may not use this file except in compliance with -// the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -== Accumulo Design - -=== Data Model - -Accumulo provides a richer data model than simple key-value stores, but is not a -fully relational database. Data is represented as key-value pairs, where the key and -value are comprised of the following elements: - -[width="75%",cols="^,^,^,^,^,^"] -|=========================================================================== - 5+|Key .3+^.^|Value -.2+^.^|Row ID 3+|Column .2+^.^|Timestamp - |Family |Qualifier |Visibility -|=========================================================================== - -All elements of the Key and the Value are represented as byte arrays except for -Timestamp, which is a Long. Accumulo sorts keys by element and lexicographically -in ascending order. Timestamps are sorted in descending order so that later -versions of the same Key appear first in a sequential scan. Tables consist of a set of -sorted key-value pairs. - -=== Architecture - -Accumulo is a distributed data storage and retrieval system and as such consists of -several architectural components, some of which run on many individual servers. -Much of the work Accumulo does involves maintaining certain properties of the -data, such as organization, availability, and integrity, across many commodity-class -machines. - -=== Components - -An instance of Accumulo includes many TabletServers, one Garbage Collector process, -one Master server and many Clients. - -==== Tablet Server - -The TabletServer manages some subset of all the tablets (partitions of tables). This includes receiving writes from clients, persisting writes to a -write-ahead log, sorting new key-value pairs in memory, periodically -flushing sorted key-value pairs to new files in HDFS, and responding -to reads from clients, forming a merge-sorted view of all keys and -values from all the files it has created and the sorted in-memory -store. - -TabletServers also perform recovery of a tablet -that was previously on a server that failed, reapplying any writes -found in the write-ahead log to the tablet. - -==== Garbage Collector - -Accumulo processes will share files stored in HDFS. Periodically, the Garbage -Collector will identify files that are no longer needed by any process, and -delete them. Multiple garbage collectors can be run to provide hot-standby support. -They will perform leader election among themselves to choose a single active instance. - -==== Master - -The Accumulo Master is responsible for detecting and responding to TabletServer -failure. It tries to balance the load across TabletServer by assigning tablets carefully -and instructing TabletServers to unload tablets when necessary. The Master ensures all -tablets are assigned to one TabletServer each, and handles table creation, alteration, -and deletion requests from clients. The Master also coordinates startup, graceful -shutdown and recovery of changes in write-ahead logs when Tablet servers fail. - -Multiple masters may be run. The masters will choose among themselves a single master, -and the others will become backups if the master should fail. - -==== Tracer - -The Accumulo Tracer process supports the distributed timing API provided by Accumulo. -One to many of these processes can be run on a cluster which will write the timing -information to a given Accumulo table for future reference. Seeing the section on -Tracing for more information on this support. - -==== Monitor - -The Accumulo Monitor is a web application that provides a wealth of information about -the state of an instance. The Monitor shows graphs and tables which contain information -about read/write rates, cache hit/miss rates, and Accumulo table information such as scan -rate and active/queued compactions. Additionally, the Monitor should always be the first -point of entry when attempting to debug an Accumulo problem as it will show high-level problems -in addition to aggregated errors from all nodes in the cluster. See the section on <<monitoring>> -for more information. - -Multiple Monitors can be run to provide hot-standby support in the face of failure. Due to the -forwarding of logs from remote hosts to the Monitor, only one Monitor process should be active -at one time. Leader election will be performed internally to choose the active Monitor. - -==== Client - -Accumulo includes a client library that is linked to every application. The client -library contains logic for finding servers managing a particular tablet, and -communicating with TabletServers to write and retrieve key-value pairs. - -=== Data Management - -Accumulo stores data in tables, which are partitioned into tablets. Tablets are -partitioned on row boundaries so that all of the columns and values for a particular -row are found together within the same tablet. The Master assigns Tablets to one -TabletServer at a time. This enables row-level transactions to take place without -using distributed locking or some other complicated synchronization mechanism. As -clients insert and query data, and as machines are added and removed from the -cluster, the Master migrates tablets to ensure they remain available and that the -ingest and query load is balanced across the cluster. - -image::data_distribution.png[width=500] - -=== Tablet Service - - -When a write arrives at a TabletServer it is written to a Write-Ahead Log and -then inserted into a sorted data structure in memory called a MemTable. When the -MemTable reaches a certain size, the TabletServer writes out the sorted -key-value pairs to a file in HDFS called a Relative Key File (RFile), which is a -kind of Indexed Sequential Access Method (ISAM) file. This process is called a -minor compaction. A new MemTable is then created and the fact of the compaction -is recorded in the Write-Ahead Log. - -When a request to read data arrives at a TabletServer, the TabletServer does a -binary search across the MemTable as well as the in-memory indexes associated -with each RFile to find the relevant values. If clients are performing a scan, -several key-value pairs are returned to the client in order from the MemTable -and the set of RFiles by performing a merge-sort as they are read. - -=== Compactions - -In order to manage the number of files per tablet, periodically the TabletServer -performs Major Compactions of files within a tablet, in which some set of RFiles -are combined into one file. The previous files will eventually be removed by the -Garbage Collector. This also provides an opportunity to permanently remove -deleted key-value pairs by omitting key-value pairs suppressed by a delete entry -when the new file is created. - -=== Splitting - -When a table is created it has one tablet. As the table grows its initial -tablet eventually splits into two tablets. Its likely that one of these -tablets will migrate to another tablet server. As the table continues to grow, -its tablets will continue to split and be migrated. The decision to -automatically split a tablet is based on the size of a tablets files. The -size threshold at which a tablet splits is configurable per table. In addition -to automatic splitting, a user can manually add split points to a table to -create new tablets. Manually splitting a new table can parallelize reads and -writes giving better initial performance without waiting for automatic -splitting. - -As data is deleted from a table, tablets may shrink. Over time this can lead -to small or empty tablets. To deal with this, merging of tablets was -introduced in Accumulo 1.4. This is discussed in more detail later. - -=== Fault-Tolerance - -If a TabletServer fails, the Master detects it and automatically reassigns the tablets -assigned from the failed server to other servers. Any key-value pairs that were in -memory at the time the TabletServer fails are automatically reapplied from the Write-Ahead -Log(WAL) to prevent any loss of data. - -Tablet servers write their WALs directly to HDFS so the logs are available to all tablet -servers for recovery. To make the recovery process efficient, the updates within a log are -grouped by tablet. TabletServers can quickly apply the mutations from the sorted logs -that are destined for the tablets they have now been assigned. - -TabletServer failures are noted on the Master's monitor page, accessible via -+http://master-address:9995/monitor+. - -image::failure_handling.png[width=500]
http://git-wip-us.apache.org/repos/asf/accumulo-website/blob/7cc70b2e/docs/master/development_clients.md ---------------------------------------------------------------------- diff --git a/docs/master/development_clients.md b/docs/master/development_clients.md deleted file mode 100644 index 18821e3..0000000 --- a/docs/master/development_clients.md +++ /dev/null @@ -1,107 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one or more -// contributor license agreements. See the NOTICE file distributed with -// this work for additional information regarding copyright ownership. -// The ASF licenses this file to You under the Apache License, Version 2.0 -// (the "License"); you may not use this file except in compliance with -// the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -== Development Clients - -Normally, Accumulo consists of lots of moving parts. Even a stand-alone version of -Accumulo requires Hadoop, Zookeeper, the Accumulo master, a tablet server, etc. If -you want to write a unit test that uses Accumulo, you need a lot of infrastructure -in place before your test can run. - -=== Mock Accumulo - -Mock Accumulo supplies mock implementations for much of the client API. It presently -does not enforce users, logins, permissions, etc. It does support Iterators and Combiners. -Note that MockAccumulo holds all data in memory, and will not retain any data or -settings between runs. - -While normal interaction with the Accumulo client looks like this: - -[source,java] -Instance instance = new ZooKeeperInstance(...); -Connector conn = instance.getConnector(user, passwordToken); - -To interact with the MockAccumulo, just replace the ZooKeeperInstance with MockInstance: - -[source,java] -Instance instance = new MockInstance(); - -In fact, you can use the +--fake+ option to the Accumulo shell and interact with -MockAccumulo: - ----- -$ accumulo shell --fake -u root -p '' - -Shell - Apache Accumulo Interactive Shell -- -- version: 2.x.x -- instance name: fake -- instance id: mock-instance-id -- -- type 'help' for a list of available commands -- - -root@fake> createtable test - -root@fake test> insert row1 cf cq value -root@fake test> insert row2 cf cq value2 -root@fake test> insert row3 cf cq value3 - -root@fake test> scan -row1 cf:cq [] value -row2 cf:cq [] value2 -row3 cf:cq [] value3 - -root@fake test> scan -b row2 -e row2 -row2 cf:cq [] value2 - -root@fake test> ----- - -When testing Map Reduce jobs, you can also set the Mock Accumulo on the AccumuloInputFormat -and AccumuloOutputFormat classes: - -[source,java] -// ... set up job configuration -AccumuloInputFormat.setMockInstance(job, "mockInstance"); -AccumuloOutputFormat.setMockInstance(job, "mockInstance"); - -=== Mini Accumulo Cluster - -While the Mock Accumulo provides a lightweight implementation of the client API for unit -testing, it is often necessary to write more realistic end-to-end integration tests that -take advantage of the entire ecosystem. The Mini Accumulo Cluster makes this possible by -configuring and starting Zookeeper, initializing Accumulo, and starting the Master as well -as some Tablet Servers. It runs against the local filesystem instead of having to start -up HDFS. - -To start it up, you will need to supply an empty directory and a root password as arguments: - -[source,java] -File tempDirectory = // JUnit and Guava supply mechanisms for creating temp directories -MiniAccumuloCluster accumulo = new MiniAccumuloCluster(tempDirectory, "password"); -accumulo.start(); - -Once we have our mini cluster running, we will want to interact with the Accumulo client API: - -[source,java] -Instance instance = new ZooKeeperInstance(accumulo.getInstanceName(), accumulo.getZooKeepers()); -Connector conn = instance.getConnector("root", new PasswordToken("password")); - -Upon completion of our development code, we will want to shutdown our MiniAccumuloCluster: - -[source,java] -accumulo.stop(); -// delete your temporary folder http://git-wip-us.apache.org/repos/asf/accumulo-website/blob/7cc70b2e/docs/master/high_speed_ingest.md ---------------------------------------------------------------------- diff --git a/docs/master/high_speed_ingest.md b/docs/master/high_speed_ingest.md deleted file mode 100644 index 2a7a702..0000000 --- a/docs/master/high_speed_ingest.md +++ /dev/null @@ -1,124 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one or more -// contributor license agreements. See the NOTICE file distributed with -// this work for additional information regarding copyright ownership. -// The ASF licenses this file to You under the Apache License, Version 2.0 -// (the "License"); you may not use this file except in compliance with -// the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -== High-Speed Ingest - -Accumulo is often used as part of a larger data processing and storage system. To -maximize the performance of a parallel system involving Accumulo, the ingestion -and query components should be designed to provide enough parallelism and -concurrency to avoid creating bottlenecks for users and other systems writing to -and reading from Accumulo. There are several ways to achieve high ingest -performance. - -=== Pre-Splitting New Tables - -New tables consist of a single tablet by default. As mutations are applied, the table -grows and splits into multiple tablets which are balanced by the Master across -TabletServers. This implies that the aggregate ingest rate will be limited to fewer -servers than are available within the cluster until the table has reached the point -where there are tablets on every TabletServer. - -Pre-splitting a table ensures that there are as many tablets as desired available -before ingest begins to take advantage of all the parallelism possible with the cluster -hardware. Tables can be split at any time by using the shell: - - user@myinstance mytable> addsplits -sf /local_splitfile -t mytable - -For the purposes of providing parallelism to ingest it is not necessary to create more -tablets than there are physical machines within the cluster as the aggregate ingest -rate is a function of the number of physical machines. Note that the aggregate ingest -rate is still subject to the number of machines running ingest clients, and the -distribution of rowIDs across the table. The aggregation ingest rate will be -suboptimal if there are many inserts into a small number of rowIDs. - -=== Multiple Ingester Clients - -Accumulo is capable of scaling to very high rates of ingest, which is dependent upon -not just the number of TabletServers in operation but also the number of ingest -clients. This is because a single client, while capable of batching mutations and -sending them to all TabletServers, is ultimately limited by the amount of data that -can be processed on a single machine. The aggregate ingest rate will scale linearly -with the number of clients up to the point at which either the aggregate I/O of -TabletServers or total network bandwidth capacity is reached. - -In operational settings where high rates of ingest are paramount, clusters are often -configured to dedicate some number of machines solely to running Ingester Clients. -The exact ratio of clients to TabletServers necessary for optimum ingestion rates -will vary according to the distribution of resources per machine and by data type. - -=== Bulk Ingest - -Accumulo supports the ability to import files produced by an external process such -as MapReduce into an existing table. In some cases it may be faster to load data this -way rather than via ingesting through clients using BatchWriters. This allows a large -number of machines to format data the way Accumulo expects. The new files can -then simply be introduced to Accumulo via a shell command. - -To configure MapReduce to format data in preparation for bulk loading, the job -should be set to use a range partitioner instead of the default hash partitioner. The -range partitioner uses the split points of the Accumulo table that will receive the -data. The split points can be obtained from the shell and used by the MapReduce -RangePartitioner. Note that this is only useful if the existing table is already split -into multiple tablets. - - user@myinstance mytable> getsplits - aa - ab - ac - ... - zx - zy - zz - -Run the MapReduce job, using the AccumuloFileOutputFormat to create the files to -be introduced to Accumulo. Once this is complete, the files can be added to -Accumulo via the shell: - - user@myinstance mytable> importdirectory /files_dir /failures - -Note that the paths referenced are directories within the same HDFS instance over -which Accumulo is running. Accumulo places any files that failed to be added to the -second directory specified. - -See the https://github.com/apache/accumulo-examples/blob/master/docs/bulkIngest.md[Bulk Ingest example] -for a complete example. - -=== Logical Time for Bulk Ingest - -Logical time is important for bulk imported data, for which the client code may -be choosing a timestamp. At bulk import time, the user can choose to enable -logical time for the set of files being imported. When its enabled, Accumulo -uses a specialized system iterator to lazily set times in a bulk imported file. -This mechanism guarantees that times set by unsynchronized multi-node -applications (such as those running on MapReduce) will maintain some semblance -of causal ordering. This mitigates the problem of the time being wrong on the -system that created the file for bulk import. These times are not set when the -file is imported, but whenever it is read by scans or compactions. At import, a -time is obtained and always used by the specialized system iterator to set that -time. - -The timestamp assigned by Accumulo will be the same for every key in the file. -This could cause problems if the file contains multiple keys that are identical -except for the timestamp. In this case, the sort order of the keys will be -undefined. This could occur if an insert and an update were in the same bulk -import file. - -=== MapReduce Ingest - -It is possible to efficiently write many mutations to Accumulo in parallel via a -MapReduce job. In this scenario the MapReduce is written to process data that lives -in HDFS and write mutations to Accumulo using the AccumuloOutputFormat. See -the MapReduce section under Analytics for details. The https://github.com/apache/accumulo-examples/blob/master/docs/mapred.md[MapReduce example] -is also a good reference for example code. http://git-wip-us.apache.org/repos/asf/accumulo-website/blob/7cc70b2e/docs/master/implementation.md ---------------------------------------------------------------------- diff --git a/docs/master/implementation.md b/docs/master/implementation.md deleted file mode 100644 index 520f538..0000000 --- a/docs/master/implementation.md +++ /dev/null @@ -1,86 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one or more -// contributor license agreements. See the NOTICE file distributed with -// this work for additional information regarding copyright ownership. -// The ASF licenses this file to You under the Apache License, Version 2.0 -// (the "License"); you may not use this file except in compliance with -// the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -== Implementation Details - -=== Fault-Tolerant Executor (FATE) - -Accumulo must implement a number of distributed, multi-step operations to support -the client API. Creating a new table is a simple example of an atomic client call -which requires multiple steps in the implementation: get a unique table ID, configure -default table permissions, populate information in ZooKeeper to record the table's -existence, create directories in HDFS for the table's data, etc. Implementing these -steps in a way that is tolerant to node failure and other concurrent operations is -very difficult to achieve. Accumulo includes a Fault-Tolerant Executor (FATE) which -is widely used server-side to implement the client API safely and correctly. - -FATE is the implementation detail which ensures that tables in creation when the -Master dies will be successfully created when another Master process is started. -This alleviates the need for any external tools to correct some bad state -- Accumulo can -undo the failure and self-heal without any external intervention. - -=== Overview - -FATE consists of two primary components: a repeatable, persisted operation (REPO), a storage -layer for REPOs and an execution system to run REPOs. Accumulo uses ZooKeeper as the storage -layer for FATE and the Accumulo Master acts as the execution system to run REPOs. - -The important characteristic of REPOs are that they implemented in a way that is idempotent: -every operation must be able to undo or replay a partial execution of itself. Requiring the -implementation of the operation to support this functional greatly simplifies the execution -of these operations. This property is also what guarantees safety in light of failure conditions. - -=== Administration - -Sometimes, it is useful to inspect the current FATE operations, both pending and executing. -For example, a command that is not completing could be blocked on the execution of another -operation. Accumulo provides an Accumulo shell command to interact with fate. - -The +fate+ shell command accepts a number of arguments for different functionality: -+list+/+print+, +fail+, +delete+, +dump+. - -==== List/Print - -Without any additional arguments, this command will print all operations that still exist in -the FATE store (ZooKeeper). This will include active, pending, and completed operations (completed -operations are lazily removed from the store). Each operation includes a unique "transaction ID", the -state of the operation (e.g. +NEW+, +IN_PROGRESS+, +FAILED+), any locks the -transaction actively holds and any locks it is waiting to acquire. - -This option can also accept transaction IDs which will restrict the list of transactions shown. - -==== Fail - -This command can be used to manually fail a FATE transaction and requires a transaction ID -as an argument. Failing an operation is not a normal procedure and should only be performed -by an administrator who understands the implications of why they are failing the operation. - -==== Delete - -This command requires a transaction ID and will delete any locks that the transaction -holds. Like the fail command, this command should only be used in extreme circumstances -by an administrator that understands the implications of the command they are about to -invoke. It is not normal to invoke this command. - -==== Dump - -This command accepts zero more transaction IDs. If given no transaction IDs, -it will dump all active transactions. A FATE operations is compromised as a -sequence of REPOs. In order to start a FATE transaction, a REPO is pushed onto -a per transaction REPO stack. The top of the stack always contains the next -REPO the FATE transaction should execute. When a REPO is successful it may -return another REPO which is pushed on the stack. The +dump+ command will -print all of the REPOs on each transactions stack. The REPOs are serialized to -JSON in order to make them human readable. http://git-wip-us.apache.org/repos/asf/accumulo-website/blob/7cc70b2e/docs/master/introduction.md ---------------------------------------------------------------------- diff --git a/docs/master/introduction.md b/docs/master/introduction.md deleted file mode 100644 index 1b964b4..0000000 --- a/docs/master/introduction.md +++ /dev/null @@ -1,25 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one or more -// contributor license agreements. See the NOTICE file distributed with -// this work for additional information regarding copyright ownership. -// The ASF licenses this file to You under the Apache License, Version 2.0 -// (the "License"); you may not use this file except in compliance with -// the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -== Introduction -Apache Accumulo is a highly scalable structured store based on Google's BigTable. -Accumulo is written in Java and operates over the Hadoop Distributed File System -(HDFS), which is part of the popular Apache Hadoop project. Accumulo supports -efficient storage and retrieval of structured data, including queries for ranges, and -provides support for using Accumulo tables as input and output for MapReduce -jobs. - -Accumulo features automatic load-balancing and partitioning, data compression -and fine-grained security labels. http://git-wip-us.apache.org/repos/asf/accumulo-website/blob/7cc70b2e/docs/master/iterator_design.md ---------------------------------------------------------------------- diff --git a/docs/master/iterator_design.md b/docs/master/iterator_design.md deleted file mode 100644 index 4beaeb0..0000000 --- a/docs/master/iterator_design.md +++ /dev/null @@ -1,401 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one or more -// contributor license agreements. See the NOTICE file distributed with -// this work for additional information regarding copyright ownership. -// The ASF licenses this file to You under the Apache License, Version 2.0 -// (the "License"); you may not use this file except in compliance with -// the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -== Iterator Design - -Accumulo SortedKeyValueIterators, commonly referred to as Iterators for short, are server-side programming constructs -that allow users to implement custom retrieval or computational purpose within Accumulo TabletServers. The name rightly -brings forward similarities to the Java Iterator interface; however, Accumulo Iterators are more complex than Java -Iterators. Notably, in addition to the expected methods to retrieve the current element and advance to the next element -in the iteration, Accumulo Iterators must also support the ability to "move" (`seek`) to an specified point in the -iteration (the Accumulo table). Accumulo Iterators are designed to be concatenated together, similar to applying a -series of transformations to a list of elements. Accumulo Iterators can duplicate their underlying source to create -multiple "pointers" over the same underlying data (which is extremely powerful since each stream is sorted) or they can -merge multiple Iterators into a single view. In this sense, a collection of Iterators operating in tandem is close to -a tree-structure than a list, but there is always a sense of a flow of Key-Value pairs through some Iterators. Iterators -are not designed to act as triggers nor are they designed to operate outside of the purview of a single table. - -Understanding how TabletServers invoke the methods on a SortedKeyValueIterator can be obtuse as the actual code is -buried within the implementation of the TabletServer; however, it is generally unnecessary to have a strong -understanding of this as the interface provides clear definitions about what each action each method should take. This -chapter aims to provide a more detailed description of how Iterators are invoked, some best practices and some common -pitfalls. - -=== Instantiation - -To invoke an Accumulo Iterator inside of the TabletServer, the Iterator class must be on the classpath of every -TabletServer. For production environments, it is common to place a JAR file which contains the Iterator in -`lib/`. In development environments, it is convenient to instead place the JAR file in `lib/ext/` as JAR files -in this directory are dynamically reloaded by the TabletServers alleviating the need to restart Accumulo while -testing an Iterator. Advanced classloader features which enable other types of filesystems and per-table classpath -configurations (as opposed to process-wide classpaths). These features are not covered here, but elsewhere in the user -manual. - -Accumulo references the Iterator class by name and uses Java reflection to instantiate the Iterator. This means that -Iterators must have a public no-args constructor. - -=== Interface - -A normal implementation of the SortedKeyValueIterator defines functionality for the following methods: - -[source,java] ----- -void init(SortedKeyValueIterator<Key,Value> source, Map<String,String> options, IteratorEnvironment env) throws IOException; - -boolean hasTop(); - -void next() throws IOException; - -void seek(Range range, Collection<ByteSequence> columnFamilies, boolean inclusive) throws IOException; - -Key getTopKey(); - -Value getTopValue(); - -SortedKeyValueIterator<Key,Value> deepCopy(IteratorEnvironment env); ----- - -==== `init` - -The `init` method is called by the TabletServer after it constructs an instance of the Iterator. This method should -clear/reset any internal state in the Iterator and prepare it to process data. The first argument, the `source`, is the -Iterator "below" this Iterator (where the client is at "top" and the Iterator for files in HDFS are at the "bottom"). -The "source" Iterator provides the Key-Value pairs which this Iterator will operate upon. - -The second argument, a Map of options, is made up of options provided by the user, options set in the table's -configuration, and/or options set in the containing namespace's configuration. -These options allow for Iterators to dynamically configure themselves on the fly. If no options are used in the current context -(a Scan or Compaction), the Map will be empty. An example of a configuration item for an Iterator could be a pattern used to filter -Key-Value pairs in a regular expression Iterator. - -The third argument, the `IteratorEnvironment`, is a special object which provides information to this Iterator about the -context in which it was invoked. Commonly, this information is not necessary to inspect. For example, if an Iterator -knows that it is running in the context of a full-major compaction (reading all of the data) as opposed to a user scan -(which may strongly limit the number of columns), the Iterator might make different algorithmic decisions in an attempt to -optimize itself. - -==== `seek` - -The `seek` method is likely the most confusing method on the Iterator interface. The purpose of this method is to -advance the stream of Key-Value pairs to a certain point in the iteration (the Accumulo table). It is common that before -the implementation of this method returns some additional processing is performed which may further advance the current -position past the `startKey` of the `Range`. This, however, is dependent on the functionality the iterator provides. For -example, a filtering iterator would consume a number Key-Value pairs which do not meets its criteria before `seek` -returns. The important condition for `seek` to meet is that this Iterator should be ready to return the first Key-Value -pair, or none if no such pair is available, when the method returns. The Key-Value pair would be returned by `getTopKey` -and `getTopValue`, respectively, and `hasTop` should return a boolean denoting whether or not there is -a Key-Value pair to return. - -The arguments passed to seek are as follows: - -The TabletServer first provides a `Range`, an object which defines some collection of Accumulo `Key`s, which defines the -Key-Value pairs that this Iterator should return. Each `Range` has a `startKey` and `endKey` with an inclusive flag for -both. While this Range is often similar to the Range(s) set by the client on a Scanner or BatchScanner, it is not -guaranteed to be a Range that the client set. Accumulo will split up larger ranges and group them together based on -Tablet boundaries per TabletServer. Iterators should not attempt to implement any custom logic based on the Range(s) -provided to `seek` and Iterators should not return any Keys that fall outside of the provided Range. - -The second argument, a `Collection<ByteSequence>`, is the set of column families which should be retained or -excluded by this Iterator. The third argument, a boolean, defines whether the collection of column families -should be treated as an inclusion collection (true) or an exclusion collection (false). - -It is likely that all implementations of `seek` will first make a call to the `seek` method on the -"source" Iterator that was provided in the `init` method. The collection of column families and -the boolean `include` argument should be passed down as well as the `Range`. Somewhat commonly, the Iterator will -also implement some sort of additional logic to find or compute the first Key-Value pair in the provided -Range. For example, a regular expression Iterator would consume all records which do not match the given -pattern before returning from `seek`. - -It is important to retain the original Range passed to this method to know when this Iterator should stop -reading more Key-Value pairs. Ignoring this typically does not affect scans from a Scanner, but it -will result in duplicate keys emitting from a BatchScan if the scanned table has more than one tablet. -Best practice is to never emit entries outside the seek range. - -==== `next` - -The `next` method is analogous to the `next` method on a Java Iterator: this method should advance -the Iterator to the next Key-Value pair. For implementations that perform some filtering or complex -logic, this may result in more than one Key-Value pair being inspected. This method alters -some internal state that is exposed via the `hasTop`, `getTopKey`, and `getTopValue` methods. - -The result of this method is commonly caching a Key-Value pair which `getTopKey` and `getTopValue` -can later return. While there is another Key-Value pair to return, `hasTop` should return true. -If there are no more Key-Value pairs to return from this Iterator since the last call to -`seek`, `hasTop` should return false. - -==== `hasTop` - -The `hasTop` method is similar to the `hasNext` method on a Java Iterator in that it informs -the caller if there is a Key-Value pair to be returned. If there is no pair to return, this method -should return false. Like a Java Iterator, multiple calls to `hasTop` (without calling `next`) should not -alter the internal state of the Iterator. - -==== `getTopKey` and `getTopValue` - -These methods simply return the current Key-Value pair for this iterator. If `hasTop` returns true, -both of these methods should return non-null objects. If `hasTop` returns false, it is undefined -what these methods should return. Like `hasTop`, multiple calls to these methods should not alter -the state of the Iterator. - -Users should take caution when either - -1. caching the Key/Value from `getTopKey`/`getTopValue`, for use after calling `next` on the source iterator. -In this case, the cached Key/Value object is aliased to the reference returned by the source iterator. -Iterators may reuse the same Key/Value object in a `next` call for performance reasons, changing the data -that the cached Key/Value object references and resulting in a logic bug. -2. modifying the Key/Value from `getTopKey`/`getTopValue`. If the source iterator reuses data stored in the Key/Value, -then the source iterator may use the modified data that the Key/Value references. This may/may not result in a logic bug. - -In both cases, copying the Key/Value's data into a new object ensures iterator correctness. If neither case applies, -it is safe to not copy the Key/Value. The general guideline is to be aware of who else may use Key/Value objects -returned from `getTopKey`/`getTopValue`. - -==== `deepCopy` - -The `deepCopy` method is similar to the `clone` method from the Java `Cloneable` interface. -Implementations of this method should return a new object of the same type as the Accumulo Iterator -instance it was called on. Any internal state from the instance `deepCopy` was called -on should be carried over to the returned copy. The returned copy should be ready to have -`seek` called on it. The SortedKeyValueIterator interface guarantees that `init` will be called on -an iterator before `deepCopy` and that `init` will not be called on the iterator returned by -`deepCopy`. - -Typically, implementations of `deepCopy` call a copy-constructor which will initialize -internal data structures. As with `seek`, it is common for the `IteratorEnvironment` -argument to be ignored as most Iterator implementations can be written without the explicit -information the environment provides. - -In the analogy of a series of Iterators representing a tree, `deepCopy` can be thought of as -early programming assignments which implement their own tree data structures. `deepCopy` calls -copy on its sources (the children), copies itself, attaches the copies of the children, and -then returns itself. - -=== TabletServer invocation of Iterators - -The following code is a general outline for how TabletServers invoke Iterators. - -[source,java] ----- - List<KeyValue> batch; - Range range = getRangeFromClient(); - while(!overSizeLimit(batch)){ - SortedKeyValueIterator source = getSystemIterator(); - - for(String clzName : getUserIterators()){ - Class<?> clz = Class.forName(clzName); - SortedKeyValueIterator iter = (SortedKeyValueIterator) clz.newInstance(); - iter.init(source, opts, env); - source = iter; - } - - // read a batch of data to return to client - // the last iterator, the "top" - SortedKeyValueIterator topIter = source; - topIter.seek(getRangeFromUser(), ...) - - while(topIter.hasTop() && !overSizeLimit(batch)){ - key = topIter.getTopKey() - val = topIter.getTopValue() - batch.add(new KeyValue(key, val) - if(systemDataSourcesChanged()){ - // code does not show isolation case, which will - // keep using same data sources until a row boundry is hit - range = new Range(key, false, range.endKey(), range.endKeyInclusive()); - break; - } - } - } - //return batch of key values to client ----- - -Additionally, the obtuse "re-seek" case can be outlined as the following: - -[source,java] ----- - // Given the above - List<KeyValue> batch = getNextBatch(); - - // Store off lastKeyReturned for this client - lastKeyReturned = batch.get(batch.size() - 1).getKey(); - - // thread goes away (client stops asking for the next batch). - - // Eventually client comes back - // Setup as before... - - Range userRange = getRangeFromUser(); - Range actualRange = new Range(lastKeyReturned, false - userRange.getEndKey(), userRange.isEndKeyInclusive()); - - // Use the actualRange, not the user provided one - topIter.seek(actualRange); ----- - - -=== Isolation - -Accumulo provides a feature which clients can enable to prevent the viewing of partially -applied mutations within the context of rows. If a client is submitting multiple column -updates to rows at a time, isolation would ensure that a client would either see all of -updates made to that row or none of the updates (until they are all applied). - -When using Isolation, there are additional concerns in iterator design. A scan time iterator in accumulo -reads from a set of data sources. While an iterator is reading data it has an isolated view. However, after it returns a -key/value it is possible that accumulo may switch data sources and re-seek the iterator. This is done so that resources -may be reclaimed. When the user does not request isolation this can occur after any key is returned. When a user enables -Isolation, this will only occur after a new row is returned, in which case it will re-seek to the very beginning of the -next possible row. - -=== Abstract Iterators - -A number of Abstract implementations of Iterators are provided to allow for faster creation -of common patterns. The most commonly used abstract implementations are the `Filter` and -`Combiner` classes. When possible these classes should be used instead as they have been -thoroughly tested inside Accumulo itself. - -==== Filter - -The `Filter` abstract Iterator provides a very simple implementation which allows implementations -to define whether or not a Key-Value pair should be returned via an `accept(Key, Value)` method. - -Filters are extremely simple to implement; however, when the implementation is filtering a -large percentage of Key-Value pairs with respect to the total number of pairs examined, -it can be very inefficient. For example, if a Filter implementation can determine after examining -part of the row that no other pairs in this row will be accepted, there is no mechanism to -efficiently skip the remaining Key-Value pairs. Concretely, take a row which is comprised of -1000 Key-Value pairs. After examining the first 10 Key-Value pairs, it is determined -that no other Key-Value pairs in this row will be accepted. The Filter must still examine each -remaining 990 Key-Value pairs in this row. Another way to express this deficiency is that -Filters have no means to leverage the `seek` method to efficiently skip large portions -of Key-Value pairs. - -As such, the `Filter` class functions well for filtering small amounts of data, but is -inefficient for filtering large amounts of data. The decision to use a `Filter` strongly -depends on the use case and distribution of data being filtered. - -==== Combiner - -The `Combiner` class is another common abstract Iterator. Similar to the `Combiner` interface -define in Hadoop's MapReduce framework, implementations of this abstract class reduce -multiple Values for different versions of a Key (Keys which only differ by timestamps) into one Key-Value pair. -Combiners provide a simple way to implement common operations like summation and -aggregation without the need to implement the entire Accumulo Iterator interface. - -One important consideration when choosing to design a Combiner is that the "reduction" operation -is often best represented when it is associative and commutative. Operations which do not meet -these criteria can be implemented; however, the implementation can be difficult. - -A second consideration is that a Combiner is not guaranteed to see every Key-Value pair -which differ only by timestamp every time it is invoked. For example, if there are 5 Key-Value -pairs in a table which only differ by the timestamps 1, 2, 3, 4, and 5, it is not guaranteed that -every invocation of the Combiner will see 5 timestamps. One invocation might see the Values for -Keys with timestamp 1 and 4, while another invocation might see the Values for Keys with the -timestamps 1, 2, 4 and 5. - -Finally, when configuring an Accumulo table to use a Combiner, be sure to disable the Versioning Iterator or set the -Combiner at a priority less than the Combiner (the Versioning Iterator is added at a priority of 20 by default). The -Versioning Iterator will filter out multiple Key-Value pairs that differ only by timestamp and return only the Key-Value -pair that has the largest timestamp. - -=== Best practices - -Because of the flexibility that the `SortedKeyValueInterface` provides, it doesn't directly disallow -many implementations which are poor design decisions. The following are some common recommendations to -follow and pitfalls to avoid in Iterator implementations. - -==== Avoid special logic encoded in Ranges - -Commonly, granular Ranges that a client passes to an Iterator from a `Scanner` or `BatchScanner` are unmodified. -If a `Range` falls within the boundaries of a Tablet, an Iterator will often see that same Range in the -`seek` method. However, there is no guarantee that the `Range` will remain unaltered from client to server. As such, Iterators -should *never* make assumptions about the current state/context based on the `Range`. - -The common failure condition is referred to as a "re-seek". In the context of a Scan, TabletServers construct the -"stack" of Iterators and batch up Key-Value pairs to send back to the client. When a sufficient number of Key-Value -pairs are collected, it is common for the Iterators to be "torn down" until the client asks for the next batch of -Key-Value pairs. This is done by the TabletServer to add fairness in ensuring one Scan does not monopolize the available -resources. When the client asks for the next batch, the implementation modifies the original Range so that servers know -the point to resume the iteration (to avoid returning duplicate Key-Value pairs). Specifically, the new Range is created -from the original but is shortened by setting the startKey of the original Range to the Key last returned by the Scan, -non-inclusive. - -==== `seek`'ing backwards - -The ability for an Iterator to "skip over" large blocks of Key-Value pairs is a major tenet behind Iterators. -By `seek`'ing when it is known that there is a collection of Key-Value pairs which can be ignored can -greatly increase the speed of a scan as many Key-Value pairs do not have to be deserialized and processed. - -While the `seek` method provides the `Range` that should be used to `seek` the underlying source Iterator, -there is no guarantee that the implementing Iterator uses that `Range` to perform the `seek` on its -"source" Iterator. As such, it is possible to seek to any `Range` and the interface has no assertions -to prevent this from happening. - -Since Iterators are allowed to `seek` to arbitrary Keys, it also allows Iterators to create infinite loops -inside Scans that will repeatedly read the same data without end. If an arbitrary Range is constructed, it should -construct a completely new Range as it allows for bugs to be introduced which will break Accumulo. - -Thus, `seek`'s should always be thought of as making "forward progress" in the view of the total iteration. The -`startKey` of a `Range` should always be greater than the current Key seen by the Iterator while the `endKey` of the -`Range` should always retain the original `endKey` (and `endKey` inclusivity) of the last `Range` seen by your -Iterator's implementation of seek. - -==== Take caution in constructing new data in an Iterator - -Implementations of Iterator might be tempted to open BatchWriters inside of an Iterator as a means -to implement triggers for writing additional data outside of their client application. The lifecycle of an Iterator -is *not* managed in such a way that guarantees that this is safe nor efficient. Specifically, there -is no way to guarantee that the internal ThreadPool inside of the BatchWriter is closed (and the thread(s) -are reaped) without calling the close() method. `close`'ing and recreating a `BatchWriter` after every -Key-Value pair is also prohibitively performance limiting to be considered an option. - -The only safe way to generate additional data in an Iterator is to alter the current Key-Value pair. -For example, the `WholeRowIterator` serializes the all of the Key-Values pairs that fall within each -row. A safe way to generate more data in an Iterator would be to construct an Iterator that is -"higher" (at a larger priority) than the `WholeRowIterator`, that is, the Iterator receives the Key-Value pairs which are -a serialization of many Key-Value pairs. The custom Iterator could deserialize the pairs, compute -some function, and add a new Key-Value pair to the original collection, re-serializing the collection -of Key-Value pairs back into a single Key-Value pair. - -Any other situation is likely not guaranteed to ensure that the caller (a Scan or a Compaction) will -always see all intended data that is generated. - -=== Final things to remember - -Some simple recommendations/points to keep in mind: - -==== Method call order - -On an instance of an Iterator: `init` is always called before `seek`, `seek` is always called before `hasTop`, -`getTopKey` and `getTopValue` will not be called if `hasTop` returns false. - -==== Teardown - -As mentioned, instance of Iterators may be torn down inside of the server transparently. When a complex -collection of iterators is performing some advanced functionality, they will not be torn down until a Key-Value -pair is returned out of the "stack" of Iterators (and added into the batch of Key-Values to be returned -to the caller). Being torn-down is equivalent to a new instance of the Iterator being creating and `deepCopy` -being called on the new instance with the old instance provided as the argument to `deepCopy`. References -to the old instance are removed and the object is lazily garbage collected by the JVM. - -=== Compaction-time Iterators - -When Iterators are configured to run during compactions, at the `minc` or `majc` scope, these Iterators sometimes need -to make different assertions than those who only operate at scan time. Iterators won't see the delete entries; however, -Iterators will not necessarily see all of the Key-Value pairs in ever invocation. Because compactions often do not rewrite -all files (only a subset of them), it is possible that the logic take this into consideration. - -For example, a Combiner that runs over data at during compactions, might not see all of the values for a given Key. The -Combiner must recognize this and not perform any function that would be incorrect due -to the missing values. http://git-wip-us.apache.org/repos/asf/accumulo-website/blob/7cc70b2e/docs/master/iterator_test_harness.md ---------------------------------------------------------------------- diff --git a/docs/master/iterator_test_harness.md b/docs/master/iterator_test_harness.md deleted file mode 100644 index 91ae53a..0000000 --- a/docs/master/iterator_test_harness.md +++ /dev/null @@ -1,110 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one or more -// contributor license agreements. See the NOTICE file distributed with -// this work for additional information regarding copyright ownership. -// The ASF licenses this file to You under the Apache License, Version 2.0 -// (the "License"); you may not use this file except in compliance with -// the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -== Iterator Testing - -Iterators, while extremely powerful, are notoriously difficult to test. While the API defines -the methods an Iterator must implement and each method's functionality, the actual invocation -of these methods by Accumulo TabletServers can be surprisingly difficult to mimic in unit tests. - -The Apache Accumulo "Iterator Test Harness" is designed to provide a generalized testing framework -for all Accumulo Iterators to leverage to identify common pitfalls in user-created Iterators. - -=== Framework Use - -The harness provides an abstract class for use with JUnit4. Users must define the following for this -abstract class: - - * A `SortedMap` of input data (`Key`-`Value` pairs) - * A `Range` to use in tests - * A `Map` of options (`String` to `String` pairs) - * A `SortedMap` of output data (`Key`-`Value` pairs) - * A list of `IteratorTestCase`s (these can be automatically discovered) - -The majority of effort a user must make is in creating the input dataset and the expected -output dataset for the iterator being tested. - -=== Normal Test Outline - -Most iterator tests will follow the given outline: - -[source,java] ----- -import java.util.List; -import java.util.SortedMap; - -import org.apache.accumulo.core.data.Key; -import org.apache.accumulo.core.data.Range; -import org.apache.accumulo.core.data.Value; -import org.apache.accumulo.iteratortest.IteratorTestCaseFinder; -import org.apache.accumulo.iteratortest.IteratorTestInput; -import org.apache.accumulo.iteratortest.IteratorTestOutput; -import org.apache.accumulo.iteratortest.junit4.BaseJUnit4IteratorTest; -import org.apache.accumulo.iteratortest.testcases.IteratorTestCase; -import org.junit.runners.Parameterized.Parameters; - -public class MyIteratorTest extends BaseJUnit4IteratorTest { - - @Parameters - public static Object[][] parameters() { - final IteratorTestInput input = createIteratorInput(); - final IteratorTestOutput output = createIteratorOutput(); - final List<IteratorTestCase> testCases = IteratorTestCaseFinder.findAllTestCases(); - return BaseJUnit4IteratorTest.createParameters(input, output, tests); - } - - private static SortedMap<Key,Value> INPUT_DATA = createInputData(); - private static SortedMap<Key,Value> OUTPUT_DATA = createOutputData(); - - private static SortedMap<Key,Value> createInputData() { - // TODO -- implement this method - } - - private static SortedMap<Key,Value> createOutputData() { - // TODO -- implement this method - } - - private static IteratorTestInput createIteratorInput() { - final Map<String,String> options = createIteratorOptions(); - final Range range = createRange(); - return new IteratorTestInput(MyIterator.class, options, range, INPUT_DATA); - } - - private static Map<String,String> createIteratorOptions() { - // TODO -- implement this method - // Tip: Use INPUT_DATA if helpful in generating output - } - - private static Range createRange() { - // TODO -- implement this method - } - - private static IteratorTestOutput createIteratorOutput() { - return new IteratorTestOutput(OUTPUT_DATA); - } - -} ----- - -=== Limitations - -While the provided `IteratorTestCase`s should exercise common edge-cases in user iterators, -there are still many limitations to the existing test harness. Some of them are: - - * Can only specify a single iterator, not many (a "stack") - * No control over provided IteratorEnvironment for tests - * Exercising delete keys (especially with major compactions that do not include all files) - -These are left as future improvements to the harness.