[04/16] accumulo-website git commit: ACCUMULO-4630 Copied docs from Accumulo repo

mwalch Mon, 22 May 2017 10:30:04 -0700

http://git-wip-us.apache.org/repos/asf/accumulo-website/blob/32ac61d3/docs/master/design.md
----------------------------------------------------------------------
diff --git a/docs/master/design.md b/docs/master/design.md
new file mode 100644
index 0000000..6c77cb6
--- /dev/null
+++ b/docs/master/design.md
@@ -0,0 +1,180 @@
+// Licensed to the Apache Software Foundation (ASF) under one or more
+// contributor license agreements.  See the NOTICE file distributed with
+// this work for additional information regarding copyright ownership.
+// The ASF licenses this file to You under the Apache License, Version 2.0
+// (the "License"); you may not use this file except in compliance with
+// the License.  You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+== Accumulo Design
+
+=== Data Model
+
+Accumulo provides a richer data model than simple key-value stores, but is not 
a
+fully relational database. Data is represented as key-value pairs, where the 
key and
+value are comprised of the following elements:
+
+[width="75%",cols="^,^,^,^,^,^"]
+|===========================================================================
+    5+|Key                                                      .3+^.^|Value
+.2+^.^|Row ID 3+|Column                        .2+^.^|Timestamp
+                |Family |Qualifier |Visibility
+|===========================================================================
+
+All elements of the Key and the Value are represented as byte arrays except for
+Timestamp, which is a Long. Accumulo sorts keys by element and 
lexicographically
+in ascending order. Timestamps are sorted in descending order so that later
+versions of the same Key appear first in a sequential scan. Tables consist of 
a set of
+sorted key-value pairs.
+
+=== Architecture
+
+Accumulo is a distributed data storage and retrieval system and as such 
consists of
+several architectural components, some of which run on many individual servers.
+Much of the work Accumulo does involves maintaining certain properties of the
+data, such as organization, availability, and integrity, across many 
commodity-class
+machines.
+
+=== Components
+
+An instance of Accumulo includes many TabletServers, one Garbage Collector 
process,
+one Master server and many Clients.
+
+==== Tablet Server
+
+The TabletServer manages some subset of all the tablets (partitions of 
tables). This includes receiving writes from clients, persisting writes to a
+write-ahead log, sorting new key-value pairs in memory, periodically
+flushing sorted key-value pairs to new files in HDFS, and responding
+to reads from clients, forming a merge-sorted view of all keys and
+values from all the files it has created and the sorted in-memory
+store.
+
+TabletServers also perform recovery of a tablet
+that was previously on a server that failed, reapplying any writes
+found in the write-ahead log to the tablet.
+
+==== Garbage Collector
+
+Accumulo processes will share files stored in HDFS. Periodically, the Garbage
+Collector will identify files that are no longer needed by any process, and
+delete them. Multiple garbage collectors can be run to provide hot-standby 
support.
+They will perform leader election among themselves to choose a single active 
instance.
+
+==== Master
+
+The Accumulo Master is responsible for detecting and responding to TabletServer
+failure. It tries to balance the load across TabletServer by assigning tablets 
carefully
+and instructing TabletServers to unload tablets when necessary. The Master 
ensures all
+tablets are assigned to one TabletServer each, and handles table creation, 
alteration,
+and deletion requests from clients. The Master also coordinates startup, 
graceful
+shutdown and recovery of changes in write-ahead logs when Tablet servers fail.
+
+Multiple masters may be run. The masters will choose among themselves a single 
master,
+and the others will become backups if the master should fail.
+
+==== Tracer
+
+The Accumulo Tracer process supports the distributed timing API provided by 
Accumulo.
+One to many of these processes can be run on a cluster which will write the 
timing
+information to a given Accumulo table for future reference. Seeing the section 
on
+Tracing for more information on this support.
+
+==== Monitor
+
+The Accumulo Monitor is a web application that provides a wealth of 
information about
+the state of an instance. The Monitor shows graphs and tables which contain 
information
+about read/write rates, cache hit/miss rates, and Accumulo table information 
such as scan
+rate and active/queued compactions. Additionally, the Monitor should always be 
the first
+point of entry when attempting to debug an Accumulo problem as it will show 
high-level problems
+in addition to aggregated errors from all nodes in the cluster. See the 
section on <<monitoring>>
+for more information.
+
+Multiple Monitors can be run to provide hot-standby support in the face of 
failure. Due to the
+forwarding of logs from remote hosts to the Monitor, only one Monitor process 
should be active
+at one time. Leader election will be performed internally to choose the active 
Monitor.
+
+==== Client
+
+Accumulo includes a client library that is linked to every application. The 
client
+library contains logic for finding servers managing a particular tablet, and
+communicating with TabletServers to write and retrieve key-value pairs.
+
+=== Data Management
+
+Accumulo stores data in tables, which are partitioned into tablets. Tablets are
+partitioned on row boundaries so that all of the columns and values for a 
particular
+row are found together within the same tablet. The Master assigns Tablets to 
one
+TabletServer at a time. This enables row-level transactions to take place 
without
+using distributed locking or some other complicated synchronization mechanism. 
As
+clients insert and query data, and as machines are added and removed from the
+cluster, the Master migrates tablets to ensure they remain available and that 
the
+ingest and query load is balanced across the cluster.
+
+image::data_distribution.png[width=500]
+
+=== Tablet Service
+
+
+When a write arrives at a TabletServer it is written to a Write-Ahead Log and
+then inserted into a sorted data structure in memory called a MemTable. When 
the
+MemTable reaches a certain size, the TabletServer writes out the sorted
+key-value pairs to a file in HDFS called a Relative Key File (RFile), which is 
a
+kind of Indexed Sequential Access Method (ISAM) file. This process is called a
+minor compaction. A new MemTable is then created and the fact of the compaction
+is recorded in the Write-Ahead Log.
+
+When a request to read data arrives at a TabletServer, the TabletServer does a
+binary search across the MemTable as well as the in-memory indexes associated
+with each RFile to find the relevant values. If clients are performing a scan,
+several key-value pairs are returned to the client in order from the MemTable
+and the set of RFiles by performing a merge-sort as they are read.
+
+=== Compactions
+
+In order to manage the number of files per tablet, periodically the 
TabletServer
+performs Major Compactions of files within a tablet, in which some set of 
RFiles
+are combined into one file. The previous files will eventually be removed by 
the
+Garbage Collector. This also provides an opportunity to permanently remove
+deleted key-value pairs by omitting key-value pairs suppressed by a delete 
entry
+when the new file is created.
+
+=== Splitting
+
+When a table is created it has one tablet. As the table grows its initial
+tablet eventually splits into two tablets. Its likely that one of these
+tablets will migrate to another tablet server. As the table continues to grow,
+its tablets will continue to split and be migrated. The decision to
+automatically split a tablet is based on the size of a tablets files. The
+size threshold at which a tablet splits is configurable per table. In addition
+to automatic splitting, a user can manually add split points to a table to
+create new tablets. Manually splitting a new table can parallelize reads and
+writes giving better initial performance without waiting for automatic
+splitting.
+
+As data is deleted from a table, tablets may shrink. Over time this can lead
+to small or empty tablets. To deal with this, merging of tablets was
+introduced in Accumulo 1.4. This is discussed in more detail later.
+
+=== Fault-Tolerance
+
+If a TabletServer fails, the Master detects it and automatically reassigns the 
tablets
+assigned from the failed server to other servers. Any key-value pairs that 
were in
+memory at the time the TabletServer fails are automatically reapplied from the 
Write-Ahead
+Log(WAL) to prevent any loss of data.
+
+Tablet servers write their WALs directly to HDFS so the logs are available to 
all tablet
+servers for recovery. To make the recovery process efficient, the updates 
within a log are
+grouped by tablet.  TabletServers can quickly apply the mutations from the 
sorted logs
+that are destined for the tablets they have now been assigned.
+
+TabletServer failures are noted on the Master's monitor page, accessible via
++http://master-address:9995/monitor+.
+
+image::failure_handling.png[width=500]


http://git-wip-us.apache.org/repos/asf/accumulo-website/blob/32ac61d3/docs/master/development_clients.md
----------------------------------------------------------------------
diff --git a/docs/master/development_clients.md 
b/docs/master/development_clients.md
new file mode 100644
index 0000000..18821e3
--- /dev/null
+++ b/docs/master/development_clients.md
@@ -0,0 +1,107 @@
+// Licensed to the Apache Software Foundation (ASF) under one or more
+// contributor license agreements.  See the NOTICE file distributed with
+// this work for additional information regarding copyright ownership.
+// The ASF licenses this file to You under the Apache License, Version 2.0
+// (the "License"); you may not use this file except in compliance with
+// the License.  You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+== Development Clients
+
+Normally, Accumulo consists of lots of moving parts. Even a stand-alone 
version of
+Accumulo requires Hadoop, Zookeeper, the Accumulo master, a tablet server, 
etc. If
+you want to write a unit test that uses Accumulo, you need a lot of 
infrastructure
+in place before your test can run.
+
+=== Mock Accumulo
+
+Mock Accumulo supplies mock implementations for much of the client API. It 
presently
+does not enforce users, logins, permissions, etc. It does support Iterators 
and Combiners.
+Note that MockAccumulo holds all data in memory, and will not retain any data 
or
+settings between runs.
+
+While normal interaction with the Accumulo client looks like this:
+
+[source,java]
+Instance instance = new ZooKeeperInstance(...);
+Connector conn = instance.getConnector(user, passwordToken);
+
+To interact with the MockAccumulo, just replace the ZooKeeperInstance with 
MockInstance:
+
+[source,java]
+Instance instance = new MockInstance();
+
+In fact, you can use the +--fake+ option to the Accumulo shell and interact 
with
+MockAccumulo:
+
+----
+$ accumulo shell --fake -u root -p ''
+
+Shell - Apache Accumulo Interactive Shell
+-
+- version: 2.x.x
+- instance name: fake
+- instance id: mock-instance-id
+-
+- type 'help' for a list of available commands
+-
+
+root@fake> createtable test
+
+root@fake test> insert row1 cf cq value
+root@fake test> insert row2 cf cq value2
+root@fake test> insert row3 cf cq value3
+
+root@fake test> scan
+row1 cf:cq []    value
+row2 cf:cq []    value2
+row3 cf:cq []    value3
+
+root@fake test> scan -b row2 -e row2
+row2 cf:cq []    value2
+
+root@fake test>
+----
+
+When testing Map Reduce jobs, you can also set the Mock Accumulo on the 
AccumuloInputFormat
+and AccumuloOutputFormat classes:
+
+[source,java]
+// ... set up job configuration
+AccumuloInputFormat.setMockInstance(job, "mockInstance");
+AccumuloOutputFormat.setMockInstance(job, "mockInstance");
+
+=== Mini Accumulo Cluster
+
+While the Mock Accumulo provides a lightweight implementation of the client 
API for unit
+testing, it is often necessary to write more realistic end-to-end integration 
tests that
+take advantage of the entire ecosystem. The Mini Accumulo Cluster makes this 
possible by
+configuring and starting Zookeeper, initializing Accumulo, and starting the 
Master as well
+as some Tablet Servers. It runs against the local filesystem instead of having 
to start
+up HDFS.
+
+To start it up, you will need to supply an empty directory and a root password 
as arguments:
+
+[source,java]
+File tempDirectory = // JUnit and Guava supply mechanisms for creating temp 
directories
+MiniAccumuloCluster accumulo = new MiniAccumuloCluster(tempDirectory, 
"password");
+accumulo.start();
+
+Once we have our mini cluster running, we will want to interact with the 
Accumulo client API:
+
+[source,java]
+Instance instance = new ZooKeeperInstance(accumulo.getInstanceName(), 
accumulo.getZooKeepers());
+Connector conn = instance.getConnector("root", new PasswordToken("password"));
+
+Upon completion of our development code, we will want to shutdown our 
MiniAccumuloCluster:
+
+[source,java]
+accumulo.stop();
+// delete your temporary folder

http://git-wip-us.apache.org/repos/asf/accumulo-website/blob/32ac61d3/docs/master/high_speed_ingest.md
----------------------------------------------------------------------
diff --git a/docs/master/high_speed_ingest.md b/docs/master/high_speed_ingest.md
new file mode 100644
index 0000000..2a7a702
--- /dev/null
+++ b/docs/master/high_speed_ingest.md
@@ -0,0 +1,124 @@
+// Licensed to the Apache Software Foundation (ASF) under one or more
+// contributor license agreements.  See the NOTICE file distributed with
+// this work for additional information regarding copyright ownership.
+// The ASF licenses this file to You under the Apache License, Version 2.0
+// (the "License"); you may not use this file except in compliance with
+// the License.  You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+== High-Speed Ingest
+
+Accumulo is often used as part of a larger data processing and storage system. 
To
+maximize the performance of a parallel system involving Accumulo, the ingestion
+and query components should be designed to provide enough parallelism and
+concurrency to avoid creating bottlenecks for users and other systems writing 
to
+and reading from Accumulo. There are several ways to achieve high ingest
+performance.
+
+=== Pre-Splitting New Tables
+
+New tables consist of a single tablet by default. As mutations are applied, 
the table
+grows and splits into multiple tablets which are balanced by the Master across
+TabletServers. This implies that the aggregate ingest rate will be limited to 
fewer
+servers than are available within the cluster until the table has reached the 
point
+where there are tablets on every TabletServer.
+
+Pre-splitting a table ensures that there are as many tablets as desired 
available
+before ingest begins to take advantage of all the parallelism possible with 
the cluster
+hardware. Tables can be split at any time by using the shell:
+
+  user@myinstance mytable> addsplits -sf /local_splitfile -t mytable
+
+For the purposes of providing parallelism to ingest it is not necessary to 
create more
+tablets than there are physical machines within the cluster as the aggregate 
ingest
+rate is a function of the number of physical machines. Note that the aggregate 
ingest
+rate is still subject to the number of machines running ingest clients, and the
+distribution of rowIDs across the table. The aggregation ingest rate will be
+suboptimal if there are many inserts into a small number of rowIDs.
+
+=== Multiple Ingester Clients
+
+Accumulo is capable of scaling to very high rates of ingest, which is 
dependent upon
+not just the number of TabletServers in operation but also the number of ingest
+clients. This is because a single client, while capable of batching mutations 
and
+sending them to all TabletServers, is ultimately limited by the amount of data 
that
+can be processed on a single machine. The aggregate ingest rate will scale 
linearly
+with the number of clients up to the point at which either the aggregate I/O of
+TabletServers or total network bandwidth capacity is reached.
+
+In operational settings where high rates of ingest are paramount, clusters are 
often
+configured to dedicate some number of machines solely to running Ingester 
Clients.
+The exact ratio of clients to TabletServers necessary for optimum ingestion 
rates
+will vary according to the distribution of resources per machine and by data 
type.
+
+=== Bulk Ingest
+
+Accumulo supports the ability to import files produced by an external process 
such
+as MapReduce into an existing table. In some cases it may be faster to load 
data this
+way rather than via ingesting through clients using BatchWriters. This allows 
a large
+number of machines to format data the way Accumulo expects. The new files can
+then simply be introduced to Accumulo via a shell command.
+
+To configure MapReduce to format data in preparation for bulk loading, the job
+should be set to use a range partitioner instead of the default hash 
partitioner. The
+range partitioner uses the split points of the Accumulo table that will 
receive the
+data. The split points can be obtained from the shell and used by the MapReduce
+RangePartitioner. Note that this is only useful if the existing table is 
already split
+into multiple tablets.
+
+  user@myinstance mytable> getsplits
+  aa
+  ab
+  ac
+  ...
+  zx
+  zy
+  zz
+
+Run the MapReduce job, using the AccumuloFileOutputFormat to create the files 
to
+be introduced to Accumulo. Once this is complete, the files can be added to
+Accumulo via the shell:
+
+  user@myinstance mytable> importdirectory /files_dir /failures
+
+Note that the paths referenced are directories within the same HDFS instance 
over
+which Accumulo is running. Accumulo places any files that failed to be added 
to the
+second directory specified.
+
+See the 
https://github.com/apache/accumulo-examples/blob/master/docs/bulkIngest.md[Bulk 
Ingest example]
+for a complete example.
+
+=== Logical Time for Bulk Ingest
+
+Logical time is important for bulk imported data, for which the client code may
+be choosing a timestamp. At bulk import time, the user can choose to enable
+logical time for the set of files being imported. When its enabled, Accumulo
+uses a specialized system iterator to lazily set times in a bulk imported file.
+This mechanism guarantees that times set by unsynchronized multi-node
+applications (such as those running on MapReduce) will maintain some semblance
+of causal ordering. This mitigates the problem of the time being wrong on the
+system that created the file for bulk import. These times are not set when the
+file is imported, but whenever it is read by scans or compactions. At import, a
+time is obtained and always used by the specialized system iterator to set that
+time.
+
+The timestamp assigned by Accumulo will be the same for every key in the file.
+This could cause problems if the file contains multiple keys that are identical
+except for the timestamp. In this case, the sort order of the keys will be
+undefined. This could occur if an insert and an update were in the same bulk
+import file.
+
+=== MapReduce Ingest
+
+It is possible to efficiently write many mutations to Accumulo in parallel via 
a
+MapReduce job. In this scenario the MapReduce is written to process data that 
lives
+in HDFS and write mutations to Accumulo using the AccumuloOutputFormat. See
+the MapReduce section under Analytics for details. The 
https://github.com/apache/accumulo-examples/blob/master/docs/mapred.md[MapReduce
 example]
+is also a good reference for example code.

http://git-wip-us.apache.org/repos/asf/accumulo-website/blob/32ac61d3/docs/master/implementation.md
----------------------------------------------------------------------
diff --git a/docs/master/implementation.md b/docs/master/implementation.md
new file mode 100644
index 0000000..520f538
--- /dev/null
+++ b/docs/master/implementation.md
@@ -0,0 +1,86 @@
+// Licensed to the Apache Software Foundation (ASF) under one or more
+// contributor license agreements. See the NOTICE file distributed with
+// this work for additional information regarding copyright ownership.
+// The ASF licenses this file to You under the Apache License, Version 2.0
+// (the "License"); you may not use this file except in compliance with
+// the License. You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+== Implementation Details
+
+=== Fault-Tolerant Executor (FATE)
+
+Accumulo must implement a number of distributed, multi-step operations to 
support
+the client API. Creating a new table is a simple example of an atomic client 
call
+which requires multiple steps in the implementation: get a unique table ID, 
configure
+default table permissions, populate information in ZooKeeper to record the 
table's
+existence, create directories in HDFS for the table's data, etc. Implementing 
these
+steps in a way that is tolerant to node failure and other concurrent 
operations is
+very difficult to achieve. Accumulo includes a Fault-Tolerant Executor (FATE) 
which
+is widely used server-side to implement the client API safely and correctly.
+
+FATE is the implementation detail which ensures that tables in creation when 
the
+Master dies will be successfully created when another Master process is 
started.
+This alleviates the need for any external tools to correct some bad state -- 
Accumulo can
+undo the failure and self-heal without any external intervention.
+
+=== Overview
+
+FATE consists of two primary components: a repeatable, persisted operation 
(REPO), a storage
+layer for REPOs and an execution system to run REPOs. Accumulo uses ZooKeeper 
as the storage
+layer for FATE and the Accumulo Master acts as the execution system to run 
REPOs.
+
+The important characteristic of REPOs are that they implemented in a way that 
is idempotent:
+every operation must be able to undo or replay a partial execution of itself. 
Requiring the
+implementation of the operation to support this functional greatly simplifies 
the execution
+of these operations. This property is also what guarantees safety in light of 
failure conditions.
+
+=== Administration
+
+Sometimes, it is useful to inspect the current FATE operations, both pending 
and executing.
+For example, a command that is not completing could be blocked on the 
execution of another
+operation. Accumulo provides an Accumulo shell command to interact with fate.
+
+The +fate+ shell command accepts a number of arguments for different 
functionality:
++list+/+print+, +fail+, +delete+, +dump+.
+
+==== List/Print
+
+Without any additional arguments, this command will print all operations that 
still exist in
+the FATE store (ZooKeeper). This will include active, pending, and completed 
operations (completed
+operations are lazily removed from the store). Each operation includes a 
unique "transaction ID", the
+state of the operation (e.g. +NEW+, +IN_PROGRESS+, +FAILED+), any locks the
+transaction actively holds and any locks it is waiting to acquire.
+
+This option can also accept transaction IDs which will restrict the list of 
transactions shown.
+
+==== Fail
+
+This command can be used to manually fail a FATE transaction and requires a 
transaction ID
+as an argument. Failing an operation is not a normal procedure and should only 
be performed
+by an administrator who understands the implications of why they are failing 
the operation.
+
+==== Delete
+
+This command requires a transaction ID and will delete any locks that the 
transaction
+holds. Like the fail command, this command should only be used in extreme 
circumstances
+by an administrator that understands the implications of the command they are 
about to
+invoke. It is not normal to invoke this command.
+
+==== Dump
+
+This command accepts zero more transaction IDs.  If given no transaction IDs,
+it will dump all active transactions.  A FATE operations is compromised as a
+sequence of REPOs.  In order to start a FATE transaction, a REPO is pushed onto
+a per transaction REPO stack.  The top of the stack always contains the next
+REPO the FATE transaction should execute.  When a REPO is successful it may
+return another REPO which is pushed on the stack.  The +dump+ command will
+print all of the REPOs on each transactions stack.  The REPOs are serialized to
+JSON in order to make them human readable.

http://git-wip-us.apache.org/repos/asf/accumulo-website/blob/32ac61d3/docs/master/introduction.md
----------------------------------------------------------------------
diff --git a/docs/master/introduction.md b/docs/master/introduction.md
new file mode 100644
index 0000000..1b964b4
--- /dev/null
+++ b/docs/master/introduction.md
@@ -0,0 +1,25 @@
+// Licensed to the Apache Software Foundation (ASF) under one or more
+// contributor license agreements.  See the NOTICE file distributed with
+// this work for additional information regarding copyright ownership.
+// The ASF licenses this file to You under the Apache License, Version 2.0
+// (the "License"); you may not use this file except in compliance with
+// the License.  You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+== Introduction
+Apache Accumulo is a highly scalable structured store based on Google's 
BigTable.
+Accumulo is written in Java and operates over the Hadoop Distributed File 
System
+(HDFS), which is part of the popular Apache Hadoop project. Accumulo supports
+efficient storage and retrieval of structured data, including queries for 
ranges, and
+provides support for using Accumulo tables as input and output for MapReduce
+jobs.
+
+Accumulo features automatic load-balancing and partitioning, data compression
+and fine-grained security labels.

http://git-wip-us.apache.org/repos/asf/accumulo-website/blob/32ac61d3/docs/master/iterator_design.md
----------------------------------------------------------------------
diff --git a/docs/master/iterator_design.md b/docs/master/iterator_design.md
new file mode 100644
index 0000000..4beaeb0
--- /dev/null
+++ b/docs/master/iterator_design.md
@@ -0,0 +1,401 @@
+// Licensed to the Apache Software Foundation (ASF) under one or more
+// contributor license agreements. See the NOTICE file distributed with
+// this work for additional information regarding copyright ownership.
+// The ASF licenses this file to You under the Apache License, Version 2.0
+// (the "License"); you may not use this file except in compliance with
+// the License. You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+== Iterator Design
+
+Accumulo SortedKeyValueIterators, commonly referred to as Iterators for short, 
are server-side programming constructs
+that allow users to implement custom retrieval or computational purpose within 
Accumulo TabletServers.  The name rightly
+brings forward similarities to the Java Iterator interface; however, Accumulo 
Iterators are more complex than Java
+Iterators. Notably, in addition to the expected methods to retrieve the 
current element and advance to the next element
+in the iteration, Accumulo Iterators must also support the ability to "move" 
(`seek`) to an specified point in the
+iteration (the Accumulo table). Accumulo Iterators are designed to be 
concatenated together, similar to applying a
+series of transformations to a list of elements. Accumulo Iterators can 
duplicate their underlying source to create
+multiple "pointers" over the same underlying data (which is extremely powerful 
since each stream is sorted) or they can
+merge multiple Iterators into a single view. In this sense, a collection of 
Iterators operating in tandem is close to
+a tree-structure than a list, but there is always a sense of a flow of 
Key-Value pairs through some Iterators. Iterators
+are not designed to act as triggers nor are they designed to operate outside 
of the purview of a single table.
+
+Understanding how TabletServers invoke the methods on a SortedKeyValueIterator 
can be obtuse as the actual code is
+buried within the implementation of the TabletServer; however, it is generally 
unnecessary to have a strong
+understanding of this as the interface provides clear definitions about what 
each action each method should take. This
+chapter aims to provide a more detailed description of how Iterators are 
invoked, some best practices and some common
+pitfalls.
+
+=== Instantiation
+
+To invoke an Accumulo Iterator inside of the TabletServer, the Iterator class 
must be on the classpath of every
+TabletServer. For production environments, it is common to place a JAR file 
which contains the Iterator in
+`lib/`.  In development environments, it is convenient to instead place the 
JAR file in `lib/ext/` as JAR files
+in this directory are dynamically reloaded by the TabletServers alleviating 
the need to restart Accumulo while
+testing an Iterator. Advanced classloader features which enable other types of 
filesystems and per-table classpath
+configurations (as opposed to process-wide classpaths). These features are not 
covered here, but elsewhere in the user
+manual.
+
+Accumulo references the Iterator class by name and uses Java reflection to 
instantiate the Iterator. This means that
+Iterators must have a public no-args constructor.
+
+=== Interface
+
+A normal implementation of the SortedKeyValueIterator defines functionality 
for the following methods:
+
+[source,java]
+----
+void init(SortedKeyValueIterator<Key,Value> source, Map<String,String> 
options, IteratorEnvironment env) throws IOException;
+
+boolean hasTop();
+
+void next() throws IOException;
+
+void seek(Range range, Collection<ByteSequence> columnFamilies, boolean 
inclusive) throws IOException;
+
+Key getTopKey();
+
+Value getTopValue();
+
+SortedKeyValueIterator<Key,Value> deepCopy(IteratorEnvironment env);
+----
+
+==== `init`
+
+The `init` method is called by the TabletServer after it constructs an 
instance of the Iterator.  This method should
+clear/reset any internal state in the Iterator and prepare it to process data. 
 The first argument, the `source`, is the
+Iterator "below" this Iterator (where the client is at "top" and the Iterator 
for files in HDFS are at the "bottom").
+The "source" Iterator provides the Key-Value pairs which this Iterator will 
operate upon.
+
+The second argument, a Map of options, is made up of options provided by the 
user, options set in the table's
+configuration, and/or options set in the containing namespace's configuration.
+These options allow for Iterators to dynamically configure themselves on the 
fly. If no options are used in the current context
+(a Scan or Compaction), the Map will be empty. An example of a configuration 
item for an Iterator could be a pattern used to filter
+Key-Value pairs in a regular expression Iterator.
+
+The third argument, the `IteratorEnvironment`, is a special object which 
provides information to this Iterator about the
+context in which it was invoked. Commonly, this information is not necessary 
to inspect. For example, if an Iterator
+knows that it is running in the context of a full-major compaction (reading 
all of the data) as opposed to a user scan
+(which may strongly limit the number of columns), the Iterator might make 
different algorithmic decisions in an attempt to
+optimize itself.
+
+==== `seek`
+
+The `seek` method is likely the most confusing method on the Iterator 
interface. The purpose of this method is to
+advance the stream of Key-Value pairs to a certain point in the iteration (the 
Accumulo table). It is common that before
+the implementation of this method returns some additional processing is 
performed which may further advance the current
+position past the `startKey` of the `Range`. This, however, is dependent on 
the functionality the iterator provides. For
+example, a filtering iterator would consume a number Key-Value pairs which do 
not meets its criteria before `seek`
+returns. The important condition for `seek` to meet is that this Iterator 
should be ready to return the first Key-Value
+pair, or none if no such pair is available, when the method returns. The 
Key-Value pair would be returned by `getTopKey`
+and `getTopValue`, respectively, and `hasTop` should return a boolean denoting 
whether or not there is
+a Key-Value pair to return.
+
+The arguments passed to seek are as follows:
+
+The TabletServer first provides a `Range`, an object which defines some 
collection of Accumulo `Key`s, which defines the
+Key-Value pairs that this Iterator should return. Each `Range` has a 
`startKey` and `endKey` with an inclusive flag for
+both. While this Range is often similar to the Range(s) set by the client on a 
Scanner or BatchScanner, it is not
+guaranteed to be a Range that the client set. Accumulo will split up larger 
ranges and group them together based on
+Tablet boundaries per TabletServer. Iterators should not attempt to implement 
any custom logic based on the Range(s)
+provided to `seek` and Iterators should not return any Keys that fall outside 
of the provided Range.
+
+The second argument, a `Collection<ByteSequence>`, is the set of column 
families which should be retained or
+excluded by this Iterator. The third argument, a boolean, defines whether the 
collection of column families
+should be treated as an inclusion collection (true) or an exclusion collection 
(false).
+
+It is likely that all implementations of `seek` will first make a call to the 
`seek` method on the
+"source" Iterator that was provided in the `init` method. The collection of 
column families and
+the boolean `include` argument should be passed down as well as the `Range`. 
Somewhat commonly, the Iterator will
+also implement some sort of additional logic to find or compute the first 
Key-Value pair in the provided
+Range. For example, a regular expression Iterator would consume all records 
which do not match the given
+pattern before returning from `seek`.
+
+It is important to retain the original Range passed to this method to know 
when this Iterator should stop
+reading more Key-Value pairs. Ignoring this typically does not affect scans 
from a Scanner, but it
+will result in duplicate keys emitting from a BatchScan if the scanned table 
has more than one tablet.
+Best practice is to never emit entries outside the seek range.
+
+==== `next`
+
+The `next` method is analogous to the `next` method on a Java Iterator: this 
method should advance
+the Iterator to the next Key-Value pair. For implementations that perform some 
filtering or complex
+logic, this may result in more than one Key-Value pair being inspected. This 
method alters
+some internal state that is exposed via the `hasTop`, `getTopKey`, and 
`getTopValue` methods.
+
+The result of this method is commonly caching a Key-Value pair which 
`getTopKey` and `getTopValue`
+can later return. While there is another Key-Value pair to return, `hasTop` 
should return true.
+If there are no more Key-Value pairs to return from this Iterator since the 
last call to
+`seek`, `hasTop` should return false.
+
+==== `hasTop`
+
+The `hasTop` method is similar to the `hasNext` method on a Java Iterator in 
that it informs
+the caller if there is a Key-Value pair to be returned. If there is no pair to 
return, this method
+should return false. Like a Java Iterator, multiple calls to `hasTop` (without 
calling `next`) should not
+alter the internal state of the Iterator.
+
+==== `getTopKey` and `getTopValue`
+
+These methods simply return the current Key-Value pair for this iterator. If 
`hasTop` returns true,
+both of these methods should return non-null objects. If `hasTop` returns 
false, it is undefined
+what these methods should return. Like `hasTop`, multiple calls to these 
methods should not alter
+the state of the Iterator.
+
+Users should take caution when either
+
+1. caching the Key/Value from `getTopKey`/`getTopValue`, for use after calling 
`next` on the source iterator.
+In this case, the cached Key/Value object is aliased to the reference returned 
by the source iterator.
+Iterators may reuse the same Key/Value object in a `next` call for performance 
reasons, changing the data
+that the cached Key/Value object references and resulting in a logic bug.
+2. modifying the Key/Value from `getTopKey`/`getTopValue`. If the source 
iterator reuses data stored in the Key/Value,
+then the source iterator may use the modified data that the Key/Value 
references. This may/may not result in a logic bug.
+
+In both cases, copying the Key/Value's data into a new object ensures iterator 
correctness. If neither case applies,
+it is safe to not copy the Key/Value.  The general guideline is to be aware of 
who else may use Key/Value objects
+returned from `getTopKey`/`getTopValue`.
+
+==== `deepCopy`
+
+The `deepCopy` method is similar to the `clone` method from the Java 
`Cloneable` interface.
+Implementations of this method should return a new object of the same type as 
the Accumulo Iterator
+instance it was called on. Any internal state from the instance `deepCopy` was 
called
+on should be carried over to the returned copy. The returned copy should be 
ready to have
+`seek` called on it. The SortedKeyValueIterator interface guarantees that 
`init` will be called on
+an iterator before `deepCopy` and that `init` will not be called on the 
iterator returned by
+`deepCopy`.
+
+Typically, implementations of `deepCopy` call a copy-constructor which will 
initialize
+internal data structures. As with `seek`, it is common for the 
`IteratorEnvironment`
+argument to be ignored as most Iterator implementations can be written without 
the explicit
+information the environment provides.
+
+In the analogy of a series of Iterators representing a tree, `deepCopy` can be 
thought of as
+early programming assignments which implement their own tree data structures. 
`deepCopy` calls
+copy on its sources (the children), copies itself, attaches the copies of the 
children, and
+then returns itself.
+
+=== TabletServer invocation of Iterators
+
+The following code is a general outline for how TabletServers invoke Iterators.
+
+[source,java]
+----
+ List<KeyValue> batch;
+ Range range = getRangeFromClient();
+ while(!overSizeLimit(batch)){
+   SortedKeyValueIterator source = getSystemIterator();
+
+   for(String clzName : getUserIterators()){
+    Class<?> clz = Class.forName(clzName);
+    SortedKeyValueIterator iter = (SortedKeyValueIterator) clz.newInstance();
+    iter.init(source, opts, env);
+    source = iter;
+   }
+
+   // read a batch of data to return to client
+   // the last iterator, the "top"
+   SortedKeyValueIterator topIter = source;
+   topIter.seek(getRangeFromUser(), ...)
+
+   while(topIter.hasTop() && !overSizeLimit(batch)){
+     key = topIter.getTopKey()
+     val = topIter.getTopValue()
+     batch.add(new KeyValue(key, val)
+     if(systemDataSourcesChanged()){
+       // code does not show isolation case, which will
+       // keep using same data sources until a row boundry is hit 
+       range = new Range(key, false, range.endKey(), range.endKeyInclusive());
+       break;
+     }
+   }
+ }
+ //return batch of key values to client
+----
+
+Additionally, the obtuse "re-seek" case can be outlined as the following:
+
+[source,java]
+----
+  // Given the above
+  List<KeyValue> batch = getNextBatch();
+
+  // Store off lastKeyReturned for this client
+  lastKeyReturned = batch.get(batch.size() - 1).getKey();
+
+  // thread goes away (client stops asking for the next batch).
+
+  // Eventually client comes back
+  // Setup as before...
+
+  Range userRange = getRangeFromUser();
+  Range actualRange = new Range(lastKeyReturned, false
+      userRange.getEndKey(), userRange.isEndKeyInclusive());
+
+  // Use the actualRange, not the user provided one
+  topIter.seek(actualRange);
+----
+
+
+=== Isolation
+
+Accumulo provides a feature which clients can enable to prevent the viewing of 
partially
+applied mutations within the context of rows. If a client is submitting 
multiple column
+updates to rows at a time, isolation would ensure that a client would either 
see all of
+updates made to that row or none of the updates (until they are all applied).
+
+When using Isolation, there are additional concerns in iterator design. A scan 
time iterator in accumulo
+reads from a set of data sources. While an iterator is reading data it has an 
isolated view. However, after it returns a
+key/value it is possible that accumulo may switch data sources and re-seek the 
iterator. This is done so that resources
+may be reclaimed. When the user does not request isolation this can occur 
after any key is returned. When a user enables
+Isolation, this will only occur after a new row is returned, in which case it 
will re-seek to the very beginning of the
+next possible row.
+
+=== Abstract Iterators
+
+A number of Abstract implementations of Iterators are provided to allow for 
faster creation
+of common patterns. The most commonly used abstract implementations are the 
`Filter` and
+`Combiner` classes. When possible these classes should be used instead as they 
have been
+thoroughly tested inside Accumulo itself.
+
+==== Filter
+
+The `Filter` abstract Iterator provides a very simple implementation which 
allows implementations
+to define whether or not a Key-Value pair should be returned via an 
`accept(Key, Value)` method.
+
+Filters are extremely simple to implement; however, when the implementation is 
filtering a
+large percentage of Key-Value pairs with respect to the total number of pairs 
examined,
+it can be very inefficient. For example, if a Filter implementation can 
determine after examining
+part of the row that no other pairs in this row will be accepted, there is no 
mechanism to
+efficiently skip the remaining Key-Value pairs. Concretely, take a row which 
is comprised of
+1000 Key-Value pairs. After examining the first 10 Key-Value pairs, it is 
determined
+that no other Key-Value pairs in this row will be accepted. The Filter must 
still examine each
+remaining 990 Key-Value pairs in this row. Another way to express this 
deficiency is that
+Filters have no means to leverage the `seek` method to efficiently skip large 
portions
+of Key-Value pairs.
+
+As such, the `Filter` class functions well for filtering small amounts of 
data, but is
+inefficient for filtering large amounts of data. The decision to use a 
`Filter` strongly
+depends on the use case and distribution of data being filtered.
+
+==== Combiner
+
+The `Combiner` class is another common abstract Iterator. Similar to the 
`Combiner` interface
+define in Hadoop's MapReduce framework, implementations of this abstract class 
reduce
+multiple Values for different versions of a Key (Keys which only differ by 
timestamps) into one Key-Value pair.
+Combiners provide a simple way to implement common operations like summation 
and
+aggregation without the need to implement the entire Accumulo Iterator 
interface.
+
+One important consideration when choosing to design a Combiner is that the 
"reduction" operation
+is often best represented when it is associative and commutative. Operations 
which do not meet
+these criteria can be implemented; however, the implementation can be 
difficult.
+
+A second consideration is that a Combiner is not guaranteed to see every 
Key-Value pair
+which differ only by timestamp every time it is invoked. For example, if there 
are 5 Key-Value
+pairs in a table which only differ by the timestamps 1, 2, 3, 4, and 5, it is 
not guaranteed that
+every invocation of the Combiner will see 5 timestamps. One invocation might 
see the Values for
+Keys with timestamp 1 and 4, while another invocation might see the Values for 
Keys with the
+timestamps 1, 2, 4 and 5.
+
+Finally, when configuring an Accumulo table to use a Combiner, be sure to 
disable the Versioning Iterator or set the
+Combiner at a priority less than the Combiner (the Versioning Iterator is 
added at a priority of 20 by default). The
+Versioning Iterator will filter out multiple Key-Value pairs that differ only 
by timestamp and return only the Key-Value
+pair that has the largest timestamp.
+
+=== Best practices
+
+Because of the flexibility that the `SortedKeyValueInterface` provides, it 
doesn't directly disallow
+many implementations which are poor design decisions. The following are some 
common recommendations to
+follow and pitfalls to avoid in Iterator implementations.
+
+==== Avoid special logic encoded in Ranges
+
+Commonly, granular Ranges that a client passes to an Iterator from a `Scanner` 
or `BatchScanner` are unmodified.
+If a `Range` falls within the boundaries of a Tablet, an Iterator will often 
see that same Range in the
+`seek` method. However, there is no guarantee that the `Range` will remain 
unaltered from client to server. As such, Iterators
+should *never* make assumptions about the current state/context based on the 
`Range`.
+
+The common failure condition is referred to as a "re-seek". In the context of 
a Scan, TabletServers construct the
+"stack" of Iterators and batch up Key-Value pairs to send back to the client. 
When a sufficient number of Key-Value
+pairs are collected, it is common for the Iterators to be "torn down" until 
the client asks for the next batch of
+Key-Value pairs. This is done by the TabletServer to add fairness in ensuring 
one Scan does not monopolize the available
+resources. When the client asks for the next batch, the implementation 
modifies the original Range so that servers know
+the point to resume the iteration (to avoid returning duplicate Key-Value 
pairs). Specifically, the new Range is created
+from the original but is shortened by setting the startKey of the original 
Range to the Key last returned by the Scan,
+non-inclusive.
+
+==== `seek`'ing backwards
+
+The ability for an Iterator to "skip over" large blocks of Key-Value pairs is 
a major tenet behind Iterators.
+By `seek`'ing when it is known that there is a collection of Key-Value pairs 
which can be ignored can
+greatly increase the speed of a scan as many Key-Value pairs do not have to be 
deserialized and processed.
+
+While the `seek` method provides the `Range` that should be used to `seek` the 
underlying source Iterator,
+there is no guarantee that the implementing Iterator uses that `Range` to 
perform the `seek` on its
+"source" Iterator. As such, it is possible to seek to any `Range` and the 
interface has no assertions
+to prevent this from happening.
+
+Since Iterators are allowed to `seek` to arbitrary Keys, it also allows 
Iterators to create infinite loops
+inside Scans that will repeatedly read the same data without end. If an 
arbitrary Range is constructed, it should
+construct a completely new Range as it allows for bugs to be introduced which 
will break Accumulo.
+
+Thus, `seek`'s should always be thought of as making "forward progress" in the 
view of the total iteration. The
+`startKey` of a `Range` should always be greater than the current Key seen by 
the Iterator while the `endKey` of the
+`Range` should always retain the original `endKey` (and `endKey` inclusivity) 
of the last `Range` seen by your
+Iterator's implementation of seek.
+
+==== Take caution in constructing new data in an Iterator
+
+Implementations of Iterator might be tempted to open BatchWriters inside of an 
Iterator as a means
+to implement triggers for writing additional data outside of their client 
application. The lifecycle of an Iterator
+is *not* managed in such a way that guarantees that this is safe nor 
efficient. Specifically, there
+is no way to guarantee that the internal ThreadPool inside of the BatchWriter 
is closed (and the thread(s)
+are reaped) without calling the close() method. `close`'ing and recreating a 
`BatchWriter` after every
+Key-Value pair is also prohibitively performance limiting to be considered an 
option.
+
+The only safe way to generate additional data in an Iterator is to alter the 
current Key-Value pair.
+For example, the `WholeRowIterator` serializes the all of the Key-Values pairs 
that fall within each
+row. A safe way to generate more data in an Iterator would be to construct an 
Iterator that is
+"higher" (at a larger priority) than the `WholeRowIterator`, that is, the 
Iterator receives the Key-Value pairs which are
+a serialization of many Key-Value pairs. The custom Iterator could deserialize 
the pairs, compute
+some function, and add a new Key-Value pair to the original collection, 
re-serializing the collection
+of Key-Value pairs back into a single Key-Value pair.
+
+Any other situation is likely not guaranteed to ensure that the caller (a Scan 
or a Compaction) will
+always see all intended data that is generated.
+
+=== Final things to remember
+
+Some simple recommendations/points to keep in mind:
+
+==== Method call order
+
+On an instance of an Iterator: `init` is always called before `seek`, `seek` 
is always called before `hasTop`,
+`getTopKey` and `getTopValue` will not be called if `hasTop` returns false.
+
+==== Teardown
+
+As mentioned, instance of Iterators may be torn down inside of the server 
transparently. When a complex
+collection of iterators is performing some advanced functionality, they will 
not be torn down until a Key-Value
+pair is returned out of the "stack" of Iterators (and added into the batch of 
Key-Values to be returned
+to the caller). Being torn-down is equivalent to a new instance of the 
Iterator being creating and `deepCopy`
+being called on the new instance with the old instance provided as the 
argument to `deepCopy`. References
+to the old instance are removed and the object is lazily garbage collected by 
the JVM.
+
+=== Compaction-time Iterators
+
+When Iterators are configured to run during compactions, at the `minc` or 
`majc` scope, these Iterators sometimes need
+to make different assertions than those who only operate at scan time. 
Iterators won't see the delete entries; however,
+Iterators will not necessarily see all of the Key-Value pairs in ever 
invocation. Because compactions often do not rewrite
+all files (only a subset of them), it is possible that the logic take this 
into consideration.
+
+For example, a Combiner that runs over data at during compactions, might not 
see all of the values for a given Key. The
+Combiner must recognize this and not perform any function that would be 
incorrect due
+to the missing values.

http://git-wip-us.apache.org/repos/asf/accumulo-website/blob/32ac61d3/docs/master/iterator_test_harness.md
----------------------------------------------------------------------
diff --git a/docs/master/iterator_test_harness.md 
b/docs/master/iterator_test_harness.md
new file mode 100644
index 0000000..91ae53a
--- /dev/null
+++ b/docs/master/iterator_test_harness.md
@@ -0,0 +1,110 @@
+// Licensed to the Apache Software Foundation (ASF) under one or more
+// contributor license agreements. See the NOTICE file distributed with
+// this work for additional information regarding copyright ownership.
+// The ASF licenses this file to You under the Apache License, Version 2.0
+// (the "License"); you may not use this file except in compliance with
+// the License. You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+== Iterator Testing
+
+Iterators, while extremely powerful, are notoriously difficult to test. While 
the API defines
+the methods an Iterator must implement and each method's functionality, the 
actual invocation
+of these methods by Accumulo TabletServers can be surprisingly difficult to 
mimic in unit tests.
+
+The Apache Accumulo "Iterator Test Harness" is designed to provide a 
generalized testing framework
+for all Accumulo Iterators to leverage to identify common pitfalls in 
user-created Iterators.
+
+=== Framework Use
+
+The harness provides an abstract class for use with JUnit4. Users must define 
the following for this
+abstract class:
+
+  * A `SortedMap` of input data (`Key`-`Value` pairs)
+  * A `Range` to use in tests
+  * A `Map` of options (`String` to `String` pairs)
+  * A `SortedMap` of output data (`Key`-`Value` pairs)
+  * A list of `IteratorTestCase`s (these can be automatically discovered)
+
+The majority of effort a user must make is in creating the input dataset and 
the expected
+output dataset for the iterator being tested.
+
+=== Normal Test Outline
+
+Most iterator tests will follow the given outline:
+
+[source,java]
+----
+import java.util.List;
+import java.util.SortedMap;
+
+import org.apache.accumulo.core.data.Key;
+import org.apache.accumulo.core.data.Range;
+import org.apache.accumulo.core.data.Value;
+import org.apache.accumulo.iteratortest.IteratorTestCaseFinder;
+import org.apache.accumulo.iteratortest.IteratorTestInput;
+import org.apache.accumulo.iteratortest.IteratorTestOutput;
+import org.apache.accumulo.iteratortest.junit4.BaseJUnit4IteratorTest;
+import org.apache.accumulo.iteratortest.testcases.IteratorTestCase;
+import org.junit.runners.Parameterized.Parameters;
+
+public class MyIteratorTest extends BaseJUnit4IteratorTest {
+
+  @Parameters
+  public static Object[][] parameters() {
+    final IteratorTestInput input = createIteratorInput();
+    final IteratorTestOutput output = createIteratorOutput();
+    final List<IteratorTestCase> testCases = 
IteratorTestCaseFinder.findAllTestCases();
+    return BaseJUnit4IteratorTest.createParameters(input, output, tests);
+  }
+
+  private static SortedMap<Key,Value> INPUT_DATA = createInputData();
+  private static SortedMap<Key,Value> OUTPUT_DATA = createOutputData();
+
+  private static SortedMap<Key,Value> createInputData() {
+    // TODO -- implement this method
+  }
+
+  private static SortedMap<Key,Value> createOutputData() {
+    // TODO -- implement this method
+  }
+
+  private static IteratorTestInput createIteratorInput() {
+    final Map<String,String> options = createIteratorOptions(); 
+    final Range range = createRange();
+    return new IteratorTestInput(MyIterator.class, options, range, INPUT_DATA);
+  }
+
+  private static Map<String,String> createIteratorOptions() {
+    // TODO -- implement this method
+    // Tip: Use INPUT_DATA if helpful in generating output
+  }
+
+  private static Range createRange() {
+    // TODO -- implement this method
+  }
+
+  private static IteratorTestOutput createIteratorOutput() {
+    return new IteratorTestOutput(OUTPUT_DATA);
+  }
+
+}
+----
+
+=== Limitations
+
+While the provided `IteratorTestCase`s should exercise common edge-cases in 
user iterators,
+there are still many limitations to the existing test harness. Some of them 
are:
+
+  * Can only specify a single iterator, not many (a "stack")
+  * No control over provided IteratorEnvironment for tests
+  * Exercising delete keys (especially with major compactions that do not 
include all files)
+
+These are left as future improvements to the harness.

[04/16] accumulo-website git commit: ACCUMULO-4630 Copied docs from Accumulo repo

Reply via email to