http://git-wip-us.apache.org/repos/asf/accumulo-website/blob/7b2eb317/docs/unreleased/development/analytics.html ---------------------------------------------------------------------- diff --git a/docs/unreleased/development/analytics.html b/docs/unreleased/development/analytics.html new file mode 100644 index 0000000..98285c1 --- /dev/null +++ b/docs/unreleased/development/analytics.html @@ -0,0 +1,552 @@ +<!DOCTYPE html> +<html lang="en"> +<head> +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> +<meta charset="utf-8"> +<meta http-equiv="X-UA-Compatible" content="IE=edge"> +<meta name="viewport" content="width=device-width, initial-scale=1"> +<link href="https://maxcdn.bootstrapcdn.com/bootswatch/3.3.7/paper/bootstrap.min.css" rel="stylesheet" integrity="sha384-awusxf8AUojygHf2+joICySzB780jVvQaVCAt1clU3QsyAitLGul28Qxb2r1e5g+" crossorigin="anonymous"> +<link href="//netdna.bootstrapcdn.com/font-awesome/4.0.3/css/font-awesome.css" rel="stylesheet"> +<link rel="stylesheet" type="text/css" href="https://cdn.datatables.net/v/bs/jq-2.2.3/dt-1.10.12/datatables.min.css"> +<link href="/css/accumulo.css" rel="stylesheet" type="text/css"> + +<title>Accumulo Documentation - Analytics</title> + +<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.2.4/jquery.min.js"></script> +<script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/js/bootstrap.min.js" integrity="sha384-Tc5IQib027qvyjSMfHjOMaLkfuWVxZxUPnCJA7l2mCWNIpG9mGCD8wGNIcPD7Txa" crossorigin="anonymous"></script> +<script type="text/javascript" src="https://cdn.datatables.net/v/bs/jq-2.2.3/dt-1.10.12/datatables.min.js"></script> +<script> + // show location of canonical site if not currently on the canonical site + $(function() { + var host = window.location.host; + if (typeof host !== 'undefined' && host !== 'accumulo.apache.org') { + $('#non-canonical').show(); + } + }); + + $(function() { + // decorate section headers with anchors + return $("h2, h3, h4, h5, h6").each(function(i, el) { + var $el, icon, id; + $el = $(el); + id = $el.attr('id'); + icon = '<i class="fa fa-link"></i>'; + if (id) { + return $el.append($("<a />").addClass("header-link").attr("href", "#" + id).html(icon)); + } + }); + }); + + // configure Google Analytics + (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){ + (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o), + m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) + })(window,document,'script','//www.google-analytics.com/analytics.js','ga'); + + if (ga.hasOwnProperty('loaded') && ga.loaded === true) { + ga('create', 'UA-50934829-1', 'apache.org'); + ga('send', 'pageview'); + } +</script> + +</head> +<body style="padding-top: 100px"> + + <nav class="navbar navbar-default navbar-fixed-top"> + <div class="container"> + <div class="navbar-header"> + <button type="button" class="navbar-toggle" data-toggle="collapse" data-target="#navbar-items"> + <span class="sr-only">Toggle navigation</span> + <span class="icon-bar"></span> + <span class="icon-bar"></span> + <span class="icon-bar"></span> + </button> + <a href="/"><img id="nav-logo" alt="Apache Accumulo" class="img-responsive" src="/images/accumulo-logo.png" width="200" + /></a> + </div> + <div class="collapse navbar-collapse" id="navbar-items"> + <ul class="nav navbar-nav"> + <li class="nav-link"><a href="/downloads">Download</a></li> + <li class="dropdown"> + <a class="dropdown-toggle" data-toggle="dropdown" href="#">Releases<span class="caret"></span></a> + <ul class="dropdown-menu"> + <li><a href="/release/accumulo-1.8.1/">1.8.1 (Latest)</a></li> + <li><a href="/release/accumulo-1.7.3/">1.7.3</a></li> + <li><a href="/release/accumulo-1.6.6/">1.6.6</a></li> + <li><a href="/release/">Archive</a></li> + </ul> + </li> + <li class="dropdown"> + <a class="dropdown-toggle" data-toggle="dropdown" href="#">Documentation<span class="caret"></span></a> + <ul class="dropdown-menu"> + <li><a href="/1.8/accumulo_user_manual.html">User Manual (1.8)</a></li> + <li><a href="/1.8/apidocs">Javadocs (1.8)</a></li> + <li><a href="/1.8/examples">Examples (1.8)</a></li> + <li><a href="/features">Features</a></li> + <li><a href="/glossary">Glossary</a></li> + <li><a href="/external-docs">External Docs</a></li> + <li><a href="/docs-archive/">Archive</a></li> + </ul> + </li> + <li class="dropdown"> + <a class="dropdown-toggle" data-toggle="dropdown" href="#">Community<span class="caret"></span></a> + <ul class="dropdown-menu"> + <li><a href="/get_involved">Get Involved</a></li> + <li><a href="/mailing_list">Mailing Lists</a></li> + <li><a href="/people">People</a></li> + <li><a href="/related-projects">Related Projects</a></li> + <li><a href="/contributor/">Contributor Guide</a></li> + </ul> + </li> + </ul> + <ul class="nav navbar-nav navbar-right"> + <li class="dropdown"> + <a class="dropdown-toggle" data-toggle="dropdown" href="#">Apache Software Foundation<span class="caret"></span></a> + <ul class="dropdown-menu"> + <li><a href="https://www.apache.org">Apache Homepage <i class="fa fa-external-link"></i></a></li> + <li><a href="https://www.apache.org/licenses/LICENSE-2.0">License <i class="fa fa-external-link"></i></a></li> + <li><a href="https://www.apache.org/foundation/sponsorship">Sponsorship <i class="fa fa-external-link"></i></a></li> + <li><a href="https://www.apache.org/security">Security <i class="fa fa-external-link"></i></a></li> + <li><a href="https://www.apache.org/foundation/thanks">Thanks <i class="fa fa-external-link"></i></a></li> + <li><a href="https://www.apache.org/foundation/policies/conduct">Code of Conduct <i class="fa fa-external-link"></i></a></li> + </ul> + </li> + </ul> + </div> + </div> +</nav> + + <div class="container"> + <div class="row"> + <div class="col-md-12"> + + <div id="non-canonical" style="display: none; background-color: #F0E68C; padding-left: 1em;"> + Visit the official site at: <a href="https://accumulo.apache.org">https://accumulo.apache.org</a> + </div> + <div id="content"> + + <div class="alert alert-danger" role="alert">This documentation is for an unreleased version of Apache Accumulo that is currently under development! Check out the <a href="/docs-1.8/">documentation for the latest release</a>.</div> + +<div class="row"> + <div class="col-md-3"> + <div class="panel-group" id="accordion" role="tablist" aria-multiselectable="true"> + <div class="panel panel-default"> + + + + + + + + + + <div class="panel-heading" role="tab" id="headingOne"> + <h4 class="panel-title"> + <a role="button" data-toggle="collapse" data-parent="#accordion" href="#collapsegetting-started" aria-expanded="false" aria-controls="collapsegetting-started"> + Getting started + </a> + </h4> + </div> + <div id="collapsegetting-started" class="panel-collapse collapse" role="tabpanel" aria-labelledby="headingOne"> + <div class="panel-body"> + + + <div class="row doc-sidebar-link"><a href="/docs/unreleased/getting-started/design">Accumulo Design</a></div> + + <div class="row doc-sidebar-link"><a href="/docs/unreleased/getting-started/clients">Accumulo Clients</a></div> + + <div class="row doc-sidebar-link"><a href="/docs/unreleased/getting-started/shell">Accumulo Shell</a></div> + + <div class="row doc-sidebar-link"><a href="/docs/unreleased/getting-started/table_design">Table Design</a></div> + + <div class="row doc-sidebar-link"><a href="/docs/unreleased/getting-started/table_configuration">Table Configuration</a></div> + + </div> + </div> + + + + + + + + + + + + <div class="panel-heading" role="tab" id="headingOne"> + <h4 class="panel-title"> + <a role="button" data-toggle="collapse" data-parent="#accordion" href="#collapsedevelopment" aria-expanded="true" aria-controls="collapsedevelopment"> + Development + </a> + </h4> + </div> + <div id="collapsedevelopment" class="panel-collapse collapse in" role="tabpanel" aria-labelledby="headingOne"> + <div class="panel-body"> + + + <div class="row doc-sidebar-link"><a href="/docs/unreleased/development/iterator_design">Iterator Design</a></div> + + <div class="row doc-sidebar-link"><a href="/docs/unreleased/development/iterator_testing">Iterator Testing</a></div> + + <div class="row doc-sidebar-link"><a href="/docs/unreleased/development/development_tools">Development Tools</a></div> + + <div class="row doc-sidebar-link"><a href="/docs/unreleased/development/sampling">Sampling</a></div> + + <div class="row doc-sidebar-link"><a href="/docs/unreleased/development/summaries">Summary Statistics</a></div> + + <div class="row doc-sidebar-link"><a href="/docs/unreleased/development/security">Security</a></div> + + <div class="row doc-sidebar-link"><a href="/docs/unreleased/development/high_speed_ingest">High-Speed Ingest</a></div> + + <div class="row doc-sidebar-link"><a href="/docs/unreleased/development/analytics">Analytics</a></div> + + </div> + </div> + + + + + + + + + + + + <div class="panel-heading" role="tab" id="headingOne"> + <h4 class="panel-title"> + <a role="button" data-toggle="collapse" data-parent="#accordion" href="#collapseadministration" aria-expanded="false" aria-controls="collapseadministration"> + Administration + </a> + </h4> + </div> + <div id="collapseadministration" class="panel-collapse collapse" role="tabpanel" aria-labelledby="headingOne"> + <div class="panel-body"> + + + <div class="row doc-sidebar-link"><a href="/docs/unreleased/administration/overview">Overview</a></div> + + <div class="row doc-sidebar-link"><a href="/docs/unreleased/administration/configuration-management">Configuration Management</a></div> + + <div class="row doc-sidebar-link"><a href="/docs/unreleased/administration/configuration-properties">Configuration Properties</a></div> + + <div class="row doc-sidebar-link"><a href="/docs/unreleased/administration/kerberos">Kerberos</a></div> + + <div class="row doc-sidebar-link"><a href="/docs/unreleased/administration/replication">Replication</a></div> + + <div class="row doc-sidebar-link"><a href="/docs/unreleased/administration/fate">FATE</a></div> + + <div class="row doc-sidebar-link"><a href="/docs/unreleased/administration/multivolume">Multi-Volume Installations</a></div> + + <div class="row doc-sidebar-link"><a href="/docs/unreleased/administration/ssl">SSL</a></div> + + </div> + </div> + + + + + + + + + + + + + + + + + + + + + + <div class="panel-heading" role="tab" id="headingOne"> + <h4 class="panel-title"> + <a role="button" data-toggle="collapse" data-parent="#accordion" href="#collapsetroubleshooting" aria-expanded="false" aria-controls="collapsetroubleshooting"> + Troubleshooting + </a> + </h4> + </div> + <div id="collapsetroubleshooting" class="panel-collapse collapse" role="tabpanel" aria-labelledby="headingOne"> + <div class="panel-body"> + + + <div class="row doc-sidebar-link"><a href="/docs/unreleased/troubleshooting/overview">Overview</a></div> + + </div> + </div> + + + + </div> + </div> + </div> + <div class="col-md-9"> + + <p><a href="/docs/unreleased/">Accumulo unreleased docs</a> >> Development >> Analytics</p> + + + <h1>Analytics</h1> + + <p>Accumulo supports more advanced data processing than simply keeping keys +sorted and performing efficient lookups. Analytics can be developed by using +MapReduce and Iterators in conjunction with Accumulo tables.</p> + +<h2 id="mapreduce">MapReduce</h2> + +<p>Accumulo tables can be used as the source and destination of MapReduce jobs. To +use an Accumulo table with a MapReduce job (specifically with the new Hadoop API +as of version 0.20), configure the job parameters to use the AccumuloInputFormat +and AccumuloOutputFormat. Accumulo specific parameters can be set via these +two format classes to do the following:</p> + +<ul> + <li>Authenticate and provide user credentials for the input</li> + <li>Restrict the scan to a range of rows</li> + <li>Restrict the input to a subset of available columns</li> +</ul> + +<h3 id="mapper-and-reducer-classes">Mapper and Reducer classes</h3> + +<p>To read from an Accumulo table create a Mapper with the following class +parameterization and be sure to configure the AccumuloInputFormat.</p> + +<div class="language-java highlighter-rouge"><pre class="highlight"><code><span class="kd">class</span> <span class="nc">MyMapper</span> <span class="kd">extends</span> <span class="n">Mapper</span><span class="o"><</span><span class="n">Key</span><span class="o">,</span><span class="n">Value</span><span class="o">,</span><span class="n">WritableComparable</span><span class="o">,</span><span class="n">Writable</span><span class="o">></span> <span class="o">{</span> + <span class="kd">public</span> <span class="kt">void</span> <span class="nf">map</span><span class="o">(</span><span class="n">Key</span> <span class="n">k</span><span class="o">,</span> <span class="n">Value</span> <span class="n">v</span><span class="o">,</span> <span class="n">Context</span> <span class="n">c</span><span class="o">)</span> <span class="o">{</span> + <span class="c1">// transform key and value data here</span> + <span class="o">}</span> +<span class="o">}</span> +</code></pre> +</div> + +<p>To write to an Accumulo table, create a Reducer with the following class +parameterization and be sure to configure the AccumuloOutputFormat. The key +emitted from the Reducer identifies the table to which the mutation is sent. This +allows a single Reducer to write to more than one table if desired. A default table +can be configured using the AccumuloOutputFormat, in which case the output table +name does not have to be passed to the Context object within the Reducer.</p> + +<div class="language-java highlighter-rouge"><pre class="highlight"><code><span class="kd">class</span> <span class="nc">MyReducer</span> <span class="kd">extends</span> <span class="n">Reducer</span><span class="o"><</span><span class="n">WritableComparable</span><span class="o">,</span> <span class="n">Writable</span><span class="o">,</span> <span class="n">Text</span><span class="o">,</span> <span class="n">Mutation</span><span class="o">></span> <span class="o">{</span> + <span class="kd">public</span> <span class="kt">void</span> <span class="nf">reduce</span><span class="o">(</span><span class="n">WritableComparable</span> <span class="n">key</span><span class="o">,</span> <span class="n">Iterable</span><span class="o"><</span><span class="n">Text</span><span class="o">></span> <span class="n">values</span><span class="o">,</span> <span class="n">Context</span> <span class="n">c</span><span class="o">)</span> <span class="o">{</span> + <span class="n">Mutation</span> <span class="n">m</span><span class="o">;</span> + <span class="c1">// create the mutation based on input key and value</span> + <span class="n">c</span><span class="o">.</span><span class="na">write</span><span class="o">(</span><span class="k">new</span> <span class="n">Text</span><span class="o">(</span><span class="s">"output-table"</span><span class="o">),</span> <span class="n">m</span><span class="o">);</span> + <span class="o">}</span> +<span class="o">}</span> +</code></pre> +</div> + +<p>The Text object passed as the output should contain the name of the table to which +this mutation should be applied. The Text can be null in which case the mutation +will be applied to the default table name specified in the AccumuloOutputFormat +options.</p> + +<h3 id="accumuloinputformat-options">AccumuloInputFormat options</h3> + +<div class="language-java highlighter-rouge"><pre class="highlight"><code><span class="n">Job</span> <span class="n">job</span> <span class="o">=</span> <span class="k">new</span> <span class="n">Job</span><span class="o">(</span><span class="n">getConf</span><span class="o">());</span> +<span class="n">AccumuloInputFormat</span><span class="o">.</span><span class="na">setInputInfo</span><span class="o">(</span><span class="n">job</span><span class="o">,</span> + <span class="s">"user"</span><span class="o">,</span> + <span class="s">"passwd"</span><span class="o">.</span><span class="na">getBytes</span><span class="o">(),</span> + <span class="s">"table"</span><span class="o">,</span> + <span class="k">new</span> <span class="nf">Authorizations</span><span class="o">());</span> + +<span class="n">AccumuloInputFormat</span><span class="o">.</span><span class="na">setZooKeeperInstance</span><span class="o">(</span><span class="n">job</span><span class="o">,</span> <span class="s">"myinstance"</span><span class="o">,</span> + <span class="s">"zooserver-one,zooserver-two"</span><span class="o">);</span> +</code></pre> +</div> + +<p><strong>Optional Settings:</strong></p> + +<p>To restrict Accumulo to a set of row ranges:</p> + +<div class="language-java highlighter-rouge"><pre class="highlight"><code><span class="n">ArrayList</span><span class="o"><</span><span class="n">Range</span><span class="o">></span> <span class="n">ranges</span> <span class="o">=</span> <span class="k">new</span> <span class="n">ArrayList</span><span class="o"><</span><span class="n">Range</span><span class="o">>();</span> +<span class="c1">// populate array list of row ranges ...</span> +<span class="n">AccumuloInputFormat</span><span class="o">.</span><span class="na">setRanges</span><span class="o">(</span><span class="n">job</span><span class="o">,</span> <span class="n">ranges</span><span class="o">);</span> +</code></pre> +</div> + +<p>To restrict Accumulo to a list of columns:</p> + +<div class="language-java highlighter-rouge"><pre class="highlight"><code><span class="n">ArrayList</span><span class="o"><</span><span class="n">Pair</span><span class="o"><</span><span class="n">Text</span><span class="o">,</span><span class="n">Text</span><span class="o">>></span> <span class="n">columns</span> <span class="o">=</span> <span class="k">new</span> <span class="n">ArrayList</span><span class="o"><</span><span class="n">Pair</span><span class="o"><</span><span class="n">Text</span><span class="o">,</span><span class="n">Text</span><span class="o">>>();</span> +<span class="c1">// populate list of columns</span> +<span class="n">AccumuloInputFormat</span><span class="o">.</span><span class="na">fetchColumns</span><span class="o">(</span><span class="n">job</span><span class="o">,</span> <span class="n">columns</span><span class="o">);</span> +</code></pre> +</div> + +<p>To use a regular expression to match row IDs:</p> + +<div class="language-java highlighter-rouge"><pre class="highlight"><code><span class="n">IteratorSetting</span> <span class="n">is</span> <span class="o">=</span> <span class="k">new</span> <span class="n">IteratorSetting</span><span class="o">(</span><span class="mi">30</span><span class="o">,</span> <span class="n">RexExFilter</span><span class="o">.</span><span class="na">class</span><span class="o">);</span> +<span class="n">RegExFilter</span><span class="o">.</span><span class="na">setRegexs</span><span class="o">(</span><span class="n">is</span><span class="o">,</span> <span class="s">".*suffix"</span><span class="o">,</span> <span class="kc">null</span><span class="o">,</span> <span class="kc">null</span><span class="o">,</span> <span class="kc">null</span><span class="o">,</span> <span class="kc">true</span><span class="o">);</span> +<span class="n">AccumuloInputFormat</span><span class="o">.</span><span class="na">addIterator</span><span class="o">(</span><span class="n">job</span><span class="o">,</span> <span class="n">is</span><span class="o">);</span> +</code></pre> +</div> + +<h3 id="accumulomultitableinputformat-options">AccumuloMultiTableInputFormat options</h3> + +<p>The AccumuloMultiTableInputFormat allows the scanning over multiple tables +in a single MapReduce job. Separate ranges, columns, and iterators can be +used for each table.</p> + +<div class="language-java highlighter-rouge"><pre class="highlight"><code><span class="n">InputTableConfig</span> <span class="n">tableOneConfig</span> <span class="o">=</span> <span class="k">new</span> <span class="n">InputTableConfig</span><span class="o">();</span> +<span class="n">InputTableConfig</span> <span class="n">tableTwoConfig</span> <span class="o">=</span> <span class="k">new</span> <span class="n">InputTableConfig</span><span class="o">();</span> +</code></pre> +</div> + +<p>To set the configuration objects on the job:</p> + +<div class="language-java highlighter-rouge"><pre class="highlight"><code><span class="n">Map</span><span class="o"><</span><span class="n">String</span><span class="o">,</span> <span class="n">InputTableConfig</span><span class="o">></span> <span class="n">configs</span> <span class="o">=</span> <span class="k">new</span> <span class="n">HashMap</span><span class="o"><</span><span class="n">String</span><span class="o">,</span><span class="n">InputTableConfig</span><span class="o">>();</span> +<span class="n">configs</span><span class="o">.</span><span class="na">put</span><span class="o">(</span><span class="s">"table1"</span><span class="o">,</span> <span class="n">tableOneConfig</span><span class="o">);</span> +<span class="n">configs</span><span class="o">.</span><span class="na">put</span><span class="o">(</span><span class="s">"table2"</span><span class="o">,</span> <span class="n">tableTwoConfig</span><span class="o">);</span> +<span class="n">AccumuloMultiTableInputFormat</span><span class="o">.</span><span class="na">setInputTableConfigs</span><span class="o">(</span><span class="n">job</span><span class="o">,</span> <span class="n">configs</span><span class="o">);</span> +</code></pre> +</div> + +<p><strong>Optional settings:</strong></p> + +<p>To restrict to a set of ranges:</p> + +<div class="language-java highlighter-rouge"><pre class="highlight"><code><span class="n">ArrayList</span><span class="o"><</span><span class="n">Range</span><span class="o">></span> <span class="n">tableOneRanges</span> <span class="o">=</span> <span class="k">new</span> <span class="n">ArrayList</span><span class="o"><</span><span class="n">Range</span><span class="o">>();</span> +<span class="n">ArrayList</span><span class="o"><</span><span class="n">Range</span><span class="o">></span> <span class="n">tableTwoRanges</span> <span class="o">=</span> <span class="k">new</span> <span class="n">ArrayList</span><span class="o"><</span><span class="n">Range</span><span class="o">>();</span> +<span class="c1">// populate array lists of row ranges for tables...</span> +<span class="n">tableOneConfig</span><span class="o">.</span><span class="na">setRanges</span><span class="o">(</span><span class="n">tableOneRanges</span><span class="o">);</span> +<span class="n">tableTwoConfig</span><span class="o">.</span><span class="na">setRanges</span><span class="o">(</span><span class="n">tableTwoRanges</span><span class="o">);</span> +</code></pre> +</div> + +<p>To restrict Accumulo to a list of columns:</p> + +<div class="language-java highlighter-rouge"><pre class="highlight"><code><span class="n">ArrayList</span><span class="o"><</span><span class="n">Pair</span><span class="o"><</span><span class="n">Text</span><span class="o">,</span><span class="n">Text</span><span class="o">>></span> <span class="n">tableOneColumns</span> <span class="o">=</span> <span class="k">new</span> <span class="n">ArrayList</span><span class="o"><</span><span class="n">Pair</span><span class="o"><</span><span class="n">Text</span><span class="o">,</span><span class="n">Text</span><span class="o">>>();</span> +<span class="n">ArrayList</span><span class="o"><</span><span class="n">Pair</span><span class="o"><</span><span class="n">Text</span><span class="o">,</span><span class="n">Text</span><span class="o">>></span> <span class="n">tableTwoColumns</span> <span class="o">=</span> <span class="k">new</span> <span class="n">ArrayList</span><span class="o"><</span><span class="n">Pair</span><span class="o"><</span><span class="n">Text</span><span class="o">,</span><span class="n">Text</span><span class="o">>>();</span> +<span class="c1">// populate lists of columns for each of the tables ...</span> +<span class="n">tableOneConfig</span><span class="o">.</span><span class="na">fetchColumns</span><span class="o">(</span><span class="n">tableOneColumns</span><span class="o">);</span> +<span class="n">tableTwoConfig</span><span class="o">.</span><span class="na">fetchColumns</span><span class="o">(</span><span class="n">tableTwoColumns</span><span class="o">);</span> +</code></pre> +</div> + +<p>To set scan iterators:</p> + +<div class="language-java highlighter-rouge"><pre class="highlight"><code><span class="n">List</span><span class="o"><</span><span class="n">IteratorSetting</span><span class="o">></span> <span class="n">tableOneIterators</span> <span class="o">=</span> <span class="k">new</span> <span class="n">ArrayList</span><span class="o"><</span><span class="n">IteratorSetting</span><span class="o">>();</span> +<span class="n">List</span><span class="o"><</span><span class="n">IteratorSetting</span><span class="o">></span> <span class="n">tableTwoIterators</span> <span class="o">=</span> <span class="k">new</span> <span class="n">ArrayList</span><span class="o"><</span><span class="n">IteratorSetting</span><span class="o">>();</span> +<span class="c1">// populate the lists of iterator settings for each of the tables ...</span> +<span class="n">tableOneConfig</span><span class="o">.</span><span class="na">setIterators</span><span class="o">(</span><span class="n">tableOneIterators</span><span class="o">);</span> +<span class="n">tableTwoConfig</span><span class="o">.</span><span class="na">setIterators</span><span class="o">(</span><span class="n">tableTwoIterators</span><span class="o">);</span> +</code></pre> +</div> + +<p>The name of the table can be retrieved from the input split:</p> + +<div class="language-java highlighter-rouge"><pre class="highlight"><code><span class="kd">class</span> <span class="nc">MyMapper</span> <span class="kd">extends</span> <span class="n">Mapper</span><span class="o"><</span><span class="n">Key</span><span class="o">,</span><span class="n">Value</span><span class="o">,</span><span class="n">WritableComparable</span><span class="o">,</span><span class="n">Writable</span><span class="o">></span> <span class="o">{</span> + <span class="kd">public</span> <span class="kt">void</span> <span class="nf">map</span><span class="o">(</span><span class="n">Key</span> <span class="n">k</span><span class="o">,</span> <span class="n">Value</span> <span class="n">v</span><span class="o">,</span> <span class="n">Context</span> <span class="n">c</span><span class="o">)</span> <span class="o">{</span> + <span class="n">RangeInputSplit</span> <span class="n">split</span> <span class="o">=</span> <span class="o">(</span><span class="n">RangeInputSplit</span><span class="o">)</span><span class="n">c</span><span class="o">.</span><span class="na">getInputSplit</span><span class="o">();</span> + <span class="n">String</span> <span class="n">tableName</span> <span class="o">=</span> <span class="n">split</span><span class="o">.</span><span class="na">getTableName</span><span class="o">();</span> + <span class="c1">// do something with table name</span> + <span class="o">}</span> +<span class="o">}</span> +</code></pre> +</div> + +<h3 id="accumulooutputformat-options">AccumuloOutputFormat options</h3> + +<div class="language-java highlighter-rouge"><pre class="highlight"><code><span class="kt">boolean</span> <span class="n">createTables</span> <span class="o">=</span> <span class="kc">true</span><span class="o">;</span> +<span class="n">String</span> <span class="n">defaultTable</span> <span class="o">=</span> <span class="s">"mytable"</span><span class="o">;</span> + +<span class="n">AccumuloOutputFormat</span><span class="o">.</span><span class="na">setOutputInfo</span><span class="o">(</span><span class="n">job</span><span class="o">,</span> + <span class="s">"user"</span><span class="o">,</span> + <span class="s">"passwd"</span><span class="o">.</span><span class="na">getBytes</span><span class="o">(),</span> + <span class="n">createTables</span><span class="o">,</span> + <span class="n">defaultTable</span><span class="o">);</span> + +<span class="n">AccumuloOutputFormat</span><span class="o">.</span><span class="na">setZooKeeperInstance</span><span class="o">(</span><span class="n">job</span><span class="o">,</span> <span class="s">"myinstance"</span><span class="o">,</span> + <span class="s">"zooserver-one,zooserver-two"</span><span class="o">);</span> +</code></pre> +</div> + +<p><strong>Optional Settings:</strong></p> + +<div class="language-java highlighter-rouge"><pre class="highlight"><code><span class="n">AccumuloOutputFormat</span><span class="o">.</span><span class="na">setMaxLatency</span><span class="o">(</span><span class="n">job</span><span class="o">,</span> <span class="mi">300000</span><span class="o">);</span> <span class="c1">// milliseconds</span> +<span class="n">AccumuloOutputFormat</span><span class="o">.</span><span class="na">setMaxMutationBufferSize</span><span class="o">(</span><span class="n">job</span><span class="o">,</span> <span class="mi">50000000</span><span class="o">);</span> <span class="c1">// bytes</span> +</code></pre> +</div> + +<p>The <a href="https://github.com/apache/accumulo-examples/blob/master/docs/mapred.md">MapReduce example</a> +contains a complete example of using MapReduce with Accumulo.</p> + +<h2 id="combiners">Combiners</h2> + +<p>Many applications can benefit from the ability to aggregate values across common +keys. This can be done via Combiner iterators and is similar to the Reduce step in +MapReduce. This provides the ability to define online, incrementally updated +analytics without the overhead or latency associated with batch-oriented +MapReduce jobs.</p> + +<p>All that is needed to aggregate values of a table is to identify the fields over which +values will be grouped, insert mutations with those fields as the key, and configure +the table with a combining iterator that supports the summarizing operation +desired.</p> + +<p>The only restriction on an combining iterator is that the combiner developer +should not assume that all values for a given key have been seen, since new +mutations can be inserted at anytime. This precludes using the total number of +values in the aggregation such as when calculating an average, for example.</p> + +<h3 id="feature-vectors">Feature Vectors</h3> + +<p>An interesting use of combining iterators within an Accumulo table is to store +feature vectors for use in machine learning algorithms. For example, many +algorithms such as k-means clustering, support vector machines, anomaly detection, +etc. use the concept of a feature vector and the calculation of distance metrics to +learn a particular model. The columns in an Accumulo table can be used to efficiently +store sparse features and their weights to be incrementally updated via the use of an +combining iterator.</p> + +<h2 id="statistical-modeling">Statistical Modeling</h2> + +<p>Statistical models that need to be updated by many machines in parallel could be +similarly stored within an Accumulo table. For example, a MapReduce job that is +iteratively updating a global statistical model could have each map or reduce worker +reference the parts of the model to be read and updated through an embedded +Accumulo client.</p> + +<p>Using Accumulo this way enables efficient and fast lookups and updates of small +pieces of information in a random access pattern, which is complementary to +MapReduceâs sequential access model.</p> + + </div> +</div> + + </div> + + +<footer> + + <p><a href="https://www.apache.org/foundation/contributing"><img src="https://www.apache.org/images/SupportApache-small.png" alt="Support the ASF" id="asf-logo" height="100" /></a></p> + + <p>Copyright © 2011-2017 The Apache Software Foundation. Licensed under the <a href="https://www.apache.org/licenses/LICENSE-2.0">Apache License, Version 2.0</a>.</p> + +</footer> + + + </div> + </div> + </div> +</body> +</html>
http://git-wip-us.apache.org/repos/asf/accumulo-website/blob/7b2eb317/docs/unreleased/development/development_tools.html ---------------------------------------------------------------------- diff --git a/docs/unreleased/development/development_tools.html b/docs/unreleased/development/development_tools.html new file mode 100644 index 0000000..a0a34e4 --- /dev/null +++ b/docs/unreleased/development/development_tools.html @@ -0,0 +1,426 @@ +<!DOCTYPE html> +<html lang="en"> +<head> +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> +<meta charset="utf-8"> +<meta http-equiv="X-UA-Compatible" content="IE=edge"> +<meta name="viewport" content="width=device-width, initial-scale=1"> +<link href="https://maxcdn.bootstrapcdn.com/bootswatch/3.3.7/paper/bootstrap.min.css" rel="stylesheet" integrity="sha384-awusxf8AUojygHf2+joICySzB780jVvQaVCAt1clU3QsyAitLGul28Qxb2r1e5g+" crossorigin="anonymous"> +<link href="//netdna.bootstrapcdn.com/font-awesome/4.0.3/css/font-awesome.css" rel="stylesheet"> +<link rel="stylesheet" type="text/css" href="https://cdn.datatables.net/v/bs/jq-2.2.3/dt-1.10.12/datatables.min.css"> +<link href="/css/accumulo.css" rel="stylesheet" type="text/css"> + +<title>Accumulo Documentation - Development Tools</title> + +<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.2.4/jquery.min.js"></script> +<script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/js/bootstrap.min.js" integrity="sha384-Tc5IQib027qvyjSMfHjOMaLkfuWVxZxUPnCJA7l2mCWNIpG9mGCD8wGNIcPD7Txa" crossorigin="anonymous"></script> +<script type="text/javascript" src="https://cdn.datatables.net/v/bs/jq-2.2.3/dt-1.10.12/datatables.min.js"></script> +<script> + // show location of canonical site if not currently on the canonical site + $(function() { + var host = window.location.host; + if (typeof host !== 'undefined' && host !== 'accumulo.apache.org') { + $('#non-canonical').show(); + } + }); + + $(function() { + // decorate section headers with anchors + return $("h2, h3, h4, h5, h6").each(function(i, el) { + var $el, icon, id; + $el = $(el); + id = $el.attr('id'); + icon = '<i class="fa fa-link"></i>'; + if (id) { + return $el.append($("<a />").addClass("header-link").attr("href", "#" + id).html(icon)); + } + }); + }); + + // configure Google Analytics + (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){ + (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o), + m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) + })(window,document,'script','//www.google-analytics.com/analytics.js','ga'); + + if (ga.hasOwnProperty('loaded') && ga.loaded === true) { + ga('create', 'UA-50934829-1', 'apache.org'); + ga('send', 'pageview'); + } +</script> + +</head> +<body style="padding-top: 100px"> + + <nav class="navbar navbar-default navbar-fixed-top"> + <div class="container"> + <div class="navbar-header"> + <button type="button" class="navbar-toggle" data-toggle="collapse" data-target="#navbar-items"> + <span class="sr-only">Toggle navigation</span> + <span class="icon-bar"></span> + <span class="icon-bar"></span> + <span class="icon-bar"></span> + </button> + <a href="/"><img id="nav-logo" alt="Apache Accumulo" class="img-responsive" src="/images/accumulo-logo.png" width="200" + /></a> + </div> + <div class="collapse navbar-collapse" id="navbar-items"> + <ul class="nav navbar-nav"> + <li class="nav-link"><a href="/downloads">Download</a></li> + <li class="dropdown"> + <a class="dropdown-toggle" data-toggle="dropdown" href="#">Releases<span class="caret"></span></a> + <ul class="dropdown-menu"> + <li><a href="/release/accumulo-1.8.1/">1.8.1 (Latest)</a></li> + <li><a href="/release/accumulo-1.7.3/">1.7.3</a></li> + <li><a href="/release/accumulo-1.6.6/">1.6.6</a></li> + <li><a href="/release/">Archive</a></li> + </ul> + </li> + <li class="dropdown"> + <a class="dropdown-toggle" data-toggle="dropdown" href="#">Documentation<span class="caret"></span></a> + <ul class="dropdown-menu"> + <li><a href="/1.8/accumulo_user_manual.html">User Manual (1.8)</a></li> + <li><a href="/1.8/apidocs">Javadocs (1.8)</a></li> + <li><a href="/1.8/examples">Examples (1.8)</a></li> + <li><a href="/features">Features</a></li> + <li><a href="/glossary">Glossary</a></li> + <li><a href="/external-docs">External Docs</a></li> + <li><a href="/docs-archive/">Archive</a></li> + </ul> + </li> + <li class="dropdown"> + <a class="dropdown-toggle" data-toggle="dropdown" href="#">Community<span class="caret"></span></a> + <ul class="dropdown-menu"> + <li><a href="/get_involved">Get Involved</a></li> + <li><a href="/mailing_list">Mailing Lists</a></li> + <li><a href="/people">People</a></li> + <li><a href="/related-projects">Related Projects</a></li> + <li><a href="/contributor/">Contributor Guide</a></li> + </ul> + </li> + </ul> + <ul class="nav navbar-nav navbar-right"> + <li class="dropdown"> + <a class="dropdown-toggle" data-toggle="dropdown" href="#">Apache Software Foundation<span class="caret"></span></a> + <ul class="dropdown-menu"> + <li><a href="https://www.apache.org">Apache Homepage <i class="fa fa-external-link"></i></a></li> + <li><a href="https://www.apache.org/licenses/LICENSE-2.0">License <i class="fa fa-external-link"></i></a></li> + <li><a href="https://www.apache.org/foundation/sponsorship">Sponsorship <i class="fa fa-external-link"></i></a></li> + <li><a href="https://www.apache.org/security">Security <i class="fa fa-external-link"></i></a></li> + <li><a href="https://www.apache.org/foundation/thanks">Thanks <i class="fa fa-external-link"></i></a></li> + <li><a href="https://www.apache.org/foundation/policies/conduct">Code of Conduct <i class="fa fa-external-link"></i></a></li> + </ul> + </li> + </ul> + </div> + </div> +</nav> + + <div class="container"> + <div class="row"> + <div class="col-md-12"> + + <div id="non-canonical" style="display: none; background-color: #F0E68C; padding-left: 1em;"> + Visit the official site at: <a href="https://accumulo.apache.org">https://accumulo.apache.org</a> + </div> + <div id="content"> + + <div class="alert alert-danger" role="alert">This documentation is for an unreleased version of Apache Accumulo that is currently under development! Check out the <a href="/docs-1.8/">documentation for the latest release</a>.</div> + +<div class="row"> + <div class="col-md-3"> + <div class="panel-group" id="accordion" role="tablist" aria-multiselectable="true"> + <div class="panel panel-default"> + + + + + + + + + + <div class="panel-heading" role="tab" id="headingOne"> + <h4 class="panel-title"> + <a role="button" data-toggle="collapse" data-parent="#accordion" href="#collapsegetting-started" aria-expanded="false" aria-controls="collapsegetting-started"> + Getting started + </a> + </h4> + </div> + <div id="collapsegetting-started" class="panel-collapse collapse" role="tabpanel" aria-labelledby="headingOne"> + <div class="panel-body"> + + + <div class="row doc-sidebar-link"><a href="/docs/unreleased/getting-started/design">Accumulo Design</a></div> + + <div class="row doc-sidebar-link"><a href="/docs/unreleased/getting-started/clients">Accumulo Clients</a></div> + + <div class="row doc-sidebar-link"><a href="/docs/unreleased/getting-started/shell">Accumulo Shell</a></div> + + <div class="row doc-sidebar-link"><a href="/docs/unreleased/getting-started/table_design">Table Design</a></div> + + <div class="row doc-sidebar-link"><a href="/docs/unreleased/getting-started/table_configuration">Table Configuration</a></div> + + </div> + </div> + + + + + + + + + + + + <div class="panel-heading" role="tab" id="headingOne"> + <h4 class="panel-title"> + <a role="button" data-toggle="collapse" data-parent="#accordion" href="#collapsedevelopment" aria-expanded="true" aria-controls="collapsedevelopment"> + Development + </a> + </h4> + </div> + <div id="collapsedevelopment" class="panel-collapse collapse in" role="tabpanel" aria-labelledby="headingOne"> + <div class="panel-body"> + + + <div class="row doc-sidebar-link"><a href="/docs/unreleased/development/iterator_design">Iterator Design</a></div> + + <div class="row doc-sidebar-link"><a href="/docs/unreleased/development/iterator_testing">Iterator Testing</a></div> + + <div class="row doc-sidebar-link"><a href="/docs/unreleased/development/development_tools">Development Tools</a></div> + + <div class="row doc-sidebar-link"><a href="/docs/unreleased/development/sampling">Sampling</a></div> + + <div class="row doc-sidebar-link"><a href="/docs/unreleased/development/summaries">Summary Statistics</a></div> + + <div class="row doc-sidebar-link"><a href="/docs/unreleased/development/security">Security</a></div> + + <div class="row doc-sidebar-link"><a href="/docs/unreleased/development/high_speed_ingest">High-Speed Ingest</a></div> + + <div class="row doc-sidebar-link"><a href="/docs/unreleased/development/analytics">Analytics</a></div> + + </div> + </div> + + + + + + + + + + + + <div class="panel-heading" role="tab" id="headingOne"> + <h4 class="panel-title"> + <a role="button" data-toggle="collapse" data-parent="#accordion" href="#collapseadministration" aria-expanded="false" aria-controls="collapseadministration"> + Administration + </a> + </h4> + </div> + <div id="collapseadministration" class="panel-collapse collapse" role="tabpanel" aria-labelledby="headingOne"> + <div class="panel-body"> + + + <div class="row doc-sidebar-link"><a href="/docs/unreleased/administration/overview">Overview</a></div> + + <div class="row doc-sidebar-link"><a href="/docs/unreleased/administration/configuration-management">Configuration Management</a></div> + + <div class="row doc-sidebar-link"><a href="/docs/unreleased/administration/configuration-properties">Configuration Properties</a></div> + + <div class="row doc-sidebar-link"><a href="/docs/unreleased/administration/kerberos">Kerberos</a></div> + + <div class="row doc-sidebar-link"><a href="/docs/unreleased/administration/replication">Replication</a></div> + + <div class="row doc-sidebar-link"><a href="/docs/unreleased/administration/fate">FATE</a></div> + + <div class="row doc-sidebar-link"><a href="/docs/unreleased/administration/multivolume">Multi-Volume Installations</a></div> + + <div class="row doc-sidebar-link"><a href="/docs/unreleased/administration/ssl">SSL</a></div> + + </div> + </div> + + + + + + + + + + + + + + + + + + + + + + <div class="panel-heading" role="tab" id="headingOne"> + <h4 class="panel-title"> + <a role="button" data-toggle="collapse" data-parent="#accordion" href="#collapsetroubleshooting" aria-expanded="false" aria-controls="collapsetroubleshooting"> + Troubleshooting + </a> + </h4> + </div> + <div id="collapsetroubleshooting" class="panel-collapse collapse" role="tabpanel" aria-labelledby="headingOne"> + <div class="panel-body"> + + + <div class="row doc-sidebar-link"><a href="/docs/unreleased/troubleshooting/overview">Overview</a></div> + + </div> + </div> + + + + </div> + </div> + </div> + <div class="col-md-9"> + + <p><a href="/docs/unreleased/">Accumulo unreleased docs</a> >> Development >> Development Tools</p> + + + <h1>Development Tools</h1> + + <p>Normally, Accumulo consists of lots of moving parts. Even a stand-alone version of +Accumulo requires Hadoop, Zookeeper, the Accumulo master, a tablet server, etc. If +you want to write a unit test that uses Accumulo, you need a lot of infrastructure +in place before your test can run.</p> + +<h2 id="mock-accumulo">Mock Accumulo</h2> + +<p>Mock Accumulo supplies mock implementations for much of the client API. It presently +does not enforce users, logins, permissions, etc. It does support Iterators and Combiners. +Note that MockAccumulo holds all data in memory, and will not retain any data or +settings between runs.</p> + +<p>While normal interaction with the Accumulo client looks like this:</p> + +<div class="language-java highlighter-rouge"><pre class="highlight"><code><span class="n">Instance</span> <span class="n">instance</span> <span class="o">=</span> <span class="k">new</span> <span class="n">ZooKeeperInstance</span><span class="o">(...);</span> +<span class="n">Connector</span> <span class="n">conn</span> <span class="o">=</span> <span class="n">instance</span><span class="o">.</span><span class="na">getConnector</span><span class="o">(</span><span class="n">user</span><span class="o">,</span> <span class="n">passwordToken</span><span class="o">);</span> +</code></pre> +</div> + +<p>To interact with the MockAccumulo, just replace the ZooKeeperInstance with MockInstance:</p> + +<div class="language-java highlighter-rouge"><pre class="highlight"><code><span class="n">Instance</span> <span class="n">instance</span> <span class="o">=</span> <span class="k">new</span> <span class="n">MockInstance</span><span class="o">();</span> +</code></pre> +</div> + +<p>In fact, you can use the <code class="highlighter-rouge">--fake</code> option to the Accumulo shell and interact with +MockAccumulo:</p> + +<div class="highlighter-rouge"><pre class="highlight"><code>$ accumulo shell --fake -u root -p '' + +Shell - Apache Accumulo Interactive Shell +- +- version: 2.x.x +- instance name: fake +- instance id: mock-instance-id +- +- type 'help' for a list of available commands +- + +root@fake> createtable test + +root@fake test> insert row1 cf cq value +root@fake test> insert row2 cf cq value2 +root@fake test> insert row3 cf cq value3 + +root@fake test> scan +row1 cf:cq [] value +row2 cf:cq [] value2 +row3 cf:cq [] value3 + +root@fake test> scan -b row2 -e row2 +row2 cf:cq [] value2 + +root@fake test> +</code></pre> +</div> + +<p>When testing Map Reduce jobs, you can also set the Mock Accumulo on the AccumuloInputFormat +and AccumuloOutputFormat classes:</p> + +<div class="language-java highlighter-rouge"><pre class="highlight"><code><span class="c1">// ... set up job configuration</span> +<span class="n">AccumuloInputFormat</span><span class="o">.</span><span class="na">setMockInstance</span><span class="o">(</span><span class="n">job</span><span class="o">,</span> <span class="s">"mockInstance"</span><span class="o">);</span> +<span class="n">AccumuloOutputFormat</span><span class="o">.</span><span class="na">setMockInstance</span><span class="o">(</span><span class="n">job</span><span class="o">,</span> <span class="s">"mockInstance"</span><span class="o">);</span> +</code></pre> +</div> + +<h2 id="mini-accumulo-cluster">Mini Accumulo Cluster</h2> + +<p>While the Mock Accumulo provides a lightweight implementation of the client API for unit +testing, it is often necessary to write more realistic end-to-end integration tests that +take advantage of the entire ecosystem. The Mini Accumulo Cluster makes this possible by +configuring and starting Zookeeper, initializing Accumulo, and starting the Master as well +as some Tablet Servers. It runs against the local filesystem instead of having to start +up HDFS.</p> + +<p>To start it up, you will need to supply an empty directory and a root password as arguments:</p> + +<div class="language-java highlighter-rouge"><pre class="highlight"><code><span class="n">File</span> <span class="n">tempDirectory</span> <span class="o">=</span> <span class="c1">// JUnit and Guava supply mechanisms for creating temp directories</span> +<span class="n">MiniAccumuloCluster</span> <span class="n">accumulo</span> <span class="o">=</span> <span class="k">new</span> <span class="n">MiniAccumuloCluster</span><span class="o">(</span><span class="n">tempDirectory</span><span class="o">,</span> <span class="s">"password"</span><span class="o">);</span> +<span class="n">accumulo</span><span class="o">.</span><span class="na">start</span><span class="o">();</span> +</code></pre> +</div> + +<p>Once we have our mini cluster running, we will want to interact with the Accumulo client API:</p> + +<div class="language-java highlighter-rouge"><pre class="highlight"><code><span class="n">Instance</span> <span class="n">instance</span> <span class="o">=</span> <span class="k">new</span> <span class="n">ZooKeeperInstance</span><span class="o">(</span><span class="n">accumulo</span><span class="o">.</span><span class="na">getInstanceName</span><span class="o">(),</span> <span class="n">accumulo</span><span class="o">.</span><span class="na">getZooKeepers</span><span class="o">());</span> +<span class="n">Connector</span> <span class="n">conn</span> <span class="o">=</span> <span class="n">instance</span><span class="o">.</span><span class="na">getConnector</span><span class="o">(</span><span class="s">"root"</span><span class="o">,</span> <span class="k">new</span> <span class="n">PasswordToken</span><span class="o">(</span><span class="s">"password"</span><span class="o">));</span> +</code></pre> +</div> + +<p>Upon completion of our development code, we will want to shutdown our MiniAccumuloCluster:</p> + +<div class="language-java highlighter-rouge"><pre class="highlight"><code><span class="n">accumulo</span><span class="o">.</span><span class="na">stop</span><span class="o">();</span> +<span class="c1">// delete your temporary folder</span> +</code></pre> +</div> + + </div> +</div> + + </div> + + +<footer> + + <p><a href="https://www.apache.org/foundation/contributing"><img src="https://www.apache.org/images/SupportApache-small.png" alt="Support the ASF" id="asf-logo" height="100" /></a></p> + + <p>Copyright © 2011-2017 The Apache Software Foundation. Licensed under the <a href="https://www.apache.org/licenses/LICENSE-2.0">Apache License, Version 2.0</a>.</p> + +</footer> + + + </div> + </div> + </div> +</body> +</html> http://git-wip-us.apache.org/repos/asf/accumulo-website/blob/7b2eb317/docs/unreleased/development/high_speed_ingest.html ---------------------------------------------------------------------- diff --git a/docs/unreleased/development/high_speed_ingest.html b/docs/unreleased/development/high_speed_ingest.html new file mode 100644 index 0000000..0ad4ef8 --- /dev/null +++ b/docs/unreleased/development/high_speed_ingest.html @@ -0,0 +1,443 @@ +<!DOCTYPE html> +<html lang="en"> +<head> +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> +<meta charset="utf-8"> +<meta http-equiv="X-UA-Compatible" content="IE=edge"> +<meta name="viewport" content="width=device-width, initial-scale=1"> +<link href="https://maxcdn.bootstrapcdn.com/bootswatch/3.3.7/paper/bootstrap.min.css" rel="stylesheet" integrity="sha384-awusxf8AUojygHf2+joICySzB780jVvQaVCAt1clU3QsyAitLGul28Qxb2r1e5g+" crossorigin="anonymous"> +<link href="//netdna.bootstrapcdn.com/font-awesome/4.0.3/css/font-awesome.css" rel="stylesheet"> +<link rel="stylesheet" type="text/css" href="https://cdn.datatables.net/v/bs/jq-2.2.3/dt-1.10.12/datatables.min.css"> +<link href="/css/accumulo.css" rel="stylesheet" type="text/css"> + +<title>Accumulo Documentation - High-Speed Ingest</title> + +<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.2.4/jquery.min.js"></script> +<script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/js/bootstrap.min.js" integrity="sha384-Tc5IQib027qvyjSMfHjOMaLkfuWVxZxUPnCJA7l2mCWNIpG9mGCD8wGNIcPD7Txa" crossorigin="anonymous"></script> +<script type="text/javascript" src="https://cdn.datatables.net/v/bs/jq-2.2.3/dt-1.10.12/datatables.min.js"></script> +<script> + // show location of canonical site if not currently on the canonical site + $(function() { + var host = window.location.host; + if (typeof host !== 'undefined' && host !== 'accumulo.apache.org') { + $('#non-canonical').show(); + } + }); + + $(function() { + // decorate section headers with anchors + return $("h2, h3, h4, h5, h6").each(function(i, el) { + var $el, icon, id; + $el = $(el); + id = $el.attr('id'); + icon = '<i class="fa fa-link"></i>'; + if (id) { + return $el.append($("<a />").addClass("header-link").attr("href", "#" + id).html(icon)); + } + }); + }); + + // configure Google Analytics + (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){ + (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o), + m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) + })(window,document,'script','//www.google-analytics.com/analytics.js','ga'); + + if (ga.hasOwnProperty('loaded') && ga.loaded === true) { + ga('create', 'UA-50934829-1', 'apache.org'); + ga('send', 'pageview'); + } +</script> + +</head> +<body style="padding-top: 100px"> + + <nav class="navbar navbar-default navbar-fixed-top"> + <div class="container"> + <div class="navbar-header"> + <button type="button" class="navbar-toggle" data-toggle="collapse" data-target="#navbar-items"> + <span class="sr-only">Toggle navigation</span> + <span class="icon-bar"></span> + <span class="icon-bar"></span> + <span class="icon-bar"></span> + </button> + <a href="/"><img id="nav-logo" alt="Apache Accumulo" class="img-responsive" src="/images/accumulo-logo.png" width="200" + /></a> + </div> + <div class="collapse navbar-collapse" id="navbar-items"> + <ul class="nav navbar-nav"> + <li class="nav-link"><a href="/downloads">Download</a></li> + <li class="dropdown"> + <a class="dropdown-toggle" data-toggle="dropdown" href="#">Releases<span class="caret"></span></a> + <ul class="dropdown-menu"> + <li><a href="/release/accumulo-1.8.1/">1.8.1 (Latest)</a></li> + <li><a href="/release/accumulo-1.7.3/">1.7.3</a></li> + <li><a href="/release/accumulo-1.6.6/">1.6.6</a></li> + <li><a href="/release/">Archive</a></li> + </ul> + </li> + <li class="dropdown"> + <a class="dropdown-toggle" data-toggle="dropdown" href="#">Documentation<span class="caret"></span></a> + <ul class="dropdown-menu"> + <li><a href="/1.8/accumulo_user_manual.html">User Manual (1.8)</a></li> + <li><a href="/1.8/apidocs">Javadocs (1.8)</a></li> + <li><a href="/1.8/examples">Examples (1.8)</a></li> + <li><a href="/features">Features</a></li> + <li><a href="/glossary">Glossary</a></li> + <li><a href="/external-docs">External Docs</a></li> + <li><a href="/docs-archive/">Archive</a></li> + </ul> + </li> + <li class="dropdown"> + <a class="dropdown-toggle" data-toggle="dropdown" href="#">Community<span class="caret"></span></a> + <ul class="dropdown-menu"> + <li><a href="/get_involved">Get Involved</a></li> + <li><a href="/mailing_list">Mailing Lists</a></li> + <li><a href="/people">People</a></li> + <li><a href="/related-projects">Related Projects</a></li> + <li><a href="/contributor/">Contributor Guide</a></li> + </ul> + </li> + </ul> + <ul class="nav navbar-nav navbar-right"> + <li class="dropdown"> + <a class="dropdown-toggle" data-toggle="dropdown" href="#">Apache Software Foundation<span class="caret"></span></a> + <ul class="dropdown-menu"> + <li><a href="https://www.apache.org">Apache Homepage <i class="fa fa-external-link"></i></a></li> + <li><a href="https://www.apache.org/licenses/LICENSE-2.0">License <i class="fa fa-external-link"></i></a></li> + <li><a href="https://www.apache.org/foundation/sponsorship">Sponsorship <i class="fa fa-external-link"></i></a></li> + <li><a href="https://www.apache.org/security">Security <i class="fa fa-external-link"></i></a></li> + <li><a href="https://www.apache.org/foundation/thanks">Thanks <i class="fa fa-external-link"></i></a></li> + <li><a href="https://www.apache.org/foundation/policies/conduct">Code of Conduct <i class="fa fa-external-link"></i></a></li> + </ul> + </li> + </ul> + </div> + </div> +</nav> + + <div class="container"> + <div class="row"> + <div class="col-md-12"> + + <div id="non-canonical" style="display: none; background-color: #F0E68C; padding-left: 1em;"> + Visit the official site at: <a href="https://accumulo.apache.org">https://accumulo.apache.org</a> + </div> + <div id="content"> + + <div class="alert alert-danger" role="alert">This documentation is for an unreleased version of Apache Accumulo that is currently under development! Check out the <a href="/docs-1.8/">documentation for the latest release</a>.</div> + +<div class="row"> + <div class="col-md-3"> + <div class="panel-group" id="accordion" role="tablist" aria-multiselectable="true"> + <div class="panel panel-default"> + + + + + + + + + + <div class="panel-heading" role="tab" id="headingOne"> + <h4 class="panel-title"> + <a role="button" data-toggle="collapse" data-parent="#accordion" href="#collapsegetting-started" aria-expanded="false" aria-controls="collapsegetting-started"> + Getting started + </a> + </h4> + </div> + <div id="collapsegetting-started" class="panel-collapse collapse" role="tabpanel" aria-labelledby="headingOne"> + <div class="panel-body"> + + + <div class="row doc-sidebar-link"><a href="/docs/unreleased/getting-started/design">Accumulo Design</a></div> + + <div class="row doc-sidebar-link"><a href="/docs/unreleased/getting-started/clients">Accumulo Clients</a></div> + + <div class="row doc-sidebar-link"><a href="/docs/unreleased/getting-started/shell">Accumulo Shell</a></div> + + <div class="row doc-sidebar-link"><a href="/docs/unreleased/getting-started/table_design">Table Design</a></div> + + <div class="row doc-sidebar-link"><a href="/docs/unreleased/getting-started/table_configuration">Table Configuration</a></div> + + </div> + </div> + + + + + + + + + + + + <div class="panel-heading" role="tab" id="headingOne"> + <h4 class="panel-title"> + <a role="button" data-toggle="collapse" data-parent="#accordion" href="#collapsedevelopment" aria-expanded="true" aria-controls="collapsedevelopment"> + Development + </a> + </h4> + </div> + <div id="collapsedevelopment" class="panel-collapse collapse in" role="tabpanel" aria-labelledby="headingOne"> + <div class="panel-body"> + + + <div class="row doc-sidebar-link"><a href="/docs/unreleased/development/iterator_design">Iterator Design</a></div> + + <div class="row doc-sidebar-link"><a href="/docs/unreleased/development/iterator_testing">Iterator Testing</a></div> + + <div class="row doc-sidebar-link"><a href="/docs/unreleased/development/development_tools">Development Tools</a></div> + + <div class="row doc-sidebar-link"><a href="/docs/unreleased/development/sampling">Sampling</a></div> + + <div class="row doc-sidebar-link"><a href="/docs/unreleased/development/summaries">Summary Statistics</a></div> + + <div class="row doc-sidebar-link"><a href="/docs/unreleased/development/security">Security</a></div> + + <div class="row doc-sidebar-link"><a href="/docs/unreleased/development/high_speed_ingest">High-Speed Ingest</a></div> + + <div class="row doc-sidebar-link"><a href="/docs/unreleased/development/analytics">Analytics</a></div> + + </div> + </div> + + + + + + + + + + + + <div class="panel-heading" role="tab" id="headingOne"> + <h4 class="panel-title"> + <a role="button" data-toggle="collapse" data-parent="#accordion" href="#collapseadministration" aria-expanded="false" aria-controls="collapseadministration"> + Administration + </a> + </h4> + </div> + <div id="collapseadministration" class="panel-collapse collapse" role="tabpanel" aria-labelledby="headingOne"> + <div class="panel-body"> + + + <div class="row doc-sidebar-link"><a href="/docs/unreleased/administration/overview">Overview</a></div> + + <div class="row doc-sidebar-link"><a href="/docs/unreleased/administration/configuration-management">Configuration Management</a></div> + + <div class="row doc-sidebar-link"><a href="/docs/unreleased/administration/configuration-properties">Configuration Properties</a></div> + + <div class="row doc-sidebar-link"><a href="/docs/unreleased/administration/kerberos">Kerberos</a></div> + + <div class="row doc-sidebar-link"><a href="/docs/unreleased/administration/replication">Replication</a></div> + + <div class="row doc-sidebar-link"><a href="/docs/unreleased/administration/fate">FATE</a></div> + + <div class="row doc-sidebar-link"><a href="/docs/unreleased/administration/multivolume">Multi-Volume Installations</a></div> + + <div class="row doc-sidebar-link"><a href="/docs/unreleased/administration/ssl">SSL</a></div> + + </div> + </div> + + + + + + + + + + + + + + + + + + + + + + <div class="panel-heading" role="tab" id="headingOne"> + <h4 class="panel-title"> + <a role="button" data-toggle="collapse" data-parent="#accordion" href="#collapsetroubleshooting" aria-expanded="false" aria-controls="collapsetroubleshooting"> + Troubleshooting + </a> + </h4> + </div> + <div id="collapsetroubleshooting" class="panel-collapse collapse" role="tabpanel" aria-labelledby="headingOne"> + <div class="panel-body"> + + + <div class="row doc-sidebar-link"><a href="/docs/unreleased/troubleshooting/overview">Overview</a></div> + + </div> + </div> + + + + </div> + </div> + </div> + <div class="col-md-9"> + + <p><a href="/docs/unreleased/">Accumulo unreleased docs</a> >> Development >> High-Speed Ingest</p> + + + <h1>High-Speed Ingest</h1> + + <p>Accumulo is often used as part of a larger data processing and storage system. To +maximize the performance of a parallel system involving Accumulo, the ingestion +and query components should be designed to provide enough parallelism and +concurrency to avoid creating bottlenecks for users and other systems writing to +and reading from Accumulo. There are several ways to achieve high ingest +performance.</p> + +<h2 id="pre-splitting-new-tables">Pre-Splitting New Tables</h2> + +<p>New tables consist of a single tablet by default. As mutations are applied, the table +grows and splits into multiple tablets which are balanced by the Master across +TabletServers. This implies that the aggregate ingest rate will be limited to fewer +servers than are available within the cluster until the table has reached the point +where there are tablets on every TabletServer.</p> + +<p>Pre-splitting a table ensures that there are as many tablets as desired available +before ingest begins to take advantage of all the parallelism possible with the cluster +hardware. Tables can be split at any time by using the shell:</p> + +<div class="highlighter-rouge"><pre class="highlight"><code>user@myinstance mytable> addsplits -sf /local_splitfile -t mytable +</code></pre> +</div> + +<p>For the purposes of providing parallelism to ingest it is not necessary to create more +tablets than there are physical machines within the cluster as the aggregate ingest +rate is a function of the number of physical machines. Note that the aggregate ingest +rate is still subject to the number of machines running ingest clients, and the +distribution of rowIDs across the table. The aggregation ingest rate will be +suboptimal if there are many inserts into a small number of rowIDs.</p> + +<h2 id="multiple-ingester-clients">Multiple Ingester Clients</h2> + +<p>Accumulo is capable of scaling to very high rates of ingest, which is dependent upon +not just the number of TabletServers in operation but also the number of ingest +clients. This is because a single client, while capable of batching mutations and +sending them to all TabletServers, is ultimately limited by the amount of data that +can be processed on a single machine. The aggregate ingest rate will scale linearly +with the number of clients up to the point at which either the aggregate I/O of +TabletServers or total network bandwidth capacity is reached.</p> + +<p>In operational settings where high rates of ingest are paramount, clusters are often +configured to dedicate some number of machines solely to running Ingester Clients. +The exact ratio of clients to TabletServers necessary for optimum ingestion rates +will vary according to the distribution of resources per machine and by data type.</p> + +<h2 id="bulk-ingest">Bulk Ingest</h2> + +<p>Accumulo supports the ability to import files produced by an external process such +as MapReduce into an existing table. In some cases it may be faster to load data this +way rather than via ingesting through clients using BatchWriters. This allows a large +number of machines to format data the way Accumulo expects. The new files can +then simply be introduced to Accumulo via a shell command.</p> + +<p>To configure MapReduce to format data in preparation for bulk loading, the job +should be set to use a range partitioner instead of the default hash partitioner. The +range partitioner uses the split points of the Accumulo table that will receive the +data. The split points can be obtained from the shell and used by the MapReduce +RangePartitioner. Note that this is only useful if the existing table is already split +into multiple tablets.</p> + +<div class="highlighter-rouge"><pre class="highlight"><code>user@myinstance mytable> getsplits +aa +ab +ac +... +zx +zy +zz +</code></pre> +</div> + +<p>Run the MapReduce job, using the AccumuloFileOutputFormat to create the files to +be introduced to Accumulo. Once this is complete, the files can be added to +Accumulo via the shell:</p> + +<div class="highlighter-rouge"><pre class="highlight"><code>user@myinstance mytable> importdirectory /files_dir /failures +</code></pre> +</div> + +<p>Note that the paths referenced are directories within the same HDFS instance over +which Accumulo is running. Accumulo places any files that failed to be added to the +second directory specified.</p> + +<p>See the <a href="https://github.com/apache/accumulo-examples/blob/master/docs/bulkIngest.md">bulk ingest example</a> +for a complete example.</p> + +<h2 id="logical-time-for-bulk-ingest">Logical Time for Bulk Ingest</h2> + +<p>Logical time is important for bulk imported data, for which the client code may +be choosing a timestamp. At bulk import time, the user can choose to enable +logical time for the set of files being imported. When its enabled, Accumulo +uses a specialized system iterator to lazily set times in a bulk imported file. +This mechanism guarantees that times set by unsynchronized multi-node +applications (such as those running on MapReduce) will maintain some semblance +of causal ordering. This mitigates the problem of the time being wrong on the +system that created the file for bulk import. These times are not set when the +file is imported, but whenever it is read by scans or compactions. At import, a +time is obtained and always used by the specialized system iterator to set that +time.</p> + +<p>The timestamp assigned by Accumulo will be the same for every key in the file. +This could cause problems if the file contains multiple keys that are identical +except for the timestamp. In this case, the sort order of the keys will be +undefined. This could occur if an insert and an update were in the same bulk +import file.</p> + +<h2 id="mapreduce-ingest">MapReduce Ingest</h2> + +<p>It is possible to efficiently write many mutations to Accumulo in parallel via a +MapReduce job. In this scenario the MapReduce is written to process data that lives +in HDFS and write mutations to Accumulo using the AccumuloOutputFormat. See +the MapReduce section under Analytics for details. The <a href="https://github.com/apache/accumulo-examples/blob/master/docs/mapred.md">MapReduce example</a> +is also a good reference for example code.</p> + + </div> +</div> + + </div> + + +<footer> + + <p><a href="https://www.apache.org/foundation/contributing"><img src="https://www.apache.org/images/SupportApache-small.png" alt="Support the ASF" id="asf-logo" height="100" /></a></p> + + <p>Copyright © 2011-2017 The Apache Software Foundation. Licensed under the <a href="https://www.apache.org/licenses/LICENSE-2.0">Apache License, Version 2.0</a>.</p> + +</footer> + + + </div> + </div> + </div> +</body> +</html>