accumulo git commit: Added sampling to release notes

kturner Tue, 06 Sep 2016 08:19:19 -0700

Repository: accumulo
Updated Branches:
  refs/heads/gh-pages be06c7629 -> e70549671



Added sampling to release notes


Project: http://git-wip-us.apache.org/repos/asf/accumulo/repo
Commit: http://git-wip-us.apache.org/repos/asf/accumulo/commit/e7054967
Tree: http://git-wip-us.apache.org/repos/asf/accumulo/tree/e7054967
Diff: http://git-wip-us.apache.org/repos/asf/accumulo/diff/e7054967

Branch: refs/heads/gh-pages
Commit: e705496714f39f5bf1383710ba253adb695948d7
Parents: be06c76
Author: Keith Turner <ktur...@apache.org>
Authored: Tue Sep 6 11:18:07 2016 -0400
Committer: Keith Turner <ktur...@apache.org>
Committed: Tue Sep 6 11:18:07 2016 -0400

----------------------------------------------------------------------
 release_notes/1.8.0.md | 55 +++++++++++++++++++++++++++++++++++----------
 1 file changed, 43 insertions(+), 12 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/accumulo/blob/e7054967/release_notes/1.8.0.md
----------------------------------------------------------------------
diff --git a/release_notes/1.8.0.md b/release_notes/1.8.0.md
index b191dfb..6dfe2ad 100644
--- a/release_notes/1.8.0.md
+++ b/release_notes/1.8.0.md
@@ -47,8 +47,8 @@ default. Root tablet assignment can not be suspended. See 
[ACCUMULO-4353] for mo
 
 ### Run multiple Tablet Servers on one node
 
-[ACCUMULO-4328] introduces the capability of running multiple tservers on a 
single node. This intended for nodes with a large
-amount of memory. This feature is disabled by default. There are several 
related tickets: [ACCUMULO-4072], [ACCUMULO-4331]
+[ACCUMULO-4328] introduces the capability of running multiple tservers on a 
single node. This is intended for nodes with a large
+amounts of memory and/or disk. This feature is disabled by default. There are 
several related tickets: [ACCUMULO-4072], [ACCUMULO-4331]
 and [ACCUMULO-4406]. Note that when this is enabled, the names of the log 
files change. Previous log file names were defined in the
 generic_logger.xml as 
`${org.apache.accumulo.core.application}_{org.apache.accumulo.core.ip.localhost.hostname}.log`.
 The files will now include the instance id after the application with
@@ -60,11 +60,32 @@ names do not change if this feature is not used.
 
 ### Rate limiting Major Compactions
 
-Major Compactions can significantly increase the amount of load on 
TabletServers. [ACCUMULO-4187] take a cue from Apache
+Major Compactions can significantly increase the amount of load on 
TabletServers. [ACCUMULO-4187] takes a cue from Apache
 Cassandra and restricts the rate at which data is read and written when 
performing major compactions. This has a direct effect
 on the IO load caused by major compactions with a similar effect on the CPU 
utilization. This behavior is controlled
 by a new property `tserver.compaction.major.throughput` with a defaults of 0B 
which disables the rate limiting.
 
+### Sampling
+
+Queryable sample data was added by [ACCUMULO-3913].  This allows users to 
configure a pluggable
+function to generate sample data.  At scan time, the sample data can 
optionally be scanned.
+Iterators also have access to sample data.  Iterators can access all data and 
sample data, this
+allows an iterator to use sample data for query optimizations.  The new user 
level RFile API
+supports writing RFiles with sample data for bulk import.
+
+A simple configurable sampler function is included with Accumulo.  This 
sampler uses hashing and
+can be configured to use a subset of Key fields.  For example if it was 
desired to have entire rows
+in the sample, then this sampler would be configured to hash+mod the row.   
Then when a row is
+selected for the sample, all of its columns and all of its updates will be in 
the sample data.
+Another scenario is one in which a document id is in the column qualifier.  In 
this scenario, one
+would either want all data related to a document in the sample data or none.  
To achieve this, the
+sample could be configured to hash+mod on the column qualifier.  See the 
sample [Readme
+example][sample] and javadocs on the new APIs for more information.
+
+For sampling to work, all tablets scanned must have pre-generated sample data 
that was generated in
+the same way.  If this is not the case then scans will fail.  For existing 
tables, samples can be
+generated by configuring sampling on the table and compacting the table.
+
 ### Upgrade to Apache Thrift 0.9.3
 
 Accumulo relies on Apache Thrift to implement remote procedure calls between 
Accumulo services.
@@ -74,7 +95,7 @@ on the changes to Thrift.
 ### Iterator Test Harness
 
 Users often write iterators without fully understanding its limits and 
lifetime. Previously, Accumulo did
-not provide any means in which a user could test iterators to catch common 
issues that only become apparant
+not provide any means in which a user could test iterators to catch common 
issues that only become apparent
 in multi-node production deployments. [ACCUMULO-626] provides a framework and 
a collection of initial tests
 which can be used to simulate common issues with Iterators that only appear in 
production deployments. This test
 harness can be used directly by users as a supplemental tool to unit tests and 
integration tests with MiniAccumuloCluster.
@@ -93,14 +114,18 @@ defaults out of the ephemeral range, we can guarantee that 
the Monitor and GC wi
 
 ## Other Notable Changes
 
- * [ACCUMULO-1055][ACCUMULO-1055] Configurable maximum file size for merging 
minor compactions
- * [ACCUMULO-1124][ACCUMULO-1124] Optimization of RFile index
- * [ACCUMULO-2883][ACCUMULO-2883] API to fetch current tablet assignments
- * [ACCUMULO-3871][ACCUMULO-3871] Support for running integration tests in 
MapReduce
- * [ACCUMULO-3920][ACCUMULO-3920] Deprecate the MockAccumulo class and remove 
usage in our tests
- * [ACCUMULO-4339][ACCUMULO-4339] Make hadoop-minicluster optional dependency 
of acccumulo-minicluster
- * [ACCUMULO-4354][ACCUMULO-4354] Bump dependency versions to include gson, 
jetty, and sl4j
- * [ACCUMULO-3735][ACCUMULO-3735] Bulk Import status page on the monitor
+ * [ACCUMULO-1055] Configurable maximum file size for merging minor compactions
+ * [ACCUMULO-1124] Optimization of RFile index
+ * [ACCUMULO-2883] API to fetch current tablet assignments
+ * [ACCUMULO-3871] Support for running integration tests in MapReduce
+ * [ACCUMULO-3920] Deprecate the MockAccumulo class and remove usage in our 
tests
+ * [ACCUMULO-4339] Make hadoop-minicluster optional dependency of 
acccumulo-minicluster
+ * [ACCUMULO-4318] BatchWriter, ConditionalWriter, and ScannerBase now extend 
AutoCloseable
+ * [ACCUMULO-4326] Value constructor now accepts Strings (and Charsequences)
+ * [ACCUMULO-4354] Bump dependency versions to include gson, jetty, and sl4j
+ * [ACCUMULO-3735] Bulk Import status page on the monitor
+ * [ACCUMULO-4066] Reduced time to processes conditional mutations.
+ * [ACCUMULO-4164] Reduced seek time for cached data.
 
 ## Testing
 
@@ -127,11 +152,16 @@ HDFS High-Availability instances, forcing NameNode 
failover.
 [ACCUMULO-3423]: https://issues.apache.org/jira/browse/ACCUMULO-3423
 [ACCUMULO-3735]: https://issues.apache.org/jira/browse/ACCUMULO-3735
 [ACCUMULO-3871]: https://issues.apache.org/jira/browse/ACCUMULO-3871
+[ACCUMULO-3913]: https://issues.apache.org/jira/browse/ACCUMULO-3913
 [ACCUMULO-3920]: https://issues.apache.org/jira/browse/ACCUMULO-3920
 [ACCUMULO-4072]: https://issues.apache.org/jira/browse/ACCUMULO-4072
 [ACCUMULO-4077]: https://issues.apache.org/jira/browse/ACCUMULO-4077
+[ACCUMULO-4066]: https://issues.apache.org/jira/browse/ACCUMULO-4066
+[ACCUMULO-4164]: https://issues.apache.org/jira/browse/ACCUMULO-4164
 [ACCUMULO-4165]: https://issues.apache.org/jira/browse/ACCUMULO-4165
 [ACCUMULO-4187]: https://issues.apache.org/jira/browse/ACCUMULO-4187
+[ACCUMULO-4318]: https://issues.apache.org/jira/browse/ACCUMULO-4318
+[ACCUMULO-4326]: https://issues.apache.org/jira/browse/ACCUMULO-4326
 [ACCUMULO-4328]: https://issues.apache.org/jira/browse/ACCUMULO-4328
 [ACCUMULO-4331]: https://issues.apache.org/jira/browse/ACCUMULO-4331
 [ACCUMULO-4339]: https://issues.apache.org/jira/browse/ACCUMULO-4339
@@ -144,4 +174,5 @@ HDFS High-Availability instances, forcing NameNode failover.
 [THRIFT-0.9.3-RN]: https://github.com/apache/thrift/blob/0.9.3/CHANGES
 [api]: https://github.com/apache/accumulo/blob/1.8/README.md#api
 [semver]: http://semver.org
+[sample]: http://accumulo.apache.org/1.8/examples/sample
 [ITER_TEST]: 
https://accumulo.apache.org/1.8/accumulo_user_manual.html#_iterator_testing

accumulo git commit: Added sampling to release notes

Reply via email to