Merge branch '1.5.2-SNAPSHOT' into 1.6.1-SNAPSHOT Conflicts: docs/src/main/resources/examples/README.mapred docs/src/main/resources/examples/README.maxmutation
Project: http://git-wip-us.apache.org/repos/asf/accumulo/repo Commit: http://git-wip-us.apache.org/repos/asf/accumulo/commit/d7c1125d Tree: http://git-wip-us.apache.org/repos/asf/accumulo/tree/d7c1125d Diff: http://git-wip-us.apache.org/repos/asf/accumulo/diff/d7c1125d Branch: refs/heads/1.6.1-SNAPSHOT Commit: d7c1125d3d8a8101e121f112303762a65c30f7da Parents: e8916f1 9f3cbb3 Author: Josh Elser <els...@apache.org> Authored: Wed Jul 23 01:08:37 2014 -0400 Committer: Josh Elser <els...@apache.org> Committed: Wed Jul 23 01:08:37 2014 -0400 ---------------------------------------------------------------------- docs/src/main/resources/examples/README.batch | 2 +- docs/src/main/resources/examples/README.bloom | 8 ++++---- docs/src/main/resources/examples/README.maxmutation | 12 +++++++----- docs/src/main/resources/examples/README.regex | 3 +-- 4 files changed, 13 insertions(+), 12 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/accumulo/blob/d7c1125d/docs/src/main/resources/examples/README.batch ---------------------------------------------------------------------- diff --cc docs/src/main/resources/examples/README.batch index 05f2304,0000000..463481b mode 100644,000000..100644 --- a/docs/src/main/resources/examples/README.batch +++ b/docs/src/main/resources/examples/README.batch @@@ -1,55 -1,0 +1,55 @@@ +Title: Apache Accumulo Batch Writing and Scanning Example +Notice: Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + . + http://www.apache.org/licenses/LICENSE-2.0 + . + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. + +This tutorial uses the following Java classes, which can be found in org.apache.accumulo.examples.simple.client in the examples-simple module: + + * SequentialBatchWriter.java - writes mutations with sequential rows and random values + * RandomBatchWriter.java - used by SequentialBatchWriter to generate random values + * RandomBatchScanner.java - reads random rows and verifies their values + +This is an example of how to use the batch writer and batch scanner. To compile +the example, run maven and copy the produced jar into the accumulo lib dir. +This is already done in the tar distribution. + +Below are commands that add 10000 entries to accumulo and then do 100 random +queries. The write command generates random 50 byte values. + +Be sure to use the name of your instance (given as instance here) and the appropriate +list of zookeeper nodes (given as zookeepers here). + +Before you run this, you must ensure that the user you are running has the +"exampleVis" authorization. (you can set this in the shell with "setauths -u username -s exampleVis") + + $ ./bin/accumulo shell -u root -e "setauths -u username -s exampleVis" + +You must also create the table, batchtest1, ahead of time. (In the shell, use "createtable batchtest1") + + $ ./bin/accumulo shell -u username -e "createtable batchtest1" + $ ./bin/accumulo org.apache.accumulo.examples.simple.client.SequentialBatchWriter -i instance -z zookeepers -u username -p password -t batchtest1 --start 0 --num 10000 --size 50 --batchMemory 20M --batchLatency 500 --batchThreads 20 --vis exampleVis - $ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchScanner -i instance -z zookeepers -u username -p password -t batchtest1 --num 100 --min 0 --max 10000 --size 50 --scanThreads 20 --vis exampleVis ++ $ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchScanner -i instance -z zookeepers -u username -p password -t batchtest1 --num 100 --min 0 --max 10000 --size 50 --scanThreads 20 --auths exampleVis + 07 11:33:11,103 [client.CountingVerifyingReceiver] INFO : Generating 100 random queries... + 07 11:33:11,112 [client.CountingVerifyingReceiver] INFO : finished + 07 11:33:11,260 [client.CountingVerifyingReceiver] INFO : 694.44 lookups/sec 0.14 secs + + 07 11:33:11,260 [client.CountingVerifyingReceiver] INFO : num results : 100 + + 07 11:33:11,364 [client.CountingVerifyingReceiver] INFO : Generating 100 random queries... + 07 11:33:11,370 [client.CountingVerifyingReceiver] INFO : finished + 07 11:33:11,416 [client.CountingVerifyingReceiver] INFO : 2173.91 lookups/sec 0.05 secs + + 07 11:33:11,416 [client.CountingVerifyingReceiver] INFO : num results : 100 http://git-wip-us.apache.org/repos/asf/accumulo/blob/d7c1125d/docs/src/main/resources/examples/README.bloom ---------------------------------------------------------------------- diff --cc docs/src/main/resources/examples/README.bloom index 6fe4602,0000000..555f06d mode 100644,000000..100644 --- a/docs/src/main/resources/examples/README.bloom +++ b/docs/src/main/resources/examples/README.bloom @@@ -1,219 -1,0 +1,219 @@@ +Title: Apache Accumulo Bloom Filter Example +Notice: Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + . + http://www.apache.org/licenses/LICENSE-2.0 + . + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. + +This example shows how to create a table with bloom filters enabled. It also +shows how bloom filters increase query performance when looking for values that +do not exist in a table. + +Below table named bloom_test is created and bloom filters are enabled. + + $ ./bin/accumulo shell -u username -p password + Shell - Apache Accumulo Interactive Shell + - version: 1.5.0 + - instance name: instance + - instance id: 00000000-0000-0000-0000-000000000000 + - + - type 'help' for a list of available commands + - + username@instance> setauths -u username -s exampleVis + username@instance> createtable bloom_test + username@instance bloom_test> config -t bloom_test -s table.bloom.enabled=true + username@instance bloom_test> exit + +Below 1 million random values are inserted into accumulo. The randomly +generated rows range between 0 and 1 billion. The random number generator is +initialized with the seed 7. + - $ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchWriter --seed 7 -i instance -z zookeepers -u username -p password -t bloom_test --num 1000000 -min 0 -max 1000000000 -valueSize 50 -batchMemory 2M -batchLatency 60s -batchThreads 3 --vis exampleVis ++ $ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchWriter --seed 7 -i instance -z zookeepers -u username -p password -t bloom_test --num 1000000 --min 0 --max 1000000000 --size 50 --batchMemory 2M --batchLatency 60s --batchThreads 3 --vis exampleVis + +Below the table is flushed: + + $ ./bin/accumulo shell -u username -p password -e 'flush -t bloom_test -w' + 05 10:40:06,069 [shell.Shell] INFO : Flush of table bloom_test completed. + +After the flush completes, 500 random queries are done against the table. The +same seed is used to generate the queries, therefore everything is found in the +table. + - $ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchScanner --seed 7 -i instance -z zookeepers -u username -p password -t bloom_test --num 500 --min 0 --max 1000000000 --size 50 -batchThreads 20 --vis exampleVis ++ $ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchScanner --seed 7 -i instance -z zookeepers -u username -p password -t bloom_test --num 500 --min 0 --max 1000000000 --size 50 --scanThreads 20 --auths exampleVis + Generating 500 random queries...finished + 96.19 lookups/sec 5.20 secs + num results : 500 + Generating 500 random queries...finished + 102.35 lookups/sec 4.89 secs + num results : 500 + +Below another 500 queries are performed, using a different seed which results +in nothing being found. In this case the lookups are much faster because of +the bloom filters. + + $ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchScanner --seed 8 -i instance -z zookeepers -u username -p password -t bloom_test --num 500 --min 0 --max 1000000000 --size 50 -batchThreads 20 -auths exampleVis + Generating 500 random queries...finished + 2212.39 lookups/sec 0.23 secs + num results : 0 + Did not find 500 rows + Generating 500 random queries...finished + 4464.29 lookups/sec 0.11 secs + num results : 0 + Did not find 500 rows + +******************************************************************************** + +Bloom filters can also speed up lookups for entries that exist. In accumulo +data is divided into tablets and each tablet has multiple map files. Every +lookup in accumulo goes to a specific tablet where a lookup is done on each +map file in the tablet. So if a tablet has three map files, lookup performance +can be three times slower than a tablet with one map file. However if the map +files contain unique sets of data, then bloom filters can help eliminate map +files that do not contain the row being looked up. To illustrate this two +identical tables were created using the following process. One table had bloom +filters, the other did not. Also the major compaction ratio was increased to +prevent the files from being compacted into one file. + + * Insert 1 million entries using RandomBatchWriter with a seed of 7 + * Flush the table using the shell + * Insert 1 million entries using RandomBatchWriter with a seed of 8 + * Flush the table using the shell + * Insert 1 million entries using RandomBatchWriter with a seed of 9 + * Flush the table using the shell + +After following the above steps, each table will have a tablet with three map +files. Flushing the table after each batch of inserts will create a map file. +Each map file will contain 1 million entries generated with a different seed. +This is assuming that Accumulo is configured with enough memory to hold 1 +million inserts. If not, then more map files will be created. + +The commands for creating the first table without bloom filters are below. + + $ ./bin/accumulo shell -u username -p password + Shell - Apache Accumulo Interactive Shell + - version: 1.5.0 + - instance name: instance + - instance id: 00000000-0000-0000-0000-000000000000 + - + - type 'help' for a list of available commands + - + username@instance> setauths -u username -s exampleVis + username@instance> createtable bloom_test1 + username@instance bloom_test1> config -t bloom_test1 -s table.compaction.major.ratio=7 + username@instance bloom_test1> exit + - $ ARGS="-i instance -z zookeepers -u username -p password -t bloom_test1 --num 1000000 --min 0 --max 1000000000 --size 50 --batchMemory 2M --batchLatency 60s --batchThreads 3 --auths exampleVis" ++ $ ARGS="-i instance -z zookeepers -u username -p password -t bloom_test1 --num 1000000 --min 0 --max 1000000000 --size 50 --batchMemory 2M --batchLatency 60s --batchThreads 3 --vis exampleVis" + $ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchWriter --seed 7 $ARGS + $ ./bin/accumulo shell -u username -p password -e 'flush -t bloom_test1 -w' + $ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchWriter --seed 8 $ARGS + $ ./bin/accumulo shell -u username -p password -e 'flush -t bloom_test1 -w' + $ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchWriter --seed 9 $ARGS + $ ./bin/accumulo shell -u username -p password -e 'flush -t bloom_test1 -w' + +The commands for creating the second table with bloom filers are below. + + $ ./bin/accumulo shell -u username -p password + Shell - Apache Accumulo Interactive Shell + - version: 1.5.0 + - instance name: instance + - instance id: 00000000-0000-0000-0000-000000000000 + - + - type 'help' for a list of available commands + - + username@instance> setauths -u username -s exampleVis + username@instance> createtable bloom_test2 + username@instance bloom_test2> config -t bloom_test2 -s table.compaction.major.ratio=7 + username@instance bloom_test2> config -t bloom_test2 -s table.bloom.enabled=true + username@instance bloom_test2> exit + - $ ARGS="-i instance -z zookeepers -u username -p password -t bloom_test2 --num 1000000 --min 0 --max 1000000000 --size 50 --batchMemory 2M --batchLatency 60s --batchThreads 3 --auths exampleVis" ++ $ ARGS="-i instance -z zookeepers -u username -p password -t bloom_test2 --num 1000000 --min 0 --max 1000000000 --size 50 --batchMemory 2M --batchLatency 60s --batchThreads 3 --vis exampleVis" + $ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchWriter --seed 7 $ARGS + $ ./bin/accumulo shell -u username -p password -e 'flush -t bloom_test2 -w' + $ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchWriter --seed 8 $ARGS + $ ./bin/accumulo shell -u username -p password -e 'flush -t bloom_test2 -w' + $ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchWriter --seed 9 $ARGS + $ ./bin/accumulo shell -u username -p password -e 'flush -t bloom_test2 -w' + +Below 500 lookups are done against the table without bloom filters using random +NG seed 7. Even though only one map file will likely contain entries for this +seed, all map files will be interrogated. + + $ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchScanner --seed 7 -i instance -z zookeepers -u username -p password -t bloom_test1 --num 500 --min 0 --max 1000000000 --size 50 --scanThreads 20 --auths exampleVis + Generating 500 random queries...finished + 35.09 lookups/sec 14.25 secs + num results : 500 + Generating 500 random queries...finished + 35.33 lookups/sec 14.15 secs + num results : 500 + +Below the same lookups are done against the table with bloom filters. The +lookups were 2.86 times faster because only one map file was used, even though three +map files existed. + + $ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchScanner --seed 7 -i instance -z zookeepers -u username -p password -t bloom_test2 --num 500 --min 0 --max 1000000000 --size 50 -scanThreads 20 --auths exampleVis + Generating 500 random queries...finished + 99.03 lookups/sec 5.05 secs + num results : 500 + Generating 500 random queries...finished + 101.15 lookups/sec 4.94 secs + num results : 500 + +You can verify the table has three files by looking in HDFS. To look in HDFS +you will need the table ID, because this is used in HDFS instead of the table +name. The following command will show table ids. + + $ ./bin/accumulo shell -u username -p password -e 'tables -l' + accumulo.metadata => !0 + accumulo.root => +r + bloom_test1 => o7 + bloom_test2 => o8 + trace => 1 + +So the table id for bloom_test2 is o8. The command below shows what files this +table has in HDFS. This assumes Accumulo is at the default location in HDFS. + + $ hadoop fs -lsr /accumulo/tables/o8 + drwxr-xr-x - username supergroup 0 2012-01-10 14:02 /accumulo/tables/o8/default_tablet + -rw-r--r-- 3 username supergroup 52672650 2012-01-10 14:01 /accumulo/tables/o8/default_tablet/F00000dj.rf + -rw-r--r-- 3 username supergroup 52436176 2012-01-10 14:01 /accumulo/tables/o8/default_tablet/F00000dk.rf + -rw-r--r-- 3 username supergroup 52850173 2012-01-10 14:02 /accumulo/tables/o8/default_tablet/F00000dl.rf + +Running the rfile-info command shows that one of the files has a bloom filter +and its 1.5MB. + + $ ./bin/accumulo rfile-info /accumulo/tables/o8/default_tablet/F00000dj.rf + Locality group : <DEFAULT> + Start block : 0 + Num blocks : 752 + Index level 0 : 43,598 bytes 1 blocks + First key : row_0000001169 foo:1 [exampleVis] 1326222052539 false + Last key : row_0999999421 foo:1 [exampleVis] 1326222052058 false + Num entries : 999,536 + Column families : [foo] + + Meta block : BCFile.index + Raw size : 4 bytes + Compressed size : 12 bytes + Compression type : gz + + Meta block : RFile.index + Raw size : 43,696 bytes + Compressed size : 15,592 bytes + Compression type : gz + + Meta block : acu_bloom + Raw size : 1,540,292 bytes + Compressed size : 1,433,115 bytes + Compression type : gz + http://git-wip-us.apache.org/repos/asf/accumulo/blob/d7c1125d/docs/src/main/resources/examples/README.maxmutation ---------------------------------------------------------------------- diff --cc docs/src/main/resources/examples/README.maxmutation index 7fb3e08,0000000..45b80d4 mode 100644,000000..100644 --- a/docs/src/main/resources/examples/README.maxmutation +++ b/docs/src/main/resources/examples/README.maxmutation @@@ -1,47 -1,0 +1,49 @@@ +Title: Apache Accumulo MaxMutation Constraints Example +Notice: Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + . + http://www.apache.org/licenses/LICENSE-2.0 + . + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. + +This an example of how to limit the size of mutations that will be accepted into +a table. Under the default configuration, accumulo does not provide a limitation +on the size of mutations that can be ingested. Poorly behaved writers might +inadvertently create mutations so large, that they cause the tablet servers to +run out of memory. A simple contraint can be added to a table to reject very +large mutations. + + $ ./bin/accumulo shell -u username -p password + + Shell - Apache Accumulo Interactive Shell + - + - version: 1.5.0 + - instance name: instance + - instance id: 00000000-0000-0000-0000-000000000000 + - + - type 'help' for a list of available commands + - + username@instance> createtable test_ingest + username@instance test_ingest> config -t test_ingest -s table.constraint.1=org.apache.accumulo.examples.simple.constraints.MaxMutationSize + username@instance test_ingest> + + - Now the table will reject any mutation that is larger than 1/256th of the - working memory of the tablet server. The following command attempts to ingest - a single row with 10000 columns, which exceeds the memory limit: ++Now the table will reject any mutation that is larger than 1/256th of the ++working memory of the tablet server. The following command attempts to ingest ++a single row with 10000 columns, which exceeds the memory limit. Depending on the ++amount of Java heap your tserver(s) are given, you may have to increase the number ++of columns provided to see the failure. + - $ ./bin/accumulo org.apache.accumulo.test.TestIngest -i instance -z zookeepers -u username -p password --rows 1 --cols 10000 - ERROR : Constraint violates : ConstraintViolationSummary(constrainClass:org.apache.accumulo.examples.simple.constraints.MaxMutationSize, violationCode:0, violationDescription:mutation exceeded maximum size of 188160, numberOfViolatingMutations:1) ++ $ ./bin/accumulo org.apache.accumulo.test.TestIngest -i instance -z zookeepers -u username -p password --rows 1 --cols 10000 ++ ERROR : Constraint violates : ConstraintViolationSummary(constrainClass:org.apache.accumulo.examples.simple.constraints.MaxMutationSize, violationCode:0, violationDescription:mutation exceeded maximum size of 188160, numberOfViolatingMutations:1) + http://git-wip-us.apache.org/repos/asf/accumulo/blob/d7c1125d/docs/src/main/resources/examples/README.regex ---------------------------------------------------------------------- diff --cc docs/src/main/resources/examples/README.regex index a5cc854,0000000..ea9f208 mode 100644,000000..100644 --- a/docs/src/main/resources/examples/README.regex +++ b/docs/src/main/resources/examples/README.regex @@@ -1,58 -1,0 +1,57 @@@ +Title: Apache Accumulo Regex Example +Notice: Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + . + http://www.apache.org/licenses/LICENSE-2.0 + . + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. + +This example uses mapreduce and accumulo to find items using regular expressions. +This is accomplished using a map-only mapreduce job and a scan-time iterator. + +To run this example you will need some data in a table. The following will +put a trivial amount of data into accumulo using the accumulo shell: + + $ ./bin/accumulo shell -u username -p password + Shell - Apache Accumulo Interactive Shell + - version: 1.5.0 + - instance name: instance + - instance id: 00000000-0000-0000-0000-000000000000 + - + - type 'help' for a list of available commands + - + username@instance> createtable input + username@instance> insert dogrow dogcf dogcq dogvalue + username@instance> insert catrow catcf catcq catvalue + username@instance> quit + +The RegexExample class sets an iterator on the scanner. This does pattern matching +against each key/value in accumulo, and only returns matching items. It will do this +in parallel and will store the results in files in hdfs. + +The following will search for any rows in the input table that starts with "dog": + + $ bin/tool.sh lib/accumulo-examples-simple.jar org.apache.accumulo.examples.simple.mapreduce.RegexExample -u user -p passwd -i instance -t input --rowRegex 'dog.*' --output /tmp/output + + $ hadoop fs -ls /tmp/output + Found 3 items + -rw-r--r-- 1 username supergroup 0 2013-01-10 14:11 /tmp/output/_SUCCESS + drwxr-xr-x - username supergroup 0 2013-01-10 14:10 /tmp/output/_logs + -rw-r--r-- 1 username supergroup 51 2013-01-10 14:10 /tmp/output/part-m-00000 + +We can see the output of our little map-reduce job: + - $ hadoop fs -text /tmp/output/output/part-m-00000 ++ $ hadoop fs -text /tmp/output/part-m-00000 + dogrow dogcf:dogcq [] 1357844987994 false dogvalue - $ + +