http://git-wip-us.apache.org/repos/asf/accumulo-website/blob/7cc70b2e/docs/master/ssl.md ---------------------------------------------------------------------- diff --git a/docs/master/ssl.md b/docs/master/ssl.md deleted file mode 100644 index d315c6e..0000000 --- a/docs/master/ssl.md +++ /dev/null @@ -1,134 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one or more -// contributor license agreements. See the NOTICE file distributed with -// this work for additional information regarding copyright ownership. -// The ASF licenses this file to You under the Apache License, Version 2.0 -// (the "License"); you may not use this file except in compliance with -// the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -== SSL -Accumulo, through Thrift's TSSLTransport, provides the ability to encrypt -wire communication between Accumulo servers and clients using secure -sockets layer (SSL). SSL certifcates signed by the same certificate authority -control the "circle of trust" in which a secure connection can be established. -Typically, each host running Accumulo processes would be given a certificate -which identifies itself. - -Clients can optionally also be given a certificate, when client-auth is enabled, -which prevents unwanted clients from accessing the system. The SSL integration -presently provides no authentication support within Accumulo (an Accumulo username -and password are still required) and is only used to establish a means for -secure communication. - -=== Server configuration - -As previously mentioned, the circle of trust is established by the certificate -authority which created the certificates in use. Because of the tight coupling -of certificate generation with an organization's policies, Accumulo does not -provide a method in which to automatically create the necessary SSL components. - -Administrators without existing infrastructure built on SSL are encourage to -use OpenSSL and the +keytool+ command. An example of these commands are -included in a section below. Accumulo servers require a certificate and keystore, -in the form of Java KeyStores, to enable SSL. The following configuration assumes -these files already exist. - -In +accumulo-site.xml+, the following properties are required: - -* *rpc.javax.net.ssl.keyStore*=_The path on the local filesystem to the keystore containing the server's certificate_ -* *rpc.javax.net.ssl.keyStorePassword*=_The password for the keystore containing the server's certificate_ -* *rpc.javax.net.ssl.trustStore*=_The path on the local filesystem to the keystore containing the certificate authority's public key_ -* *rpc.javax.net.ssl.trustStorePassword*=_The password for the keystore containing the certificate authority's public key_ -* *instance.rpc.ssl.enabled*=_true_ - -Optionally, SSL client-authentication (two-way SSL) can also be enabled by setting -+instance.rpc.ssl.clientAuth=true+ in +accumulo-site.xml+. -This requires that each client has access to valid certificate to set up a secure connection -to the servers. By default, Accumulo uses one-way SSL which does not require clients to have -their own certificate. - -=== Client configuration - -To establish a connection to Accumulo servers, each client must also have -special configuration. This is typically accomplished through the use of -the client configuration file whose default location is +~/.accumulo/config+. - -The following properties must be set to connect to an Accumulo instance using SSL: - -* *rpc.javax.net.ssl.trustStore*=_The path on the local filesystem to the keystore containing the certificate authority's public key_ -* *rpc.javax.net.ssl.trustStorePassword*=_The password for the keystore containing the certificate authority's public key_ -* *instance.rpc.ssl.enabled*=_true_ - -If two-way SSL if enabled (+instance.rpc.ssl.clientAuth=true+) for the instance, the client must also define -their own certificate and enable client authenticate as well. - -* *rpc.javax.net.ssl.keyStore*=_The path on the local filesystem to the keystore containing the server's certificate_ -* *rpc.javax.net.ssl.keyStorePassword*=_The password for the keystore containing the server's certificate_ -* *instance.rpc.ssl.clientAuth*=_true_ - -=== Generating SSL material using OpenSSL - -The following is included as an example for generating your own SSL material (certificate authority and server/client -certificates) using OpenSSL and Java's KeyTool command. - -==== Generate a certificate authority - ----- -# Create a private key -openssl genrsa -des3 -out root.key 4096 - -# Create a certificate request using the private key -openssl req -x509 -new -key root.key -days 365 -out root.pem - -# Generate a Base64-encoded version of the PEM just created -openssl x509 -outform der -in root.pem -out root.der - -# Import the key into a Java KeyStore -keytool -import -alias root-key -keystore truststore.jks -file root.der - -# Remove the DER formatted key file (as we don't need it anymore) -rm root.der ----- - -The +truststore.jks+ file is the Java keystore which contains the certificate authority's public key. - -==== Generate a certificate/keystore per host - -It's common that each host in the instance is issued its own certificate (notably to ensure that revocation procedures -can be easily followed). The following steps can be taken for each host. - ----- -# Create the private key for our server -openssl genrsa -out server.key 4096 - -# Generate a certificate signing request (CSR) with our private key -openssl req -new -key server.key -out server.csr - -# Use the CSR and the CA to create a certificate for the server (a reply to the CSR) -openssl x509 -req -in server.csr -CA root.pem -CAkey root.key -CAcreateserial \ - -out server.crt -days 365 - -# Use the certificate and the private key for our server to create PKCS12 file -openssl pkcs12 -export -in server.crt -inkey server.key -certfile server.crt \ - -name 'server-key' -out server.p12 - -# Create a Java KeyStore for the server using the PKCS12 file (private key) -keytool -importkeystore -srckeystore server.p12 -srcstoretype pkcs12 -destkeystore \ - server.jks -deststoretype JKS - -# Remove the PKCS12 file as we don't need it -rm server.p12 - -# Import the CA-signed certificate to the keystore -keytool -import -trustcacerts -alias server-crt -file server.crt -keystore server.jks ----- - -The +server.jks+ file is the Java keystore containing the certificate for a given host. The above -methods are equivalent whether the certficate is generate for an Accumulo server or a client.
http://git-wip-us.apache.org/repos/asf/accumulo-website/blob/7cc70b2e/docs/master/summaries.md ---------------------------------------------------------------------- diff --git a/docs/master/summaries.md b/docs/master/summaries.md deleted file mode 100644 index 08d8011..0000000 --- a/docs/master/summaries.md +++ /dev/null @@ -1,232 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one or more -// contributor license agreements. See the NOTICE file distributed with -// this work for additional information regarding copyright ownership. -// The ASF licenses this file to You under the Apache License, Version 2.0 -// (the "License"); you may not use this file except in compliance with -// the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -== Summary Statistics - -=== Overview - -Accumulo has the ability to generate summary statistics about data in a table -using user defined functions. Currently these statistics are only generated for -data written to files. Data recently written to Accumulo that is still in -memory will not contribute to summary statistics. - -This feature can be used to inform a user about what data is in their table. -Summary statistics can also be used by compaction strategies to make decisions -about which files to compact. - -Summary data is stored in each file Accumulo produces. Accumulo can gather -summary information from across a cluster merging it along the way. In order -for this to be fast the, summary information should fit in cache. There is a -dedicated cache for summary data on each tserver with a configurable size. In -order for summary data to fit in cache, it should probably be small. - -For information on writing a custom summarizer see the javadoc for -+org.apache.accumulo.core.client.summary.Summarizer+. The package -+org.apache.accumulo.core.client.summary.summarizers+ contains summarizer -implementations that ship with Accumulo and can be configured for use. - -=== Inaccuracies - -Summary data can be inaccurate when files are missing summary data or when -files have extra summary data. Files can contain data outside of a tablets -boundaries. This can happen as result of bulk imported files and tablet splits. -When this happens, those files could contain extra summary information. -Accumulo offsets this some by storing summary information for multiple row -ranges per a file. However, the ranges are not granular enough to completely -offset extra data. - -Any source of inaccuracies is reported when summary information is requested. -In the shell examples below this can be seen on the +File Statistics+ line. -For files missing summary information, the compact command in the shell has a -+--sf-no-summary+ option. This options compacts files that do not have the -summary information configured for the table. The compact command also has the -+--sf-extra-summary+ option which will compact files with extra summary -information. - -=== Configuring - -The following tablet server and table properties configure summarization. - -* <<appendices/config.txt#_tserver_cache_summary_size>> -* <<appendices/config.txt#_tserver_summary_partition_threads>> -* <<appendices/config.txt#_tserver_summary_remote_threads>> -* <<appendices/config.txt#_tserver_summary_retrieval_threads>> -* <<appendices/config.txt#TABLE_SUMMARIZER_PREFIX>> -* <<appendices/config.txt#_table_file_summary_maxsize>> - -=== Permissions - -Because summary data may be derived from sensitive data, requesting summary data -requires a special permission. User must have the table permission -+GET_SUMMARIES+ in order to retrieve summary data. - - -=== Bulk import - -When generating rfiles to bulk import into Accumulo, those rfiles can contain -summary data. To use this feature, look at the javadoc on the -+AccumuloFileOutputFormat.setSummarizers(...)+ method. Also, -+org.apache.accumulo.core.client.rfile.RFile+ has options for creating RFiles -with embedded summary data. - -=== Examples - -This example walks through using summarizers in the Accumulo shell. Below a -table is created and some data is inserted to summarize. - - root@uno> createtable summary_test - root@uno summary_test> setauths -u root -s PI,GEO,TIME - root@uno summary_test> insert 3b503bd name last Doe - root@uno summary_test> insert 3b503bd name first John - root@uno summary_test> insert 3b503bd contact address "123 Park Ave, NY, NY" -l PI&GEO - root@uno summary_test> insert 3b503bd date birth "1/11/1942" -l PI&TIME - root@uno summary_test> insert 3b503bd date married "5/11/1962" -l PI&TIME - root@uno summary_test> insert 3b503bd contact home_phone 1-123-456-7890 -l PI - root@uno summary_test> insert d5d18dd contact address "50 Lake Shore Dr, Chicago, IL" -l PI&GEO - root@uno summary_test> insert d5d18dd name first Jane - root@uno summary_test> insert d5d18dd name last Doe - root@uno summary_test> insert d5d18dd date birth 8/15/1969 -l PI&TIME - root@uno summary_test> scan -s PI,GEO,TIME - 3b503bd contact:address [PI&GEO] 123 Park Ave, NY, NY - 3b503bd contact:home_phone [PI] 1-123-456-7890 - 3b503bd date:birth [PI&TIME] 1/11/1942 - 3b503bd date:married [PI&TIME] 5/11/1962 - 3b503bd name:first [] John - 3b503bd name:last [] Doe - d5d18dd contact:address [PI&GEO] 50 Lake Shore Dr, Chicago, IL - d5d18dd date:birth [PI&TIME] 8/15/1969 - d5d18dd name:first [] Jane - d5d18dd name:last [] Doe - -After inserting the data, summaries are requested below. No summaries are returned. - - root@uno summary_test> summaries - -The visibility summarizer is configured below and the table is flushed. -Flushing the table creates a file creating summary data in the process. The -summary data returned counts how many times each column visibility occurred. -The statistics with a +c:+ prefix are visibilities. The others are generic -statistics created by the CountingSummarizer that VisibilitySummarizer extends. - - root@uno summary_test> config -t summary_test -s table.summarizer.vis=org.apache.accumulo.core.client.summary.summarizers.VisibilitySummarizer - root@uno summary_test> summaries - root@uno summary_test> flush -w - 2017-02-24 19:54:46,090 [shell.Shell] INFO : Flush of table summary_test completed. - root@uno summary_test> summaries - Summarizer : org.apache.accumulo.core.client.summary.summarizers.VisibilitySummarizer vis {} - File Statistics : [total:1, missing:0, extra:0, large:0] - Summary Statistics : - c: = 4 - c:PI = 1 - c:PI&GEO = 2 - c:PI&TIME = 3 - emitted = 10 - seen = 10 - tooLong = 0 - tooMany = 0 - -VisibilitySummarizer has an option +maxCounters+ that determines the max number -of column visibilites it will track. Below this option is set and compaction -is forced to regenerate summary data. The new summary data only has three -visibilites and now the +tooMany+ statistic is 4. This is the number of -visibilites that were not counted. - - root@uno summary_test> config -t summary_test -s table.summarizer.vis.opt.maxCounters=3 - root@uno summary_test> compact -w - 2017-02-24 19:54:46,267 [shell.Shell] INFO : Compacting table ... - 2017-02-24 19:54:47,127 [shell.Shell] INFO : Compaction of table summary_test completed for given range - root@uno summary_test> summaries - Summarizer : org.apache.accumulo.core.client.summary.summarizers.VisibilitySummarizer vis {maxCounters=3} - File Statistics : [total:1, missing:0, extra:0, large:0] - Summary Statistics : - c:PI = 1 - c:PI&GEO = 2 - c:PI&TIME = 3 - emitted = 10 - seen = 10 - tooLong = 0 - tooMany = 4 - -Another summarizer is configured below that tracks the number of deletes. Also -a compaction strategy that uses this summary data is configured. The -+TooManyDeletesCompactionStrategy+ will force a compaction of the tablet when -the ratio of deletes to non-deletes is over 25%. This threshold is -configurable. Below a delete is added and its reflected in the statistics. In -this case there is 1 delete and 10 non-deletes, not enough to force a -compaction of the tablet. - -.... -root@uno summary_test> config -t summary_test -s table.summarizer.del=org.apache.accumulo.core.client.summary.summarizers.DeletesSummarizer -root@uno summary_test> compact -w -2017-02-24 19:54:47,282 [shell.Shell] INFO : Compacting table ... -2017-02-24 19:54:49,236 [shell.Shell] INFO : Compaction of table summary_test completed for given range -root@uno summary_test> config -t summary_test -s table.compaction.major.ratio=10 -root@uno summary_test> config -t summary_test -s table.majc.compaction.strategy=org.apache.accumulo.tserver.compaction.strategies.TooManyDeletesCompactionStrategy -root@uno summary_test> deletemany -r d5d18dd -c date -f -[DELETED] d5d18dd date:birth [PI&TIME] -root@uno summary_test> flush -w -2017-02-24 19:54:49,686 [shell.Shell] INFO : Flush of table summary_test completed. -root@uno summary_test> summaries - Summarizer : org.apache.accumulo.core.client.summary.summarizers.VisibilitySummarizer vis {maxCounters=3} - File Statistics : [total:2, missing:0, extra:0, large:0] - Summary Statistics : - c:PI = 1 - c:PI&GEO = 2 - c:PI&TIME = 4 - emitted = 11 - seen = 11 - tooLong = 0 - tooMany = 4 - - Summarizer : org.apache.accumulo.core.client.summary.summarizers.DeletesSummarizer del {} - File Statistics : [total:2, missing:0, extra:0, large:0] - Summary Statistics : - deletes = 1 - total = 11 -.... - -Some more deletes are added and the table is flushed below. This results in 4 -deletes and 10 non-deletes, which triggers a full compaction. A full -compaction of all files is the only time when delete markers are dropped. The -compaction ratio was set to 10 above to show that the number of files did not -trigger the compaction. After the compaction there no deletes 6 non-deletes. - -.... -root@uno summary_test> deletemany -r d5d18dd -f -[DELETED] d5d18dd contact:address [PI&GEO] -[DELETED] d5d18dd name:first [] -[DELETED] d5d18dd name:last [] -root@uno summary_test> flush -w -2017-02-24 19:54:52,800 [shell.Shell] INFO : Flush of table summary_test completed. -root@uno summary_test> summaries - Summarizer : org.apache.accumulo.core.client.summary.summarizers.VisibilitySummarizer vis {maxCounters=3} - File Statistics : [total:1, missing:0, extra:0, large:0] - Summary Statistics : - c:PI = 1 - c:PI&GEO = 1 - c:PI&TIME = 2 - emitted = 6 - seen = 6 - tooLong = 0 - tooMany = 2 - - Summarizer : org.apache.accumulo.core.client.summary.summarizers.DeletesSummarizer del {} - File Statistics : [total:1, missing:0, extra:0, large:0] - Summary Statistics : - deletes = 0 - total = 6 -root@uno summary_test> -.... - http://git-wip-us.apache.org/repos/asf/accumulo-website/blob/7cc70b2e/docs/master/table_configuration.md ---------------------------------------------------------------------- diff --git a/docs/master/table_configuration.md b/docs/master/table_configuration.md deleted file mode 100644 index e78d7bd..0000000 --- a/docs/master/table_configuration.md +++ /dev/null @@ -1,670 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one or more -// contributor license agreements. See the NOTICE file distributed with -// this work for additional information regarding copyright ownership. -// The ASF licenses this file to You under the Apache License, Version 2.0 -// (the "License"); you may not use this file except in compliance with -// the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -== Table Configuration - -Accumulo tables have a few options that can be configured to alter the default -behavior of Accumulo as well as improve performance based on the data stored. -These include locality groups, constraints, bloom filters, iterators, and block -cache. For a complete list of available configuration options, see <<configuration>>. - -=== Locality Groups - -Accumulo supports storing sets of column families separately on disk to allow -clients to efficiently scan over columns that are frequently used together and to avoid -scanning over column families that are not requested. After a locality group is set, -Scanner and BatchScanner operations will automatically take advantage of them -whenever the fetchColumnFamilies() method is used. - -By default, tables place all column families into the same ``default'' locality group. -Additional locality groups can be configured at any time via the shell or -programmatically as follows: - -==== Managing Locality Groups via the Shell - - usage: setgroups <group>=<col fam>{,<col fam>}{ <group>=<col fam>{,<col fam>}} - [-?] -t <table> - - user@myinstance mytable> setgroups group_one=colf1,colf2 -t mytable - - user@myinstance mytable> getgroups -t mytable - -==== Managing Locality Groups via the Client API - -[source,java] ----- -Connector conn; - -HashMap<String,Set<Text>> localityGroups = new HashMap<String, Set<Text>>(); - -HashSet<Text> metadataColumns = new HashSet<Text>(); -metadataColumns.add(new Text("domain")); -metadataColumns.add(new Text("link")); - -HashSet<Text> contentColumns = new HashSet<Text>(); -contentColumns.add(new Text("body")); -contentColumns.add(new Text("images")); - -localityGroups.put("metadata", metadataColumns); -localityGroups.put("content", contentColumns); - -conn.tableOperations().setLocalityGroups("mytable", localityGroups); - -// existing locality groups can be obtained as follows -Map<String, Set<Text>> groups = - conn.tableOperations().getLocalityGroups("mytable"); ----- - -The assignment of Column Families to Locality Groups can be changed at any time. The -physical movement of column families into their new locality groups takes place via -the periodic Major Compaction process that takes place continuously in the -background. Major Compaction can also be scheduled to take place immediately -through the shell: - - user@myinstance mytable> compact -t mytable - -=== Constraints - -Accumulo supports constraints applied on mutations at insert time. This can be -used to disallow certain inserts according to a user defined policy. Any mutation -that fails to meet the requirements of the constraint is rejected and sent back to the -client. - -Constraints can be enabled by setting a table property as follows: - ----- -user@myinstance mytable> constraint -t mytable -a com.test.ExampleConstraint com.test.AnotherConstraint - -user@myinstance mytable> constraint -l -com.test.ExampleConstraint=1 -com.test.AnotherConstraint=2 ----- - -Currently there are no general-purpose constraints provided with the Accumulo -distribution. New constraints can be created by writing a Java class that implements -the following interface: - - org.apache.accumulo.core.constraints.Constraint - -To deploy a new constraint, create a jar file containing the class implementing the -new constraint and place it in the lib directory of the Accumulo installation. New -constraint jars can be added to Accumulo and enabled without restarting but any -change to an existing constraint class requires Accumulo to be restarted. - -See the https://github.com/apache/accumulo-examples/blob/master/docs/contraints.md[constraints example] -for example code. - -=== Bloom Filters - -As mutations are applied to an Accumulo table, several files are created per tablet. If -bloom filters are enabled, Accumulo will create and load a small data structure into -memory to determine whether a file contains a given key before opening the file. -This can speed up lookups considerably. - -To enable bloom filters, enter the following command in the Shell: - - user@myinstance> config -t mytable -s table.bloom.enabled=true - -The https://github.com/apache/accumulo-examples/blob/master/docs/bloom.md[bloom filter example] -contains an extensive example of using Bloom Filters. - -=== Iterators - -Iterators provide a modular mechanism for adding functionality to be executed by -TabletServers when scanning or compacting data. This allows users to efficiently -summarize, filter, and aggregate data. In fact, the built-in features of cell-level -security and column fetching are implemented using Iterators. -Some useful Iterators are provided with Accumulo and can be found in the -*+org.apache.accumulo.core.iterators.user+* package. -In each case, any custom Iterators must be included in Accumulo's classpath, -typically by including a jar in +lib/+ or +lib/ext/+, although the VFS classloader -allows for classpath manipulation using a variety of schemes including URLs and HDFS URIs. - -==== Setting Iterators via the Shell - -Iterators can be configured on a table at scan, minor compaction and/or major -compaction scopes. If the Iterator implements the OptionDescriber interface, the -setiter command can be used which will interactively prompt the user to provide -values for the given necessary options. - - usage: setiter [-?] -ageoff | -agg | -class <name> | -regex | - -reqvis | -vers [-majc] [-minc] [-n <itername>] -p <pri> - [-scan] [-t <table>] - - user@myinstance mytable> setiter -t mytable -scan -p 15 -n myiter -class com.company.MyIterator - -The config command can always be used to manually configure iterators which is useful -in cases where the Iterator does not implement the OptionDescriber interface. - - config -t mytable -s table.iterator.scan.myiter=15,com.company.MyIterator - config -t mytable -s table.iterator.minc.myiter=15,com.company.MyIterator - config -t mytable -s table.iterator.majc.myiter=15,com.company.MyIterator - config -t mytable -s table.iterator.scan.myiter.opt.myoptionname=myoptionvalue - config -t mytable -s table.iterator.minc.myiter.opt.myoptionname=myoptionvalue - config -t mytable -s table.iterator.majc.myiter.opt.myoptionname=myoptionvalue - -Typically, a table will have multiple iterators. Accumulo configures a set of -system level iterators for each table. These iterators provide core -functionality like visibility label filtering and may not be removed by -users. User level iterators are applied in the order of their priority. -Priority is a user configured integer; iterators with lower numbers go first, -passing the results of their iteration on to the other iterators up the -stack. - -==== Setting Iterators Programmatically - -[source,java] -scanner.addIterator(new IteratorSetting( - 15, // priority - "myiter", // name this iterator - "com.company.MyIterator" // class name -)); - -Some iterators take additional parameters from client code, as in the following -example: - -[source,java] -IteratorSetting iter = new IteratorSetting(...); -iter.addOption("myoptionname", "myoptionvalue"); -scanner.addIterator(iter) - -Tables support separate Iterator settings to be applied at scan time, upon minor -compaction and upon major compaction. For most uses, tables will have identical -iterator settings for all three to avoid inconsistent results. - -==== Versioning Iterators and Timestamps - -Accumulo provides the capability to manage versioned data through the use of -timestamps within the Key. If a timestamp is not specified in the key created by the -client then the system will set the timestamp to the current time. Two keys with -identical rowIDs and columns but different timestamps are considered two versions -of the same key. If two inserts are made into Accumulo with the same rowID, -column, and timestamp, then the behavior is non-deterministic. - -Timestamps are sorted in descending order, so the most recent data comes first. -Accumulo can be configured to return the top k versions, or versions later than a -given date. The default is to return the one most recent version. - -The version policy can be changed by changing the VersioningIterator options for a -table as follows: - ----- -user@myinstance mytable> config -t mytable -s table.iterator.scan.vers.opt.maxVersions=3 - -user@myinstance mytable> config -t mytable -s table.iterator.minc.vers.opt.maxVersions=3 - -user@myinstance mytable> config -t mytable -s table.iterator.majc.vers.opt.maxVersions=3 ----- - -When a table is created, by default its configured to use the -VersioningIterator and keep one version. A table can be created without the -VersioningIterator with the -ndi option in the shell. Also the Java API -has the following method - -[source,java] -connector.tableOperations.create(String tableName, boolean limitVersion); - -===== Logical Time - -Accumulo 1.2 introduces the concept of logical time. This ensures that timestamps -set by Accumulo always move forward. This helps avoid problems caused by -TabletServers that have different time settings. The per tablet counter gives unique -one up time stamps on a per mutation basis. When using time in milliseconds, if -two things arrive within the same millisecond then both receive the same -timestamp. When using time in milliseconds, Accumulo set times will still -always move forward and never backwards. - -A table can be configured to use logical timestamps at creation time as follows: - - user@myinstance> createtable -tl logical - -===== Deletes - -Deletes are special keys in Accumulo that get sorted along will all the other data. -When a delete key is inserted, Accumulo will not show anything that has a -timestamp less than or equal to the delete key. During major compaction, any keys -older than a delete key are omitted from the new file created, and the omitted keys -are removed from disk as part of the regular garbage collection process. - -==== Filters - -When scanning over a set of key-value pairs it is possible to apply an arbitrary -filtering policy through the use of a Filter. Filters are types of iterators that return -only key-value pairs that satisfy the filter logic. Accumulo has a few built-in filters -that can be configured on any table: AgeOff, ColumnAgeOff, Timestamp, NoVis, and RegEx. More can be added -by writing a Java class that extends the -+org.apache.accumulo.core.iterators.Filter+ class. - -The AgeOff filter can be configured to remove data older than a certain date or a fixed -amount of time from the present. The following example sets a table to delete -everything inserted over 30 seconds ago: - ----- -user@myinstance> createtable filtertest - -user@myinstance filtertest> setiter -t filtertest -scan -minc -majc -p 10 -n myfilter -ageoff -AgeOffFilter removes entries with timestamps more than <ttl> milliseconds old -----------> set org.apache.accumulo.core.iterators.user.AgeOffFilter parameter negate, default false - keeps k/v that pass accept method, true rejects k/v that pass accept method: -----------> set org.apache.accumulo.core.iterators.user.AgeOffFilter parameter ttl, time to - live (milliseconds): 30000 -----------> set org.apache.accumulo.core.iterators.user.AgeOffFilter parameter currentTime, if set, - use the given value as the absolute time in milliseconds as the current time of day: - -user@myinstance filtertest> - -user@myinstance filtertest> scan - -user@myinstance filtertest> insert foo a b c - -user@myinstance filtertest> scan -foo a:b [] c - -user@myinstance filtertest> sleep 4 - -user@myinstance filtertest> scan - -user@myinstance filtertest> ----- - -To see the iterator settings for a table, use: - - user@example filtertest> config -t filtertest -f iterator - ---------+---------------------------------------------+------------------ - SCOPE | NAME | VALUE - ---------+---------------------------------------------+------------------ - table | table.iterator.majc.myfilter .............. | 10,org.apache.accumulo.core.iterators.user.AgeOffFilter - table | table.iterator.majc.myfilter.opt.ttl ...... | 30000 - table | table.iterator.majc.vers .................. | 20,org.apache.accumulo.core.iterators.VersioningIterator - table | table.iterator.majc.vers.opt.maxVersions .. | 1 - table | table.iterator.minc.myfilter .............. | 10,org.apache.accumulo.core.iterators.user.AgeOffFilter - table | table.iterator.minc.myfilter.opt.ttl ...... | 30000 - table | table.iterator.minc.vers .................. | 20,org.apache.accumulo.core.iterators.VersioningIterator - table | table.iterator.minc.vers.opt.maxVersions .. | 1 - table | table.iterator.scan.myfilter .............. | 10,org.apache.accumulo.core.iterators.user.AgeOffFilter - table | table.iterator.scan.myfilter.opt.ttl ...... | 30000 - table | table.iterator.scan.vers .................. | 20,org.apache.accumulo.core.iterators.VersioningIterator - table | table.iterator.scan.vers.opt.maxVersions .. | 1 - ---------+---------------------------------------------+------------------ - -==== Combiners - -Accumulo supports on the fly lazy aggregation of data using Combiners. Aggregation is -done at compaction and scan time. No lookup is done at insert time, which` greatly -speeds up ingest. - -Accumulo allows Combiners to be configured on tables and column -families. When a Combiner is set it is applied across the values -associated with any keys that share rowID, column family, and column qualifier. -This is similar to the reduce step in MapReduce, which applied some function to all -the values associated with a particular key. - -For example, if a summing combiner were configured on a table and the following -mutations were inserted: - - Row Family Qualifier Timestamp Value - rowID1 colfA colqA 20100101 1 - rowID1 colfA colqA 20100102 1 - -The table would reflect only one aggregate value: - - rowID1 colfA colqA - 2 - -Combiners can be enabled for a table using the setiter command in the shell. Below is an example. - ----- -root@a14 perDayCounts> setiter -t perDayCounts -p 10 -scan -minc -majc -n daycount - -class org.apache.accumulo.core.iterators.user.SummingCombiner -TypedValueCombiner can interpret Values as a variety of number encodings - (VLong, Long, or String) before combining -----------> set SummingCombiner parameter columns, - <col fam>[:<col qual>]{,<col fam>[:<col qual>]} : day -----------> set SummingCombiner parameter type, <VARNUM|LONG|STRING>: STRING - -root@a14 perDayCounts> insert foo day 20080101 1 -root@a14 perDayCounts> insert foo day 20080101 1 -root@a14 perDayCounts> insert foo day 20080103 1 -root@a14 perDayCounts> insert bar day 20080101 1 -root@a14 perDayCounts> insert bar day 20080101 1 - -root@a14 perDayCounts> scan -bar day:20080101 [] 2 -foo day:20080101 [] 2 -foo day:20080103 [] 1 ----- - -Accumulo includes some useful Combiners out of the box. To find these look in -the *+org.apache.accumulo.core.iterators.user+* package. - -Additional Combiners can be added by creating a Java class that extends -+org.apache.accumulo.core.iterators.Combiner+ and adding a jar containing that -class to Accumulo's lib/ext directory. - -See the https://github.com/apache/accumulo-examples/blob/master/docs/combiner.md[combiner example] -for example code. - -=== Block Cache - -In order to increase throughput of commonly accessed entries, Accumulo employs a block cache. -This block cache buffers data in memory so that it doesn't have to be read off of disk. -The RFile format that Accumulo prefers is a mix of index blocks and data blocks, where the index blocks are used to find the appropriate data blocks. -Typical queries to Accumulo result in a binary search over several index blocks followed by a linear scan of one or more data blocks. - -The block cache can be configured on a per-table basis, and all tablets hosted on a tablet server share a single resource pool. -To configure the size of the tablet server's block cache, set the following properties: - - tserver.cache.data.size: Specifies the size of the cache for file data blocks. - tserver.cache.index.size: Specifies the size of the cache for file indices. - -To enable the block cache for your table, set the following properties: - - table.cache.block.enable: Determines whether file (data) block cache is enabled. - table.cache.index.enable: Determines whether index cache is enabled. - -The block cache can have a significant effect on alleviating hot spots, as well as reducing query latency. -It is enabled by default for the metadata tables. - -=== Compaction - -As data is written to Accumulo it is buffered in memory. The data buffered in -memory is eventually written to HDFS on a per tablet basis. Files can also be -added to tablets directly by bulk import. In the background tablet servers run -major compactions to merge multiple files into one. The tablet server has to -decide which tablets to compact and which files within a tablet to compact. -This decision is made using the compaction ratio, which is configurable on a -per table basis. To configure this ratio modify the following property: - - table.compaction.major.ratio - -Increasing this ratio will result in more files per tablet and less compaction -work. More files per tablet means more higher query latency. So adjusting -this ratio is a trade off between ingest and query performance. The ratio -defaults to 3. - -The way the ratio works is that a set of files is compacted into one file if the -sum of the sizes of the files in the set is larger than the ratio multiplied by -the size of the largest file in the set. If this is not true for the set of all -files in a tablet, the largest file is removed from consideration, and the -remaining files are considered for compaction. This is repeated until a -compaction is triggered or there are no files left to consider. - -The number of background threads tablet servers use to run major compactions is -configurable. To configure this modify the following property: - - tserver.compaction.major.concurrent.max - -Also, the number of threads tablet servers use for minor compactions is -configurable. To configure this modify the following property: - - tserver.compaction.minor.concurrent.max - -The numbers of minor and major compactions running and queued is visible on the -Accumulo monitor page. This allows you to see if compactions are backing up -and adjustments to the above settings are needed. When adjusting the number of -threads available for compactions, consider the number of cores and other tasks -running on the nodes such as maps and reduces. - -If major compactions are not keeping up, then the number of files per tablet -will grow to a point such that query performance starts to suffer. One way to -handle this situation is to increase the compaction ratio. For example, if the -compaction ratio were set to 1, then every new file added to a tablet by minor -compaction would immediately queue the tablet for major compaction. So if a -tablet has a 200M file and minor compaction writes a 1M file, then the major -compaction will attempt to merge the 200M and 1M file. If the tablet server -has lots of tablets trying to do this sort of thing, then major compactions -will back up and the number of files per tablet will start to grow, assuming -data is being continuously written. Increasing the compaction ratio will -alleviate backups by lowering the amount of major compaction work that needs to -be done. - -Another option to deal with the files per tablet growing too large is to adjust -the following property: - - table.file.max - -When a tablet reaches this number of files and needs to flush its in-memory -data to disk, it will choose to do a merging minor compaction. A merging minor -compaction will merge the tablet's smallest file with the data in memory at -minor compaction time. Therefore the number of files will not grow beyond this -limit. This will make minor compactions take longer, which will cause ingest -performance to decrease. This can cause ingest to slow down until major -compactions have enough time to catch up. When adjusting this property, also -consider adjusting the compaction ratio. Ideally, merging minor compactions -never need to occur and major compactions will keep up. It is possible to -configure the file max and compaction ratio such that only merging minor -compactions occur and major compactions never occur. This should be avoided -because doing only merging minor compactions causes O(_N_^2^) work to be done. -The amount of work done by major compactions is O(_N_*log~_R_~(_N_)) where -_R_ is the compaction ratio. - -Compactions can be initiated manually for a table. To initiate a minor -compaction, use the flush command in the shell. To initiate a major compaction, -use the compact command in the shell. The compact command will compact all -tablets in a table to one file. Even tablets with one file are compacted. This -is useful for the case where a major compaction filter is configured for a -table. In 1.4 the ability to compact a range of a table was added. To use this -feature specify start and stop rows for the compact command. This will only -compact tablets that overlap the given row range. - -==== Compaction Strategies - -The default behavior of major compactions is defined in the class DefaultCompactionStrategy. -This behavior can be changed by overriding the following property with a fully qualified class name: - - table.majc.compaction.strategy - -Custom compaction strategies can have additional properties that are specified following the prefix property: - - table.majc.compaction.strategy.opts.* - -Accumulo provides a few classes that can be used as an alternative compaction strategy. These classes are located in the -org.apache.accumulo.tserver.compaction.* package. EverythingCompactionStrategy will simply compact all files. This is the -strategy used by the user "compact" command. SizeLimitCompactionStrategy compacts files no bigger than the limit set in the -property table.majc.compaction.strategy.opts.sizeLimit. - -TwoTierCompactionStrategy is a hybrid compaction strategy that supports two types of compression. If the total size of -files being compacted is larger than table.majc.compaction.strategy.opts.file.large.compress.threshold than a larger -compression type will be used. The larger compression type is specified in table.majc.compaction.strategy.opts.file.large.compress.type. -Otherwise, the configured table compression will be used. To use this strategy with minor compactions set table.file.compress.type=snappy -and set a different compress type in table.majc.compaction.strategy.opts.file.large.compress.type for larger files. - -=== Pre-splitting tables - -Accumulo will balance and distribute tables across servers. Before a -table gets large, it will be maintained as a single tablet on a single -server. This limits the speed at which data can be added or queried -to the speed of a single node. To improve performance when the a table -is new, or small, you can add split points and generate new tablets. - -In the shell: - - root@myinstance> createtable newTable - root@myinstance> addsplits -t newTable g n t - -This will create a new table with 4 tablets. The table will be split -on the letters ``g'', ``n'', and ``t'' which will work nicely if the -row data start with lower-case alphabetic characters. If your row -data includes binary information or numeric information, or if the -distribution of the row information is not flat, then you would pick -different split points. Now ingest and query can proceed on 4 nodes -which can improve performance. - -=== Merging tablets - -Over time, a table can get very large, so large that it has hundreds -of thousands of split points. Once there are enough tablets to spread -a table across the entire cluster, additional splits may not improve -performance, and may create unnecessary bookkeeping. The distribution -of data may change over time. For example, if row data contains date -information, and data is continually added and removed to maintain a -window of current information, tablets for older rows may be empty. - -Accumulo supports tablet merging, which can be used to reduce -the number of split points. The following command will merge all rows -from ``A'' to ``Z'' into a single tablet: - - root@myinstance> merge -t myTable -s A -e Z - -If the result of a merge produces a tablet that is larger than the -configured split size, the tablet may be split by the tablet server. -Be sure to increase your tablet size prior to any merges if the goal -is to have larger tablets: - - root@myinstance> config -t myTable -s table.split.threshold=2G - -In order to merge small tablets, you can ask Accumulo to merge -sections of a table smaller than a given size. - - root@myinstance> merge -t myTable -s 100M - -By default, small tablets will not be merged into tablets that are -already larger than the given size. This can leave isolated small -tablets. To force small tablets to be merged into larger tablets use -the +--force+ option: - - root@myinstance> merge -t myTable -s 100M --force - -Merging away small tablets works on one section at a time. If your -table contains many sections of small split points, or you are -attempting to change the split size of the entire table, it will be -faster to set the split point and merge the entire table: - - root@myinstance> config -t myTable -s table.split.threshold=256M - root@myinstance> merge -t myTable - -=== Delete Range - -Consider an indexing scheme that uses date information in each row. -For example ``20110823-15:20:25.013'' might be a row that specifies a -date and time. In some cases, we might like to delete rows based on -this date, say to remove all the data older than the current year. -Accumulo supports a delete range operation which efficiently -removes data between two rows. For example: - - root@myinstance> deleterange -t myTable -s 2010 -e 2011 - -This will delete all rows starting with ``2010'' and it will stop at -any row starting ``2011''. You can delete any data prior to 2011 -with: - - root@myinstance> deleterange -t myTable -e 2011 --force - -The shell will not allow you to delete an unbounded range (no start) -unless you provide the +--force+ option. - -Range deletion is implemented using splits at the given start/end -positions, and will affect the number of splits in the table. - -=== Cloning Tables - -A new table can be created that points to an existing table's data. This is a -very quick metadata operation, no data is actually copied. The cloned table -and the source table can change independently after the clone operation. One -use case for this feature is testing. For example to test a new filtering -iterator, clone the table, add the filter to the clone, and force a major -compaction. To perform a test on less data, clone a table and then use delete -range to efficiently remove a lot of data from the clone. Another use case is -generating a snapshot to guard against human error. To create a snapshot, -clone a table and then disable write permissions on the clone. - -The clone operation will point to the source table's files. This is why the -flush option is present and is enabled by default in the shell. If the flush -option is not enabled, then any data the source table currently has in memory -will not exist in the clone. - -A cloned table copies the configuration of the source table. However the -permissions of the source table are not copied to the clone. After a clone is -created, only the user that created the clone can read and write to it. - -In the following example we see that data inserted after the clone operation is -not visible in the clone. - ----- -root@a14> createtable people - -root@a14 people> insert 890435 name last Doe -root@a14 people> insert 890435 name first John - -root@a14 people> clonetable people test - -root@a14 people> insert 890436 name first Jane -root@a14 people> insert 890436 name last Doe - -root@a14 people> scan -890435 name:first [] John -890435 name:last [] Doe -890436 name:first [] Jane -890436 name:last [] Doe - -root@a14 people> table test - -root@a14 test> scan -890435 name:first [] John -890435 name:last [] Doe - -root@a14 test> ----- - -The du command in the shell shows how much space a table is using in HDFS. -This command can also show how much overlapping space two cloned tables have in -HDFS. In the example below du shows table ci is using 428M. Then ci is cloned -to cic and du shows that both tables share 428M. After three entries are -inserted into cic and its flushed, du shows the two tables still share 428M but -cic has 226 bytes to itself. Finally, table cic is compacted and then du shows -that each table uses 428M. - ----- -root@a14> du ci - 428,482,573 [ci] - -root@a14> clonetable ci cic - -root@a14> du ci cic - 428,482,573 [ci, cic] - -root@a14> table cic - -root@a14 cic> insert r1 cf1 cq1 v1 -root@a14 cic> insert r1 cf1 cq2 v2 -root@a14 cic> insert r1 cf1 cq3 v3 - -root@a14 cic> flush -t cic -w -27 15:00:13,908 [shell.Shell] INFO : Flush of table cic completed. - -root@a14 cic> du ci cic - 428,482,573 [ci, cic] - 226 [cic] - -root@a14 cic> compact -t cic -w -27 15:00:35,871 [shell.Shell] INFO : Compacting table ... -27 15:03:03,303 [shell.Shell] INFO : Compaction of table cic completed for given range - -root@a14 cic> du ci cic - 428,482,573 [ci] - 428,482,612 [cic] - -root@a14 cic> ----- - -=== Exporting Tables - -Accumulo supports exporting tables for the purpose of copying tables to another -cluster. Exporting and importing tables preserves the tables configuration, -splits, and logical time. Tables are exported and then copied via the hadoop -distcp command. To export a table, it must be offline and stay offline while -discp runs. The reason it needs to stay offline is to prevent files from being -deleted. A table can be cloned and the clone taken offline inorder to avoid -losing access to the table. See the https://github.com/apache/accumulo-examples/blob/master/docs/export.md[export example] -for example code. \ No newline at end of file http://git-wip-us.apache.org/repos/asf/accumulo-website/blob/7cc70b2e/docs/master/table_design.md ---------------------------------------------------------------------- diff --git a/docs/master/table_design.md b/docs/master/table_design.md deleted file mode 100644 index 31fa49a..0000000 --- a/docs/master/table_design.md +++ /dev/null @@ -1,336 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one or more -// contributor license agreements. See the NOTICE file distributed with -// this work for additional information regarding copyright ownership. -// The ASF licenses this file to You under the Apache License, Version 2.0 -// (the "License"); you may not use this file except in compliance with -// the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -== Table Design - -=== Basic Table - -Since Accumulo tables are sorted by row ID, each table can be thought of as being -indexed by the row ID. Lookups performed by row ID can be executed quickly, by doing -a binary search, first across the tablets, and then within a tablet. Clients should -choose a row ID carefully in order to support their desired application. A simple rule -is to select a unique identifier as the row ID for each entity to be stored and assign -all the other attributes to be tracked to be columns under this row ID. For example, -if we have the following data in a comma-separated file: - - userid,age,address,account-balance - -We might choose to store this data using the userid as the rowID, the column -name in the column family, and a blank column qualifier: - -[source,java] ----- -Mutation m = new Mutation(userid); -final String column_qualifier = ""; -m.put("age", column_qualifier, age); -m.put("address", column_qualifier, address); -m.put("balance", column_qualifier, account_balance); - -writer.add(m); ----- - -We could then retrieve any of the columns for a specific userid by specifying the -userid as the range of a scanner and fetching specific columns: - -[source,java] ----- -Range r = new Range(userid, userid); // single row -Scanner s = conn.createScanner("userdata", auths); -s.setRange(r); -s.fetchColumnFamily(new Text("age")); - -for(Entry<Key,Value> entry : s) { - System.out.println(entry.getValue().toString()); -} ----- - -=== RowID Design - -Often it is necessary to transform the rowID in order to have rows ordered in a way -that is optimal for anticipated access patterns. A good example of this is reversing -the order of components of internet domain names in order to group rows of the -same parent domain together: - - com.google.code - com.google.labs - com.google.mail - com.yahoo.mail - com.yahoo.research - -Some data may result in the creation of very large rows - rows with many columns. -In this case the table designer may wish to split up these rows for better load -balancing while keeping them sorted together for scanning purposes. This can be -done by appending a random substring at the end of the row: - - com.google.code_00 - com.google.code_01 - com.google.code_02 - com.google.labs_00 - com.google.mail_00 - com.google.mail_01 - -It could also be done by adding a string representation of some period of time such as date to the week -or month: - - com.google.code_201003 - com.google.code_201004 - com.google.code_201005 - com.google.labs_201003 - com.google.mail_201003 - com.google.mail_201004 - -Appending dates provides the additional capability of restricting a scan to a given -date range. - -=== Lexicoders -Since Keys in Accumulo are sorted lexicographically by default, it's often useful to encode -common data types into a byte format in which their sort order corresponds to the sort order -in their native form. An example of this is encoding dates and numerical data so that they can -be better seeked or searched in ranges. - -The lexicoders are a standard and extensible way of encoding Java types. Here's an example -of a lexicoder that encodes a java Date object so that it sorts lexicographically: - -[source,java] ----- -// create new date lexicoder -DateLexicoder dateEncoder = new DateLexicoder(); - -// truncate time to hours -long epoch = System.currentTimeMillis(); -Date hour = new Date(epoch - (epoch % 3600000)); - -// encode the rowId so that it is sorted lexicographically -Mutation mutation = new Mutation(dateEncoder.encode(hour)); -mutation.put(new Text("colf"), new Text("colq"), new Value(new byte[]{})); ----- - -If we want to return the most recent date first, we can reverse the sort order -with the reverse lexicoder: - -[source,java] ----- -// create new date lexicoder and reverse lexicoder -DateLexicoder dateEncoder = new DateLexicoder(); -ReverseLexicoder reverseEncoder = new ReverseLexicoder(dateEncoder); - -// truncate date to hours -long epoch = System.currentTimeMillis(); -Date hour = new Date(epoch - (epoch % 3600000)); - -// encode the rowId so that it sorts in reverse lexicographic order -Mutation mutation = new Mutation(reverseEncoder.encode(hour)); -mutation.put(new Text("colf"), new Text("colq"), new Value(new byte[]{})); ----- - - -=== Indexing -In order to support lookups via more than one attribute of an entity, additional -indexes can be built. However, because Accumulo tables can support any number of -columns without specifying them beforehand, a single additional index will often -suffice for supporting lookups of records in the main table. Here, the index has, as -the rowID, the Value or Term from the main table, the column families are the same, -and the column qualifier of the index table contains the rowID from the main table. - -[width="75%",cols="^,^,^,^"] -[grid="rows"] -[options="header"] -|============================================= -|RowID |Column Family |Column Qualifier |Value -|Term |Field Name |MainRowID | -|============================================= - -Note: We store rowIDs in the column qualifier rather than the Value so that we can -have more than one rowID associated with a particular term within the index. If we -stored this in the Value we would only see one of the rows in which the value -appears since Accumulo is configured by default to return the one most recent -value associated with a key. - -Lookups can then be done by scanning the Index Table first for occurrences of the -desired values in the columns specified, which returns a list of row ID from the main -table. These can then be used to retrieve each matching record, in their entirety, or a -subset of their columns, from the Main Table. - -To support efficient lookups of multiple rowIDs from the same table, the Accumulo -client library provides a BatchScanner. Users specify a set of Ranges to the -BatchScanner, which performs the lookups in multiple threads to multiple servers -and returns an Iterator over all the rows retrieved. The rows returned are NOT in -sorted order, as is the case with the basic Scanner interface. - -[source,java] ----- -// first we scan the index for IDs of rows matching our query -Text term = new Text("mySearchTerm"); - -HashSet<Range> matchingRows = new HashSet<Range>(); - -Scanner indexScanner = createScanner("index", auths); -indexScanner.setRange(new Range(term, term)); - -// we retrieve the matching rowIDs and create a set of ranges -for(Entry<Key,Value> entry : indexScanner) { - matchingRows.add(new Range(entry.getKey().getColumnQualifier())); -} - -// now we pass the set of rowIDs to the batch scanner to retrieve them -BatchScanner bscan = conn.createBatchScanner("table", auths, 10); -bscan.setRanges(matchingRows); -bscan.fetchColumnFamily(new Text("attributes")); - -for(Entry<Key,Value> entry : bscan) { - System.out.println(entry.getValue()); -} ----- - -One advantage of the dynamic schema capabilities of Accumulo is that different -fields may be indexed into the same physical table. However, it may be necessary to -create different index tables if the terms must be formatted differently in order to -maintain proper sort order. For example, real numbers must be formatted -differently than their usual notation in order to be sorted correctly. In these cases, -usually one index per unique data type will suffice. - -=== Entity-Attribute and Graph Tables - -Accumulo is ideal for storing entities and their attributes, especially of the -attributes are sparse. It is often useful to join several datasets together on common -entities within the same table. This can allow for the representation of graphs, -including nodes, their attributes, and connections to other nodes. - -Rather than storing individual events, Entity-Attribute or Graph tables store -aggregate information about the entities involved in the events and the -relationships between entities. This is often preferrable when single events aren't -very useful and when a continuously updated summarization is desired. - -The physical schema for an entity-attribute or graph table is as follows: - -[width="75%",cols="^,^,^,^"] -[grid="rows"] -[options="header"] -|================================================== -|RowID |Column Family |Column Qualifier |Value -|EntityID |Attribute Name |Attribute Value |Weight -|EntityID |Edge Type |Related EntityID |Weight -|================================================== - -For example, to keep track of employees, managers and products the following -entity-attribute table could be used. Note that the weights are not always necessary -and are set to 0 when not used. - -[width="75%",cols="^,^,^,^"] -[grid="rows"] -[options="header"] -|============================================= -|RowID |Column Family |Column Qualifier |Value -| E001 | name | bob | 0 -| E001 | department | sales | 0 -| E001 | hire_date | 20030102 | 0 -| E001 | units_sold | P001 | 780 -| E002 | name | george | 0 -| E002 | department | sales | 0 -| E002 | manager_of | E001 | 0 -| E002 | manager_of | E003 | 0 -| E003 | name | harry | 0 -| E003 | department | accounts_recv | 0 -| E003 | hire_date | 20000405 | 0 -| E003 | units_sold | P002 | 566 -| E003 | units_sold | P001 | 232 -| P001 | product_name | nike_airs | 0 -| P001 | product_type | shoe | 0 -| P001 | in_stock | germany | 900 -| P001 | in_stock | brazil | 200 -| P002 | product_name | basic_jacket | 0 -| P002 | product_type | clothing | 0 -| P002 | in_stock | usa | 3454 -| P002 | in_stock | germany | 700 -|============================================= - -To allow efficient updating of edge weights, an aggregating iterator can be -configured to add the value of all mutations applied with the same key. These types -of tables can easily be created from raw events by simply extracting the entities, -attributes, and relationships from individual events and inserting the keys into -Accumulo each with a count of 1. The aggregating iterator will take care of -maintaining the edge weights. - -=== Document-Partitioned Indexing - -Using a simple index as described above works well when looking for records that -match one of a set of given criteria. When looking for records that match more than -one criterion simultaneously, such as when looking for documents that contain all of -the words `the' and `white' and `house', there are several issues. - -First is that the set of all records matching any one of the search terms must be sent -to the client, which incurs a lot of network traffic. The second problem is that the -client is responsible for performing set intersection on the sets of records returned -to eliminate all but the records matching all search terms. The memory of the client -may easily be overwhelmed during this operation. - -For these reasons Accumulo includes support for a scheme known as sharded -indexing, in which these set operations can be performed at the TabletServers and -decisions about which records to include in the result set can be made without -incurring network traffic. - -This is accomplished via partitioning records into bins that each reside on at most -one TabletServer, and then creating an index of terms per record within each bin as -follows: - -[width="75%",cols="^,^,^,^"] -[grid="rows"] -[options="header"] -|============================================== -|RowID |Column Family |Column Qualifier |Value -|BinID |Term |DocID |Weight -|============================================== - -Documents or records are mapped into bins by a user-defined ingest application. By -storing the BinID as the RowID we ensure that all the information for a particular -bin is contained in a single tablet and hosted on a single TabletServer since -Accumulo never splits rows across tablets. Storing the Terms as column families -serves to enable fast lookups of all the documents within this bin that contain the -given term. - -Finally, we perform set intersection operations on the TabletServer via a special -iterator called the Intersecting Iterator. Since documents are partitioned into many -bins, a search of all documents must search every bin. We can use the BatchScanner -to scan all bins in parallel. The Intersecting Iterator should be enabled on a -BatchScanner within user query code as follows: - -[source,java] ----- -Text[] terms = {new Text("the"), new Text("white"), new Text("house")}; - -BatchScanner bscan = conn.createBatchScanner(table, auths, 20); - -IteratorSetting iter = new IteratorSetting(20, "ii", IntersectingIterator.class); -IntersectingIterator.setColumnFamilies(iter, terms); - -bscan.addScanIterator(iter); -bscan.setRanges(Collections.singleton(new Range())); - -for(Entry<Key,Value> entry : bscan) { - System.out.println(" " + entry.getKey().getColumnQualifier()); -} ----- - -This code effectively has the BatchScanner scan all tablets of a table, looking for -documents that match all the given terms. Because all tablets are being scanned for -every query, each query is more expensive than other Accumulo scans, which -typically involve a small number of TabletServers. This reduces the number of -concurrent queries supported and is subject to what is known as the `straggler' -problem in which every query runs as slow as the slowest server participating. - -Of course, fast servers will return their results to the client which can display them -to the user immediately while they wait for the rest of the results to arrive. If the -results are unordered this is quite effective as the first results to arrive are as good -as any others to the user.