This is an automated email from the ASF dual-hosted git repository. ctubbsii pushed a commit to branch next-release in repository https://gitbox.apache.org/repos/asf/accumulo-website.git
The following commit(s) were added to refs/heads/next-release by this push: new 73f66ef Update Compaction documentation apache/accumulo#1613 (#232) 73f66ef is described below commit 73f66efaf21ac266fe154d9609da92a7e851cace Author: Keith Turner <ktur...@apache.org> AuthorDate: Tue Feb 9 18:56:06 2021 -0500 Update Compaction documentation apache/accumulo#1613 (#232) * Add throughput to example --- _docs-2/administration/compaction.md | 119 +++++++++++++++++++++++++ _docs-2/getting-started/table_configuration.md | 108 +--------------------- css/accumulo.scss | 4 +- 3 files changed, 122 insertions(+), 109 deletions(-) diff --git a/_docs-2/administration/compaction.md b/_docs-2/administration/compaction.md new file mode 100644 index 0000000..18786a6 --- /dev/null +++ b/_docs-2/administration/compaction.md @@ -0,0 +1,119 @@ +--- +title: Compactions +category: administration +order: 6 +--- + +In Accumulo each tablet has a list of files associated with it. As data is +written to Accumulo it is buffered in memory. The data buffered in memory is +eventually written to files in DFS on a per tablet basis. Files can also be +added to tablets directly by bulk import. In the background tablet servers run +major compactions to merge multiple files into one. The tablet server has to +decide which tablets to compact and which files within a tablet to compact. + +Within each tablet server there are one or more user configurable Comapction +Services that compact tablets. Each Accumulo table has a user configurable +Compaction Dispatcher that decides which compaction services that table will +use. Accumulo generates metrics for each compaction service which enable users +to adjust compaction service settings based on actual activity. + +Each compaction service has a compaction planner that decides which files to +compact. The default compaction planner uses the table property {% plink +table.compaction.major.ratio %} to decide which files to compact. The +compaction ratio is real number >= 1.0. Assume LFS is the size of the largest +file in a set, CR is the compaction ratio, and FSS is the sum of file sizes in +a set. The default planner looks for file sets where LFS*CR <= FSS. By only +compacting sets of files that meet this requirement the amount of work done by +compactions is O(N * log<sub>CR</sub>(N)). Increasing the ratio will +result in less compaction work and more files per tablet. More files per +tablet means more higher query latency. So adjusting this ratio is a trade off +between ingest and query performance. + +When CR=1.0 this will result in a goal of a single per file tablet, but the +amount of work is O(N<sup>2</sup>) so 1.0 should be used with caution. For +example if a tablet has a 1G file and 1M file is added, then a compaction of +the 1G and 1M file would be queued. + +Compaction services and dispatchers were introduced in Accumulo 2.1, so much +of this documentation only applies to Accumulo 2.1 and later. + +## Configuration + +Below are some Accumulo shell commands that do the following : + + * Create a compaction service named `cs1` that has three executors. The first executor named `small` has 8 threads and runs compactions less than 16M. The second executor `medium` runs compactions less than 128M with 4 threads. The last executor `large` runs all other compactions. + * Create a compaction service named `cs2` that has three executors. It has similar config to `cs1`, but its executors have less threads. Limits total I/O of all compactions within the service to 40MB/s. +* Configure table `ci` to use compaction service `cs1` for system compactions and service `cs2` for user compactions. + +``` +config -s tserver.compaction.major.service.cs1.planner=org.apache.accumulo.core.spi.compaction.DefaultCompactionPlanner +config -s 'tserver.compaction.major.service.cs1.planner.opts.executors=[{"name":"small","maxSize":"16M","numThreads":8},{"name":"medium","maxSize":"128M","numThreads":4},{"name":"large","numThreads":2}]' +config -s tserver.compaction.major.service.cs2.planner=org.apache.accumulo.core.spi.compaction.DefaultCompactionPlanner +config -s 'tserver.compaction.major.service.cs2.planner.opts.executors=[{"name":"small","maxSize":"16M","numThreads":4},{"name":"medium","maxSize":"128M","numThreads":2},{"name":"large","numThreads":1}]' +config -s tserver.compaction.major.service.cs2.throughput=40M +config -t ci -s table.compaction.dispatcher=org.apache.accumulo.core.spi.compaction.SimpleCompactionDispatcher +config -t ci -s table.compaction.dispatcher.opts.service=cs1 +config -t ci -s table.compaction.dispatcher.opts.service.user=cs2 +``` + +For more information see the javadoc for {% jlink org.apache.accumulo.core.spi.compaction %}, +{% jlink org.apache.accumulo.core.spi.compaction.DefaultCompactionPlanner %} and +{% jlink org.apache.accumulo.core.spi.compaction.SimpleCompactionDispatcher %} + +The names of the compaction services and executors are used for logging and metrics. + +## Logging + +The names of compaction services and executors are used in logging. The log +messages below are from a tserver with the configuration above with data being +written to the ci table. Also a compaction of the table was forced from the +shell. + +``` +2020-06-25T16:34:31,669 [tablet.files] DEBUG: Compacting 3;667;6 on cs1.small for SYSTEM from [C00001cm.rf, C00001a7.rf, F00001db.rf] size 15 MB +2020-06-25T16:34:45,165 [tablet.files] DEBUG: Compacted 3;667;6 for SYSTEM created hdfs://localhost:8020/accumulo/tables/3/t-000006f/C00001de.rf from [C00001cm.rf, C00001a7.rf, F00001db.rf] +2020-06-25T16:35:01,965 [tablet.files] DEBUG: Compacting 3;667;6 on cs1.medium for SYSTEM from [C00001de.rf, A000017v.rf, F00001e7.rf] size 33 MB +2020-06-25T16:35:11,686 [tablet.files] DEBUG: Compacted 3;667;6 for SYSTEM created hdfs://localhost:8020/accumulo/tables/3/t-000006f/A00001er.rf from [C00001de.rf, A000017v.rf, F00001e7.rf] +2020-06-25T16:37:12,521 [tablet.files] DEBUG: Compacting 3;667;6 on cs2.medium for USER from [F00001f8.rf, A00001er.rf] size 35 MB config [] +2020-06-25T16:37:17,917 [tablet.files] DEBUG: Compacted 3;667;6 for USER created hdfs://localhost:8020/accumulo/tables/3/t-000006f/A00001fr.rf from [F00001f8.rf, A00001er.rf] +``` + +## Metrics + +The numbers of major and minor compactions running and queued is visible on the +Accumulo monitor page. This allows you to see if compactions are backing up +and adjustments to the above settings are needed. When adjusting the number of +threads available for compactions, consider the number of cores and other tasks +running on the nodes. + +The numbers displayed on the Accumulo monitor are an aggregate of all +compaction services and executors. Accumulo emits metrics about the number of +compactions queued and running on each compaction executor. Accumulo also +emits metrics about the number of files per tablets. These metrics can be used +to guide adjusting compaction ratios and compaction service configurations to ensure +tablets do not have to many files. + +For example if metrics show that some compaction executors within a compaction +service are under utilized while others are over utilized, then the +configuration for compaction service may need to be adjusted. If the metrics +show that all compaction executors are fully utilized for long periods then +maybe the compaction ratio on a table needs to be increased. + +## User compactions + +Compactions can be initiated manually for a table. To initiate a minor +compaction, use the `flush` command in the shell. To initiate a major compaction, +use the `compact` command in the shell: + + user@myinstance mytable> compact -t mytable + +If needed, the compaction can be canceled using `compact --cancel -t mytable`. + +The `compact` command will compact all tablets in a table to one file. Even tablets +with one file are compacted. This is useful for the case where a major compaction +filter is configured for a table. In 1.4, the ability to compact a range of a table +was added. To use this feature specify start and stop rows for the compact command. +This will only compact tablets that overlap the given row range. + + + diff --git a/_docs-2/getting-started/table_configuration.md b/_docs-2/getting-started/table_configuration.md index 09e384d..0044790 100644 --- a/_docs-2/getting-started/table_configuration.md +++ b/_docs-2/getting-started/table_configuration.md @@ -343,109 +343,7 @@ in reduced read latency. Read the [Caching] documentation to learn more. ## Compaction -As data is written to Accumulo it is buffered in memory. The data buffered in -memory is eventually written to HDFS on a per tablet basis. Files can also be -added to tablets directly by bulk import. In the background tablet servers run -major compactions to merge multiple files into one. The tablet server has to -decide which tablets to compact and which files within a tablet to compact. -This decision is made using the compaction ratio, which is configurable on a -per table basis by the [table.compaction.major.ratio] property. - -Increasing this ratio will result in more files per tablet and less compaction -work. More files per tablet means more higher query latency. So adjusting -this ratio is a trade off between ingest and query performance. The ratio -defaults to 3. - -The way the ratio works is that a set of files is compacted into one file if the -sum of the sizes of the files in the set is larger than the ratio multiplied by -the size of the largest file in the set. If this is not true for the set of all -files in a tablet, the largest file is removed from consideration, and the -remaining files are considered for compaction. This is repeated until a -compaction is triggered or there are no files left to consider. - -The number of background threads tablet servers use to run major and minor -compactions is configured by the [tserver.compaction.major.concurrent.max] -and [tserver.compaction.minor.concurrent.max] properties respectively. - -The numbers of major and minor compactions running and queued is visible on the -Accumulo monitor page. This allows you to see if compactions are backing up -and adjustments to the above settings are needed. When adjusting the number of -threads available for compactions, consider the number of cores and other tasks -running on the nodes such as maps and reduces. - -If major compactions are not keeping up, then the number of files per tablet -will grow to a point such that query performance starts to suffer. One way to -handle this situation is to increase the compaction ratio. For example, if the -compaction ratio were set to 1, then every new file added to a tablet by minor -compaction would immediately queue the tablet for major compaction. So if a -tablet has a 200M file and minor compaction writes a 1M file, then the major -compaction will attempt to merge the 200M and 1M file. If the tablet server -has lots of tablets trying to do this sort of thing, then major compactions -will back up and the number of files per tablet will start to grow, assuming -data is being continuously written. Increasing the compaction ratio will -alleviate backups by lowering the amount of major compaction work that needs to -be done. - -Another option to deal with the files per tablet growing too large is to adjust -the [table.file.max] property. When a tablet reaches this number of files and needs -to flush its in-memory data to disk, it will choose to do a merging minor compaction. -A merging minor compaction will merge the tablet's smallest file with the data in memory at -minor compaction time. Therefore the number of files will not grow beyond this -limit. This will make minor compactions take longer, which will cause ingest -performance to decrease. This can cause ingest to slow down until major -compactions have enough time to catch up. When adjusting this property, also -consider adjusting the compaction ratio. Ideally, merging minor compactions -never need to occur and major compactions will keep up. It is possible to -configure the file max and compaction ratio such that only merging minor -compactions occur and major compactions never occur. This should be avoided -because doing only merging minor compactions causes O(N<sup>2</sup>) work to be done. -The amount of work done by major compactions is `O(N*log<sub>R</sub>(N))` where -R is the compaction ratio. - -Compactions can be initiated manually for a table. To initiate a minor -compaction, use the `flush` command in the shell. To initiate a major compaction, -use the `compact` command in the shell: - - user@myinstance mytable> compact -t mytable - -If needed, the compaction can be canceled using `compact --cancel -t mytable`. - -The `compact` command will compact all tablets in a table to one file. Even tablets -with one file are compacted. This is useful for the case where a major compaction -filter is configured for a table. In 1.4, the ability to compact a range of a table -was added. To use this feature specify start and stop rows for the compact command. -This will only compact tablets that overlap the given row range. - -### Compaction Strategies - -The default behavior of major compactions is defined in the class {% jlink org.apache.accumulo.tserver.compaction.DefaultCompactionStrategy %}. -This behavior can be changed by overriding [table.majc.compaction.strategy] with a fully -qualified class name. - -Custom compaction strategies can have additional properties that are specified with the -{% plink table.majc.compaction.strategy.opts.\* %} prefix. - -Accumulo provides a few classes that can be used as an alternative compaction strategy. These classes are located in the -{% jlink -f org.apache.accumulo.tserver.compaction %} package. {% jlink org.apache.accumulo.tserver.compaction.EverythingCompactionStrategy %} -will simply compact all files. This is the strategy used by the user `compact` command. - -{% jlink org.apache.accumulo.tserver.compaction.strategies.BasicCompactionStrategy %} is -a compaction strategy that supports a few options based on file size. It -supports filtering out large files from ever being included in a compaction. -It also supports using a different compression algorithm for larger files. -This allows frequent compactions of smaller files to use a fast algorithm and -infrequent compactions of more data to use a slower algorithm. Using this may -enable an increase in throughput w/o using a lot more space. - -The following shell command configures a table to use snappy for small files, -gzip for files over 100M, and avoid compacting any file larger than 250M. - - config -t myTable -s table.file.compress.type=snappy - config -t myTable -s table.majc.compaction.strategy=org.apache.accumulo.tserver.compaction.strategies.BasicCompactionStrategy - config -t myTable -s table.majc.compaction.strategy.opts.filter.size=250M - config -t myTable -s table.majc.compaction.strategy.opts.large.compress.threshold=100M - config -t myTable -s table.majc.compaction.strategy.opts.large.compress.type=gzip - +See {% dlink administration/compaction %} ## Pre-splitting tables Accumulo will balance and distribute tables across servers. Before a @@ -719,9 +617,5 @@ preserved. [Scanner]: {% jurl org.apache.accumulo.core.client.Scanner %} [BatchScanner]: {% jurl org.apache.accumulo.core.client.BatchScanner %} [Caching]: {% durl administration/caching %} -[table.compaction.major.ratio]: {% purl table.compaction.major.ratio %} -[tserver.compaction.major.concurrent.max]: {% purl tserver.compaction.major.concurrent.max %} -[tserver.compaction.minor.concurrent.max]: {% purl tserver.compaction.minor.concurrent.max %} -[table.file.max]: {% purl table.file.max %} [table.bloom.enabled]: {% purl table.bloom.enabled %} [table.file.compress.type]: {% purl table.file.compress.type %} diff --git a/css/accumulo.scss b/css/accumulo.scss index 17609cb..6579cc9 100644 --- a/css/accumulo.scss +++ b/css/accumulo.scss @@ -43,13 +43,13 @@ body { pre code { font-size: 14px; + /* override nowrap in bootstrap */ + white-space: pre; } code { background-color: #f5f5f5; color: #555; - /* override nowrap in bootstrap */ - white-space: normal; } #nav-logo {