eab148 commented on PR #6327:
URL: https://github.com/apache/hbase/pull/6327#issuecomment-2420034164

   > Since i cannot leave a comment on the design doc, would you mind filling 
in some detail about how these statistics are queried and used? Thanks @eab148 !
   
   #### Querying row statistics:
   
   We have an API that fetches and performs aggregations/filtering on our row 
statistics. These results are cached for a duration of time. 
   
   The available API queries include:
   
   - Get a given region's row statistics
   - Sample X row statistics for a given table/CF
   - Aggregate all of the row statistics across table/CF pairs
   - Fetch the "top N" row statistics for a given field and table/CF pair
     - In other words, we may provide `LARGEST_ROW_NUM_BYTES` and fetch the top 
N row statistics in order of `RowStatistics::largestRowNumBytes` (descending 
order)
   
   As a reminder, the cells in our internal row statistics table have the 
following fields:
   
   - **rowKey:** full region name
   - **family:** `0`
   - **qualifier:** `1` for majorCompaction row statistics, `0` for minor 
compaction row statistics
   - **value:** JSON blob of the `RowStatistic` object
   
   #### Using row statistics:
   
   At my day job, we've used the row statistics to 
   1. Tune the block sizes for tables that service random read workloads, 
reducing disk I/O on relevant clusters
   2. Remove huge cells (>512 KiB) from our tables. Huge cells are ticking time 
bombs in HBase, as they cannot be cached in memory without admin 
intervention/memory configuration changes.
   3. Implement smarter compaction schedules, reducing the daily network data 
transfer cost of our HBase setup
   
   ##### Tune Block Sizes
   
   Our block size tuning job halves the block size for each family of a table if
   - The table’s "typical row" can fit into the smaller block size. At my day 
job, clients usually query full rows, so we want all cells for a given row to 
be in the same block.
   - The cluster’s memory has space for the larger index size
   - The table serves mostly random read traffic. 
   
   Tuning the block sizes for sequential read workloads is more complicated as 
estimating the number of blocks a Scan will require is an ambiguous task when 
one considers filtered, partial, _and_ full table Scans. 
   
   ##### Remove Huge Cells
   
   1. Use our Row Statistics API to aggregate all of the row statistics for a 
given table. 
   2. Use this response to find the number of cells that exceed the max cache 
size, which we have set to 512 KiB.
   3. Alert the relevant teams within our organization that they need to remove 
or break up these cells in their HBase tables.
   
   ##### Smarter Compaction Schedules 
   
   1. Use our Row Statistics API to aggregate get the most recent row 
statistics for a given region. 
   2. Estimate the amount of useful work that a major compaction accomplishes 
based on the number of StoreFiles in that region, the number of tombstones in 
that region, and the time since last compaction for that region


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@hbase.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to