eab148 commented on PR #6327: URL: https://github.com/apache/hbase/pull/6327#issuecomment-2420034164
> Since i cannot leave a comment on the design doc, would you mind filling in some detail about how these statistics are queried and used? Thanks @eab148 ! #### Querying row statistics: We have an API that fetches and performs aggregations/filtering on our row statistics. These results are cached for a duration of time. The available API queries include: - Get a given region's row statistics - Sample X row statistics for a given table/CF - Aggregate all of the row statistics across table/CF pairs - Fetch the "top N" row statistics for a given field and table/CF pair - In other words, we may provide `LARGEST_ROW_NUM_BYTES` and fetch the top N row statistics in order of `RowStatistics::largestRowNumBytes` (descending order) As a reminder, the cells in our internal row statistics table have the following fields: - **rowKey:** full region name - **family:** `0` - **qualifier:** `1` for majorCompaction row statistics, `0` for minor compaction row statistics - **value:** JSON blob of the `RowStatistic` object #### Using row statistics: At my day job, we've used the row statistics to 1. Tune the block sizes for tables that service random read workloads, reducing disk I/O on relevant clusters 2. Remove huge cells (>512 KiB) from our tables. Huge cells are ticking time bombs in HBase, as they cannot be cached in memory without admin intervention/memory configuration changes. 3. Implement smarter compaction schedules, reducing the daily network data transfer cost of our HBase setup ##### Tune Block Sizes Our block size tuning job halves the block size for each family of a table if - The table’s "typical row" can fit into the smaller block size. At my day job, clients usually query full rows, so we want all cells for a given row to be in the same block. - The cluster’s memory has space for the larger index size - The table serves mostly random read traffic. Tuning the block sizes for sequential read workloads is more complicated as estimating the number of blocks a Scan will require is an ambiguous task when one considers filtered, partial, _and_ full table Scans. ##### Remove Huge Cells 1. Use our Row Statistics API to aggregate all of the row statistics for a given table. 2. Use this response to find the number of cells that exceed the max cache size, which we have set to 512 KiB. 3. Alert the relevant teams within our organization that they need to remove or break up these cells in their HBase tables. ##### Smarter Compaction Schedules 1. Use our Row Statistics API to aggregate get the most recent row statistics for a given region. 2. Estimate the amount of useful work that a major compaction accomplishes based on the number of StoreFiles in that region, the number of tombstones in that region, and the time since last compaction for that region -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@hbase.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org