Jaehui-Lee commented on PR #7482: URL: https://github.com/apache/hbase/pull/7482#issuecomment-3581654722
@wchevreuil Thank you for the review! I apologize for the insufficient explanation in my initial description. Let me provide additional clarification. > And can you explain further what you mean by "non-cached reads (such as those from compactions)" ? I don't think compactions bypass the cache, AFAIK, it use same HFileReaderImpl in it's scanner, which always looks the cache first. You're absolutely right! Compactions do read from the cache first. What I meant to convey is that while compaction reads do check the cache first, they generate many cache misses, which are reflected in the `...(Hit|Miss)Count` metrics. However, since scanners created by compactions have `cacheBlocks` set to false, they don't affect the `...(Hit|Miss)CachingCount` metrics. Therefore, when we want to monitor the cache hit ratio specifically for client read requests connected to the HBase cluster (excluding internal operations like compactions), we need to check the `...HitCachingRatio` metric, which is what we currently monitor. This can be easily verified with a simple test. With BucketCache enabled and an empty region server, follow these steps: ``` create 'test', 'cf' put 'test', 'r1', 'cf:q', 'v1' flush 'test' put 'test', 'r2', 'cf:q', 'v2' flush 'test' # 1) Load one of the two blocks into the block cache get 'test', 'r1', 'cf:q' # 2) Observe metric changes after major compaction major_compact 'test' ``` After step 1, all hit metrics remain at 0 (no change), while both `missCount` and `missCachingCount` increase to 2 (not entirely sure why it's 2, but that's what we observe). After step 2, `hitCount` becomes 1, `hitCachingCount` remains 0, `missCount` becomes 3, and `missCachingCount` stays at 2. (These metrics were observed from the L2 cache section in the region server web UI.) This demonstrates that reads from compactions are reflected in the general hit/miss metrics but not in the Caching metrics. > I'm not sure this is true. We do account and expose separate hit rations for L1 and L2. For the reasons explained above, we specifically want to monitor the `(Hit|Miss)Caching` metrics rather than the general `Hit|Miss` metrics. When `BucketCache` is enabled, `BlockCache` is used as L1 and `BucketCache` as L2, which is represented as `CombinedBlockCache` in the HBase implementation. The currently exposed `BlockCache(...Count|...Ratio)` metrics represent the combined sum of L1 and L2. While we need to monitor each cache tier separately, there are currently no `Caching` metrics available for individual cache layers. The metrics you mentioned only include `(Hit|Miss)Count` format metrics, which—as I explained above—are significantly influenced by operations like compactions and don't represent the values we're looking for. The `Caching` metrics for each cache tier are not currently exposed via JMX (which is what this PR aims to enable), but they can be viewed in the BlockCache section of the region server web UI. As an example from our production cluster, here are the L2 metrics: - Hits: 14,709,630 - Hits Caching: 12,340,830 - Misses: 59,908,348 - Misses Caching: 17,610,008 - Hit Ratio: 19.71% When we calculate the Hit Caching Ratio from these values, we get 41.2%, which shows a significant difference from the 19.7% general hit ratio—demonstrating why separate caching metrics are important for accurate monitoring of client-driven cache performance. Please let me know if anything is still unclear or if I've misunderstood something. Thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
