zclllyybb commented on issue #63694: URL: https://github.com/apache/doris/issues/63694#issuecomment-4544186906
Breakwater-GitHub-Analysis-Slot: slot_75b5f7a31eb3 Initial triage: this looks like a valid performance/caching enhancement for the HMS external table planning path, not a correctness bug. I checked current `origin/master` locally at `4483daf9f03`. The relevant flow matches the report: - `ExternalTable.getRowCount()` initializes the table and loads the row-count cache through `ExternalRowCountCache.getCachedRowCount(...)`. - `ExternalRowCountCache.RowCountCacheLoader` calls `table.fetchRowCount()`. - `HMSExternalTable.fetchRowCount()` first tries HMS table parameters, then falls back to `getRowCountFromFileList()` when the row count is unknown and `enable_get_row_count_from_file_list` is enabled. - In that fallback, `getFilesForPartitions(...)` explicitly uses `cache.getAllPartitionsWithoutCache(...)` and then `cache.getFilesByPartitions(..., withCache=false, ...)`. - Later scan planning for the same HMS table uses the normal cached path: `HiveScanNode.getPartitions()` calls `getAllPartitionsWithCache(...)`, and file split generation calls `getFilesByPartitions(...)` with `withCache = Config.max_external_file_cache_num > 0`. So the same Hive partition/file metadata can indeed be fetched once for row-count estimation and then fetched again for scan planning in one planning process. The issue description is consistent with the current code. Suggested implementation direction: - Add an explicit row-count loading mode, for example `QUERY_PLANNING` vs `DISPLAY_ONLY`, rather than inferring intent inside `HMSExternalTable`. - Use the planning mode from `ExternalTable.getRowCount()` / optimizer stats paths, and allow the HMS file-list row-count fallback to populate or reuse the normal Hive partition/file caches in that mode. - Keep display-oriented paths such as `getCachedRowCount()`, `SHOW TABLE STATUS`, `SHOW STATS`, and `information_schema.tables` lightweight. If the intent is strictly "cached value only", consider wiring them to `ExternalRowCountCache.getCachedRowCountIfPresent(...)` or equivalent non-loading behavior; the method already exists but is not currently used. - For HMS row-count estimation, make the partition and file metadata APIs mode-dependent: cached APIs for query planning, non-cached APIs for display/background-only loading. Recommended validation: - Add an FE unit test or mock-cache test proving that planning-time row-count estimation for an HMS table uses the cached partition/file path. - Add a companion test proving that display-only calls do not populate heavy Hive partition/file caches just to render row count. - If possible, include a simple before/after planning profile or FE log snippet for a partitioned HMS table without `numRows` in HMS parameters and with `enable_get_row_count_from_file_list=true`, showing that partition/file listing happens once after the change. No additional user logs are required to accept this as an enhancement, but a concrete reproduction table shape would help measure impact: Doris commit/version, catalog properties, partition count, file count, `hive_stats_partition_sample_size`, `max_external_file_cache_num`, `fetch_hive_row_count_sync`, and a representative `EXPLAIN` or planning profile. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
