Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/24153 )
Change subject: IMPALA-14597: Initial HBO support ...................................................................... Patch Set 4: (14 comments) http://gerrit.cloudera.org:8080/#/c/24153/3//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/24153/3//COMMIT_MSG@17 PS3, Line 17: historical runs. The key in the HBO cache is a hash of this string. In > typo: historiical -> historical Done http://gerrit.cloudera.org:8080/#/c/24153/3//COMMIT_MSG@34 PS3, Line 34: e.g. "a IN (2, 1)" -> "a IN (1, 2)". It guarantees the expressions are > typo: guarentees -> guarantees Done http://gerrit.cloudera.org:8080/#/c/24153/3//COMMIT_MSG@43 PS3, Line 43: partition predicates, e.g., ds > '20260331', it keeps them as-is. > as-it -> as-is? Done http://gerrit.cloudera.org:8080/#/c/24153/3//COMMIT_MSG@46 PS3, Line 46: canonicalization strategy. In the future, we can add more strategies, > stragety -> strategy Done http://gerrit.cloudera.org:8080/#/c/24153/3//COMMIT_MSG@105 PS3, Line 105: > strategy Done http://gerrit.cloudera.org:8080/#/c/24153/3//COMMIT_MSG@115 PS3, Line 115: partition predicates: b.`year` = 2010 > retrived -> retrieved Done http://gerrit.cloudera.org:8080/#/c/24153/3/be/src/common/global-flags.cc File be/src/common/global-flags.cc: http://gerrit.cloudera.org:8080/#/c/24153/3/be/src/common/global-flags.cc@240 PS3, Line 240: "Default is 1GB. The in-memory backend is mainly used for testing."); > This seems a little big? But it's probably fine; lite users in a small envi Yeah, that makes sense. I need some tests to measure the cache size to support a given workload. Currently just picking an arbitual default value like other caches, e.g., codegen_cache_capacity. http://gerrit.cloudera.org:8080/#/c/24153/3/be/src/common/global-flags.cc@243 PS3, Line 243: "Prefix of log filename - " > Should note the default. The new patch set removes this flag and uses unregistration_thread_pool_size directly. This flag is used for concurrencyLevel in Guava cache which limits the write concurrency. The writers of HBO stats are threads from the thread-pool for unregistering queries. Their concurrency is configured by unregistration_thread_pool_size. I think using that flag is simpler. http://gerrit.cloudera.org:8080/#/c/24153/3/be/src/runtime/coordinator.cc File be/src/runtime/coordinator.cc: http://gerrit.cloudera.org:8080/#/c/24153/3/be/src/runtime/coordinator.cc@a1887 PS3, Line 1887: > Why remove the comment? Oops, unintentionally removed in rebasing the conflicts. http://gerrit.cloudera.org:8080/#/c/24153/3/be/src/service/impala-server.h File be/src/service/impala-server.h: http://gerrit.cloudera.org:8080/#/c/24153/3/be/src/service/impala-server.h@1264 PS3, Line 1264: /// <ms-since-epoch> <query-id> <thrift query profile URL encoded and gzipped> > nit: could be a static function rather than part of the class interface. Done http://gerrit.cloudera.org:8080/#/c/24153/3/common/thrift/HBO.thrift File common/thrift/HBO.thrift: http://gerrit.cloudera.org:8080/#/c/24153/3/common/thrift/HBO.thrift@34 PS3, Line 34: TPlanNodeRun I'm trying to avoid using required fields since this will be shared across coordinators in the future (when we support shared storage like Redis). Even coordinators from different clusters (might be in different versions) can share the HBO stats. Using optional fields help to evolving this structure, e.g. we might add fields like kudu_scan_timestamp, cpu usages, etc. > why not use simple java objects? They are passed to backend to update fields of execution stats like num_rows, mem_usage. Using thrift objects is simpler. http://gerrit.cloudera.org:8080/#/c/24153/3/fe/src/main/java/org/apache/impala/service/HistoricalStats.java File fe/src/main/java/org/apache/impala/service/HistoricalStats.java: http://gerrit.cloudera.org:8080/#/c/24153/3/fe/src/main/java/org/apache/impala/service/HistoricalStats.java@124 PS3, Line 124: // TODO: handle races from concurrent writers. > This doesn't look hard using asMap().compute() on the map. That's a good point! Need the CacheBackend to expose a compute interface for this since GuavaCache is just one of the implementations. http://gerrit.cloudera.org:8080/#/c/24153/3/fe/src/main/java/org/apache/impala/service/InMemoryCacheBackend.java File fe/src/main/java/org/apache/impala/service/InMemoryCacheBackend.java: http://gerrit.cloudera.org:8080/#/c/24153/3/fe/src/main/java/org/apache/impala/service/InMemoryCacheBackend.java@73 PS3, Line 73: weighHistoricalStatsValue > It seems simpler to store the stats in serialized format, which would make That's a good point! I haven't measured the performance yet. I'll think more about where to serialize the values (inside/outside CacheBackend) and what kinds of format (Java Serialization/Thrift/FlatBuffer) to use. Also need to consider whether moving the read-and-update logic inside CacheBackend so we can make it atomic, e.g., by using asMap().compute() in Guava as you mentioned. Note that though adding serialization simplify the weighing code, it adds codes in serialization and de-serialization which might be longer if using Thrift/FlatBuffer. http://gerrit.cloudera.org:8080/#/c/24153/3/fe/src/main/java/org/apache/impala/service/InMemoryCacheBackend.java@97 PS3, Line 97: */ > Is this method used? Removed this unused default constructor. I think it's a good point to coordinate/unify the caches in JVM. Since this InMemoryCacheBackend would mainly be used in tests due to its local nature, we can keep it simple. -- To view, visit http://gerrit.cloudera.org:8080/24153 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6ff60a8bd22c13c0ecad1198934cc96249b1015e Gerrit-Change-Number: 24153 Gerrit-PatchSet: 4 Gerrit-Owner: Quanlong Huang <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Michael Smith <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Reviewer: Surya Hebbar <[email protected]> Gerrit-Comment-Date: Mon, 13 Apr 2026 14:07:13 +0000 Gerrit-HasComments: Yes
