Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/24153 )

Change subject: IMPALA-14597: Initial HBO support
......................................................................


Patch Set 4:

(14 comments)

http://gerrit.cloudera.org:8080/#/c/24153/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/24153/3//COMMIT_MSG@17
PS3, Line 17: historical runs. The key in the HBO cache is a hash of this 
string. In
> typo: historiical -> historical
Done


http://gerrit.cloudera.org:8080/#/c/24153/3//COMMIT_MSG@34
PS3, Line 34: e.g. "a IN (2, 1)" -> "a IN (1, 2)". It guarantees the 
expressions are
> typo: guarentees -> guarantees
Done


http://gerrit.cloudera.org:8080/#/c/24153/3//COMMIT_MSG@43
PS3, Line 43: partition predicates, e.g., ds > '20260331', it keeps them as-is.
> as-it -> as-is?
Done


http://gerrit.cloudera.org:8080/#/c/24153/3//COMMIT_MSG@46
PS3, Line 46: canonicalization strategy. In the future, we can add more 
strategies,
> stragety -> strategy
Done


http://gerrit.cloudera.org:8080/#/c/24153/3//COMMIT_MSG@105
PS3, Line 105:
> strategy
Done


http://gerrit.cloudera.org:8080/#/c/24153/3//COMMIT_MSG@115
PS3, Line 115:      partition predicates: b.`year` = 2010
> retrived -> retrieved
Done


http://gerrit.cloudera.org:8080/#/c/24153/3/be/src/common/global-flags.cc
File be/src/common/global-flags.cc:

http://gerrit.cloudera.org:8080/#/c/24153/3/be/src/common/global-flags.cc@240
PS3, Line 240:     "Default is 1GB. The in-memory backend is mainly used for 
testing.");
> This seems a little big? But it's probably fine; lite users in a small envi
Yeah, that makes sense. I need some tests to measure the cache size to support 
a given workload. Currently just picking an arbitual default value like other 
caches, e.g., codegen_cache_capacity.


http://gerrit.cloudera.org:8080/#/c/24153/3/be/src/common/global-flags.cc@243
PS3, Line 243:     "Prefix of log filename - "
> Should note the default.
The new patch set removes this flag and uses unregistration_thread_pool_size 
directly. This flag is used for concurrencyLevel in Guava cache which limits 
the write concurrency. The writers of HBO stats are threads from the 
thread-pool for unregistering queries. Their concurrency is configured by 
unregistration_thread_pool_size. I think using that flag is simpler.


http://gerrit.cloudera.org:8080/#/c/24153/3/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

http://gerrit.cloudera.org:8080/#/c/24153/3/be/src/runtime/coordinator.cc@a1887
PS3, Line 1887:
> Why remove the comment?
Oops, unintentionally removed in rebasing the conflicts.


http://gerrit.cloudera.org:8080/#/c/24153/3/be/src/service/impala-server.h
File be/src/service/impala-server.h:

http://gerrit.cloudera.org:8080/#/c/24153/3/be/src/service/impala-server.h@1264
PS3, Line 1264:   /// <ms-since-epoch> <query-id> <thrift query profile URL 
encoded and gzipped>
> nit: could be a static function rather than part of the class interface.
Done


http://gerrit.cloudera.org:8080/#/c/24153/3/common/thrift/HBO.thrift
File common/thrift/HBO.thrift:

http://gerrit.cloudera.org:8080/#/c/24153/3/common/thrift/HBO.thrift@34
PS3, Line 34: TPlanNodeRun
I'm trying to avoid using required fields since this will be shared across 
coordinators in the future (when we support shared storage like Redis). Even 
coordinators from different clusters (might be in different versions) can share 
the HBO stats. Using optional fields help to evolving this structure, e.g. we 
might add fields like kudu_scan_timestamp, cpu usages, etc.

> why not use simple java objects?

They are passed to backend to update fields of execution stats like num_rows, 
mem_usage. Using thrift objects is simpler.


http://gerrit.cloudera.org:8080/#/c/24153/3/fe/src/main/java/org/apache/impala/service/HistoricalStats.java
File fe/src/main/java/org/apache/impala/service/HistoricalStats.java:

http://gerrit.cloudera.org:8080/#/c/24153/3/fe/src/main/java/org/apache/impala/service/HistoricalStats.java@124
PS3, Line 124:     // TODO: handle races from concurrent writers.
> This doesn't look hard using asMap().compute() on the map.
That's a good point! Need the CacheBackend to expose a compute interface for 
this since GuavaCache is just one of the implementations.


http://gerrit.cloudera.org:8080/#/c/24153/3/fe/src/main/java/org/apache/impala/service/InMemoryCacheBackend.java
File fe/src/main/java/org/apache/impala/service/InMemoryCacheBackend.java:

http://gerrit.cloudera.org:8080/#/c/24153/3/fe/src/main/java/org/apache/impala/service/InMemoryCacheBackend.java@73
PS3, Line 73: weighHistoricalStatsValue
> It seems simpler to store the stats in serialized format, which would make
That's a good point! I haven't measured the performance yet. I'll think more 
about where to serialize the values (inside/outside CacheBackend) and what 
kinds of format (Java Serialization/Thrift/FlatBuffer) to use. Also need to 
consider whether moving the read-and-update logic inside CacheBackend so we can 
make it atomic, e.g., by using asMap().compute() in Guava as you mentioned.

Note that though adding serialization simplify the weighing code, it adds codes 
in serialization and de-serialization which might be longer if using 
Thrift/FlatBuffer.


http://gerrit.cloudera.org:8080/#/c/24153/3/fe/src/main/java/org/apache/impala/service/InMemoryCacheBackend.java@97
PS3, Line 97:    */
> Is this method used?
Removed this unused default constructor. I think it's a good point to 
coordinate/unify the caches in JVM. Since this InMemoryCacheBackend would 
mainly be used in tests due to its local nature, we can keep it simple.



--
To view, visit http://gerrit.cloudera.org:8080/24153
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6ff60a8bd22c13c0ecad1198934cc96249b1015e
Gerrit-Change-Number: 24153
Gerrit-PatchSet: 4
Gerrit-Owner: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Michael Smith <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Surya Hebbar <[email protected]>
Gerrit-Comment-Date: Mon, 13 Apr 2026 14:07:13 +0000
Gerrit-HasComments: Yes

Reply via email to