Copilot commented on code in PR #61945:
URL: https://github.com/apache/doris/pull/61945#discussion_r3015021033
##########
fe/fe-core/src/main/java/org/apache/doris/catalog/CloudTabletStatMgr.java:
##########
@@ -394,4 +548,78 @@ private void
filterTopTableStatsByDataSize(List<OlapTable.Statistics> newCloudTa
}
this.cloudTableStatsList = new ArrayList<>(topStats);
}
+
+ public void addActiveTablets(List<Long> tabletIds) {
+ if (Config.cloud_get_tablet_stats_version == 1 || tabletIds == null ||
tabletIds.isEmpty()) {
+ return;
+ }
+ activeTablets.addAll(tabletIds);
+ }
+
+ // master FE send update tablet stats rpc to other FEs
+ private void pushTabletStats(GetTabletStatsResponse response) {
+ List<Frontend> frontends = getFrontends();
+ if (frontends == null || frontends.isEmpty()) {
+ return;
+ }
+ TSyncCloudTabletStatsRequest request = new
TSyncCloudTabletStatsRequest();
+ request.setTabletStatsPb(ByteBuffer.wrap(response.toByteArray()));
+ for (Frontend fe : frontends) {
Review Comment:
`pushTabletStats` sets `tabletStatsPb` using `ByteBuffer.wrap(...)`, but the
receiving side (`FrontendServiceImpl.syncCloudTabletStats`) reads this field as
a `byte[]` (`request.getTabletStatsPb()`). This looks like a type mismatch that
will either not compile or will serialize incorrectly. Set the field using the
expected `byte[]` payload (e.g., `response.toByteArray()`) and drop the
`ByteBuffer` usage/import here.
##########
fe/fe-core/src/main/java/org/apache/doris/cloud/catalog/CloudReplica.java:
##########
@@ -69,6 +71,20 @@ public class CloudReplica extends Replica implements
GsonPostProcessable {
private long indexId = -1;
@SerializedName(value = "idx")
private long idx = -1;
+ // last time to get tablet stats
+ @Getter
+ @Setter
+ long lastGetTabletStatsTime = 0;
+ /**
+ * The index of {@link
org.apache.doris.catalog.CloudTabletStatMgr#DEFAULT_INTERVAL_LADDER_MS} array.
+ * Used to control the interval of getting tablet stats.
+ * When get tablet stats:
+ * if the stats is unchanged, will update this index to next value to get
stats less frequently;
+ * if the stats is changed, will update this index to 0 to get stats more
frequently.
+ */
+ @Getter
+ @Setter
+ int statsIntervalIndex = 0;
Review Comment:
New fields `lastGetTabletStatsTime` and `statsIntervalIndex` are
package-private and lack `@SerializedName`, while `CloudReplica` is
Gson-persisted and other fields use short `@SerializedName` keys. Make these
fields `private` and add appropriate `@SerializedName` (with alternates if
needed) to keep metadata serialization stable/backward-compatible.
##########
fe/fe-core/src/main/java/org/apache/doris/cloud/catalog/CloudReplica.java:
##########
@@ -69,6 +71,20 @@ public class CloudReplica extends Replica implements
GsonPostProcessable {
private long indexId = -1;
@SerializedName(value = "idx")
private long idx = -1;
+ // last time to get tablet stats
+ @Getter
+ @Setter
+ long lastGetTabletStatsTime = 0;
+ /**
+ * The index of {@link
org.apache.doris.catalog.CloudTabletStatMgr#DEFAULT_INTERVAL_LADDER_MS} array.
+ * Used to control the interval of getting tablet stats.
+ * When get tablet stats:
+ * if the stats is unchanged, will update this index to next value to get
stats less frequently;
+ * if the stats is changed, will update this index to 0 to get stats more
frequently.
+ */
Review Comment:
The Javadoc links to `CloudTabletStatMgr#DEFAULT_INTERVAL_LADDER_MS`, but
that constant is `private` in `CloudTabletStatMgr`, so the Javadoc reference
will be broken. Either make the constant accessible (public/protected) or
remove/adjust the link text.
##########
fe/fe-core/src/main/java/org/apache/doris/catalog/CloudTabletStatMgr.java:
##########
@@ -55,20 +77,81 @@ public class CloudTabletStatMgr extends MasterDaemon {
private volatile List<OlapTable.Statistics> cloudTableStatsList = new
ArrayList<>();
private static final ExecutorService GET_TABLET_STATS_THREAD_POOL =
Executors.newFixedThreadPool(
- Config.max_get_tablet_stat_task_threads_num);
+ Config.max_get_tablet_stat_task_threads_num,
+ new
ThreadFactoryBuilder().setNameFormat("get-tablet-stats-%d").setDaemon(true).build());
+ // Master: send tablet stats to followers and observers
+ // Follower and observer: receive tablet stats from master
+ private static final ExecutorService SYNC_TABLET_STATS_THREAD_POOL =
Executors.newFixedThreadPool(
+ Config.cloud_sync_tablet_stats_task_threads_num,
+ new
ThreadFactoryBuilder().setNameFormat("sync-tablet-stats-%d").setDaemon(true).build());
+ private Set<Long> activeTablets = ConcurrentHashMap.newKeySet();
+
+ /**
+ * Interval ladder in milliseconds: 1m, 5m, 10m, 30m, 2h, 6h, 12h, 3d,
infinite.
+ * Tablets with changing stats stay at lower intervals; stable tablets
move to higher intervals.
+ */
+ private static final long[] DEFAULT_INTERVAL_LADDER_MS = {
+ TimeUnit.MINUTES.toMillis(1), // 1 minute
+ TimeUnit.MINUTES.toMillis(5), // 5 minutes
+ TimeUnit.MINUTES.toMillis(10), // 10 minutes
+ TimeUnit.MINUTES.toMillis(30), // 30 minutes
+ TimeUnit.HOURS.toMillis(2), // 2 hours
+ TimeUnit.HOURS.toMillis(6), // 6 hours
+ TimeUnit.HOURS.toMillis(12), // 12 hours
+ TimeUnit.DAYS.toMillis(3), // 3 days
+ Long.MAX_VALUE // infinite (never auto-fetch)
+ };
public CloudTabletStatMgr() {
super("cloud tablet stat mgr",
Config.tablet_stat_update_interval_second * 1000);
}
@Override
protected void runAfterCatalogReady() {
- LOG.info("cloud tablet stat begin");
- List<Long> dbIds = getAllTabletStats();
+ int version = Config.cloud_get_tablet_stats_version;
+ LOG.info("cloud tablet stat begin with version: {}", version);
+
+ // version1: get all tablet stats
+ if (version == 1) {
+ this.activeTablets.clear();
+ List<Long> dbIds = getAllTabletStats(null);
+ updateStatInfo(dbIds);
+ return;
+ }
+
+ // version2: get stats for active tablets
+ Set<Long> copiedTablets = new HashSet<>(activeTablets);
+ activeTablets.removeAll(copiedTablets);
+ getActiveTabletStats(copiedTablets);
Review Comment:
`activeTablets` is cleared (`removeAll`) before
`getActiveTabletStats(copiedTablets)` runs. If fetching tablet stats fails (RPC
error / task failure), those tablet IDs are still removed and won’t be retried
until they become active again, which can leave stats stale. Consider only
removing IDs after a successful fetch, or re-adding failed IDs back to
`activeTablets`.
```suggestion
getActiveTabletStats(copiedTablets);
activeTablets.removeAll(copiedTablets);
```
##########
fe/fe-common/src/main/java/org/apache/doris/common/Config.java:
##########
@@ -3554,6 +3553,13 @@ public static int metaServiceRpcRetryTimes() {
"Maximal concurrent num of get tablet stat job."})
public static int max_get_tablet_stat_task_threads_num = 4;
+ @ConfField(mutable = true, description = {"Version of getting tablet stats
in cloud mode. "
+ + "Version 1: get all tablets; Version 2: get active and interval
expired tablets"})
+ public static int cloud_get_tablet_stats_version = 2;
+
+ @ConfField(description = {"Maximum concurrent number of get tablet stat
jobs."})
Review Comment:
Config description for `cloud_sync_tablet_stats_task_threads_num` says "get
tablet stat jobs", but this setting controls the concurrency of syncing/pushing
tablet stats between FEs. Please update the description to match the actual
behavior to avoid operator confusion.
```suggestion
@ConfField(description = {"存算分离模式下 FE 之间同步 tablet 统计信息任务的最大并发数。",
"Maximum concurrent number of syncing tablet stats between
FEs."})
```
##########
fe/fe-core/src/main/java/org/apache/doris/service/FrontendServiceImpl.java:
##########
@@ -4543,16 +4546,28 @@ public TStatus
reportCommitTxnResult(TReportCommitTxnResultRequest request) thro
return new TStatus(TStatusCode.NOT_MASTER);
}
- LOG.info("receive load stats report request: {}, backend: {}, dbId:
{}, txnId: {}, label: {}",
- request, clientAddr, request.getDbId(), request.getTxnId(),
request.getLabel());
+ if (LOG.isDebugEnabled()) {
+ LOG.debug("receive load stats report from backend: {}, dbId: {},
txnId: {}, label: {}, tabletIds: {}",
+ clientAddr, request.getDbId(), request.getTxnId(),
request.getLabel(), request.getTabletIds());
+ }
try {
- byte[] receivedProtobufBytes = request.getPayload();
- if (receivedProtobufBytes == null || receivedProtobufBytes.length
<= 0) {
- return new TStatus(TStatusCode.INVALID_ARGUMENT);
+ List<Long> tabletIds = request.isSetTabletIds() ?
request.getTabletIds() : Collections.emptyList();
+ if (LOG.isDebugEnabled()) {
+ LOG.debug("force sync tablet stats for txnId: {}, tabletNum:
{}, tabletIds: {}", request.txnId,
Review Comment:
This debug log references `request.txnId` directly instead of using the
Thrift accessor (`getTxnId()`), which is inconsistent with the rest of the
method and can be misleading when the field is unset/defaulted. Prefer
`request.getTxnId()` for consistency.
```suggestion
LOG.debug("force sync tablet stats for txnId: {}, tabletNum:
{}, tabletIds: {}", request.getTxnId(),
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]