Hi Junbo,

First of all, thank you for the thoughtful feedback. These are exactly the
kinds of questions that help sharpen the design!

*Memory-based limitation for operators*
You're right that the in-memory nature means historical tiering events
won't be queryable across coordinator restarts. I will update the FIP to
include a  'Non-Goals' section explicitly stating that persistent
historical logging is out of scope for this initial phase. The intended
value for operators is real-time state visibility, identifying at a glance
which tables are currently stuck in FAILED or PENDING states without
digging through logs.

*Prometheus/SQL overlap*
Regarding the metrics you mentioned, I’ve just reviewed the latest updates
and confirmed that several tiering-related metrics (such as
pendingTablesCount and timestampLag) were indeed added recently. Thank you
for pointing this out.

However, while these metrics are excellent for alerting on performance
trends, they aren't designed for granular diagnostics. For instance,
Prometheus can tell you that pendingTablesCount has increased, but it
cannot tell you which specific tables are pending or why they are stuck.
The SQL interface fills this gap by providing context that isn't
expressible in Prometheus:

SELECT table_name, tiering_state, last_failure_message
> FROM sys.lake_tiering_status
> WHERE tiering_state = 'Failed';


In this sense, the two are complementary rather than redundant.

*Flink SQL vs Flink logs for failure detection*
There's a subtle but important reason SQL is more valuable here. Due to
Fluss's per-table fault isolation design, when a single table's tiering
fails, the Flink job continues running normally. This failure is invisible
in the Flink UI.

Even with the pendingTablesCount gauge, an operator only knows that
something is pending. They still wouldn't know which table or the root
cause. Instead of scanning Flink logs, the system table allows an operator
to immediately identify the affected table and the failure reason,
significantly reducing recovery or diagnostic time.

I'll update the FIP to clarify these motivations. I hope this clarifies my
perspective, but please feel free to correct me if I’ve misinterpreted your
points or missed any context.

Thanks again for the thorough review!

Best regards,
SeungMin Lee

2026년 3월 10일 (화) PM 8:24, Junbo Wang <[email protected]>님이 작성:

> Hi SeungMin,
>
> Thank you for the detailed proposal. I've been thinking about the target
> audience and wanted to share some thoughts — please feel free to correct me
> if I'm misunderstanding anything.
>
> For cluster operators: In practice, operators often need to look back at
> earlier failures, not just the last one. Since the system table is
> memory-based, it might be difficult to support historical tiering
> scheduling events. I'm also wondering if providing only the last failure
> via Flink SQL would be more convenient than manually checking Flink logs.
> I'm not sure if system tables would be the most suitable approach for
> operators, though they might be helpful in the future for viewing and
> managing tiering jobs.
>
> For end users: It seems they would primarily care about data freshness and
> scheduling priority, which would be quite convenient to access through
> system tables. The latency metrics are already well covered by Prometheus,
> so I'm curious about the specific scenarios where users would need the SQL
> interface for error information.
>
> These are just my initial thoughts. I'd appreciate your insights on the use
> cases you've encountered. Thank you in advance for your time.
>
> Best regards,
> Junbo Wang
>
>
>
> SeungMin Lee <[email protected]> 于2026年3月10日周二 15:47写道:
>
> > Hi devs,
> > I'd like to start a vote on FIP-30: Support tracking the tiering status
> of
> > a tiering table [1].
> >
> > You can find the discussion on it in [2].
> > This vote will last for at least 72 hours, unless there are objections or
> > insufficient votes.
> >
> > Best regards,
> > SeungMin Lee
> >
> > [1]
> >
> >
> https://cwiki.apache.org/confluence/display/FLUSS/FIP-30%3A+Support+tracking+the+tiering+status+of+a+tiering+table
> > [2] https://lists.apache.org/thread/9ljqmvnfjkktkgl0m6gp0c42nv8f0z1q
> >
>

Reply via email to