Re: [DISCUSS] Allow user to track the tiering status of a tiering table

Keith Lee Sun, 01 Mar 2026 07:18:31 -0800

Hello SeungMin,

Thank you for the detailed and laid out proposal. I am not familiar with
tiering failure and its visibility issues, have a few question for you that
will hopefully help my understanding.


1. On the premise `there's no way for users to check the status of lake
tiering`: my understanding is that tiering is in itself a Flink job, would
the status of the Flink tiering job be a good signal for status of tiering?
I assume that if there are tiering issue, it would surface as Flink job
failure and the job would retry with exception logs captured against the
job and can be seen through Flink dashboard? Can you clarify if this is
true and additionally, what additional information that the proposal
captures?
2. I really like the idea of exposing job metadata on Flink SQL. However,
thinking about user persona, there's two groups here that I can identify,
a. Data Scientists (or similar roles) b. System/Software Engineer
(reliability, operations). The information here that the proposal seeks to
expose serves the second group and not the first. Is Flink SQL therefore
the correct channel to expose this information?
3. I think I mentioned I really like the idea of exposing job metadata on
Flink SQL. 😄 Have we considered if Fluss is the best place to implement
Flink SQL support for job metadata query? I can see where such a feature is
useful in Flink in general. If job health, failure reason etc. is queryable
in Flink, it can be used in a much broader use-case. Perhaps we can engage
Flink community on expanding SHOW JOBS [1] to include exception, last
failure reason etc.?

Best regards
Keith Lee

[1]
https://nightlies.apache.org/flink/flink-docs-release-2.2/docs/dev/table/sql/job/


On Mon, Feb 23, 2026 at 3:29 PM SeungMin Lee <[email protected]> wrote:

> Hi dev,
>
> Hope you had a refreshing break.
>
> Touching base on FIP-30. I'm aiming to wrap up the feedback process by the
> week the 0.9 release vote
> <https://lists.apache.org/thread/3c8w6ofrssjxrpvz85pkm2n2kx1gyzxd> ends,
> so
> we can stay aligned with the project timeline. Also, hope the 0.9 release
> vote <https://lists.apache.org/thread/3c8w6ofrssjxrpvz85pkm2n2kx1gyzxd>
> gets plenty of interest as well.
>
> Looking forward to your thoughts.
>
> Best regards,
> SeungMin Lee
>
> 2026년 2월 15일 (일) AM 12:43, SeungMin Lee <[email protected]>님이 작성:
>
> > Hi Mehul Batra,
> >
> > First of all, thank you very much for the detailed review and valuable
> > suggestions. I really appreciate your insights.
> >
> > *1. Per-Table System Table vs Global System Table*
> > I think, the use case for the global view is to easily integrate with
> > monitoring tools like grafana. Without a sql interface, users have to
> build
> > a custom exporter using Admin API to monitor the tiering status of all
> > tables. I do share your concerns regarding the performance impact when
> > querying thousands of tables. While I acknowledge the potential
> performance
> > risks in massive clusters, I believe it’s better to provide full
> visibility
> > first. We can monitor real-world performance data and, if necessary,
> > implement safeguards like implicit limits or forced LIMIT clauses as a
> > follow-up optimization.
> >
> >
> > *2. Error Message Truncation Strategy*
> > It is a great point. Simply truncating the head of the error message
> might
> > indeed cut off some important information. I agree with your suggestion
> > "Smart extraction" that prioritizes the phrase near words like "*Caused
> > by*". To keep the initial FIP-30 scope focused, I plan to implement basic
> > truncation first. However, I would be very grateful if you could help
> with
> > the smart extraction as a follow-up pr if you have the capacity.
> >
> >
> > *3. Consolidating State Maps in LakeTableTieringManager*
> > I also fully agree with consolidating the maps in
> LakeTableTieringManager.
> > Looking at the code again, managing 7 separate maps (and soon 9) for each
> > table is getting a bit complicated. It’s quite easy to miss one map when
> > registering or removing tables, which could lead to bugs or small memory
> > leaks over time. Grouping everything into a single TableTieringInfo
> object
> > will make the logic much easier to follow and help keep all the metadata
> > consistent. Plus, it should be a bit more memory-efficient by reducing
> the
> > number of internal map nodes. I’ll definitely include this refactoring as
> > part of the FIP-30 implementation.
> >
> >
> > Thanks again for helping refine the design!
> >
> > Best Regards,
> > SeungMin Lee
> >
> >
> > 2026년 2월 14일 (토) AM 2:22, Mehul Batra <[email protected]>님이 작성:
> >
> >>  Hi SeungMinLee,
> >>
> >>
> >>
> >>   First of all, thank you for putting together FIP-30. The ability
> >>
> >>   Tracking tiering status is a much-needed feature, and I appreciate the
> >> thorough
> >>   design work that went into this proposal.
> >>
> >>
> >>
> >>   After reviewing the FIP, I have a few thoughts and questions I'd like
> to
> >> raise
> >>   for discussion. These are suggestions based on my understanding - I
> may
> >> be
> >>   missing context, so please feel free to correct me if any of these
> >> points
> >> have
> >>   already been considered.
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>   1. Per-Table System Table vs Global System Table
> >>
> >>
> >>
> >>   The proposal introduces both:
> >>
> >>   - Global view: `fluss_catalog.sys.lake_tiering_status`
> >>
> >>   - Per-table view: `my_db.my_table$tiering_status`
> >>
> >>
> >>
> >>   I was wondering if we could simplify the initial implementation by
> >> focusing on
> >>   the per-table `$tiering_status` virtual table for SQL access, while
> >> relying on
> >>   The `listTieringStatuses()` Admin API for bulk/system-wide queries.
> >>
> >>
> >>
> >>   My reasoning:
> >>
> >>   - Consistency: The per-table pattern (`$tiering_status`) aligns with
> >> Fluss's
> >>     existing virtual table conventions and is similar to the virtual
> table
> >> approach with
> >>     `$changelog`, `$binlog`, etc.
> >>
> >>   - Scalability: A global SQL table querying thousands of tables could
> >> have
> >>
> >>     performance implications. The Admin API seems better suited for bulk
> >> operations
> >>     with potential pagination support.
> >>
> >>
> >>
> >>
> >> A phased approach (Phase 1: per-table SQL, Phase 2: Admin API) could
> ship
> >> value to users faster with reduced initial scope.
> >>
> >> That said, I may be underestimating the need for the global SQL table.
> Are
> >> there specific use cases that would be difficult to serve with just the
> >> Admin API?
> >>
> >>
> >>
> >>
> >>
> >>  2. Error Message Truncation Strategy
> >>
> >>
> >>
> >>   The proposal mentions truncating error messages to 2-4KB before
> sending
> >> to the
> >>   Coordinator. I have a concern about simple head truncation potentially
> >> removing
> >>   the most useful diagnostic information.
> >>
> >>
> >>
> >>
> >>
> >>
> >>   Are we considering an extraction strategy to deal with it, in my mind,
> >> something like this?
> >>
> >>
> >>   - Smart extraction: Parse and extract all "Caused by:" lines, which
> >> typically
> >>     contain the most actionable information
> >>
> >>
> >>
> >>   I understand this adds complexity, so it's a trade-off. Curious to
> hear
> >> others'
> >>   thoughts on whether this is worth addressing.
> >>
> >>
> >>
> >>
> >>
> >>   3. Consolidating State Maps in LakeTableTieringManager
> >>
> >>
> >>
> >>   The proposal adds `tieringFailMessages` and `tieringFailTimes` maps to
> >>
> >>   `LakeTableTieringManager`. Looking at the current implementation, the
> >> manager
> >>   already maintains 6+ separate maps keyed by `tableId`:
> >>
> >>
> >>
> >>   ```java
> >>
> >>   Map<Long, TieringState> tieringStates;
> >>
> >>   Map<Long, TablePath> tablePaths;
> >>
> >>   Map<Long, Long> tableLakeFreshness;
> >>
> >>   Map<Long, Long> tableTierEpoch;
> >>
> >>   Map<Long, Long> tableLastTieredTime;
> >>
> >>   Map<Long, Long> liveTieringTableIds;
> >>
> >>   // Proposed additions:
> >>
> >>   Map<Long, String> tieringFailMessages;
> >>
> >>   Map<Long, Long> tieringFailTimes;
> >>
> >>
> >>
> >>   One thought: would it be cleaner to consolidate these into a single
> >>
> >>   TableTieringInfo object?
> >>
> >>
> >>
> >>   Map<Long, TableTieringInfo> tableInfos;
> >>
> >>
> >>
> >>   class TableTieringInfo {
> >>
> >>       TablePath tablePath;
> >>
> >>       long lakeFreshness;
> >>
> >>       TieringState state;
> >>
> >>       long tieringEpoch;
> >>
> >>       long lastTieredTime;
> >>
> >>       @Nullable String lastError;
> >>
> >>       @Nullable Long lastErrorTime;
> >>
> >>   }
> >>
> >>
> >>
> >>   Potential benefits:
> >>
> >>   - Single map lookup instead of multiple
> >>
> >>   - Related state updated together naturally
> >>
> >>   - Cleaner cleanup in removeLakeTable() (one removal vs. 8)
> >>
> >>
> >>
> >>
> >>   This could be a separate preparatory refactoring PR or part of FIP-30.
> >> However,
> >>   I understand this might be out of scope for this FIP, and I don't want
> >> to
> >> expand
> >>   the scope unnecessarily. Just raising it as a thought for the authors
> to
> >> consider.
> >>
> >>
> >>
> >>   These are just suggestions based on my reading of the proposal. I'm
> >> happy
> >> to be
> >>   corrected if I've misunderstood anything. Also happy to help with
> >> implementation or further discussion if useful.
> >>
> >>
> >>
> >>   Thanks again for driving this important feature!
> >>
> >>
> >>
> >>   Best regards,
> >>
> >>   Mehul Batra
> >>
> >> On Thu, Feb 12, 2026 at 5:53 PM SeungMin Lee <[email protected]> wrote:
> >>
> >> > Hi dev,
> >> >
> >> > Just a quick update.
> >> >
> >> > I have migrated the design google docs to the cwiki and registered it
> as
> >> > *FIP-30*. Please refer to the link below for the formal proposal:
> >> >
> >> >
> >> >
> >>
> https://cwiki.apache.org/confluence/display/FLUSS/FIP-30%3A+Support+tracking+the+tiering+status+of+a+tiering+table
> >> >
> >> > The content remains consistent with the previous Google Doc.
> >> >
> >> > Best regards,
> >> > SeungMin Lee
> >> >
> >> > 2026년 2월 12일 (목) PM 5:37, SeungMin Lee <[email protected]>님이 작성:
> >> > >
> >> > > Hi, dev
> >> > >
> >> > > Currently, there is no way for users to check the status of lake
> >> tiering.
> >> > Users cannot be aware if tiering fails, and they have to manually
> parse
> >> the
> >> > Tiering Service logs to identify the cause.
> >> > >
> >> > > So, I'd like to propose Issue-2362: Allow users to track the tiering
> >> > status of a tiering table to address this visibility issue.
> >> > >
> >> > > I have drafted a design docs [2]. Please feel free to review and
> share
> >> > your feed.
> >> > >
> >> > > Considering the upcoming holidays in some regions, I'll wait for
> >> feedback
> >> > and give a ping on this thread around Feb 23rd.
> >> > >
> >> > > Looking forward to your thoughts.
> >> > >
> >> > > Best regards,
> >> > > SeungMin Lee
> >> > >
> >> > > [1] https://github.com/apache/fluss/issues/2362
> >> > > [2]
> >> >
> >> >
> >>
> https://docs.google.com/document/d/1eJbRCwzAbeJLA97zQQ0I3JM1jerBXXhq69Dn8r4xWV0/edit?usp=sharing
> >> >
> >>
> >
>

Re: [DISCUSS] Allow user to track the tiering status of a tiering table

Reply via email to