Thank you all for the healthy discussion.

My view is that all automated maintenance tasks — such as compactions, data
expiration, and cleanup — should be supported by the tiering service, but
remain configurable.

These features can be enabled by default to support simple end-to-end use
cases, without requiring users to rely on an external service.

However, they should also be optional. In many cases, users may already
have or prefer to use a dedicated maintenance service for their lakehouse
tables. Such specialized services are often more capable and can manage
maintenance tasks across all lake tables, not only the fluss tiered tables.


Besides, +1 for the design proposal. I think we can kick off a vote now.

Best,
Jark

On Tue, 8 Jul 2025 at 15:24, Mehul Batra <[email protected]> wrote:

> Hi Yuxia,
>
> Thanks for the clarification.
>
> It's good to know that compaction/expiring-snapshots will be addressed in
> the initial version, and I completely understand your point on the
> complexity of orphan file cleanup. I agree it's better to avoid over-design
> at this stage and evolve as needed based on future usage patterns and
> feedback.
>
> Also, thanks for confirming that snapshot expiration will be triggered
> explicitly via the LakeCommitter. That clears things up for me.
>
> Looking forward to working on this with you and the community, seeing how
> this evolves!
>
> Best regards,
> Mehul
> On Tue, Jul 8, 2025 at 7:16 AM yuxia <[email protected]> wrote:
>
> > Hi, Mehul
> >
> > "Tableflow automates table maintenance by compacting and cleaning up
> small
> > files generated by continuous streaming data in object storage."
> > Seems table flow supports compacting, which is covered in this FIP.
> > Haven't seen orphan file cleanup in table flow.
> > Orphan file cleanup is not straightforward and a little of complex, which
> > required to list all files, and compare with the files in iceberg
> manifest
> > to find the orphan files.
> > I still prefer not to introduce the complexity currently for the first
> > version of iceberg support, which will cause overdesign. Let's just see
> in
> > the future what'll happen.
> >
> > As for snapshot expiration. Yes, LakeCommitter should trigger the
> snapshot
> > expiration action explicitly. It's a slight operation.
> >
> > Best regards,
> > Yuxia
> >
> > ----- 原始邮件 -----
> > 发件人: "Mehul Batra" <[email protected]>
> > 收件人: "dev" <[email protected]>
> > 发送时间: 星期二, 2025年 7 月 08日 上午 4:26:19
> > 主题: [SPAM]Re: [DISCUSS] FIP-3: Support tiering Fluss data to Iceberg
> >
> > Hi Yuxia, Cheng,
> >
> > Thank you both for the insights.
> >
> > From a user’s perspective, I believe our goal should be to abstract away
> as
> > much operational complexity as possible. For example, TableFlow handles
> > both data writing and maintenance seamlessly for the user, which avoids
> the
> > burden of running separate processes.
> >
> >
> https://docs.confluent.io/cloud/current/topics/tableflow/overview.html#table-maintenance-and-optimizations
> >
> > In Fluss Integration, if users are expected to run a separate maintenance
> > job (e.g., for snapshot expiration or orphan file cleanup), there's a
> real
> > risk of job overlap and failure, especially due to optimistic concurrency
> > issues when both (tiering & maintenance) try to commit around the same
> > time.
> >
> > Yuxia, you mentioned that the LakeCommitter will respect the
> > history.expire.max-snapshot-age-ms property (similar to Paimon). I just
> > wanted to clarify, while the property sets the retention policy, we still
> > need to trigger the snapshot expiration action explicitly. Do we envision
> > Fluss's tiering job playing that role?
> >
> > If so, that would be a great win, it could help automate snapshot
> > expiration and indirectly clean up orphan files, making things much
> > smoother for users.
> >
> > Please correct me if I’ve misunderstood anything.
> >
> > Best regards,
> > Mehul
> >
> > On Mon, Jul 7, 2025 at 11:32 AM Wang Cheng <[email protected]>
> > wrote:
> >
> > > Hi Mehul,
> > >
> > >
> > > I agree with Yuxia's point. We should leave such table maintenance work
> > > like expiring snapshots and deleting&nbsp;orphan files to Iceberg users
> > > rather than relying on Fluss tiering job.
> > >
> > >
> > >
> > > Regards,
> > > Cheng
> > >
> > >
> > >
> > > &nbsp;
> > >
> > >
> > >
> > >
> > > ------------------&nbsp;Original&nbsp;------------------
> > > From:
> > >                                                   "dev"
> > >                                                                 <
> > > [email protected]&gt;;
> > > Date:&nbsp;Sat, Jul 5, 2025 11:38 PM
> > > To:&nbsp;"dev"<[email protected]&gt;;
> > >
> > > Subject:&nbsp;Re: [DISCUSS] FIP-3: Support tiering Fluss data to
> Iceberg
> > >
> > >
> > >
> > > Hi Yuxia,
> > > Great, that sounds good to me and will help the user to have a better
> > read
> > > latency.
> > > How about the Snapshot expiration (to regulate metadata) and removing
> the
> > > orphan files(which are no longer referenced or dangling files of failed
> > > tasks)?
> > > Are we planning to introduce them as part of automated maintenance
> > provided
> > > by the Fluss cluster?
> > > Warm regards,
> > > Mehul Batra
> > >
> > > On Fri, Jul 4, 2025 at 5:02 PM yuxia <[email protected]&gt;
> > > wrote:
> > >
> > > &gt; Hi, Mehul.
> > > &gt; Thanks for your attention. I think we don't need to introduce an
> > extra
> > > &gt; post-commit hook to manage small files. In the design, all files
> > that
> > > belong
> > > &gt; to same bucket(in iceberg, it'll be same partition) be distributed
> > to
> > > same
> > > &gt; task to write. So, the task can compact these small files then for
> > the
> > > &gt; partition.
> > > &gt; As this FIP said, while creating IcebergLakeWriter in one round of
> > > &gt; tiering, the writer can scan manifest to know the files in this
> > > bucket, if
> > > &gt; found compaction is available, it can
> > > &gt; compact these files while writing new files. We have a similar
> logic
> > > for
> > > &gt; tiering to paimon.
> > > &gt;
> > > &gt; Best regards,
> > > &gt; Yuxia
> > > &gt;
> > > &gt; ----- 原始邮件 -----
> > > &gt; 发件人: "Mehul Batra" <[email protected]&gt;
> > > &gt; 收件人: "dev" <[email protected]&gt;
> > > &gt; 发送时间: 星期四, 2025年 7 月 03日 下午 5:04:18
> > > &gt; 主题: Re: [DISCUSS] FIP-3: Support tiering Fluss data to Iceberg
> > > &gt;
> > > &gt; +1 This will help us to address the missing table format and
> provide
> > > better
> > > &gt; ecosystem interoperability. Iceberg's growing adoption in the data
> > > &gt; lakehouse space makes this a valuable addition to Fluss's tiering
> > > &gt; capabilities.
> > > &gt; Are there any plans to integrate the Maintenance services as part
> of
> > > &gt; tiering itself as a post-commit hook to manage small files?
> > > &gt; Warm regards,
> > > &gt; Mehul Batra
> > > &gt;
> > > &gt; On Thu, Jul 3, 2025 at 2:24 PM yuxia <[email protected]
> > &gt;
> > > wrote:
> > > &gt;
> > > &gt; &gt; Hi,
> > > &gt; &gt;
> > > &gt; &gt; Fluss currently supports tiering data to Apache Paimon,
> > enabling
> > > &gt; &gt; cost-effective storage management for warm/cold data.
> However,
> > > the lack
> > > &gt; of
> > > &gt; &gt; native Iceberg tiering support limits flexibility and
> ecosystem
> > > &gt; integration
> > > &gt; &gt; for users who rely on Iceberg’s open table format.
> > > &gt; &gt;
> > > &gt; &gt; To address this gap, I’d like to propose FIP-3: Support
> Tiering
> > > Fluss
> > > &gt; Data
> > > &gt; &gt; to Iceberg[1] which aims to integrate Iceberg into Fluss’s
> > > tiering
> > > &gt; &gt; capabilities.
> > > &gt; &gt;
> > > &gt; &gt; Welcome your feedback and suggestions on this proposal.
> Looking
> > > forward
> > > &gt; to
> > > &gt; &gt; a productive discussion!
> > > &gt; &gt;
> > > &gt; &gt; [1]:
> > > &gt; &gt;
> > > &gt;
> > >
> >
> https://cwiki.apache.org/confluence/display/FLUSS/FIP-3%3A+Support+tiering+Fluss+data+to+Iceberg
> > > &gt
> > > <
> >
> https://cwiki.apache.org/confluence/display/FLUSS/FIP-3%3A+Support+tiering+Fluss+data+to+Iceberg&gt
> > >;
> > > &gt;
> > > &gt; &gt; Best regards,
> > > &gt; &gt; Yuxia
> > > &gt; &gt;
> > > &gt; &gt;
> > > &gt;
> >
>

Reply via email to