Re: [Discuss] FIP-25: Support Multi-Location for Remote Storage

Liebing Yu Mon, 02 Mar 2026 22:34:12 -0800

Hi Jark!

Thank you for your insightful suggestions. This FIP is a small step for
Fluss towards multi-remote (or multi-cloud) storage. As you mentioned, we
envision future support for commit-level multi-pathing, similar to the
approaches taken by Paimon and Lance.


For your comments on the current FIP. I'm generally in agreement.

1. FileSystem#obtainSecurityToken(FsPath f)
For the current implementation, obtainSecurityToken(FsPath f) is actually
redundant and can be removed.

2. GetFileSystemSecurityTokenRequest/Response and Client Token Management
Your suggestion simplifies the implementation of multi-path authorization.
By deferring table-level authentication to a later stage, we can expedite
the landing of this FIP. I will update the FIP accordingly.

Best regards,
Liebing Yu


On Tue, 3 Mar 2026 at 01:19, Jark Wu <[email protected]> wrote:

> Hi Liebing,
>
> Thank you for the proposal. I believe this is an excellent initiative
> to improve throughput for large-scale clusters utilizing remote
> storage.
>
> The current design implements multi-location support at the table or
> partition level, meaning only new tables and partitions will utilize
> new remote locations. Consequently, even after upgrading the cluster
> to support multiple paths, data distribution will remain concentrated
> in a single location for an extended period, failing to achieve rapid
> traffic fan-out. In contrast, industry solutions like Paimon support
> "data-file.external-paths" [1] to distribute new data files across
> multiple paths, and Lance has recently introduced a file-level
> multi-base layout [2].
>
> Ultimately, we need file-level multi-location support (I believe this
> approach will resolve most of the concerns raised above by Yang Guo).
> However, I am fine with supporting partition-level multi-location as
> an initial phase, provided we have a clear roadmap toward the final
> solution.
>
> Regarding the design details of this FIP, I have the following comments:
>
> 1. FileSystem#obtainSecurityToken(FsPath f)
> We should not add the FsPath parameter to the obtainSecurityToken
> interface for now. Because in current design, this interface only
> retrieve the security token for the entire filesystem rather than for
> a specific path. Since a filesystem is defined per authority, the
> authority does not need to be derived from an FsPath.
>
> In fact, we plan to refactor the Filesystem soon. This refactoring
> will add the FsPath parameter to obtainSecurityToken, ensuring the
> returned token is strictly scoped to that specific path. This change
> aims to address current permission leakage issues where a token
> requested for reading one table inadvertently grants access to all
> remote files of other tables.
>
> 2. GetFileSystemSecurityTokenRequest/Response and Client Token Management
>
> Current Issue: The FIP proposes maintaining a SecurityTokenManager per
> LogScanner. However, since tokens are shared at the filesystem
> granularity, tokens for the same FsKey across different tables should
> be consolidated. Therefore, the DefaultSecurityTokenManager must be
> maintained within the FlussConnection; otherwise,
> SecurityTokenManagers for different tables will overwrite each other's
> tokens.
>
> Recommendation: A straightforward approach is to leave
> GetFileSystemSecurityTokenRequest unchanged while modifying
> GetFileSystemSecurityTokenResponse to return a list of tokens. The
> server side would then return STS tokens for each FsKey configured in
> the cluster. The client-side Filesystem would subsequently retrieve
> the corresponding STS token based on the FsKey. This avoids changes to
> the LogScanner logic.
>
> While this approach retains the existing permission leakage issue,
> that problem is already present today. We can address it in a
> separate, dedicated FIP to simplify the scope and implementation of
> the current proposal.
>
> Best,
> Jark
>
> [1] https://paimon.apache.org/docs/1.3/maintenance/configurations/
> [2]
> https://lancedb.com/blog/rethinking-table-file-paths-lance-multi-base-layout/
>
>
> On Sat, 28 Feb 2026 at 20:37, Yang Guo <[email protected]> wrote:
> >
> > Hi Liebing and all,
> >
> > This is a good FIP to resolve bottlenecks in the remote storage. Thanks
> for
> > your effort. The design looks good to me and the above discussion has
> > covered some concerns in my mind.
> >
> > Now there are some further considerations I'm thinking of:
> >
> > 1. What happens if a path goes down?
> >   Right now, there’s no automatic failover. If one S3 bucket (or HDFS
> path)
> > dies, every table or partition assigned to it just fails. Could we add
> > simple health checks? If a path looks dead, the remote dir selector
> > temporarily skips it until it’s back up.
> >
> > 2. New paths don't always help old data.
> > The routing only happens when a new table or new partition is created.
> And
> > it depends on the partition strategy.
> > - If the table is using time-based partitions (e.g., daily), adding new
> > paths works well because new data goes to new partitions on new paths.
> > - But for non-partitioned tables, or if it keeps writing to old
> partitions,
> > the new paths sit idle. The traffic never shifts over.
> > It requires developers to think further about partition strategy and
> input
> > data when adding remote dirs.
> >
> > 3. Managing "weights" is tricky manually for developers/maintainers.
> > Since the weighted round-robin is static:
> >     - Developers/Maintainers have to determine the right weights based on
> > current traffic.
> >     - If you skew weights to favor a path, you have to remember to
> > rebalance them later, or that path gets overloaded forever. E.g. If two
> > paths are weighted [1, 2] in the beginning to rebalance the higher
> traffic
> > in the first path. Developers/Maintainers should remember to change the
> > weights back to [1, 1] after the traffic is balanced between two paths.
> > Otherwise the traffic in the second path will keep growing.
> >     - Also, setting a weight to 0 behaves differently depending on your
> > partition type (time-based paths eventually go quiet, but field-based
> ones
> > like "country=US" keep writing there forever).
> > Instead of manual tuning, could we eventually make this dynamic? Let the
> > system adjust weights based on real-time latency or throttling metrics.
> >
> > The points above are about future operational considerations—regarding
> > failover and maintenance after this solution is deployed. I think they
> > won't block this FIP. We may not need to fix these right now. Just bring
> > them into this discussion.
> >
> > Regards,
> > Yang Guo
> >
> > On Fri, Feb 27, 2026 at 5:53 PM Liebing Yu <[email protected]> wrote:
> >
> > > Hi Lorenzo, sorry for the late reply.
> > >
> > > Thanks for the AWS example! This further solidifies the case for
> multi-path
> > > support.
> > >
> > > Regarding your question about multi-cloud support:
> > > Our current design naturally supports multi-cloud object storage
> systems.
> > > Since the implementation is built upon a multi-schema filesystem
> > > abstraction (supporting schemes like s3://, oss://, abfs://, etc.), the
> > > system is inherently "cloud-agnostic."
> > >
> > > Best regards,
> > > Liebing Yu
> > >
> > >
> > > On Wed, 4 Feb 2026 at 23:37, Lorenzo Affetti via dev <
> [email protected]
> > > >
> > > wrote:
> > >
> > > > This is quite an interesting FIP and I think it is a significant
> > > > enhancement, especially for large-scale clusters.
> > > >
> > > > I think you can also add the AWS case in your motivation:
> > > >
> > > >
> > >
> https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance-design-patterns.html#optimizing-performance-high-request-rate
> > > > AWS automatically scales if requests exceed 5,500 per second for the
> same
> > > > prefix, which results in transient 503 errors.
> > > > Your approach would eliminate this problem by providing another
> bucket.
> > > >
> > > > I was wondering if it might also provide the possibility of
> configuring
> > > the
> > > > same Fluss cluster for multi-cloud object storage systems.
> > > > From a design perspective, nothing should prevent me from storing
> remote
> > > > data on both Azure and AWS at the same time, probably resulting in
> > > > different performance numbers for different partitions/tables.
> > > > Should the design force the use of only 1 filesystem implementation?
> > > >
> > > > Thank you again!
> > > >
> > > > On Fri, Jan 30, 2026 at 7:59 AM Liebing Yu <[email protected]>
> wrote:
> > > >
> > > > > Hi Yuxia, thanks for the thoughtful response. Let me go through
> your
> > > > > questions one by one.
> > > > >
> > > > > 1. I think after we support `remote.data.dirs`, different schemas
> will
> > > be
> > > > > supported naturally.
> > > > > 2. Yes, I think we should change from `PbTablePath` to
> > > > > `PbPhysicalTablePath`.
> > > > > 3. Thanks for the reminder. I'll poc authentication in
> > > > > https://github.com/apache/fluss/issues/2518. But it doesn't block
> the
> > > > > multiple-paths implementation in Fluss server in
> > > > > https://github.com/apache/fluss/issues/2517.
> > > > > 4. For a partition table, the table itself has a remote data dir
> for
> > > > > metadata (such as lake offset). And each partition has its own
> remote
> > > dir
> > > > > for table data (e.g. kv or log data).
> > > > > 5. Legacy clients can access data in the new cluster.
> > > > >
> > > > >    - If the permissions of the paths specified in
> `remote.data.dirs` on
> > > > the
> > > > >    new cluster match those configured in `remote.data.dir`,
> seamless
> > > > > access is
> > > > >    achievable.
> > > > >    - If the permissions are inconsistent, access permissions must
> be
> > > > >    explicitly configured. For example, when using OSS, a policy
> > > granting
> > > > >    access permissions to the account identified by `fs.oss.roleArn`
> > > must
> > > > be
> > > > >    configured for each bucket specified in `remote.data.dirs`.
> > > > >
> > > > >
> > > > > Best regards,
> > > > > Liebing Yu
> > > > >
> > > > >
> > > > > On Thu, 29 Jan 2026 at 10:07, Yuxia Luo <[email protected]> wrote:
> > > > >
> > > > > > Hi, Liebing
> > > > > >
> > > > > > Thanks for the detailed FIP. I have a few questions:
> > > > > > 1. Does `remote.data.dirs` support paths with different schemes?
> For
> > > > > > example:
> > > > > > ```
> > > > > > remote.data.dirs: oss://bucket1/fluss-data,
> s3://bucket2/fluss-data
> > > > > > ```
> > > > > >
> > > > > > 2. Should `GetFileSystemSecurityTokenRequest` include partition?
> > > > > > The FIP adds `table_path` to the request, but since different
> > > > partitions
> > > > > > may reside on different remote paths (and require different
> tokens),
> > > > > > should the request also include partition information?
> > > > > >
> > > > > > 3. Just a reminder that `DefaultSecurityTokenManager` will become
> > > more
> > > > > > complex...
> > > > > > This is not a blocker, but worth a poc to recoginize any
> complexity
> > > > > >
> > > > > > 4. I want to confirm my understanding: For a partitioned table,
> does
> > > > the
> > > > > > table itself have a remote dir, AND each partition also has its
> own
> > > > > remote
> > > > > > dir?
> > > > > >
> > > > > > Or is it:
> > > > > > - Non-partitioned table → table-level remote dir
> > > > > > - Partitioned table → only partition-level remote dirs (no
> > > > table-level)?
> > > > > >
> > > > > > 5. Can old clients (without table path in token request) still
> read
> > > > data
> > > > > > from new clusters?
> > > > > > One possibe solution is : For RPCs without table information, the
> > > > server
> > > > > > returns a token for the first dir in `remote.data.dirs`. Or other
> > > ways
> > > > > that
> > > > > > allow users to configure the cluster to keep compatibility
> > > > > >
> > > > > >
> > > > > >
> > > > > > On 2026/01/21 03:52:29 Zhe Wang wrote:
> > > > > > > Thanks for your response, now it looks good to me.
> > > > > > >
> > > > > > > Best regards,
> > > > > > > Zhe Wang
> > > > > > >
> > > > > > > Liebing Yu <[email protected]> 于2026年1月20日周二 14:29写道：
> > > > > > >
> > > > > > > > Hi Zhe, sorry for the late reply.
> > > > > > > >
> > > > > > > > The primary focus of this FIP is not to address read/write
> issues
> > > > at
> > > > > > the
> > > > > > > > table or partition level, but rather to overcome limitations
> at
> > > the
> > > > > > cluster
> > > > > > > > level. Given the current capabilities of object storage,
> > > read/write
> > > > > > > > performance for a single table or partition is unlikely to
> be a
> > > > > > bottleneck;
> > > > > > > > however, for a large-scale Fluss cluster, it can easily
> become
> > > one.
> > > > > > > > Therefore, the core objective here is to distribute the
> > > > cluster-wide
> > > > > > > > read/write traffic across multiple remote storage systems.
> > > > > > > >
> > > > > > > > Best regards,
> > > > > > > > Liebing Yu
> > > > > > > >
> > > > > > > >
> > > > > > > > On Wed, 14 Jan 2026 at 16:07, Zhe Wang <
> [email protected]>
> > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Liebing, Thanks for the clarification.
> > > > > > > > > >1. To clarify, the data is currently split by partition
> level
> > > > for
> > > > > > > > > partitioned tables and by table for non-partitioned tables.
> > > > > > > > >
> > > > > > > > > Therefore the main aim of this FIP is improving the speed
> of
> > > read
> > > > > > data
> > > > > > > > from
> > > > > > > > > different partitions, store data speed may still limit for
> a
> > > > single
> > > > > > > > system?
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > Zhe Wang
> > > > > > > > >
> > > > > > > > > Liebing Yu <[email protected]> 于2026年1月13日周二 19:11写道：
> > > > > > > > >
> > > > > > > > > > Hi Zhe, Thanks for the questions!
> > > > > > > > > >
> > > > > > > > > > 1. To clarify, the data is currently split by partition
> level
> > > > for
> > > > > > > > > > partitioned tables and by table for non-partitioned
> tables.
> > > > > > > > > >
> > > > > > > > > > 2. Regarding RemoteStorageCleaner, you are absolutely
> right.
> > > > > > Supporting
> > > > > > > > > > remote.data.dirs there is necessary for a complete
> cleanup
> > > > when a
> > > > > > table
> > > > > > > > > is
> > > > > > > > > > dropped.
> > > > > > > > > >
> > > > > > > > > > Thanks for pointing that out!
> > > > > > > > > >
> > > > > > > > > > Best regards,
> > > > > > > > > > Liebing Yu
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Mon, 12 Jan 2026 at 17:02, Zhe Wang <
> > > [email protected]>
> > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi Liebing,
> > > > > > > > > > >
> > > > > > > > > > > Thanks for driving this, I think it's a really useful
> > > > feature.
> > > > > > > > > > > I have two small questions:
> > > > > > > > > > > 1. What's the scope for split data in dirs, I see
> there's a
> > > > > > > > partitionId
> > > > > > > > > > in
> > > > > > > > > > > ZK Data, so the data will spit by partition in
> different
> > > > > > directories,
> > > > > > > > > or
> > > > > > > > > > by
> > > > > > > > > > > bucket?
> > > > > > > > > > > 2. Maybe it needs to support remote.data.dirs in
> > > > > > > > RemoteStorageCleaner?
> > > > > > > > > So
> > > > > > > > > > > we can delete all remoteStorage when delete table.
> > > > > > > > > > >
> > > > > > > > > > > Best,
> > > > > > > > > > > Zhe Wang
> > > > > > > > > > >
> > > > > > > > > > > Liebing Yu <[email protected]> 于2026年1月8日周四 20:10写道：
> > > > > > > > > > >
> > > > > > > > > > > > Hi devs,
> > > > > > > > > > > >
> > > > > > > > > > > > I propose initiating discussion on FIP-25[1]. Fluss
> > > > leverages
> > > > > > > > remote
> > > > > > > > > > > > storage systems—such as Amazon S3, HDFS, and Alibaba
> > > Cloud
> > > > > > OSS—to
> > > > > > > > > > > deliver a
> > > > > > > > > > > > cost-efficient, highly available, and fault-tolerant
> > > > storage
> > > > > > > > solution
> > > > > > > > > > > > compared to local disk. *However, in production
> > > > environments,
> > > > > > we
> > > > > > > > > often
> > > > > > > > > > > find
> > > > > > > > > > > > that the bandwidth of a single remote storage
> becomes a
> > > > > > bottleneck.
> > > > > > > > > > > *Taking
> > > > > > > > > > > > OSS[2] as an example, the typical upload bandwidth
> limit
> > > > for
> > > > > a
> > > > > > > > single
> > > > > > > > > > > > account is 20 Gbit/s (Internal) and 10 Gbit/s
> (Public).
> > > So
> > > > I
> > > > > > > > > initiated
> > > > > > > > > > > this
> > > > > > > > > > > > FIP which aims to introduce support for multiple
> remote
> > > > > storage
> > > > > > > > paths
> > > > > > > > > > and
> > > > > > > > > > > > enables the dynamic addition of new storage paths
> without
> > > > > > service
> > > > > > > > > > > > interruption.
> > > > > > > > > > > >
> > > > > > > > > > > > Any feedback and suggestions on this proposal are
> > > welcome!
> > > > > > > > > > > >
> > > > > > > > > > > > [1]
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> https://cwiki.apache.org/confluence/display/FLUSS/FIP-25%3A+Support+Multi-Location+for+Remote+Storage
> > > > > > > > > > > > [2]
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> https://www.alibabacloud.com/help/en/oss/user-guide/limits?spm=a2c63.l28256.help-menu-31815.d_0_0_5.2ac34d06oZYFvK
> > > > > > > > > > > >
> > > > > > > > > > > > Best regards,
> > > > > > > > > > > > Liebing Yu
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Lorenzo Affetti
> > > > Senior Software Engineer @ Flink Team
> > > > Ververica <http://www.ververica.com>
> > > >
> > >
>

Re: [Discuss] FIP-25: Support Multi-Location for Remote Storage

Reply via email to