liurenjie1024 commented on code in PR #1885: URL: https://github.com/apache/iceberg-rust/pull/1885#discussion_r2580505559
########## docs/rfcs/0002_storage_trait.md: ########## @@ -0,0 +1,353 @@ +<!-- + ~ Licensed to the Apache Software Foundation (ASF) under one + ~ or more contributor license agreements. See the NOTICE file + ~ distributed with this work for additional information + ~ regarding copyright ownership. The ASF licenses this file + ~ to you under the Apache License, Version 2.0 (the + ~ "License"); you may not use this file except in compliance + ~ with the License. You may obtain a copy of the License at + ~ + ~ http://www.apache.org/licenses/LICENSE-2.0 + ~ + ~ Unless required by applicable law or agreed to in writing, + ~ software distributed under the License is distributed on an + ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + ~ KIND, either express or implied. See the License for the + ~ specific language governing permissions and limitations + ~ under the License. +--> + +# Making Storage a Trait in Iceberg-rust + +## Background + +### Existing Implementation +The existing code implements storage functionality through a concrete Storage enum that handles different storage backends (S3, local filesystem, GCS, etc.). This implementation is tightly coupled with OpenDAL as the underlying storage layer. The FileIO struct wraps this Storage enum and provides a high-level API for file operations. + +Original structure: + +- **FileIO:** Main interface for file operations +- **Storage:** Enum with variants for different storage backends +- **InputFile / OutputFile:** Concrete structs for reading and writing files + +All storage operations are implemented directly in these concrete types, making it hard to extend or customize the storage layer without modifying the core codebase. + +### Problem Statement +The original design has several limitations: + +- **Tight Coupling** – All storage logic depends on OpenDAL, limiting flexibility. Users cannot easily opt in for other storage implementations like `object_store` +- **Customization Barriers** – Users cannot easily add custom behaviors or optimizations + +As discussed in Issue #1314, making Storage a trait would allow pluggable storage and better integration with existing systems. + +## Design + +### New Architecture + +The new architecture uses trait-based abstractions with a registry pattern and separate storage crates: + +``` +┌─────────────────────────────────────────────────────┐ +│ crates/iceberg/src/io/ │ +│ ┌───────────────────────────────────────────────┐ │ +│ │ Storage Trait & Registry │ │ +│ │ - pub trait Storage │ │ +│ │ - pub trait StorageBuilder │ │ +│ │ - pub struct StorageBuilderRegistry │ │ +│ │ - pub struct InputFile │ │ +│ │ - pub struct OutputFile │ │ +│ └───────────────────────────────────────────────┘ │ +└─────────────────────────────────────────────────────┘ + ▲ + │ + ┌───────────────┼───────────────┬───────────────┐ + │ │ │ │ +┌───────┴────────────┐ │ ┌────────────┴──────┐ ┌────┴──────────┐ +│ crates/storage/ │ │ │ crates/storage/ │ │ Third-Party │ +│ opendal/ │ │ │ object_store/ │ │ Crates │ +│ ┌────────────────┐ │ │ │ ┌───────────────┐ │ │ ┌───────────┐ │ +│ │ opendal-s3 │ │ │ │ │ objstore-s3 │ │ │ │ custom │ │ +│ │ impl Storage │ │ │ │ │ impl Storage │ │ │ │ storage │ │ +│ │ impl Builder │ │ │ │ │ impl Builder │ │ │ │impl traits│ │ +│ └────────────────┘ │ │ │ └───────────────┘ │ │ └───────────┘ │ +│ ┌────────────────┐ │ │ │ ┌───────────────┐ │ └───────────────┘ +│ │ opendal-fs │ │ │ │ │ objstore-gcs │ │ +│ │ impl Storage │ │ │ │ │ impl Storage │ │ +│ │ impl Builder │ │ │ │ │ impl Builder │ │ +│ └────────────────┘ │ │ │ └───────────────┘ │ +│ ┌────────────────┐ │ │ │ ┌───────────────┐ │ +│ │ opendal-gcs │ │ │ │ │ objstore-azure│ │ +│ │ impl Storage │ │ │ │ │ impl Storage │ │ +│ │ impl Builder │ │ │ │ │ impl Builder │ │ +│ └────────────────┘ │ │ │ └───────────────┘ │ +│ ... (oss, azure, │ │ └───────────────────┘ +│ memory) │ │ +└────────────────────┘ │ +``` + +### Storage Trait +The Storage trait defines the interface for all storage operations. This implementation uses Option 2 from the initial design: Storage as a trait with concrete `InputFile` and `OutputFile` structs. + +```rust +#[async_trait] +pub trait Storage: Debug + Send + Sync { + // File existence and metadata + async fn exists(&self, path: &str) -> Result<bool>; + async fn metadata(&self, path: &str) -> Result<FileMetadata>; + + // Reading operations + async fn read(&self, path: &str) -> Result<Bytes>; + async fn reader(&self, path: &str) -> Result<Box<dyn FileRead>>; + + // Writing operations + async fn write(&self, path: &str, bs: Bytes) -> Result<()>; + async fn writer(&self, path: &str) -> Result<Box<dyn FileWrite>>; + + // Deletion operations + async fn delete(&self, path: &str) -> Result<()>; + async fn remove_dir_all(&self, path: &str) -> Result<()>; + + // File object creation + fn new_input(&self, path: &str) -> Result<InputFile>; + fn new_output(&self, path: &str) -> Result<OutputFile>; +} +``` + +### InputFile and OutputFile Structs + +`InputFile` and `OutputFile` are concrete structs that contain a reference to the storage: + +```rust + +pub struct InputFile { + pub storage: Arc<dyn Storage>, + pub path: String, +} + +pub struct OutputFile { + pub storage: Arc<dyn Storage>, + pub path: String, +} +``` + +Functions in `InputFile` and `OutputFile` delegate operations to the underlying Storage: + +```rust + +impl InputFile { + pub async fn exists(&self) -> Result<bool> { + self.storage.exists(&self.path).await + } + + pub async fn read(&self) -> Result<Bytes> { + self.storage.read(&self.path).await + } + // ... other methods +} +``` + +Benefits of this approach: + +- Simpler and easier to maintain +- Less trait object overhead +- Clear delegation pattern +- Sufficient flexibility for most use cases + +## StorageBuilder and Registry + +### StorageBuilder Trait +The `StorageBuilder` trait defines how storage backends are constructed: + +```rust +pub trait StorageBuilder: Debug + Send + Sync { + fn build( + &self, + props: HashMap<String, String>, + extensions: Extensions, + ) -> Result<Arc<dyn Storage>>; +} +``` + +Key design decisions: + +- Uses `&self` instead of `self` - builders are reusable +- No associated type - returns `Arc<dyn Storage>` directly +- `Send + Sync` for thread safety + +### StorageBuilderRegistry +The registry manages storage builders and provides lookup by scheme: + +```rust +#[derive(Debug, Clone)] +pub struct StorageBuilderRegistry { Review Comment: I'm thinking maybe we also need a trait for `StorageBuilderRegistry`, and this trait lives in `iceberg` crate. ```rust pub trait StorageBuilderRegistry { } ``` The motivation is that we may also need a default concrete registry implementation, which by default loads builtin builders. And the default registry implementaion should have a standalone crate, and it needs to depend on different storages. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
