qzyu999 opened a new pull request, #474:
URL: https://github.com/apache/fluss-rust/pull/474
<!--
*Thank you very much for contributing to Fluss - we are happy that you want
to help us improve Fluss. To help the community review your contribution in the
best possible way, please go through the checklist below, which will get the
contribution into a shape in which it can be best reviewed.*
## Contribution Checklist
- Make sure that the pull request corresponds to a [GitHub
issue](https://github.com/apache/fluss-rust/issues). Exceptions are made for
typos in JavaDoc or documentation files, which need no issue.
- Name the pull request in the format "[component] Title of the pull
request", where *[component]* should be replaced by the name of the component
being changed. Typically, this corresponds to the component label assigned to
the issue (e.g., [kv], [log], [client], [flink]). Skip *[component]* if you are
unsure about which is the best component.
- Fill out the template below to describe the changes contributed by the
pull request. That will give reviewers the context they need to do the review.
- Make sure that the change passes the automated tests, i.e., `mvn clean
verify` passes.
- Each pull request should address only one issue, not mix up code from
multiple issues.
**(The sections below can be removed for hotfixes or typos)**
-->
### Purpose
<!-- Linking this pull request to the issue -->
Linked issue: close #469
<!-- What is the purpose of the change -->
The purpose of this change is to complete the Python implementation for
Array types by adding support for deterministic-length arrays (FixedSizeList).
This ensures the Python client can interface with all standard Arrow list
layouts while providing an idiomatic, programmatic way to construct nested
schemas.
### Brief change log
* **Core Engine Refinement (`crates/fluss/src/row/column.rs`):** Added
explicit downcasting for `FixedSizeListArray`. This preserves performance by
calculating element positions via direct multiplication, avoiding the memory
fetch overhead of an offsets buffer.
* **Programmatic Schema API (`bindings/python/src/metadata.rs`):**
Introduced a `DataTypes` factory class for Python. This enables the
construction of nested types (e.g., `DataTypes.array(DataTypes.int())`) which
was previously impossible for FFI-backed types.
* **Idiomatic FFI Bindings:** Replaced inherent `.to_string()` methods with
the standard Rust `fmt::Display` trait for `DataType`. This standardizes string
representation across Rust and Python (`__str__` and `__repr__`). This should
also be reusable for when adding Map<Key, Value> or Struct/Row.
* **Arrow Translator Update (`crates/fluss/src/record/arrow.rs`):** Updated
the type translator to unify `List`, `LargeList`, and `FixedSizeList` into a
single logical Fluss `Array` type.
* **Linter & Precision Pass:** Cleaned up `clippy::clone_on_copy` warnings
in integration tests and replaced hardcoded float approximations with
`std::f32::consts::PI` and `std::f64::consts::E` to prevent precision drift in
streaming aggregations.
<!-- Please describe the changes made in this pull request and explain how
they address the issue -->
### Tests
* **Rust Unit Tests:** Added `test_from_arrow_type_fixed_size_list` to
verify the Arrow-to-Fluss type translation.
* **Python Metadata Tests:** Added tests in `test_schema.py` to verify the
`DataTypes` factory and string representation logic.
* **Integration Tests:** * `test_append_and_scan_with_array`: Verifies
round-trip for variable-length arrays.
* `test_append_and_scan_with_fixed_size_array`: Verified client-side but
currently **SKIPPED** in CI. This test requires a Fluss server version >= 0.9.1
to handle the new storage layout.
<!-- List UT and IT cases to verify this change -->
### API and Format
* **API:** This change adds a new `DataTypes` factory to the Python API. It
also improves the `__repr__` output for schemas, making nested types
human-readable.
* **Format:** This PR introduces support for the `FixedSizeList` Arrow
storage format within the Fluss engine, which is more space-efficient and
performant for fixed-length vector data (like coordinates or embeddings).
<!-- Does this change affect API or storage format -->
### Documentation
This change introduces a new feature: Support for the `Array` data type in
the Python client, including support for `FixedSizeList`. Users can now define,
write, and read array-based columns using the Python SDK.
<!-- Does this change introduce a new feature -->
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]