Re: [PR] [TASK-429] Pre-size Arrow builders and recommend jemalloc for write path [fluss-rust]

via GitHub Sat, 07 Mar 2026 13:47:12 -0800


leekeiabstraction commented on code in PR #430:
URL: https://github.com/apache/fluss-rust/pull/430#discussion_r2900626203



##########
crates/examples/src/example_table.rs:
##########
@@ -15,6 +15,10 @@
 // specific language governing permissions and limitations
 // under the License.
 
+#[cfg(not(target_env = "msvc"))]
+#[global_allocator]
+static GLOBAL: tikv_jemallocator::Jemalloc = tikv_jemallocator::Jemalloc;

Review Comment:
   I wonder if we should go this route, if I understand correctly, the RSS 
bloat comes from Rust side lacks memory pooling i.e. creating 
MemoryLogRecordsArrowBuilder and discarding per append. The proper fix for the 
issue would be to implement memory pooling on rust side as well.
   
   While I do like that this fixes the issue without implementing memory 
pooling immediately, think we should consider whether it's worth providing the 
advice to users around use of Jemalloc to mitigate the issue.
   
   I am not too opinionated, if there are strong signals for production use of 
0.1 client and performance is an issue, this is certainly a low hanging fruit 
that we can take.



##########
crates/fluss/src/record/arrow.rs:
##########
@@ -151,6 +151,7 @@ pub const NO_BATCH_SEQUENCE: i32 = -1;
 
 pub const BUILDER_DEFAULT_OFFSET: i64 = 0;
 
+// TODO: Switch to byte-size-based is_full() like Java's ArrowWriter instead 
of a hard record cap.
 pub const DEFAULT_MAX_RECORD: i32 = 256;

Review Comment:
   Java uses initial capacity of 1024, is there a motivation for using lower 
capacity here?
   
   Also, consider naming this initial capacity? My understanding is that this 
is not max records but initial capacity which arrow can grow



##########
crates/fluss/src/record/arrow.rs:
##########
@@ -251,26 +253,47 @@ impl RowAppendRecordBatchBuilder {
         })
     }
 
-    fn create_builder(data_type: &arrow_schema::DataType) -> Result<Box<dyn 
ArrayBuilder>> {
+    fn create_builder(
+        data_type: &arrow_schema::DataType,
+        capacity: usize,
+    ) -> Result<Box<dyn ArrayBuilder>> {
         match data_type {
-            arrow_schema::DataType::Int8 => Ok(Box::new(Int8Builder::new())),
-            arrow_schema::DataType::Int16 => Ok(Box::new(Int16Builder::new())),
-            arrow_schema::DataType::Int32 => Ok(Box::new(Int32Builder::new())),
-            arrow_schema::DataType::Int64 => Ok(Box::new(Int64Builder::new())),
-            arrow_schema::DataType::UInt8 => Ok(Box::new(UInt8Builder::new())),
-            arrow_schema::DataType::UInt16 => 
Ok(Box::new(UInt16Builder::new())),
-            arrow_schema::DataType::UInt32 => 
Ok(Box::new(UInt32Builder::new())),
-            arrow_schema::DataType::UInt64 => 
Ok(Box::new(UInt64Builder::new())),
-            arrow_schema::DataType::Float32 => 
Ok(Box::new(Float32Builder::new())),
-            arrow_schema::DataType::Float64 => 
Ok(Box::new(Float64Builder::new())),
-            arrow_schema::DataType::Boolean => 
Ok(Box::new(BooleanBuilder::new())),
-            arrow_schema::DataType::Utf8 => Ok(Box::new(StringBuilder::new())),
-            arrow_schema::DataType::Binary => 
Ok(Box::new(BinaryBuilder::new())),
-            arrow_schema::DataType::FixedSizeBinary(size) => {
-                Ok(Box::new(FixedSizeBinaryBuilder::new(*size)))
+            arrow_schema::DataType::Int8 => 
Ok(Box::new(Int8Builder::with_capacity(capacity))),
+            arrow_schema::DataType::Int16 => 
Ok(Box::new(Int16Builder::with_capacity(capacity))),
+            arrow_schema::DataType::Int32 => 
Ok(Box::new(Int32Builder::with_capacity(capacity))),
+            arrow_schema::DataType::Int64 => 
Ok(Box::new(Int64Builder::with_capacity(capacity))),
+            arrow_schema::DataType::UInt8 => 
Ok(Box::new(UInt8Builder::with_capacity(capacity))),
+            arrow_schema::DataType::UInt16 => 
Ok(Box::new(UInt16Builder::with_capacity(capacity))),
+            arrow_schema::DataType::UInt32 => 
Ok(Box::new(UInt32Builder::with_capacity(capacity))),
+            arrow_schema::DataType::UInt64 => 
Ok(Box::new(UInt64Builder::with_capacity(capacity))),
+            arrow_schema::DataType::Float32 => {
+                Ok(Box::new(Float32Builder::with_capacity(capacity)))
             }
+            arrow_schema::DataType::Float64 => {
+                Ok(Box::new(Float64Builder::with_capacity(capacity)))
+            }
+            arrow_schema::DataType::Boolean => {
+                Ok(Box::new(BooleanBuilder::with_capacity(capacity)))
+            }
+            arrow_schema::DataType::Utf8 => {
+                // Estimate 64 bytes average per string value
+                Ok(Box::new(StringBuilder::with_capacity(
+                    capacity,
+                    capacity * 64,
+                )))
+            }
+            arrow_schema::DataType::Binary => {
+                // Estimate 64 bytes average per binary value
+                Ok(Box::new(BinaryBuilder::with_capacity(
+                    capacity,
+                    capacity * 64,

Review Comment:
   nit: magic numbers, can define as a const
   
   Also, if we follow java side's BaseVariableWidthVector uses a multiplier of 
8.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [TASK-429] Pre-size Arrow builders and recommend jemalloc for write path [fluss-rust]

Reply via email to