date:20241231

[I] [Question] Why does plan_files not seem to get multi-threading improvement [iceberg-python]

2024-12-31 Thread via GitHub



gitzwz opened a new issue, #1479:
URL: https://github.com/apache/iceberg-python/issues/1479

   ### Question
   
   I encountered a problem with table.scan.plan_files() where there is no 
noticeable time difference between single-threaded and multi-threaded 
execution. The total time is directly proportional to the number of manifest 
entries. The table I used for testing has 6 manifest files, and each manifest 
file contains around 70,000 entries. The most time-consuming process is 
_open_manifest in the DataScan.plan_files() function, and it performs similarly 
whether using a thread pool or not. Could someone help me investigate if there 
might be an issue?
   
   Here is my test code:
   `from pyiceberg.catalog import load_catalog
   from pyspark.sql import SparkSession
   from pyiceberg import expressions as pyi_expr
   import time
   from line_profiler import LineProfiler
   
   catalog = load_catalog("default")
   table = catalog.load_table('b_ods.pyiceberg_test2')
   def scan_plan_files(key, values):
   row_filter=pyi_expr.In(key, values)
   
   files = table.scan(
   row_filter=row_filter,
   limit=1000
   ).plan_files()
   print(f"total plans {len(files)}")
   for file in files:
   print(file.file.file_path)
   
   start_time = time.perf_counter()
   scan_plan_files("cid", {'844'})
   print(f"Time consumed：{time.perf_counter() - start_time:.3f} seconds")
   `
   
   I also modified the ~/.pyiceberg.yaml file, changing *max-workers: 1* to 
*max-workers: 32*, but the total time is still around 64 seconds with little to 
no change.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] [Question] Why does plan_files not seem to get multi-threading improvement [iceberg-python]

2024-12-31 Thread via GitHub



gitzwz commented on issue #1479:
URL: 
https://github.com/apache/iceberg-python/issues/1479#issuecomment-2566291942

   
   Here is my test code:
   `
   from pyiceberg.catalog import load_catalog
   from pyspark.sql import SparkSession
   from pyiceberg import expressions as pyi_expr
   import time
   from line_profiler import LineProfiler
   
   catalog = load_catalog("default")
   table = catalog.load_table('b_ods.pyiceberg_test2')
   def scan_plan_files(key, values):
   row_filter=pyi_expr.In(key, values)
   
   files = table.scan(
   row_filter=row_filter,
   limit=1000
   ).plan_files()
   print(f"total plans {len(files)}")
   for file in files:
   print(file.file.file_path)
   
   start_time = time.perf_counter()
   scan_plan_files("cid", {'844'})
   print(f"Time consumed：{time.perf_counter() - start_time:.3f} seconds")
   `
   
   I also modified the ~/.pyiceberg.yaml file, changing *max-workers: 1* to 
*max-workers: 32*, but the total time is still around 64 seconds with little to 
no change.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] feat: Support metadata table "Manifests" [iceberg-rust]

2024-12-31 Thread via GitHub



flaneur2020 commented on PR #861:
URL: https://github.com/apache/iceberg-rust/pull/861#issuecomment-2566334491

   @Xuanwo merged the main branch, PTAL 🫡


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Core: Add list/map block sizes [iceberg]

2024-12-31 Thread via GitHub



rustyconover commented on PR #10973:
URL: https://github.com/apache/iceberg/pull/10973#issuecomment-2566807700

   Seems like its still pending.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] How to apply partition/bloom filter to old data? Does rewrite_data_files/rewrite_manifests procedure work? [iceberg]

2024-12-31 Thread via GitHub



hashmapybx commented on issue #11878:
URL: https://github.com/apache/iceberg/issues/11878#issuecomment-2566842621

   by the way, ALTER TABLE prod.db.sample SET TBLPROPERTIES . Do you meet any 
other problems?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Add iceberg_arrow library [iceberg-cpp]

2024-12-31 Thread via GitHub



kou commented on code in PR #6:
URL: https://github.com/apache/iceberg-cpp/pull/6#discussion_r1900309004


##
cmake_modules/BuildUtils.cmake:
##
@@ -182,13 +183,7 @@ function(ADD_ICEBERG_LIB LIB_NAME)
   target_include_directories(${LIB_NAME}_static PRIVATE 
${ARG_PRIVATE_INCLUDES})
 endif()
 
-if(MSVC_TOOLCHAIN)
-  set(LIB_NAME_STATIC ${LIB_NAME}_static)
-else()
-  set(LIB_NAME_STATIC ${LIB_NAME})
-endif()
-
-set_target_properties(${LIB_NAME}_static PROPERTIES OUTPUT_NAME 
${LIB_NAME_STATIC})
+set_target_properties(${LIB_NAME}_static PROPERTIES OUTPUT_NAME 
${LIB_NAME})

Review Comment:
   The error log of it is 
https://github.com/apache/iceberg-cpp/actions/runs/12505613514/job/34889235775#step:4:33
 , right?
   
   ```text
  The imported target "Iceberg::iceberg_arrow_shared" references the file
   
"C:/Users/runneradmin/AppData/Local/Temp/iceberg/lib/iceberg_arrow.lib"
   
 but this file does not exist.
   ```
   
   It seems that `.lib` for shared library isn't installed. It's strange. 
`.lib` for shared library is installed by default.
   
   Could you try reverting this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Remove unneeded metadata read during update event generation [iceberg]

2024-12-31 Thread via GitHub



amogh-jahagirdar commented on code in PR #11829:
URL: https://github.com/apache/iceberg/pull/11829#discussion_r1900307975


##
core/src/main/java/org/apache/iceberg/FastAppend.java:
##
@@ -157,12 +158,16 @@ public List apply(TableMetadata base, 
Snapshot snapshot) {
   }
 
   @Override
-  public Object updateEvent() {
+  public Object updateEvent(Snapshot committedSnapshot) {
 long snapshotId = snapshotId();
-Snapshot snapshot = ops().current().snapshot(snapshotId);
-long sequenceNumber = snapshot.sequenceNumber();
+ValidationException.check(
+snapshotId == committedSnapshot.snapshotId(),
+"Committed snapshotId %s does not match expected snapshotId %s",
+committedSnapshot.snapshotId(),
+snapshotId);
+long sequenceNumber = committedSnapshot.sequenceNumber();

Review Comment:
   Can we uplevel this logic, it's common to all the implementations?



##
core/src/main/java/org/apache/iceberg/SnapshotProducer.java:
##
@@ -475,10 +475,14 @@ public void commit() {
 }
   }
 
+  Object updateEvent(Snapshot committedSnapshot) {

Review Comment:
   `SnapshotProducer` is package private, so I think we're OK in terms of 
backwards compatibility since it's not like a public API is being broken. 



##
core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java:
##
@@ -956,23 +952,16 @@ public List apply(TableMetadata base, 
Snapshot snapshot) {
   }
 
   @Override
-  public Object updateEvent() {
+  public Object updateEvent(Snapshot committedSnapshot) {
 long snapshotId = snapshotId();
-Snapshot justSaved = ops().refresh().snapshot(snapshotId);
-long sequenceNumber = TableMetadata.INVALID_SEQUENCE_NUMBER;
-Map summary;
-if (justSaved == null) {
-  // The snapshot just saved may not be present if the latest metadata 
couldn't be loaded due to
-  // eventual
-  // consistency problems in refresh.
-  LOG.warn("Failed to load committed snapshot: omitting sequence number 
from notifications");
-  summary = summary();
-} else {
-  sequenceNumber = justSaved.sequenceNumber();
-  summary = justSaved.summary();
-}
-
-return new CreateSnapshotEvent(tableName, operation(), snapshotId, 
sequenceNumber, summary);
+ValidationException.check(
+snapshotId == committedSnapshot.snapshotId(),
+"Committed snapshotId %s does not match expected snapshotId %s",
+committedSnapshot.snapshotId(),
+snapshotId);

Review Comment:
   Do we really need the validation? I feel like the principle of this change 
is that the update event that is produced is always going to be derived from 
the passed in committed snapshot. I think passing `committedSnapshot.id()` to 
the event suffices



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] feat: Support metadata table "Manifests" [iceberg-rust]

2024-12-31 Thread via GitHub



xxchan commented on code in PR #861:
URL: https://github.com/apache/iceberg-rust/pull/861#discussion_r1900309256


##
crates/iceberg/src/metadata_scan.rs:
##
@@ -128,6 +137,135 @@ impl<'a> SnapshotsTable<'a> {
 }
 }
 
+/// Manifests table.
+pub struct ManifestsTable<'a> {
+metadata_table: &'a MetadataTable,
+}
+
+impl<'a> ManifestsTable<'a> {
+fn partition_summary_fields(&self) -> Vec {
+vec![
+Field::new("contains_null", DataType::Boolean, false),
+Field::new("contains_nan", DataType::Boolean, true),
+Field::new("lower_bound", DataType::Utf8, true),
+Field::new("upper_bound", DataType::Utf8, true),
+]
+}
+
+fn schema(&self) -> Schema {
+Schema::new(vec![
+Field::new("content", DataType::Int8, false),
+Field::new("path", DataType::Utf8, false),
+Field::new("length", DataType::Int64, false),
+Field::new("partition_spec_id", DataType::Int32, false),
+Field::new("added_snapshot_id", DataType::Int64, false),
+Field::new("added_data_files_count", DataType::Int32, false),
+Field::new("existing_data_files_count", DataType::Int32, false),
+Field::new("deleted_data_files_count", DataType::Int32, false),
+Field::new("added_delete_files_count", DataType::Int32, false),
+Field::new("existing_delete_files_count", DataType::Int32, false),
+Field::new("deleted_delete_files_count", DataType::Int32, false),
+Field::new(
+"partition_summaries",
+DataType::List(Arc::new(Field::new_struct(
+"item",
+self.partition_summary_fields(),
+false,
+))),
+false,
+),
+])
+}
+
+/// Scans the manifests table.
+pub async fn scan(&self) -> Result {
+let mut content = PrimitiveBuildernew();
+let mut path = StringBuilder::new();
+let mut length = PrimitiveBuildernew();
+let mut partition_spec_id = PrimitiveBuildernew();
+let mut added_snapshot_id = PrimitiveBuildernew();
+let mut added_data_files_count = PrimitiveBuildernew();
+let mut existing_data_files_count = 
PrimitiveBuildernew();
+let mut deleted_data_files_count = 
PrimitiveBuildernew();
+let mut added_delete_files_count = 
PrimitiveBuildernew();
+let mut existing_delete_files_count = 
PrimitiveBuildernew();
+let mut deleted_delete_files_count = 
PrimitiveBuildernew();
+let mut partition_summaries = 
ListBuilder::new(StructBuilder::from_fields(
+Fields::from(self.partition_summary_fields()),
+0,
+))
+.with_field(Arc::new(Field::new_struct(
+"item",
+self.partition_summary_fields(),
+false,
+)));
+
+if let Some(snapshot) = 
self.metadata_table.metadata().current_snapshot() {
+let manifest_list = snapshot
+.load_manifest_list(
+self.metadata_table.0.file_io(),
+&self.metadata_table.0.metadata_ref(),
+)
+.await?;
+for manifest in manifest_list.entries() {
+content.append_value(manifest.content.clone() as i8);

Review Comment:
   We may derive `Copy` for `ManifestContentType`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] feat: Support metadata table "Manifests" [iceberg-rust]

2024-12-31 Thread via GitHub



xxchan commented on code in PR #861:
URL: https://github.com/apache/iceberg-rust/pull/861#discussion_r1900309436


##
crates/iceberg/src/metadata_scan.rs:
##
@@ -128,6 +137,135 @@ impl<'a> SnapshotsTable<'a> {
 }
 }
 
+/// Manifests table.
+pub struct ManifestsTable<'a> {
+metadata_table: &'a MetadataTable,
+}
+
+impl<'a> ManifestsTable<'a> {
+fn partition_summary_fields(&self) -> Vec {
+vec![
+Field::new("contains_null", DataType::Boolean, false),
+Field::new("contains_nan", DataType::Boolean, true),
+Field::new("lower_bound", DataType::Utf8, true),
+Field::new("upper_bound", DataType::Utf8, true),
+]
+}
+
+fn schema(&self) -> Schema {

Review Comment:
   We might want to make this `pub`, so engines can get the schema first 
without fetching the data.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Add iceberg_arrow library [iceberg-cpp]

2024-12-31 Thread via GitHub



kou commented on code in PR #6:
URL: https://github.com/apache/iceberg-cpp/pull/6#discussion_r1900339989


##
cmake_modules/BuildUtils.cmake:
##
@@ -182,13 +183,7 @@ function(ADD_ICEBERG_LIB LIB_NAME)
   target_include_directories(${LIB_NAME}_static PRIVATE 
${ARG_PRIVATE_INCLUDES})
 endif()
 
-if(MSVC_TOOLCHAIN)
-  set(LIB_NAME_STATIC ${LIB_NAME}_static)
-else()
-  set(LIB_NAME_STATIC ${LIB_NAME})
-endif()
-
-set_target_properties(${LIB_NAME}_static PROPERTIES OUTPUT_NAME 
${LIB_NAME_STATIC})
+set_target_properties(${LIB_NAME}_static PROPERTIES OUTPUT_NAME 
${LIB_NAME})

Review Comment:
   Thanks.
   
   Hmm... `libiceberg_arrow.lib` for `libiceberg_arrow.dll` wasn't installed... 
I think that it's installed by default... I'll take a look at it later...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Add iceberg_arrow library [iceberg-cpp]

2024-12-31 Thread via GitHub



wgtmac commented on code in PR #6:
URL: https://github.com/apache/iceberg-cpp/pull/6#discussion_r1900166523


##
cmake_modules/BuildUtils.cmake:
##
@@ -182,13 +183,7 @@ function(ADD_ICEBERG_LIB LIB_NAME)
   target_include_directories(${LIB_NAME}_static PRIVATE 
${ARG_PRIVATE_INCLUDES})
 endif()
 
-if(MSVC_TOOLCHAIN)
-  set(LIB_NAME_STATIC ${LIB_NAME}_static)
-else()
-  set(LIB_NAME_STATIC ${LIB_NAME})
-endif()
-
-set_target_properties(${LIB_NAME}_static PROPERTIES OUTPUT_NAME 
${LIB_NAME_STATIC})
+set_target_properties(${LIB_NAME}_static PROPERTIES OUTPUT_NAME 
${LIB_NAME})

Review Comment:
   I removed this because I encountered an error on Windows. It complained that 
`iceberg_arrow.dll` had referenced `iceberg_arrow.lib` but actually it was 
named `iceberg_arrow_static.lib`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] feat: Support metadata table "Manifests" [iceberg-rust]

2024-12-31 Thread via GitHub



Xuanwo commented on code in PR #861:
URL: https://github.com/apache/iceberg-rust/pull/861#discussion_r1900169821


##
crates/iceberg/src/metadata_scan.rs:
##
@@ -50,6 +52,13 @@ impl MetadataTable {
 }
 }
 
+/// Get the manifests table.
+pub fn manifests(&self) -> ManifestsTable {
+ManifestsTable {
+metadata_table: self,

Review Comment:
   Hi, I think we can simply use `Table` here, which suggests that 
`MetadataTable` is merely a wrapper and doesn't implement any additional API.
   



##
crates/iceberg/src/metadata_scan.rs:
##
@@ -128,6 +137,135 @@ impl<'a> SnapshotsTable<'a> {
 }
 }
 
+/// Manifests table.
+pub struct ManifestsTable<'a> {
+metadata_table: &'a MetadataTable,
+}
+
+impl<'a> ManifestsTable<'a> {
+fn partition_summary_fields(&self) -> Vec {
+vec![
+Field::new("contains_null", DataType::Boolean, false),
+Field::new("contains_nan", DataType::Boolean, true),
+Field::new("lower_bound", DataType::Utf8, true),
+Field::new("upper_bound", DataType::Utf8, true),
+]
+}
+
+fn schema(&self) -> Schema {
+Schema::new(vec![
+Field::new("content", DataType::Int8, false),
+Field::new("path", DataType::Utf8, false),
+Field::new("length", DataType::Int64, false),
+Field::new("partition_spec_id", DataType::Int32, false),
+Field::new("added_snapshot_id", DataType::Int64, false),
+Field::new("added_data_files_count", DataType::Int32, false),
+Field::new("existing_data_files_count", DataType::Int32, false),
+Field::new("deleted_data_files_count", DataType::Int32, false),
+Field::new("added_delete_files_count", DataType::Int32, false),
+Field::new("existing_delete_files_count", DataType::Int32, false),
+Field::new("deleted_delete_files_count", DataType::Int32, false),
+Field::new(
+"partition_summaries",
+DataType::List(Arc::new(Field::new_struct(
+"item",
+self.partition_summary_fields(),
+false,
+))),
+false,
+),
+])
+}
+
+/// Scans the manifests table.
+pub async fn scan(&self) -> Result {
+let mut content = PrimitiveBuildernew();
+let mut path = StringBuilder::new();
+let mut length = PrimitiveBuildernew();
+let mut partition_spec_id = PrimitiveBuildernew();
+let mut added_snapshot_id = PrimitiveBuildernew();
+let mut added_data_files_count = PrimitiveBuildernew();
+let mut existing_data_files_count = 
PrimitiveBuildernew();
+let mut deleted_data_files_count = 
PrimitiveBuildernew();
+let mut added_delete_files_count = 
PrimitiveBuildernew();
+let mut existing_delete_files_count = 
PrimitiveBuildernew();
+let mut deleted_delete_files_count = 
PrimitiveBuildernew();
+let mut partition_summaries = 
ListBuilder::new(StructBuilder::from_fields(
+Fields::from(self.partition_summary_fields()),
+0,
+))
+.with_field(Arc::new(Field::new_struct(
+"item",
+self.partition_summary_fields(),
+false,
+)));
+
+if let Some(snapshot) = 
self.metadata_table.metadata().current_snapshot() {
+let manifest_list = snapshot
+.load_manifest_list(
+self.metadata_table.0.file_io(),
+&self.metadata_table.0.metadata_ref(),
+)
+.await?;
+for manifest in manifest_list.entries() {
+content.append_value(manifest.content.clone() as i8);

Review Comment:
   It's a bit unusual to see something that can use `as u8` but still requires 
`clone`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[PR] Updated Readme file to reflect Implemented operations [iceberg-go]

2024-12-31 Thread via GitHub



chil-pavn opened a new pull request, #242:
URL: https://github.com/apache/iceberg-go/pull/242

   Hey @zeroshade , raised this PR as I found this would be helpful for folks 
checking the Readme.md file to see the Roadmap.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Add GitHub cpp-linter-action [iceberg-cpp]

2024-12-31 Thread via GitHub



wgtmac commented on PR #20:
URL: https://github.com/apache/iceberg-cpp/pull/20#issuecomment-2566233077

   > pre-commit only runs clang-format, not clang-tidy - I think it'd still be 
useful even without the PR comment?
   
   That makes sense! Let me enable it for now. We can improve or discard it 
after we have experience with more PRs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[PR] ParallelIterable: Queue Size w/ O(1) [iceberg]

2024-12-31 Thread via GitHub



shanielh opened a new pull request, #11895:
URL: https://github.com/apache/iceberg/pull/11895

   Instead of using ConcurrentLinkedQueue.size() which runs over the Linked 
Queue
   in order to get the size of the queue, manage an AtomicInteger with the size
   of the queue.
   
   ConcurrentLinkedQueue.size() documentation states that this method is not
   useful for concurrent applications.
   
   Note: I have a JFR dump that shows this method uses 35% CPU utilization, this
   is why I think this commit is important.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[I] Data Loss in Flink Job with Iceberg Sink After Restart: How to Ensure Consistent Writes? [iceberg]

2024-12-31 Thread via GitHub

sanchay0 opened a new issue, #11894:
URL: https://github.com/apache/iceberg/issues/11894

### Query engine

Flink

### Question

I am running a Flink job that reads data from Kafka, processes it into a
Flink [Row
object](https://nightlies.apache.org/flink/flink-docs-master/api/java/org/apache/flink/types/Row.html),
and writes it to an [Iceberg
Sink](https://iceberg.apache.org/javadoc/1.3.0/org/apache/iceberg/flink/sink/FlinkSink.Builder.html#append--).
To deploy new code changes, I restart the job from the latest savepoint, which
is committed at the Kafka source. This savepoint stores the most recent Kafka
offset read by the job, allowing the subsequent run to continue from the
previous offset, ensuring transparent failure recovery. Recently, I encountered
a scenario where data was lost in the final output after a job restart. Here's
what happened:

- The job reads from Kafka starting at offset 1 and writes the data to an
Iceberg table backed by S3.
- It then reads from offset 2, but the writes to the Iceberg table are
delayed due to backpressure or network issues.
- The job proceeds to read from offset 3.
- Before the data from offset 2 is written to S3, I restart the job. After
the restart, the job begins reading from offset 3, resulting in the loss of
data from offset 2, which was never written to S3.

Is there a workaround for this problem, or is it an inherent limitation of
the Iceberg Sink?

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[PR] Gh 1223 metadata only row count [iceberg-python]

2024-12-31 Thread via GitHub



tusharchou opened a new pull request, #1480:
URL: https://github.com/apache/iceberg-python/pull/1480

   (no comment)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Gh 1223 metadata only row count [iceberg-python]

2024-12-31 Thread via GitHub



tusharchou closed pull request #1480: Gh 1223 metadata only row count
URL: https://github.com/apache/iceberg-python/pull/1480


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Add iceberg_arrow library [iceberg-cpp]

2024-12-31 Thread via GitHub



kou commented on code in PR #6:
URL: https://github.com/apache/iceberg-cpp/pull/6#discussion_r1900073077


##
README.md:
##
@@ -44,9 +61,14 @@ After installing the core libraries, you can build the 
examples:
 
 ```bash
 cd iceberg-cpp/example
-mkdir build && cd build
-cmake .. -DCMAKE_PREFIX_PATH=/tmp/iceberg
-cmake --build .
+cmake -S . -B build -DCMAKE_PREFIX_PATH=/path/to/install
+cmake --build build
+```
+
+If you are using provided Apache Arrow, you need to include `/path/to/arrow` 
in `CMAKE_PREFIX_PATH` as below.
+
+```bash
+cmake .. -DCMAKE_PREFIX_PATH="/path/to/install;/path/to/arrow"

Review Comment:
   ```suggestion
   cmake -S . -B build -DCMAKE_PREFIX_PATH="/path/to/install;/path/to/arrow"
   ```



##
cmake_modules/ThirdpartyToolchain.cmake:
##
@@ -0,0 +1,132 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+# Accumulate all dependencies to provide suitable static link parameters to the
+# third party libraries.
+set(ICEBERG_SYSTEM_DEPENDENCIES)
+set(ICEBERG_ARROW_INSTALL_INTERFACE_LIBS)
+
+# --
+# Versions and URLs for toolchain builds
+
+set(ICEBERG_ARROW_BUILD_VERSION "18.1.0")
+set(ICEBERG_ARROW_BUILD_SHA256_CHECKSUM
+"2dc8da5f8796afe213ecc5e5aba85bb82d91520eff3cf315784a52d0fa61d7fc")
+
+if(DEFINED ENV{ICEBERG_ARROW_URL})
+  set(ARROW_SOURCE_URL "$ENV{ICEBERG_ARROW_URL}")
+else()
+  set(ARROW_SOURCE_URL
+  
"https://www.apache.org/dyn/closer.cgi?action=download&filename=/arrow/arrow-${ICEBERG_ARROW_BUILD_VERSION}/apache-arrow-${ICEBERG_ARROW_BUILD_VERSION}.tar.gz";

Review Comment:
   `.lua` is better: 
https://infra.apache.org/release-download-pages.html#download-page
   
   ```suggestion
 
"https://www.apache.org/dyn/closer.lua?action=download&filename=/arrow/arrow-${ICEBERG_ARROW_BUILD_VERSION}/apache-arrow-${ICEBERG_ARROW_BUILD_VERSION}.tar.gz";
   ```



##
example/CMakeLists.txt:
##
@@ -22,10 +22,10 @@ project(example)
 
 set(CMAKE_CXX_STANDARD 20)
 
-find_package(iceberg CONFIG REQUIRED)
-find_package(puffin CONFIG REQUIRED)
+find_package(Iceberg CONFIG REQUIRED)
 
 add_executable(demo_example demo_example.cc)
 
-target_link_libraries(demo_example PRIVATE iceberg::iceberg_core_static
-   puffin::iceberg_puffin_static)
+target_link_libraries(demo_example
+  PRIVATE Iceberg::iceberg_core_static 
Iceberg::iceberg_puffin_static
+  Iceberg::iceberg_arrow_static)

Review Comment:
   OK. Then `PRIVATE Iceberg::iceberg_puffin_static 
Iceberg::iceberg_arrow_static` will be better.
   (`Iceberg::iceberg_core_static` should be linked after 
`Iceberg::iceberg_{puffine,arrow}_static`. `Iceberg::iceberg_core_static` 
should be linked automatically by `Iceberg::iceberg_{arrow,puffine}_static`.)



##
cmake_modules/ThirdpartyToolchain.cmake:
##
@@ -0,0 +1,142 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+# Accumulate all dependencies to provide suitable static link parameters to the
+# third party libraries.
+set(ICEBERG_SYSTEM_DEPENDENCIES)
+set(ICEBERG_VENDOR_DEPENDENCIES)
+set(ICEBERG_ARROW_INSTALL_INTERFACE_LIBS)
+
+# --
+# Versions and URLs for toolchain builds
+
+set(ICEBERG_ARROW_BUILD_VERSION "18.1.0")
+set(ICEBERG_ARROW_BUILD_SHA256_CHECKSUM
+"2dc8da5f8796afe213ecc5e5aba85bb82d91520eff3cf315784a

Re: [PR] Add iceberg_arrow library [iceberg-cpp]

2024-12-31 Thread via GitHub



wgtmac commented on code in PR #6:
URL: https://github.com/apache/iceberg-cpp/pull/6#discussion_r1900325544


##
cmake_modules/BuildUtils.cmake:
##
@@ -182,13 +183,7 @@ function(ADD_ICEBERG_LIB LIB_NAME)
   target_include_directories(${LIB_NAME}_static PRIVATE 
${ARG_PRIVATE_INCLUDES})
 endif()
 
-if(MSVC_TOOLCHAIN)
-  set(LIB_NAME_STATIC ${LIB_NAME}_static)
-else()
-  set(LIB_NAME_STATIC ${LIB_NAME})
-endif()
-
-set_target_properties(${LIB_NAME}_static PROPERTIES OUTPUT_NAME 
${LIB_NAME_STATIC})
+set_target_properties(${LIB_NAME}_static PROPERTIES OUTPUT_NAME 
${LIB_NAME})

Review Comment:
   It is installed, but to a different location (bin vs lib)
   - C:/Users/runneradmin/AppData/Local/Temp/iceberg/bin/iceberg_arrow.dll
   - C:/Users/runneradmin/AppData/Local/Temp/iceberg/lib/iceberg_arrow.lib
   
   If I use `${LIB_NAME_STATIC}`, it will be
   - 
C:/Users/runneradmin/AppData/Local/Temp/iceberg/lib/iceberg_arrow_static.lib
   
   I don't know how to let `iceberg_arrow.dll` reference to 
`iceberg_arrow_static.lib` but not the default `iceberg_arrow.lib`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[PR] Doc: Add DELETE ORPHAN-FILES example [iceberg]

2024-12-31 Thread via GitHub



ebyhr opened a new pull request, #11896:
URL: https://github.com/apache/iceberg/pull/11896

   Relates to https://github.com/apache/hive/pull/4897
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Add iceberg_arrow library [iceberg-cpp]

2024-12-31 Thread via GitHub



kou commented on code in PR #6:
URL: https://github.com/apache/iceberg-cpp/pull/6#discussion_r1900327030


##
cmake_modules/ThirdpartyToolchain.cmake:
##
@@ -0,0 +1,142 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+# Accumulate all dependencies to provide suitable static link parameters to the
+# third party libraries.
+set(ICEBERG_SYSTEM_DEPENDENCIES)
+set(ICEBERG_VENDOR_DEPENDENCIES)
+set(ICEBERG_ARROW_INSTALL_INTERFACE_LIBS)
+
+# --
+# Versions and URLs for toolchain builds
+
+set(ICEBERG_ARROW_BUILD_VERSION "18.1.0")
+set(ICEBERG_ARROW_BUILD_SHA256_CHECKSUM
+"2dc8da5f8796afe213ecc5e5aba85bb82d91520eff3cf315784a52d0fa61d7fc")
+set(ARROW_VENDORED TRUE)
+
+if(DEFINED ENV{ICEBERG_ARROW_URL})
+  set(ARROW_SOURCE_URL "$ENV{ICEBERG_ARROW_URL}")
+else()
+  set(ARROW_SOURCE_URL
+  
"https://www.apache.org/dyn/closer.cgi?action=download&filename=/arrow/arrow-${ICEBERG_ARROW_BUILD_VERSION}/apache-arrow-${ICEBERG_ARROW_BUILD_VERSION}.tar.gz";
+  
"https://downloads.apache.org/arrow/arrow-${ICEBERG_ARROW_BUILD_VERSION}/apache-arrow-${ICEBERG_ARROW_BUILD_VERSION}.tar.gz";
+  
"https://github.com/apache/arrow/releases/download/apache-arrow-${ICEBERG_ARROW_BUILD_VERSION}/apache-arrow-${ICEBERG_ARROW_BUILD_VERSION}.tar.gz";
+  )
+endif()
+
+# --
+# FetchContent
+
+include(FetchContent)
+set(FC_DECLARE_COMMON_OPTIONS)
+if(CMAKE_VERSION VERSION_GREATER_EQUAL 3.28)
+  list(APPEND FC_DECLARE_COMMON_OPTIONS EXCLUDE_FROM_ALL TRUE)
+endif()
+
+macro(prepare_fetchcontent)
+  set(BUILD_SHARED_LIBS OFF)
+  set(BUILD_STATIC_LIBS ON)
+  set(CMAKE_COMPILE_WARNING_AS_ERROR FALSE)
+  set(CMAKE_EXPORT_NO_PACKAGE_REGISTRY TRUE)
+  set(CMAKE_POSITION_INDEPENDENT_CODE ON)
+endmacro()
+
+# --
+# Apache Arrow
+
+function(resolve_arrow_dependency)
+  prepare_fetchcontent()
+
+  set(ARROW_BUILD_SHARED
+  OFF
+  CACHE BOOL "" FORCE)
+  set(ARROW_BUILD_STATIC
+  ON
+  CACHE BOOL "" FORCE)
+  set(ARROW_FILESYSTEM
+  OFF
+  CACHE BOOL "" FORCE)
+  set(ARROW_SIMD_LEVEL
+  "NONE"
+  CACHE STRING "" FORCE)
+  set(ARROW_RUNTIME_SIMD_LEVEL
+  "NONE"
+  CACHE STRING "" FORCE)
+  set(ARROW_POSITION_INDEPENDENT_CODE
+  ON
+  CACHE BOOL "" FORCE)
+  set(ARROW_DEPENDENCY_SOURCE
+  "AUTO"
+  CACHE STRING "" FORCE)
+
+  fetchcontent_declare(Arrow
+   ${FC_DECLARE_COMMON_OPTIONS}
+   URL ${ARROW_SOURCE_URL}
+   URL_HASH "SHA256=${ICEBERG_ARROW_BUILD_SHA256_CHECKSUM}"
+   SOURCE_SUBDIR
+   cpp
+   FIND_PACKAGE_ARGS
+   NAMES
+   Arrow
+   CONFIG)
+
+  # Add Arrow cmake modules to the search path
+  list(PREPEND CMAKE_MODULE_PATH
+   ${CMAKE_CURRENT_BINARY_DIR}/_deps/arrow-src/cpp/cmake_modules)

Review Comment:
   https://github.com/apache/arrow/issues/45142



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Core: Add list/map block sizes [iceberg]

2024-12-31 Thread via GitHub



github-actions[bot] commented on PR #10973:
URL: https://github.com/apache/iceberg/pull/10973#issuecomment-2566766280

   This pull request has been marked as stale due to 30 days of inactivity. It 
will be closed in 1 week if no further activity occurs. If you think that’s 
incorrect or this pull request requires a review, please simply write any 
comment. If closed, you can revive the PR at any time and @mention a reviewer 
or discuss it on the d...@iceberg.apache.org list. Thank you for your 
contributions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Core: Expose `added_rows_count`, `existing_rows_count` and `deleted_rows_count` fields in all_manifests and manifests tables [iceberg]

2024-12-31 Thread via GitHub



github-actions[bot] commented on PR #11679:
URL: https://github.com/apache/iceberg/pull/11679#issuecomment-2566766304

   This pull request has been marked as stale due to 30 days of inactivity. It 
will be closed in 1 week if no further activity occurs. If you think that’s 
incorrect or this pull request requires a review, please simply write any 
comment. If closed, you can revive the PR at any time and @mention a reviewer 
or discuss it on the d...@iceberg.apache.org list. Thank you for your 
contributions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Custom s3 endpoint: Unable to execute HTTP request: Remote host terminated the handshake [iceberg]

2024-12-31 Thread via GitHub



github-actions[bot] commented on issue #10490:
URL: https://github.com/apache/iceberg/issues/10490#issuecomment-2566766228

   This issue has been closed because it has not received any activity in the 
last 14 days since being marked as 'stale'


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Custom s3 endpoint: Unable to execute HTTP request: Remote host terminated the handshake [iceberg]

2024-12-31 Thread via GitHub



github-actions[bot] closed issue #10490: Custom s3 endpoint: Unable to execute 
HTTP request: Remote host terminated the handshake
URL: https://github.com/apache/iceberg/issues/10490


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] [SUPPORT] Support setting the maximum number of partitions for a table [iceberg]

2024-12-31 Thread via GitHub



github-actions[bot] commented on issue #10628:
URL: https://github.com/apache/iceberg/issues/10628#issuecomment-2566766271

   This issue has been automatically marked as stale because it has been open 
for 180 days with no activity. It will be closed in next 14 days if no further 
activity occurs. To permanently prevent this issue from being considered stale, 
add the label 'not-stale', but commenting on the issue is preferred when 
possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Crash when writing map type with unsigned types [iceberg-python]

2024-12-31 Thread via GitHub



github-actions[bot] closed issue #837: Crash when writing map type with 
unsigned types
URL: https://github.com/apache/iceberg-python/issues/837


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Crash when writing map type with unsigned types [iceberg-python]

2024-12-31 Thread via GitHub



github-actions[bot] commented on issue #837:
URL: https://github.com/apache/iceberg-python/issues/837#issuecomment-2566767364

   This issue has been closed because it has not received any activity in the 
last 14 days since being marked as 'stale'


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] `parquet_path_to_id_mapping` generates incorrect path for List types [iceberg-python]

2024-12-31 Thread via GitHub



github-actions[bot] closed issue #716: `parquet_path_to_id_mapping` generates 
incorrect path for List types
URL: https://github.com/apache/iceberg-python/issues/716


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] `parquet_path_to_id_mapping` generates incorrect path for List types [iceberg-python]

2024-12-31 Thread via GitHub



github-actions[bot] commented on issue #716:
URL: https://github.com/apache/iceberg-python/issues/716#issuecomment-2566767375

   This issue has been closed because it has not received any activity in the 
last 14 days since being marked as 'stale'


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] [SUPPORT] Support setting the maximum number of partitions for a table [iceberg]

2024-12-31 Thread via GitHub



melin closed issue #10628: [SUPPORT] Support setting the maximum number of 
partitions for a table
URL: https://github.com/apache/iceberg/issues/10628


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] feat: Support metadata table "Manifests" [iceberg-rust]

2024-12-31 Thread via GitHub



flaneur2020 commented on code in PR #861:
URL: https://github.com/apache/iceberg-rust/pull/861#discussion_r1900324416


##
crates/iceberg/src/metadata_scan.rs:
##
@@ -128,6 +137,135 @@ impl<'a> SnapshotsTable<'a> {
 }
 }
 
+/// Manifests table.
+pub struct ManifestsTable<'a> {
+metadata_table: &'a MetadataTable,
+}
+
+impl<'a> ManifestsTable<'a> {
+fn partition_summary_fields(&self) -> Vec {
+vec![
+Field::new("contains_null", DataType::Boolean, false),
+Field::new("contains_nan", DataType::Boolean, true),
+Field::new("lower_bound", DataType::Utf8, true),
+Field::new("upper_bound", DataType::Utf8, true),
+]
+}
+
+fn schema(&self) -> Schema {
+Schema::new(vec![
+Field::new("content", DataType::Int8, false),
+Field::new("path", DataType::Utf8, false),
+Field::new("length", DataType::Int64, false),
+Field::new("partition_spec_id", DataType::Int32, false),
+Field::new("added_snapshot_id", DataType::Int64, false),
+Field::new("added_data_files_count", DataType::Int32, false),
+Field::new("existing_data_files_count", DataType::Int32, false),
+Field::new("deleted_data_files_count", DataType::Int32, false),
+Field::new("added_delete_files_count", DataType::Int32, false),
+Field::new("existing_delete_files_count", DataType::Int32, false),
+Field::new("deleted_delete_files_count", DataType::Int32, false),
+Field::new(
+"partition_summaries",
+DataType::List(Arc::new(Field::new_struct(
+"item",
+self.partition_summary_fields(),
+false,
+))),
+false,
+),
+])
+}
+
+/// Scans the manifests table.
+pub async fn scan(&self) -> Result {
+let mut content = PrimitiveBuildernew();
+let mut path = StringBuilder::new();
+let mut length = PrimitiveBuildernew();
+let mut partition_spec_id = PrimitiveBuildernew();
+let mut added_snapshot_id = PrimitiveBuildernew();
+let mut added_data_files_count = PrimitiveBuildernew();
+let mut existing_data_files_count = 
PrimitiveBuildernew();
+let mut deleted_data_files_count = 
PrimitiveBuildernew();
+let mut added_delete_files_count = 
PrimitiveBuildernew();
+let mut existing_delete_files_count = 
PrimitiveBuildernew();
+let mut deleted_delete_files_count = 
PrimitiveBuildernew();
+let mut partition_summaries = 
ListBuilder::new(StructBuilder::from_fields(
+Fields::from(self.partition_summary_fields()),
+0,
+))
+.with_field(Arc::new(Field::new_struct(
+"item",
+self.partition_summary_fields(),
+false,
+)));
+
+if let Some(snapshot) = 
self.metadata_table.metadata().current_snapshot() {
+let manifest_list = snapshot
+.load_manifest_list(
+self.metadata_table.0.file_io(),
+&self.metadata_table.0.metadata_ref(),
+)
+.await?;
+for manifest in manifest_list.entries() {
+content.append_value(manifest.content.clone() as i8);

Review Comment:
   fixed in 4c6e338



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] feat: Support metadata table "Manifests" [iceberg-rust]

2024-12-31 Thread via GitHub



flaneur2020 commented on code in PR #861:
URL: https://github.com/apache/iceberg-rust/pull/861#discussion_r1900324391


##
crates/iceberg/src/metadata_scan.rs:
##
@@ -128,6 +137,135 @@ impl<'a> SnapshotsTable<'a> {
 }
 }
 
+/// Manifests table.
+pub struct ManifestsTable<'a> {
+metadata_table: &'a MetadataTable,
+}
+
+impl<'a> ManifestsTable<'a> {
+fn partition_summary_fields(&self) -> Vec {
+vec![
+Field::new("contains_null", DataType::Boolean, false),
+Field::new("contains_nan", DataType::Boolean, true),
+Field::new("lower_bound", DataType::Utf8, true),
+Field::new("upper_bound", DataType::Utf8, true),
+]
+}
+
+fn schema(&self) -> Schema {

Review Comment:
   fixed in 83e8811



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] feat: Support metadata table "Manifests" [iceberg-rust]

2024-12-31 Thread via GitHub



flaneur2020 commented on code in PR #861:
URL: https://github.com/apache/iceberg-rust/pull/861#discussion_r1900324950


##
crates/iceberg/src/metadata_scan.rs:
##
@@ -50,6 +52,13 @@ impl MetadataTable {
 }
 }
 
+/// Get the manifests table.
+pub fn manifests(&self) -> ManifestsTable {
+ManifestsTable {
+metadata_table: self,

Review Comment:
   fixed in 9fe6bd0, PTAL



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Add iceberg_arrow library [iceberg-cpp]

2024-12-31 Thread via GitHub



wgtmac commented on code in PR #6:
URL: https://github.com/apache/iceberg-cpp/pull/6#discussion_r1900328737


##
cmake_modules/BuildUtils.cmake:
##
@@ -182,13 +183,7 @@ function(ADD_ICEBERG_LIB LIB_NAME)
   target_include_directories(${LIB_NAME}_static PRIVATE 
${ARG_PRIVATE_INCLUDES})
 endif()
 
-if(MSVC_TOOLCHAIN)
-  set(LIB_NAME_STATIC ${LIB_NAME}_static)
-else()
-  set(LIB_NAME_STATIC ${LIB_NAME})
-endif()
-
-set_target_properties(${LIB_NAME}_static PROPERTIES OUTPUT_NAME 
${LIB_NAME_STATIC})
+set_target_properties(${LIB_NAME}_static PROPERTIES OUTPUT_NAME 
${LIB_NAME})

Review Comment:
   I have reverted ${LIB_NAME_STATIC}: 
https://github.com/apache/iceberg-cpp/actions/runs/12567791967/job/35034664088?pr=6



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Add iceberg_arrow library [iceberg-cpp]

2024-12-31 Thread via GitHub



kou commented on code in PR #6:
URL: https://github.com/apache/iceberg-cpp/pull/6#discussion_r1900327762


##
cmake_modules/BuildUtils.cmake:
##
@@ -182,13 +183,7 @@ function(ADD_ICEBERG_LIB LIB_NAME)
   target_include_directories(${LIB_NAME}_static PRIVATE 
${ARG_PRIVATE_INCLUDES})
 endif()
 
-if(MSVC_TOOLCHAIN)
-  set(LIB_NAME_STATIC ${LIB_NAME}_static)
-else()
-  set(LIB_NAME_STATIC ${LIB_NAME})
-endif()
-
-set_target_properties(${LIB_NAME}_static PROPERTIES OUTPUT_NAME 
${LIB_NAME_STATIC})
+set_target_properties(${LIB_NAME}_static PROPERTIES OUTPUT_NAME 
${LIB_NAME})

Review Comment:
   > It is installed, but to a different location (bin vs lib)
   > 
   > * C:/Users/runneradmin/AppData/Local/Temp/iceberg/bin/iceberg_arrow.dll
   > 
   > * C:/Users/runneradmin/AppData/Local/Temp/iceberg/lib/iceberg_arrow.lib
   
   This is expected.
   
   Users use `lib/iceberg_arrow.lib` when they build their program.
   Users use `bin/iceberg_arrow.dll` when they run their program.
   
   > If I use `${LIB_NAME_STATIC}`, it will be
   > 
   > * 
C:/Users/runneradmin/AppData/Local/Temp/iceberg/lib/iceberg_arrow_static.lib
   > 
   > 
   > I don't know how to let `iceberg_arrow.dll` reference to 
`iceberg_arrow_static.lib` but not the default `iceberg_arrow.lib`
   
   If users use static library, users don't use `iceberg_arrow.dll`. Users use 
only `iceberg_arrow_static.lib` for static linking.
   
   If users use shared library, users use `lib/iceberg_arrow.lib` on building 
and `bin/iceberg_arrow.dll` on running.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Add iceberg_arrow library [iceberg-cpp]

2024-12-31 Thread via GitHub



wgtmac commented on code in PR #6:
URL: https://github.com/apache/iceberg-cpp/pull/6#discussion_r1900328737


##
cmake_modules/BuildUtils.cmake:
##
@@ -182,13 +183,7 @@ function(ADD_ICEBERG_LIB LIB_NAME)
   target_include_directories(${LIB_NAME}_static PRIVATE 
${ARG_PRIVATE_INCLUDES})
 endif()
 
-if(MSVC_TOOLCHAIN)
-  set(LIB_NAME_STATIC ${LIB_NAME}_static)
-else()
-  set(LIB_NAME_STATIC ${LIB_NAME})
-endif()
-
-set_target_properties(${LIB_NAME}_static PROPERTIES OUTPUT_NAME 
${LIB_NAME_STATIC})
+set_target_properties(${LIB_NAME}_static PROPERTIES OUTPUT_NAME 
${LIB_NAME})

Review Comment:
   I have reverted ${LIB_NAME_STATIC}. Let's see.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Add iceberg_arrow library [iceberg-cpp]

2024-12-31 Thread via GitHub



wgtmac commented on PR #6:
URL: https://github.com/apache/iceberg-cpp/pull/6#issuecomment-2566218953

   Thanks @kou for your suggestion! Now the CMake implementation is greatly 
simplified. FYI, the installed directory looks like below:
   ```
   ├── include/
   │   ├── iceberg/
   │   │   ├── puffin.h
   │   │   ├── table.h
   │   │   └── demo_arrow.h
   ├── lib/
   │   ├── libiceberg_vendored_arrow.a
   │   ├── libiceberg_arrow.a
   │   ├── libiceberg_core.a
   │   ├── libiceberg_puffin.a
   │   ├── cmake/
   │   │   ├── IcebergConfig.cmake
   │   │   ├── IcebergConfigVersion.cmake
   │   │   ├── IcebergTargets.cmake
   │   │   └── IcebergTargets-debug.cmake
   └── share/
   └── doc/
   └── Iceberg/
   ├── LICENSE
   └── NOTICE
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] FileIO S3: Add support for Assume-Role-Arn and other AWS Client properties [iceberg-rust]

2024-12-31 Thread via GitHub



charlesdong1991 commented on issue #527:
URL: https://github.com/apache/iceberg-rust/issues/527#issuecomment-2566649287

   Hi, I am new to the project, if nobody yet picks it up, can I give it a try 
to get to know the code base better?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Count rows as a metadata only operation [iceberg-python]

2024-12-31 Thread via GitHub



gli-chris-hao commented on code in PR #1388:
URL: https://github.com/apache/iceberg-python/pull/1388#discussion_r1900228680


##
pyiceberg/table/__init__.py:
##
@@ -1594,6 +1609,29 @@ def to_ray(self) -> ray.data.dataset.Dataset:
 
 return ray.data.from_arrow(self.to_arrow())
 
+def count(self) -> int:
+"""
+Usage: calutates the total number of records in a Scan that haven't 
had positional deletes
+"""
+res = 0
+# every task is a FileScanTask
+tasks = self.plan_files()
+
+for task in tasks:
+# task.residual is a Boolean Expression if the fiter condition is 
fully satisfied by the
+# partition value and task.delete_files represents that positional 
delete haven't been merged yet
+# hence those files have to read as a pyarrow table applying the 
filter and deletes
+if task.residual == AlwaysTrue() and not len(task.delete_files):
+# Every File has a metadata stat that stores the file record 
count
+res += task.file.record_count
+else:
+from pyiceberg.io.pyarrow import ArrowScan
+tbl = ArrowScan(
+self.table_metadata, self.io, self.projection(), 
self.row_filter, self.case_sensitive, self.limit
+).to_table([task])
+res += len(tbl)
+return res

Review Comment:
   I love this approach! My only concern is about loading too much data into 
memory at once, although this is loading just one file at a time, in the worst 
case some file could potentially be very large? Shall we define a threshold and 
check, for example, if `file size < 512MB`, load entire file, otherwise turn it 
into `pa.RecordBatchReader` and read stream of record batches for counting.
   
   ```
   target_schema = schema_to_pyarrow(self.projection())
   
   batches = ArrowScan(
   self.table_metadata, self.io, self.projection(), self.row_filter, 
self.case_sensitive, self.limit
   ).to_record_batches([task])
   
   reader = pa.RecordBatchReader.from_batches(
   target_schema,
   batches,
   )
   
   count = 0
   for batch in reader:
   count += batch.num_rows
   return count
   ```
   
   
https://github.com/apache/iceberg-python/blob/e6465001bd8a47718ff79da4def5800962e6b895/pyiceberg/table/__init__.py#L1557-L1564



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[PR] Bump pyparsing from 3.2.0 to 3.2.1 [iceberg-python]

2024-12-31 Thread via GitHub



dependabot[bot] opened a new pull request, #1481:
URL: https://github.com/apache/iceberg-python/pull/1481

   Bumps [pyparsing](https://github.com/pyparsing/pyparsing) from 3.2.0 to 
3.2.1.
   
   Changelog
   Sourced from https://github.com/pyparsing/pyparsing/blob/master/CHANGES";>pyparsing's 
changelog.
   
   Version 3.2.1 - December, 2024
   
   
   Updated generated railroad diagrams to make non-terminal elements links 
to their related
   sub-diagrams. This greatly improves navigation of the diagram, 
especially for
   large, complex parsers.
   
   
   Simplified railroad diagrams emitted for parsers using 
infix_notation, by hiding
   lookahead terms. Renamed internally generated expressions for clarity, and 
improved
   diagramming.
   
   
   Improved performance of cpp_style_comment, 
c_style_comment, common.fnumber
   and common.ieee_float Regex expressions. PRs submitted by 
Gabriel Gerlero,
   nice work, thanks!
   
   
   Add missing type annotations to match_only_at_col, 
replace_with, remove_quotes,
   with_attribute, and with_class. Issue https://redirect.github.com/pyparsing/pyparsing/issues/585";>#585 
reported by rafrafrek.
   
   
   Added generated diagrams for many of the examples.
   
   
   Replaced old examples/0README.html file with examples/README.md file.
   
   
   
   
   
   Commits
   
   https://github.com/pyparsing/pyparsing/commit/abb3b4bcfaeaa5e1c83a3af4c1c6c8a03acc7dcd";>abb3b4b
 Prep for 3.2.1 release
   https://github.com/pyparsing/pyparsing/commit/91a56b468386e63b45030463e3d5fc78f4d03e98";>91a56b4
 delete old 0README.html file from the examples directory, replaced with 
READM...
   https://github.com/pyparsing/pyparsing/commit/51e25558c9e5d5fb3a904045d8bc79b9468c12a8";>51e2555
 improved README.md for examples directory
   https://github.com/pyparsing/pyparsing/commit/5e32ee62ed9818e6a2fdf3854fea2faeb6f3715d";>5e32ee6
 add parse_python_value.py to test_examples
   https://github.com/pyparsing/pyparsing/commit/2f41d0f701caeeb5a25fa86de7cbed99a37dbd7d";>2f41d0f
 update version timestamp
   https://github.com/pyparsing/pyparsing/commit/dc8d66eac77c6ff89ce9fac14f06d4f933e25845";>dc8d66e
 blackening
   https://github.com/pyparsing/pyparsing/commit/00d7aed694b8c0971c2f853e09af211419b9017d";>00d7aed
 Fix nested expression diagram element generation
   https://github.com/pyparsing/pyparsing/commit/dde1a025d0783d4e90ce0ee089c9b7cf89f50c78";>dde1a02
 Added generated diagrams for many of the examples
   https://github.com/pyparsing/pyparsing/commit/471801e30a4432a79535e8a3a21685dc19e934c5";>471801e
 Fixed cacheing bug in diagram generation; modified names for inner elements 
o...
   https://github.com/pyparsing/pyparsing/commit/4bb26472131e8fe9d2230138b48ad5d2b9c93466";>4bb2647
 Remove modification of Regex exprs when generating railroad diags; also 
short...
   Additional commits viewable in https://github.com/pyparsing/pyparsing/compare/3.2.0...3.2.1";>compare 
view
   
   
   
   
   
   [![Dependabot compatibility 
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=pyparsing&package-manager=pip&previous-version=3.2.0&new-version=3.2.1)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a rebase manually by commenting 
`@dependabot rebase`.
   
   [//]: # (dependabot-automerge-start)
   [//]: # (dependabot-automerge-end)
   
   ---
   
   
   Dependabot commands and options
   
   
   You can trigger Dependabot actions by commenting on this PR:
   - `@dependabot rebase` will rebase this PR
   - `@dependabot recreate` will recreate this PR, overwriting any edits that 
have been made to it
   - `@dependabot merge` will merge this PR after your CI passes on it
   - `@dependabot squash and merge` will squash and merge this PR after your CI 
passes on it
   - `@dependabot cancel merge` will cancel a previously requested merge and 
block automerging
   - `@dependabot reopen` will reopen this PR if it is closed
   - `@dependabot close` will close this PR and stop Dependabot recreating it. 
You can achieve the same result by closing it manually
   - `@dependabot show  ignore conditions` will show all of 
the ignore conditions of the specified dependency
   - `@dependabot ignore this major version` will close this PR and stop 
Dependabot creating any more for this major version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this minor version` will close this PR and stop 
Dependabot creating any more for this minor version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this dependency` will close this PR and stop 
Dependabot creating any more for this dependency (unless you reopen the PR or 
upgrade to it yourself)
   
   
   


-- 
This is an automated message from the Apache Git S

Re: [I] Count rows as a metadata-only operation [iceberg-python]

2024-12-31 Thread via GitHub



gli-chris-hao commented on issue #1223:
URL: 
https://github.com/apache/iceberg-python/issues/1223#issuecomment-2566622940

   We have the same use case and concerns about loading too much data into 
memory for counting, the way I'm doing it to use 
`DataScan.to_arrow_batch_reader`, and then count number of rows by iterating 
the batches, this should avoid memory issue for large datascan:
   ```
   count = 0
   for batch in datascan.to_arrow_batch_reader():
   count += batch.num_rows
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Count rows as a metadata only operation [iceberg-python]

2024-12-31 Thread via GitHub



gli-chris-hao commented on code in PR #1388:
URL: https://github.com/apache/iceberg-python/pull/1388#discussion_r1900228680


##
pyiceberg/table/__init__.py:
##
@@ -1594,6 +1609,29 @@ def to_ray(self) -> ray.data.dataset.Dataset:
 
 return ray.data.from_arrow(self.to_arrow())
 
+def count(self) -> int:
+"""
+Usage: calutates the total number of records in a Scan that haven't 
had positional deletes
+"""
+res = 0
+# every task is a FileScanTask
+tasks = self.plan_files()
+
+for task in tasks:
+# task.residual is a Boolean Expression if the fiter condition is 
fully satisfied by the
+# partition value and task.delete_files represents that positional 
delete haven't been merged yet
+# hence those files have to read as a pyarrow table applying the 
filter and deletes
+if task.residual == AlwaysTrue() and not len(task.delete_files):
+# Every File has a metadata stat that stores the file record 
count
+res += task.file.record_count
+else:
+from pyiceberg.io.pyarrow import ArrowScan
+tbl = ArrowScan(
+self.table_metadata, self.io, self.projection(), 
self.row_filter, self.case_sensitive, self.limit
+).to_table([task])
+res += len(tbl)
+return res

Review Comment:
   I love this approach! My only concern is about loading too much data into 
memory at once, although this is loading just one file at a time, in the worst 
case some file could potentially be very large? Shall we define a threshold and 
check, for example, if `file size < 512MB`, load entire file, otherwise turn it 
into `pa.RecordBatchReader` and read stream of record batches for counting.
   
   ```
   target_schema = schema_to_pyarrow(self.projection())
   
   batches = ArrowScan(
   self.table_metadata, self.io, self.projection(), self.row_filter, 
self.case_sensitive, self.limit
   ).to_record_batches([task])
   
   reader = pa.RecordBatchReader.from_batches(
   target_schema,
   batches,
   )
   
   count = 0
   for batch in reader:
   count += batch.num_rows
   return count
   ```
   
https://github.com/apache/iceberg-python/blob/e6465001bd8a47718ff79da4def5800962e6b895/pyiceberg/table/__init__.py#L1541-L1564



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Count rows as a metadata only operation [iceberg-python]

2024-12-31 Thread via GitHub



gli-chris-hao commented on code in PR #1388:
URL: https://github.com/apache/iceberg-python/pull/1388#discussion_r1900228680


##
pyiceberg/table/__init__.py:
##
@@ -1594,6 +1609,29 @@ def to_ray(self) -> ray.data.dataset.Dataset:
 
 return ray.data.from_arrow(self.to_arrow())
 
+def count(self) -> int:
+"""
+Usage: calutates the total number of records in a Scan that haven't 
had positional deletes
+"""
+res = 0
+# every task is a FileScanTask
+tasks = self.plan_files()
+
+for task in tasks:
+# task.residual is a Boolean Expression if the fiter condition is 
fully satisfied by the
+# partition value and task.delete_files represents that positional 
delete haven't been merged yet
+# hence those files have to read as a pyarrow table applying the 
filter and deletes
+if task.residual == AlwaysTrue() and not len(task.delete_files):
+# Every File has a metadata stat that stores the file record 
count
+res += task.file.record_count
+else:
+from pyiceberg.io.pyarrow import ArrowScan
+tbl = ArrowScan(
+self.table_metadata, self.io, self.projection(), 
self.row_filter, self.case_sensitive, self.limit
+).to_table([task])
+res += len(tbl)
+return res

Review Comment:
   I love this approach! My only concern is about loading too much data into 
memory at once, although this is loading just one file at a time, in the worst 
case some file could potentially be very large? Shall we define a threshold and 
check, for example, if `file size < 512MB`, load entire file, otherwise turn it 
into `pa.RecordBatchReader` and read stream of record batches for counting.
   
   ```
   target_schema = schema_to_pyarrow(self.projection())
   
   batches = ArrowScan(
   self.table_metadata, self.io, self.projection(), self.row_filter, 
self.case_sensitive, self.limit
   ).to_record_batches([task])
   
   reader = pa.RecordBatchReader.from_batches(
   target_schema,
   batches,
   )
   
   count = 0
   for batch in reader:
   count += batch.num_rows
   return count
   ```
   
https://github.com/apache/iceberg-python/blob/main/pyiceberg/table/__init__.py#L1541-L1564



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Count rows as a metadata only operation [iceberg-python]

2024-12-31 Thread via GitHub



gli-chris-hao commented on code in PR #1388:
URL: https://github.com/apache/iceberg-python/pull/1388#discussion_r1900228680


##
pyiceberg/table/__init__.py:
##
@@ -1594,6 +1609,29 @@ def to_ray(self) -> ray.data.dataset.Dataset:
 
 return ray.data.from_arrow(self.to_arrow())
 
+def count(self) -> int:
+"""
+Usage: calutates the total number of records in a Scan that haven't 
had positional deletes
+"""
+res = 0
+# every task is a FileScanTask
+tasks = self.plan_files()
+
+for task in tasks:
+# task.residual is a Boolean Expression if the fiter condition is 
fully satisfied by the
+# partition value and task.delete_files represents that positional 
delete haven't been merged yet
+# hence those files have to read as a pyarrow table applying the 
filter and deletes
+if task.residual == AlwaysTrue() and not len(task.delete_files):
+# Every File has a metadata stat that stores the file record 
count
+res += task.file.record_count
+else:
+from pyiceberg.io.pyarrow import ArrowScan
+tbl = ArrowScan(
+self.table_metadata, self.io, self.projection(), 
self.row_filter, self.case_sensitive, self.limit
+).to_table([task])
+res += len(tbl)
+return res

Review Comment:
   I love this approach! My only concern is about loading too much data into 
memory at once, although this is loading just one file at a time, in the worst 
case some file could potentially be very large? Shall we define a threshold and 
check, for example, if `file size < XXX`, load entire file, otherwise turn it 
into `pa.RecordBatchReader` and read stream of record batches for counting.
   
   ```
   target_schema = schema_to_pyarrow(self.projection())
   
   batches = ArrowScan(
   self.table_metadata, self.io, self.projection(), self.row_filter, 
self.case_sensitive, self.limit
   ).to_record_batches([task])
   
   reader = pa.RecordBatchReader.from_batches(
   target_schema,
   batches,
   )
   
   count = 0
   for batch in reader:
   count += batch.num_rows
   return count
   ```
   
https://github.com/apache/iceberg-python/blob/main/pyiceberg/table/__init__.py#L1541-L1564



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Fields are out of order in equality delete files if equality fields are not together [iceberg]

2024-12-31 Thread via GitHub



singhpk234 commented on issue #11891:
URL: https://github.com/apache/iceberg/issues/11891#issuecomment-2566633478

   > But this equality delete file is out of order and this record and still be 
read in iceberg table
   
   Equality delete file written had ptr as **111** instead of **202412130** 
seems to be the root cause as the eq delete predicate will not be applied
   
   Q: 
   1/ is flink producing the eq delete ? 
   2/ how was the table created ? is there something like name-mapping etc 
comming into play which might fiddle with the actual delete file write ? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

47 matches

Mail list logo