Re: [I] PyIceberg appending data creates snapshots incompatible with Athena/Spark [iceberg-python]

2024-12-16 Thread via GitHub


Fokko commented on issue #1424:
URL: 
https://github.com/apache/iceberg-python/issues/1424#issuecomment-2544900271

   I'm also not seeing how this could happen, to test this, I also ran this 
script:
   
   ```
   Python 3.10.14 (main, Mar 19 2024, 21:46:16) [Clang 15.0.0 
(clang-1500.3.9.4)]
   Type 'copyright', 'credits' or 'license' for more information
   IPython 8.26.0 -- An enhanced Interactive Python. Type '?' for help.
   
   In [1]: import uuid
  ...: 
  ...: while True:
  ...: rnd_uuid = uuid.uuid4()
  ...: snapshot_id = int.from_bytes(
  ...: bytes(lhs ^ rhs for lhs, rhs in zip(rnd_uuid.bytes[0:8], 
rnd_uuid.bytes[8:16])), byteorder="little", signed=True
  ...: )
  ...: snapshot_id = snapshot_id if snapshot_id >= 0 else snapshot_id * 
-1
  ...: if snapshot_id > 9223372036854775807:
  ...: print("Boom")
  ...: 
   ```
   
   I think we should add some checks to ensure that the `snapshot_id` is in the 
range of `[0, 9223372036854775807]` or raise a `ValueError` instead. Would you 
be interested in contributing that @Samreay ? I think this should happen in the 
`SnapshotProducer`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Doc: Do Not Modify the Source Data Table During MergeIntoCommand Exec… [iceberg]

2024-12-16 Thread via GitHub


BsoBird commented on code in PR #11787:
URL: https://github.com/apache/iceberg/pull/11787#discussion_r1886334243


##
docs/docs/spark-writes.md:
##
@@ -101,6 +101,9 @@ Spark 3.5 added support for `WHEN NOT MATCHED BY SOURCE ... 
THEN ...` to update
 WHEN NOT MATCHED BY SOURCE THEN UPDATE SET status = 'invalid'
 ```
 
+!!! danger
+Note: For copy on write table,Please Do Not Modify the Source Data Table 
During Execution.Due to the need for Spark to use the source table 
consecutively twice for computation,the relation which is created of source 
data must remain constant through the two different passes of the source 
data.If the source data query would return different results user will see odd 
behavior.

Review Comment:
   > Would it be possible to be more specific? What kind of odd behavior users 
will experience
   
   @Fokko  Hello,Sir. The most common situation encountered in our production 
environment is data loss



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Doc: Do Not Modify the Source Data Table During MergeIntoCommand Exec… [iceberg]

2024-12-16 Thread via GitHub


BsoBird commented on code in PR #11787:
URL: https://github.com/apache/iceberg/pull/11787#discussion_r1886334243


##
docs/docs/spark-writes.md:
##
@@ -101,6 +101,9 @@ Spark 3.5 added support for `WHEN NOT MATCHED BY SOURCE ... 
THEN ...` to update
 WHEN NOT MATCHED BY SOURCE THEN UPDATE SET status = 'invalid'
 ```
 
+!!! danger
+Note: For copy on write table,Please Do Not Modify the Source Data Table 
During Execution.Due to the need for Spark to use the source table 
consecutively twice for computation,the relation which is created of source 
data must remain constant through the two different passes of the source 
data.If the source data query would return different results user will see odd 
behavior.

Review Comment:
   > Would it be possible to be more specific? What kind of odd behavior users 
will experience
   
   The most common situation encountered in our production environment is data 
loss



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add plan tasks for TableScan [iceberg-python]

2024-12-16 Thread via GitHub


Fokko commented on code in PR #1427:
URL: https://github.com/apache/iceberg-python/pull/1427#discussion_r1886483368


##
pyiceberg/table/__init__.py:
##
@@ -1423,6 +1451,66 @@ def plan_files(self) -> Iterable[FileScanTask]:
 for data_entry in data_entries
 ]
 
+def _target_split_size(self) -> int:
+table_value = property_as_int(
+self.table_metadata.properties, TableProperties.READ_SPLIT_SIZE, 
TableProperties.READ_SPLIT_SIZE_DEFAULT
+)
+return property_as_int(self.options, TableProperties.READ_SPLIT_SIZE, 
table_value)  # type: ignore
+
+def _loop_back(self) -> int:
+table_value = property_as_int(
+self.table_metadata.properties, 
TableProperties.READ_SPLIT_LOOKBACK, TableProperties.READ_SPLIT_LOOKBACK_DEFAULT
+)
+return property_as_int(self.options, 
TableProperties.READ_SPLIT_LOOKBACK, table_value)  # type: ignore
+
+def _split_open_file_cost(self) -> int:
+table_value = property_as_int(
+self.table_metadata.properties,
+TableProperties.READ_SPLIT_OPEN_FILE_COST,
+TableProperties.READ_SPLIT_OPEN_FILE_COST_DEFAULT,
+)
+return property_as_int(self.options, 
TableProperties.READ_SPLIT_OPEN_FILE_COST, table_value)  # type: ignore
+
+def plan_task(self) -> Iterable[CombinedFileScanTask]:

Review Comment:
   I have the same concern as @corleyma. For example, Java splits on row-groups 
to make full use of the parallelism of Spark, Trino, Hive etc. On a local 
machine, it makes more sense to just plow through the file itself, preferably 
using a native reader like PyArrow or Iceberg-Rust in the future. 
   
   There are a lot of details around this API, for example, a task might point 
to a row-group that doesn't contain any relevant information, and we don't know 
until we open the file itself.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add plan tasks for TableScan [iceberg-python]

2024-12-16 Thread via GitHub


Fokko commented on code in PR #1427:
URL: https://github.com/apache/iceberg-python/pull/1427#discussion_r1886487094


##
pyiceberg/table/__init__.py:
##
@@ -1423,6 +1451,66 @@ def plan_files(self) -> Iterable[FileScanTask]:
 for data_entry in data_entries
 ]
 
+def _target_split_size(self) -> int:
+table_value = property_as_int(
+self.table_metadata.properties, TableProperties.READ_SPLIT_SIZE, 
TableProperties.READ_SPLIT_SIZE_DEFAULT
+)
+return property_as_int(self.options, TableProperties.READ_SPLIT_SIZE, 
table_value)  # type: ignore
+
+def _loop_back(self) -> int:
+table_value = property_as_int(
+self.table_metadata.properties, 
TableProperties.READ_SPLIT_LOOKBACK, TableProperties.READ_SPLIT_LOOKBACK_DEFAULT
+)
+return property_as_int(self.options, 
TableProperties.READ_SPLIT_LOOKBACK, table_value)  # type: ignore
+
+def _split_open_file_cost(self) -> int:
+table_value = property_as_int(
+self.table_metadata.properties,
+TableProperties.READ_SPLIT_OPEN_FILE_COST,
+TableProperties.READ_SPLIT_OPEN_FILE_COST_DEFAULT,
+)
+return property_as_int(self.options, 
TableProperties.READ_SPLIT_OPEN_FILE_COST, table_value)  # type: ignore
+
+def plan_task(self) -> Iterable[CombinedFileScanTask]:

Review Comment:
   Sorry, I think we had a race condition. Thoughts @jaychia @samster25?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add license checker [iceberg-cpp]

2024-12-16 Thread via GitHub


Fokko commented on code in PR #10:
URL: https://github.com/apache/iceberg-cpp/pull/10#discussion_r1886489090


##
.github/workflows/license_check.yml:
##
@@ -0,0 +1,26 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+name: "Run License Check"

Review Comment:
   That's fancy, let me change to Skywalking!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add plan tasks for TableScan [iceberg-python]

2024-12-16 Thread via GitHub


ConeyLiu commented on code in PR #1427:
URL: https://github.com/apache/iceberg-python/pull/1427#discussion_r1886465895


##
tests/integration/test_reads.py:
##
@@ -873,3 +874,76 @@ def test_table_scan_empty_table(catalog: Catalog) -> None:
 result_table = tbl.scan().to_arrow()
 
 assert len(result_table) == 0
+
+
+@pytest.mark.integration
+def test_plan_tasks(session_catalog: Catalog) -> None:
+from pyiceberg.table import TableProperties
+
+table_name = "default.test_plan_tasks"
+try:
+session_catalog.drop_table(table_name)
+except NoSuchTableError:
+pass  # Just to make sure that the table doesn't exist
+
+tbl = session_catalog.create_table(
+table_name,
+Schema(
+NestedField(1, "number", LongType()),
+),
+properties={TableProperties.PARQUET_ROW_GROUP_LIMIT: "1"},
+)
+
+# Write 10 row groups, that should end up as 10 batches
+entries = 10
+tbl.append(
+pa.Table.from_pylist(
+[
+{
+"number": number,
+}
+for number in range(entries)
+],
+)
+)
+
+assert len(tbl.inspect.files()) == 1
+
+plan_files = list(tbl.scan().plan_files())
+assert len(plan_files) == 1
+data_file = plan_files[0].file
+assert data_file.split_offsets is not None and 
len(data_file.split_offsets) == 10
+
+plan_tasks = list(tbl.scan(options={TableProperties.READ_SPLIT_SIZE: 
1}).plan_task())
+assert len(plan_tasks) == 10
+
+split_offsets = []
+for task in plan_tasks:
+assert len(task.tasks) == 1
+split_offsets.append(task.tasks[0].start)
+
+assert split_offsets == plan_files[0].file.split_offsets
+
+split_sizes = []
+for i in range(1, len(data_file.split_offsets)):
+split_sizes.append(data_file.split_offsets[i] - 
data_file.split_offsets[i - 1])
+
+split_sizes.append(data_file.file_size_in_bytes - 
data_file.split_offsets[-1])
+
+read_split_size = int(data_file.file_size_in_bytes / 4)
+read_split_open_file_cost = 1
+read_split_lookback = 5
+
+plan_tasks = list(

Review Comment:
   @kevinjqliu added more tests here, pls give more advice.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add plan tasks for TableScan [iceberg-python]

2024-12-16 Thread via GitHub


ConeyLiu commented on code in PR #1427:
URL: https://github.com/apache/iceberg-python/pull/1427#discussion_r1886464171


##
pyiceberg/table/__init__.py:
##
@@ -1423,6 +1451,66 @@ def plan_files(self) -> Iterable[FileScanTask]:
 for data_entry in data_entries
 ]
 
+def _target_split_size(self) -> int:
+table_value = property_as_int(
+self.table_metadata.properties, TableProperties.READ_SPLIT_SIZE, 
TableProperties.READ_SPLIT_SIZE_DEFAULT
+)
+return property_as_int(self.options, TableProperties.READ_SPLIT_SIZE, 
table_value)  # type: ignore
+
+def _loop_back(self) -> int:
+table_value = property_as_int(
+self.table_metadata.properties, 
TableProperties.READ_SPLIT_LOOKBACK, TableProperties.READ_SPLIT_LOOKBACK_DEFAULT
+)
+return property_as_int(self.options, 
TableProperties.READ_SPLIT_LOOKBACK, table_value)  # type: ignore
+
+def _split_open_file_cost(self) -> int:
+table_value = property_as_int(
+self.table_metadata.properties,
+TableProperties.READ_SPLIT_OPEN_FILE_COST,
+TableProperties.READ_SPLIT_OPEN_FILE_COST_DEFAULT,
+)
+return property_as_int(self.options, 
TableProperties.READ_SPLIT_OPEN_FILE_COST, table_value)  # type: ignore
+
+def plan_task(self) -> Iterable[CombinedFileScanTask]:

Review Comment:
   > I assume it's intentional that we're not actually calling this in any of 
the methods (like to_arrow, etc) that actually execute a scan?
   
   Yes, I am not sure of the initial intention why we read the full file. So 
keep it not changing.
   
   > If there is no plan to call this directly in pyiceberg, I wonder who the 
intended consumers of this API would be? I would expect most query engines -- 
distributed or otherwise -- to have their own notions for how to optimize scan 
planning.
   
   I met OOM errors recently when we read the iceberg table with ray in 
distributed. Also, the parallelism is limited by the number of files when 
planning with files(files are rewritten into larger one, eg 512MB or 1GB). Like 
Spark/Flink, I think the `plan_tasks` should be useful for distributed reading.
   
   > In the case that this is primarily intended to be for the benefit of 
pyiceberg's own scan execution I would consider making this a private API.
   
   The distributed engine such as ray/daft would have to implement themself if 
they want to do `plan_tasks`. It would be better to put it into pyiceberg?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add plan tasks for TableScan [iceberg-python]

2024-12-16 Thread via GitHub


ConeyLiu commented on PR #1427:
URL: https://github.com/apache/iceberg-python/pull/1427#issuecomment-2545047787

   Thanks @kevinjqliu @corleyma for your review. Pls take another look, thanks 
a lot.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add license checker [iceberg-cpp]

2024-12-16 Thread via GitHub


Fokko commented on code in PR #10:
URL: https://github.com/apache/iceberg-cpp/pull/10#discussion_r1886494679


##
.github/workflows/license_check.yml:
##
@@ -0,0 +1,26 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+name: "Run License Check"

Review Comment:
   Done 👍 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add license checker [iceberg-cpp]

2024-12-16 Thread via GitHub


Fokko commented on code in PR #10:
URL: https://github.com/apache/iceberg-cpp/pull/10#discussion_r1886494679


##
.github/workflows/license_check.yml:
##
@@ -0,0 +1,26 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+name: "Run License Check"

Review Comment:
   Done 👍  Tnx @wgtmac and @raulcd 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] feat: Add RemovePartitionSpecs table update [iceberg-rust]

2024-12-16 Thread via GitHub


Fokko commented on code in PR #804:
URL: https://github.com/apache/iceberg-rust/pull/804#discussion_r1886516435


##
crates/iceberg/src/spec/table_metadata_builder.rs:
##
@@ -740,6 +740,32 @@ impl TableMetadataBuilder {
 .set_default_partition_spec(Self::LAST_ADDED)
 }
 
+/// Remove partition specs by their ids from the table metadata.
+/// Does nothing if a spec id is not present.
+///
+/// If the default partition spec is removed, it is re-added
+/// upon build.
+pub fn remove_partition_specs(mut self, spec_ids: &[i32]) -> Self {
+let mut removed_specs = Vec::with_capacity(spec_ids.len());
+
+self.metadata.partition_specs.retain(|k, _| {
+if spec_ids.contains(k) {

Review Comment:
   Why not turn the `spec_ids` into a `HashSet`? :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Generate release notes on website. [iceberg-rust]

2024-12-16 Thread via GitHub


Fokko commented on issue #593:
URL: https://github.com/apache/iceberg-rust/issues/593#issuecomment-2545128114

   I'm going to remove the milestone to unblock the 0.4.0 release. I'm happy to 
help here. In general, I think it would be great to add more the website 
(getting started etc) since it is pretty empty still.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Kafka Connect Sporadic Commit Delay [iceberg]

2024-12-16 Thread via GitHub


trolle4 commented on issue #11796:
URL: https://github.com/apache/iceberg/issues/11796#issuecomment-2545131515

   Also added some debug logging inside the class 
`org.apache.iceberg.connect.channel.CommitterImpl`
   ```java
   @Override
 public void save(Collection sinkRecords) {
   
   if (sinkRecords != null && !sinkRecords.isEmpty()) {
 worker.save(sinkRecords);
 if (sinkRecords.iterator().next().topic().contains("iceberg")) {
   LOG.info(
   "Saving {} records. Last offset {}",
   sinkRecords.size(),
   sinkRecords.stream()
   .reduce((first, second) -> second)
   .map(SinkRecord::kafkaOffset)
   .orElse(-1L));
 }
   }
   processControlEvents();
 }
   ```
   
   Which produced the following logs for the above problem:
   
   ```
   2024-12-16 06:40:15,390 INFO  Saving 1 records. Last offset 151629   
[org.apache.iceberg.connect.channel.CommitterImpl]
   2024-12-16 06:40:15,661 INFO  Saving 1 records. Last offset 151630   
[org.apache.iceberg.connect.channel.CommitterImpl]
   ``` 
   
   So the message is read from the source kafka topic when it should be. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] feat: Store file io props to allow re-build it [iceberg-rust]

2024-12-16 Thread via GitHub


Xuanwo commented on code in PR #802:
URL: https://github.com/apache/iceberg-rust/pull/802#discussion_r1887055534


##
crates/iceberg/src/io/file_io.rs:
##
@@ -165,7 +175,7 @@ impl FileIOBuilder {
 /// Fetch the scheme string.
 ///
 /// The scheme_str will be empty if it's None.
-pub(crate) fn into_parts(self) -> (String, HashMap) {
+pub fn into_parts(self) -> (String, HashMap) {

Review Comment:
   > What's the goal of the `scheme_str`? If we want to have a specific impl, 
we could also pass that into the configuration itself. This is of course bound 
to a language, [for example 
`py-io-impl`](https://py.iceberg.apache.org/configuration/#fileio) and in Java 
it is just `io-impl`.
   
   Makes sense to me.
   
   I exposed the necessary elements as they are to build `FileIO` for now. I 
can address this in another PR that removes the requirement for `scheme` to 
build `FileIO`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] feat: Add RemovePartitionSpecs table update [iceberg-rust]

2024-12-16 Thread via GitHub


c-thiel commented on PR #804:
URL: https://github.com/apache/iceberg-rust/pull/804#issuecomment-2546189749

   @Fokko resolved merge conflicts. Could you re-approve?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Fix numeric overflow of timestamp nano literal [iceberg]

2024-12-16 Thread via GitHub


amogh-jahagirdar commented on code in PR #11775:
URL: https://github.com/apache/iceberg/pull/11775#discussion_r1887202656


##
api/src/main/java/org/apache/iceberg/expressions/Literals.java:
##
@@ -300,8 +300,7 @@ public  Literal to(Type type) {
 case TIMESTAMP:
   return (Literal) new TimestampLiteral(value());
 case TIMESTAMP_NANO:
-  // assume micros and convert to nanos to match the behavior in the 
timestamp case above
-  return new TimestampLiteral(value()).to(type);
+  return (Literal) new TimestampNanoLiteral(value());

Review Comment:
   This change seems correct to me, the previous behavior for this was to 
assume the value was in microseconds and then pass that through to 
TimestampLiteral but that can overflow and does not actually represent a 
nanosecond timestamp! 



##
api/src/test/java/org/apache/iceberg/expressions/TestTimestampLiteralConversions.java:
##
@@ -107,6 +108,28 @@ public void testTimestampMicrosToDateConversion() {
 assertThat(dateOrdinal).isEqualTo(-1);
   }
 
+  @Test
+  public void testTimestampNanoWithLongLiteral() {

Review Comment:
   could we also add a case for `withZone`? 



##
api/src/test/java/org/apache/iceberg/types/TestConversions.java:
##
@@ -111,9 +111,9 @@ public void testByteBufferConversions() {
 assertConversion(
 4L, TimestampNanoType.withZone(), new byte[] {0, -124, -41, 
23, 0, 0, 0, 0});
 
assertThat(Literal.of(40L).to(TimestampNanoType.withoutZone()).toByteBuffer().array())
-.isEqualTo(new byte[] {0, -124, -41, 23, 0, 0, 0, 0});
+.isEqualTo(new byte[] {-128, 26, 6, 0, 0, 0, 0, 0});
 
assertThat(Literal.of(40L).to(TimestampNanoType.withZone()).toByteBuffer().array())

Review Comment:
   I'm a bit confused how the original assertion was passing? Shouldn't have 
this always been equivalent to `{-128, 26, 6, 0, 0, 0, 0, 0}`? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Auth Manager API part 1: HTTPRequest, HTTPHeader [iceberg]

2024-12-16 Thread via GitHub


danielcweeks commented on code in PR #11769:
URL: https://github.com/apache/iceberg/pull/11769#discussion_r1887271043


##
core/src/main/java/org/apache/iceberg/rest/HTTPHeaders.java:
##
@@ -0,0 +1,109 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.rest;
+
+import java.util.List;
+import java.util.stream.Collectors;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.immutables.value.Value;
+
+/**
+ * Represents a group of HTTP headers.
+ *
+ * Header name comparison in this class is always case-insensitive, in 
accordance with RFC 2616.
+ *
+ * This class exposes methods to convert to and from different 
representations such as maps and
+ * multimap, for easier access and manipulation – especially when dealing with 
multiple headers with
+ * the same name.
+ */
+@Value.Style(depluralize = true)
+@Value.Immutable
+@SuppressWarnings({"ImmutablesStyle", "SafeLoggingPropagation"})
+public interface HTTPHeaders {
+
+  HTTPHeaders EMPTY = of();
+
+  /** Returns all the header entries in this group. */
+  List entries();

Review Comment:
   Nit: shouldn't this be `Set`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Auth Manager API part 1: HTTPRequest, HTTPHeader [iceberg]

2024-12-16 Thread via GitHub


danielcweeks commented on code in PR #11769:
URL: https://github.com/apache/iceberg/pull/11769#discussion_r1887271278


##
core/src/main/java/org/apache/iceberg/rest/HTTPHeaders.java:
##
@@ -0,0 +1,109 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.rest;
+
+import java.util.List;
+import java.util.stream.Collectors;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.immutables.value.Value;
+
+/**
+ * Represents a group of HTTP headers.
+ *
+ * Header name comparison in this class is always case-insensitive, in 
accordance with RFC 2616.
+ *
+ * This class exposes methods to convert to and from different 
representations such as maps and
+ * multimap, for easier access and manipulation – especially when dealing with 
multiple headers with
+ * the same name.
+ */
+@Value.Style(depluralize = true)
+@Value.Immutable
+@SuppressWarnings({"ImmutablesStyle", "SafeLoggingPropagation"})
+public interface HTTPHeaders {
+
+  HTTPHeaders EMPTY = of();
+
+  /** Returns all the header entries in this group. */
+  List entries();
+
+  /** Returns all the entries in this group for the given name 
(case-insensitive). */
+  default List entries(String name) {

Review Comment:
   Same here, set



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Add Variant implementation to read serialized objects [iceberg]

2024-12-16 Thread via GitHub


rdblue commented on code in PR #11415:
URL: https://github.com/apache/iceberg/pull/11415#discussion_r1887270705


##
core/src/main/java/org/apache/iceberg/variants/Variants.java:
##
@@ -0,0 +1,276 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.variants;
+
+import java.math.BigDecimal;
+import java.nio.ByteBuffer;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.util.DateTimeUtil;
+
+public class Variants {
+  private Variants() {}
+
+  enum LogicalType {

Review Comment:
   I've left these package-private for now. That gives us the flexibility to 
remove them later if needed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Bump `pyiceberg_core` to 0.4.0 [iceberg-rust]

2024-12-16 Thread via GitHub


Xuanwo merged PR #808:
URL: https://github.com/apache/iceberg-rust/pull/808


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Implementing namespace_exists function on the REST Catalog [iceberg-python]

2024-12-16 Thread via GitHub


kevinjqliu commented on code in PR #1434:
URL: https://github.com/apache/iceberg-python/pull/1434#discussion_r1887152140


##
tests/catalog/test_rest.py:
##
@@ -681,6 +681,51 @@ def test_update_namespace_properties_200(rest_mock: 
Mocker) -> None:
 assert response == PropertiesUpdateSummary(removed=[], updated=["prop"], 
missing=["abc"])
 
 
+def test_namespace_exists_200(rest_mock: Mocker) -> None:
+rest_mock.head(
+f"{TEST_URI}v1/namespaces/fokko",
+status_code=200,
+request_headers=TEST_HEADERS,
+)
+catalog = RestCatalog("rest", uri=TEST_URI, token=TEST_TOKEN)
+
+assert catalog.namespace_exists("fokko")

Review Comment:
   nit: the tests here are only verifying against the mock response.
   it would be great to test against the actual REST catalog in integration 
tests. 
   
   i noticed that does not exist right now, perhaps we can start a 
`tests/integration/test_rest_catalog.py` 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Tracking issues of iceberg rust v0.4.0 Release [iceberg-rust]

2024-12-16 Thread via GitHub


Xuanwo commented on issue #739:
URL: https://github.com/apache/iceberg-rust/issues/739#issuecomment-2546101308

   > Could someone help review this PR so we can get the versions synced to 
0.4.0? #808
   
   Merged. Let's move!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Deserialize NestedField initial-default and write-default Attributes [iceberg-python]

2024-12-16 Thread via GitHub


kevinjqliu commented on code in PR #1432:
URL: https://github.com/apache/iceberg-python/pull/1432#discussion_r1887143738


##
pyiceberg/types.py:
##
@@ -328,8 +328,8 @@ def __init__(
 data["type"] = data["type"] if "type" in data else field_type
 data["required"] = required
 data["doc"] = doc

Review Comment:
   I went down a rabbit hole, here's what I learned. 
   1. Using `__init__` with a pydantic model is kind of an anti-pattern; the 
pydantic model typically handles all the initialization/validation. 
   2. `__init__` here is to provide the ability to initialize `NestedField` 
with positional args, i.e. `NestedField(1, "blah")`
   
   For backward compatibility, we can't change `NestedField` to not take 
positional args. 
   But we can make this `__init__` (and other `__init__`s) more resilient to 
bugs with something like
   ```
   def __init__(self, *args, **kwargs):
   # implements `__init__` to support initialization with positional 
arguments
   if args:
   field_names = list(self.model_fields.keys()) # Gets all field 
names defined on the model
   if len(args) > len(field_names):
   raise TypeError(f"Too many positional arguments. Expected at 
most {len(field_names)}")
   kwargs.update(dict(zip(field_names, args))) # Maps positional 
args to field names
   # Let Pydantic handle aliases and validation
   super().__init__(**kwargs)
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Deserialize NestedField initial-default and write-default Attributes [iceberg-python]

2024-12-16 Thread via GitHub


kevinjqliu commented on code in PR #1432:
URL: https://github.com/apache/iceberg-python/pull/1432#discussion_r1887143738


##
pyiceberg/types.py:
##
@@ -328,8 +328,8 @@ def __init__(
 data["type"] = data["type"] if "type" in data else field_type
 data["required"] = required
 data["doc"] = doc

Review Comment:
   I went down a rabbit hole, here's what I learned. 
   1. Using `__init__` with a pydantic model is kind of an anti-pattern; the 
pydantic model typically handles all the initialization/validation. 
   2. `__init__` here is to provide the ability to initialize `NestedField` 
with positional args, i.e. `NestedField(1, "blah")`
   
   For backward compatibility, we can't change `NestedField` to not take 
positional args. 
   But we can make this `__init__` (and other `__init__`s) more resilient to 
bugs with something like
   ```
   def __init__(self, *args, **kwargs):
   # implements `__init__` to support initialization with positional 
arguments
   if args:
   field_names = list(self.model_fields.keys()) # Gets all field 
names defined on the model
   if len(args) > len(field_names):
   raise TypeError(f"Too many positional arguments. Expected at 
most {len(field_names)}")
   kwargs.update(dict(zip(field_names, args))) # Maps positional 
args to field names
   # Let Pydantic handle aliases and validation
   super().__init__(**kwargs)
   ```
   
   This doesn't check for when using both positional and keyword args, i.e. 
`NestedField(1, field_id=10)`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] fix: gurantee the deserialize order of struct is same as the struct type [iceberg-rust]

2024-12-16 Thread via GitHub


Xuanwo commented on code in PR #795:
URL: https://github.com/apache/iceberg-rust/pull/795#discussion_r1887090289


##
crates/iceberg/src/spec/values.rs:
##
@@ -3604,4 +3608,29 @@ mod tests {
 
 assert_eq!(result, expected);
 }
+
+#[test]
+fn test_record_ser_de() {
+let expected_literal = Literal::Struct(Struct::from_iter(vec![
+Some(Literal::Primitive(PrimitiveLiteral::Int(1))),
+Some(Literal::Primitive(PrimitiveLiteral::String(
+"bar".to_string(),
+))),
+None,
+Some(Literal::Primitive(PrimitiveLiteral::Int(1000))),
+]));
+let fields = Type::Struct(StructType::new(vec![
+NestedField::required(2, "id", 
Type::Primitive(PrimitiveType::Int)).into(),
+NestedField::optional(3, "name", 
Type::Primitive(PrimitiveType::String)).into(),
+NestedField::optional(4, "address", 
Type::Primitive(PrimitiveType::String)).into(),
+NestedField::required(5, "extra", 
Type::Primitive(PrimitiveType::Int)).into(),
+]));
+
+let raw_literal = RawLiteral::try_from(expected_literal.clone(), 
&fields).unwrap();
+let serialized = serde_json::to_string(&raw_literal).unwrap();
+let deserialized: RawLiteral = 
serde_json::from_str(&serialized).unwrap();
+let deserialized_literal = 
deserialized.try_into(&fields).unwrap().unwrap();
+
+assert_eq!(expected_literal, deserialized_literal);

Review Comment:
   Mark unresolve for my newly added comment: 
https://github.com/apache/iceberg-rust/pull/795#discussion_r1887077718
   
   Sorry @Fokko :smile: 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] fix: gurantee the deserialize order of struct is same as the struct type [iceberg-rust]

2024-12-16 Thread via GitHub


Fokko commented on code in PR #795:
URL: https://github.com/apache/iceberg-rust/pull/795#discussion_r1887083904


##
crates/iceberg/src/spec/values.rs:
##
@@ -3604,4 +3608,29 @@ mod tests {
 
 assert_eq!(result, expected);
 }
+
+#[test]
+fn test_record_ser_de() {
+let expected_literal = Literal::Struct(Struct::from_iter(vec![
+Some(Literal::Primitive(PrimitiveLiteral::Int(1))),
+Some(Literal::Primitive(PrimitiveLiteral::String(
+"bar".to_string(),
+))),
+None,
+Some(Literal::Primitive(PrimitiveLiteral::Int(1000))),
+]));
+let fields = Type::Struct(StructType::new(vec![
+NestedField::required(2, "id", 
Type::Primitive(PrimitiveType::Int)).into(),
+NestedField::optional(3, "name", 
Type::Primitive(PrimitiveType::String)).into(),
+NestedField::optional(4, "address", 
Type::Primitive(PrimitiveType::String)).into(),
+NestedField::required(5, "extra", 
Type::Primitive(PrimitiveType::Int)).into(),
+]));
+
+let raw_literal = RawLiteral::try_from(expected_literal.clone(), 
&fields).unwrap();
+let serialized = serde_json::to_string(&raw_literal).unwrap();
+let deserialized: RawLiteral = 
serde_json::from_str(&serialized).unwrap();
+let deserialized_literal = 
deserialized.try_into(&fields).unwrap().unwrap();
+
+assert_eq!(expected_literal, deserialized_literal);

Review Comment:
   Ah, missed that. Let me unblock this for now 👍 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Failed to read iceberg TPCH generated by snowflake [iceberg-rust]

2024-12-16 Thread via GitHub


Xuanwo commented on issue #790:
URL: https://github.com/apache/iceberg-rust/issues/790#issuecomment-2546018588

   > @Xuanwo are you running into anything else? Curious to learn if it works :)
   
   Hi, it works great! Thank you for the fix (just in time I needed it!). Let's 
close.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Failed to read iceberg TPCH generated by snowflake [iceberg-rust]

2024-12-16 Thread via GitHub


Xuanwo closed issue #790: Failed to read iceberg TPCH generated by snowflake
URL: https://github.com/apache/iceberg-rust/issues/790


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[PR] Bump `pyiceberg_core` to 0.4.0 [iceberg-rust]

2024-12-16 Thread via GitHub


sungwy opened a new pull request, #808:
URL: https://github.com/apache/iceberg-rust/pull/808

   We are releasing the bindings simultaneously with the core rust crates.
   
   Bump the version to be consistent with the core rust crates for simplicity.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Add Variant implementation to read serialized objects [iceberg]

2024-12-16 Thread via GitHub


rdblue commented on code in PR #11415:
URL: https://github.com/apache/iceberg/pull/11415#discussion_r1887268226


##
core/src/main/java/org/apache/iceberg/variants/PrimitiveWrapper.java:
##
@@ -0,0 +1,206 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.variants;
+
+import java.math.BigDecimal;
+import java.nio.ByteBuffer;
+import java.nio.ByteOrder;
+import java.nio.charset.StandardCharsets;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.variants.Variants.Primitives;
+
+class PrimitiveWrapper implements VariantPrimitive {
+  private static final byte NULL_HEADER = 
VariantUtil.primitiveHeader(Primitives.TYPE_NULL);
+  private static final byte TRUE_HEADER = 
VariantUtil.primitiveHeader(Primitives.TYPE_TRUE);
+  private static final byte FALSE_HEADER = 
VariantUtil.primitiveHeader(Primitives.TYPE_FALSE);
+  private static final byte INT8_HEADER = 
VariantUtil.primitiveHeader(Primitives.TYPE_INT8);
+  private static final byte INT16_HEADER = 
VariantUtil.primitiveHeader(Primitives.TYPE_INT16);
+  private static final byte INT32_HEADER = 
VariantUtil.primitiveHeader(Primitives.TYPE_INT32);
+  private static final byte INT64_HEADER = 
VariantUtil.primitiveHeader(Primitives.TYPE_INT64);
+  private static final byte FLOAT_HEADER = 
VariantUtil.primitiveHeader(Primitives.TYPE_FLOAT);
+  private static final byte DOUBLE_HEADER = 
VariantUtil.primitiveHeader(Primitives.TYPE_DOUBLE);
+  private static final byte DATE_HEADER = 
VariantUtil.primitiveHeader(Primitives.TYPE_DATE);
+  private static final byte TIMESTAMPTZ_HEADER =
+  VariantUtil.primitiveHeader(Primitives.TYPE_TIMESTAMPTZ);
+  private static final byte TIMESTAMPNTZ_HEADER =
+  VariantUtil.primitiveHeader(Primitives.TYPE_TIMESTAMPNTZ);
+  private static final byte DECIMAL4_HEADER = 
VariantUtil.primitiveHeader(Primitives.TYPE_DECIMAL4);
+  private static final byte DECIMAL8_HEADER = 
VariantUtil.primitiveHeader(Primitives.TYPE_DECIMAL8);
+  private static final byte DECIMAL16_HEADER =
+  VariantUtil.primitiveHeader(Primitives.TYPE_DECIMAL16);
+  private static final byte BINARY_HEADER = 
VariantUtil.primitiveHeader(Primitives.TYPE_BINARY);
+  private static final byte STRING_HEADER = 
VariantUtil.primitiveHeader(Primitives.TYPE_STRING);
+
+  private final Variants.PhysicalType type;
+  private final T value;
+  private ByteBuffer buffer = null;
+
+  PrimitiveWrapper(Variants.PhysicalType type, T value) {
+this.type = type;
+this.value = value;
+  }
+
+  @Override
+  public Variants.PhysicalType type() {
+return type;
+  }
+
+  @Override
+  public T get() {
+return value;
+  }
+
+  @Override
+  public int sizeInBytes() {
+switch (type()) {
+  case NULL:
+  case BOOLEAN_TRUE:
+  case BOOLEAN_FALSE:
+return 1; // 1 header only
+  case INT8:
+return 2; // 1 header + 1 value
+  case INT16:
+return 3; // 1 header + 2 value
+  case INT32:
+  case DATE:
+  case FLOAT:
+return 5; // 1 header + 4 value
+  case INT64:
+  case DOUBLE:
+  case TIMESTAMPTZ:
+  case TIMESTAMPNTZ:
+return 9; // 1 header + 8 value
+  case DECIMAL4:
+return 6; // 1 header + 1 scale + 4 unscaled value
+  case DECIMAL8:
+return 10; // 1 header + 1 scale + 8 unscaled value
+  case DECIMAL16:
+return 18; // 1 header + 1 scale + 16 unscaled value
+  case BINARY:
+return 5 + ((ByteBuffer) value).remaining(); // 1 header + 4 length + 
value length

Review Comment:
   @aihuaxu, if you modify the buffer then it will change the value. What 
you're suggesting is equivalent to other modifications, like this:
   
   ```java
   byte[] arr = new byte[] { 0x0A, 0x0B, 0x0C, 0x0D };
   VariantPrimitive primitive = Variants.of(ByteBuffer.wrap(arr));
   // modify the primitive value
   arr[0] = 0;
   
   ...
   ```
   
   If you modify the value then it is going to be different. We could duplicate 
the incoming buffer to avoid the case where `position` is modified, but not the 
case where the backing byte array is modified. That would be okay, but I don't 
see a lot of value in w

Re: [PR] Spark 3.4: Add REST catalog to Spark integration tests [iceberg]

2024-12-16 Thread via GitHub


danielcweeks merged PR #11698:
URL: https://github.com/apache/iceberg/pull/11698


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Feat: support aliyun oss backend. [iceberg-go]

2024-12-16 Thread via GitHub


zeroshade commented on PR #216:
URL: https://github.com/apache/iceberg-go/pull/216#issuecomment-2546599302

   the Java iceberg impl has some mocking and test setups for Aliyun as seen 
[here](https://github.com/apache/iceberg/tree/main/aliyun/src/test/java/org/apache/iceberg/aliyun/oss)
 would it make sense to create a similar utility in Go? or is it not worth that 
effort there? 
   
   Is there *any* sort of testing we could add here for this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Prep 0.4.0 release [iceberg-rust]

2024-12-16 Thread via GitHub


kevinjqliu commented on code in PR #809:
URL: https://github.com/apache/iceberg-rust/pull/809#discussion_r1887538305


##
crates/iceberg/src/spec/snapshot.rs:
##
@@ -192,13 +191,6 @@ impl Snapshot {
 partition_type_provider,
 )
 }
-
-pub(crate) fn log(&self) -> SnapshotLog {

Review Comment:
   is this intentional? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Prep 0.4.0 release [iceberg-rust]

2024-12-16 Thread via GitHub


kevinjqliu commented on code in PR #809:
URL: https://github.com/apache/iceberg-rust/pull/809#discussion_r1887539719


##
crates/iceberg/src/spec/table_metadata.rs:
##
@@ -398,33 +397,6 @@ impl TableMetadata {
 self.partition_statistics.get(&snapshot_id)
 }
 
-/// Append snapshot to table
-#[deprecated(
-since = "0.4.0",
-note = "please use `TableMetadataBuilder.set_branch_snapshot` instead"
-)]

Review Comment:
   is this deprecated since `0.4.0`? if thats the case, we shouldn't remove 
right now



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Docs: add note for `day` transform [iceberg]

2024-12-16 Thread via GitHub


kevinjqliu commented on code in PR #11749:
URL: https://github.com/apache/iceberg/pull/11749#discussion_r1887580518


##
format/spec.md:
##
@@ -454,7 +454,7 @@ Partition field IDs must be reused if an existing partition 
spec contains an equ
 | **`truncate[W]`** | Value truncated to width `W` (see below) 
| `int`, `long`, `decimal`, `string`, `binary`  
| Source type |
 | **`year`**| Extract a date or timestamp year, as years from 1970 
| `date`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns`  
| `int`   |
 | **`month`**   | Extract a date or timestamp month, as months from 
1970-01-01 | `date`, `timestamp`, `timestamptz`, `timestamp_ns`, 
`timestamptz_ns`  | `int`   |
-| **`day`** | Extract a date or timestamp day, as days from 1970-01-01 
| `date`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns`  
| `int`   |
+| **`day`** | Extract a date or timestamp day, as days from 1970-01-01 
| `date`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns`  
| `int` (the physical type should be an `int`, 
but the the logical type should be a `date`) |

Review Comment:
   I also encountered this before in 
[apache/iceberg-python#1208](https://github.com/apache/iceberg-python/pull/1208)
 and 
[apache/iceberg-python#1211](https://github.com/apache/iceberg-python/pull/1211).
 There's also this accompanying devlist thread 
https://lists.apache.org/thread/2gq7b54nvc9q6f1j08l9lnzgm5onkmx5 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Implementing namespace_exists function on the REST Catalog [iceberg-python]

2024-12-16 Thread via GitHub


sungwy commented on PR #1434:
URL: https://github.com/apache/iceberg-python/pull/1434#issuecomment-2546756342

   Hi @AhmedNader42 - thank you very much for picking up this issue and getting 
a working solution up already!
   
   I'm in agreement with @kevinjqliu 's comment, that it would be great to have 
an integration test as well, and I think it is well within the scope of this PR.
   
   
https://github.com/apache/iceberg-python/blob/main/tests/integration/test_reads.py
 already uses `session_catalog` in its tests, which is a rest catalog fixture 
defined in the `conftest.py`, and I think it would be a good place to add this 
test case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[PR] Fix comment on `WRITE_OBJECT_STORE_PARTITIONED_PATHS` table property [iceberg]

2024-12-16 Thread via GitHub


smaheshwar-pltr opened a new pull request, #11798:
URL: https://github.com/apache/iceberg/pull/11798

   The code comment above the `WRITE_OBJECT_STORE_PARTITIONED_PATHS` constant 
in `TableProperties` was incorrect - partition values are excluded when this 
property is set to *false*, not true, and object store is enabled, as stated in 
the 
[docs](https://iceberg.apache.org/docs/1.7.0/aws/#object-store-file-layout). 
This PR fixes the comment to be consistent.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Avro: Support default values for generic data [iceberg]

2024-12-16 Thread via GitHub


Fokko commented on code in PR #11786:
URL: https://github.com/apache/iceberg/pull/11786#discussion_r1887585686


##
core/src/main/java/org/apache/iceberg/data/avro/PlannedDataReader.java:
##
@@ -0,0 +1,181 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.data.avro;
+
+import java.io.IOException;
+import java.util.List;
+import java.util.Map;
+import java.util.function.Supplier;
+import org.apache.avro.LogicalType;
+import org.apache.avro.LogicalTypes;
+import org.apache.avro.Schema;
+import org.apache.avro.io.DatumReader;
+import org.apache.avro.io.Decoder;
+import org.apache.iceberg.avro.AvroSchemaUtil;
+import org.apache.iceberg.avro.AvroWithPartnerVisitor;
+import org.apache.iceberg.avro.SupportsRowPosition;
+import org.apache.iceberg.avro.ValueReader;
+import org.apache.iceberg.avro.ValueReaders;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.apache.iceberg.types.Type;
+import org.apache.iceberg.types.Types;
+import org.apache.iceberg.util.Pair;
+
+public class PlannedDataReader implements DatumReader, 
SupportsRowPosition {
+
+  public static  PlannedDataReader create(org.apache.iceberg.Schema 
expectedSchema) {
+return create(expectedSchema, ImmutableMap.of());
+  }
+
+  public static  PlannedDataReader create(
+  org.apache.iceberg.Schema expectedSchema, Map idToConstant) {
+return new PlannedDataReader<>(expectedSchema, idToConstant);
+  }
+
+  private final org.apache.iceberg.Schema expectedSchema;
+  private final Map idToConstant;
+  private ValueReader reader;
+
+  protected PlannedDataReader(
+  org.apache.iceberg.Schema expectedSchema, Map idToConstant) {
+this.expectedSchema = expectedSchema;
+this.idToConstant = idToConstant;
+  }
+
+  @Override
+  @SuppressWarnings("unchecked")
+  public void setSchema(Schema fileSchema) {
+this.reader =
+(ValueReader)
+AvroWithPartnerVisitor.visit(
+expectedSchema.asStruct(),
+fileSchema,
+new ReadBuilder(idToConstant),
+AvroWithPartnerVisitor.FieldIDAccessors.get());
+  }
+
+  @Override
+  public T read(T reuse, Decoder decoder) throws IOException {
+return reader.read(decoder, reuse);
+  }
+
+  @Override
+  public void setRowPositionSupplier(Supplier posSupplier) {
+if (reader instanceof SupportsRowPosition) {
+  ((SupportsRowPosition) reader).setRowPositionSupplier(posSupplier);
+}
+  }
+
+  private static class ReadBuilder extends AvroWithPartnerVisitor> {
+private final Map idToConstant;
+
+private ReadBuilder(Map idToConstant) {
+  this.idToConstant = idToConstant;
+}
+
+@Override
+public ValueReader record(Type partner, Schema record, 
List> fieldReaders) {
+  if (partner == null) {
+return ValueReaders.skipStruct(fieldReaders);
+  }
+
+  Types.StructType expected = partner.asStructType();
+  List>> readPlan =
+  ValueReaders.buildReadPlan(expected, record, fieldReaders, 
idToConstant);
+
+  return GenericReaders.struct(readPlan, expected);
+}
+
+@Override
+public ValueReader union(Type partner, Schema union, 
List> options) {
+  return ValueReaders.union(options);
+}
+
+@Override
+public ValueReader array(Type ignored, Schema array, ValueReader 
elementReader) {
+  return ValueReaders.array(elementReader);
+}
+
+@Override
+public ValueReader arrayMap(
+Type ignored, Schema map, ValueReader keyReader, ValueReader 
valueReader) {
+  return ValueReaders.arrayMap(keyReader, valueReader);
+}
+
+@Override
+public ValueReader map(Type ignored, Schema map, ValueReader 
valueReader) {
+  return ValueReaders.map(ValueReaders.strings(), valueReader);
+}
+
+@Override
+public ValueReader primitive(Type ignored, Schema primitive) {
+  LogicalType logicalType = primitive.getLogicalType();
+  if (logicalType != null) {
+switch (logicalType.getName()) {
+  case "date":
+return GenericReaders.dates();
+
+  case "time-micros":
+return GenericReaders.times();
+
+  case "timestamp-micros

Re: [PR] Spec: Support geo type [iceberg]

2024-12-16 Thread via GitHub


jiayuasu commented on code in PR #10981:
URL: https://github.com/apache/iceberg/pull/10981#discussion_r1887549063


##
format/spec.md:
##
@@ -584,8 +589,8 @@ The schema of a manifest file is a struct called 
`manifest_entry` with the follo
 | _optional_ | _optional_ | _optional_ | **`110  null_value_counts`**  | 
`map<121: int, 122: long>`  | 
Map from column id to number of null values in the column   

   |
 | _optional_ | _optional_ | _optional_ | **`137  nan_value_counts`**   | 
`map<138: int, 139: long>`  | 
Map from column id to number of NaN values in the column

   |
 | _optional_ | _optional_ | _optional_ | **`111  distinct_counts`**| 
`map<123: int, 124: long>`  | 
Map from column id to number of distinct values in the column; distinct counts 
must be derived using values in the file by counting or using sketches, but not 
using methods like merging existing distinct counts |
-| _optional_ | _optional_ | _optional_ | **`125  lower_bounds`**   | 
`map<126: int, 127: binary>`| 
Map from column id to lower bound in the column serialized as binary [1]. Each 
value must be less than or equal to all non-null, non-NaN values in the column 
for the file [2] |
-| _optional_ | _optional_ | _optional_ | **`128  upper_bounds`**   | 
`map<129: int, 130: binary>`| 
Map from column id to upper bound in the column serialized as binary [1]. Each 
value must be greater than or equal to all non-null, non-Nan values in the 
column for the file [2]  |
+| _optional_ | _optional_ | _optional_ | **`125  lower_bounds`**   | 
`map<126: int, 127: binary>`| 
Map from column id to lower bound in the column serialized as binary [1]. Each 
value must be less than or equal to all non-null, non-NaN values in the column 
for the file [2]. See [7] for`geometry` and [8] for `geography`.  |
+| _optional_ | _optional_ | _optional_ | **`128  upper_bounds`**   | 
`map<129: int, 130: binary>`| 
Map from column id to upper bound in the column serialized as binary [1]. Each 
value must be greater than or equal to all non-null, non-Nan values in the 
column for the file [2]. See [9] for `geometry` and [10] for `geography`. |

Review Comment:
   @szehon-ho 
   
   I found it from my PR: https://github.com/szehon-ho/iceberg/pull/2/files
   
   Can you figure out a way to add it back? 
   
   
   > The concepts of westernmost and easternmost values are explicitly 
introduced to address cases involving antimeridian crossing, where the 
`lower_bound` may be greater than `upper_bound`.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] fix: gurantee the deserialize order of struct is same as the struct type [iceberg-rust]

2024-12-16 Thread via GitHub


Fokko commented on code in PR #795:
URL: https://github.com/apache/iceberg-rust/pull/795#discussion_r1887570509


##
crates/iceberg/src/spec/values.rs:
##
@@ -3604,4 +3608,29 @@ mod tests {
 
 assert_eq!(result, expected);
 }
+
+#[test]
+fn test_record_ser_de() {
+let expected_literal = Literal::Struct(Struct::from_iter(vec![
+Some(Literal::Primitive(PrimitiveLiteral::Int(1))),
+Some(Literal::Primitive(PrimitiveLiteral::String(
+"bar".to_string(),
+))),
+None,
+Some(Literal::Primitive(PrimitiveLiteral::Int(1000))),
+]));
+let fields = Type::Struct(StructType::new(vec![
+NestedField::required(2, "id", 
Type::Primitive(PrimitiveType::Int)).into(),
+NestedField::optional(3, "name", 
Type::Primitive(PrimitiveType::String)).into(),
+NestedField::optional(4, "address", 
Type::Primitive(PrimitiveType::String)).into(),
+NestedField::required(5, "extra", 
Type::Primitive(PrimitiveType::Int)).into(),
+]));
+
+let raw_literal = RawLiteral::try_from(expected_literal.clone(), 
&fields).unwrap();
+let serialized = serde_json::to_string(&raw_literal).unwrap();
+let deserialized: RawLiteral = 
serde_json::from_str(&serialized).unwrap();
+let deserialized_literal = 
deserialized.try_into(&fields).unwrap().unwrap();
+
+assert_eq!(expected_literal, deserialized_literal);

Review Comment:
   Ah, around the same time! I share your concern, and I would like to check 
later on if we do the field-ID projection properly, but I didn't want to block 
the release 👍 
   
   For Avro it is very simple, it will always be decoded in the same order as 
the schema (otherwise it will just break). That said, we can rely on the order 
for V1, but use field-ID-based projection for V2 tables.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Parquet: Implement defaults for generic data [iceberg]

2024-12-16 Thread via GitHub


rdblue merged PR #11785:
URL: https://github.com/apache/iceberg/pull/11785


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Parquet: Implement defaults for generic data [iceberg]

2024-12-16 Thread via GitHub


rdblue commented on PR #11785:
URL: https://github.com/apache/iceberg/pull/11785#issuecomment-2546711840

   Merging this. Thanks for the review, @Fokko!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Avro: Support default values for generic data [iceberg]

2024-12-16 Thread via GitHub


rdblue commented on code in PR #11786:
URL: https://github.com/apache/iceberg/pull/11786#discussion_r1887573521


##
core/src/main/java/org/apache/iceberg/data/avro/DataReader.java:
##
@@ -36,6 +36,10 @@
 import org.apache.iceberg.types.Type;
 import org.apache.iceberg.types.Types;
 
+/**
+ * @deprecated will be removed in 2.0.0; use {@link PlannedDataReader} instead.
+ */
+@Deprecated

Review Comment:
   Most of the changed files in this PR are to avoid using this class now that 
it is deprecated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Prep 0.4.0 release [iceberg-rust]

2024-12-16 Thread via GitHub


sungwy commented on code in PR #809:
URL: https://github.com/apache/iceberg-rust/pull/809#discussion_r1887569858


##
crates/iceberg/src/spec/table_metadata.rs:
##
@@ -398,33 +397,6 @@ impl TableMetadata {
 self.partition_statistics.get(&snapshot_id)
 }
 
-/// Append snapshot to table
-#[deprecated(
-since = "0.4.0",
-note = "please use `TableMetadataBuilder.set_branch_snapshot` instead"
-)]

Review Comment:
   Yes, my bad - I read this as having to remove it by `0.4.0`, reverting



##
crates/iceberg/src/spec/snapshot.rs:
##
@@ -192,13 +191,6 @@ impl Snapshot {
 partition_type_provider,
 )
 }
-
-pub(crate) fn log(&self) -> SnapshotLog {

Review Comment:
   Yes, it's because deprecated function `append_snapshot` is the only function 
that calls it. Since we are reverting that change, I will revert this as well



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spec: Support geo type [iceberg]

2024-12-16 Thread via GitHub


szehon-ho commented on code in PR #10981:
URL: https://github.com/apache/iceberg/pull/10981#discussion_r1887625905


##
format/spec.md:
##
@@ -584,8 +589,8 @@ The schema of a manifest file is a struct called 
`manifest_entry` with the follo
 | _optional_ | _optional_ | _optional_ | **`110  null_value_counts`**  | 
`map<121: int, 122: long>`  | 
Map from column id to number of null values in the column   

   |
 | _optional_ | _optional_ | _optional_ | **`137  nan_value_counts`**   | 
`map<138: int, 139: long>`  | 
Map from column id to number of NaN values in the column

   |
 | _optional_ | _optional_ | _optional_ | **`111  distinct_counts`**| 
`map<123: int, 124: long>`  | 
Map from column id to number of distinct values in the column; distinct counts 
must be derived using values in the file by counting or using sketches, but not 
using methods like merging existing distinct counts |
-| _optional_ | _optional_ | _optional_ | **`125  lower_bounds`**   | 
`map<126: int, 127: binary>`| 
Map from column id to lower bound in the column serialized as binary [1]. Each 
value must be less than or equal to all non-null, non-NaN values in the column 
for the file [2] |
-| _optional_ | _optional_ | _optional_ | **`128  upper_bounds`**   | 
`map<129: int, 130: binary>`| 
Map from column id to upper bound in the column serialized as binary [1]. Each 
value must be greater than or equal to all non-null, non-Nan values in the 
column for the file [2]  |
+| _optional_ | _optional_ | _optional_ | **`125  lower_bounds`**   | 
`map<126: int, 127: binary>`| 
Map from column id to lower bound in the column serialized as binary [1]. Each 
value must be less than or equal to all non-null, non-NaN values in the column 
for the file [2]. See [7] for`geometry` and [8] for `geography`.  |
+| _optional_ | _optional_ | _optional_ | **`128  upper_bounds`**   | 
`map<129: int, 130: binary>`| 
Map from column id to upper bound in the column serialized as binary [1]. Each 
value must be greater than or equal to all non-null, non-Nan values in the 
column for the file [2]. See [9] for `geometry` and [10] for `geography`. |

Review Comment:
   whoops, must have accidentally removed as there are too many versions.  Put 
it back, and moved the other common note about 180/90 bound to this point as 
well.  This is for geography only right?  (the new point is for geography type)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spec: Support geo type [iceberg]

2024-12-16 Thread via GitHub


szehon-ho commented on code in PR #10981:
URL: https://github.com/apache/iceberg/pull/10981#discussion_r1887625905


##
format/spec.md:
##
@@ -584,8 +589,8 @@ The schema of a manifest file is a struct called 
`manifest_entry` with the follo
 | _optional_ | _optional_ | _optional_ | **`110  null_value_counts`**  | 
`map<121: int, 122: long>`  | 
Map from column id to number of null values in the column   

   |
 | _optional_ | _optional_ | _optional_ | **`137  nan_value_counts`**   | 
`map<138: int, 139: long>`  | 
Map from column id to number of NaN values in the column

   |
 | _optional_ | _optional_ | _optional_ | **`111  distinct_counts`**| 
`map<123: int, 124: long>`  | 
Map from column id to number of distinct values in the column; distinct counts 
must be derived using values in the file by counting or using sketches, but not 
using methods like merging existing distinct counts |
-| _optional_ | _optional_ | _optional_ | **`125  lower_bounds`**   | 
`map<126: int, 127: binary>`| 
Map from column id to lower bound in the column serialized as binary [1]. Each 
value must be less than or equal to all non-null, non-NaN values in the column 
for the file [2] |
-| _optional_ | _optional_ | _optional_ | **`128  upper_bounds`**   | 
`map<129: int, 130: binary>`| 
Map from column id to upper bound in the column serialized as binary [1]. Each 
value must be greater than or equal to all non-null, non-Nan values in the 
column for the file [2]  |
+| _optional_ | _optional_ | _optional_ | **`125  lower_bounds`**   | 
`map<126: int, 127: binary>`| 
Map from column id to lower bound in the column serialized as binary [1]. Each 
value must be less than or equal to all non-null, non-NaN values in the column 
for the file [2]. See [7] for`geometry` and [8] for `geography`.  |
+| _optional_ | _optional_ | _optional_ | **`128  upper_bounds`**   | 
`map<129: int, 130: binary>`| 
Map from column id to upper bound in the column serialized as binary [1]. Each 
value must be greater than or equal to all non-null, non-Nan values in the 
column for the file [2]. See [9] for `geometry` and [10] for `geography`. |

Review Comment:
   whoops, must have accidentally removed it, was dealing with too many 
versions :).  Put it back, and moved the other common note about 180/90 bound 
to this point as well.  This is for geography only right?  (the new point is 
for geography type)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add plan tasks for TableScan [iceberg-python]

2024-12-16 Thread via GitHub


ConeyLiu commented on code in PR #1427:
URL: https://github.com/apache/iceberg-python/pull/1427#discussion_r1886441268


##
pyiceberg/manifest.py:
##
@@ -105,6 +105,9 @@ def _missing_(cls, value: object) -> Union[None, str]:
 return member
 return None
 
+def is_splittable(self) -> bool:
+return self.name == "AVRO" or self.name == "PARQUET" or self.name == 
"ORC"

Review Comment:
   changed to enum



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add plan tasks for TableScan [iceberg-python]

2024-12-16 Thread via GitHub


ConeyLiu commented on code in PR #1427:
URL: https://github.com/apache/iceberg-python/pull/1427#discussion_r1886445512


##
tests/integration/test_reads.py:
##
@@ -873,3 +873,48 @@ def test_table_scan_empty_table(catalog: Catalog) -> None:
 result_table = tbl.scan().to_arrow()
 
 assert len(result_table) == 0
+
+
+@pytest.mark.integration
+def test_plan_tasks(session_catalog: Catalog) -> None:
+from pyiceberg.table import TableProperties
+
+table_name = "default.test_plan_tasks"
+try:
+session_catalog.drop_table(table_name)
+except NoSuchTableError:
+pass  # Just to make sure that the table doesn't exist
+
+tbl = session_catalog.create_table(
+table_name,
+Schema(
+NestedField(1, "number", LongType()),
+),
+properties={TableProperties.PARQUET_ROW_GROUP_LIMIT: "1"},
+)
+
+# Write 10 row groups, that should end up as 10 batches
+entries = 10
+tbl.append(
+pa.Table.from_pylist(
+[
+{
+"number": number,
+}
+for number in range(entries)
+],
+)
+)
+

Review Comment:
   added an assert. Here it should only have one file but 10 row groups.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add plan tasks for TableScan [iceberg-python]

2024-12-16 Thread via GitHub


ConeyLiu commented on code in PR #1427:
URL: https://github.com/apache/iceberg-python/pull/1427#discussion_r1886444111


##
pyiceberg/table/__init__.py:
##
@@ -1423,6 +1451,66 @@ def plan_files(self) -> Iterable[FileScanTask]:
 for data_entry in data_entries
 ]
 
+def _target_split_size(self) -> int:
+table_value = property_as_int(
+self.table_metadata.properties, TableProperties.READ_SPLIT_SIZE, 
TableProperties.READ_SPLIT_SIZE_DEFAULT
+)
+return property_as_int(self.options, TableProperties.READ_SPLIT_SIZE, 
table_value)  # type: ignore
+
+def _loop_back(self) -> int:
+table_value = property_as_int(
+self.table_metadata.properties, 
TableProperties.READ_SPLIT_LOOKBACK, TableProperties.READ_SPLIT_LOOKBACK_DEFAULT
+)
+return property_as_int(self.options, 
TableProperties.READ_SPLIT_LOOKBACK, table_value)  # type: ignore
+
+def _split_open_file_cost(self) -> int:
+table_value = property_as_int(
+self.table_metadata.properties,
+TableProperties.READ_SPLIT_OPEN_FILE_COST,
+TableProperties.READ_SPLIT_OPEN_FILE_COST_DEFAULT,
+)
+return property_as_int(self.options, 
TableProperties.READ_SPLIT_OPEN_FILE_COST, table_value)  # type: ignore
+
+def plan_task(self) -> Iterable[CombinedFileScanTask]:
+"""Plan balanced combined tasks for this scan by splitting large and 
combining small tasks.
+
+Returns:
+List of CombinedFileScanTasks
+"""
+split_size = self._target_split_size()
+loop_back = self._loop_back()
+open_file_cost = self._split_open_file_cost()
+
+def split(task: FileScanTask) -> List[FileScanTask]:
+data_file = task.file
+if not data_file.split_offsets:
+return [task]
+
+split_offsets = data_file.split_offsets
+if not all(split_offsets[i] <= split_offsets[i + 1] for i in 
range(len(split_offsets) - 1)):
+# split offsets are strictly ascending
+return [task]
+
+all_tasks = []
+for i in range(len(split_offsets) - 1):
+all_tasks.append(
+FileScanTask(data_file, task.delete_files, 
split_offsets[i], split_offsets[i + 1] - split_offsets[i])
+)
+
+all_tasks.append(
+FileScanTask(data_file, task.delete_files, split_offsets[-1], 
data_file.file_size_in_bytes - split_offsets[-1])
+)
+
+return all_tasks
+
+def weight_func(task: FileScanTask) -> int:
+return max(task.size_in_bytes(), (1 + len(task.delete_files)) * 
open_file_cost)

Review Comment:
   yes



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add plan tasks for TableScan [iceberg-python]

2024-12-16 Thread via GitHub


ConeyLiu commented on code in PR #1427:
URL: https://github.com/apache/iceberg-python/pull/1427#discussion_r1886443611


##
pyiceberg/table/__init__.py:
##
@@ -191,6 +193,15 @@ class TableProperties:
 DELETE_MODE_MERGE_ON_READ = "merge-on-read"
 DELETE_MODE_DEFAULT = DELETE_MODE_COPY_ON_WRITE
 
+READ_SPLIT_SIZE = "read.split.target-size"

Review Comment:
   added to docs, those options exist in table properties as well 
https://iceberg.apache.org/docs/latest/configuration/#read-properties



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add plan tasks for TableScan [iceberg-python]

2024-12-16 Thread via GitHub


ConeyLiu commented on code in PR #1427:
URL: https://github.com/apache/iceberg-python/pull/1427#discussion_r1886446277


##
pyiceberg/table/__init__.py:
##
@@ -1229,7 +1240,8 @@ def with_case_sensitive(self: S, case_sensitive: bool = 
True) -> S:
 
 
 class ScanTask(ABC):
-pass
+def size_in_bytes(self) -> int:

Review Comment:
   sure, marked as abstract



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add plan tasks for TableScan [iceberg-python]

2024-12-16 Thread via GitHub


ConeyLiu commented on code in PR #1427:
URL: https://github.com/apache/iceberg-python/pull/1427#discussion_r1886446918


##
pyiceberg/table/__init__.py:
##
@@ -1253,6 +1265,22 @@ def __init__(
 self.start = start or 0
 self.length = length or data_file.file_size_in_bytes
 
+def size_in_bytes(self) -> int:
+return self.length + sum(f.file_size_in_bytes for f in 
self.delete_files)
+
+
+@dataclass(init=False)
+class CombinedFileScanTask(ScanTask):
+"""Task representing combined multiple file scan tasks."""

Review Comment:
   updated doc, pls take another look



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] feat: support `arrow_struct_to_iceberg_struct` [iceberg-rust]

2024-12-16 Thread via GitHub


ZENOTME commented on PR #731:
URL: https://github.com/apache/iceberg-rust/pull/731#issuecomment-2545020315

   > Thanks @ZENOTME 's effort. I saw that both java/python have schema with 
partner visitor:
   > 
   > 1. 
[SchemaWithPartnerVisitor](https://github.com/apache/iceberg/blob/c07f2aabc0a1d02f068ecf1514d2479c0fbdd3b0/core/src/main/java/org/apache/iceberg/schema/SchemaWithPartnerVisitor.java#L27)
   > 2. 
https://github.com/apache/iceberg-python/blob/bfc0d9a62176803094da0867ee793808f105d352/pyiceberg/io/pyarrow.py#L1726
   > 
   > I'm thinking that we should adopt similar design?
   
   I tried to adopt the SchemaWithPartner design, but I find the interface is 
not so general. 
https://github.com/apache/iceberg-python/blob/bfc0d9a62176803094da0867ee793808f105d352/pyiceberg/io/pyarrow.py#L1726
 is not for array convert I think. To make it suitable for arrow array convert, 
I make some change to it:
   
   1. [Introduce ListIterator and 
MapIterator](https://github.com/apache/iceberg-rust/pull/731/files#diff-919de395f497e9e93101d26e0618400083c15c4eb7ac3679e3662107f0c4b0eeR1225),
 the control flow of list and map is not so suitable for arrow array convert. 
   2. array in arrow rust can be NullArray, so we need some preorder access 
before diving into the subcolumn. So I introduce `visit_type_before` to solve 
it. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Fix ParallelIterable deadlock [iceberg]

2024-12-16 Thread via GitHub


sopel39 commented on code in PR #11781:
URL: https://github.com/apache/iceberg/pull/11781#discussion_r1887631464


##
core/src/main/java/org/apache/iceberg/util/ParallelIterable.java:
##
@@ -257,17 +257,17 @@ private static class Task implements 
Supplier>>, Closeable {
 @Override
 public Optional> get() {
   try {
+if (queue.size() >= approximateMaxQueueSize) {

Review Comment:
   In order for connection to be kept around task has to be started and then it 
needs to yield. If task doesn't yield then the S3 connection is fully utilized 
and returned afterwards



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] fix: field id in name mapping should be optional [iceberg-python]

2024-12-16 Thread via GitHub


barronw commented on code in PR #1426:
URL: https://github.com/apache/iceberg-python/pull/1426#discussion_r1887662401


##
pyiceberg/table/name_mapping.py:
##
@@ -333,8 +334,8 @@ def struct(self, struct: StructType, struct_partner: 
Optional[MappedField], fiel
 return StructType(*field_results)
 
 def field(self, field: NestedField, field_partner: Optional[MappedField], 
field_result: IcebergType) -> IcebergType:
-if field_partner is None:
-raise ValueError(f"Field missing from NameMapping: 
{'.'.join(self.current_path)}")
+if field_partner is None or field_partner.field_id is None:

Review Comment:
   This should just be a type check since `NestedField` expects a field ID 
below. The field partner is looked up by field ID before being passed into this 
method.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Avro: Support default values for generic data [iceberg]

2024-12-16 Thread via GitHub


rdblue commented on code in PR #11786:
URL: https://github.com/apache/iceberg/pull/11786#discussion_r1887662465


##
core/src/main/java/org/apache/iceberg/data/avro/PlannedDataReader.java:
##
@@ -0,0 +1,187 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.data.avro;
+
+import java.io.IOException;
+import java.util.List;
+import java.util.Map;
+import java.util.function.Supplier;
+import org.apache.avro.LogicalType;
+import org.apache.avro.LogicalTypes;
+import org.apache.avro.Schema;
+import org.apache.avro.io.DatumReader;
+import org.apache.avro.io.Decoder;
+import org.apache.iceberg.avro.AvroSchemaUtil;
+import org.apache.iceberg.avro.AvroWithPartnerVisitor;
+import org.apache.iceberg.avro.SupportsRowPosition;
+import org.apache.iceberg.avro.ValueReader;
+import org.apache.iceberg.avro.ValueReaders;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.apache.iceberg.types.Type;
+import org.apache.iceberg.types.Types;
+import org.apache.iceberg.util.Pair;
+
+public class PlannedDataReader implements DatumReader, 
SupportsRowPosition {
+
+  public static  PlannedDataReader create(org.apache.iceberg.Schema 
expectedSchema) {
+return create(expectedSchema, ImmutableMap.of());
+  }
+
+  public static  PlannedDataReader create(
+  org.apache.iceberg.Schema expectedSchema, Map idToConstant) {
+return new PlannedDataReader<>(expectedSchema, idToConstant);
+  }
+
+  private final org.apache.iceberg.Schema expectedSchema;
+  private final Map idToConstant;
+  private ValueReader reader;
+
+  protected PlannedDataReader(
+  org.apache.iceberg.Schema expectedSchema, Map idToConstant) {
+this.expectedSchema = expectedSchema;
+this.idToConstant = idToConstant;
+  }
+
+  @Override
+  @SuppressWarnings("unchecked")
+  public void setSchema(Schema fileSchema) {
+this.reader =
+(ValueReader)
+AvroWithPartnerVisitor.visit(
+expectedSchema.asStruct(),
+fileSchema,
+new ReadBuilder(idToConstant),
+AvroWithPartnerVisitor.FieldIDAccessors.get());
+  }
+
+  @Override
+  public T read(T reuse, Decoder decoder) throws IOException {
+return reader.read(decoder, reuse);
+  }
+
+  @Override
+  public void setRowPositionSupplier(Supplier posSupplier) {
+if (reader instanceof SupportsRowPosition) {
+  ((SupportsRowPosition) reader).setRowPositionSupplier(posSupplier);
+}
+  }
+
+  private static class ReadBuilder extends AvroWithPartnerVisitor> {
+private final Map idToConstant;
+
+private ReadBuilder(Map idToConstant) {
+  this.idToConstant = idToConstant;
+}
+
+@Override
+public ValueReader record(Type partner, Schema record, 
List> fieldReaders) {
+  if (partner == null) {
+return ValueReaders.skipStruct(fieldReaders);
+  }
+
+  Types.StructType expected = partner.asStructType();
+  List>> readPlan =
+  ValueReaders.buildReadPlan(expected, record, fieldReaders, 
idToConstant);
+
+  return GenericReaders.struct(readPlan, expected);
+}
+
+@Override
+public ValueReader union(Type partner, Schema union, 
List> options) {
+  return ValueReaders.union(options);
+}
+
+@Override
+public ValueReader array(Type ignored, Schema array, ValueReader 
elementReader) {
+  return ValueReaders.array(elementReader);
+}
+
+@Override
+public ValueReader arrayMap(
+Type ignored, Schema map, ValueReader keyReader, ValueReader 
valueReader) {
+  return ValueReaders.arrayMap(keyReader, valueReader);
+}
+
+@Override
+public ValueReader map(Type ignored, Schema map, ValueReader 
valueReader) {
+  return ValueReaders.map(ValueReaders.strings(), valueReader);
+}
+
+@Override
+public ValueReader primitive(Type partner, Schema primitive) {
+  LogicalType logicalType = primitive.getLogicalType();
+  if (logicalType != null) {
+switch (logicalType.getName()) {
+  case "date":
+return GenericReaders.dates();
+
+  case "time-micros":
+return GenericReaders.times();
+
+  case "timestamp-micro

Re: [PR] Avro: Support default values for generic data [iceberg]

2024-12-16 Thread via GitHub


rdblue commented on PR #11786:
URL: https://github.com/apache/iceberg/pull/11786#issuecomment-2546993293

   Thanks for the review, @Fokko!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Avro: Support default values for generic data [iceberg]

2024-12-16 Thread via GitHub


rdblue merged PR #11786:
URL: https://github.com/apache/iceberg/pull/11786


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Docs: Change to Flink directory for instructions [iceberg]

2024-12-16 Thread via GitHub


szehon-ho commented on PR #11031:
URL: https://github.com/apache/iceberg/pull/11031#issuecomment-2546995771

   Whoops sorry, I must have missed this.  I think it makes sense to me.  cc 
@pvary @stevenzwu 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[PR] chore(docs): Update Readme - Lakekeeper repository moved [iceberg-rust]

2024-12-16 Thread via GitHub


c-thiel opened a new pull request, #810:
URL: https://github.com/apache/iceberg-rust/pull/810

   (no comment)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[PR] Bump moto from 5.0.22 to 5.0.23 [iceberg-python]

2024-12-16 Thread via GitHub


dependabot[bot] opened a new pull request, #1435:
URL: https://github.com/apache/iceberg-python/pull/1435

   Bumps [moto](https://github.com/getmoto/moto) from 5.0.22 to 5.0.23.
   
   Changelog
   Sourced from https://github.com/getmoto/moto/blob/master/CHANGELOG.md";>moto's 
changelog.
   
   5.0.23
   Docker Digest for 5.0.23: 
sha256:d41e007bb1f7d41b530959ae9cbed1edf42737ee839faf8da7e925bf19f63105
   New Services:
   * Kafka:
   * create_cluster()
   * create_cluster_v2()
   * describe_cluster()
   * describe_cluster_v2()
   * delete_cluster()
   * list_clusters()
   * list_clusters_v2()
   * list_tags_for_resource()
   * tag_resource()
   * untag_resource()
   New Methods:
   * DirectConnect:
   * associate_mac_sec_key()
   * create_lag()
   * describe_lags()
   * describe_settings()
   * disassociate_mac_sec_key()
   * update_settings()
   * EFS:
   * describe_file_system_policy()
   * put_file_system_policy()
   
   * ES:
   * describe_elasticsearch_domains()
   
   * OpenSearch:
   * describe_domains()
   
   Miscellaneous:
   * Athena: list_query_executions() now supports the WorkGroup-parameter
   * Athena: start_query_execution() now supports the WorkGroup-parameter
   * CloudFormation: AWS::IAM::Role now supports updates
   * CognitoIDP: list_users() now correctly filters before applying the 
Limit
   * DirectConnect: describe_trusts() no longer requires a 
DirectoryId-parameter
   * DynamoDB: The DeleteProtectionEnabled can now be disabled
   * DynamoDB: update_item() can now return list of binaries
   * EC2: SecurityGroups now contain a SecurityGroupArn
   * EC2: update_route() now correctly handles DestinationPrefixListId
   * KMS: get_public_key() now supports passing in aliases
   * Lambda: publish_function() now publishes a function even if the updated 
code hasn't changed
   * MemoryDB: tag_resource/list_tags_for_resource() now supports Snapshots and 
SubnetGroups
   * RDS: copy_db_snapshot() now supports the CopyTags-parameter
   
   
   
   ... (truncated)
   
   
   Commits
   
   https://github.com/getmoto/moto/commit/426df0b74d98bd7d1c7b42918e11c66c4adc4c96";>426df0b
 Pre-Release: Up Version Number
   https://github.com/getmoto/moto/commit/479c2119e030ad961f6b783ff461a20663f31a36";>479c211
 Prep release 5.0.23 (https://redirect.github.com/getmoto/moto/issues/8404";>#8404)
   https://github.com/getmoto/moto/commit/82c0c63bc1a3a7b682274ce0e5d41ae161dcb91c";>82c0c63
 EC2: update_route() now correctly handles DestinationPreflixListId (https://redirect.github.com/getmoto/moto/issues/8395";>#8395)
   https://github.com/getmoto/moto/commit/ea13ff2c4aea37623c5ca9c27a0bd1465afce5bd";>ea13ff2
 DynamoDB: update_item() can now return list of binaries (https://redirect.github.com/getmoto/moto/issues/8401";>#8401)
   https://github.com/getmoto/moto/commit/a7c36135f0d2b1c8c1d3e873c5c3d51f7cce4164";>a7c3613
 chore: update SSM default parameters (https://redirect.github.com/getmoto/moto/issues/8402";>#8402)
   https://github.com/getmoto/moto/commit/f46b9e5925bd1bbd1f541db34918ad7e5bb34bf1";>f46b9e5
 chore: update EC2 Instance Types (https://redirect.github.com/getmoto/moto/issues/8399";>#8399)
   https://github.com/getmoto/moto/commit/d21bb8624d12c7054b51ef17b06fb6894e642b76";>d21bb86
 chore: update EC2 Instance Offerings (https://redirect.github.com/getmoto/moto/issues/8398";>#8398)
   https://github.com/getmoto/moto/commit/b5c665edf41352cd1647d720eaf9e4199436f0e1";>b5c665e
 EC2: add SecurityGroupArn to security groups (https://redirect.github.com/getmoto/moto/issues/8393";>#8393)
   https://github.com/getmoto/moto/commit/44df2b02a65255659405f7fa31f1f24259d1b1bb";>44df2b0
 Kafka V1 Support (https://redirect.github.com/getmoto/moto/issues/8381";>#8381)
   https://github.com/getmoto/moto/commit/8896093e650be8b75a3b3103bb3a0d82380fb393";>8896093
 CognitoIDP: Fix Limit and Filter params for list_users() (https://redirect.github.com/getmoto/moto/issues/8392";>#8392)
   Additional commits viewable in https://github.com/getmoto/moto/compare/5.0.22...5.0.23";>compare 
view
   
   
   
   
   
   [![Dependabot compatibility 
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=moto&package-manager=pip&previous-version=5.0.22&new-version=5.0.23)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a rebase manually by commenting 
`@dependabot rebase`.
   
   [//]: # (dependabot-automerge-start)
   [//]: # (dependabot-automerge-end)
   
   ---
   
   
   Dependabot commands and options
   
   
   You can trigger Dependabot actions by commenting on this PR:
   - `@dependabot rebase` will rebase this PR
   - `@dependabot recreate` will recreate this PR, overwriting any edits that

[PR] Bump mkdocs-material from 9.5.48 to 9.5.49 [iceberg-python]

2024-12-16 Thread via GitHub


dependabot[bot] opened a new pull request, #1437:
URL: https://github.com/apache/iceberg-python/pull/1437

   Bumps [mkdocs-material](https://github.com/squidfunk/mkdocs-material) from 
9.5.48 to 9.5.49.
   
   Release notes
   Sourced from https://github.com/squidfunk/mkdocs-material/releases";>mkdocs-material's 
releases.
   
   mkdocs-material-9.5.49
   
   Adjusted title color in dark mode for all supported Mermaid.js 
diagrams
   Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/7803";>#7803:
 Privacy plugin crashes on generated files
   Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/7781";>#7781:
 Mermaid.js flow chart title not visible in dark mode
   
   
   
   
   Changelog
   Sourced from https://github.com/squidfunk/mkdocs-material/blob/master/CHANGELOG";>mkdocs-material's
 changelog.
   
   mkdocs-material-9.5.49 (2024-12-16)
   
   Adjusted title color in dark mode for all supported Mermaid.js 
diagrams
   Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/7803";>#7803:
 Privacy plugin crashes on generated files
   Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/7781";>#7781:
 Mermaid.js flow chart title not visible in dark mode
   
   mkdocs-material-9.5.48 (2024-12-08)
   
   Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/7774";>#7774:
 Disabling social cards doesn't work
   
   mkdocs-material-9.5.47 (2024-12-01)
   
   Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/7750";>#7750:
 Numeric tags break search
   Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/7748";>#7748:
 Blog plugin breaks when using future drafts (9.5.45 regression)
   
   mkdocs-material-9.5.46 (2024-11-25)
   
   Added support for removing preload hints in privacy plugin
   Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/7734";>#7734:
 Code blocks in h5 headlines are uppercased
   Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/7725";>#7725:
 Blog plugin crashing on missing timezone (9.5.45 regression)
   
   mkdocs-material-9.5.45 (2024-11-20)
   
   Reduced size of Docker image through multi-stage build
   Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/7708";>#7708:
 Blog plugin crashing on YAML dates with timezones
   
   mkdocs-material-9.5.44 (2024-11-05)
   
   Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/7672";>#7672:
 Font CSS 404's when using privacy plugin (9.5.43 regression)
   
   mkdocs-material-9.5.43 (2024-10-31)
   
   Added support for external images in SVGs in privacy plugin
   Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/7651";>#7651:
 Privacy plugin doesn't handle quoted URLs in CSS
   
   mkdocs-material-9.5.42 (2024-10-20)
   
   Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/7625";>#7625:
 Invalid encoding of boolean attributes in privacy plugin
   Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/7624";>#7624:
 Crash when disabling privacy plugin (9.5.41 regression)
   
   mkdocs-material-9.5.41 (2024-10-15)
   
   Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/7619";>#7619:
 Improved tooltip on logo disappears after instant navigation
   Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/7616";>#7616:
 Race condition in built-in privacy plugin when inlining assets
   Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/7615";>#7615:
 Comments and "Was this page helpful?" visible when printing
   
   mkdocs-material-9.5.40 (2024-10-10)
   
   Updated Latvian translations
   Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/7597";>#7597:
 Social cards not using site name on home page
   
   
   
   ... (truncated)
   
   
   Commits
   
   https://github.com/squidfunk/mkdocs-material/commit/f947bbe64a6f0f4754c0a7a1afc3b67e8b735c8b";>f947bbe
 Prepare 9.5.49 release
   https://github.com/squidfunk/mkdocs-material/commit/968fbe1ae213cb350850cfddf2614a1c93e73adb";>968fbe1
 Fixed privacy plugin interop with generated files
   https://github.com/squidfunk/mkdocs-material/commit/c97270c8f1d7a068eb20335ad961a356971c5f05";>c97270c
 Updated dependencies
   https://github.com/squidfunk/mkdocs-material/commit/7fb3a39bbc17865743b8cb8846aa4e845ac53eb4";>7fb3a39
 Documentation (https://redirect.github.com/squidfunk/mkdocs-material/issues/7794";>#7794)
   https://github.com/squidfunk/mkdocs-material/commit/1ee11152260a8081d2e6f394162194d1e591aadb";>1ee1115
 Updated dependencies
   https://github.com/squidfunk/mkdocs-material/commit/b368cee13f23b52335fd1ab39cacae9e26b08291";>b368cee
 Fixed diagram title not switching color in dark mode
   https://github.com/squidfunk/mkdocs-material/commit/0f97a8d5f029b1d38ec84fcdb9733a3682f84d50";>0f97a8d
 Documentation (https://redirect.github.com/squidfunk/mkdocs-material/issues/7784";>#7784)
   See full diff in https://github.com/squidfunk/mkd

[PR] Bump adlfs from 2024.7.0 to 2024.12.0 [iceberg-python]

2024-12-16 Thread via GitHub


dependabot[bot] opened a new pull request, #1436:
URL: https://github.com/apache/iceberg-python/pull/1436

   Bumps [adlfs](https://github.com/fsspec/adlfs) from 2024.7.0 to 2024.12.0.
   
   Changelog
   Sourced from https://github.com/fsspec/adlfs/blob/main/CHANGELOG.md";>adlfs's 
changelog.
   
   2024.12.0
   
   Add "exclusive" more to put, pipe and open, as in upstream 
fsspec
   Allow concurrency in fetch_ranges
   update versions
   signed URLs from connection string; and add content type
   CI cleanups
   better error messages
   honor anon parameter better
   deterministic blob IDs on upload
   AzureBlobFileSystem and AzureBlobFile support 
pickling.
   Handle mixed casing for hdi_isfolder metadata when 
determining whether a blob should be treated as a folder.
   _put_file: overwrite now defaults to 
True.
   
   2023.10.0
   
   Added support for timeout/connection_timeout/read_timeout
   
   2023.9.0
   
   Compatability with new glob behavior in fsspec 2023.9.0
   Compatability with azure.storage.blob 12.18.1
   
   2022.11.2
   
   Reorder fs.info() to search the parent directory only after searching 
for the specified item directly
   Removed pin on upper bound for azure-storage-blob
   Moved AzureDatalakeFileSystem to a separate module, in acknowledgement 
of https://learn.microsoft.com/en-us/answers/questions/281107/azure-data-lake-storage-gen1-retirement-announceme.html%7D";>Microsoft
 End of Life notice
   Added DeprecationWarning to AzureDatalakeFilesystem
   
   2022.10.1
   
   Pin azure-storage-blob >=12.12.0,=1.23.1,<2.0.0
   
   2022.9.1
   
   Fixed missing dependency on aiohttp in package 
metadata.
   
   2022.9.0
   
   Add support to AzureBlobFileSystem for versioning
   Assure full uri's are left stripped to remove ""
   Set Python requires >=3.8 in setup.cfg
   _strip_protocol handle lists
   
   2022.7.0
   
   Fix overflow error when uploading files > 2GB
   
   
   
   ... (truncated)
   
   
   Commits
   
   See full diff in https://github.com/fsspec/adlfs/compare/2024.7.0...2024.12.0";>compare 
view
   
   
   
   
   
   [![Dependabot compatibility 
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=adlfs&package-manager=pip&previous-version=2024.7.0&new-version=2024.12.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a rebase manually by commenting 
`@dependabot rebase`.
   
   [//]: # (dependabot-automerge-start)
   [//]: # (dependabot-automerge-end)
   
   ---
   
   
   Dependabot commands and options
   
   
   You can trigger Dependabot actions by commenting on this PR:
   - `@dependabot rebase` will rebase this PR
   - `@dependabot recreate` will recreate this PR, overwriting any edits that 
have been made to it
   - `@dependabot merge` will merge this PR after your CI passes on it
   - `@dependabot squash and merge` will squash and merge this PR after your CI 
passes on it
   - `@dependabot cancel merge` will cancel a previously requested merge and 
block automerging
   - `@dependabot reopen` will reopen this PR if it is closed
   - `@dependabot close` will close this PR and stop Dependabot recreating it. 
You can achieve the same result by closing it manually
   - `@dependabot show  ignore conditions` will show all of 
the ignore conditions of the specified dependency
   - `@dependabot ignore this major version` will close this PR and stop 
Dependabot creating any more for this major version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this minor version` will close this PR and stop 
Dependabot creating any more for this minor version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this dependency` will close this PR and stop 
Dependabot creating any more for this dependency (unless you reopen the PR or 
upgrade to it yourself)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Fix comment on `WRITE_OBJECT_STORE_PARTITIONED_PATHS` table property [iceberg]

2024-12-16 Thread via GitHub


ebyhr commented on PR #11798:
URL: https://github.com/apache/iceberg/pull/11798#issuecomment-2547049649

   I believe this change is correct. The usage is: 
   
https://github.com/apache/iceberg/blob/b9b61b1d72ebb192d5e90453ff7030ece73d2603/core/src/main/java/org/apache/iceberg/LocationProviders.java#L130-L134


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Allow adding files to multiple partition specs in FastAppend [iceberg]

2024-12-16 Thread via GitHub


anuragmantri commented on code in PR #11771:
URL: https://github.com/apache/iceberg/pull/11771#discussion_r1887709101


##
core/src/test/java/org/apache/iceberg/catalog/CatalogTests.java:
##
@@ -1590,13 +1590,15 @@ public void 
testCompleteCreateTransactionMultipleSchemas() {
 updateSchema.commit();
 
 UpdatePartitionSpec updateSpec = create.updateSpec().addField("new_col");
-PartitionSpec newSpec = updateSpec.apply();
 updateSpec.commit();
 
 ReplaceSortOrder replaceSortOrder = 
create.replaceSortOrder().asc("new_col");
 SortOrder newSortOrder = replaceSortOrder.apply();
 replaceSortOrder.commit();
 
+// Get new spec after commit to write new file with new spec
+PartitionSpec newSpec = create.table().spec();

Review Comment:
   This test was not using the latest spec. I would fail with `newAppend()` too.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Fix numeric overflow of timestamp nano literal [iceberg]

2024-12-16 Thread via GitHub


ebyhr commented on code in PR #11775:
URL: https://github.com/apache/iceberg/pull/11775#discussion_r1887709351


##
api/src/test/java/org/apache/iceberg/types/TestConversions.java:
##
@@ -111,9 +111,9 @@ public void testByteBufferConversions() {
 assertConversion(
 4L, TimestampNanoType.withZone(), new byte[] {0, -124, -41, 
23, 0, 0, 0, 0});
 
assertThat(Literal.of(40L).to(TimestampNanoType.withoutZone()).toByteBuffer().array())
-.isEqualTo(new byte[] {0, -124, -41, 23, 0, 0, 0, 0});
+.isEqualTo(new byte[] {-128, 26, 6, 0, 0, 0, 0, 0});
 
assertThat(Literal.of(40L).to(TimestampNanoType.withZone()).toByteBuffer().array())

Review Comment:
   I think the cause is the original logic called `DateTimeUtil.microsToNanos` 
method which multiples the value by 1000:
   
https://github.com/apache/iceberg/blob/b9b61b1d72ebb192d5e90453ff7030ece73d2603/api/src/main/java/org/apache/iceberg/expressions/Literals.java#L444-L445



##
api/src/test/java/org/apache/iceberg/expressions/TestTimestampLiteralConversions.java:
##
@@ -107,6 +108,28 @@ public void testTimestampMicrosToDateConversion() {
 assertThat(dateOrdinal).isEqualTo(-1);
   }
 
+  @Test
+  public void testTimestampNanoWithLongLiteral() {

Review Comment:
   Sure, added the test case. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Fix ParallelIterable deadlock [iceberg]

2024-12-16 Thread via GitHub


sopel39 commented on code in PR #11781:
URL: https://github.com/apache/iceberg/pull/11781#discussion_r1886884393


##
core/src/main/java/org/apache/iceberg/util/ParallelIterable.java:
##
@@ -257,17 +257,17 @@ private static class Task implements 
Supplier>>, Closeable {
 @Override
 public Optional> get() {
   try {
+if (queue.size() >= approximateMaxQueueSize) {

Review Comment:
   > so, each table scan leads to 2*cores tasks/s3 connections?
   
   potentially, yes. but there are extra conditions like queue must be full.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Unimplement Map from CharSequenceMap to obey contract [iceberg]

2024-12-16 Thread via GitHub


findepi commented on PR #11704:
URL: https://github.com/apache/iceberg/pull/11704#issuecomment-2545942666

   @nastra  @Fokko can you please take a look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[PR] Implementing namespace_exists function on the REST Catalog [iceberg-python]

2024-12-16 Thread via GitHub


AhmedNader42 opened a new pull request, #1434:
URL: https://github.com/apache/iceberg-python/pull/1434

   I have added the namespace_exists functionality for REST Catalog.
   
   Here's an example usage similar to the table_exists function of the same 
class 
   
   ![Screenshot from 2024-12-16 
18-29-00](https://github.com/user-attachments/assets/d31a7031-9a44-4db1-9077-2f8aeac6525b)
   
   Please review the changes 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] i cant table create dayTransform() and monthTransform() togetger for single field in schema [iceberg]

2024-12-16 Thread via GitHub


Fokko commented on issue #11788:
URL: https://github.com/apache/iceberg/issues/11788#issuecomment-2545929169

   @dvnageshpatil If you have a timestamp value:
   
   ```
   ts = '2024-10-22T19:25:00'
   ```
   
   Then the transforms will produce:
   
   ```
   day(ts) = '2024-10-22'
   month(ts) = '2024-10'
   ```
   
   In this case, the `month` information is useless since it is also encoded in 
the `day` information. Therefore just adding a `day` partition transform will 
do it 👍 I hope this helps! I'll close this issue for now, feel free to re-open 
it if there are any further questions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] i cant table create dayTransform() and monthTransform() togetger for single field in schema [iceberg]

2024-12-16 Thread via GitHub


Fokko closed issue #11788: i cant table create dayTransform() and 
monthTransform() togetger for single field in schema
URL: https://github.com/apache/iceberg/issues/11788


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add plan tasks for TableScan [iceberg-python]

2024-12-16 Thread via GitHub


kevinjqliu commented on code in PR #1427:
URL: https://github.com/apache/iceberg-python/pull/1427#discussion_r1887034502


##
pyiceberg/table/__init__.py:
##
@@ -1423,6 +1451,66 @@ def plan_files(self) -> Iterable[FileScanTask]:
 for data_entry in data_entries
 ]
 
+def _target_split_size(self) -> int:
+table_value = property_as_int(
+self.table_metadata.properties, TableProperties.READ_SPLIT_SIZE, 
TableProperties.READ_SPLIT_SIZE_DEFAULT
+)
+return property_as_int(self.options, TableProperties.READ_SPLIT_SIZE, 
table_value)  # type: ignore
+
+def _loop_back(self) -> int:
+table_value = property_as_int(
+self.table_metadata.properties, 
TableProperties.READ_SPLIT_LOOKBACK, TableProperties.READ_SPLIT_LOOKBACK_DEFAULT
+)
+return property_as_int(self.options, 
TableProperties.READ_SPLIT_LOOKBACK, table_value)  # type: ignore
+
+def _split_open_file_cost(self) -> int:
+table_value = property_as_int(
+self.table_metadata.properties,
+TableProperties.READ_SPLIT_OPEN_FILE_COST,
+TableProperties.READ_SPLIT_OPEN_FILE_COST_DEFAULT,
+)
+return property_as_int(self.options, 
TableProperties.READ_SPLIT_OPEN_FILE_COST, table_value)  # type: ignore
+
+def plan_task(self) -> Iterable[CombinedFileScanTask]:

Review Comment:
   This is where daft does scan planning, using `plan_files`.
   
   
https://github.com/Eventual-Inc/Daft/blob/95a61d26bfb198c590570229a81f7e3bec0049a0/daft/iceberg/iceberg_scan.py#L153-L206



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Implement `namespace_exists` function on the REST Catalog [iceberg-python]

2024-12-16 Thread via GitHub


AhmedNader42 commented on issue #1430:
URL: 
https://github.com/apache/iceberg-python/issues/1430#issuecomment-2545954476

   Submitted #1434


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] [DISCUSS] Exceptions vs status codes [iceberg-cpp]

2024-12-16 Thread via GitHub


wgtmac commented on issue #14:
URL: https://github.com/apache/iceberg-cpp/issues/14#issuecomment-2545956079

   As an Arrow developer who has experienced both status and exception in the 
same repo, I feel inclined to use exception to make the code shorter and easier 
to debug. It is preferred (perhaps biased) to use exception for modern C++: 
https://learn.microsoft.com/en-us/cpp/cpp/errors-and-exception-handling-modern-cpp?view=msvc-170


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Integrate Test Framework [iceberg-cpp]

2024-12-16 Thread via GitHub


pitrou commented on PR #13:
URL: https://github.com/apache/iceberg-cpp/pull/13#issuecomment-2545484399

   > I found this link: 
https://yurigeronimus.medium.com/guide-for-choosing-a-test-framework-for-your-c-project-2a7741b53317
   
   Seems like a lot of fluff with no substance, unfortunately.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] feat: Add RemovePartitionSpecs table update [iceberg-rust]

2024-12-16 Thread via GitHub


c-thiel commented on code in PR #804:
URL: https://github.com/apache/iceberg-rust/pull/804#discussion_r1886757422


##
crates/iceberg/src/spec/table_metadata_builder.rs:
##
@@ -740,6 +740,32 @@ impl TableMetadataBuilder {
 .set_default_partition_spec(Self::LAST_ADDED)
 }
 
+/// Remove partition specs by their ids from the table metadata.
+/// Does nothing if a spec id is not present.
+///
+/// If the default partition spec is removed, it is re-added
+/// upon build.
+pub fn remove_partition_specs(mut self, spec_ids: &[i32]) -> Self {
+let mut removed_specs = Vec::with_capacity(spec_ids.len());
+
+self.metadata.partition_specs.retain(|k, _| {
+if spec_ids.contains(k) {

Review Comment:
   I switched the order - makes more sense to me this way. We are now looping 
over a probably very short array doing O(1) lookups. `partition_specs` is a 
HashMap



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] feat: TableMetadata Statistic Files [iceberg-rust]

2024-12-16 Thread via GitHub


Fokko commented on code in PR #799:
URL: https://github.com/apache/iceberg-rust/pull/799#discussion_r1886819635


##
crates/iceberg/src/catalog/mod.rs:
##
@@ -446,6 +446,30 @@ pub enum TableUpdate {
 /// Properties to remove
 removals: Vec,
 },
+/// Set statistics for a snapshot
+#[serde(with = "_serde_set_statistics")]
+SetStatistics {
+/// File containing the statistics
+statistics: StatisticsFile,
+},

Review Comment:
   Thanks for raising this on the dev-list: 
https://lists.apache.org/thread/6fnrp73wokfqlt5vormpjyjmtvl29ll1



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] add Status data structure [iceberg-cpp]

2024-12-16 Thread via GitHub


gaborkaszab commented on PR #8:
URL: https://github.com/apache/iceberg-cpp/pull/8#issuecomment-2545642065

   > There is an old stackoverflow question which I think we can take a look
   There is a stackoverflow for everything and for the opposite too. I just 
googled for "Status codes vs exceptions" and it gives me the first 5 results 
supporting exceptions over status codes, I guess because the algorithm is 
biased by my personal preference it learned from me.
   
   > when you set a breakpoint on the method chain and exception happens, how 
do you know which call give out the exception?
   This works well in Java, I don't think it would be any different in C++ 
world. If you get an exception I guess you can check the trace where it was 
thrown and then set your debugger accordingly.
   Also method chaining doesn't mean that you write everything in a single 
line, you could break as you want.
   
   I did a quick run through on this PR and my general opinion is this: This PR 
assumes that we simply want to pull in auxiliary stuff from the Arrow project 
and without questioning whether we'd need them here or not we just pull entire 
files into this project. This is pretty difficult to review TBH, and if someone 
is not from the Arrow project it's even more difficult because they have to 
figure out the initial motivation of each functionality of another project.
   
   I think this should be the other way around. Let's identify some 
functionality that we'd like to develop in this project, like interface for 
catalogs let's say, and if we find something in other projects that could help 
then pull in only the required part. Or let's have a very simple implementation 
ourselves, we can still evolve it later on.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] feat: TableMetadata Statistic Files [iceberg-rust]

2024-12-16 Thread via GitHub


Fokko commented on code in PR #799:
URL: https://github.com/apache/iceberg-rust/pull/799#discussion_r1886823887


##
crates/iceberg/src/spec/table_metadata_builder.rs:
##
@@ -524,6 +526,52 @@ impl TableMetadataBuilder {
 self
 }
 
+/// Set statistics for a snapshot
+pub fn set_statistics(mut self, statistics: StatisticsFile) -> Self {
+self.metadata
+.statistics
+.insert(statistics.snapshot_id, statistics.clone());

Review Comment:
   Do we want to check if a `snapshot-id` exists?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[I] Exceptions vs status codes [iceberg-cpp]

2024-12-16 Thread via GitHub


gaborkaszab opened a new issue, #14:
URL: https://github.com/apache/iceberg-cpp/issues/14

   I'm pretty sure there are pros and cons for each side and people might get 
into religious fights on this topic. Let's try to come to a conclusion which 
one we should use in this project.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] add Status data structure [iceberg-cpp]

2024-12-16 Thread via GitHub


wgtmac commented on PR #8:
URL: https://github.com/apache/iceberg-cpp/pull/8#issuecomment-2545814446

   I agree with @gaborkaszab that it would be better to discuss a concrete API 
design (e.g. Table, FileIO, etc.) before introducing a full-functional status 
implementation. If we decide to go with status over exception, we can add a 
basic status class at first and make it powerful later on.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] feat: Add RemovePartitionSpecs table update [iceberg-rust]

2024-12-16 Thread via GitHub


c-thiel commented on code in PR #804:
URL: https://github.com/apache/iceberg-rust/pull/804#discussion_r1886939579


##
crates/iceberg/src/spec/table_metadata_builder.rs:
##
@@ -740,6 +740,35 @@ impl TableMetadataBuilder {
 .set_default_partition_spec(Self::LAST_ADDED)
 }
 
+/// Remove partition specs by their ids from the table metadata.
+/// Does nothing if a spec id is not present.
+///

Review Comment:
   Added a comment in my last commit.
   As I am now following the java behavior to fail if the default spec is 
removed, I also added a comment for that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Build: Bump mkdocs-material from 9.5.47 to 9.5.48 [iceberg]

2024-12-16 Thread via GitHub


Fokko merged PR #11790:
URL: https://github.com/apache/iceberg/pull/11790


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] [Discussion] googletest(gtest) or Catch2 [iceberg-cpp]

2024-12-16 Thread via GitHub


pitrou commented on issue #12:
URL: https://github.com/apache/iceberg-cpp/issues/12#issuecomment-2545837400

   Arrow C++ and Parquet C++ developers are certainly familiar with GTest too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] feat: TableMetadata Statistic Files [iceberg-rust]

2024-12-16 Thread via GitHub


c-thiel commented on code in PR #799:
URL: https://github.com/apache/iceberg-rust/pull/799#discussion_r1886956770


##
crates/iceberg/src/spec/table_metadata_builder.rs:
##
@@ -524,6 +526,52 @@ impl TableMetadataBuilder {
 self
 }
 
+/// Set statistics for a snapshot
+pub fn set_statistics(mut self, statistics: StatisticsFile) -> Self {
+self.metadata
+.statistics
+.insert(statistics.snapshot_id, statistics.clone());

Review Comment:
   I thought about this as well, but then followed java.
   
   Currently statistics and snapshots are quite separate from each other. If we 
implement your check (which I like), I think we should eventually also 
implement:
   * Upon deserialization discard statistics that belong to nonexistant 
snapshots
   * When a snapshot is removed delete the statistics for it as well
   
   This would result in snapshots for statistics not missing. It is unclear 
however what should happen to the puffin files in these cases. We would have 
coherent metadata, but probably also orphan files.
   
   Do we know why the check is not there in Java?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] [DISCUSS] Exceptions vs status codes [iceberg-cpp]

2024-12-16 Thread via GitHub


pitrou commented on issue #14:
URL: https://github.com/apache/iceberg-cpp/issues/14#issuecomment-2545841290

   The main thing I like about the "Status" style is that it makes error 
propagation and handling (or lack thereof) explicit. However, I have no 
religious preference :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] [Discussion] googletest(gtest) or Catch2 [iceberg-cpp]

2024-12-16 Thread via GitHub


gaborkaszab commented on issue #12:
URL: https://github.com/apache/iceberg-cpp/issues/12#issuecomment-2545662593

   No strong opinions from me. I myself gravitate towards gtest just because I 
have experience on that, but could live with any of them.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark3.4,3.5: In describe extended view command: fix wrong view catal… [iceberg]

2024-12-16 Thread via GitHub


nastra commented on code in PR #11751:
URL: https://github.com/apache/iceberg/pull/11751#discussion_r1886968667


##
spark/v3.5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestViews.java:
##
@@ -1414,7 +1414,42 @@ public void describeExtendedView() {
 String.format(
 "['format-version' = '1', 'location' = '%s', 'provider' = 
'iceberg']",
 location),
-""));
+""),
+row("View Text", sql, ""));
+  }
+
+  @TestTemplate
+  public void describeExtendedViewWithoutCurrentNamespace() {

Review Comment:
   Sorry, I actually confused myself and that's exactly the part I had to 
double-check. So you're right about how the current namespace is being used.
   Can you please update the tests to this?
   
   ```
   +
   +  @TestTemplate
   +  public void createAndDescribeViewInDefaultNamespace() {
   +String viewName = viewName("createViewInDefaultNamespace");
   +String sql = String.format("SELECT id, data FROM %s WHERE id <= 3", 
tableName);
   +
   +sql("CREATE VIEW %s (id, data) AS %s", viewName, sql);
   +TableIdentifier identifier = TableIdentifier.of(NAMESPACE, viewName);
   +View view = viewCatalog().loadView(identifier);
   +assertThat(view.currentVersion().defaultCatalog()).isNull();
   +assertThat(view.name()).isEqualTo(ViewUtil.fullViewName(catalogName, 
identifier));
   +
assertThat(view.currentVersion().defaultNamespace()).isEqualTo(NAMESPACE);
   +
   +String location = viewCatalog().loadView(identifier).location();
   +assertThat(sql("DESCRIBE EXTENDED %s.%s", NAMESPACE, viewName))
   +.contains(
   +row("id", "int", ""),
   +row("data", "string", ""),
   +row("", "", ""),
   +row("# Detailed View Information", "", ""),
   +row("Comment", "", ""),
   +row("View Catalog and Namespace", String.format("%s.%s", 
catalogName, NAMESPACE), ""),
   +row("View Query Output Columns", "[id, data]", ""),
   +row(
   +"View Properties",
   +String.format(
   +"['format-version' = '1', 'location' = '%s', 'provider' 
= 'iceberg']",
   +location),
   +""));
   +  }
   +
   +  @TestTemplate
   +  public void createAndDescribeViewWithoutCurrentNamespace() {
   +String viewName = viewName("createViewWithoutCurrentNamespace");
   +Namespace namespace = Namespace.of("test_namespace");
   +String sql = String.format("SELECT id, data FROM %s WHERE id <= 3", 
tableName);
   +
   +sql("CREATE NAMESPACE IF NOT EXISTS %s", namespace);
   +sql("CREATE VIEW %s.%s (id, data) AS %s", namespace, viewName, sql);
   +TableIdentifier identifier = TableIdentifier.of(namespace, viewName);
   +View view = viewCatalog().loadView(identifier);
   +assertThat(view.currentVersion().defaultCatalog()).isNull();
   +assertThat(view.name()).isEqualTo(ViewUtil.fullViewName(catalogName, 
identifier));
   +
assertThat(view.currentVersion().defaultNamespace()).isEqualTo(NAMESPACE);
   +
   +String location = viewCatalog().loadView(identifier).location();
   +assertThat(sql("DESCRIBE EXTENDED %s.%s", namespace, viewName))
   +.contains(
   +row("id", "int", ""),
   +row("data", "string", ""),
   +row("", "", ""),
   +row("# Detailed View Information", "", ""),
   +row("Comment", "", ""),
   +row("View Catalog and Namespace", String.format("%s.%s", 
catalogName, namespace), ""),
   +row("View Query Output Columns", "[id, data]", ""),
   +row(
   +"View Properties",
   +String.format(
   +"['format-version' = '1', 'location' = '%s', 'provider' 
= 'iceberg']",
   +location),
   +""));
   +  }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Deserialize NestedField initial-default and write-default Attributes [iceberg-python]

2024-12-16 Thread via GitHub


kevinjqliu commented on code in PR #1432:
URL: https://github.com/apache/iceberg-python/pull/1432#discussion_r1886979577


##
tests/conftest.py:
##
@@ -149,6 +149,35 @@ def table_schema_simple() -> Schema:
 )
 
 
+@pytest.fixture(scope="session")

Review Comment:
   nit: wydt about adding an NestedField to 
[table_schema_simple](https://github.com/apache/iceberg-python/pull/1432/files#diff-e52e4ddd58b7ef887ab03c04116e676f6280b824ab7469d5d3080e5cba4f2128R141-R149)
 instead of creating a new Schema altogether? 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Azure: Support vended credentials refresh in ADLSFileIO. [iceberg]

2024-12-16 Thread via GitHub


nastra commented on code in PR #11577:
URL: https://github.com/apache/iceberg/pull/11577#discussion_r1886971501


##
azure/src/main/java/org/apache/iceberg/azure/adlsv2/VendedAdlsCredentialProvider.java:
##
@@ -0,0 +1,197 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.azure.adlsv2;
+
+import com.azure.core.credential.AzureSasCredential;
+import java.io.Serializable;
+import java.util.List;
+import java.util.Map;
+import java.util.concurrent.Future;
+import java.util.concurrent.ScheduledExecutorService;
+import java.util.concurrent.TimeUnit;
+import java.util.stream.Collectors;
+import org.apache.commons.io.IOUtils;
+import org.apache.iceberg.azure.AzureProperties;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.Maps;
+import org.apache.iceberg.rest.ErrorHandlers;
+import org.apache.iceberg.rest.HTTPClient;
+import org.apache.iceberg.rest.RESTClient;
+import org.apache.iceberg.rest.RESTUtil;
+import org.apache.iceberg.rest.auth.OAuth2Properties;
+import org.apache.iceberg.rest.auth.OAuth2Util;
+import org.apache.iceberg.rest.credentials.Credential;
+import org.apache.iceberg.rest.responses.LoadCredentialsResponse;
+import org.apache.iceberg.util.Pair;
+import org.apache.iceberg.util.SerializableMap;
+import org.apache.iceberg.util.ThreadPools;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+public class VendedAdlsCredentialProvider implements Serializable, 
AutoCloseable {
+  private static final Logger LOG = 
LoggerFactory.getLogger(VendedAdlsCredentialProvider.class);
+
+  private static final String THREAD_PREFIX = "adls-vended-credential-refresh";
+  public static final String URI = "credentials.uri";
+
+  private final SerializableMap properties;
+  private transient volatile Map
+  azureSasCredentialRefresherMap;
+  private transient volatile RESTClient client;
+  private transient volatile ScheduledExecutorService refreshExecutor;
+
+  public VendedAdlsCredentialProvider(Map properties) {
+Preconditions.checkArgument(null != properties, "Invalid properties: 
null");
+Preconditions.checkArgument(null != properties.get(URI), "Invalid URI: 
null");
+this.properties = SerializableMap.copyOf(properties);
+azureSasCredentialRefresherMap = Maps.newHashMap();
+  }
+
+  public AzureSasCredential credentialForAccount(String storageAccount) {
+Map refresherForAccountMap =
+azureSasCredentialRefresherMap();
+if (refresherForAccountMap.containsKey(storageAccount)) {
+  return refresherForAccountMap.get(storageAccount).azureSasCredential();
+} else {
+  AzureSasCredentialRefresher azureSasCredentialRefresher =
+  new AzureSasCredentialRefresher(
+  () -> this.sasTokenWithExpiration(storageAccount), 
credentialRefreshExecutor());
+  refresherForAccountMap.put(storageAccount, azureSasCredentialRefresher);
+  return azureSasCredentialRefresher.azureSasCredential();
+}
+  }
+
+  private Pair sasTokenWithExpiration(String storageAccount) {
+LoadCredentialsResponse response = fetchCredentials();
+List adlsCredentials =
+response.credentials().stream()
+.filter(c -> c.prefix().contains(storageAccount))
+.collect(Collectors.toList());
+Preconditions.checkState(
+!adlsCredentials.isEmpty(),
+String.format("Invalid ADLS Credentials for storage-account %s: 
empty", storageAccount));
+Preconditions.checkState(
+adlsCredentials.size() == 1,
+"Invalid ADLS Credentials: only one ADLS credential should exist per 
storage-account");
+
+Credential adlsCredential = adlsCredentials.get(0);
+checkCredential(adlsCredential, AzureProperties.ADLS_SAS_TOKEN_PREFIX + 
storageAccount);
+checkCredential(
+adlsCredential, AzureProperties.ADLS_SAS_TOKEN_EXPIRES_AT_MS_PREFIX + 
storageAccount);
+
+String updatedSasToken =
+adlsCredential.config().get(AzureProperties.ADLS_SAS_TOKEN_PREFIX + 
storageAccount);
+Long tokenExpiresAtMillis =
+Long.parseLong(
+adlsCredential
+.config()
+.get(Az

Re: [I] [Discussion] googletest(gtest) or Catch2 [iceberg-cpp]

2024-12-16 Thread via GitHub


wgtmac commented on issue #12:
URL: https://github.com/apache/iceberg-cpp/issues/12#issuecomment-2545859560

   I don't have experience with Catch2 either. But I don't object to adopt it 
if it brings good stuff.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Azure: Support vended credentials refresh in ADLSFileIO. [iceberg]

2024-12-16 Thread via GitHub


ChaladiMohanVamsi commented on code in PR #11577:
URL: https://github.com/apache/iceberg/pull/11577#discussion_r1886977070


##
azure/src/main/java/org/apache/iceberg/azure/adlsv2/VendedAdlsCredentialProvider.java:
##
@@ -0,0 +1,197 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.azure.adlsv2;
+
+import com.azure.core.credential.AzureSasCredential;
+import java.io.Serializable;
+import java.util.List;
+import java.util.Map;
+import java.util.concurrent.Future;
+import java.util.concurrent.ScheduledExecutorService;
+import java.util.concurrent.TimeUnit;
+import java.util.stream.Collectors;
+import org.apache.commons.io.IOUtils;
+import org.apache.iceberg.azure.AzureProperties;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.Maps;
+import org.apache.iceberg.rest.ErrorHandlers;
+import org.apache.iceberg.rest.HTTPClient;
+import org.apache.iceberg.rest.RESTClient;
+import org.apache.iceberg.rest.RESTUtil;
+import org.apache.iceberg.rest.auth.OAuth2Properties;
+import org.apache.iceberg.rest.auth.OAuth2Util;
+import org.apache.iceberg.rest.credentials.Credential;
+import org.apache.iceberg.rest.responses.LoadCredentialsResponse;
+import org.apache.iceberg.util.Pair;
+import org.apache.iceberg.util.SerializableMap;
+import org.apache.iceberg.util.ThreadPools;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+public class VendedAdlsCredentialProvider implements Serializable, 
AutoCloseable {
+  private static final Logger LOG = 
LoggerFactory.getLogger(VendedAdlsCredentialProvider.class);
+
+  private static final String THREAD_PREFIX = "adls-vended-credential-refresh";
+  public static final String URI = "credentials.uri";
+
+  private final SerializableMap properties;
+  private transient volatile Map
+  azureSasCredentialRefresherMap;
+  private transient volatile RESTClient client;
+  private transient volatile ScheduledExecutorService refreshExecutor;
+
+  public VendedAdlsCredentialProvider(Map properties) {
+Preconditions.checkArgument(null != properties, "Invalid properties: 
null");
+Preconditions.checkArgument(null != properties.get(URI), "Invalid URI: 
null");
+this.properties = SerializableMap.copyOf(properties);
+azureSasCredentialRefresherMap = Maps.newHashMap();
+  }
+
+  public AzureSasCredential credentialForAccount(String storageAccount) {
+Map refresherForAccountMap =
+azureSasCredentialRefresherMap();
+if (refresherForAccountMap.containsKey(storageAccount)) {
+  return refresherForAccountMap.get(storageAccount).azureSasCredential();
+} else {
+  AzureSasCredentialRefresher azureSasCredentialRefresher =
+  new AzureSasCredentialRefresher(
+  () -> this.sasTokenWithExpiration(storageAccount), 
credentialRefreshExecutor());
+  refresherForAccountMap.put(storageAccount, azureSasCredentialRefresher);
+  return azureSasCredentialRefresher.azureSasCredential();
+}
+  }
+
+  private Pair sasTokenWithExpiration(String storageAccount) {
+LoadCredentialsResponse response = fetchCredentials();
+List adlsCredentials =
+response.credentials().stream()
+.filter(c -> c.prefix().contains(storageAccount))
+.collect(Collectors.toList());
+Preconditions.checkState(
+!adlsCredentials.isEmpty(),
+String.format("Invalid ADLS Credentials for storage-account %s: 
empty", storageAccount));
+Preconditions.checkState(
+adlsCredentials.size() == 1,
+"Invalid ADLS Credentials: only one ADLS credential should exist per 
storage-account");
+
+Credential adlsCredential = adlsCredentials.get(0);
+checkCredential(adlsCredential, AzureProperties.ADLS_SAS_TOKEN_PREFIX + 
storageAccount);
+checkCredential(
+adlsCredential, AzureProperties.ADLS_SAS_TOKEN_EXPIRES_AT_MS_PREFIX + 
storageAccount);
+
+String updatedSasToken =
+adlsCredential.config().get(AzureProperties.ADLS_SAS_TOKEN_PREFIX + 
storageAccount);
+Long tokenExpiresAtMillis =
+Long.parseLong(
+adlsCredential
+.config()
+

Re: [PR] Deserialize NestedField initial-default and write-default Attributes [iceberg-python]

2024-12-16 Thread via GitHub


kevinjqliu commented on code in PR #1432:
URL: https://github.com/apache/iceberg-python/pull/1432#discussion_r1886982570


##
pyiceberg/types.py:
##
@@ -328,8 +328,8 @@ def __init__(
 data["type"] = data["type"] if "type" in data else field_type
 data["required"] = required
 data["doc"] = doc

Review Comment:
   curious if we should do the same here too



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Deserialize NestedField initial-default and write-default Attributes [iceberg-python]

2024-12-16 Thread via GitHub


paulcichonski commented on code in PR #1432:
URL: https://github.com/apache/iceberg-python/pull/1432#discussion_r1886986206


##
tests/conftest.py:
##
@@ -149,6 +149,35 @@ def table_schema_simple() -> Schema:
 )
 
 
+@pytest.fixture(scope="session")

Review Comment:
   Sure, I can make that change and see what happens. I was hesitant because 
`table_schema_simple` seems to be used by lots of tests so wasn't sure of the 
implications. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Deserialize NestedField initial-default and write-default Attributes [iceberg-python]

2024-12-16 Thread via GitHub


kevinjqliu commented on code in PR #1432:
URL: https://github.com/apache/iceberg-python/pull/1432#discussion_r1886988485


##
tests/conftest.py:
##
@@ -149,6 +149,35 @@ def table_schema_simple() -> Schema:
 )
 
 
+@pytest.fixture(scope="session")

Review Comment:
   +1 if it becomes too tedious, the current test is fine 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Deserialize NestedField initial-default and write-default Attributes [iceberg-python]

2024-12-16 Thread via GitHub


paulcichonski commented on code in PR #1432:
URL: https://github.com/apache/iceberg-python/pull/1432#discussion_r1886987316


##
pyiceberg/types.py:
##
@@ -328,8 +328,8 @@ def __init__(
 data["type"] = data["type"] if "type" in data else field_type
 data["required"] = required
 data["doc"] = doc

Review Comment:
   I originally had that, but the tests seem to pass without it so I just kept 
it as is. Happy to change if you'd like. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] PyIceberg appending data creates snapshots incompatible with Athena/Spark [iceberg-python]

2024-12-16 Thread via GitHub


kevinjqliu commented on issue #1424:
URL: 
https://github.com/apache/iceberg-python/issues/1424#issuecomment-2545885155

   The above replicates the logic of `_generate_snapshot_id` 
   
https://github.com/apache/iceberg-python/blob/b981780d313f7fa6fb911381962fe00017073cfe/pyiceberg/table/metadata.py#L322-L333
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Deserialize NestedField initial-default and write-default Attributes [iceberg-python]

2024-12-16 Thread via GitHub


kevinjqliu commented on code in PR #1432:
URL: https://github.com/apache/iceberg-python/pull/1432#discussion_r1886989722


##
pyiceberg/types.py:
##
@@ -328,8 +328,8 @@ def __init__(
 data["type"] = data["type"] if "type" in data else field_type
 data["required"] = required
 data["doc"] = doc

Review Comment:
   weird, im trying to figure out when this bug occurs and see if its present 
in other places in the codebase 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Deserialize NestedField initial-default and write-default Attributes [iceberg-python]

2024-12-16 Thread via GitHub


paulcichonski commented on code in PR #1432:
URL: https://github.com/apache/iceberg-python/pull/1432#discussion_r1886992009


##
pyiceberg/types.py:
##
@@ -328,8 +328,8 @@ def __init__(
 data["type"] = data["type"] if "type" in data else field_type
 data["required"] = required
 data["doc"] = doc

Review Comment:
   Let me know what you find, I tried digging into Pydantic to understand why 
it happens, but ran out of time. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



  1   2   3   >