Re: [PR] Add iceberg_arrow library [iceberg-cpp]

2025-01-06 Thread via GitHub


wgtmac commented on code in PR #6:
URL: https://github.com/apache/iceberg-cpp/pull/6#discussion_r1904813642


##
cmake_modules/BuildUtils.cmake:
##
@@ -201,17 +202,26 @@ function(ADD_ICEBERG_LIB LIB_NAME)
 PUBLIC 
"$")
 endif()
 
-install(TARGETS ${LIB_NAME}_static ${INSTALL_IS_OPTIONAL}
-EXPORT ${LIB_NAME}_targets
+install(TARGETS ${LIB_NAME}_static
+EXPORT iceberg_targets
 ARCHIVE DESTINATION ${INSTALL_ARCHIVE_DIR}
 LIBRARY DESTINATION ${INSTALL_LIBRARY_DIR}
 RUNTIME DESTINATION ${INSTALL_RUNTIME_DIR}
 INCLUDES
 DESTINATION ${CMAKE_INSTALL_INCLUDEDIR})
   endif()
 
-  if(ARG_CMAKE_PACKAGE_NAME)
-iceberg_install_cmake_package(${ARG_CMAKE_PACKAGE_NAME} 
${LIB_NAME}_targets)
+  # generate export header file
+  string(TOUPPER ${LIB_NAME} LIB_NAME_UPPER)
+  if(BUILD_SHARED)
+generate_export_header(${LIB_NAME}_shared BASE_NAME ${LIB_NAME_UPPER})
+target_compile_definitions(${LIB_NAME}_shared PUBLIC 
${LIB_NAME}_shared_EXPORTS)

Review Comment:
   Thanks for the explanation! @kou 
   
   > is it OK that iceberg::Puffin exists in libiceberg_core.dll not 
libiceberg_puffin.dll?
   
   I think this depends on how we implement the puffin interfaces. It seems to 
me there are at least two different interpretations of `libiceberg_puffin`:
   
   1. `libiceberg_core` provides only the interface of `iceberg::Puffin` reader 
and writer. `libiceberg_puffin` provides a concrete implementation of 
`iceberg::Puffin` together with roaring bitmap serialization and 
deserialization. This means that `libiceberg_puffin` cannot be used alone and 
should always depend on `libiceberg_core`.
   2. `libiceberg_puffin` provide the minimal building block of the puffin file 
format and does not depend on anything. In this case, `libiceberg_core` depends 
on `libiceberg_puffin` to provide a concrete implementation.
   
   As the goal of `libiceberg_core` is to provide a lightweight library with 
minimal dependency to implement the core logic of Apache Iceberg and puffin is 
a required component (same as Parquet and Avro), I feel a little bit suspicious 
about the point of splitting an individual `libiceberg_puffin` library at this 
moment. As this was a request from @gaborkaszab, could you please chime in to 
provide more context about your perspective? Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] [BUG] pyiceberg hanging on multiprocessing [iceberg-python]

2025-01-06 Thread via GitHub


frankliee commented on issue #1488:
URL: 
https://github.com/apache/iceberg-python/issues/1488#issuecomment-2574263503

   @kevinjqliu 
   I use pystack to get stack of child process, it shows that 
`pyarrow.FileSystem` causes the hanging.
   By the way, our env is not easy to upgrade pyiceberg to 8.0.1 immediately.
   
   ```
  for manifest_file in snapshot.manifests(self.io)
   (Python) File 
"/home/env/lib/python3.9/site-packages/pyiceberg/table/snapshots.py", line 255, 
in manifests
   return list(read_manifest_list(file))
   (Python) File 
"/home/env/lib/python3.9/site-packages/pyiceberg/manifest.py", line 659, in 
read_manifest_list
   with AvroFile[ManifestFile](
   (Python) File 
"/home/env/lib/python3.9/site-packages/pyiceberg/avro/file.py", line 170, in 
__enter__
   with self.input_file.open() as f:
   (Python) File 
"/home/env/lib/python3.9/site-packages/pyiceberg/io/pyarrow.py", line 272, in 
open
   input_file = self._filesystem.open_input_file(self._path)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add iceberg_arrow library [iceberg-cpp]

2025-01-06 Thread via GitHub


wgtmac commented on code in PR #6:
URL: https://github.com/apache/iceberg-cpp/pull/6#discussion_r1904830435


##
cmake_modules/BuildUtils.cmake:
##
@@ -201,17 +202,26 @@ function(ADD_ICEBERG_LIB LIB_NAME)
 PUBLIC 
"$")
 endif()
 
-install(TARGETS ${LIB_NAME}_static ${INSTALL_IS_OPTIONAL}
-EXPORT ${LIB_NAME}_targets
+install(TARGETS ${LIB_NAME}_static
+EXPORT iceberg_targets
 ARCHIVE DESTINATION ${INSTALL_ARCHIVE_DIR}
 LIBRARY DESTINATION ${INSTALL_LIBRARY_DIR}
 RUNTIME DESTINATION ${INSTALL_RUNTIME_DIR}
 INCLUDES
 DESTINATION ${CMAKE_INSTALL_INCLUDEDIR})
   endif()
 
-  if(ARG_CMAKE_PACKAGE_NAME)
-iceberg_install_cmake_package(${ARG_CMAKE_PACKAGE_NAME} 
${LIB_NAME}_targets)
+  # generate export header file
+  string(TOUPPER ${LIB_NAME} LIB_NAME_UPPER)
+  if(BUILD_SHARED)
+generate_export_header(${LIB_NAME}_shared BASE_NAME ${LIB_NAME_UPPER})
+target_compile_definitions(${LIB_NAME}_shared PUBLIC 
${LIB_NAME}_shared_EXPORTS)

Review Comment:
   Thanks for your help!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add iceberg_arrow library [iceberg-cpp]

2025-01-06 Thread via GitHub


kou commented on code in PR #6:
URL: https://github.com/apache/iceberg-cpp/pull/6#discussion_r1904828898


##
cmake_modules/BuildUtils.cmake:
##
@@ -201,17 +202,26 @@ function(ADD_ICEBERG_LIB LIB_NAME)
 PUBLIC 
"$")
 endif()
 
-install(TARGETS ${LIB_NAME}_static ${INSTALL_IS_OPTIONAL}
-EXPORT ${LIB_NAME}_targets
+install(TARGETS ${LIB_NAME}_static
+EXPORT iceberg_targets
 ARCHIVE DESTINATION ${INSTALL_ARCHIVE_DIR}
 LIBRARY DESTINATION ${INSTALL_LIBRARY_DIR}
 RUNTIME DESTINATION ${INSTALL_RUNTIME_DIR}
 INCLUDES
 DESTINATION ${CMAKE_INSTALL_INCLUDEDIR})
   endif()
 
-  if(ARG_CMAKE_PACKAGE_NAME)
-iceberg_install_cmake_package(${ARG_CMAKE_PACKAGE_NAME} 
${LIB_NAME}_targets)
+  # generate export header file
+  if(BUILD_SHARED)
+generate_export_header(${LIB_NAME}_shared BASE_NAME ${LIB_NAME})
+# target_compile_definitions(${LIB_NAME}_shared PRIVATE 
${LIB_NAME}_shared_EXPORTS)

Review Comment:
   ```suggestion
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add iceberg_arrow library [iceberg-cpp]

2025-01-06 Thread via GitHub


kou commented on code in PR #6:
URL: https://github.com/apache/iceberg-cpp/pull/6#discussion_r1904828726


##
cmake_modules/BuildUtils.cmake:
##
@@ -201,17 +202,26 @@ function(ADD_ICEBERG_LIB LIB_NAME)
 PUBLIC 
"$")
 endif()
 
-install(TARGETS ${LIB_NAME}_static ${INSTALL_IS_OPTIONAL}
-EXPORT ${LIB_NAME}_targets
+install(TARGETS ${LIB_NAME}_static
+EXPORT iceberg_targets
 ARCHIVE DESTINATION ${INSTALL_ARCHIVE_DIR}
 LIBRARY DESTINATION ${INSTALL_LIBRARY_DIR}
 RUNTIME DESTINATION ${INSTALL_RUNTIME_DIR}
 INCLUDES
 DESTINATION ${CMAKE_INSTALL_INCLUDEDIR})
   endif()
 
-  if(ARG_CMAKE_PACKAGE_NAME)
-iceberg_install_cmake_package(${ARG_CMAKE_PACKAGE_NAME} 
${LIB_NAME}_targets)
+  # generate export header file
+  string(TOUPPER ${LIB_NAME} LIB_NAME_UPPER)
+  if(BUILD_SHARED)
+generate_export_header(${LIB_NAME}_shared BASE_NAME ${LIB_NAME_UPPER})
+target_compile_definitions(${LIB_NAME}_shared PUBLIC 
${LIB_NAME}_shared_EXPORTS)

Review Comment:
   Thanks for explaining the context.
   How about discussing it as a follow-up task?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add iceberg_arrow library [iceberg-cpp]

2025-01-06 Thread via GitHub


kou commented on code in PR #6:
URL: https://github.com/apache/iceberg-cpp/pull/6#discussion_r1904829176


##
cmake_modules/BuildUtils.cmake:
##
@@ -201,17 +202,26 @@ function(ADD_ICEBERG_LIB LIB_NAME)
 PUBLIC 
"$")
 endif()
 
-install(TARGETS ${LIB_NAME}_static ${INSTALL_IS_OPTIONAL}
-EXPORT ${LIB_NAME}_targets
+install(TARGETS ${LIB_NAME}_static
+EXPORT iceberg_targets
 ARCHIVE DESTINATION ${INSTALL_ARCHIVE_DIR}
 LIBRARY DESTINATION ${INSTALL_LIBRARY_DIR}
 RUNTIME DESTINATION ${INSTALL_RUNTIME_DIR}
 INCLUDES
 DESTINATION ${CMAKE_INSTALL_INCLUDEDIR})
   endif()
 
-  if(ARG_CMAKE_PACKAGE_NAME)
-iceberg_install_cmake_package(${ARG_CMAKE_PACKAGE_NAME} 
${LIB_NAME}_targets)
+  # generate export header file
+  string(TOUPPER ${LIB_NAME} LIB_NAME_UPPER)
+  if(BUILD_SHARED)
+generate_export_header(${LIB_NAME}_shared BASE_NAME ${LIB_NAME_UPPER})
+target_compile_definitions(${LIB_NAME}_shared PUBLIC 
${LIB_NAME}_shared_EXPORTS)

Review Comment:
   I think that we can merge this now.



##
cmake_modules/BuildUtils.cmake:
##
@@ -201,17 +202,26 @@ function(ADD_ICEBERG_LIB LIB_NAME)
 PUBLIC 
"$")
 endif()
 
-install(TARGETS ${LIB_NAME}_static ${INSTALL_IS_OPTIONAL}
-EXPORT ${LIB_NAME}_targets
+install(TARGETS ${LIB_NAME}_static
+EXPORT iceberg_targets
 ARCHIVE DESTINATION ${INSTALL_ARCHIVE_DIR}
 LIBRARY DESTINATION ${INSTALL_LIBRARY_DIR}
 RUNTIME DESTINATION ${INSTALL_RUNTIME_DIR}
 INCLUDES
 DESTINATION ${CMAKE_INSTALL_INCLUDEDIR})
   endif()
 
-  if(ARG_CMAKE_PACKAGE_NAME)
-iceberg_install_cmake_package(${ARG_CMAKE_PACKAGE_NAME} 
${LIB_NAME}_targets)
+  # generate export header file
+  string(TOUPPER ${LIB_NAME} LIB_NAME_UPPER)
+  if(BUILD_SHARED)
+generate_export_header(${LIB_NAME}_shared BASE_NAME ${LIB_NAME_UPPER})
+target_compile_definitions(${LIB_NAME}_shared PUBLIC 
${LIB_NAME}_shared_EXPORTS)

Review Comment:
   Thanks for explaining the context.
   How about discussing it as a follow-up task if it's needed?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[PR] feat(catalog): Add Catalog Registry [iceberg-go]

2025-01-06 Thread via GitHub


zeroshade opened a new pull request, #244:
URL: https://github.com/apache/iceberg-go/pull/244

   Allow flexibility for handling catalogs and catalog implementations by 
adding a Catalog Registry to make it easy to add custom catalog implementations.
   
   The Registry allows registering an identifier mapped to a factory function, 
with the URI scheme being used if provided or otherwise to create a catalog on 
the fly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[I] [Bug] Error in overwrite(): pyarrow.lib.ArrowInvalid: offset overflow with large dataset (~3M rows, ~1GB) [iceberg-python]

2025-01-06 Thread via GitHub


sundaresanr opened a new issue, #1491:
URL: https://github.com/apache/iceberg-python/issues/1491

   ### Apache Iceberg version
   
   0.8.1 (latest release)
   
   ### Please describe the bug 🐞
   
   Encountered the following error while calling overwrite() on a dataset with 
over 3 million rows (~1GB parquet file size; ~6 GB pyarrow table size):
   
   ```
   pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays
   ```
   
   Backtrace: 
   
   ```
   txn.overwrite(pat, overwrite_filter=overwrite_filter)
 File ".../site-packages/pyiceberg/table/__init__.py", line 470, in 
overwrite
   for data_file in data_files:
 File ".../site-packages/pyiceberg/io/pyarrow.py", line 2636, in 
_dataframe_to_data_files
   partitions = _determine_partitions(spec=table_metadata.spec(), 
schema=table_metadata.schema(), arrow_table=df)

^
 File ".../site-packages/pyiceberg/io/pyarrow.py", line 2726, in 
_determine_partitions
   arrow_table = arrow_table.take(sort_indices)
 ^^
 File "pyarrow/table.pxi", line 2133, in pyarrow.lib._Tabular.take
 File ".../site-packages/pyarrow/compute.py", line 487, in take
   return call_function('take', [data, indices], options, memory_pool)
  
 File "pyarrow/_compute.pyx", line 590, in pyarrow._compute.call_function
 File "pyarrow/_compute.pyx", line 385, in pyarrow._compute.Function.call
 File "pyarrow/error.pxi", line 155, in 
pyarrow.lib.pyarrow_internal_check_status
 File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
   pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays
   ```
   
   Seems to be related to:
   
   https://github.com/apache/arrow/issues/25822
   https://github.com/apache/arrow/issues/33049
   
   
   
   ### Willingness to contribute
   
   - [ ] I can contribute a fix for this bug independently
   - [ ] I would be willing to contribute a fix for this bug with guidance from 
the Iceberg community
   - [ ] I cannot contribute a fix for this bug at this time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Writing an arrow table with date64 unsupported [iceberg-python]

2025-01-06 Thread via GitHub


github-actions[bot] closed issue #830: Writing an arrow table with date64 
unsupported
URL: https://github.com/apache/iceberg-python/issues/830


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Writing an arrow table with date64 unsupported [iceberg-python]

2025-01-06 Thread via GitHub


github-actions[bot] commented on issue #830:
URL: https://github.com/apache/iceberg-python/issues/830#issuecomment-2574156186

   This issue has been closed because it has not received any activity in the 
last 14 days since being marked as 'stale'


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Implement Human OAuth2 Flows for OAuth2Manager [iceberg]

2025-01-06 Thread via GitHub


github-actions[bot] commented on issue #10677:
URL: https://github.com/apache/iceberg/issues/10677#issuecomment-2574157406

   This issue has been automatically marked as stale because it has been open 
for 180 days with no activity. It will be closed in next 14 days if no further 
activity occurs. To permanently prevent this issue from being considered stale, 
add the label 'not-stale', but commenting on the issue is preferred when 
possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark : Derive Stats From Manifest on the Fly [iceberg]

2025-01-06 Thread via GitHub


guykhazma commented on PR #11615:
URL: https://github.com/apache/iceberg/pull/11615#issuecomment-2574163828

   @huaxingao @RussellSpitzer friendly remainder, can you please review this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add iceberg_arrow library [iceberg-cpp]

2025-01-06 Thread via GitHub


kou commented on code in PR #6:
URL: https://github.com/apache/iceberg-cpp/pull/6#discussion_r1904773431


##
cmake_modules/BuildUtils.cmake:
##
@@ -201,17 +202,26 @@ function(ADD_ICEBERG_LIB LIB_NAME)
 PUBLIC 
"$")
 endif()
 
-install(TARGETS ${LIB_NAME}_static ${INSTALL_IS_OPTIONAL}
-EXPORT ${LIB_NAME}_targets
+install(TARGETS ${LIB_NAME}_static
+EXPORT iceberg_targets
 ARCHIVE DESTINATION ${INSTALL_ARCHIVE_DIR}
 LIBRARY DESTINATION ${INSTALL_LIBRARY_DIR}
 RUNTIME DESTINATION ${INSTALL_RUNTIME_DIR}
 INCLUDES
 DESTINATION ${CMAKE_INSTALL_INCLUDEDIR})
   endif()
 
-  if(ARG_CMAKE_PACKAGE_NAME)
-iceberg_install_cmake_package(${ARG_CMAKE_PACKAGE_NAME} 
${LIB_NAME}_targets)
+  # generate export header file
+  string(TOUPPER ${LIB_NAME} LIB_NAME_UPPER)
+  if(BUILD_SHARED)
+generate_export_header(${LIB_NAME}_shared BASE_NAME ${LIB_NAME_UPPER})
+target_compile_definitions(${LIB_NAME}_shared PUBLIC 
${LIB_NAME}_shared_EXPORTS)

Review Comment:
   If we include `iceberg/puffin.h` in `demo_table.cc`, `demo_table.cc` exports 
`iceberg::Puffin` in `puffin.h`. So `libiceberg_core.dll` has `iceberg::Puffin`.
   
   BTW, is it OK that `iceberg::Puffin` exists in `libiceberg_core.dll` not 
`libiceberg_puffin.dll`? If `iceberg::Puffin` should exist in 
`libiceberg_puffin.dll`, `iceberg::Puffin` should be `iceberg::puffin::Puffin`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] [Bug] Cannot use PyIceberg with multiple FS [iceberg-python]

2025-01-06 Thread via GitHub


jiakai-li commented on issue #1041:
URL: 
https://github.com/apache/iceberg-python/issues/1041#issuecomment-2574308740

   @kevinjqliu I guess we can close this issue now? At the meantime, I'm keen 
to work on the `write.data.path` and `write.metadata.path` if that's something 
we want to enable and no one else is currently working on it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] bump version to 0.8.1 [iceberg-python]

2025-01-06 Thread via GitHub


Fokko commented on PR #1489:
URL: https://github.com/apache/iceberg-python/pull/1489#issuecomment-2573620664

   Thanks for addressing this. We bumped the version on the 0.8.x branch. How 
about moving this to 0.9.0 right away since that would be the next version 
released from the main branch.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Change dot notation in add column documentation to tuple [iceberg-python]

2025-01-06 Thread via GitHub


kevinjqliu commented on code in PR #1433:
URL: https://github.com/apache/iceberg-python/pull/1433#discussion_r1904467272


##
mkdocs/docs/api.md:
##
@@ -951,8 +951,10 @@ Using `add_column` you can add a column, without having to 
worry about the field
 with table.update_schema() as update:
 update.add_column("retries", IntegerType(), "Number of retries to place 
the bid")
 # In a struct
-update.add_column("details.confirmed_by", StringType(), "Name of the 
exchange")
+update.add_column("details", StructType())
+update.add_column(("details", "confirmed_by"), StringType(), "Name of the 
exchange")
 ```
+A complex type must exist before columns can added to it. Fields in complex 
types are added in a tuple.

Review Comment:
   ```suggestion
   A complex type must exist before columns can be added to it. Fields in 
complex types are added in a tuple.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[PR] Bump moto from 5.0.25 to 5.0.26 [iceberg-python]

2025-01-06 Thread via GitHub


dependabot[bot] opened a new pull request, #1490:
URL: https://github.com/apache/iceberg-python/pull/1490

   Bumps [moto](https://github.com/getmoto/moto) from 5.0.25 to 5.0.26.
   
   Changelog
   Sourced from https://github.com/getmoto/moto/blob/master/CHANGELOG.md";>moto's 
changelog.
   
   5.0.26
   Docker Digest for 5.0.26: 
sha256:1cae28be97cc87151ecabb531d1507b8dd3d52d3636b86143a16cccf4b5fcf43
   New Services:
   * Kinesis:
   * delete_resource_policy()
   * get_resource_policy()
   * put_resource_policy()
   Miscellaneous:
   * DynamoDB: transact_write_items() now validates empty 
ExpressionAttributeValues
   * Logs: describe_log_groups() now returns the logStreamArn-property
   * Organizations now has additional validation around creation and deletion 
of organizations and accounts
   * SecretsManager: list_secrets() now properly splits words when filtering
   * StepFunctions: describe_state_machine() now takes Version ARN's
   * StepFunctions: describe_state_machine() now returns the 
versionDescription
   
   
   
   
   Commits
   
   https://github.com/getmoto/moto/commit/846a535d8c2d381877bfbd15f754f2378bdb737c";>846a535
 Pre-Release: Up Version Number
   https://github.com/getmoto/moto/commit/699b55ed4a18c8a4e9e8c16c49195ad21ec81974";>699b55e
 Prep release 5.0.26 (https://redirect.github.com/getmoto/moto/issues/8471";>#8471)
   https://github.com/getmoto/moto/commit/e9991da07f6309b2b2cc02f3231d03a10195c254";>e9991da
 SecretsManager: list_secrets() now properly splits words when filtering all 
v...
   https://github.com/getmoto/moto/commit/6d39358e747308b1078799a23d466b900aa1c4f3";>6d39358
 chore: update SSM default parameters (https://redirect.github.com/getmoto/moto/issues/8468";>#8468)
   https://github.com/getmoto/moto/commit/5b804940e3a885bb46b5b5e827d3aab837df62ef";>5b80494
 chore: update EC2 Instance Types (https://redirect.github.com/getmoto/moto/issues/8467";>#8467)
   https://github.com/getmoto/moto/commit/87214ab79873185d28a57e55c73b1b1b30545b46";>87214ab
 StepFunctions: Support Version Descriptions (https://redirect.github.com/getmoto/moto/issues/8466";>#8466)
   https://github.com/getmoto/moto/commit/4886ac685a4c96ff763bdc62927e786b0236aa38";>4886ac6
 Refactor test_rds.py (https://redirect.github.com/getmoto/moto/issues/8461";>#8461)
   https://github.com/getmoto/moto/commit/1eadeeb683eabede8d322f6ae90857b399b73dcd";>1eadeeb
 Kinesis resource based policy implementation (https://redirect.github.com/getmoto/moto/issues/8463";>#8463)
   https://github.com/getmoto/moto/commit/ec277aa25294839f5b9ce5cd346955050a4bbe8d";>ec277aa
 chore: update SSM Optimized AMI's (https://redirect.github.com/getmoto/moto/issues/8457";>#8457)
   https://github.com/getmoto/moto/commit/d246ef254a4df42b40f27df64c00fa12bc389c74";>d246ef2
 chore: update SSM default AMI's (https://redirect.github.com/getmoto/moto/issues/8456";>#8456)
   Additional commits viewable in https://github.com/getmoto/moto/compare/5.0.25...5.0.26";>compare 
view
   
   
   
   
   
   [![Dependabot compatibility 
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=moto&package-manager=pip&previous-version=5.0.25&new-version=5.0.26)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a rebase manually by commenting 
`@dependabot rebase`.
   
   [//]: # (dependabot-automerge-start)
   [//]: # (dependabot-automerge-end)
   
   ---
   
   
   Dependabot commands and options
   
   
   You can trigger Dependabot actions by commenting on this PR:
   - `@dependabot rebase` will rebase this PR
   - `@dependabot recreate` will recreate this PR, overwriting any edits that 
have been made to it
   - `@dependabot merge` will merge this PR after your CI passes on it
   - `@dependabot squash and merge` will squash and merge this PR after your CI 
passes on it
   - `@dependabot cancel merge` will cancel a previously requested merge and 
block automerging
   - `@dependabot reopen` will reopen this PR if it is closed
   - `@dependabot close` will close this PR and stop Dependabot recreating it. 
You can achieve the same result by closing it manually
   - `@dependabot show  ignore conditions` will show all of 
the ignore conditions of the specified dependency
   - `@dependabot ignore this major version` will close this PR and stop 
Dependabot creating any more for this major version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this minor version` will close this PR and stop 
Dependabot creating any more for this minor version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this dependency` will close this PR and stop 
Dependabot creating any more for this dependency (unless you reopen the PR or 
upgrade to it yourself)
   

Re: [PR] Remove unneeded metadata read during update event generation [iceberg]

2025-01-06 Thread via GitHub


grantatspothero commented on code in PR #11829:
URL: https://github.com/apache/iceberg/pull/11829#discussion_r1904687147


##
core/src/test/java/org/apache/iceberg/TestTables.java:
##
@@ -255,8 +257,13 @@ void failCommits(int numFailures) {
   this.failCommits = numFailures;
 }
 
+int getMetadataFetchCount() {

Review Comment:
   Decided to count calls to current()/refresh() as a proxy for reading 
metadata from objectstorage. For TestTables (which uses in memory metadata) 
that is the best we can do.
   
   Is there some other test I could write that uses real file system metadata 
(not TestTables)?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Remove unneeded metadata read during update event generation [iceberg]

2025-01-06 Thread via GitHub


grantatspothero commented on code in PR #11829:
URL: https://github.com/apache/iceberg/pull/11829#discussion_r1904687147


##
core/src/test/java/org/apache/iceberg/TestTables.java:
##
@@ -255,8 +257,13 @@ void failCommits(int numFailures) {
   this.failCommits = numFailures;
 }
 
+int getMetadataFetchCount() {

Review Comment:
   Decided to count calls to current()/refresh() as a proxy for reading 
metadata from objectstorage. For TestTables (which uses in memory metadata) 
that is the best we can do.
   
   Is there some other test I could write that uses real on file system 
metadata (not TestTables)?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Remove unneeded metadata read during update event generation [iceberg]

2025-01-06 Thread via GitHub


grantatspothero commented on code in PR #11829:
URL: https://github.com/apache/iceberg/pull/11829#discussion_r1904696125


##
core/src/test/java/org/apache/iceberg/TestTables.java:
##
@@ -255,8 +257,13 @@ void failCommits(int numFailures) {
   this.failCommits = numFailures;
 }
 
+int getMetadataFetchCount() {

Review Comment:
   Why current/refresh counting is not great: cached metadata fetches are free 
while uncached fetches are not, and this does not differentiate between the two.
   
   Instrumenting at the fileio level would let us validate indeed only a single 
uncached metadata request is needed for the whole commit. 
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] java.lang.IllegalArgumentException: requirement failed: length (-6235972) cannot be smaller than -1 [iceberg]

2025-01-06 Thread via GitHub


justdoitvimal commented on issue #9689:
URL: https://github.com/apache/iceberg/issues/9689#issuecomment-2573677957

   Do we final a conclusion as in what will exactly help in fixing this issue? 
I also ran into this issue after using the maintenance job for few months. I 
don't don't know the exact reason but problem seems little strange to me.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Use SupportsPrefixOperations for Remove OrphanFile Procedure [iceberg]

2025-01-06 Thread via GitHub


ismailsimsek commented on code in PR #11906:
URL: https://github.com/apache/iceberg/pull/11906#discussion_r1903140118


##
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java:
##
@@ -292,14 +293,37 @@ private Dataset validFileIdentDS() {
 
   private Dataset actualFileIdentDS() {
 StringToFileURI toFileURI = new StringToFileURI(equalSchemes, 
equalAuthorities);
+Dataset dataList;
 if (compareToFileList == null) {
-  return toFileURI.apply(listedFileDS());
+  dataList =
+  table.io() instanceof SupportsPrefixOperations ? listWithPrefix() : 
listWithoutPrefix();
 } else {
-  return toFileURI.apply(filteredCompareToFileList());
+  dataList = filteredCompareToFileList();
 }
+
+return toFileURI.apply(dataList);
+  }
+
+  @VisibleForTesting
+  Dataset listWithPrefix() {
+List matchingFiles = Lists.newArrayList();
+PathFilter pathFilter = 
PartitionAwareHiddenPathFilter.forSpecs(table.specs());
+
+Iterator iterator =
+((SupportsPrefixOperations) 
table.io()).listPrefix(location).iterator();
+while (iterator.hasNext()) {
+  org.apache.iceberg.io.FileInfo fileInfo = iterator.next();
+  if (fileInfo.createdAtMillis() < olderThanTimestamp
+  && pathFilter.accept(new Path(fileInfo.location( {
+matchingFiles.add(fileInfo.location());
+  }
+}
+JavaRDD matchingFileRDD = 
sparkContext().parallelize(matchingFiles, 1);
+return spark().createDataset(matchingFileRDD.rdd(), Encoders.STRING());
   }
 
-  private Dataset listedFileDS() {
+  @VisibleForTesting
+  Dataset listWithoutPrefix() {

Review Comment:
   had to make it package private in-orderto  test using `VisibleForTesting`. 
not sure how else to call `listWithoutPrefix()` method.
   
   By default: Current executions are using `HadoopFileIO` and its implementing 
`DelegateFileIO` -> `SupportPrefixOperations`  which calls new method 
`listWithPrefix`
   
   [mentioned in this comment 
too](https://github.com/apache/iceberg/pull/7914#issuecomment-2212775830)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Remove unneeded metadata read during update event generation [iceberg]

2025-01-06 Thread via GitHub


grantatspothero commented on code in PR #11829:
URL: https://github.com/apache/iceberg/pull/11829#discussion_r1904580753


##
core/src/test/java/org/apache/iceberg/util/TestReachableFileUtil.java:
##
@@ -114,7 +114,10 @@ public void testMetadataFileLocationsWithMissingFiles() {
 // delete v3.metadata.json making v2.metadata.json and v1.metadata.json 
inaccessible
 table.io().deleteFile(location);
 
-Set metadataFileLocations = 
ReachableFileUtil.metadataFileLocations(table, true);
+// original hadoop table will not see the file deletion

Review Comment:
   This is side effect of removing `refresh()` calls during update event 
generation. Code which relied upon table.refresh being called for correctness 
will now break.
   
   This is mostly a problem for hadoop/filesystem tables and these are 
deprecated: https://iceberg.apache.org/spec/#file-system-tables
   
   For metastore catalogs for example, `table.refresh()` can be manually called 
after commits. This does not work for filesystem tables hence the creating of a 
new table.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Open-API: Fix compilation errors in generated Java classes due to mismatched return types [iceberg]

2025-01-06 Thread via GitHub


VladimirYushkevich commented on code in PR #11806:
URL: https://github.com/apache/iceberg/pull/11806#discussion_r1904463285


##
open-api/rest-catalog-open-api.py:
##
@@ -981,8 +966,33 @@ class ValueMap(BaseModel):
 )
 
 
+class ContentFile(BaseModel):
+content: ContentEnum
+file_path: str = Field(..., alias='file-path')
+file_format: FileFormat = Field(..., alias='file-format')
+spec_id: int = Field(..., alias='spec-id')
+partition: List[PrimitiveTypeValue] = Field(
+...,
+description='A list of partition field values ordered based on the 
fields of the partition spec specified by the `spec-id`',
+example=[1, 'bar'],
+)
+file_size_in_bytes: int = Field(
+..., alias='file-size-in-bytes', description='Total file size in bytes'
+)
+record_count: int = Field(
+..., alias='record-count', description='Number of records in the file'
+)
+key_metadata: Optional[BinaryTypeValue] = Field(
+None, alias='key-metadata', description='Encryption key metadata blob'
+)
+split_offsets: Optional[List[int]] = Field(
+None, alias='split-offsets', description='List of splittable offsets'
+)
+sort_order_id: Optional[int] = Field(None, alias='sort-order-id')
+
+
 class DataFile(ContentFile):
-content: Literal['data']
+content: ContentEnum

Review Comment:
   you mean .py change? I run tests as mentioned here: 
https://github.com/apache/iceberg/tree/main/open-api#running-compatibility-tests



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Variant Data Type Support [iceberg]

2025-01-06 Thread via GitHub


aihuaxu commented on issue #10392:
URL: https://github.com/apache/iceberg/issues/10392#issuecomment-2573670168

   > @aihuaxu Will this also work for protobuf encoded columns? I have a 
dataset with event_bytes | event_name
   > 
   > 101010100 | e1 101010100 | e2
   
   That shouldn't work. Variant has its own specific encoding.  We need to 
store the value in a Variant column which will convert and store in its 
specific binary format. Protobuf encoding is stored differently.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spec: Support geo type [iceberg]

2025-01-06 Thread via GitHub


paleolimbot commented on code in PR #10981:
URL: https://github.com/apache/iceberg/pull/10981#discussion_r1904570442


##
format/spec.md:
##
@@ -584,8 +589,8 @@ The schema of a manifest file is a struct called 
`manifest_entry` with the follo
 | _optional_ | _optional_ | _optional_ | **`110  null_value_counts`**  | 
`map<121: int, 122: long>`  | 
Map from column id to number of null values in the column   

   |
 | _optional_ | _optional_ | _optional_ | **`137  nan_value_counts`**   | 
`map<138: int, 139: long>`  | 
Map from column id to number of NaN values in the column

   |
 | _optional_ | _optional_ | _optional_ | **`111  distinct_counts`**| 
`map<123: int, 124: long>`  | 
Map from column id to number of distinct values in the column; distinct counts 
must be derived using values in the file by counting or using sketches, but not 
using methods like merging existing distinct counts |
-| _optional_ | _optional_ | _optional_ | **`125  lower_bounds`**   | 
`map<126: int, 127: binary>`| 
Map from column id to lower bound in the column serialized as binary [1]. Each 
value must be less than or equal to all non-null, non-NaN values in the column 
for the file [2] |
-| _optional_ | _optional_ | _optional_ | **`128  upper_bounds`**   | 
`map<129: int, 130: binary>`| 
Map from column id to upper bound in the column serialized as binary [1]. Each 
value must be greater than or equal to all non-null, non-Nan values in the 
column for the file [2]  |
+| _optional_ | _optional_ | _optional_ | **`125  lower_bounds`**   | 
`map<126: int, 127: binary>`| 
Map from column id to lower bound in the column serialized as binary [1]. Each 
value must be less than or equal to all non-null, non-NaN values in the column 
for the file [2]. See [7] for`geometry` and [8] for `geography`.  |
+| _optional_ | _optional_ | _optional_ | **`128  upper_bounds`**   | 
`map<129: int, 130: binary>`| 
Map from column id to upper bound in the column serialized as binary [1]. Each 
value must be greater than or equal to all non-null, non-Nan values in the 
column for the file [2]. See [9] for `geometry` and [10] for `geography`. |

Review Comment:
   I like this! I think it's clear what it means and how to implement it. Thank 
you for iterating!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: add variant type support [iceberg]

2025-01-06 Thread via GitHub


aihuaxu commented on code in PR #11831:
URL: https://github.com/apache/iceberg/pull/11831#discussion_r1904474514


##
core/src/test/java/org/apache/iceberg/TestMetadataUpdateParser.java:
##
@@ -52,6 +56,15 @@ public class TestMetadataUpdateParser {
   Types.NestedField.required(1, "id", Types.IntegerType.get()),
   Types.NestedField.optional(2, "data", Types.StringType.get()));
 
+  private static final Schema ID_VARIANTDATA_SCHEMA =

Review Comment:
   Let me know if we need to add TestSchemaParser class. Not sure if we already 
cover the tests with  TestMetadataUpdateParser.java indirectly and 
intentionally not add TestSchemaParser.java. 



##
core/src/main/java/org/apache/iceberg/SchemaParser.java:
##
@@ -42,6 +42,7 @@ private SchemaParser() {}
   private static final String STRUCT = "struct";
   private static final String LIST = "list";
   private static final String MAP = "map";
+  private static final String VARIANT = "variant";

Review Comment:
   @rdblue  Can you help check if the new change works for you? 



##
api/src/main/java/org/apache/iceberg/transforms/Identity.java:
##
@@ -39,7 +42,7 @@ class Identity implements Transform {
   @Deprecated
   public static  Identity get(Type type) {
 Preconditions.checkArgument(
-type.typeId() != Type.TypeID.VARIANT, "Unsupported type for identity: 
%s", type);
+!UNSUPPORTED_TYPES.contains(type), "Unsupported type for identity: 
%s", type);

Review Comment:
   I don't follow the following change suggestion. Let me revert this unrelated 
change and we can add it back if needed.   
   
   > I would normally structure it as the contains at first



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] TypeError when `operation` field is missing in `summary`. [iceberg-python]

2025-01-06 Thread via GitHub


kevinjqliu commented on issue #1106:
URL: 
https://github.com/apache/iceberg-python/issues/1106#issuecomment-2573699614

   resolved by #1263


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] TypeError when `operation` field is missing in `summary`. [iceberg-python]

2025-01-06 Thread via GitHub


kevinjqliu closed issue #1106: TypeError when `operation` field is missing in 
`summary`.
URL: https://github.com/apache/iceberg-python/issues/1106


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Use ExternalTypeInfo in Rowconverter code instead of deprecated TableSchema.getFieldTypes [iceberg]

2025-01-06 Thread via GitHub


stevenzwu merged PR #11838:
URL: https://github.com/apache/iceberg/pull/11838


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] backport #11301(rowconverter) to Flink 1.19 and 1.18 [iceberg]

2025-01-06 Thread via GitHub


stevenzwu commented on PR #11826:
URL: https://github.com/apache/iceberg/pull/11826#issuecomment-2573964980

   I don't understand why this back port depends on PR #11838 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Use compatible column name to set Parquet bloom filter [iceberg]

2025-01-06 Thread via GitHub


huaxingao commented on PR #11799:
URL: https://github.com/apache/iceberg/pull/11799#issuecomment-2573977829

   @RussellSpitzer 
   
   > Looks like tests are not passing?
   
I looked at the failed test again. The reason it failed is that the bloom 
filter is set on a field of the struct type `struct_not_null._int_field`. When 
we use:
   
   ```
   String colPath = makeCompatibleName(entry.getKey());
   ```
   `makeCompatibleName` changes `struct_not_null._int_field` to` 
struct_not_null_x2E_int_field`, which we actually don't want. If the entry 
contains a period, we could check if it is a field of a complex type and only 
apply `makeCompatibleName` to the field name. However, I feel it's probably 
simpler to use my original approach: get the fieldId of the entry, and then get 
the corresponding Parquet path for that fieldId.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] feat: support S3 Table Buckets with S3TablesCatalog [iceberg-python]

2025-01-06 Thread via GitHub


kevinjqliu commented on code in PR #1429:
URL: https://github.com/apache/iceberg-python/pull/1429#discussion_r190406


##
tests/catalog/test_s3tables.py:
##
@@ -0,0 +1,227 @@
+import pytest

Review Comment:
   thats a good point, we have integration tests marked for `gcs` 
https://github.com/apache/iceberg-python/blob/551f524170b12900cfaa3fef1ec8a0f9f437ee4c/tests/io/test_fsspec.py#L479



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Hive: Optimize viewExists API in hive catalog [iceberg]

2025-01-06 Thread via GitHub


dramaticlly commented on code in PR #11813:
URL: https://github.com/apache/iceberg/pull/11813#discussion_r1904501804


##
hive-metastore/src/test/java/org/apache/iceberg/hive/TestHiveViewCatalog.java:
##
@@ -125,6 +125,51 @@ public void testHiveViewAndIcebergViewWithSameName() 
throws TException, IOExcept
 .hasMessageStartingWith("Not an iceberg view: 
hive.hivedb.test_hive_view");
   }
 
+  @Test
+  public void testHiveViewExists() throws IOException, TException {
+String dbName = "hivedb";
+Namespace ns = Namespace.of(dbName);
+String viewName = "test_hive_view_exists";
+TableIdentifier identifier = TableIdentifier.of(ns, viewName);
+TableIdentifier invalidIdentifier = TableIdentifier.of(dbName, "invalid", 
viewName);
+if (requiresNamespaceCreate()) {
+  catalog.createNamespace(identifier.namespace());
+}
+
+assertThat(catalog.viewExists(invalidIdentifier))
+.as("Should return false on invalid view identifier")
+.isFalse();
+assertThat(catalog.viewExists(identifier)).as("View should not exist 
before create").isFalse();
+
+catalog
+.buildView(identifier)
+.withSchema(SCHEMA)
+.withDefaultNamespace(ns)
+.withQuery("hive", "select * from hivedb.tbl")
+.create();
+assertThat(catalog.viewExists(identifier)).as("View should exist after 
create").isTrue();
+
+catalog.dropView(identifier);
+assertThat(catalog.viewExists(identifier)).as("View should not exist after 
drop").isFalse();
+
+// create a hive table
+Table hiveTable =
+createHiveView(
+viewName, dbName, 
Files.createTempDirectory("hive-view-tests-name").toString());
+HIVE_METASTORE_EXTENSION.metastoreClient().createTable(hiveTable);
+assertThat(catalog.viewExists(identifier))
+.as("ViewExists should return false if identifier refers to a 
non-iceberg view")
+.isFalse();
+HIVE_METASTORE_EXTENSION.metastoreClient().dropTable(dbName, viewName);

Review Comment:
   thank you @pvary , I added test scenarios to validate the behavior of 
`viewExists` when there's existing hiveView, hiveTable and icebergTable in the 
catalog. Hope this is what you are looking for



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: add variant builder implementation [iceberg]

2025-01-06 Thread via GitHub


aihuaxu commented on PR #11857:
URL: https://github.com/apache/iceberg/pull/11857#issuecomment-2574053859

   @rdblue, @RussellSpitzer Please help review. Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] feat: support S3 Table Buckets with S3TablesCatalog [iceberg-python]

2025-01-06 Thread via GitHub


HonahX commented on code in PR #1429:
URL: https://github.com/apache/iceberg-python/pull/1429#discussion_r1904584229


##
pyiceberg/catalog/s3tables.py:
##
@@ -0,0 +1,324 @@
+import re
+from typing import TYPE_CHECKING, List, Optional, Set, Tuple, Union
+
+import boto3
+
+from pyiceberg.catalog import DEPRECATED_BOTOCORE_SESSION, MetastoreCatalog, 
PropertiesUpdateSummary
+from pyiceberg.exceptions import (
+CommitFailedException,
+InvalidNamespaceName,
+InvalidTableName,
+NamespaceNotEmptyError,
+NoSuchNamespaceError,
+NoSuchTableError,
+S3TablesError,
+TableAlreadyExistsError,
+TableBucketNotFound,
+)
+from pyiceberg.io import AWS_ACCESS_KEY_ID, AWS_REGION, AWS_SECRET_ACCESS_KEY, 
AWS_SESSION_TOKEN, load_file_io
+from pyiceberg.partitioning import UNPARTITIONED_PARTITION_SPEC, PartitionSpec
+from pyiceberg.schema import Schema
+from pyiceberg.serializers import FromInputFile
+from pyiceberg.table import CommitTableResponse, Table
+from pyiceberg.table.metadata import new_table_metadata
+from pyiceberg.table.sorting import UNSORTED_SORT_ORDER, SortOrder
+from pyiceberg.table.update import TableRequirement, TableUpdate
+from pyiceberg.typedef import EMPTY_DICT, Identifier, Properties
+from pyiceberg.utils.properties import get_first_property_value
+
+if TYPE_CHECKING:
+import pyarrow as pa
+
+S3TABLES_PROFILE_NAME = "s3tables.profile-name"
+S3TABLES_REGION = "s3tables.region"
+S3TABLES_ACCESS_KEY_ID = "s3tables.access-key-id"
+S3TABLES_SECRET_ACCESS_KEY = "s3tables.secret-access-key"
+S3TABLES_SESSION_TOKEN = "s3tables.session-token"
+
+S3TABLES_TABLE_BUCKET_ARN = "s3tables.table-bucket-arn"
+
+S3TABLES_ENDPOINT = "s3tables.endpoint"
+
+# for naming rules see: 
https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables-buckets-naming.html
+S3TABLES_VALID_NAME_REGEX = pattern = re.compile("[a-z0-9][a-z0-9_]{2,62}")
+S3TABLES_RESERVED_NAMESPACE = "aws_s3_metadata"
+
+
+class S3TableCatalog(MetastoreCatalog):
+def __init__(self, name: str, **properties: str):
+super().__init__(name, **properties)
+
+self.table_bucket_arn = self.properties[S3TABLES_TABLE_BUCKET_ARN]
+
+session = boto3.Session(
+profile_name=properties.get(S3TABLES_PROFILE_NAME),
+region_name=get_first_property_value(properties, S3TABLES_REGION, 
AWS_REGION),
+botocore_session=properties.get(DEPRECATED_BOTOCORE_SESSION),
+aws_access_key_id=get_first_property_value(properties, 
S3TABLES_ACCESS_KEY_ID, AWS_ACCESS_KEY_ID),
+aws_secret_access_key=get_first_property_value(properties, 
S3TABLES_SECRET_ACCESS_KEY, AWS_SECRET_ACCESS_KEY),
+aws_session_token=get_first_property_value(properties, 
S3TABLES_SESSION_TOKEN, AWS_SESSION_TOKEN),
+)
+try:
+self.s3tables = session.client("s3tables", 
endpoint_url=properties.get(S3TABLES_ENDPOINT))
+except boto3.session.UnknownServiceError as e:
+raise S3TablesError("'s3tables' requires boto3>=1.35.74. Current 
version: {boto3.__version__}.") from e
+
+try:
+
self.s3tables.get_table_bucket(tableBucketARN=self.table_bucket_arn)
+except self.s3tables.exceptions.NotFoundException as e:
+raise TableBucketNotFound(e) from e
+
+def commit_table(
+self, table: Table, requirements: Tuple[TableRequirement, ...], 
updates: Tuple[TableUpdate, ...]
+) -> CommitTableResponse:

Review Comment:
   I did not find the logic for cases when table not exist, which means 
`create_table_transaction` will not be supported in the current version.
   
https://github.com/apache/iceberg-python/blob/e41c428e852db78459890bab2c29aee9d13097b8/pyiceberg/catalog/__init__.py#L754-L765
   
   We do not have to support everything in the initial PR. But it will be good 
to override `create_table_transaction` as "Not Implemented" for the s3tables



##
pyiceberg/catalog/s3tables.py:
##
@@ -0,0 +1,324 @@
+import re
+from typing import TYPE_CHECKING, List, Optional, Set, Tuple, Union
+
+import boto3
+
+from pyiceberg.catalog import DEPRECATED_BOTOCORE_SESSION, MetastoreCatalog, 
PropertiesUpdateSummary
+from pyiceberg.exceptions import (
+CommitFailedException,
+InvalidNamespaceName,
+InvalidTableName,
+NamespaceNotEmptyError,
+NoSuchNamespaceError,
+NoSuchTableError,
+S3TablesError,
+TableAlreadyExistsError,
+TableBucketNotFound,
+)
+from pyiceberg.io import AWS_ACCESS_KEY_ID, AWS_REGION, AWS_SECRET_ACCESS_KEY, 
AWS_SESSION_TOKEN, load_file_io
+from pyiceberg.partitioning import UNPARTITIONED_PARTITION_SPEC, PartitionSpec
+from pyiceberg.schema import Schema
+from pyiceberg.serializers import FromInputFile
+from pyiceberg.table import CommitTableResponse, Table
+from pyiceberg.table.metadata import new_table_metadata
+from pyiceberg.table.sorting import UNSORTED_SORT_ORDER, SortOrder
+from pyiceberg.table.update import TableRequirement, T

Re: [PR] feat: support S3 Table Buckets with S3TablesCatalog [iceberg-python]

2025-01-06 Thread via GitHub


HonahX commented on code in PR #1429:
URL: https://github.com/apache/iceberg-python/pull/1429#discussion_r1904875003


##
pyiceberg/catalog/s3tables.py:
##
@@ -0,0 +1,324 @@
+import re
+from typing import TYPE_CHECKING, List, Optional, Set, Tuple, Union
+
+import boto3
+
+from pyiceberg.catalog import DEPRECATED_BOTOCORE_SESSION, MetastoreCatalog, 
PropertiesUpdateSummary
+from pyiceberg.exceptions import (
+CommitFailedException,
+InvalidNamespaceName,
+InvalidTableName,
+NamespaceNotEmptyError,
+NoSuchNamespaceError,
+NoSuchTableError,
+S3TablesError,
+TableAlreadyExistsError,
+TableBucketNotFound,
+)
+from pyiceberg.io import AWS_ACCESS_KEY_ID, AWS_REGION, AWS_SECRET_ACCESS_KEY, 
AWS_SESSION_TOKEN, load_file_io
+from pyiceberg.partitioning import UNPARTITIONED_PARTITION_SPEC, PartitionSpec
+from pyiceberg.schema import Schema
+from pyiceberg.serializers import FromInputFile
+from pyiceberg.table import CommitTableResponse, Table
+from pyiceberg.table.metadata import new_table_metadata
+from pyiceberg.table.sorting import UNSORTED_SORT_ORDER, SortOrder
+from pyiceberg.table.update import TableRequirement, TableUpdate
+from pyiceberg.typedef import EMPTY_DICT, Identifier, Properties
+from pyiceberg.utils.properties import get_first_property_value
+
+if TYPE_CHECKING:
+import pyarrow as pa
+
+S3TABLES_PROFILE_NAME = "s3tables.profile-name"
+S3TABLES_REGION = "s3tables.region"
+S3TABLES_ACCESS_KEY_ID = "s3tables.access-key-id"
+S3TABLES_SECRET_ACCESS_KEY = "s3tables.secret-access-key"
+S3TABLES_SESSION_TOKEN = "s3tables.session-token"
+
+S3TABLES_TABLE_BUCKET_ARN = "s3tables.table-bucket-arn"
+
+S3TABLES_ENDPOINT = "s3tables.endpoint"
+
+# for naming rules see: 
https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables-buckets-naming.html
+S3TABLES_VALID_NAME_REGEX = pattern = re.compile("[a-z0-9][a-z0-9_]{2,62}")
+S3TABLES_RESERVED_NAMESPACE = "aws_s3_metadata"
+
+
+class S3TableCatalog(MetastoreCatalog):
+def __init__(self, name: str, **properties: str):
+super().__init__(name, **properties)
+
+self.table_bucket_arn = self.properties[S3TABLES_TABLE_BUCKET_ARN]
+
+session = boto3.Session(
+profile_name=properties.get(S3TABLES_PROFILE_NAME),
+region_name=get_first_property_value(properties, S3TABLES_REGION, 
AWS_REGION),
+botocore_session=properties.get(DEPRECATED_BOTOCORE_SESSION),
+aws_access_key_id=get_first_property_value(properties, 
S3TABLES_ACCESS_KEY_ID, AWS_ACCESS_KEY_ID),
+aws_secret_access_key=get_first_property_value(properties, 
S3TABLES_SECRET_ACCESS_KEY, AWS_SECRET_ACCESS_KEY),
+aws_session_token=get_first_property_value(properties, 
S3TABLES_SESSION_TOKEN, AWS_SESSION_TOKEN),
+)
+try:
+self.s3tables = session.client("s3tables", 
endpoint_url=properties.get(S3TABLES_ENDPOINT))
+except boto3.session.UnknownServiceError as e:
+raise S3TablesError("'s3tables' requires boto3>=1.35.74. Current 
version: {boto3.__version__}.") from e
+
+try:
+
self.s3tables.get_table_bucket(tableBucketARN=self.table_bucket_arn)
+except self.s3tables.exceptions.NotFoundException as e:
+raise TableBucketNotFound(e) from e
+
+def commit_table(
+self, table: Table, requirements: Tuple[TableRequirement, ...], 
updates: Tuple[TableUpdate, ...]
+) -> CommitTableResponse:
+table_identifier = table.name()
+database_name, table_name = 
self.identifier_to_database_and_table(table_identifier, NoSuchTableError)
+
+current_table, version_token = 
self._load_table_and_version(identifier=table_identifier)
+
+updated_staged_table = self._update_and_stage_table(current_table, 
table_identifier, requirements, updates)
+if current_table and updated_staged_table.metadata == 
current_table.metadata:
+# no changes, do nothing
+return CommitTableResponse(metadata=current_table.metadata, 
metadata_location=current_table.metadata_location)
+
+self._write_metadata(
+metadata=updated_staged_table.metadata,
+io=updated_staged_table.io,
+metadata_path=updated_staged_table.metadata_location,
+overwrite=True,
+)
+
+# try to update metadata location which will fail if the versionToken 
changed meanwhile
+try:
+self.s3tables.update_table_metadata_location(
+tableBucketARN=self.table_bucket_arn,
+namespace=database_name,
+name=table_name,
+versionToken=version_token,
+metadataLocation=updated_staged_table.metadata_location,
+)
+except self.s3tables.exceptions.ConflictException as e:
+raise CommitFailedException(
+f"Cannot commit {database_name}.{table_name} because of a 
concurrent update to the table versio

[I] Question about Decimal Type Limitations in Iceberg [iceberg]

2025-01-06 Thread via GitHub


rice668 opened a new issue, #11920:
URL: https://github.com/apache/iceberg/issues/11920

   ### Query engine
   
   Spark
   
   ### Question
   
   Hello Iceberg Community,
   
   https://github.com/user-attachments/assets/573a1a35-a3da-45a2-bb7e-10ec378dfd4c";
 />
   
   I would like to ask why Iceberg imposes the restriction on changing the 
scale for the decimal type. According to the SQL standard, the scale can indeed 
be changed. For instance, the SQL standard states that:
   
   
![WeChatWorkScreenshot_e28f4f34-1eca-42bf-b370-c96e335ff958](https://github.com/user-attachments/assets/283fe262-61af-487f-a5df-dd2e7ee60f22)
   
   Given this, I am curious about the rationale behind Iceberg's decision to 
restrict the change of scale for decimal types.
   
   Thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Kafka Connect: Add table to topics mapping property [iceberg]

2025-01-06 Thread via GitHub


yornstei commented on PR #10422:
URL: https://github.com/apache/iceberg/pull/10422#issuecomment-2574366593

   This solution seems best to me; most explicit and least restrictive. I'll be 
using this in my fork. thanks @igorvoltaic. 
   I also agree this isn't related to #11313 which is for dynamic routing. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Kafka Connect: Add the configuration option to provide a transactional id prefix to use [iceberg]

2025-01-06 Thread via GitHub


bryanck commented on PR #11780:
URL: https://github.com/apache/iceberg/pull/11780#issuecomment-2574371981

   I'll take a look this week, thanks for your patience!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Kafka Connect: Add table to topics mapping property [iceberg]

2025-01-06 Thread via GitHub


bryanck commented on PR #10422:
URL: https://github.com/apache/iceberg/pull/10422#issuecomment-2574373145

   Thanks @yornstei, good to know you found this useful. I was wondering if you 
had an opinion on https://github.com/apache/iceberg/pull/11623?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] feat: support S3 Table Buckets with S3TablesCatalog [iceberg-python]

2025-01-06 Thread via GitHub


HonahX commented on code in PR #1429:
URL: https://github.com/apache/iceberg-python/pull/1429#discussion_r1904713991


##
pyiceberg/catalog/s3tables.py:
##
@@ -0,0 +1,324 @@
+import re
+from typing import TYPE_CHECKING, List, Optional, Set, Tuple, Union
+
+import boto3
+
+from pyiceberg.catalog import DEPRECATED_BOTOCORE_SESSION, MetastoreCatalog, 
PropertiesUpdateSummary
+from pyiceberg.exceptions import (
+CommitFailedException,
+InvalidNamespaceName,
+InvalidTableName,
+NamespaceNotEmptyError,
+NoSuchNamespaceError,
+NoSuchTableError,
+S3TablesError,
+TableAlreadyExistsError,
+TableBucketNotFound,
+)
+from pyiceberg.io import AWS_ACCESS_KEY_ID, AWS_REGION, AWS_SECRET_ACCESS_KEY, 
AWS_SESSION_TOKEN, load_file_io
+from pyiceberg.partitioning import UNPARTITIONED_PARTITION_SPEC, PartitionSpec
+from pyiceberg.schema import Schema
+from pyiceberg.serializers import FromInputFile
+from pyiceberg.table import CommitTableResponse, Table
+from pyiceberg.table.metadata import new_table_metadata
+from pyiceberg.table.sorting import UNSORTED_SORT_ORDER, SortOrder
+from pyiceberg.table.update import TableRequirement, TableUpdate
+from pyiceberg.typedef import EMPTY_DICT, Identifier, Properties
+from pyiceberg.utils.properties import get_first_property_value
+
+if TYPE_CHECKING:
+import pyarrow as pa
+
+S3TABLES_PROFILE_NAME = "s3tables.profile-name"
+S3TABLES_REGION = "s3tables.region"
+S3TABLES_ACCESS_KEY_ID = "s3tables.access-key-id"
+S3TABLES_SECRET_ACCESS_KEY = "s3tables.secret-access-key"
+S3TABLES_SESSION_TOKEN = "s3tables.session-token"
+
+S3TABLES_TABLE_BUCKET_ARN = "s3tables.table-bucket-arn"
+
+S3TABLES_ENDPOINT = "s3tables.endpoint"
+
+# for naming rules see: 
https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables-buckets-naming.html
+S3TABLES_VALID_NAME_REGEX = pattern = re.compile("[a-z0-9][a-z0-9_]{2,62}")
+S3TABLES_RESERVED_NAMESPACE = "aws_s3_metadata"
+
+
+class S3TableCatalog(MetastoreCatalog):
+def __init__(self, name: str, **properties: str):
+super().__init__(name, **properties)
+
+self.table_bucket_arn = self.properties[S3TABLES_TABLE_BUCKET_ARN]
+
+session = boto3.Session(
+profile_name=properties.get(S3TABLES_PROFILE_NAME),
+region_name=get_first_property_value(properties, S3TABLES_REGION, 
AWS_REGION),
+botocore_session=properties.get(DEPRECATED_BOTOCORE_SESSION),
+aws_access_key_id=get_first_property_value(properties, 
S3TABLES_ACCESS_KEY_ID, AWS_ACCESS_KEY_ID),
+aws_secret_access_key=get_first_property_value(properties, 
S3TABLES_SECRET_ACCESS_KEY, AWS_SECRET_ACCESS_KEY),
+aws_session_token=get_first_property_value(properties, 
S3TABLES_SESSION_TOKEN, AWS_SESSION_TOKEN),
+)
+try:
+self.s3tables = session.client("s3tables", 
endpoint_url=properties.get(S3TABLES_ENDPOINT))
+except boto3.session.UnknownServiceError as e:
+raise S3TablesError("'s3tables' requires boto3>=1.35.74. Current 
version: {boto3.__version__}.") from e
+
+try:
+
self.s3tables.get_table_bucket(tableBucketARN=self.table_bucket_arn)
+except self.s3tables.exceptions.NotFoundException as e:
+raise TableBucketNotFound(e) from e
+
+def commit_table(
+self, table: Table, requirements: Tuple[TableRequirement, ...], 
updates: Tuple[TableUpdate, ...]
+) -> CommitTableResponse:
+table_identifier = table.name()
+database_name, table_name = 
self.identifier_to_database_and_table(table_identifier, NoSuchTableError)
+
+current_table, version_token = 
self._load_table_and_version(identifier=table_identifier)
+
+updated_staged_table = self._update_and_stage_table(current_table, 
table_identifier, requirements, updates)
+if current_table and updated_staged_table.metadata == 
current_table.metadata:
+# no changes, do nothing
+return CommitTableResponse(metadata=current_table.metadata, 
metadata_location=current_table.metadata_location)
+
+self._write_metadata(
+metadata=updated_staged_table.metadata,
+io=updated_staged_table.io,
+metadata_path=updated_staged_table.metadata_location,
+overwrite=True,
+)
+
+# try to update metadata location which will fail if the versionToken 
changed meanwhile
+try:
+self.s3tables.update_table_metadata_location(
+tableBucketARN=self.table_bucket_arn,
+namespace=database_name,
+name=table_name,
+versionToken=version_token,
+metadataLocation=updated_staged_table.metadata_location,
+)
+except self.s3tables.exceptions.ConflictException as e:
+raise CommitFailedException(
+f"Cannot commit {database_name}.{table_name} because of a 
concurrent update to the table versio

Re: [PR] Kafka Connect: Add table to topics mapping property [iceberg]

2025-01-06 Thread via GitHub


yornstei commented on PR #10422:
URL: https://github.com/apache/iceberg/pull/10422#issuecomment-2574383989

   > Thanks @yornstei, good to know you found this useful. I was wondering if 
you had an opinion on #11623?
   
   I took a look at that one too. From a configurable perspective, it seems 
better and the pattern aligns more with the connector's other configs of 
specifying the `route-regex` on each table. From a code perspective, there's 
quite a bit more going on there which I'm hesitant to merge/maintain in my 
fork. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Update a TableSchema from a Schema [iceberg-rust]

2025-01-06 Thread via GitHub


Lordworms commented on issue #698:
URL: https://github.com/apache/iceberg-rust/issues/698#issuecomment-2574388365

   I would like to do this one.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Use Snapshot's statistics file in SparkScan [iceberg]

2025-01-06 Thread via GitHub


jeesou commented on PR #11040:
URL: https://github.com/apache/iceberg/pull/11040#issuecomment-2574390763

   Hi @karuppayya @amogh-jahagirdar friendly reminder, please check the 
comments once.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add iceberg_arrow library [iceberg-cpp]

2025-01-06 Thread via GitHub


wgtmac commented on code in PR #6:
URL: https://github.com/apache/iceberg-cpp/pull/6#discussion_r1904261495


##
cmake_modules/BuildUtils.cmake:
##
@@ -201,17 +202,26 @@ function(ADD_ICEBERG_LIB LIB_NAME)
 PUBLIC 
"$")
 endif()
 
-install(TARGETS ${LIB_NAME}_static ${INSTALL_IS_OPTIONAL}
-EXPORT ${LIB_NAME}_targets
+install(TARGETS ${LIB_NAME}_static
+EXPORT iceberg_targets
 ARCHIVE DESTINATION ${INSTALL_ARCHIVE_DIR}
 LIBRARY DESTINATION ${INSTALL_LIBRARY_DIR}
 RUNTIME DESTINATION ${INSTALL_RUNTIME_DIR}
 INCLUDES
 DESTINATION ${CMAKE_INSTALL_INCLUDEDIR})
   endif()
 
-  if(ARG_CMAKE_PACKAGE_NAME)
-iceberg_install_cmake_package(${ARG_CMAKE_PACKAGE_NAME} 
${LIB_NAME}_targets)
+  # generate export header file
+  string(TOUPPER ${LIB_NAME} LIB_NAME_UPPER)
+  if(BUILD_SHARED)
+generate_export_header(${LIB_NAME}_shared BASE_NAME ${LIB_NAME_UPPER})
+target_compile_definitions(${LIB_NAME}_shared PUBLIC 
${LIB_NAME}_shared_EXPORTS)

Review Comment:
   Thanks! These suggestions seem to work. Why including `iceberg/puffin.h` 
fixes this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Support virtual addressing style in PyArrowFileIO [iceberg-python]

2025-01-06 Thread via GitHub


kevinjqliu closed issue #21: Support virtual addressing style in PyArrowFileIO
URL: https://github.com/apache/iceberg-python/issues/21


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Support virtual addressing style in PyArrowFileIO [iceberg-python]

2025-01-06 Thread via GitHub


kevinjqliu commented on issue #21:
URL: https://github.com/apache/iceberg-python/issues/21#issuecomment-2573312668

   closed by 
https://github.com/apache/iceberg-python/pull/1392/files#diff-8d5e63f2a87ead8cebe2fd8ac5dcf2198d229f01e16bb9e06e21f7277c328abdR377-R378


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[PR] Spark 3.5: Implement RewriteTablePath [iceberg]

2025-01-06 Thread via GitHub


szehon-ho opened a new pull request, #11555:
URL: https://github.com/apache/iceberg/pull/11555

   This is the implementation for #10920 (an action to prepare metadata for an 
Iceberg table for DR copy)
   
   This has been used in production for awhile in our setup, although support 
for rewrite of V2 position delete is new.  I performed the following cleanups 
while contributing it.
   
   - Made RewriteTableSparkAction code more functional (avoid using member 
variable on the action to track state)
   - Moved some RewriteTableSparkAction code to core Utll classes to avoid 
having to make some classes public as was previously done.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark 3.5: Implement RewriteTablePath [iceberg]

2025-01-06 Thread via GitHub


szehon-ho closed pull request #11555: Spark 3.5: Implement RewriteTablePath
URL: https://github.com/apache/iceberg/pull/11555


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Change dot notation in add column documentation to tuple [iceberg-python]

2025-01-06 Thread via GitHub


kevinjqliu commented on PR #1433:
URL: https://github.com/apache/iceberg-python/pull/1433#issuecomment-2573362375

   im having trouble running the new statements in the docs, could you give it 
a try ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Change dot notation in add column documentation to tuple [iceberg-python]

2025-01-06 Thread via GitHub


jeppe-dos commented on PR #1433:
URL: https://github.com/apache/iceberg-python/pull/1433#issuecomment-2573564249

   > im having trouble running the new statements in the docs, could you give 
it a try ?
   
   The code doesn't work, as "confirmed_by" has been changed to "exchange". 
Exchange can therefore not move before confirmed_by as it no longer exist. I 
have changed the renamed field to processed_by to make it a bit more clear. 
   
   In your opinion, should you be able to copy the whole documentation and make 
it work en sequence? It wasn't the case before, but I can change it to be the 
case, if you would like. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Hive: Add Hive 4 support and remove Hive runtime [iceberg]

2025-01-06 Thread via GitHub


pvary commented on PR #11750:
URL: https://github.com/apache/iceberg/pull/11750#issuecomment-2572593083

   > Spark (and other modules) also depend on `TestHiveMetastore` from test 
modules of `iceberg-hive-metastore`. We cannot use old Hive dependency from 
Spark to run this class due to API changes.
   
   `TestHiveMetastore` is a test only code. If we use a shaded version for 
testing and using our own Hive version for it, then it should be fine.
   
   For the production code:
   I would argue that if there are compatibility issues with the HMS Client (I 
don't know about any of them ATM), then as we use our own shaded avro version 
to write the manifest files, we probably HiveCatalog should use its own shaded 
HMSClient too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Backport #11557 to FLink1.19 and 1.18 [iceberg]

2025-01-06 Thread via GitHub


pvary merged PR #11834:
URL: https://github.com/apache/iceberg/pull/11834


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Backport #11557 to FLink1.19 and 1.18 [iceberg]

2025-01-06 Thread via GitHub


pvary commented on PR #11834:
URL: https://github.com/apache/iceberg/pull/11834#issuecomment-2572785020

   Merged to main.
   Thanks for the backport @huyuanfeng2018!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Hive: Optimize viewExists API in hive catalog [iceberg]

2025-01-06 Thread via GitHub


pvary commented on code in PR #11813:
URL: https://github.com/apache/iceberg/pull/11813#discussion_r1903964817


##
hive-metastore/src/test/java/org/apache/iceberg/hive/TestHiveViewCatalog.java:
##
@@ -125,6 +125,51 @@ public void testHiveViewAndIcebergViewWithSameName() 
throws TException, IOExcept
 .hasMessageStartingWith("Not an iceberg view: 
hive.hivedb.test_hive_view");
   }
 
+  @Test
+  public void testHiveViewExists() throws IOException, TException {
+String dbName = "hivedb";
+Namespace ns = Namespace.of(dbName);
+String viewName = "test_hive_view_exists";
+TableIdentifier identifier = TableIdentifier.of(ns, viewName);
+TableIdentifier invalidIdentifier = TableIdentifier.of(dbName, "invalid", 
viewName);
+if (requiresNamespaceCreate()) {
+  catalog.createNamespace(identifier.namespace());
+}
+
+assertThat(catalog.viewExists(invalidIdentifier))
+.as("Should return false on invalid view identifier")
+.isFalse();
+assertThat(catalog.viewExists(identifier)).as("View should not exist 
before create").isFalse();
+
+catalog
+.buildView(identifier)
+.withSchema(SCHEMA)
+.withDefaultNamespace(ns)
+.withQuery("hive", "select * from hivedb.tbl")
+.create();
+assertThat(catalog.viewExists(identifier)).as("View should exist after 
create").isTrue();
+
+catalog.dropView(identifier);

Review Comment:
   Why did you deviate here from the table tests? Why not check the return 
value of the drop?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Hive: Optimize viewExists API in hive catalog [iceberg]

2025-01-06 Thread via GitHub


pvary commented on code in PR #11813:
URL: https://github.com/apache/iceberg/pull/11813#discussion_r1903966574


##
hive-metastore/src/test/java/org/apache/iceberg/hive/TestHiveViewCatalog.java:
##
@@ -125,6 +125,51 @@ public void testHiveViewAndIcebergViewWithSameName() 
throws TException, IOExcept
 .hasMessageStartingWith("Not an iceberg view: 
hive.hivedb.test_hive_view");
   }
 
+  @Test
+  public void testHiveViewExists() throws IOException, TException {
+String dbName = "hivedb";
+Namespace ns = Namespace.of(dbName);
+String viewName = "test_hive_view_exists";
+TableIdentifier identifier = TableIdentifier.of(ns, viewName);
+TableIdentifier invalidIdentifier = TableIdentifier.of(dbName, "invalid", 
viewName);
+if (requiresNamespaceCreate()) {
+  catalog.createNamespace(identifier.namespace());
+}
+
+assertThat(catalog.viewExists(invalidIdentifier))
+.as("Should return false on invalid view identifier")
+.isFalse();
+assertThat(catalog.viewExists(identifier)).as("View should not exist 
before create").isFalse();
+
+catalog
+.buildView(identifier)
+.withSchema(SCHEMA)
+.withDefaultNamespace(ns)
+.withQuery("hive", "select * from hivedb.tbl")
+.create();
+assertThat(catalog.viewExists(identifier)).as("View should exist after 
create").isTrue();
+
+catalog.dropView(identifier);
+assertThat(catalog.viewExists(identifier)).as("View should not exist after 
drop").isFalse();
+
+// create a hive table

Review Comment:
   nit: create a hive view?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Hive: Optimize viewExists API in hive catalog [iceberg]

2025-01-06 Thread via GitHub


pvary commented on code in PR #11813:
URL: https://github.com/apache/iceberg/pull/11813#discussion_r1903970848


##
hive-metastore/src/test/java/org/apache/iceberg/hive/TestHiveViewCatalog.java:
##
@@ -125,6 +125,51 @@ public void testHiveViewAndIcebergViewWithSameName() 
throws TException, IOExcept
 .hasMessageStartingWith("Not an iceberg view: 
hive.hivedb.test_hive_view");
   }
 
+  @Test
+  public void testHiveViewExists() throws IOException, TException {
+String dbName = "hivedb";
+Namespace ns = Namespace.of(dbName);
+String viewName = "test_hive_view_exists";
+TableIdentifier identifier = TableIdentifier.of(ns, viewName);
+TableIdentifier invalidIdentifier = TableIdentifier.of(dbName, "invalid", 
viewName);
+if (requiresNamespaceCreate()) {
+  catalog.createNamespace(identifier.namespace());
+}
+
+assertThat(catalog.viewExists(invalidIdentifier))
+.as("Should return false on invalid view identifier")
+.isFalse();
+assertThat(catalog.viewExists(identifier)).as("View should not exist 
before create").isFalse();
+
+catalog
+.buildView(identifier)
+.withSchema(SCHEMA)
+.withDefaultNamespace(ns)
+.withQuery("hive", "select * from hivedb.tbl")
+.create();
+assertThat(catalog.viewExists(identifier)).as("View should exist after 
create").isTrue();
+
+catalog.dropView(identifier);
+assertThat(catalog.viewExists(identifier)).as("View should not exist after 
drop").isFalse();
+
+// create a hive table
+Table hiveTable =
+createHiveView(
+viewName, dbName, 
Files.createTempDirectory("hive-view-tests-name").toString());
+HIVE_METASTORE_EXTENSION.metastoreClient().createTable(hiveTable);
+assertThat(catalog.viewExists(identifier))
+.as("ViewExists should return false if identifier refers to a 
non-iceberg view")
+.isFalse();
+HIVE_METASTORE_EXTENSION.metastoreClient().dropTable(dbName, viewName);

Review Comment:
   Maybe I am too detail oriented, but would it worth to add?
   ```
   HIVE_METASTORE_EXTENSION
   .metastoreClient()
   .createTable(createHiveTable(testTableName, 
TableType.EXTERNAL_TABLE));
   assertThat(catalog.viewExists(identifier))
   .as("ViewExists should return false if identifier refers to a 
non-iceberg view")
   .isFalse();
   HIVE_METASTORE_EXTENSION.metastoreClient().dropTable(DB_NAME, 
testTableName);
   ```
   
   And similarly to HiveTableTest.testTableExists:
   ```
   Table hiveTable =
   createHiveView(
   viewName, dbName, 
Files.createTempDirectory("hive-view-tests-name").toString());
   HIVE_METASTORE_EXTENSION.metastoreClient().createTable(hiveTable);
   assertThat(catalog.tableExists(identifier))
   .as("Should return false when a hive table with the same name 
exists")
   .isFalse();
   assertThat(catalog.tableExists(metadataIdentifier))
   .as("Metadata table should not exist")
   .isFalse();
   HIVE_METASTORE_EXTENSION.metastoreClient().dropTable(dbName, viewName);
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Count rows as a metadata only operation [iceberg-python]

2025-01-06 Thread via GitHub


tusharchou commented on PR #1388:
URL: https://github.com/apache/iceberg-python/pull/1388#issuecomment-2572804832

   Hi @Fokko @kevinjqliu @gli-chris-hao ,
   
   I have implemented these suggestions with my best understanding.
   
   - [x] residual evaluator
   - [x] positional deletes
   - [x] batch processing of files larger than 512mb
   
   It would be helpful to get fresh review
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[I] Add ability to pickle a `Table` [iceberg-python]

2025-01-06 Thread via GitHub


Fokko opened a new issue, #513:
URL: https://github.com/apache/iceberg-python/issues/513

   ### Feature Request / Improvement
   
   This allows distribution of the Table object within Ray.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Change dot notation in add column documentation to tuple [iceberg-python]

2025-01-06 Thread via GitHub


jeppe-dos commented on PR #1433:
URL: https://github.com/apache/iceberg-python/pull/1433#issuecomment-2573318869

   What if you create the struct first, and then add the nested field like so: 
   
   ```
   with table.update_schema() as update:
   update.add_column("retries", IntegerType(), "Number of retries to place 
the bid")
   # In a struct
   update.add_column("details", StructType())
   
   with table.update_schema() as update:
   update.add_column(("details", "confirmed_by"), StringType(), "Name of 
the exchange")
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Change dot notation in add column documentation to tuple [iceberg-python]

2025-01-06 Thread via GitHub


kevinjqliu commented on PR #1433:
URL: https://github.com/apache/iceberg-python/pull/1433#issuecomment-2573325406

   that works, but i think the first example should work too. We can track this 
in a separate issue. 
   
   
   ```
   >>> with table.update_schema() as update:
   ... update.add_column("retries", IntegerType(), "Number of retries to 
place the bid")
   ... # In a struct
   ... update.add_column("details", StructType())
   ...
   
   
   >>> with table.update_schema() as update:
   ... update.add_column(("details", "confirmed_by"), StringType(), "Name 
of the exchange")
   ...
   
   >>> print(table.refresh().schema())
   table {
 1: city: optional string
 2: lat: optional double
 3: long: optional double
 4: retries: optional int (Number of retries to place the bid)
 5: details: optional struct<6: confirmed_by: optional string (Name of the 
exchange)>
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark 3.5: Implement RewriteTablePath [iceberg]

2025-01-06 Thread via GitHub


szehon-ho commented on code in PR #11555:
URL: https://github.com/apache/iceberg/pull/11555#discussion_r1903293186


##
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteTablePathSparkAction.java:
##
@@ -0,0 +1,731 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.spark.actions;
+
+import java.io.File;
+import java.io.IOException;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.UUID;
+import java.util.stream.Collectors;
+import org.apache.iceberg.ContentFile;
+import org.apache.iceberg.DataFile;
+import org.apache.iceberg.DeleteFile;
+import org.apache.iceberg.FileFormat;
+import org.apache.iceberg.HasTableOperations;
+import org.apache.iceberg.ManifestFile;
+import org.apache.iceberg.PartitionSpec;
+import org.apache.iceberg.RewriteTablePathUtil;
+import org.apache.iceberg.RewriteTablePathUtil.PositionDeleteReaderWriter;
+import org.apache.iceberg.RewriteTablePathUtil.RewriteResult;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.SerializableTable;
+import org.apache.iceberg.Snapshot;
+import org.apache.iceberg.StaticTableOperations;
+import org.apache.iceberg.StructLike;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.TableMetadata;
+import org.apache.iceberg.TableMetadata.MetadataLogEntry;
+import org.apache.iceberg.TableMetadataParser;
+import org.apache.iceberg.actions.ImmutableRewriteTablePath;
+import org.apache.iceberg.actions.RewriteTablePath;
+import org.apache.iceberg.avro.Avro;
+import org.apache.iceberg.data.Record;
+import org.apache.iceberg.data.avro.DataReader;
+import org.apache.iceberg.data.avro.DataWriter;
+import org.apache.iceberg.data.orc.GenericOrcReader;
+import org.apache.iceberg.data.orc.GenericOrcWriter;
+import org.apache.iceberg.data.parquet.GenericParquetReaders;
+import org.apache.iceberg.data.parquet.GenericParquetWriter;
+import org.apache.iceberg.deletes.PositionDeleteWriter;
+import org.apache.iceberg.exceptions.RuntimeIOException;
+import org.apache.iceberg.io.CloseableIterable;
+import org.apache.iceberg.io.DeleteSchemaUtil;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.io.InputFile;
+import org.apache.iceberg.io.OutputFile;
+import org.apache.iceberg.orc.ORC;
+import org.apache.iceberg.parquet.Parquet;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.relocated.com.google.common.collect.Sets;
+import org.apache.iceberg.spark.JobGroupInfo;
+import org.apache.iceberg.util.Pair;
+import org.apache.spark.api.java.function.ForeachFunction;
+import org.apache.spark.api.java.function.MapFunction;
+import org.apache.spark.api.java.function.ReduceFunction;
+import org.apache.spark.broadcast.Broadcast;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Encoder;
+import org.apache.spark.sql.Encoders;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SaveMode;
+import org.apache.spark.sql.SparkSession;
+import org.apache.spark.sql.functions;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import scala.Tuple2;
+
+public class RewriteTablePathSparkAction extends 
BaseSparkAction
+implements RewriteTablePath {
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(RewriteTablePathSparkAction.class);
+  private static final String RESULT_LOCATION = "file-list";
+
+  private String sourcePrefix;
+  private String targetPrefix;
+  private String startVersionName;
+  private String endVersionName;
+  private String stagingDir;
+
+  private final Table table;
+
+  RewriteTablePathSparkAction(SparkSession spark, Table table) {
+super(spark);
+this.table = table;
+  }
+
+  @Override
+  protected RewriteTablePath self() {
+return this;
+  }
+
+  @Override
+  public RewriteTablePath rewriteLocationPrefix(String sPrefix, String 
tPrefix) {
+Preconditions.checkArgument(
+sPrefix != null && !sPrefix.isEmpty(), "Source prefix('%s') cannot be 
empty.", sPrefix);
+this.sourcePrefix = sPrefix;
+this.targetPrefix = tPrefix;
+return this;
+  }
+
+  @Override
+  public RewriteTabl

Re: [PR] feat: support S3 Table Buckets with S3TablesCatalog [iceberg-python]

2025-01-06 Thread via GitHub


felixscherz commented on code in PR #1429:
URL: https://github.com/apache/iceberg-python/pull/1429#discussion_r1904397511


##
tests/catalog/test_s3tables.py:
##
@@ -0,0 +1,173 @@
+import pytest
+
+from pyiceberg.catalog.s3tables import S3TableCatalog
+from pyiceberg.exceptions import NoSuchNamespaceError, NoSuchTableError, 
TableBucketNotFound
+from pyiceberg.schema import Schema
+from pyiceberg.types import IntegerType
+
+
+@pytest.fixture
+def database_name(database_name: str) -> str:
+# naming rules prevent "-" in namespaces for s3 table buckets
+return database_name.replace("-", "_")
+
+
+@pytest.fixture
+def table_name(table_name: str) -> str:
+# naming rules prevent "-" in table namees for s3 table buckets
+return table_name.replace("-", "_")
+
+
+@pytest.fixture
+def table_bucket_arn() -> str:
+import os
+
+# since the moto library does not support s3tables as of 2024-12-14 we 
have to test against a real AWS endpoint
+# in one of the supported regions.
+
+return os.environ["ARN"]
+
+
+@pytest.fixture
+def catalog(table_bucket_arn: str) -> S3TableCatalog:
+properties = {"s3tables.table-bucket-arn": table_bucket_arn, 
"s3tables.region": "us-east-1", "s3.region": "us-east-1"}
+return S3TableCatalog(name="test_s3tables_catalog", **properties)
+
+
+def test_creating_catalog_validates_s3_table_bucket_exists(table_bucket_arn: 
str) -> None:
+properties = {"s3tables.table-bucket-arn": f"{table_bucket_arn}-modified", 
"s3tables.region": "us-east-1"}
+with pytest.raises(TableBucketNotFound):
+S3TableCatalog(name="test_s3tables_catalog", **properties)
+
+
+def test_create_namespace(catalog: S3TableCatalog, database_name: str) -> None:
+catalog.create_namespace(namespace=database_name)
+namespaces = catalog.list_namespaces()
+assert (database_name,) in namespaces
+
+
+def test_load_namespace_properties(catalog: S3TableCatalog, database_name: 
str) -> None:
+catalog.create_namespace(namespace=database_name)
+assert database_name in 
catalog.load_namespace_properties(database_name)["namespace"]
+
+
+def test_drop_namespace(catalog: S3TableCatalog, database_name: str) -> None:
+catalog.create_namespace(namespace=database_name)
+assert (database_name,) in catalog.list_namespaces()
+catalog.drop_namespace(namespace=database_name)
+assert (database_name,) not in catalog.list_namespaces()
+
+
+def test_create_table(catalog: S3TableCatalog, database_name: str, table_name: 
str, table_schema_nested: Schema) -> None:
+identifier = (database_name, table_name)
+
+catalog.create_namespace(namespace=database_name)
+table = catalog.create_table(identifier=identifier, 
schema=table_schema_nested)
+
+assert table == catalog.load_table(identifier)
+
+
+def test_create_table_in_invalid_namespace_raises_exception(
+catalog: S3TableCatalog, database_name: str, table_name: str, 
table_schema_nested: Schema
+) -> None:
+identifier = (database_name, table_name)
+
+with pytest.raises(NoSuchNamespaceError):
+catalog.create_table(identifier=identifier, schema=table_schema_nested)
+
+
+def test_table_exists(catalog: S3TableCatalog, database_name: str, table_name: 
str, table_schema_nested: Schema) -> None:
+identifier = (database_name, table_name)
+
+catalog.create_namespace(namespace=database_name)
+assert not catalog.table_exists(identifier=identifier)
+catalog.create_table(identifier=identifier, schema=table_schema_nested)
+assert catalog.table_exists(identifier=identifier)
+
+
+def test_rename_table(catalog: S3TableCatalog, database_name: str, table_name: 
str, table_schema_nested: Schema) -> None:
+identifier = (database_name, table_name)
+
+catalog.create_namespace(namespace=database_name)
+catalog.create_table(identifier=identifier, schema=table_schema_nested)
+
+to_database_name = f"{database_name}new"
+to_table_name = f"{table_name}new"
+to_identifier = (to_database_name, to_table_name)
+catalog.create_namespace(namespace=to_database_name)
+catalog.rename_table(from_identifier=identifier, 
to_identifier=to_identifier)
+
+assert not catalog.table_exists(identifier=identifier)
+assert catalog.table_exists(identifier=to_identifier)
+
+
+def test_list_tables(catalog: S3TableCatalog, database_name: str, table_name: 
str, table_schema_nested: Schema) -> None:
+identifier = (database_name, table_name)
+
+catalog.create_namespace(namespace=database_name)
+assert not catalog.list_tables(namespace=database_name)
+catalog.create_table(identifier=identifier, schema=table_schema_nested)
+assert catalog.list_tables(namespace=database_name)
+
+
+def test_drop_table(catalog: S3TableCatalog, database_name: str, table_name: 
str, table_schema_nested: Schema) -> None:
+identifier = (database_name, table_name)
+
+catalog.create_namespace(namespace=database_name)
+catalog.create_table(identifier=identifier, schema=

Re: [PR] Spark 3.5: Implement RewriteTablePath [iceberg]

2025-01-06 Thread via GitHub


szehon-ho commented on code in PR #11555:
URL: https://github.com/apache/iceberg/pull/11555#discussion_r1903293655


##
core/src/main/java/org/apache/iceberg/io/DeleteSchemaUtil.java:
##
@@ -43,4 +43,15 @@ public static Schema pathPosSchema() {
   public static Schema posDeleteSchema(Schema rowSchema) {
 return rowSchema == null ? pathPosSchema() : pathPosSchema(rowSchema);
   }
+
+  public static Schema posDeleteReadSchema(Schema rowSchema) {

Review Comment:
   Somehow after the rebase this is needed for position delete rewrite (there 
must be some intervening change related to delete readers).  Previously this 
used the method above `pathPosSchema(rowSchema)` for the read schema, which has 
'row' as required.  This would fail saying 'row' is required but not found in 
the delete file, as 'row' is usually not set.
   
   Note that Spark and all readers now actually seem to no longer include the 
'row' field in the read schema 
https://github.com/apache/iceberg/blob/main/data/src/main/java/org/apache/iceberg/data/BaseDeleteLoader.java#L70.
 
   
   But here, I do want to read the 'row' field and preserve it if it is set by 
some engine. 
So I am taking the strategy of RewritePositionDelete and actually reading 
this field, but as optional to avoid the assert error if it is not found.  
https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/PositionDeletesTable.java#L118.
 (the reader there is derived from schema of metadata table 
PositionDeletesTable).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark 3.5: Implement RewriteTablePath [iceberg]

2025-01-06 Thread via GitHub


szehon-ho commented on code in PR #11555:
URL: https://github.com/apache/iceberg/pull/11555#discussion_r1903293186


##
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteTablePathSparkAction.java:
##
@@ -0,0 +1,731 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.spark.actions;
+
+import java.io.File;
+import java.io.IOException;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.UUID;
+import java.util.stream.Collectors;
+import org.apache.iceberg.ContentFile;
+import org.apache.iceberg.DataFile;
+import org.apache.iceberg.DeleteFile;
+import org.apache.iceberg.FileFormat;
+import org.apache.iceberg.HasTableOperations;
+import org.apache.iceberg.ManifestFile;
+import org.apache.iceberg.PartitionSpec;
+import org.apache.iceberg.RewriteTablePathUtil;
+import org.apache.iceberg.RewriteTablePathUtil.PositionDeleteReaderWriter;
+import org.apache.iceberg.RewriteTablePathUtil.RewriteResult;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.SerializableTable;
+import org.apache.iceberg.Snapshot;
+import org.apache.iceberg.StaticTableOperations;
+import org.apache.iceberg.StructLike;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.TableMetadata;
+import org.apache.iceberg.TableMetadata.MetadataLogEntry;
+import org.apache.iceberg.TableMetadataParser;
+import org.apache.iceberg.actions.ImmutableRewriteTablePath;
+import org.apache.iceberg.actions.RewriteTablePath;
+import org.apache.iceberg.avro.Avro;
+import org.apache.iceberg.data.Record;
+import org.apache.iceberg.data.avro.DataReader;
+import org.apache.iceberg.data.avro.DataWriter;
+import org.apache.iceberg.data.orc.GenericOrcReader;
+import org.apache.iceberg.data.orc.GenericOrcWriter;
+import org.apache.iceberg.data.parquet.GenericParquetReaders;
+import org.apache.iceberg.data.parquet.GenericParquetWriter;
+import org.apache.iceberg.deletes.PositionDeleteWriter;
+import org.apache.iceberg.exceptions.RuntimeIOException;
+import org.apache.iceberg.io.CloseableIterable;
+import org.apache.iceberg.io.DeleteSchemaUtil;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.io.InputFile;
+import org.apache.iceberg.io.OutputFile;
+import org.apache.iceberg.orc.ORC;
+import org.apache.iceberg.parquet.Parquet;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.relocated.com.google.common.collect.Sets;
+import org.apache.iceberg.spark.JobGroupInfo;
+import org.apache.iceberg.util.Pair;
+import org.apache.spark.api.java.function.ForeachFunction;
+import org.apache.spark.api.java.function.MapFunction;
+import org.apache.spark.api.java.function.ReduceFunction;
+import org.apache.spark.broadcast.Broadcast;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Encoder;
+import org.apache.spark.sql.Encoders;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SaveMode;
+import org.apache.spark.sql.SparkSession;
+import org.apache.spark.sql.functions;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import scala.Tuple2;
+
+public class RewriteTablePathSparkAction extends 
BaseSparkAction
+implements RewriteTablePath {
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(RewriteTablePathSparkAction.class);
+  private static final String RESULT_LOCATION = "file-list";
+
+  private String sourcePrefix;
+  private String targetPrefix;
+  private String startVersionName;
+  private String endVersionName;
+  private String stagingDir;
+
+  private final Table table;
+
+  RewriteTablePathSparkAction(SparkSession spark, Table table) {
+super(spark);
+this.table = table;
+  }
+
+  @Override
+  protected RewriteTablePath self() {
+return this;
+  }
+
+  @Override
+  public RewriteTablePath rewriteLocationPrefix(String sPrefix, String 
tPrefix) {
+Preconditions.checkArgument(
+sPrefix != null && !sPrefix.isEmpty(), "Source prefix('%s') cannot be 
empty.", sPrefix);
+this.sourcePrefix = sPrefix;
+this.targetPrefix = tPrefix;
+return this;
+  }
+
+  @Override
+  public RewriteTabl

Re: [PR] Spark 3.5: Implement RewriteTablePath [iceberg]

2025-01-06 Thread via GitHub


szehon-ho commented on code in PR #11555:
URL: https://github.com/apache/iceberg/pull/11555#discussion_r1903293655


##
core/src/main/java/org/apache/iceberg/io/DeleteSchemaUtil.java:
##
@@ -43,4 +43,15 @@ public static Schema pathPosSchema() {
   public static Schema posDeleteSchema(Schema rowSchema) {
 return rowSchema == null ? pathPosSchema() : pathPosSchema(rowSchema);
   }
+
+  public static Schema posDeleteReadSchema(Schema rowSchema) {

Review Comment:
   Somehow after the rebase this is needed for position delete rewrite (there 
must be some intervening change related to delete readers).  Previously this 
used the method above `pathPosSchema(rowSchema)` for the read schema, which has 
'row' as required.  This would fail saying 'row' is required but not found in 
the delete file, as 'row' is usually not set.
   
   Note that Spark and all readers actually don't include the 'row' field in 
the read schema 
https://github.com/apache/iceberg/blob/main/data/src/main/java/org/apache/iceberg/data/BaseDeleteLoader.java#L70.
 
   
   But here, I do want to read the 'row' field and preserve it if it is set by 
some engine. 
So I am taking the strategy of RewritePositionDelete and actually reading 
this field, but as optional to avoid the assert error if it is not found.  
https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/PositionDeletesTable.java#L118.
 (the reader there is derived from schema of metadata table 
PositionDeletesTable).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark 3.5: Implement RewriteTablePath [iceberg]

2025-01-06 Thread via GitHub


szehon-ho commented on code in PR #11555:
URL: https://github.com/apache/iceberg/pull/11555#discussion_r1902646277


##
core/src/main/java/org/apache/iceberg/util/ContentFileUtil.java:
##
@@ -60,28 +69,63 @@ public static CharSequence referencedDataFile(DeleteFile 
deleteFile) {
   return deleteFile.referencedDataFile();
 }
 
-int pathId = MetadataColumns.DELETE_FILE_PATH.fieldId();
-Type pathType = MetadataColumns.DELETE_FILE_PATH.type();
-
 Map lowerBounds = deleteFile.lowerBounds();
-ByteBuffer lowerPathBound = lowerBounds != null ? lowerBounds.get(pathId) 
: null;
+ByteBuffer lowerPathBound = lowerBounds != null ? lowerBounds.get(PATH_ID) 
: null;
 if (lowerPathBound == null) {
   return null;
 }
 
 Map upperBounds = deleteFile.upperBounds();
-ByteBuffer upperPathBound = upperBounds != null ? upperBounds.get(pathId) 
: null;
+ByteBuffer upperPathBound = upperBounds != null ? upperBounds.get(PATH_ID) 
: null;
 if (upperPathBound == null) {
   return null;
 }
 
 if (lowerPathBound.equals(upperPathBound)) {
-  return Conversions.fromByteBuffer(pathType, lowerPathBound);
+  return Conversions.fromByteBuffer(PATH_TYPE, lowerPathBound);
 } else {
   return null;
 }
   }
 
+  /**
+   * Replace file_path reference for a delete file entry
+   *
+   * @param deleteFile delete file whose entry will be replaced
+   * @param sourcePrefix source prefix which will be replaced
+   * @param targetPrefix target prefix which will replace it
+   * @return metrics for the new delete file entry
+   */
+  public static Metrics replacePathBounds(
+  DeleteFile deleteFile, String sourcePrefix, String targetPrefix) {
+Preconditions.checkArgument(
+deleteFile.content() == FileContent.POSITION_DELETES,
+"Only position delete files supported");
+
+Map lowerBounds = deleteFile.lowerBounds();
+ByteBuffer lowerPathBound = lowerBounds != null ? lowerBounds.get(PATH_ID) 
: null;
+if (lowerPathBound == null) {
+  return metricsWithoutPathBounds(deleteFile);
+}
+
+Map upperBounds = deleteFile.upperBounds();
+ByteBuffer upperPathBound = upperBounds != null ? upperBounds.get(PATH_ID) 
: null;
+if (upperPathBound == null) {
+  return metricsWithoutPathBounds(deleteFile);
+}
+
+if (lowerPathBound.equals(upperPathBound)) {

Review Comment:
   Another context, I think having the same upper/lower bound is related to 
file-scoped delete files that seems will be the default soon, ref:  
https://lists.apache.org/thread/2mpmdp4ns66gp595c9b3clstgskbslsp hence my 
thought its not worth to copy the metrics for the other case (which doesnt even 
attempt filtering).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] feat: support S3 Table Buckets with S3TablesCatalog [iceberg-python]

2025-01-06 Thread via GitHub


felixscherz commented on code in PR #1429:
URL: https://github.com/apache/iceberg-python/pull/1429#discussion_r1904401166


##
pyiceberg/catalog/s3tables.py:
##
@@ -0,0 +1,322 @@
+import re
+from typing import TYPE_CHECKING, List, Optional, Set, Tuple, Union
+
+import boto3
+
+from pyiceberg.catalog import DEPRECATED_BOTOCORE_SESSION, MetastoreCatalog, 
PropertiesUpdateSummary
+from pyiceberg.exceptions import (
+CommitFailedException,
+InvalidNamespaceName,
+InvalidTableName,
+NamespaceNotEmptyError,
+NoSuchNamespaceError,
+NoSuchTableError,
+S3TablesError,
+TableAlreadyExistsError,
+TableBucketNotFound,
+)
+from pyiceberg.io import AWS_ACCESS_KEY_ID, AWS_REGION, AWS_SECRET_ACCESS_KEY, 
AWS_SESSION_TOKEN, load_file_io
+from pyiceberg.partitioning import UNPARTITIONED_PARTITION_SPEC, PartitionSpec
+from pyiceberg.schema import Schema
+from pyiceberg.serializers import FromInputFile
+from pyiceberg.table import CommitTableResponse, Table
+from pyiceberg.table.metadata import new_table_metadata
+from pyiceberg.table.sorting import UNSORTED_SORT_ORDER, SortOrder
+from pyiceberg.table.update import TableRequirement, TableUpdate
+from pyiceberg.typedef import EMPTY_DICT, Identifier, Properties
+from pyiceberg.utils.properties import get_first_property_value
+
+if TYPE_CHECKING:
+import pyarrow as pa
+
+S3TABLES_PROFILE_NAME = "s3tables.profile-name"
+S3TABLES_REGION = "s3tables.region"
+S3TABLES_ACCESS_KEY_ID = "s3tables.access-key-id"
+S3TABLES_SECRET_ACCESS_KEY = "s3tables.secret-access-key"
+S3TABLES_SESSION_TOKEN = "s3tables.session-token"
+
+S3TABLES_TABLE_BUCKET_ARN = "s3tables.table-bucket-arn"
+
+S3TABLES_ENDPOINT = "s3tables.endpoint"
+
+# for naming rules see: 
https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables-buckets-naming.html
+S3TABLES_VALID_NAME_REGEX = pattern = re.compile("[a-z0-9][a-z0-9_]{2,62}")
+S3TABLES_RESERVED_NAMESPACE = "aws_s3_metadata"
+
+
+class S3TableCatalog(MetastoreCatalog):
+def __init__(self, name: str, **properties: str):
+super().__init__(name, **properties)
+
+self.table_bucket_arn = self.properties[S3TABLES_TABLE_BUCKET_ARN]
+
+session = boto3.Session(
+profile_name=properties.get(S3TABLES_PROFILE_NAME),
+region_name=get_first_property_value(properties, S3TABLES_REGION, 
AWS_REGION),
+botocore_session=properties.get(DEPRECATED_BOTOCORE_SESSION),
+aws_access_key_id=get_first_property_value(properties, 
S3TABLES_ACCESS_KEY_ID, AWS_ACCESS_KEY_ID),
+aws_secret_access_key=get_first_property_value(properties, 
S3TABLES_SECRET_ACCESS_KEY, AWS_SECRET_ACCESS_KEY),
+aws_session_token=get_first_property_value(properties, 
S3TABLES_SESSION_TOKEN, AWS_SESSION_TOKEN),
+)
+try:
+self.s3tables = session.client("s3tables", 
endpoint_url=properties.get(S3TABLES_ENDPOINT))
+except boto3.session.UnknownServiceError as e:
+raise S3TablesError("'s3tables' requires boto3>=1.35.74. Current 
version: {boto3.__version__}.") from e
+
+try:
+
self.s3tables.get_table_bucket(tableBucketARN=self.table_bucket_arn)
+except self.s3tables.exceptions.NotFoundException as e:
+raise TableBucketNotFound(e) from e
+
+def commit_table(
+self, table: Table, requirements: Tuple[TableRequirement, ...], 
updates: Tuple[TableUpdate, ...]
+) -> CommitTableResponse:
+table_identifier = table.name()
+database_name, table_name = 
self.identifier_to_database_and_table(table_identifier, NoSuchTableError)
+
+current_table, version_token = 
self._load_table_and_version(identifier=table_identifier)
+
+updated_staged_table = self._update_and_stage_table(current_table, 
table_identifier, requirements, updates)
+if current_table and updated_staged_table.metadata == 
current_table.metadata:
+# no changes, do nothing
+return CommitTableResponse(metadata=current_table.metadata, 
metadata_location=current_table.metadata_location)
+
+self._write_metadata(
+metadata=updated_staged_table.metadata,
+io=updated_staged_table.io,
+metadata_path=updated_staged_table.metadata_location,
+overwrite=True,
+)
+
+# try to update metadata location which will fail if the versionToken 
changed meanwhile
+try:
+self.s3tables.update_table_metadata_location(
+tableBucketARN=self.table_bucket_arn,
+namespace=database_name,
+name=table_name,
+versionToken=version_token,
+metadataLocation=updated_staged_table.metadata_location,
+)
+except self.s3tables.exceptions.ConflictException as e:
+raise CommitFailedException(
+f"Cannot commit {database_name}.{table_name} because of a 
concurrent update to the table v

Re: [PR] Fix read from multiple s3 regions [iceberg-python]

2025-01-06 Thread via GitHub


kevinjqliu commented on PR #1453:
URL: https://github.com/apache/iceberg-python/pull/1453#issuecomment-2573265511

   Thank you @jiakai-li for the contribution and @Fokko for the review :) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Fix read from multiple s3 regions [iceberg-python]

2025-01-06 Thread via GitHub


kevinjqliu merged PR #1453:
URL: https://github.com/apache/iceberg-python/pull/1453


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Change dot notation in add column documentation to tuple [iceberg-python]

2025-01-06 Thread via GitHub


kevinjqliu commented on code in PR #1433:
URL: https://github.com/apache/iceberg-python/pull/1433#discussion_r1904280959


##
mkdocs/docs/api.md:
##
@@ -961,20 +963,21 @@ Renaming a field in an Iceberg table is simple:
 ```python
 with table.update_schema() as update:
 update.rename_column("retries", "num_retries")
-# This will rename `confirmed_by` to `exchange`
-update.rename_column("properties.confirmed_by", "exchange")
+# This will rename `confirmed_by` to `exchange` in the `details` struct
+update.rename_column("details", "confirmed_by"), "exchange")
 ```
 
 ### Move column
 
-Move a field inside of struct:
+Move order of fields:
 
 ```python
 with table.update_schema() as update:
 update.move_first("symbol")
+# This will move `bid` after `ask`
 update.move_after("bid", "ask")
-# This will move `confirmed_by` before `exchange`
-update.move_before("details.created_by", "details.exchange")
+# This will move `confirmed_by` before `exchange` in the `details` struct
+update.move_before("details", "confirmed_by"), ("details", "exchange"))

Review Comment:
   ```suggestion
   update.move_before(("details", "confirmed_by"), ("details", "exchange"))
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Change dot notation in add column documentation to tuple [iceberg-python]

2025-01-06 Thread via GitHub


kevinjqliu commented on code in PR #1433:
URL: https://github.com/apache/iceberg-python/pull/1433#discussion_r1904282046


##
mkdocs/docs/api.md:
##
@@ -961,20 +963,21 @@ Renaming a field in an Iceberg table is simple:
 ```python
 with table.update_schema() as update:
 update.rename_column("retries", "num_retries")
-# This will rename `confirmed_by` to `exchange`
-update.rename_column("properties.confirmed_by", "exchange")
+# This will rename `confirmed_by` to `exchange` in the `details` struct
+update.rename_column("details", "confirmed_by"), "exchange")

Review Comment:
   ```suggestion
   update.rename_column(("details", "confirmed_by"), ("details", 
"exchange"))
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Change dot notation in add column documentation to tuple [iceberg-python]

2025-01-06 Thread via GitHub


kevinjqliu commented on code in PR #1433:
URL: https://github.com/apache/iceberg-python/pull/1433#discussion_r1904280648


##
mkdocs/docs/api.md:
##
@@ -1006,6 +1009,8 @@ Delete a field, careful this is a incompatible change 
(readers/writers might exp
 ```python
 with table.update_schema(allow_incompatible_changes=True) as update:
 update.delete_column("some_field")
+# In a struct
+update.delete_column("details", "confirmed_by"))

Review Comment:
   ```suggestion
   update.delete_column(("details", "confirmed_by"))
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Change dot notation in add column documentation to tuple [iceberg-python]

2025-01-06 Thread via GitHub


jeppe-dos commented on PR #1433:
URL: https://github.com/apache/iceberg-python/pull/1433#issuecomment-2573343676

   > that works, but i think the first example should work too. We can track 
this in a separate issue.
   > 
   > ```
   > >>> with table.update_schema() as update:
   > ... update.add_column("retries", IntegerType(), "Number of retries to 
place the bid")
   > ... # In a struct
   > ... update.add_column("details", StructType())
   > ...
   > 
   > 
   > >>> with table.update_schema() as update:
   > ... update.add_column(("details", "confirmed_by"), StringType(), "Name 
of the exchange")
   > ...
   > 
   > >>> print(table.refresh().schema())
   > table {
   >   1: city: optional string
   >   2: lat: optional double
   >   3: long: optional double
   >   4: retries: optional int (Number of retries to place the bid)
   >   5: details: optional struct<6: confirmed_by: optional string (Name of 
the exchange)>
   > }
   > ```
   
   Agreed. Thank you for reviewing the changes. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[PR] bump version to 0.8.1 [iceberg-python]

2025-01-06 Thread via GitHub


kevinjqliu opened a new pull request, #1489:
URL: https://github.com/apache/iceberg-python/pull/1489

   I noticed we never bumped the version in main branch to 0.8.1
   
   Cherrypicked commit from the 0.8.x branch
   
https://github.com/apache/iceberg-python/commit/a051584a3684392d2db6556449eb299145d47d15


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Change dot notation in add column documentation to tuple [iceberg-python]

2025-01-06 Thread via GitHub


kevinjqliu commented on PR #1433:
URL: https://github.com/apache/iceberg-python/pull/1433#issuecomment-2573349822

   @jeppe-dos that would be great! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] feat: support S3 Table Buckets with S3TablesCatalog [iceberg-python]

2025-01-06 Thread via GitHub


felixscherz commented on code in PR #1429:
URL: https://github.com/apache/iceberg-python/pull/1429#discussion_r1904369776


##
pyiceberg/catalog/s3tables.py:
##
@@ -0,0 +1,322 @@
+import re
+from typing import TYPE_CHECKING, List, Optional, Set, Tuple, Union
+
+import boto3
+
+from pyiceberg.catalog import DEPRECATED_BOTOCORE_SESSION, MetastoreCatalog, 
PropertiesUpdateSummary
+from pyiceberg.exceptions import (
+CommitFailedException,
+InvalidNamespaceName,
+InvalidTableName,
+NamespaceNotEmptyError,
+NoSuchNamespaceError,
+NoSuchTableError,
+S3TablesError,
+TableAlreadyExistsError,
+TableBucketNotFound,
+)
+from pyiceberg.io import AWS_ACCESS_KEY_ID, AWS_REGION, AWS_SECRET_ACCESS_KEY, 
AWS_SESSION_TOKEN, load_file_io
+from pyiceberg.partitioning import UNPARTITIONED_PARTITION_SPEC, PartitionSpec
+from pyiceberg.schema import Schema
+from pyiceberg.serializers import FromInputFile
+from pyiceberg.table import CommitTableResponse, Table
+from pyiceberg.table.metadata import new_table_metadata
+from pyiceberg.table.sorting import UNSORTED_SORT_ORDER, SortOrder
+from pyiceberg.table.update import TableRequirement, TableUpdate
+from pyiceberg.typedef import EMPTY_DICT, Identifier, Properties
+from pyiceberg.utils.properties import get_first_property_value
+
+if TYPE_CHECKING:
+import pyarrow as pa
+
+S3TABLES_PROFILE_NAME = "s3tables.profile-name"
+S3TABLES_REGION = "s3tables.region"
+S3TABLES_ACCESS_KEY_ID = "s3tables.access-key-id"
+S3TABLES_SECRET_ACCESS_KEY = "s3tables.secret-access-key"
+S3TABLES_SESSION_TOKEN = "s3tables.session-token"
+
+S3TABLES_TABLE_BUCKET_ARN = "s3tables.table-bucket-arn"
+
+S3TABLES_ENDPOINT = "s3tables.endpoint"
+
+# for naming rules see: 
https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables-buckets-naming.html
+S3TABLES_VALID_NAME_REGEX = pattern = re.compile("[a-z0-9][a-z0-9_]{2,62}")
+S3TABLES_RESERVED_NAMESPACE = "aws_s3_metadata"
+
+
+class S3TableCatalog(MetastoreCatalog):
+def __init__(self, name: str, **properties: str):
+super().__init__(name, **properties)
+
+self.table_bucket_arn = self.properties[S3TABLES_TABLE_BUCKET_ARN]
+
+session = boto3.Session(
+profile_name=properties.get(S3TABLES_PROFILE_NAME),
+region_name=get_first_property_value(properties, S3TABLES_REGION, 
AWS_REGION),
+botocore_session=properties.get(DEPRECATED_BOTOCORE_SESSION),
+aws_access_key_id=get_first_property_value(properties, 
S3TABLES_ACCESS_KEY_ID, AWS_ACCESS_KEY_ID),
+aws_secret_access_key=get_first_property_value(properties, 
S3TABLES_SECRET_ACCESS_KEY, AWS_SECRET_ACCESS_KEY),
+aws_session_token=get_first_property_value(properties, 
S3TABLES_SESSION_TOKEN, AWS_SESSION_TOKEN),
+)
+try:
+self.s3tables = session.client("s3tables", 
endpoint_url=properties.get(S3TABLES_ENDPOINT))
+except boto3.session.UnknownServiceError as e:
+raise S3TablesError("'s3tables' requires boto3>=1.35.74. Current 
version: {boto3.__version__}.") from e

Review Comment:
   I rebased my branch. I think we need to keep this check however since as per 
`pyproject.toml` the requirement for `boto3` is `>=1.24.59`. We could bump that 
to `>=1.35.74` but since that is only required for S3 Tables API I'm not sure 
about forcing everyone to upgrade their boto3 version just to support S3 
Tables. What do you think?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] feat(datafusion): Expose DataFusion statistics on an IcebergTableScan [iceberg-rust]

2025-01-06 Thread via GitHub


gruuya commented on code in PR #880:
URL: https://github.com/apache/iceberg-rust/pull/880#discussion_r1904197220


##
crates/integrations/datafusion/src/table/mod.rs:
##
@@ -130,10 +131,14 @@ impl TableProvider for IcebergTableProvider {
 filters: &[Expr],
 _limit: Option,
 ) -> DFResult> {
+let statistics = compute_statistics(&self.table, self.snapshot_id)
+.await
+.unwrap_or(Statistics::new_unknown(self.schema.as_ref()));

Review Comment:
   Arguably this should be computed when instantiating the 
`IcebergTableProvider` (consequently `IcebergTableProvider::new` would become 
async).
   
   That way not only could `TableProvider::statistics` be exposed, we'd save a 
const perf penalty during planning for reading the manifests.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] API: Support removeUnusedSpecs in ExpireSnapshots [iceberg]

2025-01-06 Thread via GitHub


advancedxy commented on PR #10755:
URL: https://github.com/apache/iceberg/pull/10755#issuecomment-2573229194

   @rdblue @danielcweeks @amogh-jahagirdar would you mind to take a look at 
this again?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] [BUG] pyiceberg hanging on multiprocessing [iceberg-python]

2025-01-06 Thread via GitHub


kevinjqliu commented on issue #1488:
URL: 
https://github.com/apache/iceberg-python/issues/1488#issuecomment-2573279503

   Hi @frankliee thanks for reporting this issue. I noticed you're using 
version 0.7.1, the latest version is 0.8.1. Could you retry with the latest 
version?
   
   The issue might be with due to the ability to pickle Table object 
https://github.com/apache/iceberg-python/issues/513
   
   > The arr will never be printed due to process hanging
   
   it would be helpful if you have a stacktrace to show where the hang happens. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Unimplement Map from CharSequenceMap to obey contract [iceberg]

2025-01-06 Thread via GitHub


findepi commented on PR #11704:
URL: https://github.com/apache/iceberg/pull/11704#issuecomment-2573411391

   CharSequenceMap is not a sound Map implementation, it would be nice if 
someone could spend couple minutes and review this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark 3.5: Implement RewriteTablePath [iceberg]

2025-01-06 Thread via GitHub


flyrain commented on code in PR #11555:
URL: https://github.com/apache/iceberg/pull/11555#discussion_r1904432648


##
core/src/main/java/org/apache/iceberg/io/DeleteSchemaUtil.java:
##
@@ -43,4 +43,15 @@ public static Schema pathPosSchema() {
   public static Schema posDeleteSchema(Schema rowSchema) {
 return rowSchema == null ? pathPosSchema() : pathPosSchema(rowSchema);
   }
+
+  public static Schema posDeleteReadSchema(Schema rowSchema) {

Review Comment:
   This might not directly related to this PR, but seems like the column `row` 
should NOT be marked as `required` anywhere. 



##
core/src/main/java/org/apache/iceberg/util/ContentFileUtil.java:
##
@@ -60,28 +69,63 @@ public static CharSequence referencedDataFile(DeleteFile 
deleteFile) {
   return deleteFile.referencedDataFile();
 }
 
-int pathId = MetadataColumns.DELETE_FILE_PATH.fieldId();
-Type pathType = MetadataColumns.DELETE_FILE_PATH.type();
-
 Map lowerBounds = deleteFile.lowerBounds();
-ByteBuffer lowerPathBound = lowerBounds != null ? lowerBounds.get(pathId) 
: null;
+ByteBuffer lowerPathBound = lowerBounds != null ? lowerBounds.get(PATH_ID) 
: null;
 if (lowerPathBound == null) {
   return null;
 }
 
 Map upperBounds = deleteFile.upperBounds();
-ByteBuffer upperPathBound = upperBounds != null ? upperBounds.get(pathId) 
: null;
+ByteBuffer upperPathBound = upperBounds != null ? upperBounds.get(PATH_ID) 
: null;
 if (upperPathBound == null) {
   return null;
 }
 
 if (lowerPathBound.equals(upperPathBound)) {
-  return Conversions.fromByteBuffer(pathType, lowerPathBound);
+  return Conversions.fromByteBuffer(PATH_TYPE, lowerPathBound);
 } else {
   return null;
 }
   }
 
+  /**
+   * Replace file_path reference for a delete file entry
+   *
+   * @param deleteFile delete file whose entry will be replaced
+   * @param sourcePrefix source prefix which will be replaced
+   * @param targetPrefix target prefix which will replace it
+   * @return metrics for the new delete file entry
+   */
+  public static Metrics replacePathBounds(
+  DeleteFile deleteFile, String sourcePrefix, String targetPrefix) {
+Preconditions.checkArgument(
+deleteFile.content() == FileContent.POSITION_DELETES,
+"Only position delete files supported");
+
+Map lowerBounds = deleteFile.lowerBounds();
+ByteBuffer lowerPathBound = lowerBounds != null ? lowerBounds.get(PATH_ID) 
: null;
+if (lowerPathBound == null) {
+  return metricsWithoutPathBounds(deleteFile);
+}
+
+Map upperBounds = deleteFile.upperBounds();
+ByteBuffer upperPathBound = upperBounds != null ? upperBounds.get(PATH_ID) 
: null;
+if (upperPathBound == null) {
+  return metricsWithoutPathBounds(deleteFile);
+}
+
+if (lowerPathBound.equals(upperPathBound)) {

Review Comment:
   I think it's worth to have a comment to clarify the behavior. It isn't a 
blocker to me though. 



##
core/src/main/java/org/apache/iceberg/io/DeleteSchemaUtil.java:
##
@@ -43,4 +43,15 @@ public static Schema pathPosSchema() {
   public static Schema posDeleteSchema(Schema rowSchema) {
 return rowSchema == null ? pathPosSchema() : pathPosSchema(rowSchema);
   }
+
+  public static Schema posDeleteReadSchema(Schema rowSchema) {
+return new Schema(
+MetadataColumns.DELETE_FILE_PATH,
+MetadataColumns.DELETE_FILE_POS,
+Types.NestedField.optional(
+MetadataColumns.DELETE_FILE_ROW_FIELD_ID,
+"row",

Review Comment:
   could we reuse this constant `DELETE_FILE_ROW_FIELD_NAME`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] API: Support removeUnusedSpecs in ExpireSnapshots [iceberg]

2025-01-06 Thread via GitHub


amogh-jahagirdar commented on code in PR #10755:
URL: https://github.com/apache/iceberg/pull/10755#discussion_r1904445943


##
core/src/test/java/org/apache/iceberg/catalog/CatalogTests.java:
##
@@ -1284,6 +1284,62 @@ public void testUpdateTableSpecThenRevert() {
 assertThat(table.spec()).as("Loaded table should have expected 
spec").isEqualTo(TABLE_SPEC);
   }
 
+  @ParameterizedTest
+  @ValueSource(booleans = {true, false})
+  public void testRemoveUnusedSpec(boolean withBranch) {
+String branch = "test";
+C catalog = catalog();
+
+if (requiresNamespaceCreate()) {
+  catalog.createNamespace(NS);
+}
+
+Table table =
+catalog
+.buildTable(TABLE, SCHEMA)
+.withPartitionSpec(SPEC)
+.withProperty(TableProperties.GC_ENABLED, "true")
+.create();
+PartitionSpec spec = table.spec();
+// added a file to trigger snapshot expiration
+table.newFastAppend().appendFile(FILE_A).commit();
+if (withBranch) {
+  table.manageSnapshots().createBranch(branch).commit();
+}
+table.updateSpec().addField(Expressions.bucket("data", 16)).commit();
+table.updateSpec().removeField(Expressions.bucket("data", 16)).commit();
+table.updateSpec().addField("data").commit();
+assertThat(table.specs()).as("Should have 3 total specs").hasSize(3);
+PartitionSpec current = table.spec();
+table.expireSnapshots().cleanExpiredMetadata(true).commit();
+
+Table loaded = catalog.loadTable(TABLE);
+assertThat(loaded.specs().values()).containsExactlyInAnyOrder(spec, 
current);
+
+// add a data file with current spec and remove the old data file
+table.newDelete().deleteFile(FILE_A).commit();
+DataFile anotherFile =
+DataFiles.builder(current)
+.withPath("/path/to/data-b.parquet")
+.withFileSizeInBytes(10)
+.withPartitionPath("id_bucket=0/data=123") // easy way to set 
partition data for now
+.withRecordCount(2) // needs at least one record or else metrics 
will filter it out

Review Comment:
   Non-blocking nit: Sorry for failing to catch this earlier, but these 
comments don't seem particularly useful,  could we just remove them?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Open-API: Fix compilation errors in generated Java classes due to mismatched return types [iceberg]

2025-01-06 Thread via GitHub


VladimirYushkevich commented on code in PR #11806:
URL: https://github.com/apache/iceberg/pull/11806#discussion_r1904456125


##
open-api/rest-catalog-open-api.py:
##
@@ -981,8 +966,33 @@ class ValueMap(BaseModel):
 )
 
 
+class ContentFile(BaseModel):

Review Comment:
   this file is autogenerated, I moved `ContentEnum` and `ActionEnum` in proper 
place to minimize diff a bit.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Update maintenance.md [iceberg]

2025-01-06 Thread via GitHub


nastra commented on code in PR #11916:
URL: https://github.com/apache/iceberg/pull/11916#discussion_r1905022184


##
docs/docs/maintenance.md:
##
@@ -137,7 +137,7 @@ See the [`RewriteDataFiles` Javadoc](../../javadoc/{{ 
icebergVersion }}/org/apac
 
 ### Rewrite manifests
 
-Iceberg uses metadata in its manifest list and manifest files speed up query 
planning and to prune unnecessary data files. The metadata tree functions as an 
index over a table's data.
+Iceberg uses metadata in its manifest list and manifest files to speed up 
query planning, and to prune unnecessary data files. The metadata tree 
functions as an index over a table's data.

Review Comment:
   ```suggestion
   Iceberg uses metadata in its manifest list and manifest files to speed up 
query planning and to prune unnecessary data files. The metadata tree functions 
as an index over a table's data.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Update maintenance.md [iceberg]

2025-01-06 Thread via GitHub


nastra merged PR #11916:
URL: https://github.com/apache/iceberg/pull/11916


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Build: Bump software.amazon.awssdk:bom from 2.29.43 to 2.29.45 [iceberg]

2025-01-06 Thread via GitHub


nastra merged PR #11910:
URL: https://github.com/apache/iceberg/pull/11910


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Build: Bump org.xerial:sqlite-jdbc from 3.47.1.0 to 3.47.2.0 [iceberg]

2025-01-06 Thread via GitHub


nastra merged PR #11907:
URL: https://github.com/apache/iceberg/pull/11907


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Build: Bump org.assertj:assertj-core from 3.27.0 to 3.27.2 [iceberg]

2025-01-06 Thread via GitHub


nastra commented on PR #11908:
URL: https://github.com/apache/iceberg/pull/11908#issuecomment-2574587110

   @dependabot rebase


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] feat(puffin): Parse Puffin FileMetadata [iceberg-rust]

2025-01-06 Thread via GitHub


liurenjie1024 commented on code in PR #765:
URL: https://github.com/apache/iceberg-rust/pull/765#discussion_r1903911917


##
crates/iceberg/src/puffin/metadata.rs:
##
@@ -0,0 +1,777 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use std::collections::{HashMap, HashSet};
+
+use bytes::Bytes;
+use serde::{Deserialize, Serialize};
+
+use crate::io::{FileRead, InputFile};
+use crate::puffin::compression::CompressionCodec;
+use crate::{Error, ErrorKind, Result};
+
+/// Human-readable identification of the application writing the file, along 
with its version.
+/// Example: "Trino version 381"
+pub(crate) const CREATED_BY_PROPERTY: &str = "created-by";
+
+/// Metadata about a blob.
+/// For more information, see: 
https://iceberg.apache.org/puffin-spec/#blobmetadata
+#[derive(Debug, PartialEq, Eq, Serialize, Deserialize, Clone)]
+#[serde(rename_all = "kebab-case")]
+pub(crate) struct BlobMetadata {
+/// See blob types: https://iceberg.apache.org/puffin-spec/#blob-types
+pub(crate) r#type: String,
+/// List of field IDs the blob was computed for; the order of items is 
used to compute sketches stored in the blob.
+pub(crate) fields: Vec,
+/// ID of the Iceberg table's snapshot the blob was computed from
+pub(crate) snapshot_id: i64,
+/// Sequence number of the Iceberg table's snapshot the blob was computed 
from
+pub(crate) sequence_number: i64,
+/// The offset in the file where the blob contents start
+pub(crate) offset: u64,
+/// The length of the blob stored in the file (after compression, if 
compressed)
+pub(crate) length: usize,

Review Comment:
   `usize` is platform dependent, should we use `u64` here?



##
crates/iceberg/src/puffin/metadata.rs:
##
@@ -0,0 +1,777 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use std::collections::{HashMap, HashSet};
+
+use bytes::Bytes;
+use serde::{Deserialize, Serialize};
+
+use crate::io::{FileRead, InputFile};
+use crate::puffin::compression::CompressionCodec;
+use crate::{Error, ErrorKind, Result};
+
+/// Human-readable identification of the application writing the file, along 
with its version.
+/// Example: "Trino version 381"
+pub(crate) const CREATED_BY_PROPERTY: &str = "created-by";
+
+/// Metadata about a blob.
+/// For more information, see: 
https://iceberg.apache.org/puffin-spec/#blobmetadata
+#[derive(Debug, PartialEq, Eq, Serialize, Deserialize, Clone)]
+#[serde(rename_all = "kebab-case")]
+pub(crate) struct BlobMetadata {
+/// See blob types: https://iceberg.apache.org/puffin-spec/#blob-types
+pub(crate) r#type: String,
+/// List of field IDs the blob was computed for; the order of items is 
used to compute sketches stored in the blob.
+pub(crate) fields: Vec,
+/// ID of the Iceberg table's snapshot the blob was computed from
+pub(crate) snapshot_id: i64,
+/// Sequence number of the Iceberg table's snapshot the blob was computed 
from
+pub(crate) sequence_number: i64,
+/// The offset in the file where the blob contents start
+pub(crate) offset: u64,
+/// The length of the blob stored in the file (after compression, if 
compressed)
+pub(crate) length: usize,
+/// The compression codec used to compress the data
+#[serde(skip_serializing_if = "CompressionCodec::is_none")]
+#[serde(default)]
+pub(crate) compression_codec: CompressionCodec,
+/// Arbitrary meta-information about the blob
+ 

Re: [PR] GCP: Add Iceberg Catalog for GCP BigQuery Metastore [iceberg]

2025-01-06 Thread via GitHub


vrishin-bolt commented on PR #11039:
URL: https://github.com/apache/iceberg/pull/11039#issuecomment-2574498242

   > @hesham-medhat @rdblue - could you pls give an update on this PR? It seems 
it would massively simplify the Iceberg table management in GCP for non-spark 
usecases.
   > 
   > thanks!
   
   Just a curious question - Currently how are you supporting non-spark 
usecases at all? scheduling a metadata copy in iceberg metadata format, and 
then pointing other engines to that from that? 
   However, even with this PR merge, the non spark usecases must have 
integrations with BigQuery metastore for the simplification to happen right? 
what am I missing


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Split metadata tables into separate modules [iceberg-rust]

2025-01-06 Thread via GitHub


liurenjie1024 commented on code in PR #872:
URL: https://github.com/apache/iceberg-rust/pull/872#discussion_r1904969065


##
crates/iceberg/src/inspect/metadata_table.rs:
##
@@ -0,0 +1,99 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use super::{ManifestsTable, SnapshotsTable};
+use crate::table::Table;
+
+/// Metadata table is used to inspect a table's history, snapshots, and other 
metadata as a table.
+///
+/// References:
+/// - 

+/// - 
+/// - 
+#[derive(Debug)]
+pub struct MetadataTable(Table);

Review Comment:
   cc @Xuanwo @sdd @Fokko What do you think?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Split metadata tables into separate modules [iceberg-rust]

2025-01-06 Thread via GitHub


liurenjie1024 commented on code in PR #872:
URL: https://github.com/apache/iceberg-rust/pull/872#discussion_r1904966540


##
crates/iceberg/src/inspect/snapshots.rs:
##
@@ -0,0 +1,183 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use std::sync::Arc;
+
+use arrow_array::builder::{MapBuilder, PrimitiveBuilder, StringBuilder};
+use arrow_array::types::{Int64Type, TimestampMillisecondType};
+use arrow_array::RecordBatch;
+use arrow_schema::{DataType, Field, Schema, TimeUnit};
+
+use crate::table::Table;
+use crate::Result;
+
+/// Snapshots table.
+pub struct SnapshotsTable<'a> {
+table: &'a Table,
+}
+
+impl<'a> SnapshotsTable<'a> {
+/// Create a new Snapshots table instance.
+pub fn new(table: &'a Table) -> Self {
+Self { table }
+}
+
+/// Returns the schema of the snapshots table.
+pub fn schema(&self) -> Schema {

Review Comment:
   Same question as https://github.com/apache/iceberg-rust/issues/868 . But we 
could defer this to later issue.



##
crates/iceberg/src/inspect/snapshots.rs:
##
@@ -0,0 +1,183 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use std::sync::Arc;
+
+use arrow_array::builder::{MapBuilder, PrimitiveBuilder, StringBuilder};
+use arrow_array::types::{Int64Type, TimestampMillisecondType};
+use arrow_array::RecordBatch;
+use arrow_schema::{DataType, Field, Schema, TimeUnit};
+
+use crate::table::Table;
+use crate::Result;
+
+/// Snapshots table.
+pub struct SnapshotsTable<'a> {
+table: &'a Table,
+}
+
+impl<'a> SnapshotsTable<'a> {
+/// Create a new Snapshots table instance.
+pub fn new(table: &'a Table) -> Self {
+Self { table }
+}
+
+/// Returns the schema of the snapshots table.
+pub fn schema(&self) -> Schema {
+Schema::new(vec![
+Field::new(
+"committed_at",
+DataType::Timestamp(TimeUnit::Millisecond, 
Some("+00:00".into())),
+false,
+),
+Field::new("snapshot_id", DataType::Int64, false),
+Field::new("parent_id", DataType::Int64, true),
+Field::new("operation", DataType::Utf8, false),
+Field::new("manifest_list", DataType::Utf8, false),
+Field::new(
+"summary",
+DataType::Map(
+Arc::new(Field::new(
+"entries",
+DataType::Struct(
+vec![
+Field::new("keys", DataType::Utf8, false),
+Field::new("values", DataType::Utf8, true),
+]
+.into(),
+),
+false,
+)),
+false,
+),
+false,
+),
+])
+}
+
+/// Scans the snapshots table.
+pub fn scan(&self) -> Result {

Review Comment:
   Same as #870 



##
crates/iceberg/src/inspect/metadata_table.rs:
##
@@ -0,0 +1,99 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with t

[PR] API: Align CharSequenceSet impl with Data/DeleteFileSet [iceberg]

2025-01-06 Thread via GitHub


nastra opened a new pull request, #11322:
URL: https://github.com/apache/iceberg/pull/11322

   (no comment)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Metadata table scans as streams [iceberg-rust]

2025-01-06 Thread via GitHub


liurenjie1024 commented on PR #870:
URL: https://github.com/apache/iceberg-rust/pull/870#issuecomment-2574495817

   Thanks @rshkv for this pr, exactly what I mean! We should rebase this pr 
after #872 ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] feat: support S3 Table Buckets with S3TablesCatalog [iceberg-python]

2025-01-06 Thread via GitHub


felixscherz commented on code in PR #1429:
URL: https://github.com/apache/iceberg-python/pull/1429#discussion_r1905038020


##
pyiceberg/catalog/s3tables.py:
##
@@ -0,0 +1,324 @@
+import re
+from typing import TYPE_CHECKING, List, Optional, Set, Tuple, Union
+
+import boto3
+
+from pyiceberg.catalog import DEPRECATED_BOTOCORE_SESSION, MetastoreCatalog, 
PropertiesUpdateSummary
+from pyiceberg.exceptions import (
+CommitFailedException,
+InvalidNamespaceName,
+InvalidTableName,
+NamespaceNotEmptyError,
+NoSuchNamespaceError,
+NoSuchTableError,
+S3TablesError,
+TableAlreadyExistsError,
+TableBucketNotFound,
+)
+from pyiceberg.io import AWS_ACCESS_KEY_ID, AWS_REGION, AWS_SECRET_ACCESS_KEY, 
AWS_SESSION_TOKEN, load_file_io
+from pyiceberg.partitioning import UNPARTITIONED_PARTITION_SPEC, PartitionSpec
+from pyiceberg.schema import Schema
+from pyiceberg.serializers import FromInputFile
+from pyiceberg.table import CommitTableResponse, Table
+from pyiceberg.table.metadata import new_table_metadata
+from pyiceberg.table.sorting import UNSORTED_SORT_ORDER, SortOrder
+from pyiceberg.table.update import TableRequirement, TableUpdate
+from pyiceberg.typedef import EMPTY_DICT, Identifier, Properties
+from pyiceberg.utils.properties import get_first_property_value
+
+if TYPE_CHECKING:
+import pyarrow as pa
+
+S3TABLES_PROFILE_NAME = "s3tables.profile-name"
+S3TABLES_REGION = "s3tables.region"
+S3TABLES_ACCESS_KEY_ID = "s3tables.access-key-id"
+S3TABLES_SECRET_ACCESS_KEY = "s3tables.secret-access-key"
+S3TABLES_SESSION_TOKEN = "s3tables.session-token"
+
+S3TABLES_TABLE_BUCKET_ARN = "s3tables.table-bucket-arn"
+
+S3TABLES_ENDPOINT = "s3tables.endpoint"
+
+# for naming rules see: 
https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables-buckets-naming.html
+S3TABLES_VALID_NAME_REGEX = pattern = re.compile("[a-z0-9][a-z0-9_]{2,62}")
+S3TABLES_RESERVED_NAMESPACE = "aws_s3_metadata"
+
+
+class S3TableCatalog(MetastoreCatalog):
+def __init__(self, name: str, **properties: str):
+super().__init__(name, **properties)
+
+self.table_bucket_arn = self.properties[S3TABLES_TABLE_BUCKET_ARN]
+
+session = boto3.Session(
+profile_name=properties.get(S3TABLES_PROFILE_NAME),
+region_name=get_first_property_value(properties, S3TABLES_REGION, 
AWS_REGION),
+botocore_session=properties.get(DEPRECATED_BOTOCORE_SESSION),
+aws_access_key_id=get_first_property_value(properties, 
S3TABLES_ACCESS_KEY_ID, AWS_ACCESS_KEY_ID),
+aws_secret_access_key=get_first_property_value(properties, 
S3TABLES_SECRET_ACCESS_KEY, AWS_SECRET_ACCESS_KEY),
+aws_session_token=get_first_property_value(properties, 
S3TABLES_SESSION_TOKEN, AWS_SESSION_TOKEN),
+)
+try:
+self.s3tables = session.client("s3tables", 
endpoint_url=properties.get(S3TABLES_ENDPOINT))
+except boto3.session.UnknownServiceError as e:
+raise S3TablesError("'s3tables' requires boto3>=1.35.74. Current 
version: {boto3.__version__}.") from e
+
+try:
+
self.s3tables.get_table_bucket(tableBucketARN=self.table_bucket_arn)
+except self.s3tables.exceptions.NotFoundException as e:
+raise TableBucketNotFound(e) from e
+
+def commit_table(
+self, table: Table, requirements: Tuple[TableRequirement, ...], 
updates: Tuple[TableUpdate, ...]
+) -> CommitTableResponse:
+table_identifier = table.name()
+database_name, table_name = 
self.identifier_to_database_and_table(table_identifier, NoSuchTableError)
+
+current_table, version_token = 
self._load_table_and_version(identifier=table_identifier)
+
+updated_staged_table = self._update_and_stage_table(current_table, 
table_identifier, requirements, updates)
+if current_table and updated_staged_table.metadata == 
current_table.metadata:
+# no changes, do nothing
+return CommitTableResponse(metadata=current_table.metadata, 
metadata_location=current_table.metadata_location)
+
+self._write_metadata(
+metadata=updated_staged_table.metadata,
+io=updated_staged_table.io,
+metadata_path=updated_staged_table.metadata_location,
+overwrite=True,
+)
+
+# try to update metadata location which will fail if the versionToken 
changed meanwhile
+try:
+self.s3tables.update_table_metadata_location(
+tableBucketARN=self.table_bucket_arn,
+namespace=database_name,
+name=table_name,
+versionToken=version_token,
+metadataLocation=updated_staged_table.metadata_location,
+)
+except self.s3tables.exceptions.ConflictException as e:
+raise CommitFailedException(
+f"Cannot commit {database_name}.{table_name} because of a 
concurrent update to the table v

Re: [PR] feat: support S3 Table Buckets with S3TablesCatalog [iceberg-python]

2025-01-06 Thread via GitHub


felixscherz commented on code in PR #1429:
URL: https://github.com/apache/iceberg-python/pull/1429#discussion_r1905038693


##
pyiceberg/catalog/s3tables.py:
##
@@ -0,0 +1,324 @@
+import re
+from typing import TYPE_CHECKING, List, Optional, Set, Tuple, Union
+
+import boto3
+
+from pyiceberg.catalog import DEPRECATED_BOTOCORE_SESSION, MetastoreCatalog, 
PropertiesUpdateSummary
+from pyiceberg.exceptions import (
+CommitFailedException,
+InvalidNamespaceName,
+InvalidTableName,
+NamespaceNotEmptyError,
+NoSuchNamespaceError,
+NoSuchTableError,
+S3TablesError,
+TableAlreadyExistsError,
+TableBucketNotFound,
+)
+from pyiceberg.io import AWS_ACCESS_KEY_ID, AWS_REGION, AWS_SECRET_ACCESS_KEY, 
AWS_SESSION_TOKEN, load_file_io
+from pyiceberg.partitioning import UNPARTITIONED_PARTITION_SPEC, PartitionSpec
+from pyiceberg.schema import Schema
+from pyiceberg.serializers import FromInputFile
+from pyiceberg.table import CommitTableResponse, Table
+from pyiceberg.table.metadata import new_table_metadata
+from pyiceberg.table.sorting import UNSORTED_SORT_ORDER, SortOrder
+from pyiceberg.table.update import TableRequirement, TableUpdate
+from pyiceberg.typedef import EMPTY_DICT, Identifier, Properties
+from pyiceberg.utils.properties import get_first_property_value
+
+if TYPE_CHECKING:
+import pyarrow as pa
+
+S3TABLES_PROFILE_NAME = "s3tables.profile-name"
+S3TABLES_REGION = "s3tables.region"
+S3TABLES_ACCESS_KEY_ID = "s3tables.access-key-id"
+S3TABLES_SECRET_ACCESS_KEY = "s3tables.secret-access-key"
+S3TABLES_SESSION_TOKEN = "s3tables.session-token"
+
+S3TABLES_TABLE_BUCKET_ARN = "s3tables.table-bucket-arn"
+
+S3TABLES_ENDPOINT = "s3tables.endpoint"
+
+# for naming rules see: 
https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables-buckets-naming.html
+S3TABLES_VALID_NAME_REGEX = pattern = re.compile("[a-z0-9][a-z0-9_]{2,62}")
+S3TABLES_RESERVED_NAMESPACE = "aws_s3_metadata"
+
+
+class S3TableCatalog(MetastoreCatalog):

Review Comment:
   Absolutely, I will change the name!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] feat: support S3 Table Buckets with S3TablesCatalog [iceberg-python]

2025-01-06 Thread via GitHub


HonahX commented on code in PR #1429:
URL: https://github.com/apache/iceberg-python/pull/1429#discussion_r1904909293


##
pyiceberg/catalog/s3tables.py:
##
@@ -0,0 +1,324 @@
+import re
+from typing import TYPE_CHECKING, List, Optional, Set, Tuple, Union
+
+import boto3
+
+from pyiceberg.catalog import DEPRECATED_BOTOCORE_SESSION, MetastoreCatalog, 
PropertiesUpdateSummary
+from pyiceberg.exceptions import (
+CommitFailedException,
+InvalidNamespaceName,
+InvalidTableName,
+NamespaceNotEmptyError,
+NoSuchNamespaceError,
+NoSuchTableError,
+S3TablesError,
+TableAlreadyExistsError,
+TableBucketNotFound,
+)
+from pyiceberg.io import AWS_ACCESS_KEY_ID, AWS_REGION, AWS_SECRET_ACCESS_KEY, 
AWS_SESSION_TOKEN, load_file_io
+from pyiceberg.partitioning import UNPARTITIONED_PARTITION_SPEC, PartitionSpec
+from pyiceberg.schema import Schema
+from pyiceberg.serializers import FromInputFile
+from pyiceberg.table import CommitTableResponse, Table
+from pyiceberg.table.metadata import new_table_metadata
+from pyiceberg.table.sorting import UNSORTED_SORT_ORDER, SortOrder
+from pyiceberg.table.update import TableRequirement, TableUpdate
+from pyiceberg.typedef import EMPTY_DICT, Identifier, Properties
+from pyiceberg.utils.properties import get_first_property_value
+
+if TYPE_CHECKING:
+import pyarrow as pa
+
+S3TABLES_PROFILE_NAME = "s3tables.profile-name"
+S3TABLES_REGION = "s3tables.region"
+S3TABLES_ACCESS_KEY_ID = "s3tables.access-key-id"
+S3TABLES_SECRET_ACCESS_KEY = "s3tables.secret-access-key"
+S3TABLES_SESSION_TOKEN = "s3tables.session-token"
+

Review Comment:
   Is there any case that the s3tables catalog will need a different set of 
credentials than the underlying FileIO for S3 buckets? It seems we could just 
re-use the `s3.` prefix here to avoid adding a new set of configuration names. 
I feel after all s3tables still belong to s3 so it is still intuitive. WDYT?
   
   cc @kevinjqliu 



##
pyiceberg/catalog/s3tables.py:
##
@@ -0,0 +1,324 @@
+import re
+from typing import TYPE_CHECKING, List, Optional, Set, Tuple, Union
+
+import boto3
+
+from pyiceberg.catalog import DEPRECATED_BOTOCORE_SESSION, MetastoreCatalog, 
PropertiesUpdateSummary
+from pyiceberg.exceptions import (
+CommitFailedException,
+InvalidNamespaceName,
+InvalidTableName,
+NamespaceNotEmptyError,
+NoSuchNamespaceError,
+NoSuchTableError,
+S3TablesError,
+TableAlreadyExistsError,
+TableBucketNotFound,
+)
+from pyiceberg.io import AWS_ACCESS_KEY_ID, AWS_REGION, AWS_SECRET_ACCESS_KEY, 
AWS_SESSION_TOKEN, load_file_io
+from pyiceberg.partitioning import UNPARTITIONED_PARTITION_SPEC, PartitionSpec
+from pyiceberg.schema import Schema
+from pyiceberg.serializers import FromInputFile
+from pyiceberg.table import CommitTableResponse, Table
+from pyiceberg.table.metadata import new_table_metadata
+from pyiceberg.table.sorting import UNSORTED_SORT_ORDER, SortOrder
+from pyiceberg.table.update import TableRequirement, TableUpdate
+from pyiceberg.typedef import EMPTY_DICT, Identifier, Properties
+from pyiceberg.utils.properties import get_first_property_value
+
+if TYPE_CHECKING:
+import pyarrow as pa
+
+S3TABLES_PROFILE_NAME = "s3tables.profile-name"
+S3TABLES_REGION = "s3tables.region"
+S3TABLES_ACCESS_KEY_ID = "s3tables.access-key-id"
+S3TABLES_SECRET_ACCESS_KEY = "s3tables.secret-access-key"
+S3TABLES_SESSION_TOKEN = "s3tables.session-token"
+
+S3TABLES_TABLE_BUCKET_ARN = "s3tables.table-bucket-arn"
+
+S3TABLES_ENDPOINT = "s3tables.endpoint"
+
+# for naming rules see: 
https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables-buckets-naming.html
+S3TABLES_VALID_NAME_REGEX = pattern = re.compile("[a-z0-9][a-z0-9_]{2,62}")
+S3TABLES_RESERVED_NAMESPACE = "aws_s3_metadata"
+
+
+class S3TableCatalog(MetastoreCatalog):
+def __init__(self, name: str, **properties: str):
+super().__init__(name, **properties)
+
+self.table_bucket_arn = self.properties[S3TABLES_TABLE_BUCKET_ARN]
+
+session = boto3.Session(
+profile_name=properties.get(S3TABLES_PROFILE_NAME),
+region_name=get_first_property_value(properties, S3TABLES_REGION, 
AWS_REGION),
+botocore_session=properties.get(DEPRECATED_BOTOCORE_SESSION),
+aws_access_key_id=get_first_property_value(properties, 
S3TABLES_ACCESS_KEY_ID, AWS_ACCESS_KEY_ID),
+aws_secret_access_key=get_first_property_value(properties, 
S3TABLES_SECRET_ACCESS_KEY, AWS_SECRET_ACCESS_KEY),
+aws_session_token=get_first_property_value(properties, 
S3TABLES_SESSION_TOKEN, AWS_SESSION_TOKEN),
+)
+try:
+self.s3tables = session.client("s3tables", 
endpoint_url=properties.get(S3TABLES_ENDPOINT))
+except boto3.session.UnknownServiceError as e:
+raise S3TablesError("'s3tables' requires boto3>=1.35.74. Current 
version: {boto3.__version__}.") from e
+
+try:
+
self.s3t

Re: [PR] Change dot notation in add column documentation to tuple [iceberg-python]

2025-01-06 Thread via GitHub


jeppe-dos commented on PR #1433:
URL: https://github.com/apache/iceberg-python/pull/1433#issuecomment-2572871762

   The struct is now created first in the add column section. I have also 
changed from dot to tuple in move and rename column. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



  1   2   >