Re: [PR] feat(table): Implement converting Iceberg schema and types to Arrow [iceberg-go]

2024-10-18 Thread via GitHub


nastra commented on PR #168:
URL: https://github.com/apache/iceberg-go/pull/168#issuecomment-2421911918

   @zeroshade can you please rebase due to the merge conflict?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] [KafkaConnect] Fix RecordConverter [iceberg]

2024-10-18 Thread via GitHub


singhpk234 commented on code in PR #11346:
URL: https://github.com/apache/iceberg/pull/11346#discussion_r1806695970


##
orc/src/main/java/org/apache/iceberg/data/orc/GenericOrcWriters.java:
##
@@ -101,8 +101,8 @@ public static OrcValueWriter byteBuffers() {
 return ByteBufferWriter.INSTANCE;
   }
 
-  public static OrcValueWriter uuids() {
-return UUIDWriter.INSTANCE;
+  public static OrcValueWriter uuids() {
+return ByteBufferWriter.INSTANCE;

Review Comment:
   Need to revisit this.



##
kafka-connect/kafka-connect/src/test/java/org/apache/iceberg/connect/data/RecordConverterTest.java:
##
@@ -921,11 +929,24 @@ private void assertRecordValues(Record record) {
 assertThat(rec.getField("dec")).isEqualTo(DEC_VAL);
 assertThat(rec.getField("s")).isEqualTo(STR_VAL);
 assertThat(rec.getField("b")).isEqualTo(true);
-assertThat(rec.getField("u")).isEqualTo(UUID_VAL);
-assertThat(rec.getField("f")).isEqualTo(BYTES_VAL);
+assertThat(rec.getField("u")).isEqualTo(UUIDUtil.convert(UUID_VAL));
+assertThat(rec.getField("f")).isEqualTo(BYTES_VAL.array());
 assertThat(rec.getField("bi")).isEqualTo(BYTES_VAL);
 assertThat(rec.getField("li")).isEqualTo(LIST_VAL);
 assertThat(rec.getField("ma")).isEqualTo(MAP_VAL);
+
+// check by actually writing it
+for (String format : new String[] {"parquet"}) {
+  IcebergSinkConfig config = mock(IcebergSinkConfig.class);
+  when(config.tableConfig(any())).thenReturn(mock(TableSinkConfig.class));
+  
when(config.writeProps()).thenReturn(ImmutableMap.of("write.format.default", 
format));
+  WriteResult result = writeTest(ImmutableList.of(rec), config, 
UnpartitionedWriter.class);
+
+  assertThat(result.dataFiles()).hasSize(1);
+  assertThat(result.dataFiles())
+  .allMatch(file -> file.format() == FileFormat.fromString(format));
+  assertThat(result.deleteFiles()).hasSize(0);

Review Comment:
   This tests fails before the fix for parquet and aparently fixing parquet it 
made ORC fail due to UUID handling differently



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[PR] OpenAPI: Update Snapshots to be a common denominator of V1 and V2 specs [iceberg]

2024-10-18 Thread via GitHub


sungwy opened a new pull request, #11353:
URL: https://github.com/apache/iceberg/pull/11353

   In a recent [mail list 
discussion](https://lists.apache.org/thread/h9qmrmlgxh91ol0y2v8olt90b9q6p9xr), 
it has come to the community's attention that the Open API Spec does not 
conform to the current table Spec in the Snapshot component.
   
   I am putting this PR together to help visualize the changes that would be 
required if we were to rectify the Spec in an effort to support both v1 and v2 
spec, as has been suggested within the [REST API Open API Spec 
description](https://github.com/apache/iceberg/blob/8e743a5b5209569f84b6bace36e1106c67e1eab3/open-api/rest-catalog-open-api.yaml#L30-L30)
 itself.
   
   
   The proposed change below removes `manifest-list` and `summary` fields which 
are `optional` fields in V1 spec, and adds `manifests` as an optional field 
(following the V1 spec).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] REST Catalog pagination can throw IndexOutOfBoundsException [iceberg]

2024-10-18 Thread via GitHub


munendrasn commented on issue #11142:
URL: https://github.com/apache/iceberg/issues/11142#issuecomment-2422468094

   @rcjverhoef Is this good to be closed as the PR is merged? cc 
@amogh-jahagirdar 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spec: add variant type [iceberg]

2024-10-18 Thread via GitHub


aihuaxu commented on PR #10831:
URL: https://github.com/apache/iceberg/pull/10831#issuecomment-2423052011

   > This needs some notes in `Partition Transforms` , I think explicitly we 
should disallow identity
   > 
   > For Appendix B - We should define something or state explicitly we don't 
define it for variant.
   > 
   > Appendix C - We'll need some details on the JSON serialization since 
that's going to have to define some string representations I think
   > 
   > Under Sort Orders we should probably note you cannot sort on a Variant?
   > 
   > Appendix D: Single Value Serialzation needs an entry, we can probably 
right "Not SUpported" for now, Json needs an entry
   
   Thanks @RussellSpitzer  I missed those sections and just updated. 
   
   I mark Partition Transforms, sorting and hashing not supported/allowed for 
now. 
   For Appendix C, I think it should be just variant, similar to primitive 
type, since it's Iceberg schema as I understand the section. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] AWS: Introduce opt-in S3LocationProvider which is optimized for S3 performance [iceberg]

2024-10-18 Thread via GitHub


ookumuso commented on code in PR #2:
URL: https://github.com/apache/iceberg/pull/2#discussion_r1806927534


##
core/src/main/java/org/apache/iceberg/LocationProviders.java:
##
@@ -172,10 +193,45 @@ private static String pathContext(String tableLocation) {
 }
 
 private String computeHash(String fileName) {
-  byte[] bytes = TEMP.get();
-  HashCode hash = HASH_FUNC.hashString(fileName, StandardCharsets.UTF_8);
-  hash.writeBytesTo(bytes, 0, 4);
-  return BASE64_ENCODER.encode(bytes);
+  HashCode hashCode = HASH_FUNC.hashString(fileName, 
StandardCharsets.UTF_8);
+
+  // {@link Integer#toBinaryString} excludes leading zeros, which we want 
to preserve.
+  // force the first bit to be set to get around that.
+  String hashAsBinaryString = Integer.toBinaryString(hashCode.asInt() | 
Integer.MIN_VALUE);
+  return 
dirsFromHash(hashAsBinaryString.substring(HASH_BINARY_STRING_START_INDEX));
+}
+
+/**
+ * Divides hash into directories for optimized orphan removal operation 
using ENTROPY_DIR_DEPTH
+ * and ENTROPY_DIR_LENGTH
+ *
+ * @param hash 10011001100110011001
+ * @return 1001/1001/1001/10011001 with depth 3 and length 4
+ */
+private String dirsFromHash(String hash) {
+  if (hash.length() < ENTROPY_DIR_DEPTH * ENTROPY_DIR_LENGTH) {

Review Comment:
   Yeah maybe I am a bit over-paranoid here, given I have the right unit tests 
to cover any changes and length is 20 bits as final string(after addressing the 
other feedback), maybe will just remove this condition?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: code clean up for deleteRemovedMetadataFiles [iceberg]

2024-10-18 Thread via GitHub


leesf commented on code in PR #11352:
URL: https://github.com/apache/iceberg/pull/11352#discussion_r1806407924


##
core/src/main/java/org/apache/iceberg/TableMetadata.java:
##
@@ -77,6 +83,51 @@ public static TableMetadata newTableMetadata(
 return newTableMetadata(schema, spec, SortOrder.unsorted(), location, 
properties);
   }
 
+  /**
+   * Deletes the oldest metadata files if {@link
+   * TableProperties#METADATA_DELETE_AFTER_COMMIT_ENABLED} is true.
+   *
+   * @param io file IO
+   * @param base table metadata on which previous versions were based
+   * @param metadata new table metadata with updated previous versions
+   */
+  public static void deleteRemovedMetadataFiles(

Review Comment:
   @nastra done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] REST Catalog pagination can throw IndexOutOfBoundsException [iceberg]

2024-10-18 Thread via GitHub


rcjverhoef closed issue #11142: REST Catalog pagination can throw 
IndexOutOfBoundsException
URL: https://github.com/apache/iceberg/issues/11142


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] REST Catalog pagination can throw IndexOutOfBoundsException [iceberg]

2024-10-18 Thread via GitHub


rcjverhoef commented on issue #11142:
URL: https://github.com/apache/iceberg/issues/11142#issuecomment-2422470740

   yes it! let me do so.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] feat(table): Implement converting Iceberg schema and types to Arrow [iceberg-go]

2024-10-18 Thread via GitHub


nastra merged PR #168:
URL: https://github.com/apache/iceberg-go/pull/168


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark 3.5: Fix NotSerializableException when migrating partitioned Spark tables [iceberg]

2024-10-18 Thread via GitHub


manuzhang commented on code in PR #11157:
URL: https://github.com/apache/iceberg/pull/11157#discussion_r1806768662


##
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java:
##
@@ -711,7 +719,7 @@ public static void importSparkPartitions(
 spec,
 stagingDir,
 checkDuplicateFiles,
-TableMigrationUtil.migrationService(parallelism));
+executorService(parallelism));

Review Comment:
   We already add `executeWith(ExecutorService service)` to `SnapshotTable` and 
`MigrateTable` interfaces and it's more testable than passing `parallelism`. 
Hence, I prefer to fix forward if not at high cost.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] REST: AuthManager API [iceberg]

2024-10-18 Thread via GitHub


danielcweeks commented on code in PR #10753:
URL: https://github.com/apache/iceberg/pull/10753#discussion_r1807055242


##
core/src/main/java/org/apache/iceberg/rest/RESTSessionCatalog.java:
##
@@ -273,35 +227,11 @@ public void initialize(String name, Map 
unresolved) {
   this.endpoints = ImmutableSet.copyOf(config.endpoints());
 }
 
-this.sessions = newSessionCache(mergedProps);
-this.tableSessions = newSessionCache(mergedProps);
-this.keepTokenRefreshed =
-PropertyUtil.propertyAsBoolean(
-mergedProps,
-OAuth2Properties.TOKEN_REFRESH_ENABLED,
-OAuth2Properties.TOKEN_REFRESH_ENABLED_DEFAULT);
 this.client = clientBuilder.apply(mergedProps);
 this.paths = ResourcePaths.forCatalogProperties(mergedProps);
 
-String token = mergedProps.get(OAuth2Properties.TOKEN);
-this.catalogAuth =
-new AuthSession(
-baseHeaders,
-AuthConfig.builder()
-.credential(credential)
-.scope(scope)
-.oauth2ServerUri(oauth2ServerUri)
-.optionalOAuthParams(optionalOAuthParams)
-.build());
-if (authResponse != null) {
-  this.catalogAuth =
-  AuthSession.fromTokenResponse(
-  client, tokenRefreshExecutor(name), authResponse, 
startTimeMillis, catalogAuth);
-} else if (token != null) {
-  this.catalogAuth =
-  AuthSession.fromAccessToken(
-  client, tokenRefreshExecutor(name), token, 
expiresAtMillis(mergedProps), catalogAuth);
-}
+authManager.initialize(name, client, mergedProps);

Review Comment:
   I feel like we should have separate AuthManager instances, not 
double-initialize.



##
core/src/main/java/org/apache/iceberg/rest/auth/AuthManager.java:
##
@@ -0,0 +1,93 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.rest.auth;
+
+import java.util.Map;
+import org.apache.iceberg.catalog.SessionCatalog;
+import org.apache.iceberg.catalog.TableIdentifier;
+import org.apache.iceberg.rest.RESTClient;
+
+/**
+ * Manager for authentication sessions. This interface is used to create 
sessions for the catalog,
+ * the tables/views, and any other context that requires authentication.
+ *
+ * Managers are usually stateful and may require initialization and 
cleanup. The manager is
+ * created by the catalog and is closed when the catalog is closed.
+ */
+public interface AuthManager extends AutoCloseable {
+
+  /**
+   * Initializes the manager with the owner of the catalog, the REST client, 
and the properties
+   * provided in the catalog configuration.
+   *
+   * This method is intended to be called many times, typically once before 
contacting the config
+   * endpoint, and once after contacting the config endpoint. The properties 
provided in the catalog
+   * configuration are passed in the first call, and the properties returned 
by the config endpoint
+   * are passed in the second call, merged with the catalog ones.
+   *
+   * This method cannot return null.
+   */
+  void initialize(String owner, RESTClient client, Map 
properties);

Review Comment:
   I feel like `owner` isn't the right term.  Maybe just `name`?



##
core/src/main/java/org/apache/iceberg/rest/auth/OAuth2Manager.java:
##
@@ -0,0 +1,220 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceb

Re: [PR] Arrow: Fix indexing in Parquet dictionary encoded values readers [iceberg]

2024-10-18 Thread via GitHub


wypoon commented on code in PR #11247:
URL: https://github.com/apache/iceberg/pull/11247#discussion_r1807089230


##
spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/data/parquet/vectorized/TestParquetDictionaryEncodedVectorizedReads.java:
##
@@ -93,4 +125,64 @@ public void testMixedDictionaryNonDictionaryReads() throws 
IOException {
 true,
 BATCH_SIZE);
   }
+
+  @Test
+  public void testBinaryNotAllPagesDictionaryEncoded() throws IOException {
+Schema schema = new Schema(Types.NestedField.required(1, "bytes", 
Types.BinaryType.get()));
+File parquetFile = File.createTempFile("junit", null, temp.toFile());
+assertThat(parquetFile.delete()).as("Delete should succeed").isTrue();
+
+Iterable records = 
RandomData.generateFallbackData(schema, 500, 0L, 100);
+try (FileAppender writer =
+Parquet.write(Files.localOutput(parquetFile))
+.schema(schema)
+.set(PARQUET_DICT_SIZE_BYTES, "4096")
+.set(PARQUET_PAGE_ROW_LIMIT, "100")
+.build()) {
+  writer.addAll(records);
+}
+
+// After the above, parquetFile contains one column chunk of binary data 
in five pages,
+// the first two RLE dictionary encoded, and the remaining three plain 
encoded.
+assertRecordsMatch(schema, 500, records, parquetFile, true, BATCH_SIZE);
+  }
+
+  /**
+   * decimal_dict_and_plain_encoding.parquet contains one column chunk of 
decimal(38, 0) data in two
+   * pages, one RLE dictionary encoded and one plain encoded, each with 200 
rows.
+   */
+  @Test
+  public void testDecimalNotAllPagesDictionaryEncoded() throws Exception {
+Schema schema = new Schema(Types.NestedField.required(1, "id", 
Types.DecimalType.of(38, 0)));
+Path path =
+Paths.get(
+getClass()
+.getClassLoader()
+.getResource("decimal_dict_and_plain_encoding.parquet")
+.toURI());

Review Comment:
   @amogh-jahagirdar I have already explained this at length and convinced 
@nastra.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Create empty snapshot for metadata operations [iceberg]

2024-10-18 Thread via GitHub


github-actions[bot] closed issue #7075: Create empty snapshot for metadata 
operations
URL: https://github.com/apache/iceberg/issues/7075


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Support Parquet v2 Spark vectorized read [iceberg]

2024-10-18 Thread via GitHub


github-actions[bot] closed issue #7162: Support Parquet v2 Spark vectorized read
URL: https://github.com/apache/iceberg/issues/7162


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Support Page Skipping in Iceberg Parquet Reader [iceberg]

2024-10-18 Thread via GitHub


github-actions[bot] closed issue #193: Support Page Skipping in Iceberg Parquet 
Reader
URL: https://github.com/apache/iceberg/issues/193


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Does JDBC connector uses any retry mechanism? [iceberg]

2024-10-18 Thread via GitHub


github-actions[bot] commented on issue #7173:
URL: https://github.com/apache/iceberg/issues/7173#issuecomment-2423396446

   This issue has been closed because it has not received any activity in the 
last 14 days since being marked as 'stale'


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Does JDBC connector uses any retry mechanism? [iceberg]

2024-10-18 Thread via GitHub


github-actions[bot] closed issue #7173: Does JDBC connector uses any retry 
mechanism?
URL: https://github.com/apache/iceberg/issues/7173


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] org.apache.iceberg.hive.RuntimeMetaException: Failed to connect to Hive Metastore at [iceberg]

2024-10-18 Thread via GitHub


github-actions[bot] closed issue #9030: 
org.apache.iceberg.hive.RuntimeMetaException: Failed to connect to Hive 
Metastore at 
URL: https://github.com/apache/iceberg/issues/9030


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] org.apache.iceberg.hive.RuntimeMetaException: Failed to connect to Hive Metastore at [iceberg]

2024-10-18 Thread via GitHub


github-actions[bot] commented on issue #9030:
URL: https://github.com/apache/iceberg/issues/9030#issuecomment-2423396460

   This issue has been closed because it has not received any activity in the 
last 14 days since being marked as 'stale'


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Create empty snapshot for metadata operations [iceberg]

2024-10-18 Thread via GitHub


github-actions[bot] commented on issue #7075:
URL: https://github.com/apache/iceberg/issues/7075#issuecomment-2423396430

   This issue has been closed because it has not received any activity in the 
last 14 days since being marked as 'stale'


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Support Page Skipping in Iceberg Parquet Reader [iceberg]

2024-10-18 Thread via GitHub


github-actions[bot] commented on issue #193:
URL: https://github.com/apache/iceberg/issues/193#issuecomment-2423396417

   This issue has been closed because it has not received any activity in the 
last 14 days since being marked as 'stale'


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Support Parquet v2 Spark vectorized read [iceberg]

2024-10-18 Thread via GitHub


github-actions[bot] commented on issue #7162:
URL: https://github.com/apache/iceberg/issues/7162#issuecomment-2423396436

   This issue has been closed because it has not received any activity in the 
last 14 days since being marked as 'stale'


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Integration tests performance degradation [iceberg-python]

2024-10-18 Thread via GitHub


github-actions[bot] commented on issue #604:
URL: https://github.com/apache/iceberg-python/issues/604#issuecomment-2423397846

   This issue has been automatically marked as stale because it has been open 
for 180 days with no activity. It will be closed in next 14 days if no further 
activity occurs. To permanently prevent this issue from being considered stale, 
add the label 'not-stale', but commenting on the issue is preferred when 
possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Inconsistency in catalog.list_tables Behavior Across Python and Java: Returns Non-Iceberg Tables in Python Only [iceberg-python]

2024-10-18 Thread via GitHub


sungwy closed issue #314: Inconsistency in catalog.list_tables Behavior Across 
Python and Java: Returns Non-Iceberg Tables in Python Only
URL: https://github.com/apache/iceberg-python/issues/314


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] core: use testcontainers-minio [iceberg]

2024-10-18 Thread via GitHub


sullis commented on PR #11349:
URL: https://github.com/apache/iceberg/pull/11349#issuecomment-2423404051

   GitHub Actions CI passed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Glue and Hive catalog return only Iceberg tables [iceberg-python]

2024-10-18 Thread via GitHub


sungwy merged PR #1145:
URL: https://github.com/apache/iceberg-python/pull/1145


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Build: Bump parquet from 1.13.1 to 1.14.3 [iceberg]

2024-10-18 Thread via GitHub


findepi commented on PR #11264:
URL: https://github.com/apache/iceberg/pull/11264#issuecomment-2423244567

   thank you @nastra !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] core: use testcontainers-minio [iceberg]

2024-10-18 Thread via GitHub


sullis commented on code in PR #11349:
URL: https://github.com/apache/iceberg/pull/11349#discussion_r1807054261


##
aws/src/test/java/org/apache/iceberg/aws/s3/signer/TestS3RestSigner.java:
##
@@ -71,11 +73,10 @@ public class TestS3RestSigner {
 
   private static final Region REGION = Region.US_WEST_2;
   private static final String BUCKET = "iceberg-s3-signer-test";
+  private static final MinIOContainer MINIO_CONTAINER = 
MinioHelper.createContainer();
   static final AwsCredentialsProvider CREDENTIALS_PROVIDER =
   StaticCredentialsProvider.create(
-  AwsBasicCredentials.create("accessKeyId", "secretAccessKey"));
-  private static final MinioContainer MINIO_CONTAINER =
-  new MinioContainer(CREDENTIALS_PROVIDER.resolveCredentials());
+  AwsBasicCredentials.create(MINIO_CONTAINER.getUserName(), 
MINIO_CONTAINER.getPassword()));

Review Comment:
   1) I have restored virtual-host-style addressing (restored MINIO_DOMAIN) 
   
   2) MINIO_ACCESS_KEY  is deprecated in favor of  MINIO_ROOT_USER
   
   3) MINIO_SECRET_KEY is deprecated in favor of MINIO_ROOT_PASSWORD
   
   source: 
https://min.io/docs/minio/linux/reference/minio-server/settings/deprecated.html
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] AWS: Introduce opt-in S3LocationProvider which is optimized for S3 performance [iceberg]

2024-10-18 Thread via GitHub


danielcweeks commented on code in PR #2:
URL: https://github.com/apache/iceberg/pull/2#discussion_r1806946923


##
core/src/main/java/org/apache/iceberg/LocationProviders.java:
##
@@ -172,10 +193,45 @@ private static String pathContext(String tableLocation) {
 }
 
 private String computeHash(String fileName) {
-  byte[] bytes = TEMP.get();
-  HashCode hash = HASH_FUNC.hashString(fileName, StandardCharsets.UTF_8);
-  hash.writeBytesTo(bytes, 0, 4);
-  return BASE64_ENCODER.encode(bytes);
+  HashCode hashCode = HASH_FUNC.hashString(fileName, 
StandardCharsets.UTF_8);
+
+  // {@link Integer#toBinaryString} excludes leading zeros, which we want 
to preserve.
+  // force the first bit to be set to get around that.
+  String hashAsBinaryString = Integer.toBinaryString(hashCode.asInt() | 
Integer.MIN_VALUE);
+  return 
dirsFromHash(hashAsBinaryString.substring(HASH_BINARY_STRING_START_INDEX));
+}
+
+/**
+ * Divides hash into directories for optimized orphan removal operation 
using ENTROPY_DIR_DEPTH
+ * and ENTROPY_DIR_LENGTH
+ *
+ * @param hash 10011001100110011001
+ * @return 1001/1001/1001/10011001 with depth 3 and length 4
+ */
+private String dirsFromHash(String hash) {
+  if (hash.length() < ENTROPY_DIR_DEPTH * ENTROPY_DIR_LENGTH) {

Review Comment:
   If it's all internal to this class, I think we can safely assume that 
`Integer.toBinaryString(hashCode.asInt() | Integer.MIN_VALUE)` will produce the 
correct number of bits.  The check is seems unnecessary since it's not user 
supplied inputs.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Flink: make FLIP-27 default in SQL and mark the old FlinkSource as deprecated [iceberg]

2024-10-18 Thread via GitHub


stevenzwu commented on code in PR #11345:
URL: https://github.com/apache/iceberg/pull/11345#discussion_r1806716122


##
flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/source/IcebergSource.java:
##
@@ -86,7 +85,6 @@
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
 
-@Experimental

Review Comment:
   I feel it is fine to be just public.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] REST: Standardize vended credentials used in loadTable / loadView responses [iceberg]

2024-10-18 Thread via GitHub


nastra closed issue #8: REST: Standardize vended credentials used in 
loadTable / loadView responses
URL: https://github.com/apache/iceberg/issues/8


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] OpenAPI: Standardize credentials in loadTable/loadView responses [iceberg]

2024-10-18 Thread via GitHub


nastra merged PR #10722:
URL: https://github.com/apache/iceberg/pull/10722


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Add credentials to loadTable / loadView responses [iceberg]

2024-10-18 Thread via GitHub


nastra merged PR #11173:
URL: https://github.com/apache/iceberg/pull/11173


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] [Feature Request] Speed up InspectTable.files() [iceberg-python]

2024-10-18 Thread via GitHub


corleyma commented on issue #1229:
URL: 
https://github.com/apache/iceberg-python/issues/1229#issuecomment-2422931184

   > Most of the time is spent processing the manifests record-by-record and 
converting each record to a dict
   
   I haven't looked at this closely, but if memory serves, pyiceberg implements 
its own avro reader/writer using Cython.  Concurrency is great, but I wonder if 
we can make big gains by implementing a more direct conversion of avro records 
to pyarrow recordbatches somewhere at that level?  Then, processing the 
manifests could probably be implemented using pyarrow compute functions (C++) 
for a lot of performance gain?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Bump mkdocstrings from 0.26.1 to 0.26.2 [iceberg-python]

2024-10-18 Thread via GitHub


sungwy merged PR #1235:
URL: https://github.com/apache/iceberg-python/pull/1235


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Bump mypy-boto3-glue from 1.35.23 to 1.35.25 [iceberg-python]

2024-10-18 Thread via GitHub


sungwy merged PR #1236:
URL: https://github.com/apache/iceberg-python/pull/1236


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] AWS: Update ObjectStorageLocationProvider hash to optimize for S3 performance [iceberg]

2024-10-18 Thread via GitHub


jackye1995 merged PR #2:
URL: https://github.com/apache/iceberg/pull/2


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] AWS: Update ObjectStorageLocationProvider hash to optimize for S3 performance [iceberg]

2024-10-18 Thread via GitHub


jackye1995 commented on PR #2:
URL: https://github.com/apache/iceberg/pull/2#issuecomment-2423380108

   looks like CI has passed, merging. Thanks for the work and patience 
@ookumuso , and thanks for the review @danielcweeks !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spec: add variant type [iceberg]

2024-10-18 Thread via GitHub


flyrain commented on code in PR #10831:
URL: https://github.com/apache/iceberg/pull/10831#discussion_r1807111582


##
format/spec.md:
##
@@ -444,6 +449,9 @@ Sorting floating-point numbers should produce the following 
behavior: `-NaN` < `
 
 A data or delete file is associated with a sort order by the sort order's id 
within [a manifest](#manifests). Therefore, the table must declare all the sort 
orders for lookup. A table could also be configured with a default sort order 
id, indicating how the new data should be sorted by default. Writers should use 
this default sort order to sort the data on write, but are not required to if 
the default order is prohibitively expensive, as it would be for streaming 
writes.
 
+Note:
+
+1. `variant` columns are not valid for sorting.

Review Comment:
   I thought whether a variant is orderable is determined by engines per 
pervious discussion. Are we explicitly saying that all variants are not 
orderable?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] [Feature Request] Speed up InspectTable.files() [iceberg-python]

2024-10-18 Thread via GitHub


kevinjqliu commented on issue #1229:
URL: 
https://github.com/apache/iceberg-python/issues/1229#issuecomment-2423385861

   > pyiceberg implements its own avro reader/writer using Cython
   
   yep, optimistically build avro decoder, fall back to pure python.
   See 
https://github.com/apache/iceberg-python/issues/1093#issuecomment-2341456013


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark: support rewrite on specified target branch [iceberg]

2024-10-18 Thread via GitHub


amitgilad3 commented on PR #8797:
URL: https://github.com/apache/iceberg/pull/8797#issuecomment-2422284309

   Tnx for looking into this @jackye1995 , Will fix all the comment 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[PR] Custom fileio implementation is now in the docs [iceberg-python]

2024-10-18 Thread via GitHub


sikehish opened a new pull request, #1237:
URL: https://github.com/apache/iceberg-python/pull/1237

   This PR addresses #1233.
   Do let me know if any changes are to be made.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] [KafkaConnect] Fix RecordConverter for UUID and Fixed Types [iceberg]

2024-10-18 Thread via GitHub


singhpk234 commented on code in PR #11346:
URL: https://github.com/apache/iceberg/pull/11346#discussion_r1806749659


##
orc/src/main/java/org/apache/iceberg/data/orc/GenericOrcWriters.java:
##
@@ -101,8 +101,8 @@ public static OrcValueWriter byteBuffers() {
 return ByteBufferWriter.INSTANCE;
   }
 
-  public static OrcValueWriter uuids() {
-return UUIDWriter.INSTANCE;
+  public static OrcValueWriter uuids() {
+return ByteBufferWriter.INSTANCE;

Review Comment:
   removed this and handled this in the record converted 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] [KafkaConnect] Fix RecordConverter for UUID and Fixed Types [iceberg]

2024-10-18 Thread via GitHub


singhpk234 commented on code in PR #11346:
URL: https://github.com/apache/iceberg/pull/11346#discussion_r1806695581


##
kafka-connect/kafka-connect/src/test/java/org/apache/iceberg/connect/data/RecordConverterTest.java:
##
@@ -921,11 +929,24 @@ private void assertRecordValues(Record record) {
 assertThat(rec.getField("dec")).isEqualTo(DEC_VAL);
 assertThat(rec.getField("s")).isEqualTo(STR_VAL);
 assertThat(rec.getField("b")).isEqualTo(true);
-assertThat(rec.getField("u")).isEqualTo(UUID_VAL);
-assertThat(rec.getField("f")).isEqualTo(BYTES_VAL);
+assertThat(rec.getField("u")).isEqualTo(UUIDUtil.convert(UUID_VAL));
+assertThat(rec.getField("f")).isEqualTo(BYTES_VAL.array());
 assertThat(rec.getField("bi")).isEqualTo(BYTES_VAL);
 assertThat(rec.getField("li")).isEqualTo(LIST_VAL);
 assertThat(rec.getField("ma")).isEqualTo(MAP_VAL);
+
+// check by actually writing it
+for (String format : new String[] {"parquet"}) {
+  IcebergSinkConfig config = mock(IcebergSinkConfig.class);
+  when(config.tableConfig(any())).thenReturn(mock(TableSinkConfig.class));
+  
when(config.writeProps()).thenReturn(ImmutableMap.of("write.format.default", 
format));
+  WriteResult result = writeTest(ImmutableList.of(rec), config, 
UnpartitionedWriter.class);
+
+  assertThat(result.dataFiles()).hasSize(1);
+  assertThat(result.dataFiles())
+  .allMatch(file -> file.format() == FileFormat.fromString(format));
+  assertThat(result.deleteFiles()).hasSize(0);

Review Comment:
   This tests fails before the fix for parquet.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] docs/configuration.md: Documented table properties (#1231) [iceberg-python]

2024-10-18 Thread via GitHub


mths1 commented on PR #1232:
URL: https://github.com/apache/iceberg-python/pull/1232#issuecomment-2422796878

   Hi all
   
   I was trying  'target-file-size-bytes' lately, and to my understanding in 
the pyiceberg version we were using, it somehow violates the principle of least 
surprise. As far as I understand, in pyIceberg it is not the file size on disk, 
but the size _in memory_. A target-file-size-bytes of 512MB resulted for us in 
files of 20MB on disk. This caused a lot of trouble for us (first 
understanding) and secondly other tools now pick the wrong value from metadata. 
If I am not mistaken, it would be great to document that behaviour as it is not 
quite intuitive.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[PR] Custom fileio docs [iceberg-python]

2024-10-18 Thread via GitHub


sikehish opened a new pull request, #1238:
URL: https://github.com/apache/iceberg-python/pull/1238

   This PR addresses https://github.com/apache/iceberg-python/issues/1233.
   Do let me know if any changes are to be made.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Document Custom FileIO [iceberg-python]

2024-10-18 Thread via GitHub


sikehish commented on issue #1233:
URL: 
https://github.com/apache/iceberg-python/issues/1233#issuecomment-2422849621

   > Assigned to you. I think we can add it under [the `FileIO` 
section](https://py.iceberg.apache.org/configuration/#fileio) as something like 
"Custom FileIO Implementations"
   
   Hi @kevinjqliu. I've added custom fileio implementation section in docs. I'm 
not very sure if what I've added is what you are looking for. Do let me know if 
any changes are to be made :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Flink: disable the flaky range distribution bucketing tests for now [iceberg]

2024-10-18 Thread via GitHub


stevenzwu merged PR #11347:
URL: https://github.com/apache/iceberg/pull/11347


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark: Add RewriteTablePath action interface [iceberg]

2024-10-18 Thread via GitHub


szehon-ho merged PR #10920:
URL: https://github.com/apache/iceberg/pull/10920


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark: Add RewriteTablePath action interface [iceberg]

2024-10-18 Thread via GitHub


szehon-ho commented on PR #10920:
URL: https://github.com/apache/iceberg/pull/10920#issuecomment-2422978173

   Merged , thanks @laithalzyoud 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] AWS: Introduce opt-in S3LocationProvider which is optimized for S3 performance [iceberg]

2024-10-18 Thread via GitHub


danielcweeks commented on code in PR #2:
URL: https://github.com/apache/iceberg/pull/2#discussion_r1806878880


##
core/src/main/java/org/apache/iceberg/LocationProviders.java:
##
@@ -108,10 +108,15 @@ public String newDataLocation(String filename) {
   static class ObjectStoreLocationProvider implements LocationProvider {
 
 private static final HashFunction HASH_FUNC = Hashing.murmur3_32_fixed();
-private static final BaseEncoding BASE64_ENCODER = 
BaseEncoding.base64Url().omitPadding();
-private static final ThreadLocal TEMP = ThreadLocal.withInitial(() 
-> new byte[4]);
+// the starting index of the lower 20-bits of a 32-bit binary string
+private static final int HASH_BINARY_STRING_START_INDEX = 12;

Review Comment:
   Can we flip this to be to be `HASH_BINARY_STRING_BITS = 20`, since that's 
more applicable to what we're constructing?  The start index is confusing 
because it only has meaning in context to the length of what we're skipping 
from the bit field.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] AWS: Introduce opt-in S3LocationProvider which is optimized for S3 performance [iceberg]

2024-10-18 Thread via GitHub


danielcweeks commented on code in PR #2:
URL: https://github.com/apache/iceberg/pull/2#discussion_r1806873426


##
core/src/main/java/org/apache/iceberg/LocationProviders.java:
##
@@ -172,10 +193,45 @@ private static String pathContext(String tableLocation) {
 }
 
 private String computeHash(String fileName) {
-  byte[] bytes = TEMP.get();
-  HashCode hash = HASH_FUNC.hashString(fileName, StandardCharsets.UTF_8);
-  hash.writeBytesTo(bytes, 0, 4);
-  return BASE64_ENCODER.encode(bytes);
+  HashCode hashCode = HASH_FUNC.hashString(fileName, 
StandardCharsets.UTF_8);
+
+  // {@link Integer#toBinaryString} excludes leading zeros, which we want 
to preserve.
+  // force the first bit to be set to get around that.
+  String hashAsBinaryString = Integer.toBinaryString(hashCode.asInt() | 
Integer.MIN_VALUE);
+  return 
dirsFromHash(hashAsBinaryString.substring(HASH_BINARY_STRING_START_INDEX));
+}
+
+/**
+ * Divides hash into directories for optimized orphan removal operation 
using ENTROPY_DIR_DEPTH
+ * and ENTROPY_DIR_LENGTH
+ *
+ * @param hash 10011001100110011001
+ * @return 1001/1001/1001/10011001 with depth 3 and length 4
+ */
+private String dirsFromHash(String hash) {
+  if (hash.length() < ENTROPY_DIR_DEPTH * ENTROPY_DIR_LENGTH) {

Review Comment:
   We should use a `Preconditions.check` here.  Also, we're doing a lot of math 
here, but don't we always expect the hash to be 20 characters?  Why not just 
make that the requirement.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] AWS: Introduce opt-in S3LocationProvider which is optimized for S3 performance [iceberg]

2024-10-18 Thread via GitHub


danielcweeks commented on code in PR #2:
URL: https://github.com/apache/iceberg/pull/2#discussion_r1806873426


##
core/src/main/java/org/apache/iceberg/LocationProviders.java:
##
@@ -172,10 +193,45 @@ private static String pathContext(String tableLocation) {
 }
 
 private String computeHash(String fileName) {
-  byte[] bytes = TEMP.get();
-  HashCode hash = HASH_FUNC.hashString(fileName, StandardCharsets.UTF_8);
-  hash.writeBytesTo(bytes, 0, 4);
-  return BASE64_ENCODER.encode(bytes);
+  HashCode hashCode = HASH_FUNC.hashString(fileName, 
StandardCharsets.UTF_8);
+
+  // {@link Integer#toBinaryString} excludes leading zeros, which we want 
to preserve.
+  // force the first bit to be set to get around that.
+  String hashAsBinaryString = Integer.toBinaryString(hashCode.asInt() | 
Integer.MIN_VALUE);
+  return 
dirsFromHash(hashAsBinaryString.substring(HASH_BINARY_STRING_START_INDEX));
+}
+
+/**
+ * Divides hash into directories for optimized orphan removal operation 
using ENTROPY_DIR_DEPTH
+ * and ENTROPY_DIR_LENGTH
+ *
+ * @param hash 10011001100110011001
+ * @return 1001/1001/1001/10011001 with depth 3 and length 4
+ */
+private String dirsFromHash(String hash) {
+  if (hash.length() < ENTROPY_DIR_DEPTH * ENTROPY_DIR_LENGTH) {

Review Comment:
   We should use a `Preconditions.check` here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark: Add RewriteTablePath action interface [iceberg]

2024-10-18 Thread via GitHub


flyrain commented on PR #10920:
URL: https://github.com/apache/iceberg/pull/10920#issuecomment-2423035986

   Great! Thanks a lot for working on it, guys!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Custom fileio implementation is now in the docs [iceberg-python]

2024-10-18 Thread via GitHub


sikehish closed pull request #1237: Custom fileio implementation is now in the 
docs
URL: https://github.com/apache/iceberg-python/pull/1237


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark: Add RewriteTablePath action interface [iceberg]

2024-10-18 Thread via GitHub


laithalzyoud commented on PR #10920:
URL: https://github.com/apache/iceberg/pull/10920#issuecomment-2422990247

   Thanks @szehon-ho! Will follow up with the implementation PR soon 👍 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark 3.5: Fix NotSerializableException when migrating partitioned Spark tables [iceberg]

2024-10-18 Thread via GitHub


manuzhang commented on code in PR #11157:
URL: https://github.com/apache/iceberg/pull/11157#discussion_r1806768662


##
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java:
##
@@ -711,7 +719,7 @@ public static void importSparkPartitions(
 spec,
 stagingDir,
 checkDuplicateFiles,
-TableMigrationUtil.migrationService(parallelism));
+executorService(parallelism));

Review Comment:
   We already add `executeWith(ExecutorService service)` to `SnapshotTable` and 
`MigrateTable` interfaces and it's more testable than passing `parallelism`. 
Hence, I prefer to fix forward if not with high cost.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Glue and Hive catalog return only Iceberg tables [iceberg-python]

2024-10-18 Thread via GitHub


mark-major commented on code in PR #1145:
URL: https://github.com/apache/iceberg-python/pull/1145#discussion_r1806419626


##
pyiceberg/catalog/dynamodb.py:
##
@@ -393,7 +393,7 @@ def drop_namespace(self, namespace: Union[str, Identifier]) 
-> None:
 raise NoSuchNamespaceError(f"Database does not exist: 
{database_name}") from e
 
 def list_tables(self, namespace: Union[str, Identifier]) -> 
List[Identifier]:
-"""List tables under the given namespace in the catalog (including 
non-Iceberg tables).
+"""List Iceberg tables under the given namespace in the catalog.

Review Comment:
   It was quite a long time since I have worked on this, but if I recall 
correctly then yes, DynamoDB returned only Iceberg tables. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spec: Support geo type [iceberg]

2024-10-18 Thread via GitHub


desruisseaux commented on code in PR #10981:
URL: https://github.com/apache/iceberg/pull/10981#discussion_r1806416230


##
format/spec.md:
##
@@ -198,6 +199,9 @@ Notes:
 - Timestamp values _with time zone_ represent a point in time: values are 
stored as UTC and do not retain a source time zone (`2017-11-16 17:10:34 PST` 
is stored/retrieved as `2017-11-17 01:10:34 UTC` and these values are 
considered identical).
 - Timestamp values _without time zone_ represent a date and time of day 
regardless of zone: the time value is independent of zone adjustments 
(`2017-11-16 17:10:34` is always retrieved as `2017-11-16 17:10:34`).
 3. Character strings must be stored as UTF-8 encoded byte arrays.
+4. Coordinate Reference System, i.e. mapping of how coordinates refer to 
precise locations on earth. Defaults to "OGC:CRS84". Fixed and cannot be 
changed by schema evolution.

Review Comment:
   > Coordinate Reference System, i.e. mapping of how coordinates refer to 
precise locations on earth. Defaults to "OGC:CRS84".
   
   Maybe remove the word "precise"? Not all CRS are precise. The proposed 
default, `OGC:CRS84`, has an uncertainty of about 2 meters (unrelated to 
floating point precision). Maybe it would be worth to change the sentence to: 
_Defaults to "OGC:CRS84", which provides an accuracy of about 2 meters._
   
   Another reason for removing the "precise" word is that the CRS alone is not 
sufficient for really precise locations. We also need the epoch, which is a 
topic that has not been covered at all in this PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Flink: make FLIP-27 default in SQL and mark the old FlinkSource as deprecated [iceberg]

2024-10-18 Thread via GitHub


stevenzwu commented on PR #11345:
URL: https://github.com/apache/iceberg/pull/11345#issuecomment-2422803067

   thanks @pvary for the review


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Flink: make FLIP-27 default in SQL and mark the old FlinkSource as deprecated [iceberg]

2024-10-18 Thread via GitHub


stevenzwu merged PR #11345:
URL: https://github.com/apache/iceberg/pull/11345


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] feat(table): Implement converting Iceberg schema and types to Arrow [iceberg-go]

2024-10-18 Thread via GitHub


zeroshade commented on PR #168:
URL: https://github.com/apache/iceberg-go/pull/168#issuecomment-2422814408

   @nastra rebased! Thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] AWS: Introduce opt-in S3LocationProvider which is optimized for S3 performance [iceberg]

2024-10-18 Thread via GitHub


danielcweeks commented on code in PR #2:
URL: https://github.com/apache/iceberg/pull/2#discussion_r1806946923


##
core/src/main/java/org/apache/iceberg/LocationProviders.java:
##
@@ -172,10 +193,45 @@ private static String pathContext(String tableLocation) {
 }
 
 private String computeHash(String fileName) {
-  byte[] bytes = TEMP.get();
-  HashCode hash = HASH_FUNC.hashString(fileName, StandardCharsets.UTF_8);
-  hash.writeBytesTo(bytes, 0, 4);
-  return BASE64_ENCODER.encode(bytes);
+  HashCode hashCode = HASH_FUNC.hashString(fileName, 
StandardCharsets.UTF_8);
+
+  // {@link Integer#toBinaryString} excludes leading zeros, which we want 
to preserve.
+  // force the first bit to be set to get around that.
+  String hashAsBinaryString = Integer.toBinaryString(hashCode.asInt() | 
Integer.MIN_VALUE);
+  return 
dirsFromHash(hashAsBinaryString.substring(HASH_BINARY_STRING_START_INDEX));
+}
+
+/**
+ * Divides hash into directories for optimized orphan removal operation 
using ENTROPY_DIR_DEPTH
+ * and ENTROPY_DIR_LENGTH
+ *
+ * @param hash 10011001100110011001
+ * @return 1001/1001/1001/10011001 with depth 3 and length 4
+ */
+private String dirsFromHash(String hash) {
+  if (hash.length() < ENTROPY_DIR_DEPTH * ENTROPY_DIR_LENGTH) {

Review Comment:
   If it's all internal to this class, I think we can safely assume that 
`Integer.toBinaryString(hashCode.asInt() | Integer.MIN_VALUE)` will produce the 
correct number of bits.  The check seems unnecessary since it's not user 
supplied inputs.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] AWS: Introduce opt-in S3LocationProvider which is optimized for S3 performance [iceberg]

2024-10-18 Thread via GitHub


ookumuso commented on code in PR #2:
URL: https://github.com/apache/iceberg/pull/2#discussion_r1806988984


##
core/src/main/java/org/apache/iceberg/LocationProviders.java:
##
@@ -172,10 +193,45 @@ private static String pathContext(String tableLocation) {
 }
 
 private String computeHash(String fileName) {
-  byte[] bytes = TEMP.get();
-  HashCode hash = HASH_FUNC.hashString(fileName, StandardCharsets.UTF_8);
-  hash.writeBytesTo(bytes, 0, 4);
-  return BASE64_ENCODER.encode(bytes);
+  HashCode hashCode = HASH_FUNC.hashString(fileName, 
StandardCharsets.UTF_8);
+
+  // {@link Integer#toBinaryString} excludes leading zeros, which we want 
to preserve.
+  // force the first bit to be set to get around that.
+  String hashAsBinaryString = Integer.toBinaryString(hashCode.asInt() | 
Integer.MIN_VALUE);
+  return 
dirsFromHash(hashAsBinaryString.substring(HASH_BINARY_STRING_START_INDEX));
+}
+
+/**
+ * Divides hash into directories for optimized orphan removal operation 
using ENTROPY_DIR_DEPTH
+ * and ENTROPY_DIR_LENGTH
+ *
+ * @param hash 10011001100110011001
+ * @return 1001/1001/1001/10011001 with depth 3 and length 4
+ */
+private String dirsFromHash(String hash) {
+  if (hash.length() < ENTROPY_DIR_DEPTH * ENTROPY_DIR_LENGTH) {

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] AWS: Introduce opt-in S3LocationProvider which is optimized for S3 performance [iceberg]

2024-10-18 Thread via GitHub


ookumuso commented on code in PR #2:
URL: https://github.com/apache/iceberg/pull/2#discussion_r1806988739


##
core/src/main/java/org/apache/iceberg/LocationProviders.java:
##
@@ -108,10 +108,15 @@ public String newDataLocation(String filename) {
   static class ObjectStoreLocationProvider implements LocationProvider {
 
 private static final HashFunction HASH_FUNC = Hashing.murmur3_32_fixed();
-private static final BaseEncoding BASE64_ENCODER = 
BaseEncoding.base64Url().omitPadding();
-private static final ThreadLocal TEMP = ThreadLocal.withInitial(() 
-> new byte[4]);
+// the starting index of the lower 20-bits of a 32-bit binary string
+private static final int HASH_BINARY_STRING_START_INDEX = 12;

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Flink: Add RowConverter for Iceberg Source [iceberg]

2024-10-18 Thread via GitHub


abharath9 commented on code in PR #11301:
URL: https://github.com/apache/iceberg/pull/11301#discussion_r1807179938


##
flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/source/reader/RowConverter.java:
##
@@ -0,0 +1,71 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.flink.source.reader;
+
+import org.apache.flink.api.common.typeinfo.TypeInformation;
+import org.apache.flink.api.java.typeutils.RowTypeInfo;
+import org.apache.flink.table.data.RowData;
+import org.apache.flink.table.data.conversion.DataStructureConverter;
+import org.apache.flink.table.data.conversion.DataStructureConverters;
+import org.apache.flink.table.types.logical.LogicalType;
+import org.apache.flink.table.types.logical.RowType;
+import org.apache.flink.table.types.utils.TypeConversions;
+import org.apache.flink.types.Row;
+import org.apache.iceberg.flink.FlinkSchemaUtil;
+import org.jetbrains.annotations.NotNull;
+
+public class RowConverter implements RowDataConverter {

Review Comment:
   thanks for the feedback. Do you mean change this to RowData To Row mapper?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[PR] Build: Bump parquet from 1.13.1 to 1.14.3 [iceberg]

2024-10-18 Thread via GitHub


dependabot[bot] opened a new pull request, #11264:
URL: https://github.com/apache/iceberg/pull/11264

   Bumps `parquet` from 1.13.1 to 1.14.3.
   Updates `org.apache.parquet:parquet-avro` from 1.13.1 to 1.14.3
   
   Release notes
   Sourced from https://github.com/apache/parquet-mr/releases";>org.apache.parquet:parquet-avro's
 releases.
   
   Apache Parquet Java 1.14.3
   What's Changed
   
   https://redirect.github.com/apache/parquet-java/issues/3007";>GH-3007: 
Ensure version specific Jackson classes are shaded
   https://redirect.github.com/apache/parquet-java/issues/3013";>GH-3013: 
Fix potential ClassCastException at reading DELTA_BYTE_ARRAY encoding
   https://redirect.github.com/apache/parquet-java/issues/3021";>GH-3021: 
Upgrade Avro dependency to 1.11.4
   
   Apache Parquet Java 1.14.3 RC2
   What's Changed
   
   https://redirect.github.com/apache/parquet-java/issues/3007";>GH-3007: 
Ensure version specific Jackson classes are shaded
   https://redirect.github.com/apache/parquet-java/issues/3013";>GH-3013: 
Fix potential ClassCastException at reading DELTA_BYTE_ARRAY encoding
   https://redirect.github.com/apache/parquet-java/issues/3021";>GH-3021: 
Upgrade Avro dependency to 1.11.4
   
   Apache Parquet Java 1.14.2
   What's Changed
   
   https://redirect.github.com/apache/parquet-java/pull/2949";>GH-2948: 
Fix NPE when using the AvroParquetReader.Builder with LocalInputFile
   https://redirect.github.com/apache/parquet-java/pull/2957";>GH-2956: 
Use avro SchemaBuilder API to convert record
   https://redirect.github.com/apache/parquet-java/pull/2951";>GH-2935: 
Avoid double close of ParquetFileWriter
   https://redirect.github.com/apache/parquet-java/issues/2992";>GH-2992: 
Gate LocalTimestamp references in AvroSchemaConverter
   https://redirect.github.com/apache/parquet-java/pull/1376";>PARQUET-1126:
 Fix NPE when using the AvroParquetReader.Builder with LocalInputFile
   https://redirect.github.com/apache/parquet-java/pull/1376";>PARQUET-1126:
 Write unencrypted Parquet files without Hadoop
   https://redirect.github.com/apache/parquet-java/pull/1350";>PARQUET-2472:
 Close in finally block in ParquetFileWriter#end
   
   Apache Parquet Java 1.14.2 RC2
   What's Changed
   
   https://redirect.github.com/apache/parquet-java/pull/2949";>GH-2948: 
Fix NPE when using the AvroParquetReader.Builder with LocalInputFile
   https://redirect.github.com/apache/parquet-java/pull/2957";>GH-2956: 
Use avro SchemaBuilder API to convert record
   https://redirect.github.com/apache/parquet-java/pull/2951";>GH-2935: 
Avoid double close of ParquetFileWriter
   https://redirect.github.com/apache/parquet-java/issues/2992";>GH-2992: 
Gate LocalTimestamp references in AvroSchemaConverter
   https://redirect.github.com/apache/parquet-java/pull/1376";>PARQUET-1126:
 Fix NPE when using the AvroParquetReader.Builder with LocalInputFile
   https://redirect.github.com/apache/parquet-java/pull/1376";>PARQUET-1126:
 Write unencrypted Parquet files without Hadoop
   https://redirect.github.com/apache/parquet-java/pull/1350";>PARQUET-2472:
 Close in finally block in ParquetFileWriter#end
   
   Apache Parquet Java 1.14.2 RC1
   What's Changed
   
   https://redirect.github.com/apache/parquet-java/pull/2949";>GH-2948: 
Fix NPE when using the AvroParquetReader.Builder with LocalInputFile
   https://redirect.github.com/apache/parquet-java/pull/2957";>GH-2956: 
Use avro SchemaBuilder API to convert record
   https://redirect.github.com/apache/parquet-java/pull/2951";>GH-2935: 
Avoid double close of ParquetFileWriter
   https://redirect.github.com/apache/parquet-java/pull/1376";>PARQUET-1126:
 Fix NPE when using the AvroParquetReader.Builder with LocalInputFile
   https://redirect.github.com/apache/parquet-java/pull/1376";>PARQUET-1126:
 Write unencrypted Parquet files without Hadoop
   https://redirect.github.com/apache/parquet-java/pull/1350";>PARQUET-2472:
 Close in finally block in ParquetFileWriter#end
   
   Apache Parquet 1.14.1
   What's Changed
   
   Update changes for the parquet-1.14.x branch by https://github.com/Fokko";>@​Fokko in https://redirect.github.com/apache/parquet-java/pull/1347";>apache/parquet-java#1347
   
   
   
   ... (truncated)
   
   
   Changelog
   Sourced from https://github.com/apache/parquet-java/blob/master/CHANGES.md";>org.apache.parquet:parquet-avro's
 changelog.
   
   
   Parquet
   Version 1.14.1
   Release Notes - Parquet - Version 1.14.1
   Bug
   
   https://issues.apache.org/jira/browse/PARQUET-2468";>PARQUET-2468 - 
ParquetMetadata.toPrettyJSON throws exception on file read when 
LOG.isDebugEnabled()
   https://issues.apache.org/jira/browse/PARQUET-2498";>PARQUET-2498 - 
Hadoop vector IO API doesn't handle empty list of ranges
   
   Version 1.14.0
   Release Notes - Parquet - Version 1.14.0
   Bug
   
   https://issues.apache.org/jira/browse/PARQUET-2260";>PARQUET-2260 - 
Bloom filter bytes size shouldn't be larger than maxBytes size in the 
configuration
   https://issues

Re: [PR] Spec: Support geo type [iceberg]

2024-10-18 Thread via GitHub


dmeaux commented on code in PR #10981:
URL: https://github.com/apache/iceberg/pull/10981#discussion_r1805976050


##
format/spec.md:
##
@@ -483,6 +485,8 @@ Notes:
 2. For `float` and `double`, the value `-0.0` must precede `+0.0`, as in the 
IEEE 754 `totalOrder` predicate. NaNs are not permitted as lower or upper 
bounds.
 3. If sort order ID is missing or unknown, then the order is assumed to be 
unsorted. Only data files and equality delete files should be written with a 
non-null order id. [Position deletes](#position-delete-files) are required to 
be sorted by file and position, not a table order, and should set sort order id 
to null. Readers must ignore sort order id for position delete files.
 4. The following field ids are reserved on `data_file`: 141.
+5. For `geometry`, this is a point. X = min value of all component points of 
all geometries in file when edges = PLANAR, westernmost bound of all geometries 
in file if edges = SPHERICAL. Y = max value of all component points of all 
geometries in file when edges = PLANAR, northernmost bound of all geometries if 
edges = SPHERICAL. Z is min value for all component points of all geometries in 
the file. M is min value of all component points of all geometries in the file. 
See Appendix D for encoding.

Review Comment:
   Personally, I'm good with geodesic; however, there were others that brought 
up concerns with that term in the GeoParquet group meetup and were more 
comfortable starting with pseudo-geodesic. I don´t remember the exact details 
of the issue. Perhaps @paleolimbot, @jiayuasu , and/or @desruisseaux  can chime 
in with their perspectives on it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Build: Bump parquet from 1.13.1 to 1.14.3 [iceberg]

2024-10-18 Thread via GitHub


nastra closed pull request #11264: Build: Bump parquet from 1.13.1 to 1.14.3
URL: https://github.com/apache/iceberg/pull/11264


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: delete temp metadata file when version already exists [iceberg]

2024-10-18 Thread via GitHub


nastra commented on code in PR #11350:
URL: https://github.com/apache/iceberg/pull/11350#discussion_r1806163830


##
core/src/main/java/org/apache/iceberg/hadoop/HadoopTableOperations.java:
##
@@ -368,7 +368,13 @@ private void renameToFinal(FileSystem fs, Path src, Path 
dst, int nextVersion) {
   }
 
   if (fs.exists(dst)) {
-throw new CommitFailedException("Version %d already exists: %s", 
nextVersion, dst);
+CommitFailedException cfe =
+new CommitFailedException("Version %d already exists: %s", 
nextVersion, dst);
+RuntimeException re = tryDelete(src);

Review Comment:
   is it always correct to delete the src file or could there be cases where 
we'd want to keep it?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Build: Bump parquet from 1.13.1 to 1.14.3 [iceberg]

2024-10-18 Thread via GitHub


nastra commented on code in PR #11264:
URL: https://github.com/apache/iceberg/pull/11264#discussion_r1806145764


##
flink/v1.18/flink/src/test/java/org/apache/iceberg/flink/source/TestMetadataTableReadableMetrics.java:
##
@@ -217,27 +217,27 @@ public void testPrimitiveColumns() throws Exception {
 
 Row binaryCol =
 Row.of(
-52L,
+55L,

Review Comment:
   looks like column sizes got slightly larger:
   
   
   Parquet 1.13.1
   ```
   % parquet-cli/1.14.3/bin/parquet column-size old-version.parquet
   binaryCol-> Size In Bytes: 52 Size In Ratio: 0.08695652
   intCol-> Size In Bytes: 71 Size In Ratio: 0.1187291
   decimalCol-> Size In Bytes: 85 Size In Ratio: 0.14214046
   fixedCol-> Size In Bytes: 44 Size In Ratio: 0.073578596
   booleanCol-> Size In Bytes: 32 Size In Ratio: 0.053511705
   stringCol-> Size In Bytes: 79 Size In Ratio: 0.13210702
   floatCol-> Size In Bytes: 71 Size In Ratio: 0.1187291
   longCol-> Size In Bytes: 79 Size In Ratio: 0.13210702
   doubleCol-> Size In Bytes: 85 Size In Ratio: 0.14214046
   ```
   
   Parquet 1.14.3
   ```
   % parquet-cli/1.14.3/bin/parquet column-size new-version.parquet
   binaryCol-> Size In Bytes: 55 Size In Ratio: 0.085403726
   intCol-> Size In Bytes: 77 Size In Ratio: 0.11956522
   decimalCol-> Size In Bytes: 91 Size In Ratio: 0.14130434
   fixedCol-> Size In Bytes: 47 Size In Ratio: 0.072981365
   booleanCol-> Size In Bytes: 36 Size In Ratio: 0.055900622
   stringCol-> Size In Bytes: 85 Size In Ratio: 0.13198757
   floatCol-> Size In Bytes: 77 Size In Ratio: 0.11956522
   longCol-> Size In Bytes: 85 Size In Ratio: 0.13198757
   doubleCol-> Size In Bytes: 91 Size In Ratio: 0.14130434
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Revert "Support wasb[s] paths in ADLSFileIO" [iceberg]

2024-10-18 Thread via GitHub


nastra merged PR #11344:
URL: https://github.com/apache/iceberg/pull/11344


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Add SparkSessionCatalog support for views [iceberg]

2024-10-18 Thread via GitHub


nqvuong1998 commented on issue #9845:
URL: https://github.com/apache/iceberg/issues/9845#issuecomment-2421995463

   Hi @nastra , any updates for this issue?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Lack of integration tests for SQL catalog implementation [iceberg-rust]

2024-10-18 Thread via GitHub


liurenjie1024 commented on issue #673:
URL: https://github.com/apache/iceberg-rust/issues/673#issuecomment-2422012602

   Thanks @callum-ryan ! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[PR] Core: code clean up for deleteRemovedMetadataFiles [iceberg]

2024-10-18 Thread via GitHub


leesf opened a new pull request, #11352:
URL: https://github.com/apache/iceberg/pull/11352

   (no comment)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: code clean up for deleteRemovedMetadataFiles [iceberg]

2024-10-18 Thread via GitHub


nastra commented on code in PR #11352:
URL: https://github.com/apache/iceberg/pull/11352#discussion_r1806159950


##
core/src/main/java/org/apache/iceberg/TableMetadata.java:
##
@@ -77,6 +83,51 @@ public static TableMetadata newTableMetadata(
 return newTableMetadata(schema, spec, SortOrder.unsorted(), location, 
properties);
   }
 
+  /**
+   * Deletes the oldest metadata files if {@link
+   * TableProperties#METADATA_DELETE_AFTER_COMMIT_ENABLED} is true.
+   *
+   * @param io file IO
+   * @param base table metadata on which previous versions were based
+   * @param metadata new table metadata with updated previous versions
+   */
+  public static void deleteRemovedMetadataFiles(

Review Comment:
   I don't think we should be moving this to `TableMetadata`as this adds 
unnecessary dependencies to `TableMetadata` (`FileIO` and 
`SupportsBulkOperations`). 
   @leesf what's the goal of this PR and why do we want to move this method 
here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] core: use testcontainers-minio [iceberg]

2024-10-18 Thread via GitHub


nastra commented on code in PR #11349:
URL: https://github.com/apache/iceberg/pull/11349#discussion_r1806186638


##
aws/src/test/java/org/apache/iceberg/aws/s3/signer/TestS3RestSigner.java:
##
@@ -71,11 +73,10 @@ public class TestS3RestSigner {
 
   private static final Region REGION = Region.US_WEST_2;
   private static final String BUCKET = "iceberg-s3-signer-test";
+  private static final MinIOContainer MINIO_CONTAINER = 
MinioHelper.createContainer();
   static final AwsCredentialsProvider CREDENTIALS_PROVIDER =
   StaticCredentialsProvider.create(
-  AwsBasicCredentials.create("accessKeyId", "secretAccessKey"));
-  private static final MinioContainer MINIO_CONTAINER =
-  new MinioContainer(CREDENTIALS_PROVIDER.resolveCredentials());
+  AwsBasicCredentials.create(MINIO_CONTAINER.getUserName(), 
MINIO_CONTAINER.getPassword()));

Review Comment:
   this isn't correctly setting accessKey/secretAccessKey. The old 
`MinioContainer` was setting multiple env vars for this and also configured 
virtual-host-style addressing, which we should continue to do



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] core: use testcontainers-minio [iceberg]

2024-10-18 Thread via GitHub


nastra commented on code in PR #11349:
URL: https://github.com/apache/iceberg/pull/11349#discussion_r1806186638


##
aws/src/test/java/org/apache/iceberg/aws/s3/signer/TestS3RestSigner.java:
##
@@ -71,11 +73,10 @@ public class TestS3RestSigner {
 
   private static final Region REGION = Region.US_WEST_2;
   private static final String BUCKET = "iceberg-s3-signer-test";
+  private static final MinIOContainer MINIO_CONTAINER = 
MinioHelper.createContainer();
   static final AwsCredentialsProvider CREDENTIALS_PROVIDER =
   StaticCredentialsProvider.create(
-  AwsBasicCredentials.create("accessKeyId", "secretAccessKey"));
-  private static final MinioContainer MINIO_CONTAINER =
-  new MinioContainer(CREDENTIALS_PROVIDER.resolveCredentials());
+  AwsBasicCredentials.create(MINIO_CONTAINER.getUserName(), 
MINIO_CONTAINER.getPassword()));

Review Comment:
   The old `MinioContainer` was setting multiple env vars (MINIO_ACCESS_KEY / 
MINIO_SECRET_KEY) for this and also configured virtual-host-style addressing, 
which we should continue to do



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spec: Support geo type [iceberg]

2024-10-18 Thread via GitHub


desruisseaux commented on code in PR #10981:
URL: https://github.com/apache/iceberg/pull/10981#discussion_r1806236061


##
format/spec.md:
##
@@ -198,6 +199,9 @@ Notes:
 - Timestamp values _with time zone_ represent a point in time: values are 
stored as UTC and do not retain a source time zone (`2017-11-16 17:10:34 PST` 
is stored/retrieved as `2017-11-17 01:10:34 UTC` and these values are 
considered identical).
 - Timestamp values _without time zone_ represent a date and time of day 
regardless of zone: the time value is independent of zone adjustments 
(`2017-11-16 17:10:34` is always retrieved as `2017-11-16 17:10:34`).
 3. Character strings must be stored as UTF-8 encoded byte arrays.
+4. Coordinate Reference System, i.e. mapping of how coordinates refer to 
precise locations on earth. Defaults to "OGC:CRS84". Fixed and cannot be 
changed by schema evolution.

Review Comment:
   Note that this example is non-standard. A CRS JSON encoding is in progress 
at OGC, but we don't know yet what it will look like. The only standards at 
this time are GML, WKT, or simply put a reference to the EPSG database or OGC 
registry (e.g. https://www.opengis.net/def/crs/OGC/0/CRS84 or 
`"urn:ogc:def:crs:OGC::CRS84"`).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: code clean up for deleteRemovedMetadataFiles [iceberg]

2024-10-18 Thread via GitHub


leesf commented on code in PR #11352:
URL: https://github.com/apache/iceberg/pull/11352#discussion_r1806236088


##
core/src/main/java/org/apache/iceberg/TableMetadata.java:
##
@@ -77,6 +83,51 @@ public static TableMetadata newTableMetadata(
 return newTableMetadata(schema, spec, SortOrder.unsorted(), location, 
properties);
   }
 
+  /**
+   * Deletes the oldest metadata files if {@link
+   * TableProperties#METADATA_DELETE_AFTER_COMMIT_ENABLED} is true.
+   *
+   * @param io file IO
+   * @param base table metadata on which previous versions were based
+   * @param metadata new table metadata with updated previous versions
+   */
+  public static void deleteRemovedMetadataFiles(

Review Comment:
   @nastra because currently the `deleteRemovedMetadataFiles` method lies both 
in `HadoopTableOperations` and `BaseMetastoreTableOperations` and cause 
duplicate code, the place i would think is moving this method here, or what do 
you have better idea?
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spec: Support geo type [iceberg]

2024-10-18 Thread via GitHub


desruisseaux commented on code in PR #10981:
URL: https://github.com/apache/iceberg/pull/10981#discussion_r1806246237


##
format/spec.md:
##
@@ -200,12 +200,16 @@ Supported primitive types are defined in the table below. 
Primitive types added
 |  | **`uuid`** | Universally unique identifiers   
| Should use 16-byte fixed  
   |
 |  | **`fixed(L)`** | Fixed-length byte array of length L  
|   
   |
 |  | **`binary`**   | Arbitrary-length byte array  
|   
   |
+| [v3](#version-3) | **`geometry(C, CE, E)`** | An object of the simple 
feature geometry model as defined by Appendix G; This may be any of the 
geometry subclasses defined therein; crs C [4], crs-encoding CE [5], edges E 
[6] | C, CE, E are fixed, and if unset will take default values. Encoded as 
WKB, see Appendix G. |
 
 Notes:
 
 1. Timestamp values _without time zone_ represent a date and time of day 
regardless of zone: the time value is independent of zone adjustments 
(`2017-11-16 17:10:34` is always retrieved as `2017-11-16 17:10:34`).
 2. Timestamp values _with time zone_ represent a point in time: values are 
stored as UTC and do not retain a source time zone (`2017-11-16 17:10:34 PST` 
is stored/retrieved as `2017-11-17 01:10:34 UTC` and these values are 
considered identical).
 3. Character strings must be stored as UTF-8 encoded byte arrays.
+4. Crs (coordinate reference system) is a mapping of how coordinates refer to 
precise locations on earth. Defaults to "OGC:CRS84". Fixed and cannot be 
changed by schema evolution.
+5. Crs-encoding (coordinate reference system encoding) is the type of crs 
field. Must be set if crs is set. Defaults to "PROJJSON". Fixed and cannot be 
changed by schema evolution.
+6. Edges is the interpretation for non-point geometries in geometry object, 
i.e. whether an edge between points represent a straight cartesian line or the 
shortest line on the sphere. Defaults to "planar". Fixed and cannot be changed 
by schema evolution.

Review Comment:
   > Crs-encoding (coordinate reference system encoding) is the type of crs 
field.
   
   * Please change "is the type" by "is the encoding". The CRS type is 
something completely different than the information put in this field.
   * (minor detail) "crs" should be upper-case "CRS" as it is an acronym for 
"Coordinate Reference System".
   
   > Defaults to "PROJJSON".
   
   I suggest to remove this specification. PROJJSON is non-standard and may 
never be (we don't know yet what will be OGC's decision). I don't think that we 
want to tie Iceberg forever to a format specific to one project.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: delete temp metadata file when version already exists [iceberg]

2024-10-18 Thread via GitHub


leesf commented on code in PR #11350:
URL: https://github.com/apache/iceberg/pull/11350#discussion_r1806229187


##
core/src/main/java/org/apache/iceberg/hadoop/HadoopTableOperations.java:
##
@@ -368,7 +368,13 @@ private void renameToFinal(FileSystem fs, Path src, Path 
dst, int nextVersion) {
   }
 
   if (fs.exists(dst)) {
-throw new CommitFailedException("Version %d already exists: %s", 
nextVersion, dst);
+CommitFailedException cfe =
+new CommitFailedException("Version %d already exists: %s", 
nextVersion, dst);
+RuntimeException re = tryDelete(src);

Review Comment:
   @nastra  yes, i think it is always safe to delete the temp src file(temp 
metadata file), the commit retry would create a new temp metadata file.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spec: Support geo type [iceberg]

2024-10-18 Thread via GitHub


desruisseaux commented on code in PR #10981:
URL: https://github.com/apache/iceberg/pull/10981#discussion_r1806278747


##
format/spec.md:
##
@@ -483,6 +485,8 @@ Notes:
 2. For `float` and `double`, the value `-0.0` must precede `+0.0`, as in the 
IEEE 754 `totalOrder` predicate. NaNs are not permitted as lower or upper 
bounds.
 3. If sort order ID is missing or unknown, then the order is assumed to be 
unsorted. Only data files and equality delete files should be written with a 
non-null order id. [Position deletes](#position-delete-files) are required to 
be sorted by file and position, not a table order, and should set sort order id 
to null. Readers must ignore sort order id for position delete files.
 4. The following field ids are reserved on `data_file`: 141.
+5. For `geometry`, this is a point. X = min value of all component points of 
all geometries in file when edges = PLANAR, westernmost bound of all geometries 
in file if edges = SPHERICAL. Y = max value of all component points of all 
geometries in file when edges = PLANAR, northernmost bound of all geometries if 
edges = SPHERICAL. Z is min value for all component points of all geometries in 
the file. M is min value of all component points of all geometries in the file. 
See Appendix D for encoding.

Review Comment:
   > Nowaday the trend seems to be Geography vs Geometry type actually
   
   This trend was initiated by PostGIS, but I don't know how widely it is used 
outside PostGIS. This is not necessarily a trend that we should follow. The 
PostGIS's Geography type can handle envelopes crossing the anti-meridian 
(longitude wraparound), but this is not the only case where wraparounds are 
needed. Other examples are:
   
   * Radar data using a polar coordinate system.
   * Climatological data, where "December" is followed by "January" of no 
particular year (temporal wraparound).
   
   Instead of a distinct Geography type, a better and more generic solution is 
to interpret "min" and "max" as "starting point" and "ending point", with 
interior in the direction of increasing values. So `min > max` (or `west > 
east`) means that a wraparound happens at infinity:
   
   
![image](https://github.com/user-attachments/assets/dc26cd6a-77ce-4747-8346-aa97be48cdee)
   
   This convention is already used by the EPSG geodetic registry, by the GGXF 
standard, by some versions of the WMS or WCS standards (I do not remember which 
one exactly). This is the only truly generic approach that I'm aware of. All 
other approaches work only in special cases.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spec: Support geo type [iceberg]

2024-10-18 Thread via GitHub


desruisseaux commented on code in PR #10981:
URL: https://github.com/apache/iceberg/pull/10981#discussion_r1806309872


##
format/spec.md:
##
@@ -483,6 +485,8 @@ Notes:
 2. For `float` and `double`, the value `-0.0` must precede `+0.0`, as in the 
IEEE 754 `totalOrder` predicate. NaNs are not permitted as lower or upper 
bounds.
 3. If sort order ID is missing or unknown, then the order is assumed to be 
unsorted. Only data files and equality delete files should be written with a 
non-null order id. [Position deletes](#position-delete-files) are required to 
be sorted by file and position, not a table order, and should set sort order id 
to null. Readers must ignore sort order id for position delete files.
 4. The following field ids are reserved on `data_file`: 141.
+5. For `geometry`, this is a point. X = min value of all component points of 
all geometries in file when edges = PLANAR, westernmost bound of all geometries 
in file if edges = SPHERICAL. Y = max value of all component points of all 
geometries in file when edges = PLANAR, northernmost bound of all geometries if 
edges = SPHERICAL. Z is min value for all component points of all geometries in 
the file. M is min value of all component points of all geometries in the file. 
See Appendix D for encoding.

Review Comment:
   On the "geodesics" versus "pseudo-geodesics" names, I do not have a strong 
opinion at this time. The rational for "pseudo-generic" is that current 
implementations are cheating: they do not compute the real geodesic. Instead, 
they pretend that the ellipsoid is a sphere (e.g., they ignore WGS84 
flattening) and apply the formulas for a sphere.
   
   When we pretend that the ellipsoid is a sphere, the results are of course a 
little bit wrong, but there is some ways to make them "less wrong". For 
example, some map projections will convert the _geodetic latitudes_ to 
_authalic latitudes_. An _authalic sphere_ is a sphere with the same surface as 
the ellipsoid. An _authalic latitude_ is the latitude that a point would have 
if we deform the ellipsoid until it gets the shape of the authalic sphere. 
Therefore, for each point in a geometry, the numerical values of the _authalic 
latitude_ is slightly different than the _geodetic latitude_. If an application 
converts all geodetic latitudes to authalic latitudes before to perform 
computations with libraries that assume a sphere, such as S2, then even if the 
shapes are a little bit wrong, the surfaces are closer to the reality. 
Conversely, if someone is more interested in shapes rather than surfaces, 
she/he may use _conformal latitudes_ instead. See for example the list of 
[auxiliary lat
 itudes on MathWord](https://mathworld.wolfram.com/AuxiliaryLatitude.html).
   
   In summary, even if Iceberg doesn't dive in all this complexity for now, 
there is a possibility that future versions may want to add something like 
`AUTHALIC_GEODESIC` (I don't know if putting "authalic geodesic" terms together 
is orthodox, we would need to verify usages) for geodesics computed on a sphere 
but with geodetic latitudes converted to authalic latitudes. Idem for other 
kinds of latitude such as conformal.
   
   The current implementations just ignore all this complexity, use the 
latitudes _as-is_ and plug the values directly in the spherical formulas 
without bothering about the fact that this is not quite correct. This is 
exactly what the "Google Mercator" projection does. EPSG calls that "Popular 
Visualisation Pseudo Mercator" projection. Hence the proposal for "pseudo" in 
"pseudo-geodesic", by analogy with "pseudo-Mercator".



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: code clean up for deleteRemovedMetadataFiles [iceberg]

2024-10-18 Thread via GitHub


nastra commented on code in PR #11352:
URL: https://github.com/apache/iceberg/pull/11352#discussion_r1806309295


##
core/src/main/java/org/apache/iceberg/TableMetadata.java:
##
@@ -77,6 +83,51 @@ public static TableMetadata newTableMetadata(
 return newTableMetadata(schema, spec, SortOrder.unsorted(), location, 
properties);
   }
 
+  /**
+   * Deletes the oldest metadata files if {@link
+   * TableProperties#METADATA_DELETE_AFTER_COMMIT_ENABLED} is true.
+   *
+   * @param io file IO
+   * @param base table metadata on which previous versions were based
+   * @param metadata new table metadata with updated previous versions
+   */
+  public static void deleteRemovedMetadataFiles(

Review Comment:
   I would say maybe rather move it to `CatalogUtil`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spec: Support geo type [iceberg]

2024-10-18 Thread via GitHub


desruisseaux commented on code in PR #10981:
URL: https://github.com/apache/iceberg/pull/10981#discussion_r1806309872


##
format/spec.md:
##
@@ -483,6 +485,8 @@ Notes:
 2. For `float` and `double`, the value `-0.0` must precede `+0.0`, as in the 
IEEE 754 `totalOrder` predicate. NaNs are not permitted as lower or upper 
bounds.
 3. If sort order ID is missing or unknown, then the order is assumed to be 
unsorted. Only data files and equality delete files should be written with a 
non-null order id. [Position deletes](#position-delete-files) are required to 
be sorted by file and position, not a table order, and should set sort order id 
to null. Readers must ignore sort order id for position delete files.
 4. The following field ids are reserved on `data_file`: 141.
+5. For `geometry`, this is a point. X = min value of all component points of 
all geometries in file when edges = PLANAR, westernmost bound of all geometries 
in file if edges = SPHERICAL. Y = max value of all component points of all 
geometries in file when edges = PLANAR, northernmost bound of all geometries if 
edges = SPHERICAL. Z is min value for all component points of all geometries in 
the file. M is min value of all component points of all geometries in the file. 
See Appendix D for encoding.

Review Comment:
   On the "geodesics" versus "pseudo-geodesics" names, I do not have a strong 
opinion at this time. The rational for "pseudo-geodesic" is that current 
implementations are cheating: they do not compute the real geodesic. Instead, 
they pretend that the ellipsoid is a sphere (e.g., they ignore WGS84 
flattening) and apply the formulas for a sphere.
   
   When we pretend that the ellipsoid is a sphere, the results are of course a 
little bit wrong, but there is some ways to make them "less wrong". For 
example, some map projections will convert the _geodetic latitudes_ to 
_authalic latitudes_. An _authalic sphere_ is a sphere with the same surface as 
the ellipsoid. An _authalic latitude_ is the latitude that a point would have 
if we deform the ellipsoid until it gets the shape of the authalic sphere. 
Therefore, for each point in a geometry, the numerical values of the _authalic 
latitude_ is slightly different than the _geodetic latitude_. If an application 
converts all geodetic latitudes to authalic latitudes before to perform 
computations with libraries that assume a sphere, such as S2, then even if the 
shapes are a little bit wrong, the surfaces are closer to the reality. 
Conversely, if someone is more interested in shapes rather than surfaces, 
she/he may use _conformal latitudes_ instead. See for example the list of 
[auxiliary lat
 itudes on MathWord](https://mathworld.wolfram.com/AuxiliaryLatitude.html).
   
   In summary, even if Iceberg doesn't dive in all this complexity for now, 
there is a possibility that future versions may want to add something like 
`AUTHALIC_GEODESIC` (I don't know if putting "authalic geodesic" terms together 
is orthodox, we would need to verify usages) for geodesics computed on a sphere 
but with geodetic latitudes converted to authalic latitudes. Idem for other 
kinds of latitude such as conformal.
   
   The current implementations just ignore all this complexity, use the 
latitudes _as-is_ and plug the values directly in the spherical formulas 
without bothering about the fact that this is not quite correct. This is 
exactly what the "Google Mercator" projection does. EPSG calls that "Popular 
Visualisation Pseudo Mercator" projection. Hence the proposal for "pseudo" in 
"pseudo-geodesic", by analogy with "pseudo-Mercator".



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org