Re: [PR] feat(table): Implement converting Iceberg schema and types to Arrow [iceberg-go]
nastra commented on PR #168: URL: https://github.com/apache/iceberg-go/pull/168#issuecomment-2421911918 @zeroshade can you please rebase due to the merge conflict? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] [KafkaConnect] Fix RecordConverter [iceberg]
singhpk234 commented on code in PR #11346: URL: https://github.com/apache/iceberg/pull/11346#discussion_r1806695970 ## orc/src/main/java/org/apache/iceberg/data/orc/GenericOrcWriters.java: ## @@ -101,8 +101,8 @@ public static OrcValueWriter byteBuffers() { return ByteBufferWriter.INSTANCE; } - public static OrcValueWriter uuids() { -return UUIDWriter.INSTANCE; + public static OrcValueWriter uuids() { +return ByteBufferWriter.INSTANCE; Review Comment: Need to revisit this. ## kafka-connect/kafka-connect/src/test/java/org/apache/iceberg/connect/data/RecordConverterTest.java: ## @@ -921,11 +929,24 @@ private void assertRecordValues(Record record) { assertThat(rec.getField("dec")).isEqualTo(DEC_VAL); assertThat(rec.getField("s")).isEqualTo(STR_VAL); assertThat(rec.getField("b")).isEqualTo(true); -assertThat(rec.getField("u")).isEqualTo(UUID_VAL); -assertThat(rec.getField("f")).isEqualTo(BYTES_VAL); +assertThat(rec.getField("u")).isEqualTo(UUIDUtil.convert(UUID_VAL)); +assertThat(rec.getField("f")).isEqualTo(BYTES_VAL.array()); assertThat(rec.getField("bi")).isEqualTo(BYTES_VAL); assertThat(rec.getField("li")).isEqualTo(LIST_VAL); assertThat(rec.getField("ma")).isEqualTo(MAP_VAL); + +// check by actually writing it +for (String format : new String[] {"parquet"}) { + IcebergSinkConfig config = mock(IcebergSinkConfig.class); + when(config.tableConfig(any())).thenReturn(mock(TableSinkConfig.class)); + when(config.writeProps()).thenReturn(ImmutableMap.of("write.format.default", format)); + WriteResult result = writeTest(ImmutableList.of(rec), config, UnpartitionedWriter.class); + + assertThat(result.dataFiles()).hasSize(1); + assertThat(result.dataFiles()) + .allMatch(file -> file.format() == FileFormat.fromString(format)); + assertThat(result.deleteFiles()).hasSize(0); Review Comment: This tests fails before the fix for parquet and aparently fixing parquet it made ORC fail due to UUID handling differently -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] OpenAPI: Update Snapshots to be a common denominator of V1 and V2 specs [iceberg]
sungwy opened a new pull request, #11353: URL: https://github.com/apache/iceberg/pull/11353 In a recent [mail list discussion](https://lists.apache.org/thread/h9qmrmlgxh91ol0y2v8olt90b9q6p9xr), it has come to the community's attention that the Open API Spec does not conform to the current table Spec in the Snapshot component. I am putting this PR together to help visualize the changes that would be required if we were to rectify the Spec in an effort to support both v1 and v2 spec, as has been suggested within the [REST API Open API Spec description](https://github.com/apache/iceberg/blob/8e743a5b5209569f84b6bace36e1106c67e1eab3/open-api/rest-catalog-open-api.yaml#L30-L30) itself. The proposed change below removes `manifest-list` and `summary` fields which are `optional` fields in V1 spec, and adds `manifests` as an optional field (following the V1 spec). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] REST Catalog pagination can throw IndexOutOfBoundsException [iceberg]
munendrasn commented on issue #11142: URL: https://github.com/apache/iceberg/issues/11142#issuecomment-2422468094 @rcjverhoef Is this good to be closed as the PR is merged? cc @amogh-jahagirdar -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spec: add variant type [iceberg]
aihuaxu commented on PR #10831: URL: https://github.com/apache/iceberg/pull/10831#issuecomment-2423052011 > This needs some notes in `Partition Transforms` , I think explicitly we should disallow identity > > For Appendix B - We should define something or state explicitly we don't define it for variant. > > Appendix C - We'll need some details on the JSON serialization since that's going to have to define some string representations I think > > Under Sort Orders we should probably note you cannot sort on a Variant? > > Appendix D: Single Value Serialzation needs an entry, we can probably right "Not SUpported" for now, Json needs an entry Thanks @RussellSpitzer I missed those sections and just updated. I mark Partition Transforms, sorting and hashing not supported/allowed for now. For Appendix C, I think it should be just variant, similar to primitive type, since it's Iceberg schema as I understand the section. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] AWS: Introduce opt-in S3LocationProvider which is optimized for S3 performance [iceberg]
ookumuso commented on code in PR #2: URL: https://github.com/apache/iceberg/pull/2#discussion_r1806927534 ## core/src/main/java/org/apache/iceberg/LocationProviders.java: ## @@ -172,10 +193,45 @@ private static String pathContext(String tableLocation) { } private String computeHash(String fileName) { - byte[] bytes = TEMP.get(); - HashCode hash = HASH_FUNC.hashString(fileName, StandardCharsets.UTF_8); - hash.writeBytesTo(bytes, 0, 4); - return BASE64_ENCODER.encode(bytes); + HashCode hashCode = HASH_FUNC.hashString(fileName, StandardCharsets.UTF_8); + + // {@link Integer#toBinaryString} excludes leading zeros, which we want to preserve. + // force the first bit to be set to get around that. + String hashAsBinaryString = Integer.toBinaryString(hashCode.asInt() | Integer.MIN_VALUE); + return dirsFromHash(hashAsBinaryString.substring(HASH_BINARY_STRING_START_INDEX)); +} + +/** + * Divides hash into directories for optimized orphan removal operation using ENTROPY_DIR_DEPTH + * and ENTROPY_DIR_LENGTH + * + * @param hash 10011001100110011001 + * @return 1001/1001/1001/10011001 with depth 3 and length 4 + */ +private String dirsFromHash(String hash) { + if (hash.length() < ENTROPY_DIR_DEPTH * ENTROPY_DIR_LENGTH) { Review Comment: Yeah maybe I am a bit over-paranoid here, given I have the right unit tests to cover any changes and length is 20 bits as final string(after addressing the other feedback), maybe will just remove this condition? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: code clean up for deleteRemovedMetadataFiles [iceberg]
leesf commented on code in PR #11352: URL: https://github.com/apache/iceberg/pull/11352#discussion_r1806407924 ## core/src/main/java/org/apache/iceberg/TableMetadata.java: ## @@ -77,6 +83,51 @@ public static TableMetadata newTableMetadata( return newTableMetadata(schema, spec, SortOrder.unsorted(), location, properties); } + /** + * Deletes the oldest metadata files if {@link + * TableProperties#METADATA_DELETE_AFTER_COMMIT_ENABLED} is true. + * + * @param io file IO + * @param base table metadata on which previous versions were based + * @param metadata new table metadata with updated previous versions + */ + public static void deleteRemovedMetadataFiles( Review Comment: @nastra done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] REST Catalog pagination can throw IndexOutOfBoundsException [iceberg]
rcjverhoef closed issue #11142: REST Catalog pagination can throw IndexOutOfBoundsException URL: https://github.com/apache/iceberg/issues/11142 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] REST Catalog pagination can throw IndexOutOfBoundsException [iceberg]
rcjverhoef commented on issue #11142: URL: https://github.com/apache/iceberg/issues/11142#issuecomment-2422470740 yes it! let me do so. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat(table): Implement converting Iceberg schema and types to Arrow [iceberg-go]
nastra merged PR #168: URL: https://github.com/apache/iceberg-go/pull/168 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark 3.5: Fix NotSerializableException when migrating partitioned Spark tables [iceberg]
manuzhang commented on code in PR #11157: URL: https://github.com/apache/iceberg/pull/11157#discussion_r1806768662 ## spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java: ## @@ -711,7 +719,7 @@ public static void importSparkPartitions( spec, stagingDir, checkDuplicateFiles, -TableMigrationUtil.migrationService(parallelism)); +executorService(parallelism)); Review Comment: We already add `executeWith(ExecutorService service)` to `SnapshotTable` and `MigrateTable` interfaces and it's more testable than passing `parallelism`. Hence, I prefer to fix forward if not at high cost. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] REST: AuthManager API [iceberg]
danielcweeks commented on code in PR #10753: URL: https://github.com/apache/iceberg/pull/10753#discussion_r1807055242 ## core/src/main/java/org/apache/iceberg/rest/RESTSessionCatalog.java: ## @@ -273,35 +227,11 @@ public void initialize(String name, Map unresolved) { this.endpoints = ImmutableSet.copyOf(config.endpoints()); } -this.sessions = newSessionCache(mergedProps); -this.tableSessions = newSessionCache(mergedProps); -this.keepTokenRefreshed = -PropertyUtil.propertyAsBoolean( -mergedProps, -OAuth2Properties.TOKEN_REFRESH_ENABLED, -OAuth2Properties.TOKEN_REFRESH_ENABLED_DEFAULT); this.client = clientBuilder.apply(mergedProps); this.paths = ResourcePaths.forCatalogProperties(mergedProps); -String token = mergedProps.get(OAuth2Properties.TOKEN); -this.catalogAuth = -new AuthSession( -baseHeaders, -AuthConfig.builder() -.credential(credential) -.scope(scope) -.oauth2ServerUri(oauth2ServerUri) -.optionalOAuthParams(optionalOAuthParams) -.build()); -if (authResponse != null) { - this.catalogAuth = - AuthSession.fromTokenResponse( - client, tokenRefreshExecutor(name), authResponse, startTimeMillis, catalogAuth); -} else if (token != null) { - this.catalogAuth = - AuthSession.fromAccessToken( - client, tokenRefreshExecutor(name), token, expiresAtMillis(mergedProps), catalogAuth); -} +authManager.initialize(name, client, mergedProps); Review Comment: I feel like we should have separate AuthManager instances, not double-initialize. ## core/src/main/java/org/apache/iceberg/rest/auth/AuthManager.java: ## @@ -0,0 +1,93 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.rest.auth; + +import java.util.Map; +import org.apache.iceberg.catalog.SessionCatalog; +import org.apache.iceberg.catalog.TableIdentifier; +import org.apache.iceberg.rest.RESTClient; + +/** + * Manager for authentication sessions. This interface is used to create sessions for the catalog, + * the tables/views, and any other context that requires authentication. + * + * Managers are usually stateful and may require initialization and cleanup. The manager is + * created by the catalog and is closed when the catalog is closed. + */ +public interface AuthManager extends AutoCloseable { + + /** + * Initializes the manager with the owner of the catalog, the REST client, and the properties + * provided in the catalog configuration. + * + * This method is intended to be called many times, typically once before contacting the config + * endpoint, and once after contacting the config endpoint. The properties provided in the catalog + * configuration are passed in the first call, and the properties returned by the config endpoint + * are passed in the second call, merged with the catalog ones. + * + * This method cannot return null. + */ + void initialize(String owner, RESTClient client, Map properties); Review Comment: I feel like `owner` isn't the right term. Maybe just `name`? ## core/src/main/java/org/apache/iceberg/rest/auth/OAuth2Manager.java: ## @@ -0,0 +1,220 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceb
Re: [PR] Arrow: Fix indexing in Parquet dictionary encoded values readers [iceberg]
wypoon commented on code in PR #11247: URL: https://github.com/apache/iceberg/pull/11247#discussion_r1807089230 ## spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/data/parquet/vectorized/TestParquetDictionaryEncodedVectorizedReads.java: ## @@ -93,4 +125,64 @@ public void testMixedDictionaryNonDictionaryReads() throws IOException { true, BATCH_SIZE); } + + @Test + public void testBinaryNotAllPagesDictionaryEncoded() throws IOException { +Schema schema = new Schema(Types.NestedField.required(1, "bytes", Types.BinaryType.get())); +File parquetFile = File.createTempFile("junit", null, temp.toFile()); +assertThat(parquetFile.delete()).as("Delete should succeed").isTrue(); + +Iterable records = RandomData.generateFallbackData(schema, 500, 0L, 100); +try (FileAppender writer = +Parquet.write(Files.localOutput(parquetFile)) +.schema(schema) +.set(PARQUET_DICT_SIZE_BYTES, "4096") +.set(PARQUET_PAGE_ROW_LIMIT, "100") +.build()) { + writer.addAll(records); +} + +// After the above, parquetFile contains one column chunk of binary data in five pages, +// the first two RLE dictionary encoded, and the remaining three plain encoded. +assertRecordsMatch(schema, 500, records, parquetFile, true, BATCH_SIZE); + } + + /** + * decimal_dict_and_plain_encoding.parquet contains one column chunk of decimal(38, 0) data in two + * pages, one RLE dictionary encoded and one plain encoded, each with 200 rows. + */ + @Test + public void testDecimalNotAllPagesDictionaryEncoded() throws Exception { +Schema schema = new Schema(Types.NestedField.required(1, "id", Types.DecimalType.of(38, 0))); +Path path = +Paths.get( +getClass() +.getClassLoader() +.getResource("decimal_dict_and_plain_encoding.parquet") +.toURI()); Review Comment: @amogh-jahagirdar I have already explained this at length and convinced @nastra. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Create empty snapshot for metadata operations [iceberg]
github-actions[bot] closed issue #7075: Create empty snapshot for metadata operations URL: https://github.com/apache/iceberg/issues/7075 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Support Parquet v2 Spark vectorized read [iceberg]
github-actions[bot] closed issue #7162: Support Parquet v2 Spark vectorized read URL: https://github.com/apache/iceberg/issues/7162 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Support Page Skipping in Iceberg Parquet Reader [iceberg]
github-actions[bot] closed issue #193: Support Page Skipping in Iceberg Parquet Reader URL: https://github.com/apache/iceberg/issues/193 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Does JDBC connector uses any retry mechanism? [iceberg]
github-actions[bot] commented on issue #7173: URL: https://github.com/apache/iceberg/issues/7173#issuecomment-2423396446 This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale' -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Does JDBC connector uses any retry mechanism? [iceberg]
github-actions[bot] closed issue #7173: Does JDBC connector uses any retry mechanism? URL: https://github.com/apache/iceberg/issues/7173 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] org.apache.iceberg.hive.RuntimeMetaException: Failed to connect to Hive Metastore at [iceberg]
github-actions[bot] closed issue #9030: org.apache.iceberg.hive.RuntimeMetaException: Failed to connect to Hive Metastore at URL: https://github.com/apache/iceberg/issues/9030 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] org.apache.iceberg.hive.RuntimeMetaException: Failed to connect to Hive Metastore at [iceberg]
github-actions[bot] commented on issue #9030: URL: https://github.com/apache/iceberg/issues/9030#issuecomment-2423396460 This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale' -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Create empty snapshot for metadata operations [iceberg]
github-actions[bot] commented on issue #7075: URL: https://github.com/apache/iceberg/issues/7075#issuecomment-2423396430 This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale' -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Support Page Skipping in Iceberg Parquet Reader [iceberg]
github-actions[bot] commented on issue #193: URL: https://github.com/apache/iceberg/issues/193#issuecomment-2423396417 This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale' -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Support Parquet v2 Spark vectorized read [iceberg]
github-actions[bot] commented on issue #7162: URL: https://github.com/apache/iceberg/issues/7162#issuecomment-2423396436 This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale' -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Integration tests performance degradation [iceberg-python]
github-actions[bot] commented on issue #604: URL: https://github.com/apache/iceberg-python/issues/604#issuecomment-2423397846 This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Inconsistency in catalog.list_tables Behavior Across Python and Java: Returns Non-Iceberg Tables in Python Only [iceberg-python]
sungwy closed issue #314: Inconsistency in catalog.list_tables Behavior Across Python and Java: Returns Non-Iceberg Tables in Python Only URL: https://github.com/apache/iceberg-python/issues/314 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] core: use testcontainers-minio [iceberg]
sullis commented on PR #11349: URL: https://github.com/apache/iceberg/pull/11349#issuecomment-2423404051 GitHub Actions CI passed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Glue and Hive catalog return only Iceberg tables [iceberg-python]
sungwy merged PR #1145: URL: https://github.com/apache/iceberg-python/pull/1145 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Build: Bump parquet from 1.13.1 to 1.14.3 [iceberg]
findepi commented on PR #11264: URL: https://github.com/apache/iceberg/pull/11264#issuecomment-2423244567 thank you @nastra ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] core: use testcontainers-minio [iceberg]
sullis commented on code in PR #11349: URL: https://github.com/apache/iceberg/pull/11349#discussion_r1807054261 ## aws/src/test/java/org/apache/iceberg/aws/s3/signer/TestS3RestSigner.java: ## @@ -71,11 +73,10 @@ public class TestS3RestSigner { private static final Region REGION = Region.US_WEST_2; private static final String BUCKET = "iceberg-s3-signer-test"; + private static final MinIOContainer MINIO_CONTAINER = MinioHelper.createContainer(); static final AwsCredentialsProvider CREDENTIALS_PROVIDER = StaticCredentialsProvider.create( - AwsBasicCredentials.create("accessKeyId", "secretAccessKey")); - private static final MinioContainer MINIO_CONTAINER = - new MinioContainer(CREDENTIALS_PROVIDER.resolveCredentials()); + AwsBasicCredentials.create(MINIO_CONTAINER.getUserName(), MINIO_CONTAINER.getPassword())); Review Comment: 1) I have restored virtual-host-style addressing (restored MINIO_DOMAIN) 2) MINIO_ACCESS_KEY is deprecated in favor of MINIO_ROOT_USER 3) MINIO_SECRET_KEY is deprecated in favor of MINIO_ROOT_PASSWORD source: https://min.io/docs/minio/linux/reference/minio-server/settings/deprecated.html -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] AWS: Introduce opt-in S3LocationProvider which is optimized for S3 performance [iceberg]
danielcweeks commented on code in PR #2: URL: https://github.com/apache/iceberg/pull/2#discussion_r1806946923 ## core/src/main/java/org/apache/iceberg/LocationProviders.java: ## @@ -172,10 +193,45 @@ private static String pathContext(String tableLocation) { } private String computeHash(String fileName) { - byte[] bytes = TEMP.get(); - HashCode hash = HASH_FUNC.hashString(fileName, StandardCharsets.UTF_8); - hash.writeBytesTo(bytes, 0, 4); - return BASE64_ENCODER.encode(bytes); + HashCode hashCode = HASH_FUNC.hashString(fileName, StandardCharsets.UTF_8); + + // {@link Integer#toBinaryString} excludes leading zeros, which we want to preserve. + // force the first bit to be set to get around that. + String hashAsBinaryString = Integer.toBinaryString(hashCode.asInt() | Integer.MIN_VALUE); + return dirsFromHash(hashAsBinaryString.substring(HASH_BINARY_STRING_START_INDEX)); +} + +/** + * Divides hash into directories for optimized orphan removal operation using ENTROPY_DIR_DEPTH + * and ENTROPY_DIR_LENGTH + * + * @param hash 10011001100110011001 + * @return 1001/1001/1001/10011001 with depth 3 and length 4 + */ +private String dirsFromHash(String hash) { + if (hash.length() < ENTROPY_DIR_DEPTH * ENTROPY_DIR_LENGTH) { Review Comment: If it's all internal to this class, I think we can safely assume that `Integer.toBinaryString(hashCode.asInt() | Integer.MIN_VALUE)` will produce the correct number of bits. The check is seems unnecessary since it's not user supplied inputs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Flink: make FLIP-27 default in SQL and mark the old FlinkSource as deprecated [iceberg]
stevenzwu commented on code in PR #11345: URL: https://github.com/apache/iceberg/pull/11345#discussion_r1806716122 ## flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/source/IcebergSource.java: ## @@ -86,7 +85,6 @@ import org.slf4j.Logger; import org.slf4j.LoggerFactory; -@Experimental Review Comment: I feel it is fine to be just public. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] REST: Standardize vended credentials used in loadTable / loadView responses [iceberg]
nastra closed issue #8: REST: Standardize vended credentials used in loadTable / loadView responses URL: https://github.com/apache/iceberg/issues/8 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] OpenAPI: Standardize credentials in loadTable/loadView responses [iceberg]
nastra merged PR #10722: URL: https://github.com/apache/iceberg/pull/10722 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Add credentials to loadTable / loadView responses [iceberg]
nastra merged PR #11173: URL: https://github.com/apache/iceberg/pull/11173 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] [Feature Request] Speed up InspectTable.files() [iceberg-python]
corleyma commented on issue #1229: URL: https://github.com/apache/iceberg-python/issues/1229#issuecomment-2422931184 > Most of the time is spent processing the manifests record-by-record and converting each record to a dict I haven't looked at this closely, but if memory serves, pyiceberg implements its own avro reader/writer using Cython. Concurrency is great, but I wonder if we can make big gains by implementing a more direct conversion of avro records to pyarrow recordbatches somewhere at that level? Then, processing the manifests could probably be implemented using pyarrow compute functions (C++) for a lot of performance gain? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Bump mkdocstrings from 0.26.1 to 0.26.2 [iceberg-python]
sungwy merged PR #1235: URL: https://github.com/apache/iceberg-python/pull/1235 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Bump mypy-boto3-glue from 1.35.23 to 1.35.25 [iceberg-python]
sungwy merged PR #1236: URL: https://github.com/apache/iceberg-python/pull/1236 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] AWS: Update ObjectStorageLocationProvider hash to optimize for S3 performance [iceberg]
jackye1995 merged PR #2: URL: https://github.com/apache/iceberg/pull/2 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] AWS: Update ObjectStorageLocationProvider hash to optimize for S3 performance [iceberg]
jackye1995 commented on PR #2: URL: https://github.com/apache/iceberg/pull/2#issuecomment-2423380108 looks like CI has passed, merging. Thanks for the work and patience @ookumuso , and thanks for the review @danielcweeks ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spec: add variant type [iceberg]
flyrain commented on code in PR #10831: URL: https://github.com/apache/iceberg/pull/10831#discussion_r1807111582 ## format/spec.md: ## @@ -444,6 +449,9 @@ Sorting floating-point numbers should produce the following behavior: `-NaN` < ` A data or delete file is associated with a sort order by the sort order's id within [a manifest](#manifests). Therefore, the table must declare all the sort orders for lookup. A table could also be configured with a default sort order id, indicating how the new data should be sorted by default. Writers should use this default sort order to sort the data on write, but are not required to if the default order is prohibitively expensive, as it would be for streaming writes. +Note: + +1. `variant` columns are not valid for sorting. Review Comment: I thought whether a variant is orderable is determined by engines per pervious discussion. Are we explicitly saying that all variants are not orderable? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] [Feature Request] Speed up InspectTable.files() [iceberg-python]
kevinjqliu commented on issue #1229: URL: https://github.com/apache/iceberg-python/issues/1229#issuecomment-2423385861 > pyiceberg implements its own avro reader/writer using Cython yep, optimistically build avro decoder, fall back to pure python. See https://github.com/apache/iceberg-python/issues/1093#issuecomment-2341456013 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark: support rewrite on specified target branch [iceberg]
amitgilad3 commented on PR #8797: URL: https://github.com/apache/iceberg/pull/8797#issuecomment-2422284309 Tnx for looking into this @jackye1995 , Will fix all the comment -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] Custom fileio implementation is now in the docs [iceberg-python]
sikehish opened a new pull request, #1237: URL: https://github.com/apache/iceberg-python/pull/1237 This PR addresses #1233. Do let me know if any changes are to be made. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] [KafkaConnect] Fix RecordConverter for UUID and Fixed Types [iceberg]
singhpk234 commented on code in PR #11346: URL: https://github.com/apache/iceberg/pull/11346#discussion_r1806749659 ## orc/src/main/java/org/apache/iceberg/data/orc/GenericOrcWriters.java: ## @@ -101,8 +101,8 @@ public static OrcValueWriter byteBuffers() { return ByteBufferWriter.INSTANCE; } - public static OrcValueWriter uuids() { -return UUIDWriter.INSTANCE; + public static OrcValueWriter uuids() { +return ByteBufferWriter.INSTANCE; Review Comment: removed this and handled this in the record converted -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] [KafkaConnect] Fix RecordConverter for UUID and Fixed Types [iceberg]
singhpk234 commented on code in PR #11346: URL: https://github.com/apache/iceberg/pull/11346#discussion_r1806695581 ## kafka-connect/kafka-connect/src/test/java/org/apache/iceberg/connect/data/RecordConverterTest.java: ## @@ -921,11 +929,24 @@ private void assertRecordValues(Record record) { assertThat(rec.getField("dec")).isEqualTo(DEC_VAL); assertThat(rec.getField("s")).isEqualTo(STR_VAL); assertThat(rec.getField("b")).isEqualTo(true); -assertThat(rec.getField("u")).isEqualTo(UUID_VAL); -assertThat(rec.getField("f")).isEqualTo(BYTES_VAL); +assertThat(rec.getField("u")).isEqualTo(UUIDUtil.convert(UUID_VAL)); +assertThat(rec.getField("f")).isEqualTo(BYTES_VAL.array()); assertThat(rec.getField("bi")).isEqualTo(BYTES_VAL); assertThat(rec.getField("li")).isEqualTo(LIST_VAL); assertThat(rec.getField("ma")).isEqualTo(MAP_VAL); + +// check by actually writing it +for (String format : new String[] {"parquet"}) { + IcebergSinkConfig config = mock(IcebergSinkConfig.class); + when(config.tableConfig(any())).thenReturn(mock(TableSinkConfig.class)); + when(config.writeProps()).thenReturn(ImmutableMap.of("write.format.default", format)); + WriteResult result = writeTest(ImmutableList.of(rec), config, UnpartitionedWriter.class); + + assertThat(result.dataFiles()).hasSize(1); + assertThat(result.dataFiles()) + .allMatch(file -> file.format() == FileFormat.fromString(format)); + assertThat(result.deleteFiles()).hasSize(0); Review Comment: This tests fails before the fix for parquet. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] docs/configuration.md: Documented table properties (#1231) [iceberg-python]
mths1 commented on PR #1232: URL: https://github.com/apache/iceberg-python/pull/1232#issuecomment-2422796878 Hi all I was trying 'target-file-size-bytes' lately, and to my understanding in the pyiceberg version we were using, it somehow violates the principle of least surprise. As far as I understand, in pyIceberg it is not the file size on disk, but the size _in memory_. A target-file-size-bytes of 512MB resulted for us in files of 20MB on disk. This caused a lot of trouble for us (first understanding) and secondly other tools now pick the wrong value from metadata. If I am not mistaken, it would be great to document that behaviour as it is not quite intuitive. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] Custom fileio docs [iceberg-python]
sikehish opened a new pull request, #1238: URL: https://github.com/apache/iceberg-python/pull/1238 This PR addresses https://github.com/apache/iceberg-python/issues/1233. Do let me know if any changes are to be made. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Document Custom FileIO [iceberg-python]
sikehish commented on issue #1233: URL: https://github.com/apache/iceberg-python/issues/1233#issuecomment-2422849621 > Assigned to you. I think we can add it under [the `FileIO` section](https://py.iceberg.apache.org/configuration/#fileio) as something like "Custom FileIO Implementations" Hi @kevinjqliu. I've added custom fileio implementation section in docs. I'm not very sure if what I've added is what you are looking for. Do let me know if any changes are to be made :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Flink: disable the flaky range distribution bucketing tests for now [iceberg]
stevenzwu merged PR #11347: URL: https://github.com/apache/iceberg/pull/11347 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark: Add RewriteTablePath action interface [iceberg]
szehon-ho merged PR #10920: URL: https://github.com/apache/iceberg/pull/10920 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark: Add RewriteTablePath action interface [iceberg]
szehon-ho commented on PR #10920: URL: https://github.com/apache/iceberg/pull/10920#issuecomment-2422978173 Merged , thanks @laithalzyoud -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] AWS: Introduce opt-in S3LocationProvider which is optimized for S3 performance [iceberg]
danielcweeks commented on code in PR #2: URL: https://github.com/apache/iceberg/pull/2#discussion_r1806878880 ## core/src/main/java/org/apache/iceberg/LocationProviders.java: ## @@ -108,10 +108,15 @@ public String newDataLocation(String filename) { static class ObjectStoreLocationProvider implements LocationProvider { private static final HashFunction HASH_FUNC = Hashing.murmur3_32_fixed(); -private static final BaseEncoding BASE64_ENCODER = BaseEncoding.base64Url().omitPadding(); -private static final ThreadLocal TEMP = ThreadLocal.withInitial(() -> new byte[4]); +// the starting index of the lower 20-bits of a 32-bit binary string +private static final int HASH_BINARY_STRING_START_INDEX = 12; Review Comment: Can we flip this to be to be `HASH_BINARY_STRING_BITS = 20`, since that's more applicable to what we're constructing? The start index is confusing because it only has meaning in context to the length of what we're skipping from the bit field. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] AWS: Introduce opt-in S3LocationProvider which is optimized for S3 performance [iceberg]
danielcweeks commented on code in PR #2: URL: https://github.com/apache/iceberg/pull/2#discussion_r1806873426 ## core/src/main/java/org/apache/iceberg/LocationProviders.java: ## @@ -172,10 +193,45 @@ private static String pathContext(String tableLocation) { } private String computeHash(String fileName) { - byte[] bytes = TEMP.get(); - HashCode hash = HASH_FUNC.hashString(fileName, StandardCharsets.UTF_8); - hash.writeBytesTo(bytes, 0, 4); - return BASE64_ENCODER.encode(bytes); + HashCode hashCode = HASH_FUNC.hashString(fileName, StandardCharsets.UTF_8); + + // {@link Integer#toBinaryString} excludes leading zeros, which we want to preserve. + // force the first bit to be set to get around that. + String hashAsBinaryString = Integer.toBinaryString(hashCode.asInt() | Integer.MIN_VALUE); + return dirsFromHash(hashAsBinaryString.substring(HASH_BINARY_STRING_START_INDEX)); +} + +/** + * Divides hash into directories for optimized orphan removal operation using ENTROPY_DIR_DEPTH + * and ENTROPY_DIR_LENGTH + * + * @param hash 10011001100110011001 + * @return 1001/1001/1001/10011001 with depth 3 and length 4 + */ +private String dirsFromHash(String hash) { + if (hash.length() < ENTROPY_DIR_DEPTH * ENTROPY_DIR_LENGTH) { Review Comment: We should use a `Preconditions.check` here. Also, we're doing a lot of math here, but don't we always expect the hash to be 20 characters? Why not just make that the requirement. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] AWS: Introduce opt-in S3LocationProvider which is optimized for S3 performance [iceberg]
danielcweeks commented on code in PR #2: URL: https://github.com/apache/iceberg/pull/2#discussion_r1806873426 ## core/src/main/java/org/apache/iceberg/LocationProviders.java: ## @@ -172,10 +193,45 @@ private static String pathContext(String tableLocation) { } private String computeHash(String fileName) { - byte[] bytes = TEMP.get(); - HashCode hash = HASH_FUNC.hashString(fileName, StandardCharsets.UTF_8); - hash.writeBytesTo(bytes, 0, 4); - return BASE64_ENCODER.encode(bytes); + HashCode hashCode = HASH_FUNC.hashString(fileName, StandardCharsets.UTF_8); + + // {@link Integer#toBinaryString} excludes leading zeros, which we want to preserve. + // force the first bit to be set to get around that. + String hashAsBinaryString = Integer.toBinaryString(hashCode.asInt() | Integer.MIN_VALUE); + return dirsFromHash(hashAsBinaryString.substring(HASH_BINARY_STRING_START_INDEX)); +} + +/** + * Divides hash into directories for optimized orphan removal operation using ENTROPY_DIR_DEPTH + * and ENTROPY_DIR_LENGTH + * + * @param hash 10011001100110011001 + * @return 1001/1001/1001/10011001 with depth 3 and length 4 + */ +private String dirsFromHash(String hash) { + if (hash.length() < ENTROPY_DIR_DEPTH * ENTROPY_DIR_LENGTH) { Review Comment: We should use a `Preconditions.check` here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark: Add RewriteTablePath action interface [iceberg]
flyrain commented on PR #10920: URL: https://github.com/apache/iceberg/pull/10920#issuecomment-2423035986 Great! Thanks a lot for working on it, guys! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Custom fileio implementation is now in the docs [iceberg-python]
sikehish closed pull request #1237: Custom fileio implementation is now in the docs URL: https://github.com/apache/iceberg-python/pull/1237 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark: Add RewriteTablePath action interface [iceberg]
laithalzyoud commented on PR #10920: URL: https://github.com/apache/iceberg/pull/10920#issuecomment-2422990247 Thanks @szehon-ho! Will follow up with the implementation PR soon 👍 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark 3.5: Fix NotSerializableException when migrating partitioned Spark tables [iceberg]
manuzhang commented on code in PR #11157: URL: https://github.com/apache/iceberg/pull/11157#discussion_r1806768662 ## spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java: ## @@ -711,7 +719,7 @@ public static void importSparkPartitions( spec, stagingDir, checkDuplicateFiles, -TableMigrationUtil.migrationService(parallelism)); +executorService(parallelism)); Review Comment: We already add `executeWith(ExecutorService service)` to `SnapshotTable` and `MigrateTable` interfaces and it's more testable than passing `parallelism`. Hence, I prefer to fix forward if not with high cost. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Glue and Hive catalog return only Iceberg tables [iceberg-python]
mark-major commented on code in PR #1145: URL: https://github.com/apache/iceberg-python/pull/1145#discussion_r1806419626 ## pyiceberg/catalog/dynamodb.py: ## @@ -393,7 +393,7 @@ def drop_namespace(self, namespace: Union[str, Identifier]) -> None: raise NoSuchNamespaceError(f"Database does not exist: {database_name}") from e def list_tables(self, namespace: Union[str, Identifier]) -> List[Identifier]: -"""List tables under the given namespace in the catalog (including non-Iceberg tables). +"""List Iceberg tables under the given namespace in the catalog. Review Comment: It was quite a long time since I have worked on this, but if I recall correctly then yes, DynamoDB returned only Iceberg tables. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spec: Support geo type [iceberg]
desruisseaux commented on code in PR #10981: URL: https://github.com/apache/iceberg/pull/10981#discussion_r1806416230 ## format/spec.md: ## @@ -198,6 +199,9 @@ Notes: - Timestamp values _with time zone_ represent a point in time: values are stored as UTC and do not retain a source time zone (`2017-11-16 17:10:34 PST` is stored/retrieved as `2017-11-17 01:10:34 UTC` and these values are considered identical). - Timestamp values _without time zone_ represent a date and time of day regardless of zone: the time value is independent of zone adjustments (`2017-11-16 17:10:34` is always retrieved as `2017-11-16 17:10:34`). 3. Character strings must be stored as UTF-8 encoded byte arrays. +4. Coordinate Reference System, i.e. mapping of how coordinates refer to precise locations on earth. Defaults to "OGC:CRS84". Fixed and cannot be changed by schema evolution. Review Comment: > Coordinate Reference System, i.e. mapping of how coordinates refer to precise locations on earth. Defaults to "OGC:CRS84". Maybe remove the word "precise"? Not all CRS are precise. The proposed default, `OGC:CRS84`, has an uncertainty of about 2 meters (unrelated to floating point precision). Maybe it would be worth to change the sentence to: _Defaults to "OGC:CRS84", which provides an accuracy of about 2 meters._ Another reason for removing the "precise" word is that the CRS alone is not sufficient for really precise locations. We also need the epoch, which is a topic that has not been covered at all in this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Flink: make FLIP-27 default in SQL and mark the old FlinkSource as deprecated [iceberg]
stevenzwu commented on PR #11345: URL: https://github.com/apache/iceberg/pull/11345#issuecomment-2422803067 thanks @pvary for the review -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Flink: make FLIP-27 default in SQL and mark the old FlinkSource as deprecated [iceberg]
stevenzwu merged PR #11345: URL: https://github.com/apache/iceberg/pull/11345 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat(table): Implement converting Iceberg schema and types to Arrow [iceberg-go]
zeroshade commented on PR #168: URL: https://github.com/apache/iceberg-go/pull/168#issuecomment-2422814408 @nastra rebased! Thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] AWS: Introduce opt-in S3LocationProvider which is optimized for S3 performance [iceberg]
danielcweeks commented on code in PR #2: URL: https://github.com/apache/iceberg/pull/2#discussion_r1806946923 ## core/src/main/java/org/apache/iceberg/LocationProviders.java: ## @@ -172,10 +193,45 @@ private static String pathContext(String tableLocation) { } private String computeHash(String fileName) { - byte[] bytes = TEMP.get(); - HashCode hash = HASH_FUNC.hashString(fileName, StandardCharsets.UTF_8); - hash.writeBytesTo(bytes, 0, 4); - return BASE64_ENCODER.encode(bytes); + HashCode hashCode = HASH_FUNC.hashString(fileName, StandardCharsets.UTF_8); + + // {@link Integer#toBinaryString} excludes leading zeros, which we want to preserve. + // force the first bit to be set to get around that. + String hashAsBinaryString = Integer.toBinaryString(hashCode.asInt() | Integer.MIN_VALUE); + return dirsFromHash(hashAsBinaryString.substring(HASH_BINARY_STRING_START_INDEX)); +} + +/** + * Divides hash into directories for optimized orphan removal operation using ENTROPY_DIR_DEPTH + * and ENTROPY_DIR_LENGTH + * + * @param hash 10011001100110011001 + * @return 1001/1001/1001/10011001 with depth 3 and length 4 + */ +private String dirsFromHash(String hash) { + if (hash.length() < ENTROPY_DIR_DEPTH * ENTROPY_DIR_LENGTH) { Review Comment: If it's all internal to this class, I think we can safely assume that `Integer.toBinaryString(hashCode.asInt() | Integer.MIN_VALUE)` will produce the correct number of bits. The check seems unnecessary since it's not user supplied inputs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] AWS: Introduce opt-in S3LocationProvider which is optimized for S3 performance [iceberg]
ookumuso commented on code in PR #2: URL: https://github.com/apache/iceberg/pull/2#discussion_r1806988984 ## core/src/main/java/org/apache/iceberg/LocationProviders.java: ## @@ -172,10 +193,45 @@ private static String pathContext(String tableLocation) { } private String computeHash(String fileName) { - byte[] bytes = TEMP.get(); - HashCode hash = HASH_FUNC.hashString(fileName, StandardCharsets.UTF_8); - hash.writeBytesTo(bytes, 0, 4); - return BASE64_ENCODER.encode(bytes); + HashCode hashCode = HASH_FUNC.hashString(fileName, StandardCharsets.UTF_8); + + // {@link Integer#toBinaryString} excludes leading zeros, which we want to preserve. + // force the first bit to be set to get around that. + String hashAsBinaryString = Integer.toBinaryString(hashCode.asInt() | Integer.MIN_VALUE); + return dirsFromHash(hashAsBinaryString.substring(HASH_BINARY_STRING_START_INDEX)); +} + +/** + * Divides hash into directories for optimized orphan removal operation using ENTROPY_DIR_DEPTH + * and ENTROPY_DIR_LENGTH + * + * @param hash 10011001100110011001 + * @return 1001/1001/1001/10011001 with depth 3 and length 4 + */ +private String dirsFromHash(String hash) { + if (hash.length() < ENTROPY_DIR_DEPTH * ENTROPY_DIR_LENGTH) { Review Comment: Done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] AWS: Introduce opt-in S3LocationProvider which is optimized for S3 performance [iceberg]
ookumuso commented on code in PR #2: URL: https://github.com/apache/iceberg/pull/2#discussion_r1806988739 ## core/src/main/java/org/apache/iceberg/LocationProviders.java: ## @@ -108,10 +108,15 @@ public String newDataLocation(String filename) { static class ObjectStoreLocationProvider implements LocationProvider { private static final HashFunction HASH_FUNC = Hashing.murmur3_32_fixed(); -private static final BaseEncoding BASE64_ENCODER = BaseEncoding.base64Url().omitPadding(); -private static final ThreadLocal TEMP = ThreadLocal.withInitial(() -> new byte[4]); +// the starting index of the lower 20-bits of a 32-bit binary string +private static final int HASH_BINARY_STRING_START_INDEX = 12; Review Comment: Done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Flink: Add RowConverter for Iceberg Source [iceberg]
abharath9 commented on code in PR #11301: URL: https://github.com/apache/iceberg/pull/11301#discussion_r1807179938 ## flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/source/reader/RowConverter.java: ## @@ -0,0 +1,71 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.flink.source.reader; + +import org.apache.flink.api.common.typeinfo.TypeInformation; +import org.apache.flink.api.java.typeutils.RowTypeInfo; +import org.apache.flink.table.data.RowData; +import org.apache.flink.table.data.conversion.DataStructureConverter; +import org.apache.flink.table.data.conversion.DataStructureConverters; +import org.apache.flink.table.types.logical.LogicalType; +import org.apache.flink.table.types.logical.RowType; +import org.apache.flink.table.types.utils.TypeConversions; +import org.apache.flink.types.Row; +import org.apache.iceberg.flink.FlinkSchemaUtil; +import org.jetbrains.annotations.NotNull; + +public class RowConverter implements RowDataConverter { Review Comment: thanks for the feedback. Do you mean change this to RowData To Row mapper? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] Build: Bump parquet from 1.13.1 to 1.14.3 [iceberg]
dependabot[bot] opened a new pull request, #11264: URL: https://github.com/apache/iceberg/pull/11264 Bumps `parquet` from 1.13.1 to 1.14.3. Updates `org.apache.parquet:parquet-avro` from 1.13.1 to 1.14.3 Release notes Sourced from https://github.com/apache/parquet-mr/releases";>org.apache.parquet:parquet-avro's releases. Apache Parquet Java 1.14.3 What's Changed https://redirect.github.com/apache/parquet-java/issues/3007";>GH-3007: Ensure version specific Jackson classes are shaded https://redirect.github.com/apache/parquet-java/issues/3013";>GH-3013: Fix potential ClassCastException at reading DELTA_BYTE_ARRAY encoding https://redirect.github.com/apache/parquet-java/issues/3021";>GH-3021: Upgrade Avro dependency to 1.11.4 Apache Parquet Java 1.14.3 RC2 What's Changed https://redirect.github.com/apache/parquet-java/issues/3007";>GH-3007: Ensure version specific Jackson classes are shaded https://redirect.github.com/apache/parquet-java/issues/3013";>GH-3013: Fix potential ClassCastException at reading DELTA_BYTE_ARRAY encoding https://redirect.github.com/apache/parquet-java/issues/3021";>GH-3021: Upgrade Avro dependency to 1.11.4 Apache Parquet Java 1.14.2 What's Changed https://redirect.github.com/apache/parquet-java/pull/2949";>GH-2948: Fix NPE when using the AvroParquetReader.Builder with LocalInputFile https://redirect.github.com/apache/parquet-java/pull/2957";>GH-2956: Use avro SchemaBuilder API to convert record https://redirect.github.com/apache/parquet-java/pull/2951";>GH-2935: Avoid double close of ParquetFileWriter https://redirect.github.com/apache/parquet-java/issues/2992";>GH-2992: Gate LocalTimestamp references in AvroSchemaConverter https://redirect.github.com/apache/parquet-java/pull/1376";>PARQUET-1126: Fix NPE when using the AvroParquetReader.Builder with LocalInputFile https://redirect.github.com/apache/parquet-java/pull/1376";>PARQUET-1126: Write unencrypted Parquet files without Hadoop https://redirect.github.com/apache/parquet-java/pull/1350";>PARQUET-2472: Close in finally block in ParquetFileWriter#end Apache Parquet Java 1.14.2 RC2 What's Changed https://redirect.github.com/apache/parquet-java/pull/2949";>GH-2948: Fix NPE when using the AvroParquetReader.Builder with LocalInputFile https://redirect.github.com/apache/parquet-java/pull/2957";>GH-2956: Use avro SchemaBuilder API to convert record https://redirect.github.com/apache/parquet-java/pull/2951";>GH-2935: Avoid double close of ParquetFileWriter https://redirect.github.com/apache/parquet-java/issues/2992";>GH-2992: Gate LocalTimestamp references in AvroSchemaConverter https://redirect.github.com/apache/parquet-java/pull/1376";>PARQUET-1126: Fix NPE when using the AvroParquetReader.Builder with LocalInputFile https://redirect.github.com/apache/parquet-java/pull/1376";>PARQUET-1126: Write unencrypted Parquet files without Hadoop https://redirect.github.com/apache/parquet-java/pull/1350";>PARQUET-2472: Close in finally block in ParquetFileWriter#end Apache Parquet Java 1.14.2 RC1 What's Changed https://redirect.github.com/apache/parquet-java/pull/2949";>GH-2948: Fix NPE when using the AvroParquetReader.Builder with LocalInputFile https://redirect.github.com/apache/parquet-java/pull/2957";>GH-2956: Use avro SchemaBuilder API to convert record https://redirect.github.com/apache/parquet-java/pull/2951";>GH-2935: Avoid double close of ParquetFileWriter https://redirect.github.com/apache/parquet-java/pull/1376";>PARQUET-1126: Fix NPE when using the AvroParquetReader.Builder with LocalInputFile https://redirect.github.com/apache/parquet-java/pull/1376";>PARQUET-1126: Write unencrypted Parquet files without Hadoop https://redirect.github.com/apache/parquet-java/pull/1350";>PARQUET-2472: Close in finally block in ParquetFileWriter#end Apache Parquet 1.14.1 What's Changed Update changes for the parquet-1.14.x branch by https://github.com/Fokko";>@Fokko in https://redirect.github.com/apache/parquet-java/pull/1347";>apache/parquet-java#1347 ... (truncated) Changelog Sourced from https://github.com/apache/parquet-java/blob/master/CHANGES.md";>org.apache.parquet:parquet-avro's changelog. Parquet Version 1.14.1 Release Notes - Parquet - Version 1.14.1 Bug https://issues.apache.org/jira/browse/PARQUET-2468";>PARQUET-2468 - ParquetMetadata.toPrettyJSON throws exception on file read when LOG.isDebugEnabled() https://issues.apache.org/jira/browse/PARQUET-2498";>PARQUET-2498 - Hadoop vector IO API doesn't handle empty list of ranges Version 1.14.0 Release Notes - Parquet - Version 1.14.0 Bug https://issues.apache.org/jira/browse/PARQUET-2260";>PARQUET-2260 - Bloom filter bytes size shouldn't be larger than maxBytes size in the configuration https://issues
Re: [PR] Spec: Support geo type [iceberg]
dmeaux commented on code in PR #10981: URL: https://github.com/apache/iceberg/pull/10981#discussion_r1805976050 ## format/spec.md: ## @@ -483,6 +485,8 @@ Notes: 2. For `float` and `double`, the value `-0.0` must precede `+0.0`, as in the IEEE 754 `totalOrder` predicate. NaNs are not permitted as lower or upper bounds. 3. If sort order ID is missing or unknown, then the order is assumed to be unsorted. Only data files and equality delete files should be written with a non-null order id. [Position deletes](#position-delete-files) are required to be sorted by file and position, not a table order, and should set sort order id to null. Readers must ignore sort order id for position delete files. 4. The following field ids are reserved on `data_file`: 141. +5. For `geometry`, this is a point. X = min value of all component points of all geometries in file when edges = PLANAR, westernmost bound of all geometries in file if edges = SPHERICAL. Y = max value of all component points of all geometries in file when edges = PLANAR, northernmost bound of all geometries if edges = SPHERICAL. Z is min value for all component points of all geometries in the file. M is min value of all component points of all geometries in the file. See Appendix D for encoding. Review Comment: Personally, I'm good with geodesic; however, there were others that brought up concerns with that term in the GeoParquet group meetup and were more comfortable starting with pseudo-geodesic. I don´t remember the exact details of the issue. Perhaps @paleolimbot, @jiayuasu , and/or @desruisseaux can chime in with their perspectives on it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Build: Bump parquet from 1.13.1 to 1.14.3 [iceberg]
nastra closed pull request #11264: Build: Bump parquet from 1.13.1 to 1.14.3 URL: https://github.com/apache/iceberg/pull/11264 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: delete temp metadata file when version already exists [iceberg]
nastra commented on code in PR #11350: URL: https://github.com/apache/iceberg/pull/11350#discussion_r1806163830 ## core/src/main/java/org/apache/iceberg/hadoop/HadoopTableOperations.java: ## @@ -368,7 +368,13 @@ private void renameToFinal(FileSystem fs, Path src, Path dst, int nextVersion) { } if (fs.exists(dst)) { -throw new CommitFailedException("Version %d already exists: %s", nextVersion, dst); +CommitFailedException cfe = +new CommitFailedException("Version %d already exists: %s", nextVersion, dst); +RuntimeException re = tryDelete(src); Review Comment: is it always correct to delete the src file or could there be cases where we'd want to keep it? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Build: Bump parquet from 1.13.1 to 1.14.3 [iceberg]
nastra commented on code in PR #11264: URL: https://github.com/apache/iceberg/pull/11264#discussion_r1806145764 ## flink/v1.18/flink/src/test/java/org/apache/iceberg/flink/source/TestMetadataTableReadableMetrics.java: ## @@ -217,27 +217,27 @@ public void testPrimitiveColumns() throws Exception { Row binaryCol = Row.of( -52L, +55L, Review Comment: looks like column sizes got slightly larger: Parquet 1.13.1 ``` % parquet-cli/1.14.3/bin/parquet column-size old-version.parquet binaryCol-> Size In Bytes: 52 Size In Ratio: 0.08695652 intCol-> Size In Bytes: 71 Size In Ratio: 0.1187291 decimalCol-> Size In Bytes: 85 Size In Ratio: 0.14214046 fixedCol-> Size In Bytes: 44 Size In Ratio: 0.073578596 booleanCol-> Size In Bytes: 32 Size In Ratio: 0.053511705 stringCol-> Size In Bytes: 79 Size In Ratio: 0.13210702 floatCol-> Size In Bytes: 71 Size In Ratio: 0.1187291 longCol-> Size In Bytes: 79 Size In Ratio: 0.13210702 doubleCol-> Size In Bytes: 85 Size In Ratio: 0.14214046 ``` Parquet 1.14.3 ``` % parquet-cli/1.14.3/bin/parquet column-size new-version.parquet binaryCol-> Size In Bytes: 55 Size In Ratio: 0.085403726 intCol-> Size In Bytes: 77 Size In Ratio: 0.11956522 decimalCol-> Size In Bytes: 91 Size In Ratio: 0.14130434 fixedCol-> Size In Bytes: 47 Size In Ratio: 0.072981365 booleanCol-> Size In Bytes: 36 Size In Ratio: 0.055900622 stringCol-> Size In Bytes: 85 Size In Ratio: 0.13198757 floatCol-> Size In Bytes: 77 Size In Ratio: 0.11956522 longCol-> Size In Bytes: 85 Size In Ratio: 0.13198757 doubleCol-> Size In Bytes: 91 Size In Ratio: 0.14130434 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Revert "Support wasb[s] paths in ADLSFileIO" [iceberg]
nastra merged PR #11344: URL: https://github.com/apache/iceberg/pull/11344 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Add SparkSessionCatalog support for views [iceberg]
nqvuong1998 commented on issue #9845: URL: https://github.com/apache/iceberg/issues/9845#issuecomment-2421995463 Hi @nastra , any updates for this issue? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Lack of integration tests for SQL catalog implementation [iceberg-rust]
liurenjie1024 commented on issue #673: URL: https://github.com/apache/iceberg-rust/issues/673#issuecomment-2422012602 Thanks @callum-ryan ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] Core: code clean up for deleteRemovedMetadataFiles [iceberg]
leesf opened a new pull request, #11352: URL: https://github.com/apache/iceberg/pull/11352 (no comment) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: code clean up for deleteRemovedMetadataFiles [iceberg]
nastra commented on code in PR #11352: URL: https://github.com/apache/iceberg/pull/11352#discussion_r1806159950 ## core/src/main/java/org/apache/iceberg/TableMetadata.java: ## @@ -77,6 +83,51 @@ public static TableMetadata newTableMetadata( return newTableMetadata(schema, spec, SortOrder.unsorted(), location, properties); } + /** + * Deletes the oldest metadata files if {@link + * TableProperties#METADATA_DELETE_AFTER_COMMIT_ENABLED} is true. + * + * @param io file IO + * @param base table metadata on which previous versions were based + * @param metadata new table metadata with updated previous versions + */ + public static void deleteRemovedMetadataFiles( Review Comment: I don't think we should be moving this to `TableMetadata`as this adds unnecessary dependencies to `TableMetadata` (`FileIO` and `SupportsBulkOperations`). @leesf what's the goal of this PR and why do we want to move this method here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] core: use testcontainers-minio [iceberg]
nastra commented on code in PR #11349: URL: https://github.com/apache/iceberg/pull/11349#discussion_r1806186638 ## aws/src/test/java/org/apache/iceberg/aws/s3/signer/TestS3RestSigner.java: ## @@ -71,11 +73,10 @@ public class TestS3RestSigner { private static final Region REGION = Region.US_WEST_2; private static final String BUCKET = "iceberg-s3-signer-test"; + private static final MinIOContainer MINIO_CONTAINER = MinioHelper.createContainer(); static final AwsCredentialsProvider CREDENTIALS_PROVIDER = StaticCredentialsProvider.create( - AwsBasicCredentials.create("accessKeyId", "secretAccessKey")); - private static final MinioContainer MINIO_CONTAINER = - new MinioContainer(CREDENTIALS_PROVIDER.resolveCredentials()); + AwsBasicCredentials.create(MINIO_CONTAINER.getUserName(), MINIO_CONTAINER.getPassword())); Review Comment: this isn't correctly setting accessKey/secretAccessKey. The old `MinioContainer` was setting multiple env vars for this and also configured virtual-host-style addressing, which we should continue to do -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] core: use testcontainers-minio [iceberg]
nastra commented on code in PR #11349: URL: https://github.com/apache/iceberg/pull/11349#discussion_r1806186638 ## aws/src/test/java/org/apache/iceberg/aws/s3/signer/TestS3RestSigner.java: ## @@ -71,11 +73,10 @@ public class TestS3RestSigner { private static final Region REGION = Region.US_WEST_2; private static final String BUCKET = "iceberg-s3-signer-test"; + private static final MinIOContainer MINIO_CONTAINER = MinioHelper.createContainer(); static final AwsCredentialsProvider CREDENTIALS_PROVIDER = StaticCredentialsProvider.create( - AwsBasicCredentials.create("accessKeyId", "secretAccessKey")); - private static final MinioContainer MINIO_CONTAINER = - new MinioContainer(CREDENTIALS_PROVIDER.resolveCredentials()); + AwsBasicCredentials.create(MINIO_CONTAINER.getUserName(), MINIO_CONTAINER.getPassword())); Review Comment: The old `MinioContainer` was setting multiple env vars (MINIO_ACCESS_KEY / MINIO_SECRET_KEY) for this and also configured virtual-host-style addressing, which we should continue to do -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spec: Support geo type [iceberg]
desruisseaux commented on code in PR #10981: URL: https://github.com/apache/iceberg/pull/10981#discussion_r1806236061 ## format/spec.md: ## @@ -198,6 +199,9 @@ Notes: - Timestamp values _with time zone_ represent a point in time: values are stored as UTC and do not retain a source time zone (`2017-11-16 17:10:34 PST` is stored/retrieved as `2017-11-17 01:10:34 UTC` and these values are considered identical). - Timestamp values _without time zone_ represent a date and time of day regardless of zone: the time value is independent of zone adjustments (`2017-11-16 17:10:34` is always retrieved as `2017-11-16 17:10:34`). 3. Character strings must be stored as UTF-8 encoded byte arrays. +4. Coordinate Reference System, i.e. mapping of how coordinates refer to precise locations on earth. Defaults to "OGC:CRS84". Fixed and cannot be changed by schema evolution. Review Comment: Note that this example is non-standard. A CRS JSON encoding is in progress at OGC, but we don't know yet what it will look like. The only standards at this time are GML, WKT, or simply put a reference to the EPSG database or OGC registry (e.g. https://www.opengis.net/def/crs/OGC/0/CRS84 or `"urn:ogc:def:crs:OGC::CRS84"`). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: code clean up for deleteRemovedMetadataFiles [iceberg]
leesf commented on code in PR #11352: URL: https://github.com/apache/iceberg/pull/11352#discussion_r1806236088 ## core/src/main/java/org/apache/iceberg/TableMetadata.java: ## @@ -77,6 +83,51 @@ public static TableMetadata newTableMetadata( return newTableMetadata(schema, spec, SortOrder.unsorted(), location, properties); } + /** + * Deletes the oldest metadata files if {@link + * TableProperties#METADATA_DELETE_AFTER_COMMIT_ENABLED} is true. + * + * @param io file IO + * @param base table metadata on which previous versions were based + * @param metadata new table metadata with updated previous versions + */ + public static void deleteRemovedMetadataFiles( Review Comment: @nastra because currently the `deleteRemovedMetadataFiles` method lies both in `HadoopTableOperations` and `BaseMetastoreTableOperations` and cause duplicate code, the place i would think is moving this method here, or what do you have better idea? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spec: Support geo type [iceberg]
desruisseaux commented on code in PR #10981: URL: https://github.com/apache/iceberg/pull/10981#discussion_r1806246237 ## format/spec.md: ## @@ -200,12 +200,16 @@ Supported primitive types are defined in the table below. Primitive types added | | **`uuid`** | Universally unique identifiers | Should use 16-byte fixed | | | **`fixed(L)`** | Fixed-length byte array of length L | | | | **`binary`** | Arbitrary-length byte array | | +| [v3](#version-3) | **`geometry(C, CE, E)`** | An object of the simple feature geometry model as defined by Appendix G; This may be any of the geometry subclasses defined therein; crs C [4], crs-encoding CE [5], edges E [6] | C, CE, E are fixed, and if unset will take default values. Encoded as WKB, see Appendix G. | Notes: 1. Timestamp values _without time zone_ represent a date and time of day regardless of zone: the time value is independent of zone adjustments (`2017-11-16 17:10:34` is always retrieved as `2017-11-16 17:10:34`). 2. Timestamp values _with time zone_ represent a point in time: values are stored as UTC and do not retain a source time zone (`2017-11-16 17:10:34 PST` is stored/retrieved as `2017-11-17 01:10:34 UTC` and these values are considered identical). 3. Character strings must be stored as UTF-8 encoded byte arrays. +4. Crs (coordinate reference system) is a mapping of how coordinates refer to precise locations on earth. Defaults to "OGC:CRS84". Fixed and cannot be changed by schema evolution. +5. Crs-encoding (coordinate reference system encoding) is the type of crs field. Must be set if crs is set. Defaults to "PROJJSON". Fixed and cannot be changed by schema evolution. +6. Edges is the interpretation for non-point geometries in geometry object, i.e. whether an edge between points represent a straight cartesian line or the shortest line on the sphere. Defaults to "planar". Fixed and cannot be changed by schema evolution. Review Comment: > Crs-encoding (coordinate reference system encoding) is the type of crs field. * Please change "is the type" by "is the encoding". The CRS type is something completely different than the information put in this field. * (minor detail) "crs" should be upper-case "CRS" as it is an acronym for "Coordinate Reference System". > Defaults to "PROJJSON". I suggest to remove this specification. PROJJSON is non-standard and may never be (we don't know yet what will be OGC's decision). I don't think that we want to tie Iceberg forever to a format specific to one project. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: delete temp metadata file when version already exists [iceberg]
leesf commented on code in PR #11350: URL: https://github.com/apache/iceberg/pull/11350#discussion_r1806229187 ## core/src/main/java/org/apache/iceberg/hadoop/HadoopTableOperations.java: ## @@ -368,7 +368,13 @@ private void renameToFinal(FileSystem fs, Path src, Path dst, int nextVersion) { } if (fs.exists(dst)) { -throw new CommitFailedException("Version %d already exists: %s", nextVersion, dst); +CommitFailedException cfe = +new CommitFailedException("Version %d already exists: %s", nextVersion, dst); +RuntimeException re = tryDelete(src); Review Comment: @nastra yes, i think it is always safe to delete the temp src file(temp metadata file), the commit retry would create a new temp metadata file. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spec: Support geo type [iceberg]
desruisseaux commented on code in PR #10981: URL: https://github.com/apache/iceberg/pull/10981#discussion_r1806278747 ## format/spec.md: ## @@ -483,6 +485,8 @@ Notes: 2. For `float` and `double`, the value `-0.0` must precede `+0.0`, as in the IEEE 754 `totalOrder` predicate. NaNs are not permitted as lower or upper bounds. 3. If sort order ID is missing or unknown, then the order is assumed to be unsorted. Only data files and equality delete files should be written with a non-null order id. [Position deletes](#position-delete-files) are required to be sorted by file and position, not a table order, and should set sort order id to null. Readers must ignore sort order id for position delete files. 4. The following field ids are reserved on `data_file`: 141. +5. For `geometry`, this is a point. X = min value of all component points of all geometries in file when edges = PLANAR, westernmost bound of all geometries in file if edges = SPHERICAL. Y = max value of all component points of all geometries in file when edges = PLANAR, northernmost bound of all geometries if edges = SPHERICAL. Z is min value for all component points of all geometries in the file. M is min value of all component points of all geometries in the file. See Appendix D for encoding. Review Comment: > Nowaday the trend seems to be Geography vs Geometry type actually This trend was initiated by PostGIS, but I don't know how widely it is used outside PostGIS. This is not necessarily a trend that we should follow. The PostGIS's Geography type can handle envelopes crossing the anti-meridian (longitude wraparound), but this is not the only case where wraparounds are needed. Other examples are: * Radar data using a polar coordinate system. * Climatological data, where "December" is followed by "January" of no particular year (temporal wraparound). Instead of a distinct Geography type, a better and more generic solution is to interpret "min" and "max" as "starting point" and "ending point", with interior in the direction of increasing values. So `min > max` (or `west > east`) means that a wraparound happens at infinity:  This convention is already used by the EPSG geodetic registry, by the GGXF standard, by some versions of the WMS or WCS standards (I do not remember which one exactly). This is the only truly generic approach that I'm aware of. All other approaches work only in special cases. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spec: Support geo type [iceberg]
desruisseaux commented on code in PR #10981: URL: https://github.com/apache/iceberg/pull/10981#discussion_r1806309872 ## format/spec.md: ## @@ -483,6 +485,8 @@ Notes: 2. For `float` and `double`, the value `-0.0` must precede `+0.0`, as in the IEEE 754 `totalOrder` predicate. NaNs are not permitted as lower or upper bounds. 3. If sort order ID is missing or unknown, then the order is assumed to be unsorted. Only data files and equality delete files should be written with a non-null order id. [Position deletes](#position-delete-files) are required to be sorted by file and position, not a table order, and should set sort order id to null. Readers must ignore sort order id for position delete files. 4. The following field ids are reserved on `data_file`: 141. +5. For `geometry`, this is a point. X = min value of all component points of all geometries in file when edges = PLANAR, westernmost bound of all geometries in file if edges = SPHERICAL. Y = max value of all component points of all geometries in file when edges = PLANAR, northernmost bound of all geometries if edges = SPHERICAL. Z is min value for all component points of all geometries in the file. M is min value of all component points of all geometries in the file. See Appendix D for encoding. Review Comment: On the "geodesics" versus "pseudo-geodesics" names, I do not have a strong opinion at this time. The rational for "pseudo-generic" is that current implementations are cheating: they do not compute the real geodesic. Instead, they pretend that the ellipsoid is a sphere (e.g., they ignore WGS84 flattening) and apply the formulas for a sphere. When we pretend that the ellipsoid is a sphere, the results are of course a little bit wrong, but there is some ways to make them "less wrong". For example, some map projections will convert the _geodetic latitudes_ to _authalic latitudes_. An _authalic sphere_ is a sphere with the same surface as the ellipsoid. An _authalic latitude_ is the latitude that a point would have if we deform the ellipsoid until it gets the shape of the authalic sphere. Therefore, for each point in a geometry, the numerical values of the _authalic latitude_ is slightly different than the _geodetic latitude_. If an application converts all geodetic latitudes to authalic latitudes before to perform computations with libraries that assume a sphere, such as S2, then even if the shapes are a little bit wrong, the surfaces are closer to the reality. Conversely, if someone is more interested in shapes rather than surfaces, she/he may use _conformal latitudes_ instead. See for example the list of [auxiliary lat itudes on MathWord](https://mathworld.wolfram.com/AuxiliaryLatitude.html). In summary, even if Iceberg doesn't dive in all this complexity for now, there is a possibility that future versions may want to add something like `AUTHALIC_GEODESIC` (I don't know if putting "authalic geodesic" terms together is orthodox, we would need to verify usages) for geodesics computed on a sphere but with geodetic latitudes converted to authalic latitudes. Idem for other kinds of latitude such as conformal. The current implementations just ignore all this complexity, use the latitudes _as-is_ and plug the values directly in the spherical formulas without bothering about the fact that this is not quite correct. This is exactly what the "Google Mercator" projection does. EPSG calls that "Popular Visualisation Pseudo Mercator" projection. Hence the proposal for "pseudo" in "pseudo-geodesic", by analogy with "pseudo-Mercator". -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: code clean up for deleteRemovedMetadataFiles [iceberg]
nastra commented on code in PR #11352: URL: https://github.com/apache/iceberg/pull/11352#discussion_r1806309295 ## core/src/main/java/org/apache/iceberg/TableMetadata.java: ## @@ -77,6 +83,51 @@ public static TableMetadata newTableMetadata( return newTableMetadata(schema, spec, SortOrder.unsorted(), location, properties); } + /** + * Deletes the oldest metadata files if {@link + * TableProperties#METADATA_DELETE_AFTER_COMMIT_ENABLED} is true. + * + * @param io file IO + * @param base table metadata on which previous versions were based + * @param metadata new table metadata with updated previous versions + */ + public static void deleteRemovedMetadataFiles( Review Comment: I would say maybe rather move it to `CatalogUtil` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spec: Support geo type [iceberg]
desruisseaux commented on code in PR #10981: URL: https://github.com/apache/iceberg/pull/10981#discussion_r1806309872 ## format/spec.md: ## @@ -483,6 +485,8 @@ Notes: 2. For `float` and `double`, the value `-0.0` must precede `+0.0`, as in the IEEE 754 `totalOrder` predicate. NaNs are not permitted as lower or upper bounds. 3. If sort order ID is missing or unknown, then the order is assumed to be unsorted. Only data files and equality delete files should be written with a non-null order id. [Position deletes](#position-delete-files) are required to be sorted by file and position, not a table order, and should set sort order id to null. Readers must ignore sort order id for position delete files. 4. The following field ids are reserved on `data_file`: 141. +5. For `geometry`, this is a point. X = min value of all component points of all geometries in file when edges = PLANAR, westernmost bound of all geometries in file if edges = SPHERICAL. Y = max value of all component points of all geometries in file when edges = PLANAR, northernmost bound of all geometries if edges = SPHERICAL. Z is min value for all component points of all geometries in the file. M is min value of all component points of all geometries in the file. See Appendix D for encoding. Review Comment: On the "geodesics" versus "pseudo-geodesics" names, I do not have a strong opinion at this time. The rational for "pseudo-geodesic" is that current implementations are cheating: they do not compute the real geodesic. Instead, they pretend that the ellipsoid is a sphere (e.g., they ignore WGS84 flattening) and apply the formulas for a sphere. When we pretend that the ellipsoid is a sphere, the results are of course a little bit wrong, but there is some ways to make them "less wrong". For example, some map projections will convert the _geodetic latitudes_ to _authalic latitudes_. An _authalic sphere_ is a sphere with the same surface as the ellipsoid. An _authalic latitude_ is the latitude that a point would have if we deform the ellipsoid until it gets the shape of the authalic sphere. Therefore, for each point in a geometry, the numerical values of the _authalic latitude_ is slightly different than the _geodetic latitude_. If an application converts all geodetic latitudes to authalic latitudes before to perform computations with libraries that assume a sphere, such as S2, then even if the shapes are a little bit wrong, the surfaces are closer to the reality. Conversely, if someone is more interested in shapes rather than surfaces, she/he may use _conformal latitudes_ instead. See for example the list of [auxiliary lat itudes on MathWord](https://mathworld.wolfram.com/AuxiliaryLatitude.html). In summary, even if Iceberg doesn't dive in all this complexity for now, there is a possibility that future versions may want to add something like `AUTHALIC_GEODESIC` (I don't know if putting "authalic geodesic" terms together is orthodox, we would need to verify usages) for geodesics computed on a sphere but with geodetic latitudes converted to authalic latitudes. Idem for other kinds of latitude such as conformal. The current implementations just ignore all this complexity, use the latitudes _as-is_ and plug the values directly in the spherical formulas without bothering about the fact that this is not quite correct. This is exactly what the "Google Mercator" projection does. EPSG calls that "Popular Visualisation Pseudo Mercator" projection. Hence the proposal for "pseudo" in "pseudo-geodesic", by analogy with "pseudo-Mercator". -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org