[GitHub] [iceberg] nastra opened a new pull request, #6562: Core: Improvements around Token Refresh time expiration
nastra opened a new pull request, #6562: URL: https://github.com/apache/iceberg/pull/6562 currently depends on changes from https://github.com/apache/iceberg/pull/6489 but will be rebased once https://github.com/apache/iceberg/pull/6489 is in -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] nastra commented on a diff in pull request #6489: Core: Improve token exchange handling when token expires
nastra commented on code in PR #6489: URL: https://github.com/apache/iceberg/pull/6489#discussion_r1066849968 ## core/src/main/java/org/apache/iceberg/rest/auth/OAuth2Util.java: ## @@ -376,5 +404,127 @@ public Pair refresh(RESTClient client) { return null; } + +@SuppressWarnings("FutureReturnValueIgnored") +public static void scheduleTokenRefresh( +RESTClient client, +ScheduledExecutorService tokenRefreshExecutor, +AuthSession session, +long startTimeMillis, +long expiresIn, +TimeUnit unit) { + // convert expiration interval to milliseconds + long expiresInMillis = unit.toMillis(expiresIn); + // how much ahead of time to start the request to allow it to complete + long refreshWindowMillis = Math.min(expiresInMillis / 10, MAX_REFRESH_WINDOW_MILLIS); + // how much time to wait before expiration + long waitIntervalMillis = expiresInMillis - refreshWindowMillis; + // how much time has already elapsed since the new token was issued + long elapsedMillis = System.currentTimeMillis() - startTimeMillis; + // how much time to actually wait + long timeToWait = Math.max(waitIntervalMillis - elapsedMillis, MIN_REFRESH_WAIT_MILLIS); + + tokenRefreshExecutor.schedule( + () -> { +long refreshStartTime = System.currentTimeMillis(); +Pair expiration = session.refresh(client); +if (expiration != null) { + scheduleTokenRefresh( + client, + tokenRefreshExecutor, + session, + refreshStartTime, + expiration.first(), + expiration.second()); +} + }, + timeToWait, + TimeUnit.MILLISECONDS); +} + +public static AuthSession sessionFromToken( +RESTClient client, +ScheduledExecutorService tokenRefreshExecutor, +String token, +@Nullable String credential, +Long defaultExpirationMillis, +AuthSession parent) { + Optional jwt = JWT.of(token); + + if (jwt.isPresent() && jwt.get().isExpired()) { +Preconditions.checkState( +null != credential, "Credential is required to refresh expired token."); + +// we add the credential to the Authorization header and perform a token exchange to +// refresh the expired token +AuthSession session = new AuthSession(OAuth2Util.basicAuthHeader(credential), null, null); + +return AuthSession.sessionFromTokenExchange( +client, +tokenRefreshExecutor, +token, +OAuth2Properties.ACCESS_TOKEN_TYPE, +session, +OAuth2Properties.CATALOG_SCOPE); + } else { +AuthSession session = +new AuthSession(parent.headers(), token, OAuth2Properties.ACCESS_TOKEN_TYPE); +Long expiresInMillis = jwt.map(JWT::expiresInMillis).orElse(defaultExpirationMillis); + +if (null != expiresInMillis) { + scheduleTokenRefresh( + client, + tokenRefreshExecutor, + session, + System.currentTimeMillis(), + expiresInMillis, Review Comment: I've refactored this part of the code in a separate PR (https://github.com/apache/iceberg/pull/6562/commits/7c4655ea6d5390301106c2f5f2d51dfc63adf779) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] Fokko commented on a diff in pull request #6525: Python: Refactor loading manifests
Fokko commented on code in PR #6525: URL: https://github.com/apache/iceberg/pull/6525#discussion_r1066862697 ## python/pyiceberg/avro/resolver.py: ## @@ -109,38 +109,46 @@ def resolve( class SchemaResolver(PrimitiveWithPartnerVisitor[IcebergType, Reader]): -read_types: Optional[Dict[int, Callable[[Schema], StructProtocol]]] +read_types: Dict[int, Type[StructProtocol]] +context: List[int] -def __init__(self, read_types: Optional[Dict[int, Callable[[Schema], StructProtocol]]]): +def __init__(self, read_types: Dict[int, Type[StructProtocol]] = EMPTY_DICT) -> None: self.read_types = read_types +self.context = [] def schema(self, schema: Schema, expected_schema: Optional[IcebergType], result: Reader) -> Reader: return result +def before_field(self, field: NestedField, field_partner: Optional[NestedField]) -> None: +self.context.append(field.field_id) + +def after_field(self, field: NestedField, field_partner: Optional[NestedField]) -> None: +self.context.pop() + def struct(self, struct: StructType, expected_struct: Optional[IcebergType], field_readers: List[Reader]) -> Reader: +# -1 indicates the struct root +read_struct_id = self.context[-1] if len(self.context) > 0 else -1 +struct_callable = self.read_types.get(read_struct_id, Record) Review Comment: Got it. Currently, there is no record schema for the typed classes (`ManifestEntry`, `ManifestFile`, etc). I would be hesitant to add yet another schema there. Are there any cases where we want to have the Record schema and the Read schema differ? Ideally, you want to shape the read schema to the data that you need, so you don't read anything that's not being used. > It's also better to pass the read schema because we can make generic records more friendly. This is something that we do for the `Record` class: ```python def read(self, decoder: BinaryDecoder) -> Any: if issubclass(self.create_struct, Record): struct = self.create_struct(length=len(self.field_readers), fields=self.fields) elif issubclass(self.create_struct, PydanticStruct): struct = self.create_struct.construct() else: raise ValueError(f"Expected a subclass of PydanticStruct, got: {self.create_struct}") ``` And then we have the attributes available: ```python def test_named_record() -> None: r = Record(fields=(NestedField(0, "id", IntegerType()), NestedField(1, "name", StringType( assert r.id is None assert r.name is None r[0] = 123 r[1] = "abc" assert r[0] == 123 assert r[1] == "abc" assert r.id == 123 assert r.name == "abc" ``` > It looks like we can use inspect.getfullargspec(class.__init__) to inspect the init arguments. So we could do something like this: I'm not a big fan of doing inspections, similar to Java, it is quite expensive. And it looks like there is a [performance issue](https://bugs.python.org/issue37010) in `getfullargspec`. What do you think of passing in the fields as we have currently? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] Fokko commented on a diff in pull request #6525: Python: Refactor loading manifests
Fokko commented on code in PR #6525: URL: https://github.com/apache/iceberg/pull/6525#discussion_r1066862697 ## python/pyiceberg/avro/resolver.py: ## @@ -109,38 +109,46 @@ def resolve( class SchemaResolver(PrimitiveWithPartnerVisitor[IcebergType, Reader]): -read_types: Optional[Dict[int, Callable[[Schema], StructProtocol]]] +read_types: Dict[int, Type[StructProtocol]] +context: List[int] -def __init__(self, read_types: Optional[Dict[int, Callable[[Schema], StructProtocol]]]): +def __init__(self, read_types: Dict[int, Type[StructProtocol]] = EMPTY_DICT) -> None: self.read_types = read_types +self.context = [] def schema(self, schema: Schema, expected_schema: Optional[IcebergType], result: Reader) -> Reader: return result +def before_field(self, field: NestedField, field_partner: Optional[NestedField]) -> None: +self.context.append(field.field_id) + +def after_field(self, field: NestedField, field_partner: Optional[NestedField]) -> None: +self.context.pop() + def struct(self, struct: StructType, expected_struct: Optional[IcebergType], field_readers: List[Reader]) -> Reader: +# -1 indicates the struct root +read_struct_id = self.context[-1] if len(self.context) > 0 else -1 +struct_callable = self.read_types.get(read_struct_id, Record) Review Comment: Got it. In this PR there is no record schema passed in for the typed classes (`ManifestEntry`, `ManifestFile`, etc). I would be hesitant to add yet another schema there. Are there any cases where we want to have the Record schema and the Read schema differ? Ideally, you want to shape the read schema to the data that you need, so you don't read anything that's not being used. > It's also better to pass the read schema because we can make generic records more friendly. This is something that we do for the `Record` class: ```python def read(self, decoder: BinaryDecoder) -> Any: if issubclass(self.create_struct, Record): struct = self.create_struct(length=len(self.field_readers), fields=self.fields) elif issubclass(self.create_struct, PydanticStruct): struct = self.create_struct.construct() else: raise ValueError(f"Expected a subclass of PydanticStruct, got: {self.create_struct}") ``` And then we have the attributes available: ```python def test_named_record() -> None: r = Record(fields=(NestedField(0, "id", IntegerType()), NestedField(1, "name", StringType( assert r.id is None assert r.name is None r[0] = 123 r[1] = "abc" assert r[0] == 123 assert r[1] == "abc" assert r.id == 123 assert r.name == "abc" ``` > It looks like we can use inspect.getfullargspec(class.__init__) to inspect the init arguments. So we could do something like this: I'm not a big fan of doing inspections, similar to Java, it is quite expensive. And it looks like there is a [performance issue](https://bugs.python.org/issue37010) in `getfullargspec`. What do you think of passing in the fields as we have currently? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] Fokko merged pull request #6555: Python: Expression to disjunctive normal form
Fokko merged PR #6555: URL: https://github.com/apache/iceberg/pull/6555 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] pvary commented on pull request #6557: Flink: add support of writing Avro GenericRecord DataStream to Iceberg
pvary commented on PR #6557: URL: https://github.com/apache/iceberg/pull/6557#issuecomment-1378673916 @stevenzwu: We had several issues with Avro, RowData conversions. This is solving one of them - we might not want to solve all of them once and in one PR, but I think we should collect/consider all of them in the long run. The specific issues: - Iceberg shades the Avro version. This makes it inconvenient to use our own Avro tools/libraries, even if the 2 Avro version is the same. - Our current solution is to write the Avro schema to string and parse it with the shaded Avro library - Sometimes the Iceberg schema is not exactly the same as the Avro schema: - Different order of the columns could cause the built in Iceberg converters to fail - Maybe one of the schemas could contain extra columns which are not needed after the transform - We often need transformation in both direction: - When writing then Avro to RowData - When reading then RowData to Avro Your PR solves one of the issues above. It might worth to consider if we can easily add a solution for some of the issues in the same PR, or we would like to solve those separately. Thanks, Peter -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] findepi commented on a diff in pull request #6474: Make it explicit that metrics reporter is required
findepi commented on code in PR #6474: URL: https://github.com/apache/iceberg/pull/6474#discussion_r1066995039 ## core/src/main/java/org/apache/iceberg/BaseTable.java: ## @@ -48,6 +49,7 @@ public BaseTable(TableOperations ops, String name) { } public BaseTable(TableOperations ops, String name, MetricsReporter reporter) { +Preconditions.checkNotNull(reporter, "reporter cannot be null"); Review Comment: > Also we generally use `Preconditions.checkArgument` more than `Preconditions.checkNotNull`: do we use it commonly to check for nullness? i think in the code i've contributed so far to iceberg i used cNN and don't remember being turned around to use cA -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] amogh-jahagirdar commented on pull request #6559: Core: View core parser implementations
amogh-jahagirdar commented on PR #6559: URL: https://github.com/apache/iceberg/pull/6559#issuecomment-1378823387 Agreed @jackye1995 we can break this down further for easier review. I'll raise the version, representation and history entry PRs separately. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] nastra opened a new pull request, #6563: Core: Align commit metric name
nastra opened a new pull request, #6563: URL: https://github.com/apache/iceberg/pull/6563 All other commit metric names use "positional" rather than "position", except for this one. Also the method name to retrieve that metric (`addedPositionalDeleteFiles()`) already has the correct name. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] kmozaid commented on a diff in pull request #6410: Configurable metrics reporter by catalog properties
kmozaid commented on code in PR #6410: URL: https://github.com/apache/iceberg/pull/6410#discussion_r1067058918 ## core/src/main/java/org/apache/iceberg/BaseMetastoreCatalog.java: ## @@ -301,4 +305,16 @@ protected static String fullTableName(String catalogName, TableIdentifier identi return sb.toString(); } + + private MetricsReporter metricsReporter() { +if (metricsReporter == null) { + metricsReporter = + properties().containsKey(CatalogProperties.METRICS_REPORTER_IMPL) + ? CatalogUtil.loadMetricsReporter( + properties().get(CatalogProperties.METRICS_REPORTER_IMPL)) + : LoggingMetricsReporter.instance(); Review Comment: HI @szehon-ho Even without this change, LoggingMetricsReporter is always enabled for all catalogs...See BaseTable https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/BaseTable.java#L45 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #6559: Core: View core parser implementations
amogh-jahagirdar commented on code in PR #6559: URL: https://github.com/apache/iceberg/pull/6559#discussion_r1067058866 ## core/src/main/java/org/apache/iceberg/view/SQLViewRepresentationParser.java: ## @@ -0,0 +1,87 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.view; + +import com.fasterxml.jackson.core.JsonGenerator; +import com.fasterxml.jackson.databind.JsonNode; +import java.io.IOException; +import java.util.Arrays; +import org.apache.iceberg.catalog.Namespace; +import org.apache.iceberg.relocated.com.google.common.collect.Iterables; +import org.apache.iceberg.util.JsonUtil; + +public class SQLViewRepresentationParser { + private enum Field { +SQL("sql"), Review Comment: Good point, this was just how it was in the original PR so I left it as is, we can just use static fields (same as what we do in TableMetadataParser) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] Fokko opened a new issue, #6564: Python write support
Fokko opened a new issue, #6564: URL: https://github.com/apache/iceberg/issues/6564 ### Feature Request / Improvement This is a placeholder ticket for implementing write support for PyIceberg. Since we don't want PyIceberg to write the actual data, and only handle the metadata part of the Iceberg table format, we need to get an overview of the frameworks that we most likely want to integrate with (PyArrow, Dask (fastparquet?), etc). Missing pieces: - Able to efficiently get statistics (upper and lower bound) for the written files - Able to write Avro files (ManifestList, ManifestEntry) - Proper integration tests between Java and Python to make sure that we don't brick any tables (start in https://github.com/apache/iceberg/pull/6398) ### Query engine None -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] amogh-jahagirdar opened a new pull request, #6565: Core: View history entry core implementation
amogh-jahagirdar opened a new pull request, #6565: URL: https://github.com/apache/iceberg/pull/6565 Co-authored-by: John Zhuge Separating this PR from https://github.com/apache/iceberg/pull/6559/files#diff-2a70d3056d3d0cca0da3ff4ddabc83c41c06af2296f281a5b37c5b54ead98915 for easier review from the community. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] ajantha-bhat commented on pull request #6461: Spark-3.3: Store sort-order-id in manifest_entry's data_file
ajantha-bhat commented on PR #6461: URL: https://github.com/apache/iceberg/pull/6461#issuecomment-1379088280 @RussellSpitzer: After checking some code. I found that callers has a logic to find the required ordering (by checking distribution table properties for merge/write/update/delete). So, I think I can use that. Did some manual testing. It looks ok to me. So, please let me know if this change is ok. If yes, I can add all combinations of test cases. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] rdblue commented on a diff in pull request #6353: Make sure S3 stream opened by ReadConf ctor is closed
rdblue commented on code in PR #6353: URL: https://github.com/apache/iceberg/pull/6353#discussion_r1067212201 ## parquet/src/main/java/org/apache/iceberg/parquet/ParquetReader.java: ## @@ -79,9 +83,11 @@ private ReadConf init() { nameMapping, reuseContainers, caseSensitive, - null); - this.conf = readConf.copy(); - return readConf; + null)) { +this.conf = readConf.copy(); Review Comment: @islamismailov, the first call to init, the `ReadConf` is copied and stored in the `ParquetReader` for future use. But the `ReadConf` that was just created is the one that is returned. That way the first init call uses the same stream for the `FileIterator` and doesn't reopen one. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] jackye1995 commented on a diff in pull request #6565: Core: View history entry core implementation
jackye1995 commented on code in PR #6565: URL: https://github.com/apache/iceberg/pull/6565#discussion_r1067210433 ## core/src/test/resources/ViewHistoryEntry.json: ## @@ -0,0 +1,4 @@ +{ Review Comment: I think the resources are reserved for full JSON files that might be stored in Iceberg, like table metadata. For specific objects inside the JSON file we can just use hard-coded strings for testing -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] jackye1995 commented on a diff in pull request #6565: Core: View history entry core implementation
jackye1995 commented on code in PR #6565: URL: https://github.com/apache/iceberg/pull/6565#discussion_r1067214846 ## core/src/main/java/org/apache/iceberg/view/BaseViewHistoryEntry.java: ## @@ -0,0 +1,71 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.view; + +import java.util.Objects; +import org.apache.iceberg.relocated.com.google.common.base.MoreObjects; + +public class BaseViewHistoryEntry implements ViewHistoryEntry { + private final long timestampMillis; + private final int versionId; + + static ViewHistoryEntry of(long timestampMillis, int versionId) { Review Comment: why do we need a static method if it is just delegating to the constructor? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] jackye1995 commented on a diff in pull request #6565: Core: View history entry core implementation
jackye1995 commented on code in PR #6565: URL: https://github.com/apache/iceberg/pull/6565#discussion_r1067215702 ## core/src/main/java/org/apache/iceberg/view/ViewHistoryEntryParser.java: ## @@ -0,0 +1,44 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.view; + +import com.fasterxml.jackson.core.JsonGenerator; +import com.fasterxml.jackson.databind.JsonNode; +import java.io.IOException; +import org.apache.iceberg.util.JsonUtil; + +class ViewHistoryEntryParser { + + static final String VERSION_ID = "version-id"; + static final String TIMESTAMP_MS = "timestamp-ms"; + + static void toJson(ViewHistoryEntry entry, JsonGenerator generator) throws IOException { +generator.writeStartObject(); +generator.writeNumberField(TIMESTAMP_MS, entry.timestampMillis()); +generator.writeNumberField(VERSION_ID, entry.versionId()); +generator.writeEndObject(); + } + + static ViewHistoryEntry fromJson(JsonNode node) { +return BaseViewHistoryEntry.of( +JsonUtil.getLong(TIMESTAMP_MS, node), JsonUtil.getInt(VERSION_ID, node)); + } + + private ViewHistoryEntryParser() {} Review Comment: nit: constructor should be at top? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] jackye1995 commented on a diff in pull request #6565: Core: View history entry core implementation
jackye1995 commented on code in PR #6565: URL: https://github.com/apache/iceberg/pull/6565#discussion_r1067215702 ## core/src/main/java/org/apache/iceberg/view/ViewHistoryEntryParser.java: ## @@ -0,0 +1,44 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.view; + +import com.fasterxml.jackson.core.JsonGenerator; +import com.fasterxml.jackson.databind.JsonNode; +import java.io.IOException; +import org.apache.iceberg.util.JsonUtil; + +class ViewHistoryEntryParser { + + static final String VERSION_ID = "version-id"; + static final String TIMESTAMP_MS = "timestamp-ms"; + + static void toJson(ViewHistoryEntry entry, JsonGenerator generator) throws IOException { +generator.writeStartObject(); +generator.writeNumberField(TIMESTAMP_MS, entry.timestampMillis()); +generator.writeNumberField(VERSION_ID, entry.versionId()); +generator.writeEndObject(); + } + + static ViewHistoryEntry fromJson(JsonNode node) { +return BaseViewHistoryEntry.of( +JsonUtil.getLong(TIMESTAMP_MS, node), JsonUtil.getInt(VERSION_ID, node)); + } + + private ViewHistoryEntryParser() {} Review Comment: nit: constructor should be at top before the methods? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] jackye1995 commented on a diff in pull request #6565: Core: View history entry core implementation
jackye1995 commented on code in PR #6565: URL: https://github.com/apache/iceberg/pull/6565#discussion_r1067224129 ## core/src/test/java/org/apache/iceberg/view/ParserTestBase.java: ## @@ -0,0 +1,57 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.view; + +import java.io.IOException; +import java.nio.file.Path; +import java.nio.file.Paths; +import org.apache.iceberg.util.JsonUtil; +import org.apache.iceberg.util.TestJsonUtil; +import org.junit.Assert; +import org.junit.Test; + +public abstract class ParserTestBase { + + private final T entry; + private final String json; + private final JsonUtil.JsonWriter writer; + private final JsonUtil.JsonReader reader; + + public ParserTestBase( Review Comment: I am wondering if we really should structure the tests like this with a test base. Each object model could have different fields to be tested and this makes the tests not as flexible. Also we added all the things in the JsonUtil which I am not sure if it is useful outside this context -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #6565: Core: View history entry core implementation
amogh-jahagirdar commented on code in PR #6565: URL: https://github.com/apache/iceberg/pull/6565#discussion_r1067224185 ## core/src/main/java/org/apache/iceberg/view/BaseViewHistoryEntry.java: ## @@ -0,0 +1,71 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.view; + +import java.util.Objects; +import org.apache.iceberg.relocated.com.google.common.base.MoreObjects; + +public class BaseViewHistoryEntry implements ViewHistoryEntry { + private final long timestampMillis; + private final int versionId; + + static ViewHistoryEntry of(long timestampMillis, int versionId) { Review Comment: It's not needed, just a part of the original CR. We can just make the constructor public and use that ## core/src/main/java/org/apache/iceberg/view/BaseViewHistoryEntry.java: ## @@ -0,0 +1,71 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.view; + +import java.util.Objects; +import org.apache.iceberg.relocated.com.google.common.base.MoreObjects; + +public class BaseViewHistoryEntry implements ViewHistoryEntry { + private final long timestampMillis; + private final int versionId; + + static ViewHistoryEntry of(long timestampMillis, int versionId) { Review Comment: It's not needed, just a part of the original PR. We can just make the constructor public and use that -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] jackye1995 commented on a diff in pull request #6565: Core: View history entry core implementation
jackye1995 commented on code in PR #6565: URL: https://github.com/apache/iceberg/pull/6565#discussion_r1067226988 ## core/src/main/java/org/apache/iceberg/view/BaseViewHistoryEntry.java: ## @@ -0,0 +1,71 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.view; + +import java.util.Objects; +import org.apache.iceberg.relocated.com.google.common.base.MoreObjects; + +public class BaseViewHistoryEntry implements ViewHistoryEntry { + private final long timestampMillis; + private final int versionId; + + static ViewHistoryEntry of(long timestampMillis, int versionId) { Review Comment: I see. I can see the constructor being potentially protected, we can keep it like that for now, but the static method seems unnecessary -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] jackye1995 commented on a diff in pull request #6565: Core: View history entry core implementation
jackye1995 commented on code in PR #6565: URL: https://github.com/apache/iceberg/pull/6565#discussion_r1067210433 ## core/src/test/resources/ViewHistoryEntry.json: ## @@ -0,0 +1,4 @@ +{ Review Comment: I think the resources are reserved for full files that might be stored in Iceberg, like table metadata. For specific objects inside the JSON file we can just use hard-coded strings for testing -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] dmgcodevil commented on issue #6370: What is the purpose of Hive Lock ?
dmgcodevil commented on issue #6370: URL: https://github.com/apache/iceberg/issues/6370#issuecomment-1379151268 I still don't understand why we need locks if we have transactions and we can implement `optimistic locking` model. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #6565: Core: View history entry core implementation
amogh-jahagirdar commented on code in PR #6565: URL: https://github.com/apache/iceberg/pull/6565#discussion_r1067231261 ## core/src/test/java/org/apache/iceberg/view/ParserTestBase.java: ## @@ -0,0 +1,57 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.view; + +import java.io.IOException; +import java.nio.file.Path; +import java.nio.file.Paths; +import org.apache.iceberg.util.JsonUtil; +import org.apache.iceberg.util.TestJsonUtil; +import org.junit.Assert; +import org.junit.Test; + +public abstract class ParserTestBase { + + private final T entry; + private final String json; + private final JsonUtil.JsonWriter writer; + private final JsonUtil.JsonReader reader; + + public ParserTestBase( Review Comment: Agreed, I think what we can do is just follow the existing pattern for testing table metadata and just enumerate different toJson/fromJson cases. While the benefit of having a parent class is to reduce some duplication, for the parsing code it seems better just to have more explicit tests (for example like we do with ref parser) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] danielcweeks commented on a diff in pull request #6352: AWS: Fix inconsistent behavior of naming S3 location between read and write operations by allowing only s3 bucket name
danielcweeks commented on code in PR #6352: URL: https://github.com/apache/iceberg/pull/6352#discussion_r1067232103 ## aws/src/main/java/org/apache/iceberg/aws/s3/S3URI.java: ## @@ -74,17 +74,14 @@ class S3URI { this.scheme = schemeSplit[0]; String[] authoritySplit = schemeSplit[1].split(PATH_DELIM, 2); -ValidationException.check( -authoritySplit.length == 2, "Invalid S3 URI, cannot determine bucket: %s", location); -ValidationException.check( -!authoritySplit[1].trim().isEmpty(), "Invalid S3 URI, path is empty: %s", location); + this.bucket = bucketToAccessPointMapping == null ? authoritySplit[0] : bucketToAccessPointMapping.getOrDefault(authoritySplit[0], authoritySplit[0]); // Strip query and fragment if they exist -String path = authoritySplit[1]; +String path = authoritySplit.length > 1 ? authoritySplit[1] : ""; Review Comment: I think @amogh-jahagirdar meant using java `Optional path()` to differentiate empty from having a path. I would generally agree, but Iceberg doesn't currently use optional very much. I'm inclined to just let this go through as-is. Seems like an uncommon use case, but since it's easy to support, we should. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] danielcweeks merged pull request #6352: AWS: Fix inconsistent behavior of naming S3 location between read and write operations by allowing only s3 bucket name
danielcweeks merged PR #6352: URL: https://github.com/apache/iceberg/pull/6352 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] stevenzwu commented on pull request #6557: Flink: add support of writing Avro GenericRecord DataStream to Iceberg
stevenzwu commented on PR #6557: URL: https://github.com/apache/iceberg/pull/6557#issuecomment-1379226025 > Iceberg shades the Avro version. I updated the Flink doc in this PR already and pointed out that `iceberg-flink-runtime` shaded bundle jar shouldn't be used in this case, as it shade the Avro package. Instead non-shaded `iceberg-flink` jar should be used. > Sometimes the Iceberg schema is not exactly the same as the Avro schema This probably can be addressed separately using named mapping. Here the assumption is that the output Avro schema should be same as the Iceberg schema. > We often need transformation in both direction: > - When writing then Avro to RowData > - When reading then RowData to Avro This PR only addresses the write part: Avro to RowData. I will create a separate PR for the read path. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] amogh-jahagirdar commented on pull request #6565: Core: View history entry core implementation
amogh-jahagirdar commented on PR #6565: URL: https://github.com/apache/iceberg/pull/6565#issuecomment-1379247513 @jackye1995 I updated the PR so it should be a bit simpler now , I agree at this point we don't need the abstractions of base test classes . Thanks for the review! @jzhuge @rdblue let me know your thoughts on this! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
issues@iceberg.apache.org
haizhou-zhao commented on code in PR #6324: URL: https://github.com/apache/iceberg/pull/6324#discussion_r1067317621 ## hive-metastore/src/main/java/org/apache/iceberg/hive/HiveHadoopUtil.java: ## @@ -0,0 +1,36 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.hive; + +import java.io.IOException; +import java.io.UncheckedIOException; +import org.apache.hadoop.security.UserGroupInformation; + +public class HiveHadoopUtil { + + private HiveHadoopUtil() {} + + public static String getUserName() { Review Comment: changed in the newest commit. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
issues@iceberg.apache.org
haizhou-zhao commented on code in PR #6324: URL: https://github.com/apache/iceberg/pull/6324#discussion_r1067317866 ## hive-metastore/src/main/java/org/apache/iceberg/hive/HiveHadoopUtil.java: ## @@ -0,0 +1,36 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.hive; + +import java.io.IOException; +import java.io.UncheckedIOException; +import org.apache.hadoop.security.UserGroupInformation; + +public class HiveHadoopUtil { + + private HiveHadoopUtil() {} + + public static String getUserName() { +try { + return UserGroupInformation.getCurrentUser().getUserName(); +} catch (IOException e) { + throw new UncheckedIOException("Fail to obtain Hadoop UGI user", e); Review Comment: Falling back to os username in the newest commit. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] rdblue commented on a diff in pull request #6525: Python: Refactor loading manifests
rdblue commented on code in PR #6525: URL: https://github.com/apache/iceberg/pull/6525#discussion_r1067330294 ## python/pyiceberg/avro/resolver.py: ## @@ -109,38 +109,46 @@ def resolve( class SchemaResolver(PrimitiveWithPartnerVisitor[IcebergType, Reader]): -read_types: Optional[Dict[int, Callable[[Schema], StructProtocol]]] +read_types: Dict[int, Type[StructProtocol]] +context: List[int] -def __init__(self, read_types: Optional[Dict[int, Callable[[Schema], StructProtocol]]]): +def __init__(self, read_types: Dict[int, Type[StructProtocol]] = EMPTY_DICT) -> None: self.read_types = read_types +self.context = [] def schema(self, schema: Schema, expected_schema: Optional[IcebergType], result: Reader) -> Reader: return result +def before_field(self, field: NestedField, field_partner: Optional[NestedField]) -> None: +self.context.append(field.field_id) + +def after_field(self, field: NestedField, field_partner: Optional[NestedField]) -> None: +self.context.pop() + def struct(self, struct: StructType, expected_struct: Optional[IcebergType], field_readers: List[Reader]) -> Reader: +# -1 indicates the struct root +read_struct_id = self.context[-1] if len(self.context) > 0 else -1 +struct_callable = self.read_types.get(read_struct_id, Record) Review Comment: There is a need to be able to read a subset of fields when reading a manifest list or a manifest file. Right now, we don't use the lower or upper bounds, for example, but we still read them into memory rather than skipping them. There are also other use cases. A generic record, for example, should be able to use the field names from the read schema. To do those things, this PR needs to support passing the read schema to the instance that is instantiated. I don't really care how we do it. It can be by inspecting the constructor, assuming that the constructor accepts a schema, checking for a `set_read_schema` method, or some other idea. But it needs to be done as part of this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] szehon-ho merged pull request #5629: Spark: Add "Iceberg" prefix to SparkTable name string for better observability of Iceberg tables on SparkUI
szehon-ho merged PR #5629: URL: https://github.com/apache/iceberg/pull/5629 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] szehon-ho commented on pull request #5629: Spark: Add "Iceberg" prefix to SparkTable name string for better observability of Iceberg tables on SparkUI
szehon-ho commented on PR #5629: URL: https://github.com/apache/iceberg/pull/5629#issuecomment-1379324890 Merged, thanks @sumeetgajjar , @wypoon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] szehon-ho merged pull request #6476: API, Core, Flink, Parquet, Spark: Use enhanced for loop
szehon-ho merged PR #6476: URL: https://github.com/apache/iceberg/pull/6476 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] szehon-ho commented on pull request #6476: API, Core, Flink, Parquet, Spark: Use enhanced for loop
szehon-ho commented on PR #6476: URL: https://github.com/apache/iceberg/pull/6476#issuecomment-137912 Merged, thanks @krvikash , @hililiwei , @amogh-jahagirdar -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] dramaticlly opened a new issue, #6567: pyiceberg table scan problem with row filter set to non-partition columns
dramaticlly opened a new issue, #6567: URL: https://github.com/apache/iceberg/issues/6567 ### Apache Iceberg version None ### Query engine None ### Please describe the bug 🐞 I really like the new table scan feature released latest pyiceberg 0.2.1 release, thanks @Fokko. It works great when I provide the partition column as row filter but not working as expected when I provide other columns as part of expression. The `scan.plan_files()` shall return me collection of parquet files satisfy the predicate in row filter but it's returning all instead. Here's my repro steps, I created a simple table `hongyue_zhang.mls23` to start with ### schema ```ddl Create Table -- CREATE TABLE iceberg.hongyue_zhang.mls23 ( id bigint NOT NULL, data varchar, ts date ) WITH ( format = 'PARQUET', location = 's3a://warehouse-default/warehouse/hongyue_zhang.db/mls23', partitioning = ARRAY['ts'] ) (1 row) ``` ### Setup Table have 2 partitions and 198 records total, each write have its own parquet files for the sake of simplicity ``` partition| record_count | file_count | total_size | data -+--+++- {ts=2023-01-04} | 99 | 99 | 115300 | {id={min=1, max=1, null_count=0}, data={min=b, max=bbbc, null_count=0}} {ts=2023-01-05} | 99 | 99 | 115303 | {id={min=0, max=0, null_count=0}, data={min=a, max=aaab, null_count=0}} (2 rows) ``` ### Python code ```python import os from pyiceberg.catalog import load_catalog from pyiceberg.expressions import GreaterThanOrEqual, And, EqualTo catalog = load_catalog("prod") table = catalog.load_table("hongyue_zhang.mls23") table.location() scan1 = table.scan( row_filter=EqualTo("ts", "2023-01-04")) yesterday_files = [task.file.file_path for task in scan1.plan_files()] print(len(yesterday_files)) # expect 99 and actual is 99 parquet files for single partition scan2 = table.scan( row_filter=EqualTo("data", "a")) a_files = [task.file.file_path for task in scan2.plan_files()] print(len(a_files)) # expect 1 but I am seeing 198 instead, which means all parquet files are returned scan3 = table.scan( row_filter=And(EqualTo("ts", "2023-01-04"), EqualTo("data", "a"))) yesterday_and_a_files= [task.file.file_path for task in scan3.plan_files()] print(len(yesterday_and_a_files)) # expect 1 but I am seeing 99, which means the row filter are taking the 1st expression with partition column ts but not 2nd expression on data ``` For the sake of validation, I also tried to spark to query with similar condition and it's returnning me 1 file as expected ```spark val result = spark.sql("select id, data, input_file_name(), ts from iceberg.hongyue_zhang.mls23 where data = 'a'") +---++-+--+ |id |data|input_file_name() |ts| +---++-+--+ |0 |a |s3a://warehouse-default/warehouse/hongyue_zhang.db/mls23/data/ts=2023-01-05/0-6-5573682f-d72c-4a68-a08f-8fe4dbca8581-1.parquet |2023-01-05| +---++-+--+ ``` I cant see to figure out why it failed and happy to contribute if anyone can share insights -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] jackye1995 commented on issue #6523: Table creation fails with Glue catalog on EMR
jackye1995 commented on issue #6523: URL: https://github.com/apache/iceberg/issues/6523#issuecomment-1379360910 @singhpk234 @rajarshisarkar could you take a look into this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] erikcw commented on issue #6567: pyiceberg table scan problem with row filter set to non-partition columns
erikcw commented on issue #6567: URL: https://github.com/apache/iceberg/issues/6567#issuecomment-1379415205 I stumbled into the same issue with a slight twist. I deleted all the rows from my table, however pyiceberg is still returning parquet files with those records. Shouldn't those files no longer be in the current manifest? ```sql -- Executed in Athena DELETE FROM iceberg_test WHERE uid = '200441'; select count(uid) from "iceberg_test" where uid = '200441'; -- Returns 0. ``` ```python # Glue catalog type. catalog = load_catalog("default") table = catalog.load_table("testing.iceberg_test") scan = table.scan( row_filter=NotEqualTo("uid", "200441"), # Doesn't seem to make a difference with out without this line. selected_fields=("uid"), ) files = [task.file.file_path for task in scan.plan_files()] # files all contain the deleted value. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] Fokko commented on issue #6567: pyiceberg table scan problem with row filter set to non-partition columns
Fokko commented on issue #6567: URL: https://github.com/apache/iceberg/issues/6567#issuecomment-1379472226 @dramaticlly I just checked, and I can confirm that we don't filter on the datafile ranges, this will be implemented very soon 👍🏻 @erikcw Thanks for raising the issue, and we're not handling deleted files right now. Could you create a separate issue so we make sure that we keep track of it? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] erikcw opened a new issue, #6568: pyiceberg table scan returning deleted data
erikcw opened a new issue, #6568: URL: https://github.com/apache/iceberg/issues/6568 ### Apache Iceberg version 1.1.0 (latest release) ### Query engine Other ### Please describe the bug 🐞 I originally mentioned raised this issue in #6567. After deleting rows from a table (in my case with Athena), pyiceberg is still returning parquet files with those records from a table scan. Shouldn't those files no longer be in the current manifest and hence not returned my the table scan? ```sql -- Executed in Athena DELETE FROM iceberg_test WHERE uid = '200441'; select count(uid) from "iceberg_test" where uid = '200441'; -- Returns 0. ``` ```py # Glue catalog type. catalog = load_catalog("default") table = catalog.load_table("testing.iceberg_test") scan = table.scan( row_filter=NotEqualTo("uid", "200441"), # Doesn't seem to make a difference with or without this line. selected_fields=("uid"), ) files = [task.file.file_path for task in scan.plan_files()] # files all contain the deleted value. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] stevenzwu commented on a diff in pull request #6557: Flink: add support of writing Avro GenericRecord DataStream to Iceberg
stevenzwu commented on code in PR #6557: URL: https://github.com/apache/iceberg/pull/6557#discussion_r1067488603 ## flink/v1.16/flink/src/test/java/org/apache/iceberg/flink/sink/TestAvroGenericRecordToRowDataMapper.java: ## @@ -0,0 +1,37 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.flink.sink; + +import org.apache.flink.table.data.RowData; +import org.apache.iceberg.flink.AvroGenericRecordConverterBase; +import org.apache.iceberg.flink.DataGenerator; +import org.junit.Assert; + +public class TestAvroGenericRecordToRowDataMapper extends AvroGenericRecordConverterBase { Review Comment: read path will be addressed separately -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] sumeetgajjar commented on pull request #5629: Spark: Add "Iceberg" prefix to SparkTable name string for better observability of Iceberg tables on SparkUI
sumeetgajjar commented on PR #5629: URL: https://github.com/apache/iceberg/pull/5629#issuecomment-1379529328 Thanks @wypoon and @szehon-ho for the reviews and for merging the PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #5234: Core, API: BaseRowDelta to branch Impl
amogh-jahagirdar commented on code in PR #5234: URL: https://github.com/apache/iceberg/pull/5234#discussion_r1067483646 ## core/src/test/java/org/apache/iceberg/TestRowDelta.java: ## @@ -429,29 +470,31 @@ public void testValidateNoConflictsFromSnapshot() { statuses(Status.ADDED)); } - @Test + // @Test Review Comment: Any reason this test is commented out? ## core/src/test/java/org/apache/iceberg/TableTestBase.java: ## @@ -472,6 +484,20 @@ void validateTableDeleteFiles(Table tbl, DeleteFile... expectedFiles) { Assert.assertEquals("Delete files should match", expectedFilePaths, actualFilePaths); } + void validateTableDeleteFilesWithRef(Table tbl, String ref, DeleteFile... expectedFiles) { Review Comment: validateBranchDeleteFiles? ## core/src/test/java/org/apache/iceberg/TestRowDelta.java: ## @@ -1270,11 +1332,16 @@ public void testConcurrentNonConflictingRowDeltaAndRewriteFilesWithSequenceNumbe baseSnapshot.sequenceNumber()) .validateFromSnapshot(baseSnapshot.snapshotId()); -rowDelta.commit(); -rewriteFiles.commit(); +commit(table, rowDelta, branch); +commit(table, rewriteFiles, branch); -validateTableDeleteFiles(table, deleteFile1); -validateTableFiles(table, dataFile2); +if (branch == "testBranch") { Review Comment: equals() instead of == ## core/src/main/java/org/apache/iceberg/SnapshotProducer.java: ## @@ -267,7 +267,26 @@ private Map summary(TableMetadata previous) { } Map previousSummary; -if (previous.currentSnapshot() != null) { +if (!targetBranch.equals(SnapshotRef.MAIN_BRANCH)) { + if (previous.ref(targetBranch) != null) { +if (previous.snapshot(previous.ref(targetBranch).snapshotId()).summary() != null) { + previousSummary = previous.snapshot(previous.ref(targetBranch).snapshotId()).summary(); +} else { + previousSummary = ImmutableMap.of(); +} Review Comment: Could we move the value of previous.ref(targetBranch) to it's own variable for readability? ## core/src/main/java/org/apache/iceberg/SnapshotProducer.java: ## @@ -267,7 +267,26 @@ private Map summary(TableMetadata previous) { } Map previousSummary; -if (previous.currentSnapshot() != null) { +if (!targetBranch.equals(SnapshotRef.MAIN_BRANCH)) { + if (previous.ref(targetBranch) != null) { +if (previous.snapshot(previous.ref(targetBranch).snapshotId()).summary() != null) { + previousSummary = previous.snapshot(previous.ref(targetBranch).snapshotId()).summary(); +} else { + previousSummary = ImmutableMap.of(); +} Review Comment: Also I think we could probably simplify all the if/else logic here. the branch will either be main or not, and in both cases it's the same logic. Regardless of main or not main: There is either not a prev snapshot in which case default to 0, or the prev snapshot had no summary use an empty summary. or the prev snapshot had a summary so use it. ## core/src/test/java/org/apache/iceberg/TableTestBase.java: ## @@ -458,6 +458,18 @@ void validateTableFiles(Table tbl, DataFile... expectedFiles) { Assert.assertEquals("Files should match", expectedFilePaths, actualFilePaths); } + void validateTableFilesWithRef(Table tbl, String ref, DataFile... expectedFiles) { Review Comment: validateBranchDataFiles? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #5234: Core, API: BaseRowDelta to branch Impl
amogh-jahagirdar commented on code in PR #5234: URL: https://github.com/apache/iceberg/pull/5234#discussion_r1067482077 ## core/src/test/java/org/apache/iceberg/TestRowDelta.java: ## @@ -1270,11 +1332,16 @@ public void testConcurrentNonConflictingRowDeltaAndRewriteFilesWithSequenceNumbe baseSnapshot.sequenceNumber()) .validateFromSnapshot(baseSnapshot.snapshotId()); -rowDelta.commit(); -rewriteFiles.commit(); +commit(table, rowDelta, branch); +commit(table, rewriteFiles, branch); -validateTableDeleteFiles(table, deleteFile1); -validateTableFiles(table, dataFile2); +if (branch == "testBranch") { Review Comment: equals() instead of == ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #5234: Core, API: BaseRowDelta to branch Impl
amogh-jahagirdar commented on code in PR #5234: URL: https://github.com/apache/iceberg/pull/5234#discussion_r1067501650 ## core/src/test/java/org/apache/iceberg/TestRowDelta.java: ## @@ -1270,11 +1332,16 @@ public void testConcurrentNonConflictingRowDeltaAndRewriteFilesWithSequenceNumbe baseSnapshot.sequenceNumber()) .validateFromSnapshot(baseSnapshot.snapshotId()); -rowDelta.commit(); -rewriteFiles.commit(); +commit(table, rowDelta, branch); +commit(table, rewriteFiles, branch); -validateTableDeleteFiles(table, deleteFile1); -validateTableFiles(table, dataFile2); +if (branch == "testBranch") { Review Comment: Also just realized we don't need to differentiate the two cases here, I think we could just have a single method which takes in a branch (and the branch could just be main or not, it doesn't really matter) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #5234: Core, API: BaseRowDelta to branch Impl
amogh-jahagirdar commented on code in PR #5234: URL: https://github.com/apache/iceberg/pull/5234#discussion_r1067520428 ## core/src/test/java/org/apache/iceberg/TestRowDelta.java: ## @@ -1449,18 +1519,48 @@ public void testRowDeltaCaseSensitivity() { } @Test - public void testRowDeltaToBranchUnsupported() { + public void testBranchValidationsNotValidAncestor() { +table.newAppend().appendFile(FILE_A).commit(); +table.manageSnapshots().createBranch("branch", table.currentSnapshot().snapshotId()).commit(); +table.newAppend().appendFile(FILE_B).commit(); + +// This commit will result in validation exception as we start validation from a snapshot which +// is not an ancestor of the branch +RowDelta rowDelta = +table +.newRowDelta() +.toBranch("branch") +.addDeletes(FILE_A_DELETES) +.validateFromSnapshot(table.currentSnapshot().snapshotId()) +.conflictDetectionFilter(Expressions.alwaysTrue()) +.validateNoConflictingDeleteFiles(); + AssertHelpers.assertThrows( -"Should reject committing row delta to branch", -UnsupportedOperationException.class, -"Cannot commit to branch someBranch: org.apache.iceberg.BaseRowDelta does not support branch commits", -() -> -table -.newRowDelta() -.caseSensitive(false) -.addRows(FILE_B) -.addDeletes(FILE_A2_DELETES) -.toBranch("someBranch") -.commit()); +"Snapshot 2 is not an ancestor of 1", +IllegalArgumentException.class, +() -> rowDelta.commit()); + } + + @Test + public void testBranchValidationsValidAncestor() { +table.newAppend().appendFile(FILE_A).commit(); +Long ancestorSnapshot = table.currentSnapshot().snapshotId(); +table.manageSnapshots().createBranch("branch", ancestorSnapshot).commit(); + +// This commit not result in validation exception as we start validation from a snapshot which +// is an actual ancestor of the branch +table +.newRowDelta() +.toBranch("branch") +.addDeletes(FILE_A_DELETES) +.validateFromSnapshot(ancestorSnapshot) +.conflictDetectionFilter(Expressions.alwaysTrue()) +.validateNoConflictingDeleteFiles() +.commit(); + +int branchSnapshot = 2; + +Assert.assertEquals(table.currentSnapshot().snapshotId(), 1); +Assert.assertEquals(table.ops().current().ref("branch").snapshotId(), branchSnapshot); Review Comment: Sorry not getting why we need this test case, won't it be covered by the others? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] rdblue commented on a diff in pull request #6525: Python: Refactor loading manifests
rdblue commented on code in PR #6525: URL: https://github.com/apache/iceberg/pull/6525#discussion_r1067543958 ## python/pyiceberg/avro/reader.py: ## @@ -252,30 +252,33 @@ def skip(self, decoder: BinaryDecoder) -> None: class StructReader(Reader): field_readers: Tuple[Tuple[Optional[int], Reader], ...] create_struct: Type[StructProtocol] -fields: Optional[Tuple[NestedField, ...]] +struct: Optional[StructType] def __init__( self, field_readers: Tuple[Tuple[Optional[int], Reader], ...], create_struct: Optional[Type[StructProtocol]] = None, -fields: Optional[Tuple[NestedField, ...]] = None, +struct: Optional[StructType] = None, ): self.field_readers = field_readers self.create_struct = create_struct or Record -self.fields = fields +self.struct = struct def read(self, decoder: BinaryDecoder) -> Any: -if issubclass(self.create_struct, Record): -struct = self.create_struct(length=len(self.field_readers), fields=self.fields) -elif issubclass(self.create_struct, PydanticStruct): +if issubclass(self.create_struct, PydanticStruct): struct = self.create_struct.construct() else: -raise ValueError(f"Expected a subclass of PydanticStruct, got: {self.create_struct}") +struct = self.create_struct() + +if not isinstance(struct, StructProtocol): +raise ValueError(f"Expected struct to implement StructProtocol: {struct}") + +if self.struct: +struct.set_record_schema(self.struct) Review Comment: Do we want to do an `isinstance` check here eventually? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] namrathamyske commented on a diff in pull request #5234: Core, API: BaseRowDelta to branch Impl
namrathamyske commented on code in PR #5234: URL: https://github.com/apache/iceberg/pull/5234#discussion_r1067542176 ## core/src/test/java/org/apache/iceberg/TestRowDelta.java: ## @@ -1449,18 +1519,48 @@ public void testRowDeltaCaseSensitivity() { } @Test - public void testRowDeltaToBranchUnsupported() { + public void testBranchValidationsNotValidAncestor() { +table.newAppend().appendFile(FILE_A).commit(); +table.manageSnapshots().createBranch("branch", table.currentSnapshot().snapshotId()).commit(); +table.newAppend().appendFile(FILE_B).commit(); + +// This commit will result in validation exception as we start validation from a snapshot which +// is not an ancestor of the branch +RowDelta rowDelta = +table +.newRowDelta() +.toBranch("branch") +.addDeletes(FILE_A_DELETES) +.validateFromSnapshot(table.currentSnapshot().snapshotId()) +.conflictDetectionFilter(Expressions.alwaysTrue()) +.validateNoConflictingDeleteFiles(); + AssertHelpers.assertThrows( -"Should reject committing row delta to branch", -UnsupportedOperationException.class, -"Cannot commit to branch someBranch: org.apache.iceberg.BaseRowDelta does not support branch commits", -() -> -table -.newRowDelta() -.caseSensitive(false) -.addRows(FILE_B) -.addDeletes(FILE_A2_DELETES) -.toBranch("someBranch") -.commit()); +"Snapshot 2 is not an ancestor of 1", +IllegalArgumentException.class, +() -> rowDelta.commit()); + } + + @Test + public void testBranchValidationsValidAncestor() { +table.newAppend().appendFile(FILE_A).commit(); +Long ancestorSnapshot = table.currentSnapshot().snapshotId(); +table.manageSnapshots().createBranch("branch", ancestorSnapshot).commit(); + +// This commit not result in validation exception as we start validation from a snapshot which +// is an actual ancestor of the branch +table +.newRowDelta() +.toBranch("branch") +.addDeletes(FILE_A_DELETES) +.validateFromSnapshot(ancestorSnapshot) +.conflictDetectionFilter(Expressions.alwaysTrue()) +.validateNoConflictingDeleteFiles() +.commit(); + +int branchSnapshot = 2; + +Assert.assertEquals(table.currentSnapshot().snapshotId(), 1); +Assert.assertEquals(table.ops().current().ref("branch").snapshotId(), branchSnapshot); Review Comment: Cool, will be removing these. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] rdblue commented on a diff in pull request #6525: Python: Refactor loading manifests
rdblue commented on code in PR #6525: URL: https://github.com/apache/iceberg/pull/6525#discussion_r1067556636 ## python/pyiceberg/typedef.py: ## @@ -126,30 +135,41 @@ def json(self, exclude_none: bool = True, exclude: Optional[Set[str]] = None, by class PydanticStruct(IcebergBaseModel): +_position_to_field_name: Dict[int, str] = PrivateAttr() +_field_name_to_pydantic_field: Dict[str, Field] = PrivateAttr() + class Config: frozen = False -def __setitem__(self, pos: int, value: Any) -> None: -positions = list(self.__fields__.values()) -field = positions[pos] -if value is None: -if field.default is not None: -value = field.default -elif field.default_factory is not None: -value = field.default_factory() +@staticmethod +def _get_default_field_value(field: Field) -> Optional[Any]: +if field.default is not None: +return field.default +elif field.default_factory is not None: +return field.default_factory() +else: +return None + +def set_record_schema(self, record_schema: StructType) -> None: +self._field_name_to_pydantic_field = {field.name: field for field in self.__fields__.values()} +self._position_to_field_name = {idx: field.name for idx, field in enumerate(record_schema.fields)} +for name, field in self.__fields__.items(): +setattr(self, name, PydanticStruct._get_default_field_value(field)) -self.__setattr__(field.name, value) +def __setitem__(self, pos: int, value: Any) -> None: +field_name = self._position_to_field_name[pos] +# Check if the field exists +if field := self._field_name_to_pydantic_field.get(field_name): +self.__setattr__(field.name, value if value is not None else PydanticStruct._get_default_field_value(field)) Review Comment: I think that this is incorrect because it assumes that `None` should result in the default value. The file may explicitly contain a `null` value, in which case we do not want to default it. For now, I think it is best if we do not add default values from Pydantic. We can add that later, but we need to be careful and add them in the readers, not in the setter. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] stevenzwu commented on a diff in pull request #6382: Implement ShuffleOperator to collect data statistics
stevenzwu commented on code in PR #6382: URL: https://github.com/apache/iceberg/pull/6382#discussion_r1067557305 ## flink/v1.16/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/DataStatistics.java: ## @@ -0,0 +1,34 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.flink.sink.shuffle; + +/** + * DataStatistics defines the interface to collect data statistics. + * + * Data statistics tracks traffic volume distribution across data keys. For low-cardinality key, + * a simple map of (key, count) can be used. For high-cardinality key, probabilistic data structures + * (sketching) can be used. + */ +interface DataStatistics { + long size(); Review Comment: let's add some Javadoc for the methods. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] rdblue commented on a diff in pull request #6525: Python: Refactor loading manifests
rdblue commented on code in PR #6525: URL: https://github.com/apache/iceberg/pull/6525#discussion_r1067557776 ## python/pyiceberg/typedef.py: ## @@ -126,30 +135,41 @@ def json(self, exclude_none: bool = True, exclude: Optional[Set[str]] = None, by class PydanticStruct(IcebergBaseModel): +_position_to_field_name: Dict[int, str] = PrivateAttr() +_field_name_to_pydantic_field: Dict[str, Field] = PrivateAttr() + class Config: frozen = False -def __setitem__(self, pos: int, value: Any) -> None: -positions = list(self.__fields__.values()) -field = positions[pos] -if value is None: -if field.default is not None: -value = field.default -elif field.default_factory is not None: -value = field.default_factory() +@staticmethod +def _get_default_field_value(field: Field) -> Optional[Any]: +if field.default is not None: +return field.default +elif field.default_factory is not None: +return field.default_factory() +else: +return None + +def set_record_schema(self, record_schema: StructType) -> None: +self._field_name_to_pydantic_field = {field.name: field for field in self.__fields__.values()} +self._position_to_field_name = {idx: field.name for idx, field in enumerate(record_schema.fields)} +for name, field in self.__fields__.items(): +setattr(self, name, PydanticStruct._get_default_field_value(field)) Review Comment: I'm okay with adding defaults here, although I think that fields that are used should always be set through `__setitem__` so it shouldn't be necessary. That's why I'd prefer not creating attrs for these and simply raising an exception if an unset field is referenced. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] rdblue commented on a diff in pull request #6525: Python: Refactor loading manifests
rdblue commented on code in PR #6525: URL: https://github.com/apache/iceberg/pull/6525#discussion_r1067558837 ## python/pyiceberg/typedef.py: ## @@ -126,30 +135,41 @@ def json(self, exclude_none: bool = True, exclude: Optional[Set[str]] = None, by class PydanticStruct(IcebergBaseModel): +_position_to_field_name: Dict[int, str] = PrivateAttr() +_field_name_to_pydantic_field: Dict[str, Field] = PrivateAttr() + class Config: frozen = False -def __setitem__(self, pos: int, value: Any) -> None: -positions = list(self.__fields__.values()) -field = positions[pos] -if value is None: -if field.default is not None: -value = field.default -elif field.default_factory is not None: -value = field.default_factory() +@staticmethod +def _get_default_field_value(field: Field) -> Optional[Any]: +if field.default is not None: +return field.default +elif field.default_factory is not None: +return field.default_factory() +else: +return None + +def set_record_schema(self, record_schema: StructType) -> None: +self._field_name_to_pydantic_field = {field.name: field for field in self.__fields__.values()} +self._position_to_field_name = {idx: field.name for idx, field in enumerate(record_schema.fields)} +for name, field in self.__fields__.items(): +setattr(self, name, PydanticStruct._get_default_field_value(field)) -self.__setattr__(field.name, value) +def __setitem__(self, pos: int, value: Any) -> None: +field_name = self._position_to_field_name[pos] Review Comment: Since it isn't safe to set defaults in this method, I think it can be simpler: ```python def __setitem__(self, pos: int, value: Any) -> None: self.__setattr__(self._position_to_field_name[pos], value) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] rdblue commented on a diff in pull request #6525: Python: Refactor loading manifests
rdblue commented on code in PR #6525: URL: https://github.com/apache/iceberg/pull/6525#discussion_r1067559792 ## python/pyiceberg/typedef.py: ## @@ -126,30 +135,41 @@ def json(self, exclude_none: bool = True, exclude: Optional[Set[str]] = None, by class PydanticStruct(IcebergBaseModel): Review Comment: If we can't use Pydantic for defaults, is there still value in using it as the basis for our structs? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
issues@iceberg.apache.org
rdblue commented on code in PR #6324: URL: https://github.com/apache/iceberg/pull/6324#discussion_r1067562662 ## hive-metastore/src/main/java/org/apache/iceberg/hive/HiveHadoopUtil.java: ## @@ -0,0 +1,40 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.hive; + +import java.io.IOException; +import org.apache.hadoop.security.UserGroupInformation; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +public class HiveHadoopUtil { + + private static final Logger LOG = LoggerFactory.getLogger(HiveHadoopUtil.class); + + private HiveHadoopUtil() {} + + public static String currentUser() { +try { + return UserGroupInformation.getCurrentUser().getUserName(); +} catch (IOException e) { + LOG.warn("Fail to obtain Hadoop UGI user", e); Review Comment: Minor: to fit with conventions, I think the error message should be `Failed to get Hadoop user`. * Error messages use past tense to describe what was attempted * "Obtain" is no more specific than "get", so prefer the simpler word * `UGI` is an internal Hadoop term that is more confusing than helpful -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] flyrain commented on a diff in pull request #6012: Spark 3.3: Add a procedure to generate table changes
flyrain commented on code in PR #6012: URL: https://github.com/apache/iceberg/pull/6012#discussion_r1067563712 ## spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/GenerateChangesProcedure.java: ## @@ -0,0 +1,267 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.spark.procedures; + +import java.sql.Timestamp; +import java.util.Arrays; +import java.util.List; +import java.util.Map; +import java.util.stream.Collectors; +import org.apache.iceberg.MetadataColumns; +import org.apache.iceberg.relocated.com.google.common.collect.Maps; +import org.apache.iceberg.spark.ChangelogIterator; +import org.apache.iceberg.spark.SparkReadOptions; +import org.apache.iceberg.spark.source.SparkChangelogTable; +import org.apache.spark.api.java.function.MapPartitionsFunction; +import org.apache.spark.sql.Column; +import org.apache.spark.sql.DataFrameReader; +import org.apache.spark.sql.Dataset; +import org.apache.spark.sql.Row; +import org.apache.spark.sql.catalyst.InternalRow; +import org.apache.spark.sql.catalyst.encoders.RowEncoder; +import org.apache.spark.sql.connector.catalog.TableCatalog; +import org.apache.spark.sql.connector.iceberg.catalog.ProcedureParameter; +import org.apache.spark.sql.types.DataTypes; +import org.apache.spark.sql.types.Metadata; +import org.apache.spark.sql.types.StructField; +import org.apache.spark.sql.types.StructType; +import org.apache.spark.unsafe.types.UTF8String; +import org.jetbrains.annotations.NotNull; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; +import scala.runtime.BoxedUnit; + +/** + * A procedure that creates a view for changed rows. + * + * The procedure computes update-rows and removes the carry-over rows by default. You can disable + * them through parameters to get better performance. + * + * Carry-over rows are the result of a removal and insertion of the same row within an operation + * because of the copy-on-write mechanism. For example, given a file which contains row1 (id=1, + * data='a') and row2 (id=2, data='b'). A copy-on-write delete of row2 would require erasing this + * file and preserving row1 in a new file. The change-log table would report this as (id=1, + * data='a', op='DELETE') and (id=1, data='a', op='INSERT'), despite it not being an actual change + * to the table. The iterator finds the carry-over rows and removes them from the result. + * + * An update-row is converted from a pair of delete row and insert row. Identifier columns are + * needed for identifying whether they refer to the same row. You can either set Identifier Field + * IDs as the table properties or input them as the procedure parameters. Here is an example of + * update-row with an identifier column(id). A pair of delete row and insert row with the same id: + * + * + * (id=1, data='a', op='DELETE') + * (id=1, data='b', op='INSERT') + * + * + * will be marked as update-rows: + * + * + * (id=1, data='a', op='UPDATE_BEFORE') + * (id=1, data='b', op='UPDATE_AFTER') + * + */ +public class GenerateChangesProcedure extends BaseProcedure { + private static final Logger LOG = LoggerFactory.getLogger(GenerateChangesProcedure.class); + + private static final ProcedureParameter[] PARAMETERS = + new ProcedureParameter[] { +ProcedureParameter.required("table", DataTypes.StringType), +ProcedureParameter.optional("table_change_view", DataTypes.StringType), +ProcedureParameter.optional("options", STRING_MAP), +ProcedureParameter.optional("compute_updates", DataTypes.BooleanType), +ProcedureParameter.optional("remove_carryovers", DataTypes.BooleanType), +ProcedureParameter.optional("identifier_columns", DataTypes.StringType), + }; + + private static final StructType OUTPUT_TYPE = + new StructType( + new StructField[] { +new StructField("view_name", DataTypes.StringType, false, Metadata.empty()) + }); + + public static SparkProcedures.ProcedureBuilder builder() { +return new BaseProcedure.Builder() { + @Override + protected GenerateChangesProcedure doBuild() { +return new GenerateChangesProcedure(tableCatalog()); + } +
issues@iceberg.apache.org
rdblue commented on code in PR #6324: URL: https://github.com/apache/iceberg/pull/6324#discussion_r1067564415 ## hive-metastore/src/test/java/org/apache/iceberg/hive/TestHiveCatalog.java: ## @@ -247,17 +249,28 @@ public void testReplaceTxnBuilder() throws Exception { @Test public void testCreateTableWithOwner() throws Exception { +createTableAndVerifyOwner( Review Comment: Why are these mixed into a single test method rather than two test methods? I think this makes it harder to follow -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] rdblue commented on a diff in pull request #6525: Python: Refactor loading manifests
rdblue commented on code in PR #6525: URL: https://github.com/apache/iceberg/pull/6525#discussion_r1067581319 ## python/tests/test_typedef.py: ## @@ -46,7 +51,8 @@ def test_record_repr() -> None: def test_named_record() -> None: -r = Record(fields=(NestedField(0, "id", IntegerType()), NestedField(1, "name", StringType( +r = Record() +r.set_record_schema(StructType(NestedField(0, "id", IntegerType()), NestedField(1, "name", StringType( Review Comment: I think I prefer passing the schema into the constructor. We may want to keep doing that, at least for the generic record. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] github-actions[bot] commented on issue #5173: Add Flink test for Parquet bloom filter
github-actions[bot] commented on issue #5173: URL: https://github.com/apache/iceberg/issues/5173#issuecomment-1379643434 This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale' -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] github-actions[bot] commented on issue #5174: MERGE INTO TABLE is not supported temporarily on Spark3.2.0, Scala2.12, Iceberg0.13.1
github-actions[bot] commented on issue #5174: URL: https://github.com/apache/iceberg/issues/5174#issuecomment-1379643412 This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] github-actions[bot] closed issue #5173: Add Flink test for Parquet bloom filter
github-actions[bot] closed issue #5173: Add Flink test for Parquet bloom filter URL: https://github.com/apache/iceberg/issues/5173 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] rdblue commented on a diff in pull request #6525: Python: Refactor loading manifests
rdblue commented on code in PR #6525: URL: https://github.com/apache/iceberg/pull/6525#discussion_r1067584072 ## python/pyiceberg/typedef.py: ## @@ -79,6 +84,10 @@ def __missing__(self, key: K) -> V: class StructProtocol(Protocol): # pragma: no cover """A generic protocol used by accessors to get and set at positions of an object""" +@abstractmethod +def set_record_schema(self, record_schema: StructType) -> None: Review Comment: One thing that I didn't think about when we were discussing where to put this method is that this is can be called multiple times if we don't pass the schema to the constructor, allowing callers to change the record schema. We assume that it will only be set once in the readers, but this makes the API quite strange. I can read with one schema then set a different schema, and then the behavior is unclear. Because of that, I'd prefer to move this method to our `PydanticStruct` class instead of putting it here. That way we control all of the implementations and it doesn't matter. And we will have the ability to change it later if needed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] flyrain commented on a diff in pull request #6012: Spark 3.3: Add a procedure to generate table changes
flyrain commented on code in PR #6012: URL: https://github.com/apache/iceberg/pull/6012#discussion_r1067602297 ## spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/GenerateChangesProcedure.java: ## @@ -0,0 +1,271 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.spark.procedures; + +import java.util.Arrays; +import java.util.UUID; +import org.apache.iceberg.ChangelogOperation; +import org.apache.iceberg.MetadataColumns; +import org.apache.iceberg.Snapshot; +import org.apache.iceberg.Table; +import org.apache.iceberg.spark.SparkReadOptions; +import org.apache.iceberg.spark.source.SparkChangelogTable; +import org.apache.iceberg.util.DateTimeUtil; +import org.apache.iceberg.util.SnapshotUtil; +import org.apache.spark.sql.Column; +import org.apache.spark.sql.DataFrameReader; +import org.apache.spark.sql.Dataset; +import org.apache.spark.sql.Row; +import org.apache.spark.sql.catalyst.InternalRow; +import org.apache.spark.sql.connector.catalog.Identifier; +import org.apache.spark.sql.connector.catalog.TableCatalog; +import org.apache.spark.sql.connector.iceberg.catalog.ProcedureParameter; +import org.apache.spark.sql.expressions.Window; +import org.apache.spark.sql.functions; +import org.apache.spark.sql.types.DataTypes; +import org.apache.spark.sql.types.Metadata; +import org.apache.spark.sql.types.StructField; +import org.apache.spark.sql.types.StructType; +import org.apache.spark.unsafe.types.UTF8String; +import org.jetbrains.annotations.NotNull; + +public class GenerateChangesProcedure extends BaseProcedure { + + public static SparkProcedures.ProcedureBuilder builder() { +return new BaseProcedure.Builder() { + @Override + protected GenerateChangesProcedure doBuild() { +return new GenerateChangesProcedure(tableCatalog()); + } +}; + } + + private static final ProcedureParameter[] PARAMETERS = + new ProcedureParameter[] { +ProcedureParameter.required("table", DataTypes.StringType), +// the snapshot ids input are ignored when the start/end timestamps are provided +ProcedureParameter.optional("start_snapshot_id_exclusive", DataTypes.LongType), +ProcedureParameter.optional("end_snapshot_id_inclusive", DataTypes.LongType), +ProcedureParameter.optional("table_change_view", DataTypes.StringType), +ProcedureParameter.optional("identifier_columns", DataTypes.StringType), Review Comment: Made the change per suggestion -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] flyrain commented on pull request #6012: Spark 3.3: Add a procedure to generate table changes
flyrain commented on PR #6012: URL: https://github.com/apache/iceberg/pull/6012#issuecomment-1379673208 Ready for another look. cc @RussellSpitzer @szehon-ho @aokolnychyi -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] hililiwei commented on a diff in pull request #6557: Flink: add support of writing Avro GenericRecord DataStream to Iceberg
hililiwei commented on code in PR #6557: URL: https://github.com/apache/iceberg/pull/6557#discussion_r1067664873 ## flink/v1.16/flink/src/test/java/org/apache/iceberg/flink/DataGenerators.java: ## @@ -70,6 +78,12 @@ public static class Primitives implements DataGenerator { private static final BigDecimal BIG_DECIMAL_NEGATIVE = new BigDecimal("-1.50"); private static final byte[] FIXED_BYTES = "012345689012345".getBytes(StandardCharsets.UTF_8); +private static org.apache.avro.Schema addAdjustToUtc( Review Comment: Do we need it? It doesn't seem to have been used. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] stevenzwu commented on a diff in pull request #6557: Flink: add support of writing Avro GenericRecord DataStream to Iceberg
stevenzwu commented on code in PR #6557: URL: https://github.com/apache/iceberg/pull/6557#discussion_r1067710226 ## flink/v1.16/flink/src/test/java/org/apache/iceberg/flink/DataGenerators.java: ## @@ -70,6 +78,12 @@ public static class Primitives implements DataGenerator { private static final BigDecimal BIG_DECIMAL_NEGATIVE = new BigDecimal("-1.50"); private static final byte[] FIXED_BYTES = "012345689012345".getBytes(StandardCharsets.UTF_8); +private static org.apache.avro.Schema addAdjustToUtc( Review Comment: good catch. this is not needed anymore. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] stevenzwu commented on a diff in pull request #6557: Flink: add support of writing Avro GenericRecord DataStream to Iceberg
stevenzwu commented on code in PR #6557: URL: https://github.com/apache/iceberg/pull/6557#discussion_r1067710226 ## flink/v1.16/flink/src/test/java/org/apache/iceberg/flink/DataGenerators.java: ## @@ -70,6 +78,12 @@ public static class Primitives implements DataGenerator { private static final BigDecimal BIG_DECIMAL_NEGATIVE = new BigDecimal("-1.50"); private static final byte[] FIXED_BYTES = "012345689012345".getBytes(StandardCharsets.UTF_8); +private static org.apache.avro.Schema addAdjustToUtc( Review Comment: good catch. this is not needed anymore. I forgot to delete the obsoleted code after some refactoring. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] pvary commented on issue #6370: What is the purpose of Hive Lock ?
pvary commented on issue #6370: URL: https://github.com/apache/iceberg/issues/6370#issuecomment-1379883500 With the apache/hive#3888 we can implement a solution which will handle failures the same way as the current one, without using locks, and depending on the `alter_table` to fail. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] JonasJ-ap commented on pull request #6449: Delta: Support Snapshot Delta Lake Table to Iceberg Table
JonasJ-ap commented on PR #6449: URL: https://github.com/apache/iceberg/pull/6449#issuecomment-1379900129 > Also, because the Spark-based tests are under `integrationTest` task, I think we need to create a new CI task to run the tests, otherwise it won't automatically run? I've added a new CI called Delta Conversion CI (`.github/workflows/delta-conversion-ci.yml`) and upgrade the test to run against a dataframe with nested fields. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] JonasJ-ap commented on a diff in pull request #6449: Delta: Support Snapshot Delta Lake Table to Iceberg Table
JonasJ-ap commented on code in PR #6449: URL: https://github.com/apache/iceberg/pull/6449#discussion_r106276 ## delta-lake/src/main/java/org/apache/iceberg/delta/BaseSnapshotDeltaLakeTableAction.java: ## @@ -0,0 +1,370 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.delta; + +import io.delta.standalone.DeltaLog; +import io.delta.standalone.VersionLog; +import io.delta.standalone.actions.Action; +import io.delta.standalone.actions.AddFile; +import io.delta.standalone.actions.RemoveFile; +import java.io.File; +import java.net.URI; +import java.util.Iterator; +import java.util.List; +import java.util.Map; +import java.util.stream.Collectors; +import org.apache.hadoop.conf.Configuration; +import org.apache.iceberg.AppendFiles; +import org.apache.iceberg.DataFile; +import org.apache.iceberg.DataFiles; +import org.apache.iceberg.DeleteFiles; +import org.apache.iceberg.FileFormat; +import org.apache.iceberg.Metrics; +import org.apache.iceberg.MetricsConfig; +import org.apache.iceberg.OverwriteFiles; +import org.apache.iceberg.PartitionField; +import org.apache.iceberg.PartitionSpec; +import org.apache.iceberg.Schema; +import org.apache.iceberg.SnapshotSummary; +import org.apache.iceberg.Table; +import org.apache.iceberg.TableProperties; +import org.apache.iceberg.Transaction; +import org.apache.iceberg.avro.Avro; +import org.apache.iceberg.catalog.Catalog; +import org.apache.iceberg.catalog.TableIdentifier; +import org.apache.iceberg.exceptions.ValidationException; +import org.apache.iceberg.hadoop.HadoopFileIO; +import org.apache.iceberg.io.InputFile; +import org.apache.iceberg.mapping.MappingUtil; +import org.apache.iceberg.mapping.NameMapping; +import org.apache.iceberg.mapping.NameMappingParser; +import org.apache.iceberg.orc.OrcMetrics; +import org.apache.iceberg.parquet.ParquetUtil; +import org.apache.iceberg.relocated.com.google.common.base.Preconditions; +import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap; +import org.apache.iceberg.relocated.com.google.common.collect.Iterables; +import org.apache.iceberg.relocated.com.google.common.collect.Lists; +import org.apache.iceberg.types.Type; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** + * Takes a Delta Lake table's location and attempts to create an Iceberg table snapshot in an + * optional user-specified location (default to the Delta Lake table's location) with a different + * identifier. + */ +class BaseSnapshotDeltaLakeTableAction implements SnapshotDeltaLakeTable { + + private static final Logger LOG = LoggerFactory.getLogger(BaseSnapshotDeltaLakeTableAction.class); + + private static final String SNAPSHOT_SOURCE_PROP = "snapshot_source"; + private static final String DELTA_SOURCE_VALUE = "delta"; + private static final String ORIGINAL_LOCATION_PROP = "original_location"; + private static final String PARQUET_SUFFIX = ".parquet"; + private static final String AVRO_SUFFIX = ".avro"; + private static final String ORC_SUFFIX = ".orc"; + private final ImmutableMap.Builder additionalPropertiesBuilder = + ImmutableMap.builder(); + private DeltaLog deltaLog; + private Catalog icebergCatalog; + private final String deltaTableLocation; + private TableIdentifier newTableIdentifier; + private String newTableLocation; + private HadoopFileIO deltaLakeFileIO; + + /** + * Snapshot a delta lake table to be an iceberg table. The action will read the delta lake table's + * log through the table's path, create a new iceberg table using the given icebergCatalog and + * newTableIdentifier, and commit all changes in one iceberg transaction. + * + * The new table will only be created if the snapshot is successful. + * + * @param deltaTableLocation the delta lake table's path + */ + BaseSnapshotDeltaLakeTableAction(String deltaTableLocation) { +this.deltaTableLocation = deltaTableLocation; +this.newTableLocation = deltaTableLocation; + } + + @Override + public SnapshotDeltaLakeTable tableProperties(Map properties) { +additionalPropertiesBuilder.putAll(properties); +return this; + } + + @Override + public SnapshotDeltaLakeTable tableProperty(String name,
[GitHub] [iceberg] nastra opened a new pull request, #6569: Spark: Add the query ID to file names
nastra opened a new pull request, #6569: URL: https://github.com/apache/iceberg/pull/6569 Co-authored-by: Ryan Blue -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] nastra commented on pull request #5214: Spark: Add the query ID to file names
nastra commented on PR #5214: URL: https://github.com/apache/iceberg/pull/5214#issuecomment-1379936898 Closing this one as it's been superseded by https://github.com/apache/iceberg/pull/6569 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] nastra closed pull request #5214: Spark: Add the query ID to file names
nastra closed pull request #5214: Spark: Add the query ID to file names URL: https://github.com/apache/iceberg/pull/5214 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org