[GitHub] [iceberg] nastra opened a new pull request, #6562: Core: Improvements around Token Refresh time expiration

2023-01-11 Thread GitBox


nastra opened a new pull request, #6562:
URL: https://github.com/apache/iceberg/pull/6562

   currently depends on changes from 
https://github.com/apache/iceberg/pull/6489 but will be rebased once 
https://github.com/apache/iceberg/pull/6489 is in


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] nastra commented on a diff in pull request #6489: Core: Improve token exchange handling when token expires

2023-01-11 Thread GitBox


nastra commented on code in PR #6489:
URL: https://github.com/apache/iceberg/pull/6489#discussion_r1066849968


##
core/src/main/java/org/apache/iceberg/rest/auth/OAuth2Util.java:
##
@@ -376,5 +404,127 @@ public Pair refresh(RESTClient client) 
{
 
   return null;
 }
+
+@SuppressWarnings("FutureReturnValueIgnored")
+public static void scheduleTokenRefresh(
+RESTClient client,
+ScheduledExecutorService tokenRefreshExecutor,
+AuthSession session,
+long startTimeMillis,
+long expiresIn,
+TimeUnit unit) {
+  // convert expiration interval to milliseconds
+  long expiresInMillis = unit.toMillis(expiresIn);
+  // how much ahead of time to start the request to allow it to complete
+  long refreshWindowMillis = Math.min(expiresInMillis / 10, 
MAX_REFRESH_WINDOW_MILLIS);
+  // how much time to wait before expiration
+  long waitIntervalMillis = expiresInMillis - refreshWindowMillis;
+  // how much time has already elapsed since the new token was issued
+  long elapsedMillis = System.currentTimeMillis() - startTimeMillis;
+  // how much time to actually wait
+  long timeToWait = Math.max(waitIntervalMillis - elapsedMillis, 
MIN_REFRESH_WAIT_MILLIS);
+
+  tokenRefreshExecutor.schedule(
+  () -> {
+long refreshStartTime = System.currentTimeMillis();
+Pair expiration = session.refresh(client);
+if (expiration != null) {
+  scheduleTokenRefresh(
+  client,
+  tokenRefreshExecutor,
+  session,
+  refreshStartTime,
+  expiration.first(),
+  expiration.second());
+}
+  },
+  timeToWait,
+  TimeUnit.MILLISECONDS);
+}
+
+public static AuthSession sessionFromToken(
+RESTClient client,
+ScheduledExecutorService tokenRefreshExecutor,
+String token,
+@Nullable String credential,
+Long defaultExpirationMillis,
+AuthSession parent) {
+  Optional jwt = JWT.of(token);
+
+  if (jwt.isPresent() && jwt.get().isExpired()) {
+Preconditions.checkState(
+null != credential, "Credential is required to refresh expired 
token.");
+
+// we add the credential to the Authorization header and perform a 
token exchange to
+// refresh the expired token
+AuthSession session = new 
AuthSession(OAuth2Util.basicAuthHeader(credential), null, null);
+
+return AuthSession.sessionFromTokenExchange(
+client,
+tokenRefreshExecutor,
+token,
+OAuth2Properties.ACCESS_TOKEN_TYPE,
+session,
+OAuth2Properties.CATALOG_SCOPE);
+  } else {
+AuthSession session =
+new AuthSession(parent.headers(), token, 
OAuth2Properties.ACCESS_TOKEN_TYPE);
+Long expiresInMillis = 
jwt.map(JWT::expiresInMillis).orElse(defaultExpirationMillis);
+
+if (null != expiresInMillis) {
+  scheduleTokenRefresh(
+  client,
+  tokenRefreshExecutor,
+  session,
+  System.currentTimeMillis(),
+  expiresInMillis,

Review Comment:
   I've refactored this part of the code in a separate PR 
(https://github.com/apache/iceberg/pull/6562/commits/7c4655ea6d5390301106c2f5f2d51dfc63adf779)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] Fokko commented on a diff in pull request #6525: Python: Refactor loading manifests

2023-01-11 Thread GitBox


Fokko commented on code in PR #6525:
URL: https://github.com/apache/iceberg/pull/6525#discussion_r1066862697


##
python/pyiceberg/avro/resolver.py:
##
@@ -109,38 +109,46 @@ def resolve(
 
 
 class SchemaResolver(PrimitiveWithPartnerVisitor[IcebergType, Reader]):
-read_types: Optional[Dict[int, Callable[[Schema], StructProtocol]]]
+read_types: Dict[int, Type[StructProtocol]]
+context: List[int]
 
-def __init__(self, read_types: Optional[Dict[int, Callable[[Schema], 
StructProtocol]]]):
+def __init__(self, read_types: Dict[int, Type[StructProtocol]] = 
EMPTY_DICT) -> None:
 self.read_types = read_types
+self.context = []
 
 def schema(self, schema: Schema, expected_schema: Optional[IcebergType], 
result: Reader) -> Reader:
 return result
 
+def before_field(self, field: NestedField, field_partner: 
Optional[NestedField]) -> None:
+self.context.append(field.field_id)
+
+def after_field(self, field: NestedField, field_partner: 
Optional[NestedField]) -> None:
+self.context.pop()
+
 def struct(self, struct: StructType, expected_struct: 
Optional[IcebergType], field_readers: List[Reader]) -> Reader:
+# -1 indicates the struct root
+read_struct_id = self.context[-1] if len(self.context) > 0 else -1
+struct_callable = self.read_types.get(read_struct_id, Record)

Review Comment:
   Got it. Currently, there is no record schema for the typed classes 
(`ManifestEntry`, `ManifestFile`, etc). I would be hesitant to add yet another 
schema there. Are there any cases where we want to have the Record schema and 
the Read schema differ? Ideally, you want to shape the read schema to the data 
that you need, so you don't read anything that's not being used.
   
   > It's also better to pass the read schema because we can make generic 
records more friendly.
   
   This is something that we do for the `Record` class:
   
   ```python
   def read(self, decoder: BinaryDecoder) -> Any:
   if issubclass(self.create_struct, Record):
   struct = self.create_struct(length=len(self.field_readers), 
fields=self.fields)
   elif issubclass(self.create_struct, PydanticStruct):
   struct = self.create_struct.construct()
   else:
   raise ValueError(f"Expected a subclass of PydanticStruct, got: 
{self.create_struct}")
   ```
   
   And then we have the attributes available:
   ```python
   def test_named_record() -> None:
   r = Record(fields=(NestedField(0, "id", IntegerType()), NestedField(1, 
"name", StringType(
   
   assert r.id is None
   assert r.name is None
   
   r[0] = 123
   r[1] = "abc"
   
   assert r[0] == 123
   assert r[1] == "abc"
   
   assert r.id == 123
   assert r.name == "abc"
   ```
   
   > It looks like we can use inspect.getfullargspec(class.__init__) to inspect 
the init arguments. So we could do something like this:
   
   I'm not a big fan of doing inspections, similar to Java, it is quite 
expensive. And it looks like there is a [performance 
issue](https://bugs.python.org/issue37010) in `getfullargspec`. What do you 
think of passing in the fields as we have currently?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] Fokko commented on a diff in pull request #6525: Python: Refactor loading manifests

2023-01-11 Thread GitBox


Fokko commented on code in PR #6525:
URL: https://github.com/apache/iceberg/pull/6525#discussion_r1066862697


##
python/pyiceberg/avro/resolver.py:
##
@@ -109,38 +109,46 @@ def resolve(
 
 
 class SchemaResolver(PrimitiveWithPartnerVisitor[IcebergType, Reader]):
-read_types: Optional[Dict[int, Callable[[Schema], StructProtocol]]]
+read_types: Dict[int, Type[StructProtocol]]
+context: List[int]
 
-def __init__(self, read_types: Optional[Dict[int, Callable[[Schema], 
StructProtocol]]]):
+def __init__(self, read_types: Dict[int, Type[StructProtocol]] = 
EMPTY_DICT) -> None:
 self.read_types = read_types
+self.context = []
 
 def schema(self, schema: Schema, expected_schema: Optional[IcebergType], 
result: Reader) -> Reader:
 return result
 
+def before_field(self, field: NestedField, field_partner: 
Optional[NestedField]) -> None:
+self.context.append(field.field_id)
+
+def after_field(self, field: NestedField, field_partner: 
Optional[NestedField]) -> None:
+self.context.pop()
+
 def struct(self, struct: StructType, expected_struct: 
Optional[IcebergType], field_readers: List[Reader]) -> Reader:
+# -1 indicates the struct root
+read_struct_id = self.context[-1] if len(self.context) > 0 else -1
+struct_callable = self.read_types.get(read_struct_id, Record)

Review Comment:
   Got it. In this PR there is no record schema passed in for the typed classes 
(`ManifestEntry`, `ManifestFile`, etc). I would be hesitant to add yet another 
schema there. Are there any cases where we want to have the Record schema and 
the Read schema differ? Ideally, you want to shape the read schema to the data 
that you need, so you don't read anything that's not being used.
   
   > It's also better to pass the read schema because we can make generic 
records more friendly.
   
   This is something that we do for the `Record` class:
   
   ```python
   def read(self, decoder: BinaryDecoder) -> Any:
   if issubclass(self.create_struct, Record):
   struct = self.create_struct(length=len(self.field_readers), 
fields=self.fields)
   elif issubclass(self.create_struct, PydanticStruct):
   struct = self.create_struct.construct()
   else:
   raise ValueError(f"Expected a subclass of PydanticStruct, got: 
{self.create_struct}")
   ```
   
   And then we have the attributes available:
   ```python
   def test_named_record() -> None:
   r = Record(fields=(NestedField(0, "id", IntegerType()), NestedField(1, 
"name", StringType(
   
   assert r.id is None
   assert r.name is None
   
   r[0] = 123
   r[1] = "abc"
   
   assert r[0] == 123
   assert r[1] == "abc"
   
   assert r.id == 123
   assert r.name == "abc"
   ```
   
   > It looks like we can use inspect.getfullargspec(class.__init__) to inspect 
the init arguments. So we could do something like this:
   
   I'm not a big fan of doing inspections, similar to Java, it is quite 
expensive. And it looks like there is a [performance 
issue](https://bugs.python.org/issue37010) in `getfullargspec`. What do you 
think of passing in the fields as we have currently?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] Fokko merged pull request #6555: Python: Expression to disjunctive normal form

2023-01-11 Thread GitBox


Fokko merged PR #6555:
URL: https://github.com/apache/iceberg/pull/6555


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] pvary commented on pull request #6557: Flink: add support of writing Avro GenericRecord DataStream to Iceberg

2023-01-11 Thread GitBox


pvary commented on PR #6557:
URL: https://github.com/apache/iceberg/pull/6557#issuecomment-1378673916

   @stevenzwu: We had several issues with Avro, RowData conversions. This is 
solving one of them - we might not want to solve all of them once and in one 
PR, but I think we should collect/consider all of them in the long run.
   
   The specific issues:
   - Iceberg shades the Avro version. This makes it inconvenient to use our own 
Avro tools/libraries, even if the 2 Avro version is the same. - Our current 
solution is to write the Avro schema to string and parse it with the shaded 
Avro library
   - Sometimes the Iceberg schema is not exactly the same as the Avro schema:
   - Different order of the columns could cause the built in Iceberg 
converters to fail
   - Maybe one of the schemas could contain extra columns which are not 
needed after the transform
   - We often need transformation in both direction:
   - When writing then Avro to RowData
   - When reading then RowData to Avro
   
   Your PR solves one of the issues above. It might worth to consider if we can 
easily add a solution for some of the issues in the same PR, or we would like 
to solve those separately.
   
   Thanks,
   Peter


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] findepi commented on a diff in pull request #6474: Make it explicit that metrics reporter is required

2023-01-11 Thread GitBox


findepi commented on code in PR #6474:
URL: https://github.com/apache/iceberg/pull/6474#discussion_r1066995039


##
core/src/main/java/org/apache/iceberg/BaseTable.java:
##
@@ -48,6 +49,7 @@ public BaseTable(TableOperations ops, String name) {
   }
 
   public BaseTable(TableOperations ops, String name, MetricsReporter reporter) 
{
+Preconditions.checkNotNull(reporter, "reporter cannot be null");

Review Comment:
   > Also we generally use `Preconditions.checkArgument` more than 
`Preconditions.checkNotNull`:
   
   do we use it commonly to check for nullness?
   
   i think in the code i've contributed so far to iceberg i used cNN and don't 
remember being turned around to use cA



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] amogh-jahagirdar commented on pull request #6559: Core: View core parser implementations

2023-01-11 Thread GitBox


amogh-jahagirdar commented on PR #6559:
URL: https://github.com/apache/iceberg/pull/6559#issuecomment-1378823387

   Agreed @jackye1995 we can break this down further for easier review. I'll 
raise the version, representation and history entry PRs separately.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] nastra opened a new pull request, #6563: Core: Align commit metric name

2023-01-11 Thread GitBox


nastra opened a new pull request, #6563:
URL: https://github.com/apache/iceberg/pull/6563

   All other commit metric names use "positional" rather than "position", 
except for this one. Also the method name to retrieve that metric 
(`addedPositionalDeleteFiles()`) already has the correct name.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] kmozaid commented on a diff in pull request #6410: Configurable metrics reporter by catalog properties

2023-01-11 Thread GitBox


kmozaid commented on code in PR #6410:
URL: https://github.com/apache/iceberg/pull/6410#discussion_r1067058918


##
core/src/main/java/org/apache/iceberg/BaseMetastoreCatalog.java:
##
@@ -301,4 +305,16 @@ protected static String fullTableName(String catalogName, 
TableIdentifier identi
 
 return sb.toString();
   }
+
+  private MetricsReporter metricsReporter() {
+if (metricsReporter == null) {
+  metricsReporter =
+  properties().containsKey(CatalogProperties.METRICS_REPORTER_IMPL)
+  ? CatalogUtil.loadMetricsReporter(
+  properties().get(CatalogProperties.METRICS_REPORTER_IMPL))
+  : LoggingMetricsReporter.instance();

Review Comment:
   HI @szehon-ho Even without this change, LoggingMetricsReporter is always 
enabled for all catalogs...See BaseTable 
https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/BaseTable.java#L45



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #6559: Core: View core parser implementations

2023-01-11 Thread GitBox


amogh-jahagirdar commented on code in PR #6559:
URL: https://github.com/apache/iceberg/pull/6559#discussion_r1067058866


##
core/src/main/java/org/apache/iceberg/view/SQLViewRepresentationParser.java:
##
@@ -0,0 +1,87 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.view;
+
+import com.fasterxml.jackson.core.JsonGenerator;
+import com.fasterxml.jackson.databind.JsonNode;
+import java.io.IOException;
+import java.util.Arrays;
+import org.apache.iceberg.catalog.Namespace;
+import org.apache.iceberg.relocated.com.google.common.collect.Iterables;
+import org.apache.iceberg.util.JsonUtil;
+
+public class SQLViewRepresentationParser {
+  private enum Field {
+SQL("sql"),

Review Comment:
   Good point, this was just how it was in the original PR so I left it as is, 
we can just use static fields (same as what we do in TableMetadataParser)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] Fokko opened a new issue, #6564: Python write support

2023-01-11 Thread GitBox


Fokko opened a new issue, #6564:
URL: https://github.com/apache/iceberg/issues/6564

   ### Feature Request / Improvement
   
   This is a placeholder ticket for implementing write support for PyIceberg.
   
   Since we don't want PyIceberg to write the actual data, and only handle the 
metadata part of the Iceberg table format, we need to get an overview of the 
frameworks that we most likely want to integrate with (PyArrow, Dask 
(fastparquet?), etc).
   
   Missing pieces:
   - Able to efficiently get statistics (upper and lower bound) for the written 
files
   - Able to write Avro files (ManifestList, ManifestEntry)
   - Proper integration tests between Java and Python to make sure that we 
don't brick any tables (start in https://github.com/apache/iceberg/pull/6398)
   
   
   ### Query engine
   
   None


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] amogh-jahagirdar opened a new pull request, #6565: Core: View history entry core implementation

2023-01-11 Thread GitBox


amogh-jahagirdar opened a new pull request, #6565:
URL: https://github.com/apache/iceberg/pull/6565

   Co-authored-by: John Zhuge 
   
   Separating this PR from 
https://github.com/apache/iceberg/pull/6559/files#diff-2a70d3056d3d0cca0da3ff4ddabc83c41c06af2296f281a5b37c5b54ead98915
 for easier review from the community.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] ajantha-bhat commented on pull request #6461: Spark-3.3: Store sort-order-id in manifest_entry's data_file

2023-01-11 Thread GitBox


ajantha-bhat commented on PR #6461:
URL: https://github.com/apache/iceberg/pull/6461#issuecomment-1379088280

   @RussellSpitzer: After checking some code. I found that callers has a logic 
to find the required ordering (by checking distribution table properties for 
merge/write/update/delete). So, I think I can use that.
   
   Did some manual testing. It looks ok to me. 
   So, please let me know if this change is ok. If yes, I can add all 
combinations of test cases.   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] rdblue commented on a diff in pull request #6353: Make sure S3 stream opened by ReadConf ctor is closed

2023-01-11 Thread GitBox


rdblue commented on code in PR #6353:
URL: https://github.com/apache/iceberg/pull/6353#discussion_r1067212201


##
parquet/src/main/java/org/apache/iceberg/parquet/ParquetReader.java:
##
@@ -79,9 +83,11 @@ private ReadConf init() {
   nameMapping,
   reuseContainers,
   caseSensitive,
-  null);
-  this.conf = readConf.copy();
-  return readConf;
+  null)) {
+this.conf = readConf.copy();

Review Comment:
   @islamismailov, the first call to init, the `ReadConf` is copied and stored 
in the `ParquetReader` for future use. But the `ReadConf` that was just created 
is the one that is returned. That way the first init call uses the same stream 
for the `FileIterator` and doesn't reopen one.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] jackye1995 commented on a diff in pull request #6565: Core: View history entry core implementation

2023-01-11 Thread GitBox


jackye1995 commented on code in PR #6565:
URL: https://github.com/apache/iceberg/pull/6565#discussion_r1067210433


##
core/src/test/resources/ViewHistoryEntry.json:
##
@@ -0,0 +1,4 @@
+{

Review Comment:
   I think the resources are reserved for full JSON files that might be stored 
in Iceberg, like table metadata. For specific objects inside the JSON file we 
can just use hard-coded strings for testing



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] jackye1995 commented on a diff in pull request #6565: Core: View history entry core implementation

2023-01-11 Thread GitBox


jackye1995 commented on code in PR #6565:
URL: https://github.com/apache/iceberg/pull/6565#discussion_r1067214846


##
core/src/main/java/org/apache/iceberg/view/BaseViewHistoryEntry.java:
##
@@ -0,0 +1,71 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.view;
+
+import java.util.Objects;
+import org.apache.iceberg.relocated.com.google.common.base.MoreObjects;
+
+public class BaseViewHistoryEntry implements ViewHistoryEntry {
+  private final long timestampMillis;
+  private final int versionId;
+
+  static ViewHistoryEntry of(long timestampMillis, int versionId) {

Review Comment:
   why do we need a static method if it is just delegating to the constructor?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] jackye1995 commented on a diff in pull request #6565: Core: View history entry core implementation

2023-01-11 Thread GitBox


jackye1995 commented on code in PR #6565:
URL: https://github.com/apache/iceberg/pull/6565#discussion_r1067215702


##
core/src/main/java/org/apache/iceberg/view/ViewHistoryEntryParser.java:
##
@@ -0,0 +1,44 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.view;
+
+import com.fasterxml.jackson.core.JsonGenerator;
+import com.fasterxml.jackson.databind.JsonNode;
+import java.io.IOException;
+import org.apache.iceberg.util.JsonUtil;
+
+class ViewHistoryEntryParser {
+
+  static final String VERSION_ID = "version-id";
+  static final String TIMESTAMP_MS = "timestamp-ms";
+
+  static void toJson(ViewHistoryEntry entry, JsonGenerator generator) throws 
IOException {
+generator.writeStartObject();
+generator.writeNumberField(TIMESTAMP_MS, entry.timestampMillis());
+generator.writeNumberField(VERSION_ID, entry.versionId());
+generator.writeEndObject();
+  }
+
+  static ViewHistoryEntry fromJson(JsonNode node) {
+return BaseViewHistoryEntry.of(
+JsonUtil.getLong(TIMESTAMP_MS, node), JsonUtil.getInt(VERSION_ID, 
node));
+  }
+
+  private ViewHistoryEntryParser() {}

Review Comment:
   nit: constructor should be at top?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] jackye1995 commented on a diff in pull request #6565: Core: View history entry core implementation

2023-01-11 Thread GitBox


jackye1995 commented on code in PR #6565:
URL: https://github.com/apache/iceberg/pull/6565#discussion_r1067215702


##
core/src/main/java/org/apache/iceberg/view/ViewHistoryEntryParser.java:
##
@@ -0,0 +1,44 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.view;
+
+import com.fasterxml.jackson.core.JsonGenerator;
+import com.fasterxml.jackson.databind.JsonNode;
+import java.io.IOException;
+import org.apache.iceberg.util.JsonUtil;
+
+class ViewHistoryEntryParser {
+
+  static final String VERSION_ID = "version-id";
+  static final String TIMESTAMP_MS = "timestamp-ms";
+
+  static void toJson(ViewHistoryEntry entry, JsonGenerator generator) throws 
IOException {
+generator.writeStartObject();
+generator.writeNumberField(TIMESTAMP_MS, entry.timestampMillis());
+generator.writeNumberField(VERSION_ID, entry.versionId());
+generator.writeEndObject();
+  }
+
+  static ViewHistoryEntry fromJson(JsonNode node) {
+return BaseViewHistoryEntry.of(
+JsonUtil.getLong(TIMESTAMP_MS, node), JsonUtil.getInt(VERSION_ID, 
node));
+  }
+
+  private ViewHistoryEntryParser() {}

Review Comment:
   nit: constructor should be at top before the methods?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] jackye1995 commented on a diff in pull request #6565: Core: View history entry core implementation

2023-01-11 Thread GitBox


jackye1995 commented on code in PR #6565:
URL: https://github.com/apache/iceberg/pull/6565#discussion_r1067224129


##
core/src/test/java/org/apache/iceberg/view/ParserTestBase.java:
##
@@ -0,0 +1,57 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.view;
+
+import java.io.IOException;
+import java.nio.file.Path;
+import java.nio.file.Paths;
+import org.apache.iceberg.util.JsonUtil;
+import org.apache.iceberg.util.TestJsonUtil;
+import org.junit.Assert;
+import org.junit.Test;
+
+public abstract class ParserTestBase {
+
+  private final T entry;
+  private final String json;
+  private final JsonUtil.JsonWriter writer;
+  private final JsonUtil.JsonReader reader;
+
+  public ParserTestBase(

Review Comment:
   I am wondering if we really should structure the tests like this with a test 
base. Each object model could have different fields to be tested and this makes 
the tests not as flexible. Also we added all the things in the JsonUtil which I 
am not sure if it is useful outside this context



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #6565: Core: View history entry core implementation

2023-01-11 Thread GitBox


amogh-jahagirdar commented on code in PR #6565:
URL: https://github.com/apache/iceberg/pull/6565#discussion_r1067224185


##
core/src/main/java/org/apache/iceberg/view/BaseViewHistoryEntry.java:
##
@@ -0,0 +1,71 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.view;
+
+import java.util.Objects;
+import org.apache.iceberg.relocated.com.google.common.base.MoreObjects;
+
+public class BaseViewHistoryEntry implements ViewHistoryEntry {
+  private final long timestampMillis;
+  private final int versionId;
+
+  static ViewHistoryEntry of(long timestampMillis, int versionId) {

Review Comment:
   It's not needed, just a part of the original CR. We can just make the 
constructor public and use that



##
core/src/main/java/org/apache/iceberg/view/BaseViewHistoryEntry.java:
##
@@ -0,0 +1,71 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.view;
+
+import java.util.Objects;
+import org.apache.iceberg.relocated.com.google.common.base.MoreObjects;
+
+public class BaseViewHistoryEntry implements ViewHistoryEntry {
+  private final long timestampMillis;
+  private final int versionId;
+
+  static ViewHistoryEntry of(long timestampMillis, int versionId) {

Review Comment:
   It's not needed, just a part of the original PR. We can just make the 
constructor public and use that



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] jackye1995 commented on a diff in pull request #6565: Core: View history entry core implementation

2023-01-11 Thread GitBox


jackye1995 commented on code in PR #6565:
URL: https://github.com/apache/iceberg/pull/6565#discussion_r1067226988


##
core/src/main/java/org/apache/iceberg/view/BaseViewHistoryEntry.java:
##
@@ -0,0 +1,71 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.view;
+
+import java.util.Objects;
+import org.apache.iceberg.relocated.com.google.common.base.MoreObjects;
+
+public class BaseViewHistoryEntry implements ViewHistoryEntry {
+  private final long timestampMillis;
+  private final int versionId;
+
+  static ViewHistoryEntry of(long timestampMillis, int versionId) {

Review Comment:
   I see. I can see the constructor being potentially protected, we can keep it 
like that for now, but the static method seems unnecessary



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] jackye1995 commented on a diff in pull request #6565: Core: View history entry core implementation

2023-01-11 Thread GitBox


jackye1995 commented on code in PR #6565:
URL: https://github.com/apache/iceberg/pull/6565#discussion_r1067210433


##
core/src/test/resources/ViewHistoryEntry.json:
##
@@ -0,0 +1,4 @@
+{

Review Comment:
   I think the resources are reserved for full files that might be stored in 
Iceberg, like table metadata. For specific objects inside the JSON file we can 
just use hard-coded strings for testing



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] dmgcodevil commented on issue #6370: What is the purpose of Hive Lock ?

2023-01-11 Thread GitBox


dmgcodevil commented on issue #6370:
URL: https://github.com/apache/iceberg/issues/6370#issuecomment-1379151268

   I still don't understand why we need locks if we have transactions and we 
can implement `optimistic locking` model. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #6565: Core: View history entry core implementation

2023-01-11 Thread GitBox


amogh-jahagirdar commented on code in PR #6565:
URL: https://github.com/apache/iceberg/pull/6565#discussion_r1067231261


##
core/src/test/java/org/apache/iceberg/view/ParserTestBase.java:
##
@@ -0,0 +1,57 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.view;
+
+import java.io.IOException;
+import java.nio.file.Path;
+import java.nio.file.Paths;
+import org.apache.iceberg.util.JsonUtil;
+import org.apache.iceberg.util.TestJsonUtil;
+import org.junit.Assert;
+import org.junit.Test;
+
+public abstract class ParserTestBase {
+
+  private final T entry;
+  private final String json;
+  private final JsonUtil.JsonWriter writer;
+  private final JsonUtil.JsonReader reader;
+
+  public ParserTestBase(

Review Comment:
   Agreed, I think what we can do is just follow the existing pattern for 
testing table metadata and just enumerate different toJson/fromJson cases. 
While the benefit of having a parent class is to reduce some duplication, for 
the parsing code it seems better just to have more explicit tests (for example 
like we do with ref parser)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] danielcweeks commented on a diff in pull request #6352: AWS: Fix inconsistent behavior of naming S3 location between read and write operations by allowing only s3 bucket name

2023-01-11 Thread GitBox


danielcweeks commented on code in PR #6352:
URL: https://github.com/apache/iceberg/pull/6352#discussion_r1067232103


##
aws/src/main/java/org/apache/iceberg/aws/s3/S3URI.java:
##
@@ -74,17 +74,14 @@ class S3URI {
 this.scheme = schemeSplit[0];
 
 String[] authoritySplit = schemeSplit[1].split(PATH_DELIM, 2);
-ValidationException.check(
-authoritySplit.length == 2, "Invalid S3 URI, cannot determine bucket: 
%s", location);
-ValidationException.check(
-!authoritySplit[1].trim().isEmpty(), "Invalid S3 URI, path is empty: 
%s", location);
+
 this.bucket =
 bucketToAccessPointMapping == null
 ? authoritySplit[0]
 : bucketToAccessPointMapping.getOrDefault(authoritySplit[0], 
authoritySplit[0]);
 
 // Strip query and fragment if they exist
-String path = authoritySplit[1];
+String path = authoritySplit.length > 1 ? authoritySplit[1] : "";

Review Comment:
   I think @amogh-jahagirdar meant using java `Optional path()` to 
differentiate empty from having a path.  I would generally agree, but Iceberg 
doesn't currently use optional very much.  I'm inclined to just let this go 
through as-is.  Seems like an uncommon use case, but since it's easy to 
support, we should.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] danielcweeks merged pull request #6352: AWS: Fix inconsistent behavior of naming S3 location between read and write operations by allowing only s3 bucket name

2023-01-11 Thread GitBox


danielcweeks merged PR #6352:
URL: https://github.com/apache/iceberg/pull/6352


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] stevenzwu commented on pull request #6557: Flink: add support of writing Avro GenericRecord DataStream to Iceberg

2023-01-11 Thread GitBox


stevenzwu commented on PR #6557:
URL: https://github.com/apache/iceberg/pull/6557#issuecomment-1379226025

   > Iceberg shades the Avro version.
   
   I updated the Flink doc in this PR already and pointed out that 
`iceberg-flink-runtime` shaded bundle jar shouldn't be used in this case, as it 
shade the Avro package. Instead non-shaded `iceberg-flink` jar should be used.
   
   > Sometimes the Iceberg schema is not exactly the same as the Avro schema
   
   This probably can be addressed separately using named mapping. Here the 
assumption is that the output Avro schema should be same as the Iceberg schema.
   
   > We often need transformation in both direction:
   >  - When writing then Avro to RowData
   >  - When reading then RowData to Avro
   
   This PR only addresses the write part: Avro to RowData. I will create a 
separate PR for the read path.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] amogh-jahagirdar commented on pull request #6565: Core: View history entry core implementation

2023-01-11 Thread GitBox


amogh-jahagirdar commented on PR #6565:
URL: https://github.com/apache/iceberg/pull/6565#issuecomment-1379247513

   @jackye1995 I updated the PR so it should be a bit simpler now , I agree at 
this point we don't need the abstractions of base test classes  . Thanks for 
the review! @jzhuge @rdblue let me know your thoughts on this! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



issues@iceberg.apache.org

2023-01-11 Thread GitBox


haizhou-zhao commented on code in PR #6324:
URL: https://github.com/apache/iceberg/pull/6324#discussion_r1067317621


##
hive-metastore/src/main/java/org/apache/iceberg/hive/HiveHadoopUtil.java:
##
@@ -0,0 +1,36 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.hive;
+
+import java.io.IOException;
+import java.io.UncheckedIOException;
+import org.apache.hadoop.security.UserGroupInformation;
+
+public class HiveHadoopUtil {
+
+  private HiveHadoopUtil() {}
+
+  public static String getUserName() {

Review Comment:
   changed in the newest commit.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



issues@iceberg.apache.org

2023-01-11 Thread GitBox


haizhou-zhao commented on code in PR #6324:
URL: https://github.com/apache/iceberg/pull/6324#discussion_r1067317866


##
hive-metastore/src/main/java/org/apache/iceberg/hive/HiveHadoopUtil.java:
##
@@ -0,0 +1,36 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.hive;
+
+import java.io.IOException;
+import java.io.UncheckedIOException;
+import org.apache.hadoop.security.UserGroupInformation;
+
+public class HiveHadoopUtil {
+
+  private HiveHadoopUtil() {}
+
+  public static String getUserName() {
+try {
+  return UserGroupInformation.getCurrentUser().getUserName();
+} catch (IOException e) {
+  throw new UncheckedIOException("Fail to obtain Hadoop UGI user", e);

Review Comment:
   Falling back to os username in the newest commit.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] rdblue commented on a diff in pull request #6525: Python: Refactor loading manifests

2023-01-11 Thread GitBox


rdblue commented on code in PR #6525:
URL: https://github.com/apache/iceberg/pull/6525#discussion_r1067330294


##
python/pyiceberg/avro/resolver.py:
##
@@ -109,38 +109,46 @@ def resolve(
 
 
 class SchemaResolver(PrimitiveWithPartnerVisitor[IcebergType, Reader]):
-read_types: Optional[Dict[int, Callable[[Schema], StructProtocol]]]
+read_types: Dict[int, Type[StructProtocol]]
+context: List[int]
 
-def __init__(self, read_types: Optional[Dict[int, Callable[[Schema], 
StructProtocol]]]):
+def __init__(self, read_types: Dict[int, Type[StructProtocol]] = 
EMPTY_DICT) -> None:
 self.read_types = read_types
+self.context = []
 
 def schema(self, schema: Schema, expected_schema: Optional[IcebergType], 
result: Reader) -> Reader:
 return result
 
+def before_field(self, field: NestedField, field_partner: 
Optional[NestedField]) -> None:
+self.context.append(field.field_id)
+
+def after_field(self, field: NestedField, field_partner: 
Optional[NestedField]) -> None:
+self.context.pop()
+
 def struct(self, struct: StructType, expected_struct: 
Optional[IcebergType], field_readers: List[Reader]) -> Reader:
+# -1 indicates the struct root
+read_struct_id = self.context[-1] if len(self.context) > 0 else -1
+struct_callable = self.read_types.get(read_struct_id, Record)

Review Comment:
   There is a need to be able to read a subset of fields when reading a 
manifest list or a manifest file. Right now, we don't use the lower or upper 
bounds, for example, but we still read them into memory rather than skipping 
them. There are also other use cases. A generic record, for example, should be 
able to use the field names from the read schema.
   
   To do those things, this PR needs to support passing the read schema to the 
instance that is instantiated. I don't really care how we do it. It can be by 
inspecting the constructor, assuming that the constructor accepts a schema, 
checking for a `set_read_schema` method, or some other idea. But it needs to be 
done as part of this PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] szehon-ho merged pull request #5629: Spark: Add "Iceberg" prefix to SparkTable name string for better observability of Iceberg tables on SparkUI

2023-01-11 Thread GitBox


szehon-ho merged PR #5629:
URL: https://github.com/apache/iceberg/pull/5629


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] szehon-ho commented on pull request #5629: Spark: Add "Iceberg" prefix to SparkTable name string for better observability of Iceberg tables on SparkUI

2023-01-11 Thread GitBox


szehon-ho commented on PR #5629:
URL: https://github.com/apache/iceberg/pull/5629#issuecomment-1379324890

   Merged, thanks @sumeetgajjar , @wypoon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] szehon-ho merged pull request #6476: API, Core, Flink, Parquet, Spark: Use enhanced for loop

2023-01-11 Thread GitBox


szehon-ho merged PR #6476:
URL: https://github.com/apache/iceberg/pull/6476


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] szehon-ho commented on pull request #6476: API, Core, Flink, Parquet, Spark: Use enhanced for loop

2023-01-11 Thread GitBox


szehon-ho commented on PR #6476:
URL: https://github.com/apache/iceberg/pull/6476#issuecomment-137912

   Merged, thanks @krvikash , @hililiwei , @amogh-jahagirdar 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] dramaticlly opened a new issue, #6567: pyiceberg table scan problem with row filter set to non-partition columns

2023-01-11 Thread GitBox


dramaticlly opened a new issue, #6567:
URL: https://github.com/apache/iceberg/issues/6567

   ### Apache Iceberg version
   
   None
   
   ### Query engine
   
   None
   
   ### Please describe the bug 🐞
   
   I really like the new table scan feature released latest pyiceberg 0.2.1 
release, thanks @Fokko. It works great when I provide the partition column as 
row filter but not working as expected when I provide other columns as part of 
expression. The `scan.plan_files()` shall return me collection of parquet files 
satisfy the predicate in row filter but it's returning all instead. 
   
   Here's my repro steps, I created a simple table `hongyue_zhang.mls23` to 
start with
   
   ### schema
   ```ddl
  Create Table
   
--
CREATE TABLE iceberg.hongyue_zhang.mls23 (
   id bigint NOT NULL,
   data varchar,
   ts date
)
WITH (
   format = 'PARQUET',
   location = 's3a://warehouse-default/warehouse/hongyue_zhang.db/mls23',
   partitioning = ARRAY['ts']
)
   (1 row)
   ```
   
   ### Setup
   Table have 2 partitions  and 198 records total, each write have its own 
parquet files for the sake of simplicity 
   ```
partition| record_count | file_count | total_size | 
   data
   
-+--+++-
{ts=2023-01-04} |   99 | 99 | 115300 | {id={min=1, 
max=1, null_count=0}, data={min=b, max=bbbc, null_count=0}}
{ts=2023-01-05} |   99 | 99 | 115303 | {id={min=0, 
max=0, null_count=0}, data={min=a, max=aaab, null_count=0}}
   (2 rows)
   ```
   
   ### Python code
   ```python
   import os
   from pyiceberg.catalog import load_catalog
   from pyiceberg.expressions import GreaterThanOrEqual, And, EqualTo
   
   catalog = load_catalog("prod")
   table = catalog.load_table("hongyue_zhang.mls23")
   table.location()
   
   scan1 = table.scan(
   row_filter=EqualTo("ts", "2023-01-04"))
   yesterday_files = [task.file.file_path for task in scan1.plan_files()]
   print(len(yesterday_files))
   # expect 99 and actual is 99 parquet files for single partition
   
   scan2 = table.scan(
   row_filter=EqualTo("data", "a"))
   a_files = [task.file.file_path for task in scan2.plan_files()]
   print(len(a_files))
   # expect 1 but I am seeing 198 instead, which means all parquet files are 
returned
   
   scan3 = table.scan(
   row_filter=And(EqualTo("ts", "2023-01-04"), EqualTo("data", "a")))
   yesterday_and_a_files= [task.file.file_path for task in scan3.plan_files()]
   print(len(yesterday_and_a_files))
   # expect 1 but I am seeing 99, which means the row filter are taking the 1st 
expression with partition column ts but not 2nd expression on data 
   ```
   
   For the sake of validation, I also tried to spark to query with similar 
condition and it's returnning me 1 file as expected
   ```spark
   val result = spark.sql("select id, data, input_file_name(), ts from 
iceberg.hongyue_zhang.mls23 where data = 'a'")
   
+---++-+--+
   |id |data|input_file_name()  

  |ts|
   
+---++-+--+
   |0  |a   
|s3a://warehouse-default/warehouse/hongyue_zhang.db/mls23/data/ts=2023-01-05/0-6-5573682f-d72c-4a68-a08f-8fe4dbca8581-1.parquet
   |2023-01-05|
   
+---++-+--+
   ```
   
   
   I cant see to figure out why it failed and happy to contribute if anyone can 
share insights 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] jackye1995 commented on issue #6523: Table creation fails with Glue catalog on EMR

2023-01-11 Thread GitBox


jackye1995 commented on issue #6523:
URL: https://github.com/apache/iceberg/issues/6523#issuecomment-1379360910

   @singhpk234 @rajarshisarkar could you take a look into this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] erikcw commented on issue #6567: pyiceberg table scan problem with row filter set to non-partition columns

2023-01-11 Thread GitBox


erikcw commented on issue #6567:
URL: https://github.com/apache/iceberg/issues/6567#issuecomment-1379415205

   I stumbled into the same issue with a slight twist.  I deleted all the rows 
from my table, however pyiceberg is still returning parquet files with those 
records.  Shouldn't those files no longer be in the current manifest?
   
   ```sql
   -- Executed in Athena
   DELETE FROM iceberg_test WHERE uid = '200441';
   
   select count(uid) from "iceberg_test"
   where uid = '200441';
   
   -- Returns 0.
   
   
   ```
   
   ```python
   # Glue catalog type.
   catalog = load_catalog("default")
   table = catalog.load_table("testing.iceberg_test")
   
   scan = table.scan(
   row_filter=NotEqualTo("uid", "200441"),  # Doesn't seem to make a 
difference with out without this line.
   selected_fields=("uid"),
   )
   files = [task.file.file_path for task in scan.plan_files()]
   # files all contain the deleted value.
  
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] Fokko commented on issue #6567: pyiceberg table scan problem with row filter set to non-partition columns

2023-01-11 Thread GitBox


Fokko commented on issue #6567:
URL: https://github.com/apache/iceberg/issues/6567#issuecomment-1379472226

   @dramaticlly I just checked, and I can confirm that we don't filter on the 
datafile ranges, this will be implemented very soon 👍🏻 
   
   @erikcw Thanks for raising the issue, and we're not handling deleted files 
right now. Could you create a separate issue so we make sure that we keep track 
of it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] erikcw opened a new issue, #6568: pyiceberg table scan returning deleted data

2023-01-11 Thread GitBox


erikcw opened a new issue, #6568:
URL: https://github.com/apache/iceberg/issues/6568

   ### Apache Iceberg version
   
   1.1.0 (latest release)
   
   ### Query engine
   
   Other
   
   ### Please describe the bug 🐞
   
   I originally mentioned raised this issue in #6567.  After deleting rows from 
a table (in my case with Athena), pyiceberg is still returning parquet files 
with those records from a table scan. Shouldn't those files no longer be in the 
current manifest and hence not returned my the table scan?
   
   ```sql
   -- Executed in Athena
   DELETE FROM iceberg_test WHERE uid = '200441';
   
   select count(uid) from "iceberg_test"
   where uid = '200441';
   
   -- Returns 0.
   ```
   
   ```py
   # Glue catalog type.
   catalog = load_catalog("default")
   table = catalog.load_table("testing.iceberg_test")
   
   scan = table.scan(
   row_filter=NotEqualTo("uid", "200441"),  # Doesn't seem to make a 
difference with or without this line.
   selected_fields=("uid"),
   )
   files = [task.file.file_path for task in scan.plan_files()]
   # files all contain the deleted value.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] stevenzwu commented on a diff in pull request #6557: Flink: add support of writing Avro GenericRecord DataStream to Iceberg

2023-01-11 Thread GitBox


stevenzwu commented on code in PR #6557:
URL: https://github.com/apache/iceberg/pull/6557#discussion_r1067488603


##
flink/v1.16/flink/src/test/java/org/apache/iceberg/flink/sink/TestAvroGenericRecordToRowDataMapper.java:
##
@@ -0,0 +1,37 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.flink.sink;
+
+import org.apache.flink.table.data.RowData;
+import org.apache.iceberg.flink.AvroGenericRecordConverterBase;
+import org.apache.iceberg.flink.DataGenerator;
+import org.junit.Assert;
+
+public class TestAvroGenericRecordToRowDataMapper extends 
AvroGenericRecordConverterBase {

Review Comment:
   read path will be addressed separately



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] sumeetgajjar commented on pull request #5629: Spark: Add "Iceberg" prefix to SparkTable name string for better observability of Iceberg tables on SparkUI

2023-01-11 Thread GitBox


sumeetgajjar commented on PR #5629:
URL: https://github.com/apache/iceberg/pull/5629#issuecomment-1379529328

   Thanks @wypoon and @szehon-ho for the reviews and for merging the PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #5234: Core, API: BaseRowDelta to branch Impl

2023-01-11 Thread GitBox


amogh-jahagirdar commented on code in PR #5234:
URL: https://github.com/apache/iceberg/pull/5234#discussion_r1067483646


##
core/src/test/java/org/apache/iceberg/TestRowDelta.java:
##
@@ -429,29 +470,31 @@ public void testValidateNoConflictsFromSnapshot() {
 statuses(Status.ADDED));
   }
 
-  @Test
+  // @Test

Review Comment:
   Any reason this test is commented out? 



##
core/src/test/java/org/apache/iceberg/TableTestBase.java:
##
@@ -472,6 +484,20 @@ void validateTableDeleteFiles(Table tbl, DeleteFile... 
expectedFiles) {
 Assert.assertEquals("Delete files should match", expectedFilePaths, 
actualFilePaths);
   }
 
+  void validateTableDeleteFilesWithRef(Table tbl, String ref, DeleteFile... 
expectedFiles) {

Review Comment:
   validateBranchDeleteFiles?



##
core/src/test/java/org/apache/iceberg/TestRowDelta.java:
##
@@ -1270,11 +1332,16 @@ public void 
testConcurrentNonConflictingRowDeltaAndRewriteFilesWithSequenceNumbe
 baseSnapshot.sequenceNumber())
 .validateFromSnapshot(baseSnapshot.snapshotId());
 
-rowDelta.commit();
-rewriteFiles.commit();
+commit(table, rowDelta, branch);
+commit(table, rewriteFiles, branch);
 
-validateTableDeleteFiles(table, deleteFile1);
-validateTableFiles(table, dataFile2);
+if (branch == "testBranch") {

Review Comment:
   equals() instead of ==



##
core/src/main/java/org/apache/iceberg/SnapshotProducer.java:
##
@@ -267,7 +267,26 @@ private Map summary(TableMetadata 
previous) {
 }
 
 Map previousSummary;
-if (previous.currentSnapshot() != null) {
+if (!targetBranch.equals(SnapshotRef.MAIN_BRANCH)) {
+  if (previous.ref(targetBranch) != null) {
+if 
(previous.snapshot(previous.ref(targetBranch).snapshotId()).summary() != null) {
+  previousSummary = 
previous.snapshot(previous.ref(targetBranch).snapshotId()).summary();
+} else {
+  previousSummary = ImmutableMap.of();
+}

Review Comment:
   Could we move the value of previous.ref(targetBranch) to it's own variable 
for readability?



##
core/src/main/java/org/apache/iceberg/SnapshotProducer.java:
##
@@ -267,7 +267,26 @@ private Map summary(TableMetadata 
previous) {
 }
 
 Map previousSummary;
-if (previous.currentSnapshot() != null) {
+if (!targetBranch.equals(SnapshotRef.MAIN_BRANCH)) {
+  if (previous.ref(targetBranch) != null) {
+if 
(previous.snapshot(previous.ref(targetBranch).snapshotId()).summary() != null) {
+  previousSummary = 
previous.snapshot(previous.ref(targetBranch).snapshotId()).summary();
+} else {
+  previousSummary = ImmutableMap.of();
+}

Review Comment:
   Also I think we could probably simplify all the if/else logic here. the 
branch will either be main or not, and in both cases it's the same logic. 
Regardless of main or not main: There is either not a prev snapshot in which 
case default to 0, or the prev snapshot had no summary use an empty summary. or 
the prev snapshot had a summary so use it.



##
core/src/test/java/org/apache/iceberg/TableTestBase.java:
##
@@ -458,6 +458,18 @@ void validateTableFiles(Table tbl, DataFile... 
expectedFiles) {
 Assert.assertEquals("Files should match", expectedFilePaths, 
actualFilePaths);
   }
 
+  void validateTableFilesWithRef(Table tbl, String ref, DataFile... 
expectedFiles) {

Review Comment:
   validateBranchDataFiles?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #5234: Core, API: BaseRowDelta to branch Impl

2023-01-11 Thread GitBox


amogh-jahagirdar commented on code in PR #5234:
URL: https://github.com/apache/iceberg/pull/5234#discussion_r1067482077


##
core/src/test/java/org/apache/iceberg/TestRowDelta.java:
##
@@ -1270,11 +1332,16 @@ public void 
testConcurrentNonConflictingRowDeltaAndRewriteFilesWithSequenceNumbe
 baseSnapshot.sequenceNumber())
 .validateFromSnapshot(baseSnapshot.snapshotId());
 
-rowDelta.commit();
-rewriteFiles.commit();
+commit(table, rowDelta, branch);
+commit(table, rewriteFiles, branch);
 
-validateTableDeleteFiles(table, deleteFile1);
-validateTableFiles(table, dataFile2);
+if (branch == "testBranch") {

Review Comment:
   equals() instead of == ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #5234: Core, API: BaseRowDelta to branch Impl

2023-01-11 Thread GitBox


amogh-jahagirdar commented on code in PR #5234:
URL: https://github.com/apache/iceberg/pull/5234#discussion_r1067501650


##
core/src/test/java/org/apache/iceberg/TestRowDelta.java:
##
@@ -1270,11 +1332,16 @@ public void 
testConcurrentNonConflictingRowDeltaAndRewriteFilesWithSequenceNumbe
 baseSnapshot.sequenceNumber())
 .validateFromSnapshot(baseSnapshot.snapshotId());
 
-rowDelta.commit();
-rewriteFiles.commit();
+commit(table, rowDelta, branch);
+commit(table, rewriteFiles, branch);
 
-validateTableDeleteFiles(table, deleteFile1);
-validateTableFiles(table, dataFile2);
+if (branch == "testBranch") {

Review Comment:
   Also just realized we don't need to differentiate the two cases here, I 
think we could just have a single method which takes in a branch (and the 
branch could just be main or not, it doesn't really matter)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #5234: Core, API: BaseRowDelta to branch Impl

2023-01-11 Thread GitBox


amogh-jahagirdar commented on code in PR #5234:
URL: https://github.com/apache/iceberg/pull/5234#discussion_r1067520428


##
core/src/test/java/org/apache/iceberg/TestRowDelta.java:
##
@@ -1449,18 +1519,48 @@ public void testRowDeltaCaseSensitivity() {
   }
 
   @Test
-  public void testRowDeltaToBranchUnsupported() {
+  public void testBranchValidationsNotValidAncestor() {
+table.newAppend().appendFile(FILE_A).commit();
+table.manageSnapshots().createBranch("branch", 
table.currentSnapshot().snapshotId()).commit();
+table.newAppend().appendFile(FILE_B).commit();
+
+// This commit will result in validation exception as we start validation 
from a snapshot which
+// is not an ancestor of the branch
+RowDelta rowDelta =
+table
+.newRowDelta()
+.toBranch("branch")
+.addDeletes(FILE_A_DELETES)
+.validateFromSnapshot(table.currentSnapshot().snapshotId())
+.conflictDetectionFilter(Expressions.alwaysTrue())
+.validateNoConflictingDeleteFiles();
+
 AssertHelpers.assertThrows(
-"Should reject committing row delta to branch",
-UnsupportedOperationException.class,
-"Cannot commit to branch someBranch: org.apache.iceberg.BaseRowDelta 
does not support branch commits",
-() ->
-table
-.newRowDelta()
-.caseSensitive(false)
-.addRows(FILE_B)
-.addDeletes(FILE_A2_DELETES)
-.toBranch("someBranch")
-.commit());
+"Snapshot 2 is not an ancestor of 1",
+IllegalArgumentException.class,
+() -> rowDelta.commit());
+  }
+
+  @Test
+  public void testBranchValidationsValidAncestor() {
+table.newAppend().appendFile(FILE_A).commit();
+Long ancestorSnapshot = table.currentSnapshot().snapshotId();
+table.manageSnapshots().createBranch("branch", ancestorSnapshot).commit();
+
+// This commit not result in validation exception as we start validation 
from a snapshot which
+// is an actual ancestor of the branch
+table
+.newRowDelta()
+.toBranch("branch")
+.addDeletes(FILE_A_DELETES)
+.validateFromSnapshot(ancestorSnapshot)
+.conflictDetectionFilter(Expressions.alwaysTrue())
+.validateNoConflictingDeleteFiles()
+.commit();
+
+int branchSnapshot = 2;
+
+Assert.assertEquals(table.currentSnapshot().snapshotId(), 1);
+Assert.assertEquals(table.ops().current().ref("branch").snapshotId(), 
branchSnapshot);

Review Comment:
   Sorry not getting why we need this test case, won't it be covered by the 
others? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] rdblue commented on a diff in pull request #6525: Python: Refactor loading manifests

2023-01-11 Thread GitBox


rdblue commented on code in PR #6525:
URL: https://github.com/apache/iceberg/pull/6525#discussion_r1067543958


##
python/pyiceberg/avro/reader.py:
##
@@ -252,30 +252,33 @@ def skip(self, decoder: BinaryDecoder) -> None:
 class StructReader(Reader):
 field_readers: Tuple[Tuple[Optional[int], Reader], ...]
 create_struct: Type[StructProtocol]
-fields: Optional[Tuple[NestedField, ...]]
+struct: Optional[StructType]
 
 def __init__(
 self,
 field_readers: Tuple[Tuple[Optional[int], Reader], ...],
 create_struct: Optional[Type[StructProtocol]] = None,
-fields: Optional[Tuple[NestedField, ...]] = None,
+struct: Optional[StructType] = None,
 ):
 self.field_readers = field_readers
 self.create_struct = create_struct or Record
-self.fields = fields
+self.struct = struct
 
 def read(self, decoder: BinaryDecoder) -> Any:
-if issubclass(self.create_struct, Record):
-struct = self.create_struct(length=len(self.field_readers), 
fields=self.fields)
-elif issubclass(self.create_struct, PydanticStruct):
+if issubclass(self.create_struct, PydanticStruct):
 struct = self.create_struct.construct()
 else:
-raise ValueError(f"Expected a subclass of PydanticStruct, got: 
{self.create_struct}")
+struct = self.create_struct()
+
+if not isinstance(struct, StructProtocol):
+raise ValueError(f"Expected struct to implement StructProtocol: 
{struct}")
+
+if self.struct:
+struct.set_record_schema(self.struct)

Review Comment:
   Do we want to do an `isinstance` check here eventually?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] namrathamyske commented on a diff in pull request #5234: Core, API: BaseRowDelta to branch Impl

2023-01-11 Thread GitBox


namrathamyske commented on code in PR #5234:
URL: https://github.com/apache/iceberg/pull/5234#discussion_r1067542176


##
core/src/test/java/org/apache/iceberg/TestRowDelta.java:
##
@@ -1449,18 +1519,48 @@ public void testRowDeltaCaseSensitivity() {
   }
 
   @Test
-  public void testRowDeltaToBranchUnsupported() {
+  public void testBranchValidationsNotValidAncestor() {
+table.newAppend().appendFile(FILE_A).commit();
+table.manageSnapshots().createBranch("branch", 
table.currentSnapshot().snapshotId()).commit();
+table.newAppend().appendFile(FILE_B).commit();
+
+// This commit will result in validation exception as we start validation 
from a snapshot which
+// is not an ancestor of the branch
+RowDelta rowDelta =
+table
+.newRowDelta()
+.toBranch("branch")
+.addDeletes(FILE_A_DELETES)
+.validateFromSnapshot(table.currentSnapshot().snapshotId())
+.conflictDetectionFilter(Expressions.alwaysTrue())
+.validateNoConflictingDeleteFiles();
+
 AssertHelpers.assertThrows(
-"Should reject committing row delta to branch",
-UnsupportedOperationException.class,
-"Cannot commit to branch someBranch: org.apache.iceberg.BaseRowDelta 
does not support branch commits",
-() ->
-table
-.newRowDelta()
-.caseSensitive(false)
-.addRows(FILE_B)
-.addDeletes(FILE_A2_DELETES)
-.toBranch("someBranch")
-.commit());
+"Snapshot 2 is not an ancestor of 1",
+IllegalArgumentException.class,
+() -> rowDelta.commit());
+  }
+
+  @Test
+  public void testBranchValidationsValidAncestor() {
+table.newAppend().appendFile(FILE_A).commit();
+Long ancestorSnapshot = table.currentSnapshot().snapshotId();
+table.manageSnapshots().createBranch("branch", ancestorSnapshot).commit();
+
+// This commit not result in validation exception as we start validation 
from a snapshot which
+// is an actual ancestor of the branch
+table
+.newRowDelta()
+.toBranch("branch")
+.addDeletes(FILE_A_DELETES)
+.validateFromSnapshot(ancestorSnapshot)
+.conflictDetectionFilter(Expressions.alwaysTrue())
+.validateNoConflictingDeleteFiles()
+.commit();
+
+int branchSnapshot = 2;
+
+Assert.assertEquals(table.currentSnapshot().snapshotId(), 1);
+Assert.assertEquals(table.ops().current().ref("branch").snapshotId(), 
branchSnapshot);

Review Comment:
   Cool, will be removing these.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] rdblue commented on a diff in pull request #6525: Python: Refactor loading manifests

2023-01-11 Thread GitBox


rdblue commented on code in PR #6525:
URL: https://github.com/apache/iceberg/pull/6525#discussion_r1067556636


##
python/pyiceberg/typedef.py:
##
@@ -126,30 +135,41 @@ def json(self, exclude_none: bool = True, exclude: 
Optional[Set[str]] = None, by
 
 
 class PydanticStruct(IcebergBaseModel):
+_position_to_field_name: Dict[int, str] = PrivateAttr()
+_field_name_to_pydantic_field: Dict[str, Field] = PrivateAttr()
+
 class Config:
 frozen = False
 
-def __setitem__(self, pos: int, value: Any) -> None:
-positions = list(self.__fields__.values())
-field = positions[pos]
-if value is None:
-if field.default is not None:
-value = field.default
-elif field.default_factory is not None:
-value = field.default_factory()
+@staticmethod
+def _get_default_field_value(field: Field) -> Optional[Any]:
+if field.default is not None:
+return field.default
+elif field.default_factory is not None:
+return field.default_factory()
+else:
+return None
+
+def set_record_schema(self, record_schema: StructType) -> None:
+self._field_name_to_pydantic_field = {field.name: field for field in 
self.__fields__.values()}
+self._position_to_field_name = {idx: field.name for idx, field in 
enumerate(record_schema.fields)}
+for name, field in self.__fields__.items():
+setattr(self, name, PydanticStruct._get_default_field_value(field))
 
-self.__setattr__(field.name, value)
+def __setitem__(self, pos: int, value: Any) -> None:
+field_name = self._position_to_field_name[pos]
+# Check if the field exists
+if field := self._field_name_to_pydantic_field.get(field_name):
+self.__setattr__(field.name, value if value is not None else 
PydanticStruct._get_default_field_value(field))

Review Comment:
   I think that this is incorrect because it assumes that `None` should result 
in the default value. The file may explicitly contain a `null` value, in which 
case we do not want to default it.
   
   For now, I think it is best if we do not add default values from Pydantic. 
We can add that later, but we need to be careful and add them in the readers, 
not in the setter.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] stevenzwu commented on a diff in pull request #6382: Implement ShuffleOperator to collect data statistics

2023-01-11 Thread GitBox


stevenzwu commented on code in PR #6382:
URL: https://github.com/apache/iceberg/pull/6382#discussion_r1067557305


##
flink/v1.16/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/DataStatistics.java:
##
@@ -0,0 +1,34 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.flink.sink.shuffle;
+
+/**
+ * DataStatistics defines the interface to collect data statistics.
+ *
+ * Data statistics tracks traffic volume distribution across data keys. For 
low-cardinality key,
+ * a simple map of (key, count) can be used. For high-cardinality key, 
probabilistic data structures
+ * (sketching) can be used.
+ */
+interface DataStatistics {
+  long size();

Review Comment:
   let's add some Javadoc for the methods.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] rdblue commented on a diff in pull request #6525: Python: Refactor loading manifests

2023-01-11 Thread GitBox


rdblue commented on code in PR #6525:
URL: https://github.com/apache/iceberg/pull/6525#discussion_r1067557776


##
python/pyiceberg/typedef.py:
##
@@ -126,30 +135,41 @@ def json(self, exclude_none: bool = True, exclude: 
Optional[Set[str]] = None, by
 
 
 class PydanticStruct(IcebergBaseModel):
+_position_to_field_name: Dict[int, str] = PrivateAttr()
+_field_name_to_pydantic_field: Dict[str, Field] = PrivateAttr()
+
 class Config:
 frozen = False
 
-def __setitem__(self, pos: int, value: Any) -> None:
-positions = list(self.__fields__.values())
-field = positions[pos]
-if value is None:
-if field.default is not None:
-value = field.default
-elif field.default_factory is not None:
-value = field.default_factory()
+@staticmethod
+def _get_default_field_value(field: Field) -> Optional[Any]:
+if field.default is not None:
+return field.default
+elif field.default_factory is not None:
+return field.default_factory()
+else:
+return None
+
+def set_record_schema(self, record_schema: StructType) -> None:
+self._field_name_to_pydantic_field = {field.name: field for field in 
self.__fields__.values()}
+self._position_to_field_name = {idx: field.name for idx, field in 
enumerate(record_schema.fields)}
+for name, field in self.__fields__.items():
+setattr(self, name, PydanticStruct._get_default_field_value(field))

Review Comment:
   I'm okay with adding defaults here, although I think that fields that are 
used should always be set through `__setitem__` so it shouldn't be necessary. 
That's why I'd prefer not creating attrs for these and simply raising an 
exception if an unset field is referenced.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] rdblue commented on a diff in pull request #6525: Python: Refactor loading manifests

2023-01-11 Thread GitBox


rdblue commented on code in PR #6525:
URL: https://github.com/apache/iceberg/pull/6525#discussion_r1067558837


##
python/pyiceberg/typedef.py:
##
@@ -126,30 +135,41 @@ def json(self, exclude_none: bool = True, exclude: 
Optional[Set[str]] = None, by
 
 
 class PydanticStruct(IcebergBaseModel):
+_position_to_field_name: Dict[int, str] = PrivateAttr()
+_field_name_to_pydantic_field: Dict[str, Field] = PrivateAttr()
+
 class Config:
 frozen = False
 
-def __setitem__(self, pos: int, value: Any) -> None:
-positions = list(self.__fields__.values())
-field = positions[pos]
-if value is None:
-if field.default is not None:
-value = field.default
-elif field.default_factory is not None:
-value = field.default_factory()
+@staticmethod
+def _get_default_field_value(field: Field) -> Optional[Any]:
+if field.default is not None:
+return field.default
+elif field.default_factory is not None:
+return field.default_factory()
+else:
+return None
+
+def set_record_schema(self, record_schema: StructType) -> None:
+self._field_name_to_pydantic_field = {field.name: field for field in 
self.__fields__.values()}
+self._position_to_field_name = {idx: field.name for idx, field in 
enumerate(record_schema.fields)}
+for name, field in self.__fields__.items():
+setattr(self, name, PydanticStruct._get_default_field_value(field))
 
-self.__setattr__(field.name, value)
+def __setitem__(self, pos: int, value: Any) -> None:
+field_name = self._position_to_field_name[pos]

Review Comment:
   Since it isn't safe to set defaults in this method, I think it can be 
simpler:
   
   ```python
   def __setitem__(self, pos: int, value: Any) -> None:
   self.__setattr__(self._position_to_field_name[pos], value)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] rdblue commented on a diff in pull request #6525: Python: Refactor loading manifests

2023-01-11 Thread GitBox


rdblue commented on code in PR #6525:
URL: https://github.com/apache/iceberg/pull/6525#discussion_r1067559792


##
python/pyiceberg/typedef.py:
##
@@ -126,30 +135,41 @@ def json(self, exclude_none: bool = True, exclude: 
Optional[Set[str]] = None, by
 
 
 class PydanticStruct(IcebergBaseModel):

Review Comment:
   If we can't use Pydantic for defaults, is there still value in using it as 
the basis for our structs?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



issues@iceberg.apache.org

2023-01-11 Thread GitBox


rdblue commented on code in PR #6324:
URL: https://github.com/apache/iceberg/pull/6324#discussion_r1067562662


##
hive-metastore/src/main/java/org/apache/iceberg/hive/HiveHadoopUtil.java:
##
@@ -0,0 +1,40 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.hive;
+
+import java.io.IOException;
+import org.apache.hadoop.security.UserGroupInformation;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+public class HiveHadoopUtil {
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(HiveHadoopUtil.class);
+
+  private HiveHadoopUtil() {}
+
+  public static String currentUser() {
+try {
+  return UserGroupInformation.getCurrentUser().getUserName();
+} catch (IOException e) {
+  LOG.warn("Fail to obtain Hadoop UGI user", e);

Review Comment:
   Minor: to fit with conventions, I think the error message should be `Failed 
to get Hadoop user`.
   * Error messages use past tense to describe what was attempted
   * "Obtain" is no more specific than "get", so prefer the simpler word
   * `UGI` is an internal Hadoop term that is more confusing than helpful



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] flyrain commented on a diff in pull request #6012: Spark 3.3: Add a procedure to generate table changes

2023-01-11 Thread GitBox


flyrain commented on code in PR #6012:
URL: https://github.com/apache/iceberg/pull/6012#discussion_r1067563712


##
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/GenerateChangesProcedure.java:
##
@@ -0,0 +1,267 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.spark.procedures;
+
+import java.sql.Timestamp;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+import org.apache.iceberg.MetadataColumns;
+import org.apache.iceberg.relocated.com.google.common.collect.Maps;
+import org.apache.iceberg.spark.ChangelogIterator;
+import org.apache.iceberg.spark.SparkReadOptions;
+import org.apache.iceberg.spark.source.SparkChangelogTable;
+import org.apache.spark.api.java.function.MapPartitionsFunction;
+import org.apache.spark.sql.Column;
+import org.apache.spark.sql.DataFrameReader;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.encoders.RowEncoder;
+import org.apache.spark.sql.connector.catalog.TableCatalog;
+import org.apache.spark.sql.connector.iceberg.catalog.ProcedureParameter;
+import org.apache.spark.sql.types.DataTypes;
+import org.apache.spark.sql.types.Metadata;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+import org.apache.spark.unsafe.types.UTF8String;
+import org.jetbrains.annotations.NotNull;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import scala.runtime.BoxedUnit;
+
+/**
+ * A procedure that creates a view for changed rows.
+ *
+ * The procedure computes update-rows and removes the carry-over rows by 
default. You can disable
+ * them through parameters to get better performance.
+ *
+ * Carry-over rows are the result of a removal and insertion of the same 
row within an operation
+ * because of the copy-on-write mechanism. For example, given a file which 
contains row1 (id=1,
+ * data='a') and row2 (id=2, data='b'). A copy-on-write delete of row2 would 
require erasing this
+ * file and preserving row1 in a new file. The change-log table would report 
this as (id=1,
+ * data='a', op='DELETE') and (id=1, data='a', op='INSERT'), despite it not 
being an actual change
+ * to the table. The iterator finds the carry-over rows and removes them from 
the result.
+ *
+ * An update-row is converted from a pair of delete row and insert row. 
Identifier columns are
+ * needed for identifying whether they refer to the same row. You can either 
set Identifier Field
+ * IDs as the table properties or input them as the procedure parameters. Here 
is an example of
+ * update-row with an identifier column(id). A pair of delete row and insert 
row with the same id:
+ *
+ * 
+ *   (id=1, data='a', op='DELETE')
+ *   (id=1, data='b', op='INSERT')
+ * 
+ *
+ * will be marked as update-rows:
+ *
+ * 
+ *   (id=1, data='a', op='UPDATE_BEFORE')
+ *   (id=1, data='b', op='UPDATE_AFTER')
+ * 
+ */
+public class GenerateChangesProcedure extends BaseProcedure {
+  private static final Logger LOG = 
LoggerFactory.getLogger(GenerateChangesProcedure.class);
+
+  private static final ProcedureParameter[] PARAMETERS =
+  new ProcedureParameter[] {
+ProcedureParameter.required("table", DataTypes.StringType),
+ProcedureParameter.optional("table_change_view", DataTypes.StringType),
+ProcedureParameter.optional("options", STRING_MAP),
+ProcedureParameter.optional("compute_updates", DataTypes.BooleanType),
+ProcedureParameter.optional("remove_carryovers", 
DataTypes.BooleanType),
+ProcedureParameter.optional("identifier_columns", 
DataTypes.StringType),
+  };
+
+  private static final StructType OUTPUT_TYPE =
+  new StructType(
+  new StructField[] {
+new StructField("view_name", DataTypes.StringType, false, 
Metadata.empty())
+  });
+
+  public static SparkProcedures.ProcedureBuilder builder() {
+return new BaseProcedure.Builder() {
+  @Override
+  protected GenerateChangesProcedure doBuild() {
+return new GenerateChangesProcedure(tableCatalog());
+  }
+

issues@iceberg.apache.org

2023-01-11 Thread GitBox


rdblue commented on code in PR #6324:
URL: https://github.com/apache/iceberg/pull/6324#discussion_r1067564415


##
hive-metastore/src/test/java/org/apache/iceberg/hive/TestHiveCatalog.java:
##
@@ -247,17 +249,28 @@ public void testReplaceTxnBuilder() throws Exception {
 
   @Test
   public void testCreateTableWithOwner() throws Exception {
+createTableAndVerifyOwner(

Review Comment:
   Why are these mixed into a single test method rather than two test methods? 
I think this makes it harder to follow



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] rdblue commented on a diff in pull request #6525: Python: Refactor loading manifests

2023-01-11 Thread GitBox


rdblue commented on code in PR #6525:
URL: https://github.com/apache/iceberg/pull/6525#discussion_r1067581319


##
python/tests/test_typedef.py:
##
@@ -46,7 +51,8 @@ def test_record_repr() -> None:
 
 
 def test_named_record() -> None:
-r = Record(fields=(NestedField(0, "id", IntegerType()), NestedField(1, 
"name", StringType(
+r = Record()
+r.set_record_schema(StructType(NestedField(0, "id", IntegerType()), 
NestedField(1, "name", StringType(

Review Comment:
   I think I prefer passing the schema into the constructor. We may want to 
keep doing that, at least for the generic record.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] github-actions[bot] commented on issue #5173: Add Flink test for Parquet bloom filter

2023-01-11 Thread GitBox


github-actions[bot] commented on issue #5173:
URL: https://github.com/apache/iceberg/issues/5173#issuecomment-1379643434

   This issue has been closed because it has not received any activity in the 
last 14 days since being marked as 'stale'


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] github-actions[bot] commented on issue #5174: MERGE INTO TABLE is not supported temporarily on Spark3.2.0, Scala2.12, Iceberg0.13.1

2023-01-11 Thread GitBox


github-actions[bot] commented on issue #5174:
URL: https://github.com/apache/iceberg/issues/5174#issuecomment-1379643412

   This issue has been automatically marked as stale because it has been open 
for 180 days with no activity. It will be closed in next 14 days if no further 
activity occurs. To permanently prevent this issue from being considered stale, 
add the label 'not-stale', but commenting on the issue is preferred when 
possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] github-actions[bot] closed issue #5173: Add Flink test for Parquet bloom filter

2023-01-11 Thread GitBox


github-actions[bot] closed issue #5173: Add Flink test for Parquet bloom filter 
URL: https://github.com/apache/iceberg/issues/5173


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] rdblue commented on a diff in pull request #6525: Python: Refactor loading manifests

2023-01-11 Thread GitBox


rdblue commented on code in PR #6525:
URL: https://github.com/apache/iceberg/pull/6525#discussion_r1067584072


##
python/pyiceberg/typedef.py:
##
@@ -79,6 +84,10 @@ def __missing__(self, key: K) -> V:
 class StructProtocol(Protocol):  # pragma: no cover
 """A generic protocol used by accessors to get and set at positions of an 
object"""
 
+@abstractmethod
+def set_record_schema(self, record_schema: StructType) -> None:

Review Comment:
   One thing that I didn't think about when we were discussing where to put 
this method is that this is can be called multiple times if we don't pass the 
schema to the constructor, allowing callers to change the record schema. We 
assume that it will only be set once in the readers, but this makes the API 
quite strange. I can read with one schema then set a different schema, and then 
the behavior is unclear.
   
   Because of that, I'd prefer to move this method to our `PydanticStruct` 
class instead of putting it here. That way we control all of the 
implementations and it doesn't matter. And we will have the ability to change 
it later if needed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] flyrain commented on a diff in pull request #6012: Spark 3.3: Add a procedure to generate table changes

2023-01-11 Thread GitBox


flyrain commented on code in PR #6012:
URL: https://github.com/apache/iceberg/pull/6012#discussion_r1067602297


##
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/GenerateChangesProcedure.java:
##
@@ -0,0 +1,271 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.spark.procedures;
+
+import java.util.Arrays;
+import java.util.UUID;
+import org.apache.iceberg.ChangelogOperation;
+import org.apache.iceberg.MetadataColumns;
+import org.apache.iceberg.Snapshot;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.spark.SparkReadOptions;
+import org.apache.iceberg.spark.source.SparkChangelogTable;
+import org.apache.iceberg.util.DateTimeUtil;
+import org.apache.iceberg.util.SnapshotUtil;
+import org.apache.spark.sql.Column;
+import org.apache.spark.sql.DataFrameReader;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.connector.catalog.Identifier;
+import org.apache.spark.sql.connector.catalog.TableCatalog;
+import org.apache.spark.sql.connector.iceberg.catalog.ProcedureParameter;
+import org.apache.spark.sql.expressions.Window;
+import org.apache.spark.sql.functions;
+import org.apache.spark.sql.types.DataTypes;
+import org.apache.spark.sql.types.Metadata;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+import org.apache.spark.unsafe.types.UTF8String;
+import org.jetbrains.annotations.NotNull;
+
+public class GenerateChangesProcedure extends BaseProcedure {
+
+  public static SparkProcedures.ProcedureBuilder builder() {
+return new BaseProcedure.Builder() {
+  @Override
+  protected GenerateChangesProcedure doBuild() {
+return new GenerateChangesProcedure(tableCatalog());
+  }
+};
+  }
+
+  private static final ProcedureParameter[] PARAMETERS =
+  new ProcedureParameter[] {
+ProcedureParameter.required("table", DataTypes.StringType),
+// the snapshot ids input are ignored when the start/end timestamps 
are provided
+ProcedureParameter.optional("start_snapshot_id_exclusive", 
DataTypes.LongType),
+ProcedureParameter.optional("end_snapshot_id_inclusive", 
DataTypes.LongType),
+ProcedureParameter.optional("table_change_view", DataTypes.StringType),
+ProcedureParameter.optional("identifier_columns", 
DataTypes.StringType),

Review Comment:
   Made the change per suggestion



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] flyrain commented on pull request #6012: Spark 3.3: Add a procedure to generate table changes

2023-01-11 Thread GitBox


flyrain commented on PR #6012:
URL: https://github.com/apache/iceberg/pull/6012#issuecomment-1379673208

   Ready for another look. cc @RussellSpitzer @szehon-ho @aokolnychyi 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] hililiwei commented on a diff in pull request #6557: Flink: add support of writing Avro GenericRecord DataStream to Iceberg

2023-01-11 Thread GitBox


hililiwei commented on code in PR #6557:
URL: https://github.com/apache/iceberg/pull/6557#discussion_r1067664873


##
flink/v1.16/flink/src/test/java/org/apache/iceberg/flink/DataGenerators.java:
##
@@ -70,6 +78,12 @@ public static class Primitives implements DataGenerator {
 private static final BigDecimal BIG_DECIMAL_NEGATIVE = new 
BigDecimal("-1.50");
 private static final byte[] FIXED_BYTES = 
"012345689012345".getBytes(StandardCharsets.UTF_8);
 
+private static org.apache.avro.Schema addAdjustToUtc(

Review Comment:
   Do we need it? It doesn't seem to have been used.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] stevenzwu commented on a diff in pull request #6557: Flink: add support of writing Avro GenericRecord DataStream to Iceberg

2023-01-11 Thread GitBox


stevenzwu commented on code in PR #6557:
URL: https://github.com/apache/iceberg/pull/6557#discussion_r1067710226


##
flink/v1.16/flink/src/test/java/org/apache/iceberg/flink/DataGenerators.java:
##
@@ -70,6 +78,12 @@ public static class Primitives implements DataGenerator {
 private static final BigDecimal BIG_DECIMAL_NEGATIVE = new 
BigDecimal("-1.50");
 private static final byte[] FIXED_BYTES = 
"012345689012345".getBytes(StandardCharsets.UTF_8);
 
+private static org.apache.avro.Schema addAdjustToUtc(

Review Comment:
   good catch. this is not needed anymore.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] stevenzwu commented on a diff in pull request #6557: Flink: add support of writing Avro GenericRecord DataStream to Iceberg

2023-01-11 Thread GitBox


stevenzwu commented on code in PR #6557:
URL: https://github.com/apache/iceberg/pull/6557#discussion_r1067710226


##
flink/v1.16/flink/src/test/java/org/apache/iceberg/flink/DataGenerators.java:
##
@@ -70,6 +78,12 @@ public static class Primitives implements DataGenerator {
 private static final BigDecimal BIG_DECIMAL_NEGATIVE = new 
BigDecimal("-1.50");
 private static final byte[] FIXED_BYTES = 
"012345689012345".getBytes(StandardCharsets.UTF_8);
 
+private static org.apache.avro.Schema addAdjustToUtc(

Review Comment:
   good catch. this is not needed anymore. I forgot to delete the obsoleted 
code after some refactoring.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] pvary commented on issue #6370: What is the purpose of Hive Lock ?

2023-01-11 Thread GitBox


pvary commented on issue #6370:
URL: https://github.com/apache/iceberg/issues/6370#issuecomment-1379883500

   With the apache/hive#3888 we can implement a solution which will handle 
failures the same way as the current one, without using locks, and depending on 
the `alter_table` to fail.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] JonasJ-ap commented on pull request #6449: Delta: Support Snapshot Delta Lake Table to Iceberg Table

2023-01-11 Thread GitBox


JonasJ-ap commented on PR #6449:
URL: https://github.com/apache/iceberg/pull/6449#issuecomment-1379900129

   > Also, because the Spark-based tests are under `integrationTest` task, I 
think we need to create a new CI task to run the tests, otherwise it won't 
automatically run?
   
   I've added a new CI called Delta Conversion CI 
(`.github/workflows/delta-conversion-ci.yml`) and upgrade the test to run 
against a dataframe with nested fields.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] JonasJ-ap commented on a diff in pull request #6449: Delta: Support Snapshot Delta Lake Table to Iceberg Table

2023-01-11 Thread GitBox


JonasJ-ap commented on code in PR #6449:
URL: https://github.com/apache/iceberg/pull/6449#discussion_r106276


##
delta-lake/src/main/java/org/apache/iceberg/delta/BaseSnapshotDeltaLakeTableAction.java:
##
@@ -0,0 +1,370 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.delta;
+
+import io.delta.standalone.DeltaLog;
+import io.delta.standalone.VersionLog;
+import io.delta.standalone.actions.Action;
+import io.delta.standalone.actions.AddFile;
+import io.delta.standalone.actions.RemoveFile;
+import java.io.File;
+import java.net.URI;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.iceberg.AppendFiles;
+import org.apache.iceberg.DataFile;
+import org.apache.iceberg.DataFiles;
+import org.apache.iceberg.DeleteFiles;
+import org.apache.iceberg.FileFormat;
+import org.apache.iceberg.Metrics;
+import org.apache.iceberg.MetricsConfig;
+import org.apache.iceberg.OverwriteFiles;
+import org.apache.iceberg.PartitionField;
+import org.apache.iceberg.PartitionSpec;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.SnapshotSummary;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.TableProperties;
+import org.apache.iceberg.Transaction;
+import org.apache.iceberg.avro.Avro;
+import org.apache.iceberg.catalog.Catalog;
+import org.apache.iceberg.catalog.TableIdentifier;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.hadoop.HadoopFileIO;
+import org.apache.iceberg.io.InputFile;
+import org.apache.iceberg.mapping.MappingUtil;
+import org.apache.iceberg.mapping.NameMapping;
+import org.apache.iceberg.mapping.NameMappingParser;
+import org.apache.iceberg.orc.OrcMetrics;
+import org.apache.iceberg.parquet.ParquetUtil;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.apache.iceberg.relocated.com.google.common.collect.Iterables;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.types.Type;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * Takes a Delta Lake table's location and attempts to create an Iceberg table 
snapshot in an
+ * optional user-specified location (default to the Delta Lake table's 
location) with a different
+ * identifier.
+ */
+class BaseSnapshotDeltaLakeTableAction implements SnapshotDeltaLakeTable {
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(BaseSnapshotDeltaLakeTableAction.class);
+
+  private static final String SNAPSHOT_SOURCE_PROP = "snapshot_source";
+  private static final String DELTA_SOURCE_VALUE = "delta";
+  private static final String ORIGINAL_LOCATION_PROP = "original_location";
+  private static final String PARQUET_SUFFIX = ".parquet";
+  private static final String AVRO_SUFFIX = ".avro";
+  private static final String ORC_SUFFIX = ".orc";
+  private final ImmutableMap.Builder 
additionalPropertiesBuilder =
+  ImmutableMap.builder();
+  private DeltaLog deltaLog;
+  private Catalog icebergCatalog;
+  private final String deltaTableLocation;
+  private TableIdentifier newTableIdentifier;
+  private String newTableLocation;
+  private HadoopFileIO deltaLakeFileIO;
+
+  /**
+   * Snapshot a delta lake table to be an iceberg table. The action will read 
the delta lake table's
+   * log through the table's path, create a new iceberg table using the given 
icebergCatalog and
+   * newTableIdentifier, and commit all changes in one iceberg transaction.
+   *
+   * The new table will only be created if the snapshot is successful.
+   *
+   * @param deltaTableLocation the delta lake table's path
+   */
+  BaseSnapshotDeltaLakeTableAction(String deltaTableLocation) {
+this.deltaTableLocation = deltaTableLocation;
+this.newTableLocation = deltaTableLocation;
+  }
+
+  @Override
+  public SnapshotDeltaLakeTable tableProperties(Map 
properties) {
+additionalPropertiesBuilder.putAll(properties);
+return this;
+  }
+
+  @Override
+  public SnapshotDeltaLakeTable tableProperty(String name,

[GitHub] [iceberg] nastra opened a new pull request, #6569: Spark: Add the query ID to file names

2023-01-11 Thread GitBox


nastra opened a new pull request, #6569:
URL: https://github.com/apache/iceberg/pull/6569

   Co-authored-by: Ryan Blue 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] nastra commented on pull request #5214: Spark: Add the query ID to file names

2023-01-11 Thread GitBox


nastra commented on PR #5214:
URL: https://github.com/apache/iceberg/pull/5214#issuecomment-1379936898

   Closing this one as it's been superseded by 
https://github.com/apache/iceberg/pull/6569


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [iceberg] nastra closed pull request #5214: Spark: Add the query ID to file names

2023-01-11 Thread GitBox


nastra closed pull request #5214: Spark: Add the query ID to file names
URL: https://github.com/apache/iceberg/pull/5214


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org