[PR] Build: Fix minor compilation warnings [iceberg]

2023-10-10 Thread via GitHub


nk1506 opened a new pull request, #8758:
URL: https://github.com/apache/iceberg/pull/8758

   There were few warnings with `./gradlew clean build -x test -x 
integrationTest` . 
   this change is to make build **green**. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] flink1.14.4+iceberg0.13.1+hive-metastore3.1.2+minio(S3) error! [iceberg]

2023-10-10 Thread via GitHub


pvary commented on issue #4743:
URL: https://github.com/apache/iceberg/issues/4743#issuecomment-1754585935

   Maybe opening another issue would have been 
   
   > @pvary I know the error is diff than the issue.
   
   Maybe opening another issue would have been better in this case
   
   > Do we have document for Flink on how to configure Flink with Iceberg, 
Hive, and Minio. I am more interested in configuration part. Thanks!
   
   I do not think we have specific documentation for you case.
   We have a general docs: https://iceberg.apache.org/docs/latest/flink/
   We have https://iceberg.apache.org/docs/latest/flink/#hive-catalog and 
https://iceberg.apache.org/docs/latest/flink-connector/#table-managed-in-hive-catalog
 for Hive Catalog
   We have https://iceberg.apache.org/docs/latest/aws/#s3-fileio for S3 access
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] flink1.14.4+iceberg0.13.1+hive-metastore3.1.2+minio(S3) error! [iceberg]

2023-10-10 Thread via GitHub


ramdas-jagtap commented on issue #4743:
URL: https://github.com/apache/iceberg/issues/4743#issuecomment-1754588737

   Thanks @pvary for sharing the docs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[PR] Disable merge-commit and enforce linear history [iceberg-python]

2023-10-10 Thread via GitHub


Fokko opened a new pull request, #57:
URL: https://github.com/apache/iceberg-python/pull/57

   This keeps the git history clear


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Migrate Files using TestRule in dell package to Junit5 [iceberg]

2023-10-10 Thread via GitHub


nastra closed issue #7888: Migrate Files using TestRule in dell package to 
Junit5
URL: https://github.com/apache/iceberg/issues/7888


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Dell : Migrate Files using TestRule to Junit5. [iceberg]

2023-10-10 Thread via GitHub


nastra merged PR #8707:
URL: https://github.com/apache/iceberg/pull/8707


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Failed to find data source: iceberg. Please find packages at [iceberg]

2023-10-10 Thread via GitHub


NhatDuy11 commented on issue #7268:
URL: https://github.com/apache/iceberg/issues/7268#issuecomment-1754691644

   Can someone tell me if I am using Spark 2.4.5 and Scala 2.11.12 which 
version of Apache Iceberg should I use?, thank you very much everyone!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] java.lang.IllegalStateException: Connection pool shut down when refreshing table metadata on s3 [iceberg]

2023-10-10 Thread via GitHub


AkshayWise commented on issue #8601:
URL: https://github.com/apache/iceberg/issues/8601#issuecomment-1754706576

   @Kontinuation @stevenzwu I believe this fix was released over 1.4.0 over 
last week, but I am still getting this error over Flink (1.15) Iceberg jobs:
   ```
   java.lang.IllegalStateException: Connection pool shut down
at org.apache.http.util.Asserts.check(Asserts.java:34)
at 
org.apache.http.impl.conn.PoolingHttpClientConnectionManager.requestConnection(PoolingHttpClientConnectionManager.java:269)
at 
software.amazon.awssdk.http.apache.internal.conn.ClientConnectionManagerFactory$DelegatingHttpClientConnectionManager.requestConnection(ClientConnectionManagerFactory.java:75)
at 
software.amazon.awssdk.http.apache.internal.conn.ClientConnectionManagerFactory$InstrumentedHttpClientConnectionManager.requestConnection(ClientConnectionManagerFactory.java:57)
at 
org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:176)
at 
org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
at 
org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
at 
software.amazon.awssdk.http.apache.internal.impl.ApacheSdkHttpClient.execute(ApacheSdkHttpClient.java:72)
at 
software.amazon.awssdk.http.apache.ApacheHttpClient.execute(ApacheHttpClient.java:254)
at 
software.amazon.awssdk.http.apache.ApacheHttpClient.access$500(ApacheHttpClient.java:104)
at 
software.amazon.awssdk.http.apache.ApacheHttpClient$1.call(ApacheHttpClient.java:231)
at 
software.amazon.awssdk.http.apache.ApacheHttpClient$1.call(ApacheHttpClient.java:228)
at 
software.amazon.awssdk.core.internal.util.MetricUtils.measureDurationUnsafe(MetricUtils.java:63)
at 
software.amazon.awssdk.core.internal.http.pipeline.stages.MakeHttpRequestStage.executeHttpRequest(MakeHttpRequestStage.java:77)
at 
software.amazon.awssdk.core.internal.http.pipeline.stages.MakeHttpRequestStage.execute(MakeHttpRequestStage.java:56)
at 
software.amazon.awssdk.core.internal.http.pipeline.stages.MakeHttpRequestStage.execute(MakeHttpRequestStage.java:39)
at 
software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at 
software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at 
software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at 
software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at 
software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptTimeoutTrackingStage.execute(ApiCallAttemptTimeoutTrackingStage.java:73)
at 
software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptTimeoutTrackingStage.execute(ApiCallAttemptTimeoutTrackingStage.java:42)
at 
software.amazon.awssdk.core.internal.http.pipeline.stages.TimeoutExceptionHandlingStage.execute(TimeoutExceptionHandlingStage.java:78)
at 
software.amazon.awssdk.core.internal.http.pipeline.stages.TimeoutExceptionHandlingStage.execute(TimeoutExceptionHandlingStage.java:40)
at 
software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptMetricCollectionStage.execute(ApiCallAttemptMetricCollectionStage.java:50)
at 
software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptMetricCollectionStage.execute(ApiCallAttemptMetricCollectionStage.java:36)
at 
software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:81)
at 
software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:36)
at 
software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at 
software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:56)
at 
software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:36)
at 
software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.executeWithTimer(ApiCallTimeoutTrackingStage.java:80)
at 
software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:60)
at 
software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingSt

[PR] Core: Use more permissive check when registering existing table [iceberg]

2023-10-10 Thread via GitHub


nastra opened a new pull request, #8759:
URL: https://github.com/apache/iceberg/pull/8759

   (no comment)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] java.lang.IllegalStateException: Connection pool shut down when refreshing table metadata on s3 [iceberg]

2023-10-10 Thread via GitHub


nastra commented on issue #8601:
URL: https://github.com/apache/iceberg/issues/8601#issuecomment-1754735024

   @AkshayWise this fix didn't make it into 1.4.0 unfortunately


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Construct a writer tree [iceberg-python]

2023-10-10 Thread via GitHub


Fokko commented on code in PR #40:
URL: https://github.com/apache/iceberg-python/pull/40#discussion_r1351918311


##
pyiceberg/avro/resolver.py:
##
@@ -233,7 +255,107 @@ def skip(self, decoder: BinaryDecoder) -> None:
 pass
 
 
-class SchemaResolver(PrimitiveWithPartnerVisitor[IcebergType, Reader]):
+class WriteSchemaResolver(PrimitiveWithPartnerVisitor[IcebergType, Writer]):
+def schema(self, schema: Schema, expected_schema: Optional[IcebergType], 
result: Writer) -> Writer:
+return result
+
+def struct(self, struct: StructType, provided_struct: 
Optional[IcebergType], field_writers: List[Writer]) -> Writer:
+if not isinstance(provided_struct, StructType):
+raise ResolveError(f"File/write schema are not aligned for struct, 
got {provided_struct}")
+
+provided_struct_positions: Dict[int, int] = {field.field_id: pos for 
pos, field in enumerate(provided_struct.fields)}
+
+results: List[Tuple[Optional[int], Writer]] = []
+iter(field_writers)
+
+for pos, write_field in enumerate(struct.fields):
+if write_field.field_id in provided_struct_positions:
+
results.append((provided_struct_positions[write_field.field_id], 
field_writers[pos]))
+else:
+# There is a default value
+if isinstance(write_field, NestedField) and 
write_field.write_default is not None:
+# The field is not in the record, but there is a write 
default value
+default_writer = DefaultWriter(
+writer=visit(write_field.field_type, 
CONSTRUCT_WRITER_VISITOR), value=write_field.write_default

Review Comment:
   @rdblue Just to clarify, the type annotation here is not just a hint, it 
will be enforced by Pydantic. If you pass in something other than what the type 
allows, it will raise a Pydantic `ValidationError`. An assertion would be 
similar (but then it would be done in Python land instead of Rust 🦀 ).



##
pyiceberg/avro/resolver.py:
##
@@ -233,7 +255,107 @@ def skip(self, decoder: BinaryDecoder) -> None:
 pass
 
 
-class SchemaResolver(PrimitiveWithPartnerVisitor[IcebergType, Reader]):
+class WriteSchemaResolver(PrimitiveWithPartnerVisitor[IcebergType, Writer]):
+def schema(self, schema: Schema, expected_schema: Optional[IcebergType], 
result: Writer) -> Writer:
+return result
+
+def struct(self, struct: StructType, provided_struct: 
Optional[IcebergType], field_writers: List[Writer]) -> Writer:
+if not isinstance(provided_struct, StructType):
+raise ResolveError(f"File/write schema are not aligned for struct, 
got {provided_struct}")
+
+provided_struct_positions: Dict[int, int] = {field.field_id: pos for 
pos, field in enumerate(provided_struct.fields)}
+
+results: List[Tuple[Optional[int], Writer]] = []
+iter(field_writers)
+
+for pos, write_field in enumerate(struct.fields):
+if write_field.field_id in provided_struct_positions:
+
results.append((provided_struct_positions[write_field.field_id], 
field_writers[pos]))
+else:
+# There is a default value
+if isinstance(write_field, NestedField) and 
write_field.write_default is not None:
+# The field is not in the record, but there is a write 
default value
+default_writer = DefaultWriter(
+writer=visit(write_field.field_type, 
CONSTRUCT_WRITER_VISITOR), value=write_field.write_default

Review Comment:
   @rdblue Just to clarify, the type annotation here is not just a hint, it 
will be enforced by Pydantic. If you pass in something other than what the type 
allows, it will raise a Pydantic `ValidationError`. An assertion would be 
similar (but then it would be done in Python land instead of Rust 🦀 ).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add logic for table format-version updates [iceberg-python]

2023-10-10 Thread via GitHub


Fokko commented on PR #55:
URL: https://github.com/apache/iceberg-python/pull/55#issuecomment-1754783886

   @rdblue I agree with you there. I think we can still update the method name 
since it was just raising an exception.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Allow missing object in ErrorResponse [iceberg]

2023-10-10 Thread via GitHub


amogh-jahagirdar commented on code in PR #8760:
URL: https://github.com/apache/iceberg/pull/8760#discussion_r1352021537


##
core/src/main/java/org/apache/iceberg/rest/responses/ErrorResponseParser.java:
##
@@ -76,17 +76,20 @@ public static ErrorResponse fromJson(JsonNode jsonNode) {
 jsonNode != null && jsonNode.isObject(),
 "Cannot parse error response from non-object value: %s",
 jsonNode);
-Preconditions.checkArgument(jsonNode.has(ERROR), "Cannot parse missing 
field: error");
-JsonNode error = JsonUtil.get(ERROR, jsonNode);
-String message = JsonUtil.getStringOrNull(MESSAGE, error);
-String type = JsonUtil.getStringOrNull(TYPE, error);
-Integer code = JsonUtil.getIntOrNull(CODE, error);
-List stack = JsonUtil.getStringListOrNull(STACK, error);
-return ErrorResponse.builder()
-.withMessage(message)
-.withType(type)
-.responseCode(code)
-.withStackTrace(stack)
-.build();
+if (jsonNode.has(ERROR)) {
+  JsonNode error = JsonUtil.get(ERROR, jsonNode);
+  String message = JsonUtil.getStringOrNull(MESSAGE, error);
+  String type = JsonUtil.getStringOrNull(TYPE, error);
+  Integer code = JsonUtil.getIntOrNull(CODE, error);
+  List stack = JsonUtil.getStringListOrNull(STACK, error);
+  return ErrorResponse.builder()
+  .withMessage(message)
+  .withType(type)
+  .responseCode(code)
+  .withStackTrace(stack)
+  .build();
+} else {
+  return ErrorResponse.builder().build();
+}

Review Comment:
   I maybe missing something but this is `ErrorResponseParser` no?  I'd expect 
whatever JSON that gets passed to this to have all of these details (message, 
type, code, stack). I just assumed the response model isn't marked as required 
in the REST spec, because it depends on if an error is thrown as part of the 
call. If it is thrown, I'd expect all these fields to be set. Maybe there's a 
better way to convey it in the spec.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Allow missing object in ErrorResponse [iceberg]

2023-10-10 Thread via GitHub


amogh-jahagirdar commented on code in PR #8760:
URL: https://github.com/apache/iceberg/pull/8760#discussion_r135149


##
core/src/main/java/org/apache/iceberg/rest/responses/ErrorResponse.java:
##
@@ -22,18 +22,17 @@
 import java.io.StringWriter;
 import java.util.Arrays;
 import java.util.List;
-import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
 import org.apache.iceberg.rest.RESTResponse;
 
 /** Standard response body for all API errors */
 public class ErrorResponse implements RESTResponse {
 
   private String message;
   private String type;
-  private int code;
+  private Integer code;

Review Comment:
   I think this needs to still be `int`, since `code` is required? 
https://github.com/apache/iceberg/blob/master/open-api/rest-catalog-open-api.yaml#L1108
 , which makes sense we should always have some known status code.



##
core/src/main/java/org/apache/iceberg/rest/responses/ErrorResponseParser.java:
##
@@ -76,17 +76,20 @@ public static ErrorResponse fromJson(JsonNode jsonNode) {
 jsonNode != null && jsonNode.isObject(),
 "Cannot parse error response from non-object value: %s",
 jsonNode);
-Preconditions.checkArgument(jsonNode.has(ERROR), "Cannot parse missing 
field: error");
-JsonNode error = JsonUtil.get(ERROR, jsonNode);
-String message = JsonUtil.getStringOrNull(MESSAGE, error);
-String type = JsonUtil.getStringOrNull(TYPE, error);
-Integer code = JsonUtil.getIntOrNull(CODE, error);
-List stack = JsonUtil.getStringListOrNull(STACK, error);
-return ErrorResponse.builder()
-.withMessage(message)
-.withType(type)
-.responseCode(code)
-.withStackTrace(stack)
-.build();
+if (jsonNode.has(ERROR)) {
+  JsonNode error = JsonUtil.get(ERROR, jsonNode);
+  String message = JsonUtil.getStringOrNull(MESSAGE, error);
+  String type = JsonUtil.getStringOrNull(TYPE, error);
+  Integer code = JsonUtil.getIntOrNull(CODE, error);
+  List stack = JsonUtil.getStringListOrNull(STACK, error);
+  return ErrorResponse.builder()
+  .withMessage(message)
+  .withType(type)
+  .responseCode(code)
+  .withStackTrace(stack)
+  .build();
+} else {
+  return ErrorResponse.builder().build();
+}

Review Comment:
   I maybe missing something but this is `ErrorResponseParser` no?  I'd expect 
whatever JSON that gets passed to this to have the error. I just assumed the 
response model isn't marked as required in the REST spec, because it depends on 
if an error is thrown as part of the call. If it is thrown, I'd expect all 
these fields to be set. Maybe there's a better way to convey it in the spec.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[I] Flaky test/env TestFlinkParquetReader, TestFlinkParquetWriter, TestIcebergSourceBoundedSql [iceberg]

2023-10-10 Thread via GitHub


nk1506 opened a new issue, #8761:
URL: https://github.com/apache/iceberg/issues/8761

   ### Apache Iceberg version
   
   1.4.0 (latest release)
   
   ### Query engine
   
   Flink
   
   ### Please describe the bug 🐞
   
   Flink 1.16
   `
   at 
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:2188)
   at 
org.apache.flink.table.planner.delegation.DefaultExecutor.executeAsync(DefaultExecutor.java:95)
   at 
org.apache.flink.table.api.internal.TableEnvironmentImpl.executeQueryOperation(TableEnvironmentImpl.java:884)
   ... 4 more
   
   Caused by:
   java.util.concurrent.ExecutionException: 
java.lang.IllegalArgumentException
   at 
java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
   at 
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
   at 
org.apache.flink.streaming.api.graph.StreamingJobGraphGenerator.createJobGraph(StreamingJobGraphGenerator.java:292)
   ... 13 more
   
   Caused by:
   java.lang.IllegalArgumentException
   at java.nio.ByteBuffer.allocate(ByteBuffer.java:334)
   at 
java.nio.charset.CharsetEncoder.encode(CharsetEncoder.java:795)
   at org.apache.hadoop.io.Text.encode(Text.java:451)
   at org.apache.hadoop.io.Text.encode(Text.java:431)
   at org.apache.hadoop.io.Text.writeString(Text.java:480)
   at 
org.apache.hadoop.conf.Configuration.write(Configuration.java:2889)
   at 
org.apache.iceberg.hadoop.SerializableConfiguration.writeObject(SerializableConfiguration.java:38)
   at sun.reflect.GeneratedMethodAccessor87.invoke(Unknown 
Source)
   at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:498)
   at 
java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1154)
   at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
   at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
   at 
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
   at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
   at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
   at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
   at 
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
   at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
   at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
   at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
   at 
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
   at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
   at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
   at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
   at 
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
   at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
   at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
   at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
   at 
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
   at 
java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
   at 
org.apache.flink.util.InstantiationUtil.serializeObject(InstantiationUtil.java:632)
   at 
org.apache.flink.util.InstantiationUtil.writeObjectToConfig(InstantiationUtil.java:548)
   at 
org.apache.flink.streaming.api.graph.StreamConfig.lambda$serializeAllConfigs$1(StreamConfig.java:195)
   at java.util.HashMap.forEach(HashMap.java:1290)
   at 
org.apache.flink.streaming.api.graph.StreamConfig.serializeAllConfigs(StreamConfig.java:192)
   at 
org.apache.flink.streaming.api.graph.StreamConfig.lambda$triggerSerializationAndReturnFuture$0(StreamConfig.java:169)
   at 
java.util.concurrent.CompletableFuture.uniAccept

Re: [I] Some questions about Iceberg's capabilities in Flink [iceberg]

2023-10-10 Thread via GitHub


jonathf commented on issue #8754:
URL: https://github.com/apache/iceberg/issues/8754#issuecomment-1754998172

   Okay, that explains it.
   
   Last two questions:
   * Will #8553 support some sort of ordering guaranty?
   * Is the streaminig feature associated with icebergs tagging features? This 
might be a weird question, but I have heared some people mentioning it, though 
I can not see it written anywhere.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add logic for table format-version updates [iceberg-python]

2023-10-10 Thread via GitHub


Fokko merged PR #55:
URL: https://github.com/apache/iceberg-python/pull/55


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Allow missing object in ErrorResponse [iceberg]

2023-10-10 Thread via GitHub


Fokko commented on code in PR #8760:
URL: https://github.com/apache/iceberg/pull/8760#discussion_r1352239032


##
core/src/main/java/org/apache/iceberg/rest/responses/ErrorResponseParser.java:
##
@@ -76,17 +76,20 @@ public static ErrorResponse fromJson(JsonNode jsonNode) {
 jsonNode != null && jsonNode.isObject(),
 "Cannot parse error response from non-object value: %s",
 jsonNode);
-Preconditions.checkArgument(jsonNode.has(ERROR), "Cannot parse missing 
field: error");
-JsonNode error = JsonUtil.get(ERROR, jsonNode);
-String message = JsonUtil.getStringOrNull(MESSAGE, error);
-String type = JsonUtil.getStringOrNull(TYPE, error);
-Integer code = JsonUtil.getIntOrNull(CODE, error);
-List stack = JsonUtil.getStringListOrNull(STACK, error);
-return ErrorResponse.builder()
-.withMessage(message)
-.withType(type)
-.responseCode(code)
-.withStackTrace(stack)
-.build();
+if (jsonNode.has(ERROR)) {
+  JsonNode error = JsonUtil.get(ERROR, jsonNode);
+  String message = JsonUtil.getStringOrNull(MESSAGE, error);
+  String type = JsonUtil.getStringOrNull(TYPE, error);
+  Integer code = JsonUtil.getIntOrNull(CODE, error);
+  List stack = JsonUtil.getStringListOrNull(STACK, error);
+  return ErrorResponse.builder()
+  .withMessage(message)
+  .withType(type)
+  .responseCode(code)
+  .withStackTrace(stack)
+  .build();
+} else {
+  return ErrorResponse.builder().build();
+}

Review Comment:
   We can also update the spec, but it looks like not all systems (EMR) are 
sending the full message. The question is, do we want to fail at parsing the 
error message, or just return an empty message (or throw an exception somewhere 
else).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Allow missing object in ErrorResponse [iceberg]

2023-10-10 Thread via GitHub


Fokko commented on code in PR #8760:
URL: https://github.com/apache/iceberg/pull/8760#discussion_r1352243492


##
core/src/main/java/org/apache/iceberg/rest/responses/ErrorResponse.java:
##
@@ -22,18 +22,17 @@
 import java.io.StringWriter;
 import java.util.Arrays;
 import java.util.List;
-import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
 import org.apache.iceberg.rest.RESTResponse;
 
 /** Standard response body for all API errors */
 public class ErrorResponse implements RESTResponse {
 
   private String message;
   private String type;
-  private int code;
+  private Integer code;

Review Comment:
   Hmm, we only require `code` to be there, we also don't check for `message` 
and `type`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[I] is there anyway to rewrite onto a specific branch? [iceberg]

2023-10-10 Thread via GitHub


zinking opened a new issue, #8762:
URL: https://github.com/apache/iceberg/issues/8762

   ### Query engine
   
   
   Spark
   
   ### Question
   
   I thought this might do
   
   ```
 val table = s"iceberg_catalog.${tableIdentifier}.branch_${branch}"
 val t = Spark3Util.loadIcebergTable(spark, table)
 val start = System.currentTimeMillis()
 try {
   SparkActions.get()
 .rewriteDataFiles(t)
 .skipPlanDeletes(skipPlanDeletes)
 .filter(Expressions.equal("ds", 20230923))
 .execute()
   ```
   I was assuming the data is read from the branch, and the rewrite the result 
is written onto the branch
   
   but it is not, seems the change is still visible on main.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Construct a writer tree [iceberg-python]

2023-10-10 Thread via GitHub


Fokko commented on code in PR #40:
URL: https://github.com/apache/iceberg-python/pull/40#discussion_r1352295604


##
pyiceberg/avro/resolver.py:
##
@@ -233,7 +255,107 @@ def skip(self, decoder: BinaryDecoder) -> None:
 pass
 
 
-class SchemaResolver(PrimitiveWithPartnerVisitor[IcebergType, Reader]):
+class WriteSchemaResolver(PrimitiveWithPartnerVisitor[IcebergType, Writer]):
+def schema(self, schema: Schema, expected_schema: Optional[IcebergType], 
result: Writer) -> Writer:
+return result
+
+def struct(self, struct: StructType, provided_struct: 
Optional[IcebergType], field_writers: List[Writer]) -> Writer:
+if not isinstance(provided_struct, StructType):
+raise ResolveError(f"File/write schema are not aligned for struct, 
got {provided_struct}")
+
+provided_struct_positions: Dict[int, int] = {field.field_id: pos for 
pos, field in enumerate(provided_struct.fields)}
+
+results: List[Tuple[Optional[int], Writer]] = []
+iter(field_writers)
+
+for pos, write_field in enumerate(struct.fields):
+if write_field.field_id in provided_struct_positions:
+
results.append((provided_struct_positions[write_field.field_id], 
field_writers[pos]))
+else:
+# There is a default value
+if isinstance(write_field, NestedField) and 
write_field.write_default is not None:
+# The field is not in the record, but there is a write 
default value
+default_writer = DefaultWriter(
+writer=visit(write_field.field_type, 
CONSTRUCT_WRITER_VISITOR), value=write_field.write_default
+)
+results.append((None, default_writer))
+elif write_field.required:
+raise ValueError(f"Field is required, and there is no 
write default: {write_field}")
+else:
+results.append((pos, NoneWriter()))
+
+return StructWriter(field_writers=tuple(results))
+
+def field(self, field: NestedField, expected_field: Optional[IcebergType], 
field_writer: Writer) -> Writer:
+return field_writer if field.required else OptionWriter(field_writer)
+
+def list(self, list_type: ListType, expected_list: Optional[IcebergType], 
element_reader: Writer) -> Writer:
+if expected_list and not isinstance(expected_list, ListType):
+raise ResolveError(f"File/read schema are not aligned for list, 
got {expected_list}")

Review Comment:
   Created an issue for this: https://github.com/apache/iceberg-python/issues/58



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[I] Pass in the correct type for the VisitorWithParent [iceberg-python]

2023-10-10 Thread via GitHub


Fokko opened a new issue, #58:
URL: https://github.com/apache/iceberg-python/issues/58

   ### Feature Request / Improvement
   
   So we can avoid the checks, see: 
https://github.com/apache/iceberg-python/pull/40#discussion_r1349776857


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Construct a writer tree [iceberg-python]

2023-10-10 Thread via GitHub


Fokko commented on code in PR #40:
URL: https://github.com/apache/iceberg-python/pull/40#discussion_r1352299345


##
pyiceberg/avro/resolver.py:
##
@@ -192,7 +195,26 @@ def visit_binary(self, binary_type: BinaryType) -> Writer:
 return BinaryWriter()
 
 
-def resolve(
+CONSTRUCT_WRITER_VISITOR = ConstructWriter()
+
+
+def resolve_writer(
+struct_schema: Union[Schema, IcebergType],
+write_schema: Union[Schema, IcebergType],
+) -> Writer:
+"""Resolve the file and read schema to produce a reader.
+
+Args:
+struct_schema (Schema | IcebergType): The schema of the Avro file.
+write_schema (Schema | IcebergType): The requested read schema which 
is equal, subset or superset of the file schema.

Review Comment:
   This is very subjective :D
   
   >  I think the names are still confusing here. When I see data_schema I 
would expect it to be the schema of the data that is being written.
   
   For me, I would assume that the `data_schema` is in memory. `record_schema` 
and `file_schema` sounds the most natural to me.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Disable merge-commit and enforce linear history [iceberg-python]

2023-10-10 Thread via GitHub


liurenjie1024 commented on code in PR #57:
URL: https://github.com/apache/iceberg-python/pull/57#discussion_r1352301058


##
.asf.yaml:
##
@@ -28,6 +28,16 @@ github:
 - apache
 - hacktoberfest
 - pyiceberg
+  enabled_merge_buttons:
+merge: false
+squash: true
+rebase: trueB

Review Comment:
   ```suggestion
   rebase: true
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Construct a writer tree [iceberg-python]

2023-10-10 Thread via GitHub


Fokko commented on code in PR #40:
URL: https://github.com/apache/iceberg-python/pull/40#discussion_r1352300965


##
pyiceberg/avro/resolver.py:
##
@@ -192,7 +194,28 @@ def visit_binary(self, binary_type: BinaryType) -> Writer:
 return BinaryWriter()
 
 
-def resolve(
+CONSTRUCT_WRITER_VISITOR = ConstructWriter()
+
+
+def resolve_writer(
+data_schema: Union[Schema, IcebergType],
+write_schema: Union[Schema, IcebergType],
+) -> Writer:
+"""Resolve the file and read schema to produce a reader.
+
+Args:
+data_schema (Schema | IcebergType): The schema of the Avro file.
+write_schema (Schema | IcebergType): The requested read schema which 
is equal, subset or superset of the file schema.
+
+Raises:
+NotImplementedError: If attempting to resolve an unrecognized object 
type.
+"""
+if write_schema == data_schema:
+return construct_writer(write_schema)
+return visit_with_partner(write_schema, data_schema, 
WriteSchemaResolver(), SchemaPartnerAccessor())  # type: ignore

Review Comment:
   Yes, this is because the arguments to the function, feel most natural from 
left to right. You have the data in some kind of schema, and you want to 
project that to some write schema.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Construct a writer tree [iceberg-python]

2023-10-10 Thread via GitHub


Fokko commented on code in PR #40:
URL: https://github.com/apache/iceberg-python/pull/40#discussion_r1352299345


##
pyiceberg/avro/resolver.py:
##
@@ -192,7 +195,26 @@ def visit_binary(self, binary_type: BinaryType) -> Writer:
 return BinaryWriter()
 
 
-def resolve(
+CONSTRUCT_WRITER_VISITOR = ConstructWriter()
+
+
+def resolve_writer(
+struct_schema: Union[Schema, IcebergType],
+write_schema: Union[Schema, IcebergType],
+) -> Writer:
+"""Resolve the file and read schema to produce a reader.
+
+Args:
+struct_schema (Schema | IcebergType): The schema of the Avro file.
+write_schema (Schema | IcebergType): The requested read schema which 
is equal, subset or superset of the file schema.

Review Comment:
   Missed this one. Thanks and this is very subjective :D
   
   >  I think the names are still confusing here. When I see data_schema I 
would expect it to be the schema of the data that is being written.
   
   For me, I would assume that the `data_schema` is in memory. `record_schema` 
and `file_schema` sounds the most natural to me.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Construct a writer tree [iceberg-python]

2023-10-10 Thread via GitHub


Fokko commented on code in PR #40:
URL: https://github.com/apache/iceberg-python/pull/40#discussion_r1352318212


##
pyiceberg/avro/resolver.py:
##
@@ -233,7 +256,93 @@ def skip(self, decoder: BinaryDecoder) -> None:
 pass
 
 
-class SchemaResolver(PrimitiveWithPartnerVisitor[IcebergType, Reader]):
+class WriteSchemaResolver(PrimitiveWithPartnerVisitor[IcebergType, Writer]):
+def schema(self, write_schema: Schema, data_schema: Optional[IcebergType], 
result: Writer) -> Writer:
+return result
+
+def struct(self, write_schema: StructType, data_struct: 
Optional[IcebergType], field_writers: List[Writer]) -> Writer:
+if not isinstance(data_struct, StructType):
+raise ResolveError(f"File/write schema are not aligned for struct, 
got {data_struct}")
+
+data_positions: Dict[int, int] = {field.field_id: pos for pos, field 
in enumerate(data_struct.fields)}
+results: List[Tuple[Optional[int], Writer]] = []
+
+for writer, write_field in zip(field_writers, write_schema.fields):
+if write_field.field_id in data_positions:
+results.append((data_positions[write_field.field_id], writer))
+else:
+# There is a default value
+if write_field.write_default is not None:
+# The field is not in the record, but there is a write 
default value
+results.append((None, DefaultWriter(writer=writer, 
value=write_field.write_default)))  # type: ignore
+elif write_field.required:
+raise ValueError(f"Field is required, and there is no 
write default: {write_field}")
+
+return StructWriter(field_writers=tuple(results))
+
+def field(self, write_field: NestedField, data_type: 
Optional[IcebergType], field_writer: Writer) -> Writer:
+return field_writer if write_field.required else 
OptionWriter(field_writer)
+
+def list(self, write_list_type: ListType, write_list: 
Optional[IcebergType], element_reader: Writer) -> Writer:

Review Comment:
   Nice, thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Construct a writer tree [iceberg-python]

2023-10-10 Thread via GitHub


Fokko commented on code in PR #40:
URL: https://github.com/apache/iceberg-python/pull/40#discussion_r1352317115


##
pyiceberg/avro/resolver.py:
##
@@ -233,7 +256,93 @@ def skip(self, decoder: BinaryDecoder) -> None:
 pass
 
 
-class SchemaResolver(PrimitiveWithPartnerVisitor[IcebergType, Reader]):
+class WriteSchemaResolver(PrimitiveWithPartnerVisitor[IcebergType, Writer]):
+def schema(self, write_schema: Schema, data_schema: Optional[IcebergType], 
result: Writer) -> Writer:
+return result
+
+def struct(self, write_schema: StructType, data_struct: 
Optional[IcebergType], field_writers: List[Writer]) -> Writer:
+if not isinstance(data_struct, StructType):
+raise ResolveError(f"File/write schema are not aligned for struct, 
got {data_struct}")
+
+data_positions: Dict[int, int] = {field.field_id: pos for pos, field 
in enumerate(data_struct.fields)}
+results: List[Tuple[Optional[int], Writer]] = []
+
+for writer, write_field in zip(field_writers, write_schema.fields):
+if write_field.field_id in data_positions:
+results.append((data_positions[write_field.field_id], writer))
+else:
+# There is a default value
+if write_field.write_default is not None:
+# The field is not in the record, but there is a write 
default value
+results.append((None, DefaultWriter(writer=writer, 
value=write_field.write_default)))  # type: ignore
+elif write_field.required:
+raise ValueError(f"Field is required, and there is no 
write default: {write_field}")

Review Comment:
   I think this is correct, and let me illustrate this with an example:
   
   
![image](https://github.com/apache/iceberg-python/assets/1134248/df2e5350-dbdc-493c-b6f2-4e409464d339)
   
   All the three branches:
   
   - `if`: The field is in the `record_schema` and is part of the write schema. 
It will produce a `(0, IntegerWriter())` for the `0: status`.
   - `elif`: The field is not in the `record_schema`, but has a write default 
(we use this to write the `block_size_in_bytes` since it is required:
   
![image](https://github.com/apache/iceberg-python/assets/1134248/0491fcce-43da-4ec5-b747-50aac3908f85)
   - `else`: The else is not there anymore, and this branch is taken for the 
`sequence_number` and `file_sequence_number` where the field is part of the 
`record_schema`, but not part of the `file_schema`. Therefore we don't need to 
write any null bytes. For the read-case, this is different, and we would need a 
reader since we need to skip over the data in the file, but for the write case, 
we can just ignore certain fields because they are not part of the 
`file_schema`.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Construct a writer tree [iceberg-python]

2023-10-10 Thread via GitHub


Fokko commented on code in PR #40:
URL: https://github.com/apache/iceberg-python/pull/40#discussion_r1352317115


##
pyiceberg/avro/resolver.py:
##
@@ -233,7 +256,93 @@ def skip(self, decoder: BinaryDecoder) -> None:
 pass
 
 
-class SchemaResolver(PrimitiveWithPartnerVisitor[IcebergType, Reader]):
+class WriteSchemaResolver(PrimitiveWithPartnerVisitor[IcebergType, Writer]):
+def schema(self, write_schema: Schema, data_schema: Optional[IcebergType], 
result: Writer) -> Writer:
+return result
+
+def struct(self, write_schema: StructType, data_struct: 
Optional[IcebergType], field_writers: List[Writer]) -> Writer:
+if not isinstance(data_struct, StructType):
+raise ResolveError(f"File/write schema are not aligned for struct, 
got {data_struct}")
+
+data_positions: Dict[int, int] = {field.field_id: pos for pos, field 
in enumerate(data_struct.fields)}
+results: List[Tuple[Optional[int], Writer]] = []
+
+for writer, write_field in zip(field_writers, write_schema.fields):
+if write_field.field_id in data_positions:
+results.append((data_positions[write_field.field_id], writer))
+else:
+# There is a default value
+if write_field.write_default is not None:
+# The field is not in the record, but there is a write 
default value
+results.append((None, DefaultWriter(writer=writer, 
value=write_field.write_default)))  # type: ignore
+elif write_field.required:
+raise ValueError(f"Field is required, and there is no 
write default: {write_field}")

Review Comment:
   I think this is correct, and let me illustrate this with an example:
   
   
![image](https://github.com/apache/iceberg-python/assets/1134248/df2e5350-dbdc-493c-b6f2-4e409464d339)
   
   All the three branches:
   
   - `if`: The field is in the `record_schema` and is part of the write schema. 
It will produce a `(0, IntegerWriter())` for the `0: status`.
   - `elif`: The field is not in the `record_schema`, but has a write default 
(we use this to write the `block_size_in_bytes` since it is required:
   
![image](https://github.com/apache/iceberg-python/assets/1134248/0491fcce-43da-4ec5-b747-50aac3908f85)
   - `else`: The else is not there anymore, and this branch is taken for the 
`sequence_number` and `file_sequence_number` where the field is part of the 
`record_schema`, but not part of the `file_schema`. Therefore we don't need to 
write any null bytes. For the read-case, this is different, and we would need a 
reader since we need to skip over the data in the file, but for the write case, 
we can just ignore certain fields because they are not part of the 
`file_schema`.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[PR] Build: Fix compiler warnings [iceberg]

2023-10-10 Thread via GitHub


nk1506 opened a new pull request, #8763:
URL: https://github.com/apache/iceberg/pull/8763

   Fixing few xlint related warnings. 
   
   Before:
   
   https://github.com/apache/iceberg/assets/4146188/c3666c0d-f879-4dbc-8fe4-89ab91b93079";>
   
   After:
   
   https://github.com/apache/iceberg/assets/4146188/2e107c9d-796c-4556-b46b-0b49919ecd9b";>
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Construct a writer tree [iceberg-python]

2023-10-10 Thread via GitHub


Fokko commented on code in PR #40:
URL: https://github.com/apache/iceberg-python/pull/40#discussion_r1352346498


##
pyiceberg/avro/resolver.py:
##
@@ -233,7 +256,93 @@ def skip(self, decoder: BinaryDecoder) -> None:
 pass
 
 
-class SchemaResolver(PrimitiveWithPartnerVisitor[IcebergType, Reader]):
+class WriteSchemaResolver(PrimitiveWithPartnerVisitor[IcebergType, Writer]):
+def schema(self, write_schema: Schema, data_schema: Optional[IcebergType], 
result: Writer) -> Writer:
+return result
+
+def struct(self, write_schema: StructType, data_struct: 
Optional[IcebergType], field_writers: List[Writer]) -> Writer:
+if not isinstance(data_struct, StructType):
+raise ResolveError(f"File/write schema are not aligned for struct, 
got {data_struct}")
+
+data_positions: Dict[int, int] = {field.field_id: pos for pos, field 
in enumerate(data_struct.fields)}
+results: List[Tuple[Optional[int], Writer]] = []
+
+for writer, write_field in zip(field_writers, write_schema.fields):
+if write_field.field_id in data_positions:
+results.append((data_positions[write_field.field_id], writer))
+else:
+# There is a default value
+if write_field.write_default is not None:
+# The field is not in the record, but there is a write 
default value
+results.append((None, DefaultWriter(writer=writer, 
value=write_field.write_default)))  # type: ignore
+elif write_field.required:
+raise ValueError(f"Field is required, and there is no 
write default: {write_field}")

Review Comment:
   Yes, you're right! This would apply to `file_ordinal` and `sort_columns`:
   
![image](https://github.com/apache/iceberg-python/assets/1134248/5f9a164a-bcfb-459a-bb65-c003203ba462)
   
   However, we don't write those. Updated the code and added a test-case:
   
   ```python
   def test_writer_missing_optional_in_read_schema() -> None:
   actual = resolve_writer(
   record_schema=Schema(),
   file_schema=Schema(
   NestedField(field_id=1, name="str", type=StringType(), 
required=False),
   ),
   )
   
   expected = StructWriter(field_writers=((None, 
OptionWriter(option=OptionWriter(option=StringWriter(,))
   
   assert actual == expected
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Disable merge-commit and enforce linear history [iceberg-python]

2023-10-10 Thread via GitHub


Fokko commented on code in PR #57:
URL: https://github.com/apache/iceberg-python/pull/57#discussion_r1352350208


##
.asf.yaml:
##
@@ -28,6 +28,16 @@ github:
 - apache
 - hacktoberfest
 - pyiceberg
+  enabled_merge_buttons:
+merge: false
+squash: true
+rebase: trueB

Review Comment:
   Oops, nice catch @liurenjie1024 ! 🙌 



##
.asf.yaml:
##
@@ -28,6 +28,16 @@ github:
 - apache
 - hacktoberfest
 - pyiceberg
+  enabled_merge_buttons:
+merge: false
+squash: true
+rebase: trueB

Review Comment:
   Oops, nice catch @liurenjie1024 ! 🙌 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Build: Fix minor compilation warnings [iceberg]

2023-10-10 Thread via GitHub


nastra merged PR #8758:
URL: https://github.com/apache/iceberg/pull/8758


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Optimize metadata tables? [iceberg]

2023-10-10 Thread via GitHub


ajantha-bhat commented on issue #8714:
URL: https://github.com/apache/iceberg/issues/8714#issuecomment-1755285785

   @RussellSpitzer and @aokolnychyi: WDYT?  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spec: Add partition stats spec [iceberg]

2023-10-10 Thread via GitHub


ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1352477202


##
format/spec.md:
##
@@ -702,6 +703,58 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map` | 
Additional properties associated with the statistic. Subset of Blob properties 
in the Puffin file. |
 
 
+ Partition statistics
+
+Partition statistics files are based on [Partition Statistics file 
spec](#partition-statistics-file). 
+Partition statistics are not required for reading or planning and readers may 
ignore them.
+Each table snapshot may be associated with at most one partition statistic 
file.
+A writer can optionally write the partition statistics file during each write 
operation, and 
+it must be registered in the table metadata file to be considered as a valid 
statistics file for the reader.
+
+`partition-statistics` field of table metadata is an optional list of struct 
with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+||||--|-|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg 
table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of 
the partition statistics file. See [Partition Statistics 
file](#partition-statistics-file). |
+| _required_ | _required_ | **`file-size-in-bytes`** | `long` | Size of the 
partition statistics file. |
+
+ Partition Statistics file
+
+Statistics information for each unique partition tuple is stored as a row in 
the default data file format of the table (for example, Parquet or ORC).
+These rows must be sorted (in ascending manner with NULL FIRST) by `partition` 
field to optimize filtering rows while scanning.
+
+The schema of the partition statistics file is as follows:
+
+| v1 | v2 | Field id, name | Type | Description |
+||||--|-|
+| _required_ | _required_ | **`1 partition`** | `struct<..>` | Partition data 
tuple, schema based on the unified partition type considering all specs in a 
table |
+| _required_ | _required_ | **`2 spec_id`** | `int` | Partition spec id |
+| _required_ | _required_ | **`3 data_record_count`** | `long` | Count of 
records in data files |
+| _required_ | _required_ | **`4 data_file_count`** | `int` | Count of data 
files |
+| _required_ | _required_ | **`5 total_data_file_size_in_bytes`** | `long` | 
Total size of data files in bytes |
+| _optional_ | _optional_ | **`6 position_delete_record_count`** | `long` | 
Count of records in position delete files |
+| _optional_ | _optional_ | **`7 position_delete_file_count`** | `int` | Count 
of position delete files |
+| _optional_ | _optional_ | **`8 equality_delete_record_count`** | `long` | 
Count of records in equality delete files |
+| _optional_ | _optional_ | **`9 equality_delete_file_count`** | `int` | Count 
of equality delete files |
+| _optional_ | _optional_ | **`10 total_record_count`** | `long` | Accurate 
count of records in a partition after applying the delete files if any |

Review Comment:
   You are right. That is why schema is kept optional. 
   
   Implementation will not populate this by default (can be controlled by a 
property or the way of writing. For example, async write can compute it but not 
the incremental sync writes)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spec: Add partition stats spec [iceberg]

2023-10-10 Thread via GitHub


ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1352482149


##
format/spec.md:
##
@@ -702,6 +703,58 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map` | 
Additional properties associated with the statistic. Subset of Blob properties 
in the Puffin file. |
 
 
+ Partition statistics
+
+Partition statistics files are based on [Partition Statistics file 
spec](#partition-statistics-file). 
+Partition statistics are not required for reading or planning and readers may 
ignore them.
+Each table snapshot may be associated with at most one partition statistic 
file.
+A writer can optionally write the partition statistics file during each write 
operation, and 
+it must be registered in the table metadata file to be considered as a valid 
statistics file for the reader.
+
+`partition-statistics` field of table metadata is an optional list of struct 
with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+||||--|-|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg 
table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of 
the partition statistics file. See [Partition Statistics 
file](#partition-statistics-file). |
+| _required_ | _required_ | **`file-size-in-bytes`** | `long` | Size of the 
partition statistics file. |
+
+ Partition Statistics file
+
+Statistics information for each unique partition tuple is stored as a row in 
the default data file format of the table (for example, Parquet or ORC).

Review Comment:
   Russell gave a comment to explicitly mention the format type. 
   
   I have removed the "default" word and reworded a bit. Implementation can 
take a call whether to use the default table's format or the one specified in a 
table property. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spec: Add partition stats spec [iceberg]

2023-10-10 Thread via GitHub


ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1352483805


##
format/spec.md:
##
@@ -702,6 +703,58 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map` | 
Additional properties associated with the statistic. Subset of Blob properties 
in the Puffin file. |
 
 
+ Partition statistics
+
+Partition statistics files are based on [Partition Statistics file 
spec](#partition-statistics-file). 
+Partition statistics are not required for reading or planning and readers may 
ignore them.
+Each table snapshot may be associated with at most one partition statistic 
file.
+A writer can optionally write the partition statistics file during each write 
operation, and 

Review Comment:
   added with some rewording



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spec: Add partition stats spec [iceberg]

2023-10-10 Thread via GitHub


ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1352485101


##
format/spec.md:
##
@@ -702,6 +703,58 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map` | 
Additional properties associated with the statistic. Subset of Blob properties 
in the Puffin file. |
 
 
+ Partition statistics
+
+Partition statistics files are based on [Partition Statistics file 
spec](#partition-statistics-file). 

Review Comment:
   True, changed to keep capital only for headers. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spec: Add partition stats spec [iceberg]

2023-10-10 Thread via GitHub


ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1352490767


##
format/spec.md:
##
@@ -702,6 +703,58 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map` | 
Additional properties associated with the statistic. Subset of Blob properties 
in the Puffin file. |
 
 
+ Partition statistics
+
+Partition statistics files are based on [Partition Statistics file 
spec](#partition-statistics-file). 
+Partition statistics are not required for reading or planning and readers may 
ignore them.
+Each table snapshot may be associated with at most one partition statistic 
file.
+A writer can optionally write the partition statistics file during each write 
operation, and 
+it must be registered in the table metadata file to be considered as a valid 
statistics file for the reader.
+
+`partition-statistics` field of table metadata is an optional list of struct 
with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+||||--|-|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg 
table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of 
the partition statistics file. See [Partition Statistics 
file](#partition-statistics-file). |
+| _required_ | _required_ | **`file-size-in-bytes`** | `long` | Size of the 
partition statistics file. |
+
+ Partition Statistics file
+
+Statistics information for each unique partition tuple is stored as a row in 
the default data file format of the table (for example, Parquet or ORC).
+These rows must be sorted (in ascending manner with NULL FIRST) by `partition` 
field to optimize filtering rows while scanning.
+
+The schema of the partition statistics file is as follows:
+
+| v1 | v2 | Field id, name | Type | Description |
+||||--|-|
+| _required_ | _required_ | **`1 partition`** | `struct<..>` | Partition data 
tuple, schema based on the unified partition type considering all specs in a 
table |
+| _required_ | _required_ | **`2 spec_id`** | `int` | Partition spec id |

Review Comment:
   We can discuss the interest from the community for the synchronous writes. 
   Some of them might be intersted. 
   
   Agree that we should first go with async implementation to make things 
easier. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spec: Add partition stats spec [iceberg]

2023-10-10 Thread via GitHub


ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1352492103


##
format/spec.md:
##
@@ -702,6 +703,58 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map` | 
Additional properties associated with the statistic. Subset of Blob properties 
in the Puffin file. |
 
 
+ Partition statistics
+
+Partition statistics files are based on [Partition Statistics file 
spec](#partition-statistics-file). 
+Partition statistics are not required for reading or planning and readers may 
ignore them.
+Each table snapshot may be associated with at most one partition statistic 
file.
+A writer can optionally write the partition statistics file during each write 
operation, and 
+it must be registered in the table metadata file to be considered as a valid 
statistics file for the reader.
+
+`partition-statistics` field of table metadata is an optional list of struct 
with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+||||--|-|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg 
table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of 
the partition statistics file. See [Partition Statistics 
file](#partition-statistics-file). |

Review Comment:
   ok. Updated as suggested. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spec: Add partition stats spec [iceberg]

2023-10-10 Thread via GitHub


ajantha-bhat commented on PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#issuecomment-1755391052

   @aokolnychyi: Thanks for the detailed review and also going through the POC 
PRs. 
   I have addressed all the comments. Please have a look again.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spec: Add partition stats spec [iceberg]

2023-10-10 Thread via GitHub


ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1352500873


##
format/spec.md:
##
@@ -702,6 +703,58 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map` | 
Additional properties associated with the statistic. Subset of Blob properties 
in the Puffin file. |
 
 
+ Partition statistics
+
+Partition statistics files are based on [Partition Statistics file 
spec](#partition-statistics-file). 
+Partition statistics are not required for reading or planning and readers may 
ignore them.
+Each table snapshot may be associated with at most one partition statistic 
file.
+A writer can optionally write the partition statistics file during each write 
operation, and 
+it must be registered in the table metadata file to be considered as a valid 
statistics file for the reader.
+
+`partition-statistics` field of table metadata is an optional list of struct 
with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+||||--|-|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg 
table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of 
the partition statistics file. See [Partition Statistics 
file](#partition-statistics-file). |
+| _required_ | _required_ | **`file-size-in-bytes`** | `long` | Size of the 
partition statistics file. |
+
+ Partition Statistics file

Review Comment:
   yes. updated. 
   
   Also updated the header of Table statistics -> Table Statistics 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Optimize metadata tables? [iceberg]

2023-10-10 Thread via GitHub


RussellSpitzer commented on issue #8714:
URL: https://github.com/apache/iceberg/issues/8714#issuecomment-1755413260

   I don't see any particular reason but I also don't see any reason to change 
an existing public api here. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Optimize metadata tables? [iceberg]

2023-10-10 Thread via GitHub


ajantha-bhat commented on issue #8714:
URL: https://github.com/apache/iceberg/issues/8714#issuecomment-1755421403

   > I don't see any particular reason but I also don't see any reason to 
change an existing public api here.
   
   🙃🙃🙃


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Optimize metadata tables? [iceberg]

2023-10-10 Thread via GitHub


ajantha-bhat commented on issue #8714:
URL: https://github.com/apache/iceberg/issues/8714#issuecomment-1755423504

   > I don't see any particular reason but I also don't see any reason to 
change an existing public api here.
   
   My case for removing is to keep things simple and less metadata tables for 
users to understand and remember. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spec: Add partition stats spec [iceberg]

2023-10-10 Thread via GitHub


ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1352528356


##
format/spec.md:
##
@@ -702,6 +703,58 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map` | 
Additional properties associated with the statistic. Subset of Blob properties 
in the Puffin file. |
 
 
+ Partition statistics
+
+Partition statistics files are based on [Partition Statistics file 
spec](#partition-statistics-file). 
+Partition statistics are not required for reading or planning and readers may 
ignore them.
+Each table snapshot may be associated with at most one partition statistic 
file.
+A writer can optionally write the partition statistics file during each write 
operation, and 
+it must be registered in the table metadata file to be considered as a valid 
statistics file for the reader.
+
+`partition-statistics` field of table metadata is an optional list of struct 
with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+||||--|-|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg 
table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of 
the partition statistics file. See [Partition Statistics 
file](#partition-statistics-file). |
+| _required_ | _required_ | **`file-size-in-bytes`** | `long` | Size of the 
partition statistics file. |
+
+ Partition Statistics file
+
+Statistics information for each unique partition tuple is stored as a row in 
the default data file format of the table (for example, Parquet or ORC).
+These rows must be sorted (in ascending manner with NULL FIRST) by `partition` 
field to optimize filtering rows while scanning.
+
+The schema of the partition statistics file is as follows:
+
+| v1 | v2 | Field id, name | Type | Description |
+||||--|-|
+| _required_ | _required_ | **`1 partition`** | `struct<..>` | Partition data 
tuple, schema based on the unified partition type considering all specs in a 
table |
+| _required_ | _required_ | **`2 spec_id`** | `int` | Partition spec id |

Review Comment:
   Also, Trino is currently writing Puffin in both sync and async way. Dremio 
is also intersted in sync stats. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add ASF DOAP rdf file [iceberg]

2023-10-10 Thread via GitHub


jbonofre commented on code in PR #8586:
URL: https://github.com/apache/iceberg/pull/8586#discussion_r1352566918


##
doap.rdf:
##
@@ -0,0 +1,55 @@
+
+
+http://usefulinc.com/ns/doap#"; 
+ xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"; 
+ xmlns:asfext="http://projects.apache.org/ns/asfext#";
+ xmlns:foaf="http://xmlns.com/foaf/0.1/";>
+
+  https://iceberg.apache.org";>
+2023-09-14
+https://spdx.org/licenses/Apache-2.0"; />
+Apache Iceberg
+https://iceberg.apache.org"; />
+https://iceberg.apache.org"; />
+Iceberg is a high-performance format for huge analytic 
tables.
+Iceberg brings the reliability and simplicity of SQL tables 
to big data, while making it possible for engines like Spark, Trino, Flink, 
Presto, Hive and Impala to safely work with the same tables, at the same 
time.
+https://github.com/apache/iceberg/issues"; />
+https://iceberg.apache.org/community/"; />
+https://iceberg.apache.org/releases/"; />
+Java
+Python

Review Comment:
   DOAP file only accept one Git repository, so I put 
`https://github.com/apache/iceberg`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add ASF DOAP rdf file [iceberg]

2023-10-10 Thread via GitHub


jbonofre commented on code in PR #8586:
URL: https://github.com/apache/iceberg/pull/8586#discussion_r1352567808


##
doap.rdf:
##
@@ -0,0 +1,55 @@
+
+
+http://usefulinc.com/ns/doap#"; 
+ xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"; 
+ xmlns:asfext="http://projects.apache.org/ns/asfext#";
+ xmlns:foaf="http://xmlns.com/foaf/0.1/";>
+
+  https://iceberg.apache.org";>
+2023-09-14
+https://spdx.org/licenses/Apache-2.0"; />
+Apache Iceberg
+https://iceberg.apache.org"; />
+https://iceberg.apache.org"; />
+Iceberg is a high-performance format for huge analytic 
tables.
+Iceberg brings the reliability and simplicity of SQL tables 
to big data, while making it possible for engines like Spark, Trino, Flink, 
Presto, Hive and Impala to safely work with the same tables, at the same 
time.
+https://github.com/apache/iceberg/issues"; />
+https://iceberg.apache.org/community/"; />
+https://iceberg.apache.org/releases/"; />
+Java
+Python
+https://projects.apache.org/category/big-data"; />
+https://projects.apache.org/category/database"; />
+https://projects.apache.org/category/data-engineering"; />
+
+  
+1.3.1

Review Comment:
   I updated to 1.4.0 (NB: only one release is accepted in the DOAP, the 
latest).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add ASF DOAP rdf file [iceberg]

2023-10-10 Thread via GitHub


jbonofre commented on code in PR #8586:
URL: https://github.com/apache/iceberg/pull/8586#discussion_r1352568422


##
doap.rdf:
##
@@ -0,0 +1,55 @@
+
+
+http://usefulinc.com/ns/doap#"; 
+ xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"; 
+ xmlns:asfext="http://projects.apache.org/ns/asfext#";
+ xmlns:foaf="http://xmlns.com/foaf/0.1/";>
+
+  https://iceberg.apache.org";>
+2023-09-14
+https://spdx.org/licenses/Apache-2.0"; />
+Apache Iceberg
+https://iceberg.apache.org"; />
+https://iceberg.apache.org"; />
+Iceberg is a high-performance format for huge analytic 
tables.
+Iceberg brings the reliability and simplicity of SQL tables 
to big data, while making it possible for engines like Spark, Trino, Flink, 
Presto, Hive and Impala to safely work with the same tables, at the same 
time.
+https://github.com/apache/iceberg/issues"; />
+https://iceberg.apache.org/community/"; />
+https://iceberg.apache.org/releases/"; />
+Java
+Python

Review Comment:
   I added all languages in the DOAP, and only the "main" repository.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Build: Bump slf4j from 1.7.36 to 2.0.9 [iceberg]

2023-10-10 Thread via GitHub


jbonofre commented on PR #8737:
URL: https://github.com/apache/iceberg/pull/8737#issuecomment-1755484959

   @dependabot rebase


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Build: Bump slf4j from 1.7.36 to 2.0.9 [iceberg]

2023-10-10 Thread via GitHub


jbonofre commented on PR #8737:
URL: https://github.com/apache/iceberg/pull/8737#issuecomment-1755484513

   It should work with most of the engines.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Build: Bump slf4j from 1.7.36 to 2.0.9 [iceberg]

2023-10-10 Thread via GitHub


dependabot[bot] commented on PR #8737:
URL: https://github.com/apache/iceberg/pull/8737#issuecomment-1755485022

   Sorry, only users with push access can use that command.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Upgrade to Apache Arrow 13.0.0 [iceberg]

2023-10-10 Thread via GitHub


Fokko commented on issue #8764:
URL: https://github.com/apache/iceberg/issues/8764#issuecomment-1755497790

   Curious why this hasn't been picked up by dependabot.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Upgrade to Apache Arrow 13.0.0 [iceberg]

2023-10-10 Thread via GitHub


jbonofre commented on issue #8764:
URL: https://github.com/apache/iceberg/issues/8764#issuecomment-1755501456

   @Fokko I found a few dependencies not detected by dependabot. I'm doing the 
updates and checking why dependabot didn't find it :) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Investigate why dependabot didn't detect upgrades [iceberg]

2023-10-10 Thread via GitHub


ajantha-bhat commented on issue #8764:
URL: https://github.com/apache/iceberg/issues/8764#issuecomment-1755583628

   @snazy: Hi, Do you already have analyzed or have an info on this dependabot 
+ version catalog problems?  
   Do you recommend [renovateBot](https://github.com/renovatebot) to address 
this. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Investigate why dependabot didn't detect upgrades [iceberg]

2023-10-10 Thread via GitHub


Fokko commented on issue #8764:
URL: https://github.com/apache/iceberg/issues/8764#issuecomment-1755598241

   I think this is because we limit to 5 PRs: 
https://github.com/apache/iceberg/blob/master/.github/dependabot.yml#L32
   
   It looks like all five are open: 
https://github.com/apache/iceberg/pulls?q=is%3Apr+is%3Aopen+label%3Adependencies
 I would be in favor of just removing this limit.
   
   I would just set this to 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Allow missing object in ErrorResponse [iceberg]

2023-10-10 Thread via GitHub


Fokko commented on PR #8760:
URL: https://github.com/apache/iceberg/pull/8760#issuecomment-1755602836

   Let's do this the other way around


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Allow missing object in ErrorResponse [iceberg]

2023-10-10 Thread via GitHub


Fokko closed pull request #8760: Core: Allow missing object in ErrorResponse
URL: https://github.com/apache/iceberg/pull/8760


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[PR] Open-API: Make error required [iceberg]

2023-10-10 Thread via GitHub


Fokko opened a new pull request, #8765:
URL: https://github.com/apache/iceberg/pull/8765

   I think we want to make `error` required, otherwise it would just be an 
empty document `{}`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Investigate why dependabot didn't detect upgrades [iceberg]

2023-10-10 Thread via GitHub


jbonofre commented on issue #8764:
URL: https://github.com/apache/iceberg/issues/8764#issuecomment-1755627197

   @Fokko yes, I think we use the 5 PRs pool. +1 to upgrade. I'm doing it in a 
PR attached to this issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Docs: Fix missing semicolons in SQL snippets. [iceberg]

2023-10-10 Thread via GitHub


Fokko commented on code in PR #8748:
URL: https://github.com/apache/iceberg/pull/8748#discussion_r1352755607


##
docs/spark-getting-started.md:
##
@@ -69,7 +69,7 @@ To create your first Iceberg table in Spark, use the 
`spark-sql` shell or `spark
 
 ```sql
 -- local is the path-based catalog defined above
-CREATE TABLE local.db.table (id bigint, data string) USING iceberg
+CREATE TABLE local.db.table (id bigint, data string) USING iceberg;

Review Comment:
   Would be even better to have some kind of linter for this



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Unable to write to iceberg table using spark [iceberg]

2023-10-10 Thread via GitHub


RussellSpitzer commented on issue #8419:
URL: https://github.com/apache/iceberg/issues/8419#issuecomment-1755633933

   Pyspark I think has some issues with setting "packages" in the Spark conf 
since the py4j execution means that the Spark Context has to be started a bit 
weirdly. I would try use --packages on the cli instead of configuring within 
the context to see what happens.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Docs: Fix missing semicolons in SQL snippets. [iceberg]

2023-10-10 Thread via GitHub


Fokko commented on PR #8748:
URL: https://github.com/apache/iceberg/pull/8748#issuecomment-1755637911

   Great work @Priyansh121096 If we find more we can create a new PR. (I also 
noticed that some blocks start with:
   ```
   ```SQL
   ```
   For consistency it would be nice to have everything lowercase, but I think 
that works as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Docs: Fix missing semicolons in SQL snippets. [iceberg]

2023-10-10 Thread via GitHub


Fokko merged PR #8748:
URL: https://github.com/apache/iceberg/pull/8748


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Use more permissive check when registering existing table [iceberg]

2023-10-10 Thread via GitHub


Fokko merged PR #8759:
URL: https://github.com/apache/iceberg/pull/8759


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Build: Bump slf4j from 1.7.36 to 2.0.9 [iceberg]

2023-10-10 Thread via GitHub


nastra commented on PR #8737:
URL: https://github.com/apache/iceberg/pull/8737#issuecomment-1755681588

   @jbonofre there's an issue with Spark that needs some investigation


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[I] Parquet.write to S3 with GlueCatalog requires commit [iceberg]

2023-10-10 Thread via GitHub


djchapm opened a new issue, #8767:
URL: https://github.com/apache/iceberg/issues/8767

   ### Feature Request / Improvement
   
   Hi, writing this in an effort to improve documentation - I spent a crazy 
amount of time writing to glue catalog and parquet-avro files in S3 with 
Iceberg, but could never query the data using Athena.  I thought it had to do 
with all the missing metadata on the glue tables - but this was a red herring.  
Problem was that writing files does not automatically update metadata.  
According to the API, if you use Table.io(): 
   
   
![image](https://github.com/apache/iceberg/assets/9857153/2a61401d-b45c-42e4-9881-345311509479)
   
   This made me think using an OutputFile via Table.io() would update metadata. 
 My usage:
   
   ```
   OutputFile outputFile = table.io().newOutputFile(location);
   appenderLocation.put(messageType, location);
   FileAppender appender = 
Parquet.write(outputFile)
   .forTable(table)
   .setAll(propsBuilder)
   .createWriterFunc(ParquetAvroWriter::buildWriter)
   .build();
   ```
   
   On closing the appender - the file writes but there are no updates to 
metadata.  My table is from GlueCatalog.loadTable().  I'm new - but I could not 
find anywhere that you have to then lookup the file again as an InputFile, 
create a transcation on the table and commit it:
   
```
   log.info("Closing appender for message type {}", key);
   value.close(); //Appender from above
   //one attempt, does nothing:
   //  tables.get(key).rewriteManifests();
   log.info("Commiting {} file {}", key, 
appenderLocation.get(key));
   InputFile inputFile = 
tables.get(key).io().newInputFile(appenderLocation.get(key));
   DataFile dataFile = DataFiles.builder(tables.get(key).spec())
   .withInputFile(inputFile)
   .withMetrics(value.metrics())
   .withFormat(FileFormat.PARQUET)
   .build();
   Transaction t = tables.get(key).newTransaction();
   t.newAppend().appendFile(dataFile).commit();
   // commit all changes to the table
   t.commitTransaction();
   ```
   
   So would like improvements with respect to documentation and AWS integration 
for writing Parquet data using GlueCatalog.  Or at least a test or example 
people could follow for writing files and updating corresponding catalog 
metadata using public APIs (Junits do all kinds of metadata updates but with 
protected APIs we cannot access).
   
   Let me know your thoughts.
   
   
   ### Query engine
   
   Athena


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Build: Fix compiler warnings [iceberg]

2023-10-10 Thread via GitHub


ajantha-bhat commented on code in PR #8763:
URL: https://github.com/apache/iceberg/pull/8763#discussion_r1352796153


##
nessie/src/main/java/org/apache/iceberg/nessie/NessieIcebergClient.java:
##
@@ -477,7 +477,7 @@ public void commitTable(
 Branch branch =
 getApi()
 .commitMultipleOperations()
-.operation(Operation.Put.of(key, newTable, expectedContent))
+.operation(Operation.Put.of(key, newTable))

Review Comment:
   Oh  wait, 
   https://github.com/projectnessie/nessie/pull/6438 says that we may still 
need expected content for V1. 
   But the testcases are passing. 
   
   @snazy, @dimas-b : WDYT?  Can it be removed?  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Build: Fix compiler warnings [iceberg]

2023-10-10 Thread via GitHub


ajantha-bhat commented on code in PR #8763:
URL: https://github.com/apache/iceberg/pull/8763#discussion_r1352798678


##
nessie/src/main/java/org/apache/iceberg/nessie/NessieIcebergClient.java:
##
@@ -477,7 +477,7 @@ public void commitTable(
 Branch branch =
 getApi()
 .commitMultipleOperations()
-.operation(Operation.Put.of(key, newTable, expectedContent))
+.operation(Operation.Put.of(key, newTable))

Review Comment:
   Since that API is a deprecated usage, Author of the PR has fixed it with 
another API I guess. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Build: Bump slf4j from 1.7.36 to 2.0.9 [iceberg]

2023-10-10 Thread via GitHub


jbonofre commented on PR #8737:
URL: https://github.com/apache/iceberg/pull/8737#issuecomment-1755700578

   @nastra on a specific Spark version or any ? I can take a look if you want :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Build: Add note about running tests/itests on MacOS [iceberg]

2023-10-10 Thread via GitHub


ajantha-bhat commented on PR #8766:
URL: https://github.com/apache/iceberg/pull/8766#issuecomment-1755702793

   > LGTM, but would be great if somebody with OSX could confirm this
   
   I can confirm. I use MAC. Not just for Iceberg project, any project that 
uses `TestContainers` on MAC will throw an error for running tests that "Could 
not find a valid Docker environment."
   
   I used to google and use the command from below answer. Which matches the 
doc update. 
   
https://stackoverflow.com/questions/61108655/test-container-test-cases-are-failing-due-to-could-not-find-a-valid-docker-envi
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Build: Add note about running tests/itests on MacOS [iceberg]

2023-10-10 Thread via GitHub


jbonofre commented on PR #8766:
URL: https://github.com/apache/iceberg/pull/8766#issuecomment-1755703566

   @nastra actually it's the workaround I have to do on my Mac :) I'm using 
MacOS (tested both on 13 & 14 with Docker Desktop) on M1.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Build: increase open-pull-requests-limit to 50 [iceberg]

2023-10-10 Thread via GitHub


Fokko commented on code in PR #8768:
URL: https://github.com/apache/iceberg/pull/8768#discussion_r1352807655


##
.github/dependabot.yml:
##
@@ -28,6 +28,6 @@ updates:
 directory: "/"
 schedule:
   interval: "weekly"
-  day: "sunday"
-open-pull-requests-limit: 5
+  day: "wednesday"

Review Comment:
   We went for Sunday initially to not queue the CI during workdays. I think 
once the PR gets merged, it will retrigger anyway.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Build: increase open-pull-requests-limit to 50 [iceberg]

2023-10-10 Thread via GitHub


jbonofre commented on code in PR #8768:
URL: https://github.com/apache/iceberg/pull/8768#discussion_r1352811489


##
.github/dependabot.yml:
##
@@ -28,6 +28,6 @@ updates:
 directory: "/"
 schedule:
   interval: "weekly"
-  day: "sunday"
-open-pull-requests-limit: 5
+  day: "wednesday"

Review Comment:
   OK, let me back on Sunday.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Build: Bump slf4j from 1.7.36 to 2.0.9 [iceberg]

2023-10-10 Thread via GitHub


nastra commented on PR #8737:
URL: https://github.com/apache/iceberg/pull/8737#issuecomment-1755713705

   
https://github.com/apache/iceberg/actions/runs/6455472887/job/17523030796?pr=8737
 contains a CI run with failures


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Build: Upgrade to gradle 8.4 [iceberg]

2023-10-10 Thread via GitHub


jbonofre commented on PR #8486:
URL: https://github.com/apache/iceberg/pull/8486#issuecomment-1755722580

   Unfortunately `gradle-revapi-plugin` doesn't seem super active 
(https://github.com/palantir/gradle-revapi).
   
   I think it's important to be up to date in regards of Gradle. I will propose 
a fix to `gradle-revapi`. If the change is not merge and/or it's hard to have a 
new release, I will propose an alternative plan.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Build: Fix compiler warnings [iceberg]

2023-10-10 Thread via GitHub


dimas-b commented on code in PR #8763:
URL: https://github.com/apache/iceberg/pull/8763#discussion_r1352823020


##
nessie/src/main/java/org/apache/iceberg/nessie/NessieIcebergClient.java:
##
@@ -477,7 +477,7 @@ public void commitTable(
 Branch branch =
 getApi()
 .commitMultipleOperations()
-.operation(Operation.Put.of(key, newTable, expectedContent))

Review Comment:
   This change LGTM, but it's a non-trivial change in the Nessie Catalog, 
certainly not a simple "compiler warning" kind of change... Would you mind 
moving it into a separate PR for the sake of clarity?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Build: Fix compiler warnings [iceberg]

2023-10-10 Thread via GitHub


dimas-b commented on code in PR #8763:
URL: https://github.com/apache/iceberg/pull/8763#discussion_r1352824568


##
nessie/src/main/java/org/apache/iceberg/nessie/NessieIcebergClient.java:
##
@@ -477,7 +477,7 @@ public void commitTable(
 Branch branch =
 getApi()
 .commitMultipleOperations()
-.operation(Operation.Put.of(key, newTable, expectedContent))
+.operation(Operation.Put.of(key, newTable))

Review Comment:
   This change LGTM, but it's a non-trivial change in the Nessie Catalog, 
certainly not a simple "compiler warning" kind of change... 
   
   @nk1506 : Would you mind moving it into a separate PR for the sake of 
clarity?



##
nessie/src/main/java/org/apache/iceberg/nessie/NessieIcebergClient.java:
##
@@ -477,7 +477,7 @@ public void commitTable(
 Branch branch =
 getApi()
 .commitMultipleOperations()
-.operation(Operation.Put.of(key, newTable, expectedContent))

Review Comment:
   This change LGTM, but it's a non-trivial change in the Nessie Catalog, 
certainly not a simple "compiler warning" kind of change... Would you mind 
moving it into a separate PR for the sake of clarity?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Build: Fix compiler warnings [iceberg]

2023-10-10 Thread via GitHub


ajantha-bhat commented on code in PR #8763:
URL: https://github.com/apache/iceberg/pull/8763#discussion_r1352827194


##
nessie/src/main/java/org/apache/iceberg/nessie/NessieIcebergClient.java:
##
@@ -477,7 +477,7 @@ public void commitTable(
 Branch branch =
 getApi()
 .commitMultipleOperations()
-.operation(Operation.Put.of(key, newTable, expectedContent))
+.operation(Operation.Put.of(key, newTable))

Review Comment:
   +1 for separate PR. 
   
   I think we can even refactor the method to not pass `expectedContent` but 
just pass contentID as the whole content is unused now. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Build: Fix compiler warnings [iceberg]

2023-10-10 Thread via GitHub


ajantha-bhat commented on code in PR #8763:
URL: https://github.com/apache/iceberg/pull/8763#discussion_r1352827194


##
nessie/src/main/java/org/apache/iceberg/nessie/NessieIcebergClient.java:
##
@@ -477,7 +477,7 @@ public void commitTable(
 Branch branch =
 getApi()
 .commitMultipleOperations()
-.operation(Operation.Put.of(key, newTable, expectedContent))
+.operation(Operation.Put.of(key, newTable))

Review Comment:
   +1 for separate PR. 
   
   I think we can even refactor the `commitTable` method to not accept 
`expectedContent` but just accept contentID as the whole content is unused now. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Build: Fix compiler warnings [iceberg]

2023-10-10 Thread via GitHub


dimas-b commented on code in PR #8763:
URL: https://github.com/apache/iceberg/pull/8763#discussion_r1352831841


##
nessie/src/main/java/org/apache/iceberg/nessie/NessieIcebergClient.java:
##
@@ -477,7 +477,7 @@ public void commitTable(
 Branch branch =
 getApi()
 .commitMultipleOperations()
-.operation(Operation.Put.of(key, newTable, expectedContent))
+.operation(Operation.Put.of(key, newTable))

Review Comment:
   Please reference https://github.com/projectnessie/nessie/pull/6438 as the 
rationale for removing the third parameter here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Build: Fix compiler warnings [iceberg]

2023-10-10 Thread via GitHub


dimas-b commented on code in PR #8763:
URL: https://github.com/apache/iceberg/pull/8763#discussion_r1352831841


##
nessie/src/main/java/org/apache/iceberg/nessie/NessieIcebergClient.java:
##
@@ -477,7 +477,7 @@ public void commitTable(
 Branch branch =
 getApi()
 .commitMultipleOperations()
-.operation(Operation.Put.of(key, newTable, expectedContent))
+.operation(Operation.Put.of(key, newTable))

Review Comment:
   Please reference https://github.com/projectnessie/nessie/pull/6438 as the 
rationale for removing the third parameter in the new PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Build: Upgrade to gradle 8.4 [iceberg]

2023-10-10 Thread via GitHub


ajantha-bhat commented on PR #8486:
URL: https://github.com/apache/iceberg/pull/8486#issuecomment-1755743335

   > Unfortunately gradle-revapi-plugin doesn't seem super active 
(https://github.com/palantir/gradle-revapi).
   I think it's important to be up to date in regards of Gradle. I will propose 
a fix to gradle-revapi. If the change is not merge and/or it's hard to have a 
new release, I will propose an alternative plan.
   
   totally agree, worst case we can drop rev-api and find alternatives. But we 
can't keep gradle out of date. 
   
   cc: @nastra, @Fokko, @danielcweeks, @rdblue


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Build: Fix compiler warnings [iceberg]

2023-10-10 Thread via GitHub


dimas-b commented on code in PR #8763:
URL: https://github.com/apache/iceberg/pull/8763#discussion_r1352835064


##
nessie/src/main/java/org/apache/iceberg/nessie/NessieIcebergClient.java:
##
@@ -477,7 +477,7 @@ public void commitTable(
 Branch branch =
 getApi()
 .commitMultipleOperations()
-.operation(Operation.Put.of(key, newTable, expectedContent))
+.operation(Operation.Put.of(key, newTable))

Review Comment:
   I believe Nessie API v1 only needs to support older clients to serialize the 
`expectedContent` parameters in JSON. Actual values are not used in Nessie 
Servers 0.54.0 and later.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Upsert support for keyless Apache Flink tables [iceberg]

2023-10-10 Thread via GitHub


Ge commented on issue #8719:
URL: https://github.com/apache/iceberg/issues/8719#issuecomment-1755752149

   `SELECT word, COUNT(*) FROM word_table GROUP BY word;` is the retract stream:
   
   ```
   Flink SQL> SELECT word, COUNT(*) FROM word_table GROUP BY word;
   +++--+
   | op |   word |   EXPR$1 |
   +++--+
   | +I |  6 |1 |
   | +I |  8 |1 |
   | +I |  f |1 |
   | +I |  c |1 |
   | +I |  b |1 |
   | -U |  8 |1 |
   | +U |  8 |2 |
   | +I |  1 |1 |
   | +I |  a |1 |
   | -U |  8 |2 |
   | +U |  8 |3 |
   | -U |  6 |1 |
   | +U |  6 |2 |
   | +I |  9 |1 |
   | +I |  e |1 |
   ```
   
   Can you please elaborate on what is missing @pvary ? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Build: increase open-pull-requests-limit to 50 [iceberg]

2023-10-10 Thread via GitHub


jbonofre commented on PR #8768:
URL: https://github.com/apache/iceberg/pull/8768#issuecomment-1755758543

   @ajantha-bhat maybe we have 5 pending PRs not closed/merged, so blocking any 
new PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Build: increase open-pull-requests-limit to 50 [iceberg]

2023-10-10 Thread via GitHub


ajantha-bhat commented on PR #8768:
URL: https://github.com/apache/iceberg/pull/8768#issuecomment-1755762122

   > @ajantha-bhat maybe we have 5 pending PRs not closed/merged, so blocking 
any new PR.
   
   Yeah, anyways this change will definitely give a clarity if that was the 
problem. 
   So, a huge +1 for this. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Docs: Document all metadata tables. [iceberg]

2023-10-10 Thread via GitHub


nastra merged PR #8709:
URL: https://github.com/apache/iceberg/pull/8709


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Document all metadata tables [iceberg]

2023-10-10 Thread via GitHub


nastra closed issue #757: Document all metadata tables
URL: https://github.com/apache/iceberg/issues/757


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Construct a writer tree [iceberg-python]

2023-10-10 Thread via GitHub


Fokko commented on PR #40:
URL: https://github.com/apache/iceberg-python/pull/40#issuecomment-1755767706

   Forgot to push, just pushed the latest changes


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Disable merge-commit and enforce linear history [iceberg-python]

2023-10-10 Thread via GitHub


rdblue merged PR #57:
URL: https://github.com/apache/iceberg-python/pull/57


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Unable to write to iceberg table using spark [iceberg]

2023-10-10 Thread via GitHub


di2mot commented on issue #8419:
URL: https://github.com/apache/iceberg/issues/8419#issuecomment-1755796639

   This works for me in general it works:
   ```
   ("spark.jars.packages", "org.apache.iceberg:iceberg-spark3:0.11.0"),
   ("spark.sql.extensions", 
"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"),
   ("spark.sql.catalog.iceberg", "org.apache.iceberg.spark.spark.SparkCatalog"),
   ("spark.sql.catalog.iceberg.catalog.iceberg.type", "hadoop"),
   ("spark.sql.catalog.iceberg.warehouse", self.path)
   ```
   But it when I use local, not in docker/kuber, and on the server, on the 
Airflow we use this one without spark.jars.packages:
   ```
   ...
   ("spark.sql.extensions", 
"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"),
   ("spark.sql.catalog.iceberg", "org.apache.iceberg.spark.spark.SparkCatalog"),
   ("spark.sql.catalog.iceberg.catalog.iceberg.type", "hadoop"),
   ("spark.sql.catalog.iceberg.warehouse", self.path)
   ...
   ```
   Becouse we add them in .yaml file:
   packages:
 - org.apache.spark:spark-sql-kafka-0-10_2.12:3.4.1
 - org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.4.0


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] push down min/max/count to iceberg [iceberg]

2023-10-10 Thread via GitHub


atifiu commented on PR #6252:
URL: https://github.com/apache/iceberg/pull/6252#issuecomment-1755857764

   @huaxingao I was executing max/count query on iceberg table version 1.3.0 
and Spark3.3.1 but unable to see aggregate pushdown i.e. LocalTableScan
   
   Cc: @RussellSpitzer 
   
   `spark.sql(f""" select max(page_view_dtm) from schema.table1where 
page_view_dtm between '2020-01-01 00:00:00' and '2021-12-31 23:59:59' 
""").explain()`
   
   and explain plan generated is
   
   ```
   == Physical Plan ==
   AdaptiveSparkPlan isFinalPlan=false
   +- HashAggregate(keys=[], functions=[max(page_view_dtm#139)])
  +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=62]
 +- HashAggregate(keys=[], functions=[partial_max(page_view_dtm#139)])
+- Filter ((page_view_dtm#139 >= 2020-01-01 00:00:00) AND 
(page_view_dtm#139 <= 2021-12-31 23:59:59))
   +- BatchScan[page_view_dtm#139] 
spark_catalog.schema.table1(branch=null) [filters=page_view_dtm IS NOT NULL, 
page_view_dtm >= 15778548, page_view_dtm <= 164101319900, 
groupedBy=] RuntimeFilters: []
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spec: Add partition stats spec [iceberg]

2023-10-10 Thread via GitHub


aokolnychyi commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1353203145


##
format/spec.md:
##
@@ -702,6 +703,58 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map` | 
Additional properties associated with the statistic. Subset of Blob properties 
in the Puffin file. |
 
 
+ Partition statistics
+
+Partition statistics files are based on [Partition Statistics file 
spec](#partition-statistics-file). 
+Partition statistics are not required for reading or planning and readers may 
ignore them.
+Each table snapshot may be associated with at most one partition statistic 
file.
+A writer can optionally write the partition statistics file during each write 
operation, and 
+it must be registered in the table metadata file to be considered as a valid 
statistics file for the reader.
+
+`partition-statistics` field of table metadata is an optional list of struct 
with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+||||--|-|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg 
table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of 
the partition statistics file. See [Partition Statistics 
file](#partition-statistics-file). |
+| _required_ | _required_ | **`file-size-in-bytes`** | `long` | Size of the 
partition statistics file. |
+
+ Partition Statistics file
+
+Statistics information for each unique partition tuple is stored as a row in 
the default data file format of the table (for example, Parquet or ORC).
+These rows must be sorted (in ascending manner with NULL FIRST) by `partition` 
field to optimize filtering rows while scanning.
+
+The schema of the partition statistics file is as follows:
+
+| v1 | v2 | Field id, name | Type | Description |
+||||--|-|
+| _required_ | _required_ | **`1 partition`** | `struct<..>` | Partition data 
tuple, schema based on the unified partition type considering all specs in a 
table |
+| _required_ | _required_ | **`2 spec_id`** | `int` | Partition spec id |
+| _required_ | _required_ | **`3 data_record_count`** | `long` | Count of 
records in data files |
+| _required_ | _required_ | **`4 data_file_count`** | `int` | Count of data 
files |
+| _required_ | _required_ | **`5 total_data_file_size_in_bytes`** | `long` | 
Total size of data files in bytes |
+| _optional_ | _optional_ | **`6 position_delete_record_count`** | `long` | 
Count of records in position delete files |
+| _optional_ | _optional_ | **`7 position_delete_file_count`** | `int` | Count 
of position delete files |
+| _optional_ | _optional_ | **`8 equality_delete_record_count`** | `long` | 
Count of records in equality delete files |
+| _optional_ | _optional_ | **`9 equality_delete_file_count`** | `int` | Count 
of equality delete files |
+| _optional_ | _optional_ | **`10 total_record_count`** | `long` | Accurate 
count of records in a partition after applying the delete files if any |

Review Comment:
   This makes sense to me.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spec: Add partition stats spec [iceberg]

2023-10-10 Thread via GitHub


aokolnychyi commented on PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#issuecomment-1756079392

   I added this PR to our community sync. I am not sure I will be there this 
week but I'll sync with Russell and Yufei afterwards.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[PR] Fix column rename doc example to reflect correct API [iceberg-python]

2023-10-10 Thread via GitHub


cabhishek opened a new pull request, #59:
URL: https://github.com/apache/iceberg-python/pull/59

   * Rename column example in 
[this](https://py.iceberg.apache.org/api/#rename-column) doc is incorrect. 
   * This PR updates the example to use `update.rename_column(...)` instead of 
`update.rename(...)`.
   
   After PR
   
   ```
   with table.update_schema() as update:
   update.rename_column("retries", "num_retries")
   # This will rename `confirmed_by` to `exchange`
   update.rename_column("properties.confirmed_by", "exchange")
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Avro: Add Avro-assisted name mapping [iceberg]

2023-10-10 Thread via GitHub


wmoustafa commented on code in PR #7392:
URL: https://github.com/apache/iceberg/pull/7392#discussion_r1353271608


##
core/src/main/java/org/apache/iceberg/avro/AvroWithPartnerByStructureVisitor.java:
##
@@ -93,14 +94,23 @@ private static  T visitRecord(
   private static  T visitUnion(
   P type, Schema union, AvroWithPartnerByStructureVisitor visitor) {
 List types = union.getTypes();
-Preconditions.checkArgument(
-AvroSchemaUtil.isOptionSchema(union), "Cannot visit non-option union: 
%s", union);
 List options = Lists.newArrayListWithExpectedSize(types.size());
-for (Schema branch : types) {
-  if (branch.getType() == Schema.Type.NULL) {
-options.add(visit(visitor.nullType(), branch, visitor));
-  } else {
-options.add(visit(type, branch, visitor));
+if (AvroSchemaUtil.isOptionSchema(union)) {
+  for (Schema branch : types) {
+if (branch.getType() == Schema.Type.NULL) {
+  options.add(visit(visitor.nullType(), branch, visitor));
+} else {
+  options.add(visit(type, branch, visitor));
+}
+  }
+} else {
+  List nonNullTypes =
+  types.stream().filter(t -> t.getType() != 
Schema.Type.NULL).collect(Collectors.toList());
+  for (int i = 0; i < nonNullTypes.size(); i++) {
+// In the case of complex union, the corresponding "type" is a struct. 
Non-null type i in
+// the union maps to struct filed i + 1 because the first struct field 
is the "tag".
+options.add(
+visit(visitor.fieldNameAndType(type, i + 1).second(), 
nonNullTypes.get(i), visitor));

Review Comment:
   Visited null but did not add to the returned options since it does not 
correspond to a struct field.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Avro: Add Avro-assisted name mapping [iceberg]

2023-10-10 Thread via GitHub


wmoustafa commented on PR #7392:
URL: https://github.com/apache/iceberg/pull/7392#issuecomment-1756153921

   > I think this is ready. Just a few minor updates needed; mostly 
https://github.com/apache/iceberg/pull/7392/files#r1224853756.
   
   Addressed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Data: Support reading default values from generic Avro readers [iceberg]

2023-10-10 Thread via GitHub


wmoustafa commented on code in PR #6004:
URL: https://github.com/apache/iceberg/pull/6004#discussion_r1353272532


##
.palantir/revapi.yml:
##
@@ -451,6 +451,15 @@ acceptedBreaks:
 - code: "java.field.removedWithConstant"
   old: "field org.apache.iceberg.TableProperties.HMS_TABLE_OWNER"
   justification: "Removing deprecations for 1.3.0"
+- code: "java.method.numberOfParametersChanged"
+  old: "method void 
org.apache.iceberg.avro.ValueReaders.StructReader::(java.util.List>,\
+\ org.apache.iceberg.types.Types.StructType, 
java.util.Map)"
+  new: "method void 
org.apache.iceberg.avro.ValueReaders.StructReader::(java.util.List>,\

Review Comment:
   Yes. Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Data: Support reading default values from generic Avro readers [iceberg]

2023-10-10 Thread via GitHub


wmoustafa commented on PR #6004:
URL: https://github.com/apache/iceberg/pull/6004#issuecomment-1756154580

   > @wmoustafa, looks like there are test failures. Can you take a look?
   
   Fixed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



  1   2   >