Re: [I] fast_forward does not work for the first commit in Spark [iceberg]

2023-10-17 Thread via GitHub


ajantha-bhat commented on issue #8849:
URL: https://github.com/apache/iceberg/issues/8849#issuecomment-1765789016

   Hmm, It is not just the null check addition in the procedure, later it fails 
because reference MAIN does not exist during replace branch. 
   
   
https://github.com/apache/iceberg/blob/46cad6ddaeff8104d96defab25206a4ff7e01629/core/src/main/java/org/apache/iceberg/UpdateSnapshotReferencesOperation.java#L123
   
   The reason is we create the main branch only during first snapshot creation. 
   But in this scenario there is no main branch exist for replace operation. 
   I thought of blocking this scenario by throwing an exception. But since we 
support creating empty branch (from 
https://github.com/apache/iceberg/pull/8072/), we should also support replacing 
to/from empty branch. 
   
   I am thinking to add an empty/dummy snapshot for create table operation with 
MAIN reference in a sparate PR to tackle this.
   Let me know what you guys think, @amogh-jahagirdar, @ConeyLiu , @rdblue  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] doc: The doc for catalog is missing. [iceberg]

2023-10-17 Thread via GitHub


liurenjie1024 commented on issue #8850:
URL: https://github.com/apache/iceberg/issues/8850#issuecomment-1765832884

   I'll take this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Implement basic full table scan. [iceberg-rust]

2023-10-17 Thread via GitHub


liurenjie1024 commented on issue #66:
URL: https://github.com/apache/iceberg-rust/issues/66#issuecomment-1765838560

   Blocked by #79 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] feat: Implement Iceberg values [iceberg-rust]

2023-10-17 Thread via GitHub


JanKaul commented on code in PR #20:
URL: https://github.com/apache/iceberg-rust/pull/20#discussion_r1361668958


##
crates/iceberg/src/spec/values.rs:
##
@@ -0,0 +1,964 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+/*!
+ * Value in iceberg
+ */
+
+use std::{any::Any, collections::BTreeMap};
+
+use bitvec::vec::BitVec;
+use chrono::{DateTime, NaiveDate, NaiveDateTime, NaiveTime, Utc};
+use ordered_float::OrderedFloat;
+use rust_decimal::Decimal;
+use serde_bytes::ByteBuf;
+use serde_json::{Map as JsonMap, Number, Value as JsonValue};
+use uuid::Uuid;
+
+use crate::{Error, ErrorKind};
+
+use super::datatypes::{PrimitiveType, Type};
+
+/// Values present in iceberg type
+#[derive(Clone, Debug, PartialEq, Hash, Eq, PartialOrd, Ord)]
+pub enum PrimitiveLiteral {
+/// 0x00 for false, non-zero byte for true
+Boolean(bool),
+/// Stored as 4-byte little-endian
+Int(i32),
+/// Stored as 8-byte little-endian
+Long(i64),
+/// Stored as 4-byte little-endian
+Float(OrderedFloat),
+/// Stored as 8-byte little-endian
+Double(OrderedFloat),
+/// Stores days from the 1970-01-01 in an 4-byte little-endian int
+Date(i32),
+/// Stores microseconds from midnight in an 8-byte little-endian long
+Time(i64),
+/// Timestamp without timezone
+Timestamp(i64),
+/// Timestamp with timezone
+TimestampTZ(i64),
+/// UTF-8 bytes (without length)
+String(String),
+/// 16-byte big-endian value
+UUID(Uuid),
+/// Binary value
+Fixed(Vec),
+/// Binary value (without length)
+Binary(Vec),
+/// Stores unscaled value as two’s-complement big-endian binary,
+/// using the minimum number of bytes for the value
+Decimal(Decimal),
+}
+
+/// Values present in iceberg type
+#[derive(Clone, Debug, PartialEq, Hash, Eq, PartialOrd, Ord)]
+pub enum Literal {
+/// A primitive value
+Primitive(PrimitiveLiteral),
+/// A struct is a tuple of typed values. Each field in the tuple is named 
and has an integer id that is unique in the table schema.
+/// Each field can be either optional or required, meaning that values can 
(or cannot) be null. Fields may be any type.
+/// Fields may have an optional comment or doc string. Fields can have 
default values.
+Struct(Struct),
+/// A list is a collection of values with some element type.
+/// The element field has an integer id that is unique in the table schema.
+/// Elements can be either optional or required. Element types may be any 
type.
+List(Vec>),
+/// A map is a collection of key-value pairs with a key type and a value 
type.
+/// Both the key field and value field each have an integer id that is 
unique in the table schema.
+/// Map keys are required and map values can be either optional or 
required. Both map keys and map values may be any type, including nested types.
+Map(BTreeMap>),
+}
+
+impl From for ByteBuf {
+fn from(value: Literal) -> Self {
+match value {
+Literal::Primitive(prim) => match prim {
+PrimitiveLiteral::Boolean(val) => {
+if val {
+ByteBuf::from([0u8])
+} else {
+ByteBuf::from([1u8])
+}
+}
+PrimitiveLiteral::Int(val) => ByteBuf::from(val.to_le_bytes()),
+PrimitiveLiteral::Long(val) => 
ByteBuf::from(val.to_le_bytes()),
+PrimitiveLiteral::Float(val) => 
ByteBuf::from(val.to_le_bytes()),
+PrimitiveLiteral::Double(val) => 
ByteBuf::from(val.to_le_bytes()),
+PrimitiveLiteral::Date(val) => 
ByteBuf::from(val.to_le_bytes()),
+PrimitiveLiteral::Time(val) => 
ByteBuf::from(val.to_le_bytes()),
+PrimitiveLiteral::Timestamp(val) => 
ByteBuf::from(val.to_le_bytes()),
+PrimitiveLiteral::TimestampTZ(val) => 
ByteBuf::from(val.to_le_bytes()),
+PrimitiveLiteral::String(val) => ByteBuf::from(val.as_bytes()),
+PrimitiveLiteral::UUID(val) => 
ByteBuf::from(val.as_u128().to_be_bytes()),
+PrimitiveLiteral::Fixed(val) => ByteB

Re: [I] [BUG] to_arrow conversion does not support iceberg table column name containing slash [iceberg-python]

2023-10-17 Thread via GitHub


Fokko commented on issue #81:
URL: https://github.com/apache/iceberg-python/issues/81#issuecomment-1765864606

   Great catch @puchengy I wasn't aware of this sanitization behavior. Do you 
want to write a patch for it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[PR] fix: avro bytes test for Literal [iceberg-rust]

2023-10-17 Thread via GitHub


JanKaul opened a new pull request, #80:
URL: https://github.com/apache/iceberg-rust/pull/80

   This PR fixes a logical bug in the tests for converting `Literal`s to and 
from `ByteBuf`. The bug is that I wrote the wrong bytes to the avro file.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] feat: Implement Iceberg values [iceberg-rust]

2023-10-17 Thread via GitHub


JanKaul commented on code in PR #20:
URL: https://github.com/apache/iceberg-rust/pull/20#discussion_r1361677800


##
crates/iceberg/src/spec/values.rs:
##
@@ -0,0 +1,964 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+/*!
+ * Value in iceberg
+ */
+
+use std::{any::Any, collections::BTreeMap};
+
+use bitvec::vec::BitVec;
+use chrono::{DateTime, NaiveDate, NaiveDateTime, NaiveTime, Utc};
+use ordered_float::OrderedFloat;
+use rust_decimal::Decimal;
+use serde_bytes::ByteBuf;
+use serde_json::{Map as JsonMap, Number, Value as JsonValue};
+use uuid::Uuid;
+
+use crate::{Error, ErrorKind};
+
+use super::datatypes::{PrimitiveType, Type};
+
+/// Values present in iceberg type
+#[derive(Clone, Debug, PartialEq, Hash, Eq, PartialOrd, Ord)]
+pub enum PrimitiveLiteral {
+/// 0x00 for false, non-zero byte for true
+Boolean(bool),
+/// Stored as 4-byte little-endian
+Int(i32),
+/// Stored as 8-byte little-endian
+Long(i64),
+/// Stored as 4-byte little-endian
+Float(OrderedFloat),
+/// Stored as 8-byte little-endian
+Double(OrderedFloat),
+/// Stores days from the 1970-01-01 in an 4-byte little-endian int
+Date(i32),
+/// Stores microseconds from midnight in an 8-byte little-endian long
+Time(i64),
+/// Timestamp without timezone
+Timestamp(i64),
+/// Timestamp with timezone
+TimestampTZ(i64),
+/// UTF-8 bytes (without length)
+String(String),
+/// 16-byte big-endian value
+UUID(Uuid),
+/// Binary value
+Fixed(Vec),
+/// Binary value (without length)
+Binary(Vec),
+/// Stores unscaled value as two’s-complement big-endian binary,
+/// using the minimum number of bytes for the value
+Decimal(Decimal),
+}
+
+/// Values present in iceberg type
+#[derive(Clone, Debug, PartialEq, Hash, Eq, PartialOrd, Ord)]
+pub enum Literal {
+/// A primitive value
+Primitive(PrimitiveLiteral),
+/// A struct is a tuple of typed values. Each field in the tuple is named 
and has an integer id that is unique in the table schema.
+/// Each field can be either optional or required, meaning that values can 
(or cannot) be null. Fields may be any type.
+/// Fields may have an optional comment or doc string. Fields can have 
default values.
+Struct(Struct),
+/// A list is a collection of values with some element type.
+/// The element field has an integer id that is unique in the table schema.
+/// Elements can be either optional or required. Element types may be any 
type.
+List(Vec>),
+/// A map is a collection of key-value pairs with a key type and a value 
type.
+/// Both the key field and value field each have an integer id that is 
unique in the table schema.
+/// Map keys are required and map values can be either optional or 
required. Both map keys and map values may be any type, including nested types.
+Map(BTreeMap>),
+}
+
+impl From for ByteBuf {
+fn from(value: Literal) -> Self {
+match value {
+Literal::Primitive(prim) => match prim {
+PrimitiveLiteral::Boolean(val) => {
+if val {
+ByteBuf::from([0u8])
+} else {
+ByteBuf::from([1u8])
+}
+}
+PrimitiveLiteral::Int(val) => ByteBuf::from(val.to_le_bytes()),
+PrimitiveLiteral::Long(val) => 
ByteBuf::from(val.to_le_bytes()),
+PrimitiveLiteral::Float(val) => 
ByteBuf::from(val.to_le_bytes()),
+PrimitiveLiteral::Double(val) => 
ByteBuf::from(val.to_le_bytes()),
+PrimitiveLiteral::Date(val) => 
ByteBuf::from(val.to_le_bytes()),
+PrimitiveLiteral::Time(val) => 
ByteBuf::from(val.to_le_bytes()),
+PrimitiveLiteral::Timestamp(val) => 
ByteBuf::from(val.to_le_bytes()),
+PrimitiveLiteral::TimestampTZ(val) => 
ByteBuf::from(val.to_le_bytes()),
+PrimitiveLiteral::String(val) => ByteBuf::from(val.as_bytes()),
+PrimitiveLiteral::UUID(val) => 
ByteBuf::from(val.as_u128().to_be_bytes()),
+PrimitiveLiteral::Fixed(val) => ByteB

Re: [PR] fix: avro bytes test for Literal [iceberg-rust]

2023-10-17 Thread via GitHub


Xuanwo commented on code in PR #80:
URL: https://github.com/apache/iceberg-rust/pull/80#discussion_r1361682216


##
crates/iceberg/src/spec/values.rs:
##
@@ -995,7 +995,7 @@ mod tests {
 assert_eq!(literal, expected_literal);
 
 let mut writer = apache_avro::Writer::new(&schema, Vec::new());
-writer.append_ser(bytes).unwrap();
+writer.append_ser(Intointo(literal)).unwrap();

Review Comment:
   Using `Into` seems strange. How about using `Bytes::from(literal)` instead?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] fix: avro bytes test for Literal [iceberg-rust]

2023-10-17 Thread via GitHub


JanKaul commented on code in PR #80:
URL: https://github.com/apache/iceberg-rust/pull/80#discussion_r1361690259


##
crates/iceberg/src/spec/values.rs:
##
@@ -995,7 +995,7 @@ mod tests {
 assert_eq!(literal, expected_literal);
 
 let mut writer = apache_avro::Writer::new(&schema, Vec::new());
-writer.append_ser(bytes).unwrap();
+writer.append_ser(Intointo(literal)).unwrap();

Review Comment:
   There is a `ByteBuf::from(value: impl Into>)` which overrides the 
`From` for `ByteBuf` and I need to somehow specify that I want the 
conversion from `Literal` to `ByteBuf`. Maybe you know of a different way to 
specify which `form/into` implementation to use.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] fix: avro bytes test for Literal [iceberg-rust]

2023-10-17 Thread via GitHub


Xuanwo commented on code in PR #80:
URL: https://github.com/apache/iceberg-rust/pull/80#discussion_r1361704555


##
crates/iceberg/src/spec/values.rs:
##
@@ -995,7 +995,7 @@ mod tests {
 assert_eq!(literal, expected_literal);
 
 let mut writer = apache_avro::Writer::new(&schema, Vec::new());
-writer.append_ser(bytes).unwrap();
+writer.append_ser(Intointo(literal)).unwrap();

Review Comment:
   > There is a `ByteBuf::from(value: impl Into>)` which overrides the 
`From` for `ByteBuf`
   
   Hi, I'm not quite understanding. Does `Into>` exist for `Literal`? 
Or is it because `Literal` implements `From for BytesBuf`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add an expireAfterWrite cache eviction policy to CachingCatalog [iceberg]

2023-10-17 Thread via GitHub


zhangminglei commented on code in PR #8844:
URL: https://github.com/apache/iceberg/pull/8844#discussion_r1361733500


##
spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/TestCachingCatalogExpirationAfterWrite.java:
##
@@ -0,0 +1,89 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.spark;
+
+import java.util.Map;
+import java.util.concurrent.TimeUnit;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.iceberg.hive.HiveClientPool;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.apache.spark.sql.connector.catalog.Identifier;
+import org.awaitility.Awaitility;
+import org.junit.After;
+import org.junit.AfterClass;
+import org.junit.BeforeClass;
+import org.junit.Test;
+import org.junit.runners.Parameterized;
+
+public class TestCachingCatalogExpirationAfterWrite extends 
SparkCatalogTestBase {

Review Comment:
   Sounds reasonable, What do you think ? @nastra , You wish put it into 
`spark` module ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] How to read data in the order in which files are commited? [iceberg]

2023-10-17 Thread via GitHub


pvary commented on issue #8802:
URL: https://github.com/apache/iceberg/issues/8802#issuecomment-1765938139

   > > Sometimes we need to do similar thing in Flink Source, and we ended up 
creating our own comparator for this which compares Iceberg splits (which are a 
wrapper above ScanTasks).
   > 
   > I'm sorry, I didn't quite understand this point. Could you please explain 
it in more detail?
   
   We fetch all the tasks for a given plan here (and covert it to an 
IcebergSplit using `fromCombinedScanTask`):
   
https://github.com/apache/iceberg/blob/46cad6ddaeff8104d96defab25206a4ff7e01629/flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/source/FlinkSplitPlanner.java#L73-L82
   
   And we use  a 
[SerializableComparator](https://github.com/apache/iceberg/blob/46cad6ddaeff8104d96defab25206a4ff7e01629/flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/source/split/SplitComparators.java#L33)
 to order the splits before assigning them:
   
https://github.com/apache/iceberg/blob/46cad6ddaeff8104d96defab25206a4ff7e01629/flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/source/assigner/DefaultSplitAssigner.java#L44-L46
   
   I hope this helps!
   
   Thanks,
   Peter


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Flink: Read parquet BINARY column as String for expected [iceberg]

2023-10-17 Thread via GitHub


pvary commented on code in PR #8808:
URL: https://github.com/apache/iceberg/pull/8808#discussion_r1361742600


##
flink/v1.15/flink/src/main/java/org/apache/iceberg/flink/data/FlinkParquetReaders.java:
##
@@ -262,7 +262,11 @@ public ParquetValueReader primitive(
   switch (primitive.getPrimitiveTypeName()) {
 case FIXED_LEN_BYTE_ARRAY:
 case BINARY:
-  return new ParquetValueReaders.ByteArrayReader(desc);
+  if (expected.typeId() == Types.StringType.get().typeId()) {
+return new StringReader(desc);
+  } else {
+return new ParquetValueReaders.ByteArrayReader(desc);
+  }

Review Comment:
   I am concerned about the backward compatibility of this change.
   Someone might already depend on reading them as binary, and this change 
would break their use-case



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Flink: Read parquet BINARY column as String for expected [iceberg]

2023-10-17 Thread via GitHub


pvary commented on code in PR #8808:
URL: https://github.com/apache/iceberg/pull/8808#discussion_r1361742942


##
flink/v1.15/flink/src/test/java/org/apache/iceberg/flink/data/TestFlinkParquetReader.java:
##
@@ -81,26 +75,87 @@ public void testTwoLevelList() throws IOException {
 recordBuilder.set("topbytes", expectedBinary);
 GenericData.Record expectedRecord = recordBuilder.build();
 
-writer.write(expectedRecord);
-writer.close();
+try (ParquetWriter writer =
+AvroParquetWriter.builder(new Path(testFile.toURI()))
+.withDataModel(GenericData.get())
+.withSchema(avroSchema)
+.config("parquet.avro.add-list-element-records", "true")
+.config("parquet.avro.write-old-list-structure", "true")
+.build()) {
+  writer.write(expectedRecord);
+}
 
 try (CloseableIterable reader =
 Parquet.read(Files.localInput(testFile))
 .project(schema)
 .createReaderFunc(type -> FlinkParquetReaders.buildReader(schema, 
type))
 .build()) {
   Iterator rows = reader.iterator();
-  Assert.assertTrue("Should have at least one row", rows.hasNext());
+  assertThat(rows).as("Should have at least one row").hasNext();
+  RowData rowData = rows.next();
+  assertThat(expectedByte).isEqualTo(rowData.getArray(0).getBinary(0));
+  assertThat(expectedByte).isEqualTo(rowData.getBinary(1));
+  assertThat(rows).as("Should not have more than one row").isExhausted();
+}
+  }
+
+  @Test
+  public void testReadBinaryFieldAsString() throws IOException {
+Schema schemaForWriteBinary = new Schema(optional(1, "strbytes", 
Types.BinaryType.get()));
+org.apache.avro.Schema avroSchema = 
AvroSchemaUtil.convert(schemaForWriteBinary.asStruct());
+
+File testFile = temp.newFile();
+assertThat(testFile.delete()).isTrue();
+
+String expectedString = "hello";
+
+GenericRecordBuilder recordBuilder = new GenericRecordBuilder(avroSchema);
+ByteBuffer expectedBinary = 
ByteBuffer.wrap(expectedString.getBytes(StandardCharsets.UTF_8));
+recordBuilder.set("strbytes", expectedBinary);
+GenericData.Record expectedRecord = recordBuilder.build();
+
+try (ParquetWriter writer =
+AvroParquetWriter.builder(new Path(testFile.toURI()))
+.withDataModel(GenericData.get())
+.withSchema(avroSchema)
+.build()) {
+  writer.write(expectedRecord);
+}
+
+// read as string
+Schema schemaForReadBinaryAsString =
+new Schema(optional(1, "strbytes", Types.StringType.get()));
+try (CloseableIterable reader =
+Parquet.read(Files.localInput(testFile))
+.project(schemaForReadBinaryAsString)
+.createReaderFunc(
+type -> 
FlinkParquetReaders.buildReader(schemaForReadBinaryAsString, type))
+.build()) {
+  Iterator rows = reader.iterator();
+  assertThat(rows).as("Should have at least one row").hasNext();
+  RowData rowData = rows.next();
+  assertThat(rowData.getString(0)).isInstanceOf(BinaryStringData.class);
+  assertThat(rowData.getString(0).toString()).isEqualTo(expectedString);
+  assertThat(rows).as("Should not have more than one row").isExhausted();
+}
+
+// read as byte[]

Review Comment:
   Should this be a separate test case?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Flink: Read parquet BINARY column as String for expected [iceberg]

2023-10-17 Thread via GitHub


pvary commented on PR #8808:
URL: https://github.com/apache/iceberg/pull/8808#issuecomment-1765954472

   This is a small change, so it might not be too hard to keep the different 
Flink version changes in sync, but usually we introduce the changes on the 
latest Flink, and then create a different backport PR to backport them to the 
older Flink versions.
   
   This is better for the reviewer, as the change is smaller and easier to 
focus on the issues, and better for the contributor since when changes are 
required then they are not needed to be continuously merged to the other 
branches. And when the backport time comes, it is easier to check the backport 
specific changes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] fix: avro bytes test for Literal [iceberg-rust]

2023-10-17 Thread via GitHub


JanKaul commented on code in PR #80:
URL: https://github.com/apache/iceberg-rust/pull/80#discussion_r1361754339


##
crates/iceberg/src/spec/values.rs:
##
@@ -995,7 +995,7 @@ mod tests {
 assert_eq!(literal, expected_literal);
 
 let mut writer = apache_avro::Writer::new(&schema, Vec::new());
-writer.append_ser(bytes).unwrap();
+writer.append_ser(Intointo(literal)).unwrap();

Review Comment:
   There currently is no `Into>` for `Literal` but there is a 
`From for ByteBuf`, which I would like to use. When I use 
`ByteBuf::from`, the compiler wants to use the existing `ByteBuf::from(value: 
impl Into>)` implementation, which gives me a type error because 
`Literal` doesn't implement `Into>`. So I need someway to tell the 
compiler the right `from/into` implementation to use. It works the way I did it 
but it might be a bit confusing.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] fix: avro bytes test for Literal [iceberg-rust]

2023-10-17 Thread via GitHub


ZENOTME commented on PR #80:
URL: https://github.com/apache/iceberg-rust/pull/80#issuecomment-1765964478

   I'm a little confused. So this avro bytes is different with [binary encoding 
in avro 
spec](https://avro.apache.org/docs/1.11.1/specification/#binary-encoding)? 
What's difference between these two format? 🤔


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] fix: avro bytes test for Literal [iceberg-rust]

2023-10-17 Thread via GitHub


Xuanwo commented on code in PR #80:
URL: https://github.com/apache/iceberg-rust/pull/80#discussion_r1361762750


##
crates/iceberg/src/spec/values.rs:
##
@@ -995,7 +995,7 @@ mod tests {
 assert_eq!(literal, expected_literal);
 
 let mut writer = apache_avro::Writer::new(&schema, Vec::new());
-writer.append_ser(bytes).unwrap();
+writer.append_ser(Intointo(literal)).unwrap();

Review Comment:
   Oh, I got it. `ByteBuf` provides it's own `ByteBuf::from` that may conflicts 
with `From` trait. Maybe we should provide a `Literal::to_vec()` instead?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] fix: avro bytes test for Literal [iceberg-rust]

2023-10-17 Thread via GitHub


Xuanwo commented on code in PR #80:
URL: https://github.com/apache/iceberg-rust/pull/80#discussion_r1361762750


##
crates/iceberg/src/spec/values.rs:
##
@@ -995,7 +995,7 @@ mod tests {
 assert_eq!(literal, expected_literal);
 
 let mut writer = apache_avro::Writer::new(&schema, Vec::new());
-writer.append_ser(bytes).unwrap();
+writer.append_ser(Intointo(literal)).unwrap();

Review Comment:
   Oh, I got it. `ByteBuf` provides it's own `ByteBuf::from` that may conflicts 
with `From` trait. 
   
   Maybe we should provide a `Literal::to_vec()` instead? Users can utilize 
`ByteBuf::from(Literal::to_vec())` to enhance semantic clarity.
   
   Incidentally, since `Literal` already implements `Serialize`, it's easy to 
misuse this API. Can we find a way to prevent this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[PR] API, Core: Add uuid() to View [iceberg]

2023-10-17 Thread via GitHub


nastra opened a new pull request, #8851:
URL: https://github.com/apache/iceberg/pull/8851

   similar to https://github.com/apache/iceberg/pull/8800, adding UUID to the 
`View` API


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] fix: avro bytes test for Literal [iceberg-rust]

2023-10-17 Thread via GitHub


JanKaul commented on PR #80:
URL: https://github.com/apache/iceberg-rust/pull/80#issuecomment-1766011191

   I will try to explain it the best way I can. Please correct me if I'm wrong.
   
   Iceberg defines it's own [binary serialization 
format](https://iceberg.apache.org/spec/#binary-single-value-serialization) for 
statistics (upper & lower bounds). This statistical information is stored in 
avro files using the [avro binary 
encoding](https://avro.apache.org/docs/1.11.1/specification/#binary-encoding). 
So this is somewhat of a two-layered setup. In order to interpret a value you 
first have to deserialize the avro value into bytes and then convert the bytes 
to a value using the iceberg binary format.
   
   When I implemented `Literal` my goal was to use `serde_bytes::ByteBuf` for 
the avro binary values. Just like you used it for the 
[`ManifestList`](https://github.com/apache/iceberg-rust/blob/main/crates/iceberg/src/spec/manifest_list.rs#L586).
 After deserialization the `ByteBuf` has to be converted to a `Literal` and 
that's what the `Literal::try_from_bytes` method is for.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] fix: avro bytes test for Literal [iceberg-rust]

2023-10-17 Thread via GitHub


JanKaul commented on code in PR #80:
URL: https://github.com/apache/iceberg-rust/pull/80#discussion_r1361806053


##
crates/iceberg/src/spec/values.rs:
##
@@ -995,7 +995,7 @@ mod tests {
 assert_eq!(literal, expected_literal);
 
 let mut writer = apache_avro::Writer::new(&schema, Vec::new());
-writer.append_ser(bytes).unwrap();
+writer.append_ser(Intointo(literal)).unwrap();

Review Comment:
   > Oh, I got it. `ByteBuf` provides it's own `ByteBuf::from` that may 
conflicts with `From` trait.
   
   Exactly.
   
   It actually works with `writer.append_ser(ByteBuf::from(literal)).unwrap();` 
when I provide a `impl From for Vec`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Iceberg table support specified column comments by flinksql create [iceberg]

2023-10-17 Thread via GitHub


372242283 commented on issue #8511:
URL: https://github.com/apache/iceberg/issues/8511#issuecomment-1766049159

   I also encountered this issue by checking that the field 'COMMENT' in the 
table 'COLUMNS-V2' in HMS is null and not written in
   
![image](https://github.com/apache/iceberg/assets/43502242/81df718d-69ed-4066-a924-284485c129a5)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Build: Add note about running tests/itests on MacOS [iceberg]

2023-10-17 Thread via GitHub


Fokko commented on PR #8766:
URL: https://github.com/apache/iceberg/pull/8766#issuecomment-1766051674

   For me it looks like the file was created by the daemon:
   ```
   ➜  ls -lah /var/run/docker.sock
   lrwxr-xr-x  1 root  daemon46B Oct 16 16:32 /var/run/docker.sock -> 
/Users/fokkodriesprong/.docker/run/docker.sock
   ➜  last reboot 
   reboot timeMon Oct 16 16:29
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Build: Add note about running tests/itests on MacOS [iceberg]

2023-10-17 Thread via GitHub


jbonofre commented on PR #8766:
URL: https://github.com/apache/iceberg/pull/8766#issuecomment-1766072686

   @Fokko I think anyone using docker-desktop on Mac will have the same issue. 
So it applies for 99% of the MacOS users. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Build: Add note about running tests/itests on MacOS [iceberg]

2023-10-17 Thread via GitHub


Fokko commented on PR #8766:
URL: https://github.com/apache/iceberg/pull/8766#issuecomment-1766080354

   @jbonofre  I'm running docker desktop as well


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Build: Add note about running tests/itests on MacOS [iceberg]

2023-10-17 Thread via GitHub


Fokko merged PR #8766:
URL: https://github.com/apache/iceberg/pull/8766


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Build: Add note about running tests/itests on MacOS [iceberg]

2023-10-17 Thread via GitHub


jbonofre commented on PR #8766:
URL: https://github.com/apache/iceberg/pull/8766#issuecomment-1766082002

   @Fokko and you don't have the symbolic link ? Did you install docker-desktop 
from homebrew cask ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Flink: Read parquet BINARY column as String for expected [iceberg]

2023-10-17 Thread via GitHub


fengjiajie commented on code in PR #8808:
URL: https://github.com/apache/iceberg/pull/8808#discussion_r1361854341


##
flink/v1.15/flink/src/main/java/org/apache/iceberg/flink/data/FlinkParquetReaders.java:
##
@@ -262,7 +262,11 @@ public ParquetValueReader primitive(
   switch (primitive.getPrimitiveTypeName()) {
 case FIXED_LEN_BYTE_ARRAY:
 case BINARY:
-  return new ParquetValueReaders.ByteArrayReader(desc);
+  if (expected.typeId() == Types.StringType.get().typeId()) {
+return new StringReader(desc);
+  } else {
+return new ParquetValueReaders.ByteArrayReader(desc);
+  }

Review Comment:
   > I am concerned about the backward compatibility of this change. Someone 
might already depend on reading them as binary, and this change would break 
their use-case
   
   This modification is only applicable to cases where the iceberg definition 
is 'string' and parquet column is 'binary'. Previously, such cases would 
encounter the following exception (unit test can reproduce this exception):
   
   ```
   java.lang.ClassCastException: [B cannot be cast to 
org.apache.flink.table.data.StringData
at 
org.apache.flink.table.data.GenericRowData.getString(GenericRowData.java:169)
at 
org.apache.iceberg.flink.data.TestFlinkParquetReader.testReadBinaryFieldAsString(TestFlinkParquetReader.java:137)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] fix: avro bytes test for Literal [iceberg-rust]

2023-10-17 Thread via GitHub


ZENOTME commented on PR #80:
URL: https://github.com/apache/iceberg-rust/pull/80#issuecomment-1766094815

   Thanks for your explanation! Totally understanded it!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] API, Core: Add uuid() to View [iceberg]

2023-10-17 Thread via GitHub


ajantha-bhat commented on code in PR #8851:
URL: https://github.com/apache/iceberg/pull/8851#discussion_r1361891676


##
api/src/main/java/org/apache/iceberg/view/View.java:
##
@@ -111,4 +112,13 @@ default ReplaceViewVersion replaceVersion() {
   default UpdateLocation updateLocation() {
 throw new UnsupportedOperationException("Updating a view's location is not 
supported");
   }
+
+  /**
+   * Returns the view's UUID
+   *
+   * @return the view's UUID
+   */
+  default UUID uuid() {

Review Comment:
   Can it be the `String` instead of `UUID` object to be similar to the table ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Iceberg table support specified column comments by flinksql create [iceberg]

2023-10-17 Thread via GitHub


huyuanfeng2018 commented on issue #8511:
URL: https://github.com/apache/iceberg/issues/8511#issuecomment-1766135578

   @stevenzwu  In flink1.16 and before, it was impossible to parse the comment 
in the table creation statement. Starting from 1.17, the flink community 
completed this part of the logic, so I think icebrg should support parsing this 
part of the information and adding the comment to the field. WDYT, I can do 
this part


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] API, Core: Add uuid() to View [iceberg]

2023-10-17 Thread via GitHub


nastra commented on code in PR #8851:
URL: https://github.com/apache/iceberg/pull/8851#discussion_r1361897579


##
api/src/main/java/org/apache/iceberg/view/View.java:
##
@@ -111,4 +112,13 @@ default ReplaceViewVersion replaceVersion() {
   default UpdateLocation updateLocation() {
 throw new UnsupportedOperationException("Updating a view's location is not 
supported");
   }
+
+  /**
+   * Returns the view's UUID
+   *
+   * @return the view's UUID
+   */
+  default UUID uuid() {

Review Comment:
   for tables it might be that the type ends up being UUID based on 
https://github.com/apache/iceberg/pull/8800#discussion_r1355112214



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Flink: Read parquet BINARY column as String for expected [iceberg]

2023-10-17 Thread via GitHub


fengjiajie commented on PR #8808:
URL: https://github.com/apache/iceberg/pull/8808#issuecomment-1766144239

   > This is a small change, so it might not be too hard to keep the different 
Flink version changes in sync, but usually we introduce the changes on the 
latest Flink, and then create a different backport PR to backport them to the 
older Flink versions.
   > 
   > This is better for the reviewer, as the change is smaller and easier to 
focus on the issues, and better for the contributor since when changes are 
required then they are not needed to be continuously merged to the other 
branches. And when the backport time comes, it is easier to check the backport 
specific changes.
   
   Thank you for your review suggestions. I have made the changes to only 
include flink 1.17.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Flink: Read parquet BINARY column as String for expected [iceberg]

2023-10-17 Thread via GitHub


fengjiajie commented on code in PR #8808:
URL: https://github.com/apache/iceberg/pull/8808#discussion_r1361899155


##
flink/v1.15/flink/src/test/java/org/apache/iceberg/flink/data/TestFlinkParquetReader.java:
##
@@ -81,26 +75,87 @@ public void testTwoLevelList() throws IOException {
 recordBuilder.set("topbytes", expectedBinary);
 GenericData.Record expectedRecord = recordBuilder.build();
 
-writer.write(expectedRecord);
-writer.close();
+try (ParquetWriter writer =
+AvroParquetWriter.builder(new Path(testFile.toURI()))
+.withDataModel(GenericData.get())
+.withSchema(avroSchema)
+.config("parquet.avro.add-list-element-records", "true")
+.config("parquet.avro.write-old-list-structure", "true")
+.build()) {
+  writer.write(expectedRecord);
+}
 
 try (CloseableIterable reader =
 Parquet.read(Files.localInput(testFile))
 .project(schema)
 .createReaderFunc(type -> FlinkParquetReaders.buildReader(schema, 
type))
 .build()) {
   Iterator rows = reader.iterator();
-  Assert.assertTrue("Should have at least one row", rows.hasNext());
+  assertThat(rows).as("Should have at least one row").hasNext();
+  RowData rowData = rows.next();
+  assertThat(expectedByte).isEqualTo(rowData.getArray(0).getBinary(0));
+  assertThat(expectedByte).isEqualTo(rowData.getBinary(1));
+  assertThat(rows).as("Should not have more than one row").isExhausted();
+}
+  }
+
+  @Test
+  public void testReadBinaryFieldAsString() throws IOException {
+Schema schemaForWriteBinary = new Schema(optional(1, "strbytes", 
Types.BinaryType.get()));
+org.apache.avro.Schema avroSchema = 
AvroSchemaUtil.convert(schemaForWriteBinary.asStruct());
+
+File testFile = temp.newFile();
+assertThat(testFile.delete()).isTrue();
+
+String expectedString = "hello";
+
+GenericRecordBuilder recordBuilder = new GenericRecordBuilder(avroSchema);
+ByteBuffer expectedBinary = 
ByteBuffer.wrap(expectedString.getBytes(StandardCharsets.UTF_8));
+recordBuilder.set("strbytes", expectedBinary);
+GenericData.Record expectedRecord = recordBuilder.build();
+
+try (ParquetWriter writer =
+AvroParquetWriter.builder(new Path(testFile.toURI()))
+.withDataModel(GenericData.get())
+.withSchema(avroSchema)
+.build()) {
+  writer.write(expectedRecord);
+}
+
+// read as string
+Schema schemaForReadBinaryAsString =
+new Schema(optional(1, "strbytes", Types.StringType.get()));
+try (CloseableIterable reader =
+Parquet.read(Files.localInput(testFile))
+.project(schemaForReadBinaryAsString)
+.createReaderFunc(
+type -> 
FlinkParquetReaders.buildReader(schemaForReadBinaryAsString, type))
+.build()) {
+  Iterator rows = reader.iterator();
+  assertThat(rows).as("Should have at least one row").hasNext();
+  RowData rowData = rows.next();
+  assertThat(rowData.getString(0)).isInstanceOf(BinaryStringData.class);
+  assertThat(rowData.getString(0).toString()).isEqualTo(expectedString);
+  assertThat(rows).as("Should not have more than one row").isExhausted();
+}
+
+// read as byte[]

Review Comment:
   This unit test is testing whether expected values can be read from same file 
and column when the iceberg column definition is both string and binary.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Iceberg table support specified column comments by flinksql create [iceberg]

2023-10-17 Thread via GitHub


huyuanfeng2018 commented on issue #8511:
URL: https://github.com/apache/iceberg/issues/8511#issuecomment-1766177686

   > @stevenzwu In flink1.16 and before, it was impossible to parse the comment 
in the table creation statement. Starting from 1.17, the flink community 
completed this part of the logic, so I think icebrg should support parsing this 
part of the information and adding the comment to the field. WDYT, I can do 
this part
   https://issues.apache.org/jira/browse/FLINK-29679
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] API, Core: Add uuid() to View [iceberg]

2023-10-17 Thread via GitHub


nk1506 commented on code in PR #8851:
URL: https://github.com/apache/iceberg/pull/8851#discussion_r1361819048


##
core/src/main/java/org/apache/iceberg/view/BaseView.java:
##
@@ -97,4 +98,9 @@ public ReplaceViewVersion replaceVersion() {
   public UpdateLocation updateLocation() {
 return new SetViewLocation(ops);
   }
+
+  @Override
+  public UUID uuid() {
+return UUID.fromString(ops.current().uuid());

Review Comment:
   nit: for few catalogs, it may go through all the refresh mechanism. I am not 
sure if we have any other better approach or not. 



##
api/src/main/java/org/apache/iceberg/view/View.java:
##
@@ -111,4 +112,13 @@ default ReplaceViewVersion replaceVersion() {
   default UpdateLocation updateLocation() {
 throw new UnsupportedOperationException("Updating a view's location is not 
supported");
   }
+
+  /**
+   * Returns the view's UUID
+   *
+   * @return the view's UUID

Review Comment:
   nit: tag description and summary are same. IMO, we should have one. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark 3.5: Use Awaitility instead of Thread.sleep() [iceberg]

2023-10-17 Thread via GitHub


nk1506 commented on code in PR #8853:
URL: https://github.com/apache/iceberg/pull/8853#discussion_r1361945348


##
spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/source/SparkSQLExecutionHelper.java:
##
@@ -42,28 +45,26 @@ public static String lastExecutedMetricValue(SparkSession 
spark, String metricNa
 SQLExecutionUIData lastExecution = statusStore.executionsList().last();
 Option sqlPlanMetric =
 lastExecution.metrics().find(metric -> 
metric.name().equals(metricName));
-Assert.assertTrue(
-String.format("Metric '%s' not found in last execution", metricName),
-sqlPlanMetric.isDefined());
+assertThat(sqlPlanMetric.isDefined())
+.as(String.format("Metric '%s' not found in last execution", 
metricName))
+.isTrue();
 long metricId = sqlPlanMetric.get().accumulatorId();
 
 // Refresh metricValues, they will remain null until the execution is 
complete and metrics are
 // aggregated
-int attempts = 3;
-while (lastExecution.metricValues() == null && attempts > 0) {
-  try {
-Thread.sleep(100);
-attempts--;
-  } catch (InterruptedException e) {
-throw new RuntimeException(e);
-  }
+Awaitility.await()

Review Comment:
   nit: is it a good idea to add some alias with `Awaitility` here? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] API, Core: Add uuid() to View [iceberg]

2023-10-17 Thread via GitHub


amogh-jahagirdar commented on code in PR #8851:
URL: https://github.com/apache/iceberg/pull/8851#discussion_r1361948344


##
api/src/main/java/org/apache/iceberg/view/View.java:
##
@@ -111,4 +112,13 @@ default ReplaceViewVersion replaceVersion() {
   default UpdateLocation updateLocation() {
 throw new UnsupportedOperationException("Updating a view's location is not 
supported");
   }
+
+  /**
+   * Returns the view's UUID
+   *
+   * @return the view's UUID
+   */
+  default UUID uuid() {

Review Comment:
   Yeah sorry for the delay on that PR, planning on updating to an actual UUID. 
@ajantha-bhat let me know what you think on that PR



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark: Fix Fast forward before/after snapshot output for non-main branches [iceberg]

2023-10-17 Thread via GitHub


amogh-jahagirdar commented on code in PR #8854:
URL: https://github.com/apache/iceberg/pull/8854#discussion_r1361985873


##
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/procedures/FastForwardBranchProcedure.java:
##
@@ -77,9 +77,9 @@ public InternalRow[] call(InternalRow args) {
 return modifyIcebergTable(
 tableIdent,
 table -> {
-  long currentRef = table.currentSnapshot().snapshotId();
+  long currentRef = table.snapshot(source).snapshotId();
   table.manageSnapshots().fastForwardBranch(source, target).commit();

Review Comment:
   Another thing to fixI think when I originally implemented the 
fastForward operation I switched the definitions of source/target in my head so 
the API is confusing (in the API, target means the branch that is actually 
being fast forwarded, source is where target will be moved). That should 
probably be reversed so target is actually the target to which source will be 
moved.
   
   I *think* we should be able to safely just rename the parameters in the API 
and update the javadoc. Luckily both are String parameters.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark: Fix Fast forward before/after snapshot output for non-main branches [iceberg]

2023-10-17 Thread via GitHub


amogh-jahagirdar commented on code in PR #8854:
URL: https://github.com/apache/iceberg/pull/8854#discussion_r1361985873


##
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/procedures/FastForwardBranchProcedure.java:
##
@@ -77,9 +77,9 @@ public InternalRow[] call(InternalRow args) {
 return modifyIcebergTable(
 tableIdent,
 table -> {
-  long currentRef = table.currentSnapshot().snapshotId();
+  long currentRef = table.snapshot(source).snapshotId();
   table.manageSnapshots().fastForwardBranch(source, target).commit();

Review Comment:
   Another thing to fixI think when I originally implemented the 
fastForward operation I switched the definitions of source/target in my head so 
the API is confusing (in the API, target means the branch that is actually 
being fast forwarded, source is where target will be moved). That should 
probably be reversed so target is actually the target to which source will be 
moved.
   
   I *think* we should be able to safely just rename the parameters in the API 
and update the javadoc. Luckily both are String parameters.
   
   cc @rdblue 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark: Fix Fast forward before/after snapshot output for non-main branches [iceberg]

2023-10-17 Thread via GitHub


amogh-jahagirdar commented on PR #8854:
URL: https://github.com/apache/iceberg/pull/8854#issuecomment-1766265102

   also cc @rakesh-das08 let me know your thoughts on this fix!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] feat: suport read/write Manifest [iceberg-rust]

2023-10-17 Thread via GitHub


JanKaul commented on code in PR #79:
URL: https://github.com/apache/iceberg-rust/pull/79#discussion_r1362007623


##
crates/iceberg/src/spec/manifest.rs:
##
@@ -0,0 +1,671 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Manifest for Iceberg.
+use super::{FormatVersion, ManifestContentType, PartitionSpec, Schema, Struct};
+use super::{Literal, Type};
+use crate::{Error, ErrorKind};
+use apache_avro::{from_value, Reader as AvroReader, Schema as AvroSchema};
+use std::collections::HashMap;
+use std::str::FromStr;
+
+/// A manifest contains metadata and a list of entries.
+pub struct Manifest {
+metadata: ManifestMetadata,
+entries: Vec,
+}
+
+impl Manifest {
+/// Parse manifest from bytes of avro file.
+pub fn parse_avro(bs: &[u8]) -> Result {
+let reader = AvroReader::new(bs)?;
+
+// Parse manifest metadata
+let meta = reader.user_metadata();
+let metadata = ManifestMetadata::parse(meta)?;
+
+// Parse manifest entries
+let partition_type =
+
Type::Struct(metadata.partition_spec.partition_type(&metadata.schema)?);
+let mut entries = Vecnew();
+
+match metadata.format_version {
+FormatVersion::V1 => {
+let reader = AvroReader::with_schema(Self::v1_schema(), bs)?;
+for value in reader {
+entries.push(
+from_value::<_serde::ManifestEntryV1>(&value?)?
+.try_into(&partition_type, &metadata.schema)?,
+);
+}
+}
+FormatVersion::V2 => {
+let reader = AvroReader::with_schema(Self::v2_schema(), bs)?;
+for value in reader {
+entries.push(
+from_value::<_serde::ManifestEntryV2>(&value?)?
+.try_into(&partition_type, &metadata.schema)?,
+);
+}
+}
+};
+
+Ok(Manifest { metadata, entries })
+}
+
+fn v2_schema() -> &'static AvroSchema {
+todo!()
+}
+
+fn v1_schema() -> &'static AvroSchema {
+todo!()
+}
+}
+
+/// Meta data of a manifest.
+#[derive(Debug, PartialEq, Clone)]
+pub struct ManifestMetadata {
+/// The table schema at the time the manifest
+/// was written
+schema: Schema,
+/// ID of the schema used to write the manifest as a string
+schema_id: i32,
+/// The partition spec used  to write the manifest
+partition_spec: PartitionSpec,
+/// ID of the partition spec used to write the manifest as a string
+partition_spec_id: i32,
+/// Table format version number of the manifest as a string
+format_version: FormatVersion,
+/// Type of content files tracked by the manifest: “data” or “deletes”
+content: ManifestContentType,
+}
+
+impl ManifestMetadata {
+/// Parse from metadata in avro file.
+pub fn parse(meta: &HashMap>) -> Result {
+let schema = {
+let bs = meta.get("schema").ok_or_else(|| {
+Error::new(
+ErrorKind::DataInvalid,
+"schema is required in manifest metadata but not found",
+)
+})?;
+serde_json::from_slice::(bs).map_err(|err| {
+Error::new(
+ErrorKind::DataInvalid,
+"Fail to parse schema in manifest metadata",
+)
+.with_source(err)
+})?
+};
+let schema_id: i32 = meta
+.get("schema-id")
+.and_then(|bs| {
+Some(String::from_utf8_lossy(bs).parse().map_err(|err| {
+Error::new(
+ErrorKind::DataInvalid,
+"Fail to parse schema id in manifest metadata",
+)
+.with_source(err)
+}))
+})
+.transpose()?
+.unwrap_or(0);
+let partition_spec = {
+let bs = meta.get("partition-spec").ok_or_else(|| {
+Error::new(
+ErrorKind::DataInvalid,
+   

Re: [PR] push down min/max/count to iceberg [iceberg]

2023-10-17 Thread via GitHub


atifiu commented on PR #6252:
URL: https://github.com/apache/iceberg/pull/6252#issuecomment-1766286164

   @amogh-jahagirdar I think I know how these delete files are generated even 
though copy on write is defined at table level. I have executed the delete from 
Trino and since it only supports merge on read these delete files are 
generated. Now I am finding it really difficult to remove these delete files. 
Will try removing partition to see if they will be removed or not.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Avro writers use BlockingBinaryEncoder to enable array/map size calculations. [iceberg]

2023-10-17 Thread via GitHub


Fokko commented on PR #8625:
URL: https://github.com/apache/iceberg/pull/8625#issuecomment-1766306045

   I just realized that this would also speed up operations snapshot 
expiration, because we do need to access the manifest files, but don't need to 
use the metrics.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark: Fix Fast forward procedure output for non-main branches [iceberg]

2023-10-17 Thread via GitHub


rakesh-das08 commented on PR #8854:
URL: https://github.com/apache/iceberg/pull/8854#issuecomment-1766321748

   @amogh-jahagirdar the fix LGTM. Thanks for fixing this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] API, Core: Add uuid() to View [iceberg]

2023-10-17 Thread via GitHub


nastra commented on code in PR #8851:
URL: https://github.com/apache/iceberg/pull/8851#discussion_r1362046505


##
core/src/main/java/org/apache/iceberg/view/BaseView.java:
##
@@ -97,4 +98,9 @@ public ReplaceViewVersion replaceVersion() {
   public UpdateLocation updateLocation() {
 return new SetViewLocation(ops);
   }
+
+  @Override
+  public UUID uuid() {
+return UUID.fromString(ops.current().uuid());

Review Comment:
   it should not go through an update. The purpose of `current()` is to return 
the currently loaded metadata, without checking for updates. If there's a 
catalog where `current()` does a refresh, then that's a bug



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark: Fix Fast forward procedure output for non-main branches [iceberg]

2023-10-17 Thread via GitHub


ajantha-bhat commented on code in PR #8854:
URL: https://github.com/apache/iceberg/pull/8854#discussion_r1362035790


##
spark/v3.5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestFastForwardBranchProcedure.java:
##
@@ -188,4 +188,38 @@ public void testInvalidFastForwardBranchCases() {
 .isInstanceOf(IllegalArgumentException.class)
 .hasMessage("Cannot handle an empty identifier for argument table");
   }
+
+  @Test
+  public void testFastForwardNonMain() {
+sql("CREATE TABLE %s (id int NOT NULL, data string) USING iceberg", 
tableName);
+sql("INSERT INTO TABLE %s VALUES (1, 'a')", tableName);
+Table table = validationCatalog.loadTable(tableIdent);
+table.refresh();
+
+String branch1 = "branch1";
+sql("ALTER TABLE %s CREATE BRANCH %s", tableName, branch1);
+String tableNameWithBranch = String.format("%s.branch_%s", tableName, 
branch1);

Review Comment:
   nit:  tableNameWithBranch -> tableNameWithBranch1 ?



##
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/procedures/FastForwardBranchProcedure.java:
##
@@ -77,9 +77,9 @@ public InternalRow[] call(InternalRow args) {
 return modifyIcebergTable(
 tableIdent,
 table -> {
-  long currentRef = table.currentSnapshot().snapshotId();
+  long currentRef = table.snapshot(source).snapshotId();
   table.manageSnapshots().fastForwardBranch(source, target).commit();

Review Comment:
   I too got confused today when I checked replace branch API vs this procedure 
variables. Source and target is reversed. 
   
   The procedure's named arguments of 'branch' and 'to' is proper. It is like 
fast forward branch x to y. 
   Only thing is these internal variables in this procedure is reversed.  I 
think we can rename it in this PR. 
   
   `ManageSnapshots.replaceBranch` and `ManageSnapshots.fastForwardBranch` 
seems to have a correct naming IMO



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark: Fix Fast forward procedure output for non-main branches [iceberg]

2023-10-17 Thread via GitHub


ajantha-bhat commented on code in PR #8854:
URL: https://github.com/apache/iceberg/pull/8854#discussion_r1362051079


##
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/procedures/FastForwardBranchProcedure.java:
##
@@ -77,9 +77,9 @@ public InternalRow[] call(InternalRow args) {
 return modifyIcebergTable(
 tableIdent,
 table -> {
-  long currentRef = table.currentSnapshot().snapshotId();
+  long currentRef = table.snapshot(source).snapshotId();

Review Comment:
   There is also a recent issue reported for fast forward on an empty branch
   https://github.com/apache/iceberg/issues/8849. 
   
   I have analyzed but it looks to be clumsy if we try to support dummy 
snapshot. Let me know what you guys think. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark: Fix Fast forward procedure output for non-main branches [iceberg]

2023-10-17 Thread via GitHub


amogh-jahagirdar commented on code in PR #8854:
URL: https://github.com/apache/iceberg/pull/8854#discussion_r1362063516


##
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/procedures/FastForwardBranchProcedure.java:
##
@@ -77,9 +77,9 @@ public InternalRow[] call(InternalRow args) {
 return modifyIcebergTable(
 tableIdent,
 table -> {
-  long currentRef = table.currentSnapshot().snapshotId();
+  long currentRef = table.snapshot(source).snapshotId();

Review Comment:
   Right, actually I was investigating that issue and when going through the 
procedure code I just noticed this issue :) On the dummy snapshot idea, I need 
to think more. I think that idea has been floated around a few times and it 
logically makes sense I just don't know all the implications of that change.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark: Fix Fast forward procedure output for non-main branches [iceberg]

2023-10-17 Thread via GitHub


ajantha-bhat commented on code in PR #8854:
URL: https://github.com/apache/iceberg/pull/8854#discussion_r1362069529


##
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/procedures/FastForwardBranchProcedure.java:
##
@@ -77,9 +77,9 @@ public InternalRow[] call(InternalRow args) {
 return modifyIcebergTable(
 tableIdent,
 table -> {
-  long currentRef = table.currentSnapshot().snapshotId();
+  long currentRef = table.snapshot(source).snapshotId();

Review Comment:
   Yeah, I also couldn't conclude on the implications. All these days, if the 
snapshot id is -1, we assume that it is an empty  table or just create table 
happened. 
   
   We also need a dummy snapshot id for ancestor check to be passed for fast 
forward operations. 
   Nessie uses a **constant** default hash for on empty branch for handling 
this kind of ancestor problems. Maybe we need to introduce something like that. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark: Fix Fast forward procedure output for non-main branches [iceberg]

2023-10-17 Thread via GitHub


rakesh-das08 commented on code in PR #8854:
URL: https://github.com/apache/iceberg/pull/8854#discussion_r1362071081


##
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/procedures/FastForwardBranchProcedure.java:
##
@@ -77,9 +77,9 @@ public InternalRow[] call(InternalRow args) {
 return modifyIcebergTable(
 tableIdent,
 table -> {
-  long currentRef = table.currentSnapshot().snapshotId();
+  long currentRef = table.snapshot(source).snapshotId();

Review Comment:
   I had thought of this case where we could introduce a dummy snapshot, but as 
you mentioned, it did not look like an elegant solution. And with this PR : 
https://github.com/apache/iceberg/pull/7652/ , i just basically fall back to 
the underlying replace operation to throw an appropriate exception.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Avro writers use BlockingBinaryEncoder to enable array/map size calculations. [iceberg]

2023-10-17 Thread via GitHub


rustyconover commented on PR #8625:
URL: https://github.com/apache/iceberg/pull/8625#issuecomment-1766378641

   Yes it would!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] feat: suport read/write Manifest [iceberg-rust]

2023-10-17 Thread via GitHub


ZENOTME commented on code in PR #79:
URL: https://github.com/apache/iceberg-rust/pull/79#discussion_r1362079101


##
crates/iceberg/src/spec/manifest.rs:
##
@@ -0,0 +1,671 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Manifest for Iceberg.
+use super::{FormatVersion, ManifestContentType, PartitionSpec, Schema, Struct};
+use super::{Literal, Type};
+use crate::{Error, ErrorKind};
+use apache_avro::{from_value, Reader as AvroReader, Schema as AvroSchema};
+use std::collections::HashMap;
+use std::str::FromStr;
+
+/// A manifest contains metadata and a list of entries.
+pub struct Manifest {
+metadata: ManifestMetadata,
+entries: Vec,
+}
+
+impl Manifest {
+/// Parse manifest from bytes of avro file.
+pub fn parse_avro(bs: &[u8]) -> Result {
+let reader = AvroReader::new(bs)?;
+
+// Parse manifest metadata
+let meta = reader.user_metadata();
+let metadata = ManifestMetadata::parse(meta)?;
+
+// Parse manifest entries
+let partition_type =
+
Type::Struct(metadata.partition_spec.partition_type(&metadata.schema)?);
+let mut entries = Vecnew();
+
+match metadata.format_version {
+FormatVersion::V1 => {
+let reader = AvroReader::with_schema(Self::v1_schema(), bs)?;
+for value in reader {
+entries.push(
+from_value::<_serde::ManifestEntryV1>(&value?)?
+.try_into(&partition_type, &metadata.schema)?,
+);
+}
+}
+FormatVersion::V2 => {
+let reader = AvroReader::with_schema(Self::v2_schema(), bs)?;
+for value in reader {
+entries.push(
+from_value::<_serde::ManifestEntryV2>(&value?)?
+.try_into(&partition_type, &metadata.schema)?,
+);
+}
+}
+};

Review Comment:
   Yes!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[I] Flaky test: TestSparkReaderDeletes.testEqualityDeleteWithDeletedColumn [iceberg]

2023-10-17 Thread via GitHub


ajantha-bhat opened a new issue, #8855:
URL: https://github.com/apache/iceberg/issues/8855

   TestSparkReaderDeletes > [format = orc, vectorized = false, planningMode = 
DISTRIBUTED] > testEqualityDeleteWithDeletedColumn
   
   PR:8854
   
   Build: 
https://github.com/apache/iceberg/actions/runs/6546667781/job/1591189?pr=8854

   
java.lang.AssertionError: Metric 'number of row deletes applied' not found 
in last execution
   at org.junit.Assert.fail(Assert.java:89)
   at org.junit.Assert.assertTrue(Assert.java:42)
   at 
org.apache.iceberg.spark.source.SparkSQLExecutionHelper.lastExecutedMetricValue(SparkSQLExecutionHelper.java:45)
   at 
org.apache.iceberg.spark.source.TestSparkReaderDeletes.deleteCount(TestSparkReaderDeletes.java:203)
   at 
org.apache.iceberg.data.DeleteReadTests.checkDeleteCount(DeleteReadTests.java:188)
   at 
org.apache.iceberg.spark.source.TestSparkReaderDeletes.testEqualityDeleteWithDeletedColumn(TestSparkReaderDeletes.java:421)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] API, Core: Add uuid() to View [iceberg]

2023-10-17 Thread via GitHub


ajantha-bhat commented on code in PR #8851:
URL: https://github.com/apache/iceberg/pull/8851#discussion_r1362101911


##
api/src/main/java/org/apache/iceberg/view/View.java:
##
@@ -111,4 +112,13 @@ default ReplaceViewVersion replaceVersion() {
   default UpdateLocation updateLocation() {
 throw new UnsupportedOperationException("Updating a view's location is not 
supported");
   }
+
+  /**
+   * Returns the view's UUID
+   *
+   * @return the view's UUID
+   */
+  default UUID uuid() {

Review Comment:
   Yeah, I am ok with either String or UUID. But Expect both view and table to 
follow the same type. 
   LGTM. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark: Fix Fast forward procedure output for non-main branches [iceberg]

2023-10-17 Thread via GitHub


ajantha-bhat commented on PR #8854:
URL: https://github.com/apache/iceberg/pull/8854#issuecomment-1766410898

   logged the flaky test: #8855 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Flink 1.17: Use awaitility instead of Thread.sleep() [iceberg]

2023-10-17 Thread via GitHub


nk1506 commented on code in PR #8852:
URL: https://github.com/apache/iceberg/pull/8852#discussion_r1362111563


##
flink/v1.17/flink/src/test/java/org/apache/iceberg/flink/source/TestStreamingMonitorFunction.java:
##
@@ -111,14 +113,19 @@ public void testConsumeWithoutStartSnapshotId() throws 
Exception {
   TestSourceContext sourceContext = new TestSourceContext(latch);
   runSourceFunctionInTask(sourceContext, function);
 
-  Assert.assertTrue(
-  "Should have expected elements.", latch.await(WAIT_TIME_MILLIS, 
TimeUnit.MILLISECONDS));
-  Thread.sleep(1000L);
+  Awaitility.await()

Review Comment:
   IMO, we should move this common part at one place. 



##
flink/v1.17/flink/src/test/java/org/apache/iceberg/flink/source/TestIcebergSourceFailover.java:
##
@@ -98,9 +98,9 @@ protected List generateRecords(int numRecords, long 
seed) {
 return RandomGenericData.generate(schema(), numRecords, seed);
   }
 
-  protected void assertRecords(
-  Table table, List expectedRecords, Duration interval, int 
maxCount) throws Exception {
-SimpleDataUtil.assertTableRecords(table, expectedRecords, interval, 
maxCount);
+  protected void assertRecords(Table table, List expectedRecords, 
Duration interval)

Review Comment:
   do we need this method? can't we replace with 
   `SimpleDataUtil.assertTableRecords(table, expectedRecords, interval);` ? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark: Fix Fast forward procedure output for non-main branches [iceberg]

2023-10-17 Thread via GitHub


amogh-jahagirdar closed pull request #8854: Spark: Fix Fast forward procedure 
output for non-main branches
URL: https://github.com/apache/iceberg/pull/8854


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] API, Core: Add uuid() to View [iceberg]

2023-10-17 Thread via GitHub


nk1506 commented on code in PR #8851:
URL: https://github.com/apache/iceberg/pull/8851#discussion_r1362126880


##
core/src/main/java/org/apache/iceberg/view/BaseView.java:
##
@@ -97,4 +98,9 @@ public ReplaceViewVersion replaceVersion() {
   public UpdateLocation updateLocation() {
 return new SetViewLocation(ops);
   }
+
+  @Override
+  public UUID uuid() {
+return UUID.fromString(ops.current().uuid());

Review Comment:
   I might be missing something here. but it seems `current()` can do the 
refresh 
[here](https://github.com/apache/iceberg/blob/cb20bdbea7b299b2948e908a7775ce90818e6a92/core/src/main/java/org/apache/iceberg/view/BaseViewOperations.java#L75)
 .
   unless some catalog has it's own implementation. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spec: add nanosecond timestamp types [iceberg]

2023-10-17 Thread via GitHub


Fokko commented on code in PR #8683:
URL: https://github.com/apache/iceberg/pull/8683#discussion_r1362161257


##
format/spec.md:
##
@@ -874,6 +878,11 @@ Maps with non-string keys must use an array representation 
with the `map` logica
 |**`list`**|`array`||
 |**`map`**|`array` of key-value records, or `map` when keys are strings 
(optional).|Array storage must use logical type name `map` and must store 
elements that are 2-field records. The first field is a non-null key and the 
second field is the value.|
 
+Notes:
+
+1. Avro type annotation `adjust-to-utc` is an Iceberg convention; default 
value is `false` if not present.
+2. Avro logical type `timestamp-nanos` is an Iceberg convention; the Avro 
specification does not define this type.

Review Comment:
   It depends on what comes out of the Avro PR, but I think we're good for now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[I] Improve `All` Metadata Tables with Snapshot Information [iceberg]

2023-10-17 Thread via GitHub


RussellSpitzer opened a new issue, #8856:
URL: https://github.com/apache/iceberg/issues/8856

   ### Feature Request / Improvement
   
   Currently all versions of metadata tables have the exact same schema as 
their not "all" versions. This is actually not very useful if you are 
attempting to locate the state of a particular entry at a specific time because 
the `snapshot_id` always just shows the file's original snapshot.
   
   For example the entries table looks like
   
   ```
   scala> spark.sql("SELECT * FROM db.timezoned.entries").show
   warning: 1 deprecation (since 2.13.3); for details, enable `:setting 
-deprecation` or `:replay -deprecation`
   
+--+---+---++++
   |status|snapshot_id|sequence_number|file_sequence_number|   
data_file|readable_metrics|
   
+--+---+---++++
   | 1|6561920950175488866|  0|   0|{0, 
/Users/russel...|{{51, 1, 0, null,...|
   | 1|5535987506380389562|  0|   0|{0, 
/Users/russel...|{{51, 1, 0, null,...|
   | 1|2517256618694516958|  0|   0|{0, 
/Users/russel...|{{51, 1, 0, null,...|
   | 1|2750236691316126600|  0|   0|{0, 
/Users/russel...|{{51, 1, 0, null,...|
   | 1|7179885233531513409|  0|   0|{0, 
/Users/russel...|{{51, 1, 0, null,...|
   ```
   
   And `all_entries` looks like
   
   ```
   scala> spark.sql("SELECT * FROM db.timezoned.all_entries").show
   warning: 1 deprecation (since 2.13.3); for details, enable `:setting 
-deprecation` or `:replay -deprecation`
   
+--+---+---++++
   |status|snapshot_id|sequence_number|file_sequence_number|   
data_file|readable_metrics|
   
+--+---+---++++
   | 1|2517256618694516958|  0|   0|{0, 
/Users/russel...|{{51, 1, 0, null,...|
   | 1|6561920950175488866|  0|   0|{0, 
/Users/russel...|{{51, 1, 0, null,...|
   | 1|2750236691316126600|  0|   0|{0, 
/Users/russel...|{{51, 1, 0, null,...|
   | 1|7179885233531513409|  0|   0|{0, 
/Users/russel...|{{51, 1, 0, null,...|
   | 1|5535987506380389562|  0|   0|{0, 
/Users/russel...|{{51, 1, 0, null,...|
   | 0|6561920950175488866|  0|   0|{0, 
/Users/russel...|{{51, 1, 0, null,...|
   | 0|5535987506380389562|  0|   0|{0, 
/Users/russel...|{{51, 1, 0, null,...|
   | 0|2517256618694516958|  0|   0|{0, 
/Users/russel...|{{51, 1, 0, null,...|
   | 0|2750236691316126600|  0|   0|{0, 
/Users/russel...|{{51, 1, 0, null,...|
   | 0|7179885233531513409|  0|   0|{0, 
/Users/russel...|{{51, 1, 0, null,...|
   
+--+---+---++++
   
   
   By looking at all_entries it is impossible for me to determine whether or 
not all the manifests were rewritten at once, or if they were rewritten in 
groups. Ideally we would see something like 
   
   ```
   
+---+-+--+---+---++++
   |  as_of| time|status|
snapshot_id|sequence_number|file_sequence_number|   data_file|
readable_metrics|
   
+---+-+--+---+---++++
   |7179885233531513409|1697493267302| 1|7179885233531513409|  
0|   0|{0, /Users/russel...|{{51, 1, 0, null,...|
   |2750236691316126600|1697493268363| 1|2750236691316126600|  
0|   0|{0, /Users/russel...|{{51, 1, 0, null,...|
   |2750236691316126600|1697493268363| 1|7179885233531513409|  
0|   0|{0, /Users/russel...|{{51, 1, 0, null,...|
   |2517256618694516958|1697493269568| 1|2517256618694516958|  
0|   0|{0, /Users/russel...|{{51, 1, 0, null,...|
   |2517256618694516958|1697493269568| 1|2750236691316126600|  
0|   0|{0, /Users/russel...|{{51, 1, 0, null,...|
   |2517256618694516958|1697493269568| 1|7179885233531513409|  
0|   0|{0, /Users/russel...|{{51, 1, 0, null,...|
   |5535987506380389562|1697493270419| 1|5535987506380389562|  
0|

Re: [PR] Spark 3.5: Use Awaitility instead of Thread.sleep() [iceberg]

2023-10-17 Thread via GitHub


nastra merged PR #8853:
URL: https://github.com/apache/iceberg/pull/8853


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark 3.5: Use Awaitility instead of Thread.sleep() [iceberg]

2023-10-17 Thread via GitHub


nastra commented on code in PR #8853:
URL: https://github.com/apache/iceberg/pull/8853#discussion_r1362301067


##
spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/source/SparkSQLExecutionHelper.java:
##
@@ -42,28 +45,26 @@ public static String lastExecutedMetricValue(SparkSession 
spark, String metricNa
 SQLExecutionUIData lastExecution = statusStore.executionsList().last();
 Option sqlPlanMetric =
 lastExecution.metrics().find(metric -> 
metric.name().equals(metricName));
-Assert.assertTrue(
-String.format("Metric '%s' not found in last execution", metricName),
-sqlPlanMetric.isDefined());
+assertThat(sqlPlanMetric.isDefined())
+.as(String.format("Metric '%s' not found in last execution", 
metricName))
+.isTrue();
 long metricId = sqlPlanMetric.get().accumulatorId();
 
 // Refresh metricValues, they will remain null until the execution is 
complete and metrics are
 // aggregated
-int attempts = 3;
-while (lastExecution.metricValues() == null && attempts > 0) {
-  try {
-Thread.sleep(100);
-attempts--;
-  } catch (InterruptedException e) {
-throw new RuntimeException(e);
-  }
+Awaitility.await()

Review Comment:
   in certain cases it makes sense to add an alias, but here I think we should 
be ok, since there's a comment right above that explains what's being done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] API, Core: Add uuid() to View [iceberg]

2023-10-17 Thread via GitHub


nastra commented on code in PR #8851:
URL: https://github.com/apache/iceberg/pull/8851#discussion_r1362309604


##
core/src/main/java/org/apache/iceberg/view/BaseView.java:
##
@@ -97,4 +98,9 @@ public ReplaceViewVersion replaceVersion() {
   public UpdateLocation updateLocation() {
 return new SetViewLocation(ops);
   }
+
+  @Override
+  public UUID uuid() {
+return UUID.fromString(ops.current().uuid());

Review Comment:
   that kind of (mandatory) refresh is expected, because it's a lazy refresh, 
that is typically done after a commit. This means that the underlying metadata 
changed after the commit, but we refresh only when accessing data after the 
commit happened. The same is being done in `BaseMetastoreTableOperations`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[PR] Nessie: retain authorship information when creating a namespace [iceberg]

2023-10-17 Thread via GitHub


adutra opened a new pull request, #8857:
URL: https://github.com/apache/iceberg/pull/8857

   This change enhances the process of creating new namespaces by retaining 
commit authorship information when committing the new namespace.
   
   It also switches to Nessie API V2 for the commit operation.
   
   See https://github.com/projectnessie/nessie/issues/6501 for context.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Flink 1.17: Use awaitility instead of Thread.sleep() [iceberg]

2023-10-17 Thread via GitHub


nastra commented on code in PR #8852:
URL: https://github.com/apache/iceberg/pull/8852#discussion_r1362315314


##
flink/v1.17/flink/src/test/java/org/apache/iceberg/flink/source/TestIcebergSourceFailover.java:
##
@@ -98,9 +98,9 @@ protected List generateRecords(int numRecords, long 
seed) {
 return RandomGenericData.generate(schema(), numRecords, seed);
   }
 
-  protected void assertRecords(
-  Table table, List expectedRecords, Duration interval, int 
maxCount) throws Exception {
-SimpleDataUtil.assertTableRecords(table, expectedRecords, interval, 
maxCount);
+  protected void assertRecords(Table table, List expectedRecords, 
Duration interval)

Review Comment:
   that's a valid point, I'm not sure why this method existed in the first 
place. I've removed it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Flink 1.17: Use awaitility instead of Thread.sleep() [iceberg]

2023-10-17 Thread via GitHub


nastra commented on code in PR #8852:
URL: https://github.com/apache/iceberg/pull/8852#discussion_r1362375826


##
flink/v1.17/flink/src/test/java/org/apache/iceberg/flink/source/TestStreamingMonitorFunction.java:
##
@@ -111,14 +113,19 @@ public void testConsumeWithoutStartSnapshotId() throws 
Exception {
   TestSourceContext sourceContext = new TestSourceContext(latch);
   runSourceFunctionInTask(sourceContext, function);
 
-  Assert.assertTrue(
-  "Should have expected elements.", latch.await(WAIT_TIME_MILLIS, 
TimeUnit.MILLISECONDS));
-  Thread.sleep(1000L);
+  Awaitility.await()

Review Comment:
   makes sense, done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Flink: Reverting the default custom partitioner for bucket column [iceberg]

2023-10-17 Thread via GitHub


stevenzwu merged PR #8848:
URL: https://github.com/apache/iceberg/pull/8848


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[PR] Flink: Reverting the default custom partitioner for bucket column (#8848) [iceberg]

2023-10-17 Thread via GitHub


nastra opened a new pull request, #8858:
URL: https://github.com/apache/iceberg/pull/8858

   This backports #8848 to 1.4.x


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Flink: revert the automatic application of custom partitioner for bucketing column with hash distribution [iceberg]

2023-10-17 Thread via GitHub


nastra commented on issue #8847:
URL: https://github.com/apache/iceberg/issues/8847#issuecomment-1766722846

   Closing this as #8848 has been merged to main and I backported it to 1.4.x 
in https://github.com/apache/iceberg/pull/8858


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Flink: revert the automatic application of custom partitioner for bucketing column with hash distribution [iceberg]

2023-10-17 Thread via GitHub


nastra closed issue #8847: Flink: revert the automatic application of custom 
partitioner for bucketing column with hash distribution
URL: https://github.com/apache/iceberg/issues/8847


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] [1.4.x] Core: Do not use a lazy split offset list in manifests (#8834) [iceberg]

2023-10-17 Thread via GitHub


rdblue merged PR #8845:
URL: https://github.com/apache/iceberg/pull/8845


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Enable column statistics filtering after planning [iceberg]

2023-10-17 Thread via GitHub


stevenzwu commented on code in PR #8803:
URL: https://github.com/apache/iceberg/pull/8803#discussion_r1362401859


##
api/src/main/java/org/apache/iceberg/ContentFile.java:
##
@@ -177,4 +191,26 @@ default Long fileSequenceNumber() {
   default F copy(boolean withStats) {
 return withStats ? copy() : copyWithoutStats();
   }
+
+  /**
+   * Copies this file (potentially with or without specific column stats). 
Manifest readers can
+   * reuse file instances; use this method to copy data when collecting files 
from tasks.
+   *
+   * @param withStats Will copy this file without file stats if set to 
false.
+   * @param statsToKeep Will keep stats only for these columns. Not used if 
withStats
+   * is set to false.
+   * @return a copy of this data file. If "withStats" is set to 
false the file will not
+   * contain lower bounds, upper bounds, value counts, null value counts, 
or nan value counts.
+   * If "withStats" is set to true and the "statsToKeep" is 
not empty then only
+   * specific column stats will be kept.
+   */
+  default F copy(boolean withStats, Collection statsToKeep) {

Review Comment:
   is this needed in this interface? `copyWithSpecificStats` from line 177 
should imply `withStats` to be true.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Ignore split offsets when the last split offset is past the file length [iceberg]

2023-10-17 Thread via GitHub


amogh-jahagirdar commented on PR #8860:
URL: https://github.com/apache/iceberg/pull/8860#issuecomment-1766749771

   cc @bryanck 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Ignore split offsets when the last split offset is past the file length [iceberg]

2023-10-17 Thread via GitHub


bryanck commented on code in PR #8860:
URL: https://github.com/apache/iceberg/pull/8860#discussion_r1362407399


##
core/src/main/java/org/apache/iceberg/BaseFile.java:
##
@@ -460,6 +460,11 @@ public ByteBuffer keyMetadata() {
 
   @Override
   public List splitOffsets() {
+// If the last split offset is past the file size this means the split 
offsets are corrupted and should not be used
+if (splitOffsets[splitOffsets.length - 1] >= fileSizeInBytes) {

Review Comment:
   you'll need a null check here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Ignore split offsets when the last split offset is past the file length [iceberg]

2023-10-17 Thread via GitHub


amogh-jahagirdar commented on code in PR #8860:
URL: https://github.com/apache/iceberg/pull/8860#discussion_r1362408168


##
core/src/main/java/org/apache/iceberg/BaseFile.java:
##
@@ -460,6 +460,11 @@ public ByteBuffer keyMetadata() {
 
   @Override
   public List splitOffsets() {
+// If the last split offset is past the file size this means the split 
offsets are corrupted and should not be used
+if (splitOffsets[splitOffsets.length - 1] >= fileSizeInBytes) {

Review Comment:
   Ooh yes, good catch



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add an expireAfterWrite cache eviction policy to CachingCatalog [iceberg]

2023-10-17 Thread via GitHub


nastra commented on code in PR #8844:
URL: https://github.com/apache/iceberg/pull/8844#discussion_r1362413695


##
spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/TestCachingCatalogExpirationAfterWrite.java:
##
@@ -0,0 +1,89 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.spark;
+
+import java.util.Map;
+import java.util.concurrent.TimeUnit;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.iceberg.hive.HiveClientPool;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.apache.spark.sql.connector.catalog.Identifier;
+import org.awaitility.Awaitility;
+import org.junit.After;
+import org.junit.AfterClass;
+import org.junit.BeforeClass;
+import org.junit.Test;
+import org.junit.runners.Parameterized;
+
+public class TestCachingCatalogExpirationAfterWrite extends 
SparkCatalogTestBase {

Review Comment:
   +1



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Python: Add support for Python 3.12 [iceberg-python]

2023-10-17 Thread via GitHub


steinsgateted commented on PR #35:
URL: https://github.com/apache/iceberg-python/pull/35#issuecomment-1766760247

   @jayceslesar Thank you for the information


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Ignore split offsets when the last split offset is past the file length [iceberg]

2023-10-17 Thread via GitHub


bryanck commented on code in PR #8860:
URL: https://github.com/apache/iceberg/pull/8860#discussion_r1362417563


##
core/src/main/java/org/apache/iceberg/BaseFile.java:
##
@@ -460,6 +460,12 @@ public ByteBuffer keyMetadata() {
 
   @Override
   public List splitOffsets() {
+// If the last split offset is past the file size this means the split 
offsets are corrupted and
+// should not be used
+if (splitOffsets != null && splitOffsets[splitOffsets.length - 1] >= 
fileSizeInBytes) {

Review Comment:
   maybe a length check also? not likely to be 0 but just in case



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Ignore split offsets when the last split offset is past the file length [iceberg]

2023-10-17 Thread via GitHub


bryanck commented on code in PR #8860:
URL: https://github.com/apache/iceberg/pull/8860#discussion_r1362419546


##
core/src/main/java/org/apache/iceberg/BaseFile.java:
##
@@ -460,6 +460,12 @@ public ByteBuffer keyMetadata() {
 
   @Override
   public List splitOffsets() {
+// If the last split offset is past the file size this means the split 
offsets are corrupted and
+// should not be used
+if (splitOffsets != null && splitOffsets[splitOffsets.length - 1] >= 
fileSizeInBytes) {

Review Comment:
   Also a test would be nice



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Ignore split offsets when the last split offset is past the file length [iceberg]

2023-10-17 Thread via GitHub


amogh-jahagirdar commented on code in PR #8860:
URL: https://github.com/apache/iceberg/pull/8860#discussion_r1362421006


##
core/src/main/java/org/apache/iceberg/BaseFile.java:
##
@@ -460,6 +460,12 @@ public ByteBuffer keyMetadata() {
 
   @Override
   public List splitOffsets() {
+// If the last split offset is past the file size this means the split 
offsets are corrupted and
+// should not be used
+if (splitOffsets != null && splitOffsets[splitOffsets.length - 1] >= 
fileSizeInBytes) {

Review Comment:
   yeah just realized that, updated! 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Ignore split offsets when the last split offset is past the file length [iceberg]

2023-10-17 Thread via GitHub


amogh-jahagirdar commented on code in PR #8860:
URL: https://github.com/apache/iceberg/pull/8860#discussion_r1362421006


##
core/src/main/java/org/apache/iceberg/BaseFile.java:
##
@@ -460,6 +460,12 @@ public ByteBuffer keyMetadata() {
 
   @Override
   public List splitOffsets() {
+// If the last split offset is past the file size this means the split 
offsets are corrupted and
+// should not be used
+if (splitOffsets != null && splitOffsets[splitOffsets.length - 1] >= 
fileSizeInBytes) {

Review Comment:
   yeah just realized that, updated! Looking into where to add the test



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] [1.4.x] Flink: Reverting the default custom partitioner for bucket column (#8848) [iceberg]

2023-10-17 Thread via GitHub


nastra merged PR #8858:
URL: https://github.com/apache/iceberg/pull/8858


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Build: Replace Thread.Sleep() usage with org.Awaitility from Tests. [iceberg]

2023-10-17 Thread via GitHub


nastra commented on code in PR #8804:
URL: https://github.com/apache/iceberg/pull/8804#discussion_r1362428279


##
api/src/test/java/org/apache/iceberg/metrics/TestDefaultTimer.java:
##
@@ -101,14 +103,7 @@ public void closeableTimer() throws InterruptedException {
   @Test
   public void measureRunnable() {
 Timer timer = new DefaultTimer(TimeUnit.NANOSECONDS);
-Runnable runnable =

Review Comment:
   the goal here is actually to measure the duration of the `runnable`, so this 
isn't something that we want to replace with Awaitility



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Build: Replace Thread.Sleep() usage with org.Awaitility from Tests. [iceberg]

2023-10-17 Thread via GitHub


nastra commented on code in PR #8804:
URL: https://github.com/apache/iceberg/pull/8804#discussion_r1362427190


##
api/src/test/java/org/apache/iceberg/TestHelpers.java:
##
@@ -62,6 +70,54 @@ public static long waitUntilAfter(long timestampMillis) {
 return current;
   }
 
+  /** wait for fixed duration */
+  public static void delayInMilliseconds(String message, long delay, boolean 
useSameThread) {

Review Comment:
   I don't think we want to use Awaitility to just wait a certain amount of 
time for a true condition to happen



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Build: Replace Thread.Sleep() usage with org.Awaitility from Tests. [iceberg]

2023-10-17 Thread via GitHub


nastra commented on code in PR #8804:
URL: https://github.com/apache/iceberg/pull/8804#discussion_r1362429576


##
aws/src/integration/java/org/apache/iceberg/aws/TestAssumeRoleAwsClientFactory.java:
##
@@ -189,7 +192,17 @@ public void testAssumeRoleS3FileIO() throws Exception {
 Assert.assertFalse("should be able to access file", inputFile.exists());
   }
 
-  private void waitForIamConsistency() throws Exception {
-Thread.sleep(1); // sleep to make sure IAM up to date
+  private void waitForIamConsistency() {
+TestHelpers.delayUntilTrueCondition(
+"wait for IAM role policy to update.",
+1000,
+1001,
+10001,

Review Comment:
   I think the parameters are rather confusing. It would be better to use 
Awaitility directly here 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Build: Replace Thread.Sleep() usage with org.Awaitility from Tests. [iceberg]

2023-10-17 Thread via GitHub


nastra commented on code in PR #8804:
URL: https://github.com/apache/iceberg/pull/8804#discussion_r1362431071


##
aws/src/integration/java/org/apache/iceberg/aws/TestAssumeRoleAwsClientFactory.java:
##
@@ -189,7 +192,17 @@ public void testAssumeRoleS3FileIO() throws Exception {
 Assert.assertFalse("should be able to access file", inputFile.exists());
   }
 
-  private void waitForIamConsistency() throws Exception {
-Thread.sleep(1); // sleep to make sure IAM up to date
+  private void waitForIamConsistency() {
+TestHelpers.delayUntilTrueCondition(
+"wait for IAM role policy to update.",
+1000,
+1001,
+10001,
+TimeUnit.MILLISECONDS,
+() ->
+iam.getRolePolicy(

Review Comment:
   is that the check required whether IAM is consistent?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Build: Replace Thread.Sleep() usage with org.Awaitility from Tests. [iceberg]

2023-10-17 Thread via GitHub


nastra commented on code in PR #8804:
URL: https://github.com/apache/iceberg/pull/8804#discussion_r1362441102


##
aws/src/integration/java/org/apache/iceberg/aws/dynamodb/TestDynamoDbLockManager.java:
##
@@ -141,11 +142,8 @@ public void testAcquireSingleProcess() throws Exception {
 
 CompletableFuture.supplyAsync(
 () -> {
-  try {
-Thread.sleep(5000);
-  } catch (InterruptedException e) {
-throw new RuntimeException(e);
-  }
+  TestHelpers.delayInMilliseconds(

Review Comment:
   this goes back to an earlier comment I had on 
https://github.com/apache/iceberg/pull/8715#issuecomment-1748190639. We don't 
just want to replace `Thread.sleep()` usage blindly with Awaitility by waiting 
the same amount of time. The important piece is that we'd need to understand 
what kind of condition the test is trying to _eventually_ reach, which we can 
then check by using Awaitility (rather than just sleeping X seconds).
   
   I've opened https://github.com/apache/iceberg/pull/8853 and 
https://github.com/apache/iceberg/pull/8852 to give an idea how that might look 
for other places in the code



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Ignore split offsets when the last split offset is past the file length [iceberg]

2023-10-17 Thread via GitHub


amogh-jahagirdar commented on code in PR #8860:
URL: https://github.com/apache/iceberg/pull/8860#discussion_r1362445708


##
core/src/main/java/org/apache/iceberg/BaseFile.java:
##
@@ -460,6 +460,12 @@ public ByteBuffer keyMetadata() {
 
   @Override
   public List splitOffsets() {
+// If the last split offset is past the file size this means the split 
offsets are corrupted and
+// should not be used
+if (splitOffsets != null && splitOffsets[splitOffsets.length - 1] >= 
fileSizeInBytes) {

Review Comment:
   Looks like `TestSplitPlanning` is the best place for verifying the behavior 
of split planning when the offsets are corrupted.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Replace Thread.sleep() usage in test code with Awaitility [iceberg]

2023-10-17 Thread via GitHub


nastra commented on issue #7154:
URL: https://github.com/apache/iceberg/issues/7154#issuecomment-1766792091

   I've opened https://github.com/apache/iceberg/pull/8853 and 
https://github.com/apache/iceberg/pull/8852 to give an idea about places that 
are good candidates to replace with Awaitility. I believe there are more 
similar places in the codebase that can be fixed as part of #7154


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] [1.4.x] AWS: avoid static global credentials provider which doesn't play well with lifecycle management (#8677) [iceberg]

2023-10-17 Thread via GitHub


nastra commented on PR #8843:
URL: https://github.com/apache/iceberg/pull/8843#issuecomment-1766801596

   > @nastra and @singhpk234, is this safe for a patch release? It seems like a 
behavior change that would only be safe if the behavior is always the same when 
creating multiple credentials providers.
   
   It appears that multiple people ran into this issue with 1.3.1 on [Slack and 
on the ML](https://lists.apache.org/thread/1nwb11v1qn1nr86stofkvqlb32fp0q9d). I 
don't think those people have an easy way to resolve the issue on their own 
(without building a custom jar with the proposed fix)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Ignore split offsets when the last split offset is past the file length [iceberg]

2023-10-17 Thread via GitHub


amogh-jahagirdar commented on code in PR #8860:
URL: https://github.com/apache/iceberg/pull/8860#discussion_r1362467796


##
core/src/test/java/org/apache/iceberg/TableTestBase.java:
##
@@ -110,7 +110,7 @@ public class TableTestBase {
   static final DataFile FILE_C =
   DataFiles.builder(SPEC)
   .withPath("/path/to/data-c.parquet")
-  .withFileSizeInBytes(10)
+  .withFileSizeInBytes(3_000_000L)

Review Comment:
   I changed this so split offsets are within the file size for the existing 
test files we use. I'm adding a new test for a new file which has corrupted 
offsets.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Ignore split offsets when the last split offset is past the file length [iceberg]

2023-10-17 Thread via GitHub


amogh-jahagirdar commented on code in PR #8860:
URL: https://github.com/apache/iceberg/pull/8860#discussion_r1362467796


##
core/src/test/java/org/apache/iceberg/TableTestBase.java:
##
@@ -110,7 +110,7 @@ public class TableTestBase {
   static final DataFile FILE_C =
   DataFiles.builder(SPEC)
   .withPath("/path/to/data-c.parquet")
-  .withFileSizeInBytes(10)
+  .withFileSizeInBytes(3_000_000L)

Review Comment:
   I changed this so split offsets are within the file size for the existing 
test files we use (otherwise 
TestManifestReader#testReaderWithFilterWithoutSelect ends up failing due to 
expectations on the split offsets). I'm adding a new test for a new file which 
has corrupted offsets.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Ignore split offsets when the last split offset is past the file length [iceberg]

2023-10-17 Thread via GitHub


amogh-jahagirdar commented on code in PR #8860:
URL: https://github.com/apache/iceberg/pull/8860#discussion_r1362487232


##
core/src/main/java/org/apache/iceberg/BaseFile.java:
##
@@ -460,6 +460,12 @@ public ByteBuffer keyMetadata() {
 
   @Override
   public List splitOffsets() {
+// If the last split offset is past the file size this means the split 
offsets are corrupted and
+// should not be used
+if (splitOffsets != null && splitOffsets[splitOffsets.length - 1] >= 
fileSizeInBytes) {

Review Comment:
   Also `TestManifestReader` for verifying the split offsets are null when 
reading the file entries.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] [1.4.x] AWS: avoid static global credentials provider which doesn't play well with lifecycle management (#8677) [iceberg]

2023-10-17 Thread via GitHub


stevenzwu commented on PR #8843:
URL: https://github.com/apache/iceberg/pull/8843#issuecomment-1766827281

   > would only be safe if the behavior is always the same when creating 
multiple credentials providers.
   
   @rdblue I think it is a safe change from a static global singleton to 
creating new objects for object lifecycle management


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Ignore split offsets when the last split offset is past the file length [iceberg]

2023-10-17 Thread via GitHub


amogh-jahagirdar commented on code in PR #8860:
URL: https://github.com/apache/iceberg/pull/8860#discussion_r1362487232


##
core/src/main/java/org/apache/iceberg/BaseFile.java:
##
@@ -460,6 +460,12 @@ public ByteBuffer keyMetadata() {
 
   @Override
   public List splitOffsets() {
+// If the last split offset is past the file size this means the split 
offsets are corrupted and
+// should not be used
+if (splitOffsets != null && splitOffsets[splitOffsets.length - 1] >= 
fileSizeInBytes) {

Review Comment:
   Actually, `TestManifestReader` seems better for verifying the split offsets 
are null when reading the file entries. This seems like a cleaner test since 
split planning tests already verify the behavior when explicit split offsets 
aren't defined. Manifest reader tests can help us explicitly verify the 
behavior when the offsets are corrupted.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Ignore split offsets when the last split offset is past the file length [iceberg]

2023-10-17 Thread via GitHub


amogh-jahagirdar commented on code in PR #8860:
URL: https://github.com/apache/iceberg/pull/8860#discussion_r1362490912


##
core/src/test/java/org/apache/iceberg/TableTestBase.java:
##
@@ -110,7 +110,7 @@ public class TableTestBase {
   static final DataFile FILE_C =
   DataFiles.builder(SPEC)
   .withPath("/path/to/data-c.parquet")
-  .withFileSizeInBytes(10)
+  .withFileSizeInBytes(3_000_000L)

Review Comment:
   Err nvm, looks like split offsets for these files were added in #8834. 
Instead of changing the file size to fit the offsets (this breaks more tests 
that I need to go and update), I'll reverse it; I'll change the offsets to be 
within the file size since that was part of a more recent change and is 
isolated to a single test as far as I can tell.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Ignore split offsets when the last split offset is past the file length [iceberg]

2023-10-17 Thread via GitHub


singhpk234 commented on code in PR #8860:
URL: https://github.com/apache/iceberg/pull/8860#discussion_r1362493799


##
core/src/main/java/org/apache/iceberg/BaseFile.java:
##
@@ -460,6 +460,16 @@ public ByteBuffer keyMetadata() {
 
   @Override
   public List splitOffsets() {
+if (splitOffsets == null || splitOffsets.length == 0) {
+  return null;
+}
+
+// If the last split offset is past the file size this means the split 
offsets are corrupted and

Review Comment:
   [doubt] wondering if throwing an exception or having a pre-condition would 
be helpful to identify the buggy writer ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Enable column statistics filtering after planning [iceberg]

2023-10-17 Thread via GitHub


stevenzwu commented on PR #8803:
URL: https://github.com/apache/iceberg/pull/8803#issuecomment-1766844420

   @pvary I think we probably want to push the `copyStatsForColumns` down to 
ManifestReader. 
https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/ManifestReader.java#L299


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



  1   2   >