Re: [I] fast_forward does not work for the first commit in Spark [iceberg]
ajantha-bhat commented on issue #8849: URL: https://github.com/apache/iceberg/issues/8849#issuecomment-1765789016 Hmm, It is not just the null check addition in the procedure, later it fails because reference MAIN does not exist during replace branch. https://github.com/apache/iceberg/blob/46cad6ddaeff8104d96defab25206a4ff7e01629/core/src/main/java/org/apache/iceberg/UpdateSnapshotReferencesOperation.java#L123 The reason is we create the main branch only during first snapshot creation. But in this scenario there is no main branch exist for replace operation. I thought of blocking this scenario by throwing an exception. But since we support creating empty branch (from https://github.com/apache/iceberg/pull/8072/), we should also support replacing to/from empty branch. I am thinking to add an empty/dummy snapshot for create table operation with MAIN reference in a sparate PR to tackle this. Let me know what you guys think, @amogh-jahagirdar, @ConeyLiu , @rdblue -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] doc: The doc for catalog is missing. [iceberg]
liurenjie1024 commented on issue #8850: URL: https://github.com/apache/iceberg/issues/8850#issuecomment-1765832884 I'll take this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Implement basic full table scan. [iceberg-rust]
liurenjie1024 commented on issue #66: URL: https://github.com/apache/iceberg-rust/issues/66#issuecomment-1765838560 Blocked by #79 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat: Implement Iceberg values [iceberg-rust]
JanKaul commented on code in PR #20: URL: https://github.com/apache/iceberg-rust/pull/20#discussion_r1361668958 ## crates/iceberg/src/spec/values.rs: ## @@ -0,0 +1,964 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +/*! + * Value in iceberg + */ + +use std::{any::Any, collections::BTreeMap}; + +use bitvec::vec::BitVec; +use chrono::{DateTime, NaiveDate, NaiveDateTime, NaiveTime, Utc}; +use ordered_float::OrderedFloat; +use rust_decimal::Decimal; +use serde_bytes::ByteBuf; +use serde_json::{Map as JsonMap, Number, Value as JsonValue}; +use uuid::Uuid; + +use crate::{Error, ErrorKind}; + +use super::datatypes::{PrimitiveType, Type}; + +/// Values present in iceberg type +#[derive(Clone, Debug, PartialEq, Hash, Eq, PartialOrd, Ord)] +pub enum PrimitiveLiteral { +/// 0x00 for false, non-zero byte for true +Boolean(bool), +/// Stored as 4-byte little-endian +Int(i32), +/// Stored as 8-byte little-endian +Long(i64), +/// Stored as 4-byte little-endian +Float(OrderedFloat), +/// Stored as 8-byte little-endian +Double(OrderedFloat), +/// Stores days from the 1970-01-01 in an 4-byte little-endian int +Date(i32), +/// Stores microseconds from midnight in an 8-byte little-endian long +Time(i64), +/// Timestamp without timezone +Timestamp(i64), +/// Timestamp with timezone +TimestampTZ(i64), +/// UTF-8 bytes (without length) +String(String), +/// 16-byte big-endian value +UUID(Uuid), +/// Binary value +Fixed(Vec), +/// Binary value (without length) +Binary(Vec), +/// Stores unscaled value as two’s-complement big-endian binary, +/// using the minimum number of bytes for the value +Decimal(Decimal), +} + +/// Values present in iceberg type +#[derive(Clone, Debug, PartialEq, Hash, Eq, PartialOrd, Ord)] +pub enum Literal { +/// A primitive value +Primitive(PrimitiveLiteral), +/// A struct is a tuple of typed values. Each field in the tuple is named and has an integer id that is unique in the table schema. +/// Each field can be either optional or required, meaning that values can (or cannot) be null. Fields may be any type. +/// Fields may have an optional comment or doc string. Fields can have default values. +Struct(Struct), +/// A list is a collection of values with some element type. +/// The element field has an integer id that is unique in the table schema. +/// Elements can be either optional or required. Element types may be any type. +List(Vec>), +/// A map is a collection of key-value pairs with a key type and a value type. +/// Both the key field and value field each have an integer id that is unique in the table schema. +/// Map keys are required and map values can be either optional or required. Both map keys and map values may be any type, including nested types. +Map(BTreeMap>), +} + +impl From for ByteBuf { +fn from(value: Literal) -> Self { +match value { +Literal::Primitive(prim) => match prim { +PrimitiveLiteral::Boolean(val) => { +if val { +ByteBuf::from([0u8]) +} else { +ByteBuf::from([1u8]) +} +} +PrimitiveLiteral::Int(val) => ByteBuf::from(val.to_le_bytes()), +PrimitiveLiteral::Long(val) => ByteBuf::from(val.to_le_bytes()), +PrimitiveLiteral::Float(val) => ByteBuf::from(val.to_le_bytes()), +PrimitiveLiteral::Double(val) => ByteBuf::from(val.to_le_bytes()), +PrimitiveLiteral::Date(val) => ByteBuf::from(val.to_le_bytes()), +PrimitiveLiteral::Time(val) => ByteBuf::from(val.to_le_bytes()), +PrimitiveLiteral::Timestamp(val) => ByteBuf::from(val.to_le_bytes()), +PrimitiveLiteral::TimestampTZ(val) => ByteBuf::from(val.to_le_bytes()), +PrimitiveLiteral::String(val) => ByteBuf::from(val.as_bytes()), +PrimitiveLiteral::UUID(val) => ByteBuf::from(val.as_u128().to_be_bytes()), +PrimitiveLiteral::Fixed(val) => ByteB
Re: [I] [BUG] to_arrow conversion does not support iceberg table column name containing slash [iceberg-python]
Fokko commented on issue #81: URL: https://github.com/apache/iceberg-python/issues/81#issuecomment-1765864606 Great catch @puchengy I wasn't aware of this sanitization behavior. Do you want to write a patch for it? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] fix: avro bytes test for Literal [iceberg-rust]
JanKaul opened a new pull request, #80: URL: https://github.com/apache/iceberg-rust/pull/80 This PR fixes a logical bug in the tests for converting `Literal`s to and from `ByteBuf`. The bug is that I wrote the wrong bytes to the avro file. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat: Implement Iceberg values [iceberg-rust]
JanKaul commented on code in PR #20: URL: https://github.com/apache/iceberg-rust/pull/20#discussion_r1361677800 ## crates/iceberg/src/spec/values.rs: ## @@ -0,0 +1,964 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +/*! + * Value in iceberg + */ + +use std::{any::Any, collections::BTreeMap}; + +use bitvec::vec::BitVec; +use chrono::{DateTime, NaiveDate, NaiveDateTime, NaiveTime, Utc}; +use ordered_float::OrderedFloat; +use rust_decimal::Decimal; +use serde_bytes::ByteBuf; +use serde_json::{Map as JsonMap, Number, Value as JsonValue}; +use uuid::Uuid; + +use crate::{Error, ErrorKind}; + +use super::datatypes::{PrimitiveType, Type}; + +/// Values present in iceberg type +#[derive(Clone, Debug, PartialEq, Hash, Eq, PartialOrd, Ord)] +pub enum PrimitiveLiteral { +/// 0x00 for false, non-zero byte for true +Boolean(bool), +/// Stored as 4-byte little-endian +Int(i32), +/// Stored as 8-byte little-endian +Long(i64), +/// Stored as 4-byte little-endian +Float(OrderedFloat), +/// Stored as 8-byte little-endian +Double(OrderedFloat), +/// Stores days from the 1970-01-01 in an 4-byte little-endian int +Date(i32), +/// Stores microseconds from midnight in an 8-byte little-endian long +Time(i64), +/// Timestamp without timezone +Timestamp(i64), +/// Timestamp with timezone +TimestampTZ(i64), +/// UTF-8 bytes (without length) +String(String), +/// 16-byte big-endian value +UUID(Uuid), +/// Binary value +Fixed(Vec), +/// Binary value (without length) +Binary(Vec), +/// Stores unscaled value as two’s-complement big-endian binary, +/// using the minimum number of bytes for the value +Decimal(Decimal), +} + +/// Values present in iceberg type +#[derive(Clone, Debug, PartialEq, Hash, Eq, PartialOrd, Ord)] +pub enum Literal { +/// A primitive value +Primitive(PrimitiveLiteral), +/// A struct is a tuple of typed values. Each field in the tuple is named and has an integer id that is unique in the table schema. +/// Each field can be either optional or required, meaning that values can (or cannot) be null. Fields may be any type. +/// Fields may have an optional comment or doc string. Fields can have default values. +Struct(Struct), +/// A list is a collection of values with some element type. +/// The element field has an integer id that is unique in the table schema. +/// Elements can be either optional or required. Element types may be any type. +List(Vec>), +/// A map is a collection of key-value pairs with a key type and a value type. +/// Both the key field and value field each have an integer id that is unique in the table schema. +/// Map keys are required and map values can be either optional or required. Both map keys and map values may be any type, including nested types. +Map(BTreeMap>), +} + +impl From for ByteBuf { +fn from(value: Literal) -> Self { +match value { +Literal::Primitive(prim) => match prim { +PrimitiveLiteral::Boolean(val) => { +if val { +ByteBuf::from([0u8]) +} else { +ByteBuf::from([1u8]) +} +} +PrimitiveLiteral::Int(val) => ByteBuf::from(val.to_le_bytes()), +PrimitiveLiteral::Long(val) => ByteBuf::from(val.to_le_bytes()), +PrimitiveLiteral::Float(val) => ByteBuf::from(val.to_le_bytes()), +PrimitiveLiteral::Double(val) => ByteBuf::from(val.to_le_bytes()), +PrimitiveLiteral::Date(val) => ByteBuf::from(val.to_le_bytes()), +PrimitiveLiteral::Time(val) => ByteBuf::from(val.to_le_bytes()), +PrimitiveLiteral::Timestamp(val) => ByteBuf::from(val.to_le_bytes()), +PrimitiveLiteral::TimestampTZ(val) => ByteBuf::from(val.to_le_bytes()), +PrimitiveLiteral::String(val) => ByteBuf::from(val.as_bytes()), +PrimitiveLiteral::UUID(val) => ByteBuf::from(val.as_u128().to_be_bytes()), +PrimitiveLiteral::Fixed(val) => ByteB
Re: [PR] fix: avro bytes test for Literal [iceberg-rust]
Xuanwo commented on code in PR #80: URL: https://github.com/apache/iceberg-rust/pull/80#discussion_r1361682216 ## crates/iceberg/src/spec/values.rs: ## @@ -995,7 +995,7 @@ mod tests { assert_eq!(literal, expected_literal); let mut writer = apache_avro::Writer::new(&schema, Vec::new()); -writer.append_ser(bytes).unwrap(); +writer.append_ser(Intointo(literal)).unwrap(); Review Comment: Using `Into` seems strange. How about using `Bytes::from(literal)` instead? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] fix: avro bytes test for Literal [iceberg-rust]
JanKaul commented on code in PR #80: URL: https://github.com/apache/iceberg-rust/pull/80#discussion_r1361690259 ## crates/iceberg/src/spec/values.rs: ## @@ -995,7 +995,7 @@ mod tests { assert_eq!(literal, expected_literal); let mut writer = apache_avro::Writer::new(&schema, Vec::new()); -writer.append_ser(bytes).unwrap(); +writer.append_ser(Intointo(literal)).unwrap(); Review Comment: There is a `ByteBuf::from(value: impl Into>)` which overrides the `From` for `ByteBuf` and I need to somehow specify that I want the conversion from `Literal` to `ByteBuf`. Maybe you know of a different way to specify which `form/into` implementation to use. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] fix: avro bytes test for Literal [iceberg-rust]
Xuanwo commented on code in PR #80: URL: https://github.com/apache/iceberg-rust/pull/80#discussion_r1361704555 ## crates/iceberg/src/spec/values.rs: ## @@ -995,7 +995,7 @@ mod tests { assert_eq!(literal, expected_literal); let mut writer = apache_avro::Writer::new(&schema, Vec::new()); -writer.append_ser(bytes).unwrap(); +writer.append_ser(Intointo(literal)).unwrap(); Review Comment: > There is a `ByteBuf::from(value: impl Into>)` which overrides the `From` for `ByteBuf` Hi, I'm not quite understanding. Does `Into>` exist for `Literal`? Or is it because `Literal` implements `From for BytesBuf`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add an expireAfterWrite cache eviction policy to CachingCatalog [iceberg]
zhangminglei commented on code in PR #8844: URL: https://github.com/apache/iceberg/pull/8844#discussion_r1361733500 ## spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/TestCachingCatalogExpirationAfterWrite.java: ## @@ -0,0 +1,89 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.spark; + +import java.util.Map; +import java.util.concurrent.TimeUnit; +import org.apache.hadoop.conf.Configuration; +import org.apache.iceberg.hive.HiveClientPool; +import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap; +import org.apache.spark.sql.connector.catalog.Identifier; +import org.awaitility.Awaitility; +import org.junit.After; +import org.junit.AfterClass; +import org.junit.BeforeClass; +import org.junit.Test; +import org.junit.runners.Parameterized; + +public class TestCachingCatalogExpirationAfterWrite extends SparkCatalogTestBase { Review Comment: Sounds reasonable, What do you think ? @nastra , You wish put it into `spark` module ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] How to read data in the order in which files are commited? [iceberg]
pvary commented on issue #8802: URL: https://github.com/apache/iceberg/issues/8802#issuecomment-1765938139 > > Sometimes we need to do similar thing in Flink Source, and we ended up creating our own comparator for this which compares Iceberg splits (which are a wrapper above ScanTasks). > > I'm sorry, I didn't quite understand this point. Could you please explain it in more detail? We fetch all the tasks for a given plan here (and covert it to an IcebergSplit using `fromCombinedScanTask`): https://github.com/apache/iceberg/blob/46cad6ddaeff8104d96defab25206a4ff7e01629/flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/source/FlinkSplitPlanner.java#L73-L82 And we use a [SerializableComparator](https://github.com/apache/iceberg/blob/46cad6ddaeff8104d96defab25206a4ff7e01629/flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/source/split/SplitComparators.java#L33) to order the splits before assigning them: https://github.com/apache/iceberg/blob/46cad6ddaeff8104d96defab25206a4ff7e01629/flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/source/assigner/DefaultSplitAssigner.java#L44-L46 I hope this helps! Thanks, Peter -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Flink: Read parquet BINARY column as String for expected [iceberg]
pvary commented on code in PR #8808: URL: https://github.com/apache/iceberg/pull/8808#discussion_r1361742600 ## flink/v1.15/flink/src/main/java/org/apache/iceberg/flink/data/FlinkParquetReaders.java: ## @@ -262,7 +262,11 @@ public ParquetValueReader primitive( switch (primitive.getPrimitiveTypeName()) { case FIXED_LEN_BYTE_ARRAY: case BINARY: - return new ParquetValueReaders.ByteArrayReader(desc); + if (expected.typeId() == Types.StringType.get().typeId()) { +return new StringReader(desc); + } else { +return new ParquetValueReaders.ByteArrayReader(desc); + } Review Comment: I am concerned about the backward compatibility of this change. Someone might already depend on reading them as binary, and this change would break their use-case -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Flink: Read parquet BINARY column as String for expected [iceberg]
pvary commented on code in PR #8808: URL: https://github.com/apache/iceberg/pull/8808#discussion_r1361742942 ## flink/v1.15/flink/src/test/java/org/apache/iceberg/flink/data/TestFlinkParquetReader.java: ## @@ -81,26 +75,87 @@ public void testTwoLevelList() throws IOException { recordBuilder.set("topbytes", expectedBinary); GenericData.Record expectedRecord = recordBuilder.build(); -writer.write(expectedRecord); -writer.close(); +try (ParquetWriter writer = +AvroParquetWriter.builder(new Path(testFile.toURI())) +.withDataModel(GenericData.get()) +.withSchema(avroSchema) +.config("parquet.avro.add-list-element-records", "true") +.config("parquet.avro.write-old-list-structure", "true") +.build()) { + writer.write(expectedRecord); +} try (CloseableIterable reader = Parquet.read(Files.localInput(testFile)) .project(schema) .createReaderFunc(type -> FlinkParquetReaders.buildReader(schema, type)) .build()) { Iterator rows = reader.iterator(); - Assert.assertTrue("Should have at least one row", rows.hasNext()); + assertThat(rows).as("Should have at least one row").hasNext(); + RowData rowData = rows.next(); + assertThat(expectedByte).isEqualTo(rowData.getArray(0).getBinary(0)); + assertThat(expectedByte).isEqualTo(rowData.getBinary(1)); + assertThat(rows).as("Should not have more than one row").isExhausted(); +} + } + + @Test + public void testReadBinaryFieldAsString() throws IOException { +Schema schemaForWriteBinary = new Schema(optional(1, "strbytes", Types.BinaryType.get())); +org.apache.avro.Schema avroSchema = AvroSchemaUtil.convert(schemaForWriteBinary.asStruct()); + +File testFile = temp.newFile(); +assertThat(testFile.delete()).isTrue(); + +String expectedString = "hello"; + +GenericRecordBuilder recordBuilder = new GenericRecordBuilder(avroSchema); +ByteBuffer expectedBinary = ByteBuffer.wrap(expectedString.getBytes(StandardCharsets.UTF_8)); +recordBuilder.set("strbytes", expectedBinary); +GenericData.Record expectedRecord = recordBuilder.build(); + +try (ParquetWriter writer = +AvroParquetWriter.builder(new Path(testFile.toURI())) +.withDataModel(GenericData.get()) +.withSchema(avroSchema) +.build()) { + writer.write(expectedRecord); +} + +// read as string +Schema schemaForReadBinaryAsString = +new Schema(optional(1, "strbytes", Types.StringType.get())); +try (CloseableIterable reader = +Parquet.read(Files.localInput(testFile)) +.project(schemaForReadBinaryAsString) +.createReaderFunc( +type -> FlinkParquetReaders.buildReader(schemaForReadBinaryAsString, type)) +.build()) { + Iterator rows = reader.iterator(); + assertThat(rows).as("Should have at least one row").hasNext(); + RowData rowData = rows.next(); + assertThat(rowData.getString(0)).isInstanceOf(BinaryStringData.class); + assertThat(rowData.getString(0).toString()).isEqualTo(expectedString); + assertThat(rows).as("Should not have more than one row").isExhausted(); +} + +// read as byte[] Review Comment: Should this be a separate test case? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Flink: Read parquet BINARY column as String for expected [iceberg]
pvary commented on PR #8808: URL: https://github.com/apache/iceberg/pull/8808#issuecomment-1765954472 This is a small change, so it might not be too hard to keep the different Flink version changes in sync, but usually we introduce the changes on the latest Flink, and then create a different backport PR to backport them to the older Flink versions. This is better for the reviewer, as the change is smaller and easier to focus on the issues, and better for the contributor since when changes are required then they are not needed to be continuously merged to the other branches. And when the backport time comes, it is easier to check the backport specific changes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] fix: avro bytes test for Literal [iceberg-rust]
JanKaul commented on code in PR #80: URL: https://github.com/apache/iceberg-rust/pull/80#discussion_r1361754339 ## crates/iceberg/src/spec/values.rs: ## @@ -995,7 +995,7 @@ mod tests { assert_eq!(literal, expected_literal); let mut writer = apache_avro::Writer::new(&schema, Vec::new()); -writer.append_ser(bytes).unwrap(); +writer.append_ser(Intointo(literal)).unwrap(); Review Comment: There currently is no `Into>` for `Literal` but there is a `From for ByteBuf`, which I would like to use. When I use `ByteBuf::from`, the compiler wants to use the existing `ByteBuf::from(value: impl Into>)` implementation, which gives me a type error because `Literal` doesn't implement `Into>`. So I need someway to tell the compiler the right `from/into` implementation to use. It works the way I did it but it might be a bit confusing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] fix: avro bytes test for Literal [iceberg-rust]
ZENOTME commented on PR #80: URL: https://github.com/apache/iceberg-rust/pull/80#issuecomment-1765964478 I'm a little confused. So this avro bytes is different with [binary encoding in avro spec](https://avro.apache.org/docs/1.11.1/specification/#binary-encoding)? What's difference between these two format? 🤔 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] fix: avro bytes test for Literal [iceberg-rust]
Xuanwo commented on code in PR #80: URL: https://github.com/apache/iceberg-rust/pull/80#discussion_r1361762750 ## crates/iceberg/src/spec/values.rs: ## @@ -995,7 +995,7 @@ mod tests { assert_eq!(literal, expected_literal); let mut writer = apache_avro::Writer::new(&schema, Vec::new()); -writer.append_ser(bytes).unwrap(); +writer.append_ser(Intointo(literal)).unwrap(); Review Comment: Oh, I got it. `ByteBuf` provides it's own `ByteBuf::from` that may conflicts with `From` trait. Maybe we should provide a `Literal::to_vec()` instead? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] fix: avro bytes test for Literal [iceberg-rust]
Xuanwo commented on code in PR #80: URL: https://github.com/apache/iceberg-rust/pull/80#discussion_r1361762750 ## crates/iceberg/src/spec/values.rs: ## @@ -995,7 +995,7 @@ mod tests { assert_eq!(literal, expected_literal); let mut writer = apache_avro::Writer::new(&schema, Vec::new()); -writer.append_ser(bytes).unwrap(); +writer.append_ser(Intointo(literal)).unwrap(); Review Comment: Oh, I got it. `ByteBuf` provides it's own `ByteBuf::from` that may conflicts with `From` trait. Maybe we should provide a `Literal::to_vec()` instead? Users can utilize `ByteBuf::from(Literal::to_vec())` to enhance semantic clarity. Incidentally, since `Literal` already implements `Serialize`, it's easy to misuse this API. Can we find a way to prevent this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] API, Core: Add uuid() to View [iceberg]
nastra opened a new pull request, #8851: URL: https://github.com/apache/iceberg/pull/8851 similar to https://github.com/apache/iceberg/pull/8800, adding UUID to the `View` API -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] fix: avro bytes test for Literal [iceberg-rust]
JanKaul commented on PR #80: URL: https://github.com/apache/iceberg-rust/pull/80#issuecomment-1766011191 I will try to explain it the best way I can. Please correct me if I'm wrong. Iceberg defines it's own [binary serialization format](https://iceberg.apache.org/spec/#binary-single-value-serialization) for statistics (upper & lower bounds). This statistical information is stored in avro files using the [avro binary encoding](https://avro.apache.org/docs/1.11.1/specification/#binary-encoding). So this is somewhat of a two-layered setup. In order to interpret a value you first have to deserialize the avro value into bytes and then convert the bytes to a value using the iceberg binary format. When I implemented `Literal` my goal was to use `serde_bytes::ByteBuf` for the avro binary values. Just like you used it for the [`ManifestList`](https://github.com/apache/iceberg-rust/blob/main/crates/iceberg/src/spec/manifest_list.rs#L586). After deserialization the `ByteBuf` has to be converted to a `Literal` and that's what the `Literal::try_from_bytes` method is for. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] fix: avro bytes test for Literal [iceberg-rust]
JanKaul commented on code in PR #80: URL: https://github.com/apache/iceberg-rust/pull/80#discussion_r1361806053 ## crates/iceberg/src/spec/values.rs: ## @@ -995,7 +995,7 @@ mod tests { assert_eq!(literal, expected_literal); let mut writer = apache_avro::Writer::new(&schema, Vec::new()); -writer.append_ser(bytes).unwrap(); +writer.append_ser(Intointo(literal)).unwrap(); Review Comment: > Oh, I got it. `ByteBuf` provides it's own `ByteBuf::from` that may conflicts with `From` trait. Exactly. It actually works with `writer.append_ser(ByteBuf::from(literal)).unwrap();` when I provide a `impl From for Vec`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Iceberg table support specified column comments by flinksql create [iceberg]
372242283 commented on issue #8511: URL: https://github.com/apache/iceberg/issues/8511#issuecomment-1766049159 I also encountered this issue by checking that the field 'COMMENT' in the table 'COLUMNS-V2' in HMS is null and not written in  -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Build: Add note about running tests/itests on MacOS [iceberg]
Fokko commented on PR #8766: URL: https://github.com/apache/iceberg/pull/8766#issuecomment-1766051674 For me it looks like the file was created by the daemon: ``` ➜ ls -lah /var/run/docker.sock lrwxr-xr-x 1 root daemon46B Oct 16 16:32 /var/run/docker.sock -> /Users/fokkodriesprong/.docker/run/docker.sock ➜ last reboot reboot timeMon Oct 16 16:29 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Build: Add note about running tests/itests on MacOS [iceberg]
jbonofre commented on PR #8766: URL: https://github.com/apache/iceberg/pull/8766#issuecomment-1766072686 @Fokko I think anyone using docker-desktop on Mac will have the same issue. So it applies for 99% of the MacOS users. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Build: Add note about running tests/itests on MacOS [iceberg]
Fokko commented on PR #8766: URL: https://github.com/apache/iceberg/pull/8766#issuecomment-1766080354 @jbonofre I'm running docker desktop as well -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Build: Add note about running tests/itests on MacOS [iceberg]
Fokko merged PR #8766: URL: https://github.com/apache/iceberg/pull/8766 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Build: Add note about running tests/itests on MacOS [iceberg]
jbonofre commented on PR #8766: URL: https://github.com/apache/iceberg/pull/8766#issuecomment-1766082002 @Fokko and you don't have the symbolic link ? Did you install docker-desktop from homebrew cask ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Flink: Read parquet BINARY column as String for expected [iceberg]
fengjiajie commented on code in PR #8808: URL: https://github.com/apache/iceberg/pull/8808#discussion_r1361854341 ## flink/v1.15/flink/src/main/java/org/apache/iceberg/flink/data/FlinkParquetReaders.java: ## @@ -262,7 +262,11 @@ public ParquetValueReader primitive( switch (primitive.getPrimitiveTypeName()) { case FIXED_LEN_BYTE_ARRAY: case BINARY: - return new ParquetValueReaders.ByteArrayReader(desc); + if (expected.typeId() == Types.StringType.get().typeId()) { +return new StringReader(desc); + } else { +return new ParquetValueReaders.ByteArrayReader(desc); + } Review Comment: > I am concerned about the backward compatibility of this change. Someone might already depend on reading them as binary, and this change would break their use-case This modification is only applicable to cases where the iceberg definition is 'string' and parquet column is 'binary'. Previously, such cases would encounter the following exception (unit test can reproduce this exception): ``` java.lang.ClassCastException: [B cannot be cast to org.apache.flink.table.data.StringData at org.apache.flink.table.data.GenericRowData.getString(GenericRowData.java:169) at org.apache.iceberg.flink.data.TestFlinkParquetReader.testReadBinaryFieldAsString(TestFlinkParquetReader.java:137) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] fix: avro bytes test for Literal [iceberg-rust]
ZENOTME commented on PR #80: URL: https://github.com/apache/iceberg-rust/pull/80#issuecomment-1766094815 Thanks for your explanation! Totally understanded it! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] API, Core: Add uuid() to View [iceberg]
ajantha-bhat commented on code in PR #8851: URL: https://github.com/apache/iceberg/pull/8851#discussion_r1361891676 ## api/src/main/java/org/apache/iceberg/view/View.java: ## @@ -111,4 +112,13 @@ default ReplaceViewVersion replaceVersion() { default UpdateLocation updateLocation() { throw new UnsupportedOperationException("Updating a view's location is not supported"); } + + /** + * Returns the view's UUID + * + * @return the view's UUID + */ + default UUID uuid() { Review Comment: Can it be the `String` instead of `UUID` object to be similar to the table ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Iceberg table support specified column comments by flinksql create [iceberg]
huyuanfeng2018 commented on issue #8511: URL: https://github.com/apache/iceberg/issues/8511#issuecomment-1766135578 @stevenzwu In flink1.16 and before, it was impossible to parse the comment in the table creation statement. Starting from 1.17, the flink community completed this part of the logic, so I think icebrg should support parsing this part of the information and adding the comment to the field. WDYT, I can do this part -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] API, Core: Add uuid() to View [iceberg]
nastra commented on code in PR #8851: URL: https://github.com/apache/iceberg/pull/8851#discussion_r1361897579 ## api/src/main/java/org/apache/iceberg/view/View.java: ## @@ -111,4 +112,13 @@ default ReplaceViewVersion replaceVersion() { default UpdateLocation updateLocation() { throw new UnsupportedOperationException("Updating a view's location is not supported"); } + + /** + * Returns the view's UUID + * + * @return the view's UUID + */ + default UUID uuid() { Review Comment: for tables it might be that the type ends up being UUID based on https://github.com/apache/iceberg/pull/8800#discussion_r1355112214 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Flink: Read parquet BINARY column as String for expected [iceberg]
fengjiajie commented on PR #8808: URL: https://github.com/apache/iceberg/pull/8808#issuecomment-1766144239 > This is a small change, so it might not be too hard to keep the different Flink version changes in sync, but usually we introduce the changes on the latest Flink, and then create a different backport PR to backport them to the older Flink versions. > > This is better for the reviewer, as the change is smaller and easier to focus on the issues, and better for the contributor since when changes are required then they are not needed to be continuously merged to the other branches. And when the backport time comes, it is easier to check the backport specific changes. Thank you for your review suggestions. I have made the changes to only include flink 1.17. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Flink: Read parquet BINARY column as String for expected [iceberg]
fengjiajie commented on code in PR #8808: URL: https://github.com/apache/iceberg/pull/8808#discussion_r1361899155 ## flink/v1.15/flink/src/test/java/org/apache/iceberg/flink/data/TestFlinkParquetReader.java: ## @@ -81,26 +75,87 @@ public void testTwoLevelList() throws IOException { recordBuilder.set("topbytes", expectedBinary); GenericData.Record expectedRecord = recordBuilder.build(); -writer.write(expectedRecord); -writer.close(); +try (ParquetWriter writer = +AvroParquetWriter.builder(new Path(testFile.toURI())) +.withDataModel(GenericData.get()) +.withSchema(avroSchema) +.config("parquet.avro.add-list-element-records", "true") +.config("parquet.avro.write-old-list-structure", "true") +.build()) { + writer.write(expectedRecord); +} try (CloseableIterable reader = Parquet.read(Files.localInput(testFile)) .project(schema) .createReaderFunc(type -> FlinkParquetReaders.buildReader(schema, type)) .build()) { Iterator rows = reader.iterator(); - Assert.assertTrue("Should have at least one row", rows.hasNext()); + assertThat(rows).as("Should have at least one row").hasNext(); + RowData rowData = rows.next(); + assertThat(expectedByte).isEqualTo(rowData.getArray(0).getBinary(0)); + assertThat(expectedByte).isEqualTo(rowData.getBinary(1)); + assertThat(rows).as("Should not have more than one row").isExhausted(); +} + } + + @Test + public void testReadBinaryFieldAsString() throws IOException { +Schema schemaForWriteBinary = new Schema(optional(1, "strbytes", Types.BinaryType.get())); +org.apache.avro.Schema avroSchema = AvroSchemaUtil.convert(schemaForWriteBinary.asStruct()); + +File testFile = temp.newFile(); +assertThat(testFile.delete()).isTrue(); + +String expectedString = "hello"; + +GenericRecordBuilder recordBuilder = new GenericRecordBuilder(avroSchema); +ByteBuffer expectedBinary = ByteBuffer.wrap(expectedString.getBytes(StandardCharsets.UTF_8)); +recordBuilder.set("strbytes", expectedBinary); +GenericData.Record expectedRecord = recordBuilder.build(); + +try (ParquetWriter writer = +AvroParquetWriter.builder(new Path(testFile.toURI())) +.withDataModel(GenericData.get()) +.withSchema(avroSchema) +.build()) { + writer.write(expectedRecord); +} + +// read as string +Schema schemaForReadBinaryAsString = +new Schema(optional(1, "strbytes", Types.StringType.get())); +try (CloseableIterable reader = +Parquet.read(Files.localInput(testFile)) +.project(schemaForReadBinaryAsString) +.createReaderFunc( +type -> FlinkParquetReaders.buildReader(schemaForReadBinaryAsString, type)) +.build()) { + Iterator rows = reader.iterator(); + assertThat(rows).as("Should have at least one row").hasNext(); + RowData rowData = rows.next(); + assertThat(rowData.getString(0)).isInstanceOf(BinaryStringData.class); + assertThat(rowData.getString(0).toString()).isEqualTo(expectedString); + assertThat(rows).as("Should not have more than one row").isExhausted(); +} + +// read as byte[] Review Comment: This unit test is testing whether expected values can be read from same file and column when the iceberg column definition is both string and binary. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Iceberg table support specified column comments by flinksql create [iceberg]
huyuanfeng2018 commented on issue #8511: URL: https://github.com/apache/iceberg/issues/8511#issuecomment-1766177686 > @stevenzwu In flink1.16 and before, it was impossible to parse the comment in the table creation statement. Starting from 1.17, the flink community completed this part of the logic, so I think icebrg should support parsing this part of the information and adding the comment to the field. WDYT, I can do this part https://issues.apache.org/jira/browse/FLINK-29679 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] API, Core: Add uuid() to View [iceberg]
nk1506 commented on code in PR #8851: URL: https://github.com/apache/iceberg/pull/8851#discussion_r1361819048 ## core/src/main/java/org/apache/iceberg/view/BaseView.java: ## @@ -97,4 +98,9 @@ public ReplaceViewVersion replaceVersion() { public UpdateLocation updateLocation() { return new SetViewLocation(ops); } + + @Override + public UUID uuid() { +return UUID.fromString(ops.current().uuid()); Review Comment: nit: for few catalogs, it may go through all the refresh mechanism. I am not sure if we have any other better approach or not. ## api/src/main/java/org/apache/iceberg/view/View.java: ## @@ -111,4 +112,13 @@ default ReplaceViewVersion replaceVersion() { default UpdateLocation updateLocation() { throw new UnsupportedOperationException("Updating a view's location is not supported"); } + + /** + * Returns the view's UUID + * + * @return the view's UUID Review Comment: nit: tag description and summary are same. IMO, we should have one. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark 3.5: Use Awaitility instead of Thread.sleep() [iceberg]
nk1506 commented on code in PR #8853: URL: https://github.com/apache/iceberg/pull/8853#discussion_r1361945348 ## spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/source/SparkSQLExecutionHelper.java: ## @@ -42,28 +45,26 @@ public static String lastExecutedMetricValue(SparkSession spark, String metricNa SQLExecutionUIData lastExecution = statusStore.executionsList().last(); Option sqlPlanMetric = lastExecution.metrics().find(metric -> metric.name().equals(metricName)); -Assert.assertTrue( -String.format("Metric '%s' not found in last execution", metricName), -sqlPlanMetric.isDefined()); +assertThat(sqlPlanMetric.isDefined()) +.as(String.format("Metric '%s' not found in last execution", metricName)) +.isTrue(); long metricId = sqlPlanMetric.get().accumulatorId(); // Refresh metricValues, they will remain null until the execution is complete and metrics are // aggregated -int attempts = 3; -while (lastExecution.metricValues() == null && attempts > 0) { - try { -Thread.sleep(100); -attempts--; - } catch (InterruptedException e) { -throw new RuntimeException(e); - } +Awaitility.await() Review Comment: nit: is it a good idea to add some alias with `Awaitility` here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] API, Core: Add uuid() to View [iceberg]
amogh-jahagirdar commented on code in PR #8851: URL: https://github.com/apache/iceberg/pull/8851#discussion_r1361948344 ## api/src/main/java/org/apache/iceberg/view/View.java: ## @@ -111,4 +112,13 @@ default ReplaceViewVersion replaceVersion() { default UpdateLocation updateLocation() { throw new UnsupportedOperationException("Updating a view's location is not supported"); } + + /** + * Returns the view's UUID + * + * @return the view's UUID + */ + default UUID uuid() { Review Comment: Yeah sorry for the delay on that PR, planning on updating to an actual UUID. @ajantha-bhat let me know what you think on that PR -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark: Fix Fast forward before/after snapshot output for non-main branches [iceberg]
amogh-jahagirdar commented on code in PR #8854: URL: https://github.com/apache/iceberg/pull/8854#discussion_r1361985873 ## spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/procedures/FastForwardBranchProcedure.java: ## @@ -77,9 +77,9 @@ public InternalRow[] call(InternalRow args) { return modifyIcebergTable( tableIdent, table -> { - long currentRef = table.currentSnapshot().snapshotId(); + long currentRef = table.snapshot(source).snapshotId(); table.manageSnapshots().fastForwardBranch(source, target).commit(); Review Comment: Another thing to fixI think when I originally implemented the fastForward operation I switched the definitions of source/target in my head so the API is confusing (in the API, target means the branch that is actually being fast forwarded, source is where target will be moved). That should probably be reversed so target is actually the target to which source will be moved. I *think* we should be able to safely just rename the parameters in the API and update the javadoc. Luckily both are String parameters. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark: Fix Fast forward before/after snapshot output for non-main branches [iceberg]
amogh-jahagirdar commented on code in PR #8854: URL: https://github.com/apache/iceberg/pull/8854#discussion_r1361985873 ## spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/procedures/FastForwardBranchProcedure.java: ## @@ -77,9 +77,9 @@ public InternalRow[] call(InternalRow args) { return modifyIcebergTable( tableIdent, table -> { - long currentRef = table.currentSnapshot().snapshotId(); + long currentRef = table.snapshot(source).snapshotId(); table.manageSnapshots().fastForwardBranch(source, target).commit(); Review Comment: Another thing to fixI think when I originally implemented the fastForward operation I switched the definitions of source/target in my head so the API is confusing (in the API, target means the branch that is actually being fast forwarded, source is where target will be moved). That should probably be reversed so target is actually the target to which source will be moved. I *think* we should be able to safely just rename the parameters in the API and update the javadoc. Luckily both are String parameters. cc @rdblue -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark: Fix Fast forward before/after snapshot output for non-main branches [iceberg]
amogh-jahagirdar commented on PR #8854: URL: https://github.com/apache/iceberg/pull/8854#issuecomment-1766265102 also cc @rakesh-das08 let me know your thoughts on this fix! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat: suport read/write Manifest [iceberg-rust]
JanKaul commented on code in PR #79: URL: https://github.com/apache/iceberg-rust/pull/79#discussion_r1362007623 ## crates/iceberg/src/spec/manifest.rs: ## @@ -0,0 +1,671 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +//! Manifest for Iceberg. +use super::{FormatVersion, ManifestContentType, PartitionSpec, Schema, Struct}; +use super::{Literal, Type}; +use crate::{Error, ErrorKind}; +use apache_avro::{from_value, Reader as AvroReader, Schema as AvroSchema}; +use std::collections::HashMap; +use std::str::FromStr; + +/// A manifest contains metadata and a list of entries. +pub struct Manifest { +metadata: ManifestMetadata, +entries: Vec, +} + +impl Manifest { +/// Parse manifest from bytes of avro file. +pub fn parse_avro(bs: &[u8]) -> Result { +let reader = AvroReader::new(bs)?; + +// Parse manifest metadata +let meta = reader.user_metadata(); +let metadata = ManifestMetadata::parse(meta)?; + +// Parse manifest entries +let partition_type = + Type::Struct(metadata.partition_spec.partition_type(&metadata.schema)?); +let mut entries = Vecnew(); + +match metadata.format_version { +FormatVersion::V1 => { +let reader = AvroReader::with_schema(Self::v1_schema(), bs)?; +for value in reader { +entries.push( +from_value::<_serde::ManifestEntryV1>(&value?)? +.try_into(&partition_type, &metadata.schema)?, +); +} +} +FormatVersion::V2 => { +let reader = AvroReader::with_schema(Self::v2_schema(), bs)?; +for value in reader { +entries.push( +from_value::<_serde::ManifestEntryV2>(&value?)? +.try_into(&partition_type, &metadata.schema)?, +); +} +} +}; + +Ok(Manifest { metadata, entries }) +} + +fn v2_schema() -> &'static AvroSchema { +todo!() +} + +fn v1_schema() -> &'static AvroSchema { +todo!() +} +} + +/// Meta data of a manifest. +#[derive(Debug, PartialEq, Clone)] +pub struct ManifestMetadata { +/// The table schema at the time the manifest +/// was written +schema: Schema, +/// ID of the schema used to write the manifest as a string +schema_id: i32, +/// The partition spec used to write the manifest +partition_spec: PartitionSpec, +/// ID of the partition spec used to write the manifest as a string +partition_spec_id: i32, +/// Table format version number of the manifest as a string +format_version: FormatVersion, +/// Type of content files tracked by the manifest: “data” or “deletes” +content: ManifestContentType, +} + +impl ManifestMetadata { +/// Parse from metadata in avro file. +pub fn parse(meta: &HashMap>) -> Result { +let schema = { +let bs = meta.get("schema").ok_or_else(|| { +Error::new( +ErrorKind::DataInvalid, +"schema is required in manifest metadata but not found", +) +})?; +serde_json::from_slice::(bs).map_err(|err| { +Error::new( +ErrorKind::DataInvalid, +"Fail to parse schema in manifest metadata", +) +.with_source(err) +})? +}; +let schema_id: i32 = meta +.get("schema-id") +.and_then(|bs| { +Some(String::from_utf8_lossy(bs).parse().map_err(|err| { +Error::new( +ErrorKind::DataInvalid, +"Fail to parse schema id in manifest metadata", +) +.with_source(err) +})) +}) +.transpose()? +.unwrap_or(0); +let partition_spec = { +let bs = meta.get("partition-spec").ok_or_else(|| { +Error::new( +ErrorKind::DataInvalid, +
Re: [PR] push down min/max/count to iceberg [iceberg]
atifiu commented on PR #6252: URL: https://github.com/apache/iceberg/pull/6252#issuecomment-1766286164 @amogh-jahagirdar I think I know how these delete files are generated even though copy on write is defined at table level. I have executed the delete from Trino and since it only supports merge on read these delete files are generated. Now I am finding it really difficult to remove these delete files. Will try removing partition to see if they will be removed or not. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Avro writers use BlockingBinaryEncoder to enable array/map size calculations. [iceberg]
Fokko commented on PR #8625: URL: https://github.com/apache/iceberg/pull/8625#issuecomment-1766306045 I just realized that this would also speed up operations snapshot expiration, because we do need to access the manifest files, but don't need to use the metrics. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark: Fix Fast forward procedure output for non-main branches [iceberg]
rakesh-das08 commented on PR #8854: URL: https://github.com/apache/iceberg/pull/8854#issuecomment-1766321748 @amogh-jahagirdar the fix LGTM. Thanks for fixing this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] API, Core: Add uuid() to View [iceberg]
nastra commented on code in PR #8851: URL: https://github.com/apache/iceberg/pull/8851#discussion_r1362046505 ## core/src/main/java/org/apache/iceberg/view/BaseView.java: ## @@ -97,4 +98,9 @@ public ReplaceViewVersion replaceVersion() { public UpdateLocation updateLocation() { return new SetViewLocation(ops); } + + @Override + public UUID uuid() { +return UUID.fromString(ops.current().uuid()); Review Comment: it should not go through an update. The purpose of `current()` is to return the currently loaded metadata, without checking for updates. If there's a catalog where `current()` does a refresh, then that's a bug -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark: Fix Fast forward procedure output for non-main branches [iceberg]
ajantha-bhat commented on code in PR #8854: URL: https://github.com/apache/iceberg/pull/8854#discussion_r1362035790 ## spark/v3.5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestFastForwardBranchProcedure.java: ## @@ -188,4 +188,38 @@ public void testInvalidFastForwardBranchCases() { .isInstanceOf(IllegalArgumentException.class) .hasMessage("Cannot handle an empty identifier for argument table"); } + + @Test + public void testFastForwardNonMain() { +sql("CREATE TABLE %s (id int NOT NULL, data string) USING iceberg", tableName); +sql("INSERT INTO TABLE %s VALUES (1, 'a')", tableName); +Table table = validationCatalog.loadTable(tableIdent); +table.refresh(); + +String branch1 = "branch1"; +sql("ALTER TABLE %s CREATE BRANCH %s", tableName, branch1); +String tableNameWithBranch = String.format("%s.branch_%s", tableName, branch1); Review Comment: nit: tableNameWithBranch -> tableNameWithBranch1 ? ## spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/procedures/FastForwardBranchProcedure.java: ## @@ -77,9 +77,9 @@ public InternalRow[] call(InternalRow args) { return modifyIcebergTable( tableIdent, table -> { - long currentRef = table.currentSnapshot().snapshotId(); + long currentRef = table.snapshot(source).snapshotId(); table.manageSnapshots().fastForwardBranch(source, target).commit(); Review Comment: I too got confused today when I checked replace branch API vs this procedure variables. Source and target is reversed. The procedure's named arguments of 'branch' and 'to' is proper. It is like fast forward branch x to y. Only thing is these internal variables in this procedure is reversed. I think we can rename it in this PR. `ManageSnapshots.replaceBranch` and `ManageSnapshots.fastForwardBranch` seems to have a correct naming IMO -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark: Fix Fast forward procedure output for non-main branches [iceberg]
ajantha-bhat commented on code in PR #8854: URL: https://github.com/apache/iceberg/pull/8854#discussion_r1362051079 ## spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/procedures/FastForwardBranchProcedure.java: ## @@ -77,9 +77,9 @@ public InternalRow[] call(InternalRow args) { return modifyIcebergTable( tableIdent, table -> { - long currentRef = table.currentSnapshot().snapshotId(); + long currentRef = table.snapshot(source).snapshotId(); Review Comment: There is also a recent issue reported for fast forward on an empty branch https://github.com/apache/iceberg/issues/8849. I have analyzed but it looks to be clumsy if we try to support dummy snapshot. Let me know what you guys think. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark: Fix Fast forward procedure output for non-main branches [iceberg]
amogh-jahagirdar commented on code in PR #8854: URL: https://github.com/apache/iceberg/pull/8854#discussion_r1362063516 ## spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/procedures/FastForwardBranchProcedure.java: ## @@ -77,9 +77,9 @@ public InternalRow[] call(InternalRow args) { return modifyIcebergTable( tableIdent, table -> { - long currentRef = table.currentSnapshot().snapshotId(); + long currentRef = table.snapshot(source).snapshotId(); Review Comment: Right, actually I was investigating that issue and when going through the procedure code I just noticed this issue :) On the dummy snapshot idea, I need to think more. I think that idea has been floated around a few times and it logically makes sense I just don't know all the implications of that change. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark: Fix Fast forward procedure output for non-main branches [iceberg]
ajantha-bhat commented on code in PR #8854: URL: https://github.com/apache/iceberg/pull/8854#discussion_r1362069529 ## spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/procedures/FastForwardBranchProcedure.java: ## @@ -77,9 +77,9 @@ public InternalRow[] call(InternalRow args) { return modifyIcebergTable( tableIdent, table -> { - long currentRef = table.currentSnapshot().snapshotId(); + long currentRef = table.snapshot(source).snapshotId(); Review Comment: Yeah, I also couldn't conclude on the implications. All these days, if the snapshot id is -1, we assume that it is an empty table or just create table happened. We also need a dummy snapshot id for ancestor check to be passed for fast forward operations. Nessie uses a **constant** default hash for on empty branch for handling this kind of ancestor problems. Maybe we need to introduce something like that. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark: Fix Fast forward procedure output for non-main branches [iceberg]
rakesh-das08 commented on code in PR #8854: URL: https://github.com/apache/iceberg/pull/8854#discussion_r1362071081 ## spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/procedures/FastForwardBranchProcedure.java: ## @@ -77,9 +77,9 @@ public InternalRow[] call(InternalRow args) { return modifyIcebergTable( tableIdent, table -> { - long currentRef = table.currentSnapshot().snapshotId(); + long currentRef = table.snapshot(source).snapshotId(); Review Comment: I had thought of this case where we could introduce a dummy snapshot, but as you mentioned, it did not look like an elegant solution. And with this PR : https://github.com/apache/iceberg/pull/7652/ , i just basically fall back to the underlying replace operation to throw an appropriate exception. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Avro writers use BlockingBinaryEncoder to enable array/map size calculations. [iceberg]
rustyconover commented on PR #8625: URL: https://github.com/apache/iceberg/pull/8625#issuecomment-1766378641 Yes it would! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat: suport read/write Manifest [iceberg-rust]
ZENOTME commented on code in PR #79: URL: https://github.com/apache/iceberg-rust/pull/79#discussion_r1362079101 ## crates/iceberg/src/spec/manifest.rs: ## @@ -0,0 +1,671 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +//! Manifest for Iceberg. +use super::{FormatVersion, ManifestContentType, PartitionSpec, Schema, Struct}; +use super::{Literal, Type}; +use crate::{Error, ErrorKind}; +use apache_avro::{from_value, Reader as AvroReader, Schema as AvroSchema}; +use std::collections::HashMap; +use std::str::FromStr; + +/// A manifest contains metadata and a list of entries. +pub struct Manifest { +metadata: ManifestMetadata, +entries: Vec, +} + +impl Manifest { +/// Parse manifest from bytes of avro file. +pub fn parse_avro(bs: &[u8]) -> Result { +let reader = AvroReader::new(bs)?; + +// Parse manifest metadata +let meta = reader.user_metadata(); +let metadata = ManifestMetadata::parse(meta)?; + +// Parse manifest entries +let partition_type = + Type::Struct(metadata.partition_spec.partition_type(&metadata.schema)?); +let mut entries = Vecnew(); + +match metadata.format_version { +FormatVersion::V1 => { +let reader = AvroReader::with_schema(Self::v1_schema(), bs)?; +for value in reader { +entries.push( +from_value::<_serde::ManifestEntryV1>(&value?)? +.try_into(&partition_type, &metadata.schema)?, +); +} +} +FormatVersion::V2 => { +let reader = AvroReader::with_schema(Self::v2_schema(), bs)?; +for value in reader { +entries.push( +from_value::<_serde::ManifestEntryV2>(&value?)? +.try_into(&partition_type, &metadata.schema)?, +); +} +} +}; Review Comment: Yes! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[I] Flaky test: TestSparkReaderDeletes.testEqualityDeleteWithDeletedColumn [iceberg]
ajantha-bhat opened a new issue, #8855: URL: https://github.com/apache/iceberg/issues/8855 TestSparkReaderDeletes > [format = orc, vectorized = false, planningMode = DISTRIBUTED] > testEqualityDeleteWithDeletedColumn PR:8854 Build: https://github.com/apache/iceberg/actions/runs/6546667781/job/1591189?pr=8854 java.lang.AssertionError: Metric 'number of row deletes applied' not found in last execution at org.junit.Assert.fail(Assert.java:89) at org.junit.Assert.assertTrue(Assert.java:42) at org.apache.iceberg.spark.source.SparkSQLExecutionHelper.lastExecutedMetricValue(SparkSQLExecutionHelper.java:45) at org.apache.iceberg.spark.source.TestSparkReaderDeletes.deleteCount(TestSparkReaderDeletes.java:203) at org.apache.iceberg.data.DeleteReadTests.checkDeleteCount(DeleteReadTests.java:188) at org.apache.iceberg.spark.source.TestSparkReaderDeletes.testEqualityDeleteWithDeletedColumn(TestSparkReaderDeletes.java:421) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] API, Core: Add uuid() to View [iceberg]
ajantha-bhat commented on code in PR #8851: URL: https://github.com/apache/iceberg/pull/8851#discussion_r1362101911 ## api/src/main/java/org/apache/iceberg/view/View.java: ## @@ -111,4 +112,13 @@ default ReplaceViewVersion replaceVersion() { default UpdateLocation updateLocation() { throw new UnsupportedOperationException("Updating a view's location is not supported"); } + + /** + * Returns the view's UUID + * + * @return the view's UUID + */ + default UUID uuid() { Review Comment: Yeah, I am ok with either String or UUID. But Expect both view and table to follow the same type. LGTM. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark: Fix Fast forward procedure output for non-main branches [iceberg]
ajantha-bhat commented on PR #8854: URL: https://github.com/apache/iceberg/pull/8854#issuecomment-1766410898 logged the flaky test: #8855 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Flink 1.17: Use awaitility instead of Thread.sleep() [iceberg]
nk1506 commented on code in PR #8852: URL: https://github.com/apache/iceberg/pull/8852#discussion_r1362111563 ## flink/v1.17/flink/src/test/java/org/apache/iceberg/flink/source/TestStreamingMonitorFunction.java: ## @@ -111,14 +113,19 @@ public void testConsumeWithoutStartSnapshotId() throws Exception { TestSourceContext sourceContext = new TestSourceContext(latch); runSourceFunctionInTask(sourceContext, function); - Assert.assertTrue( - "Should have expected elements.", latch.await(WAIT_TIME_MILLIS, TimeUnit.MILLISECONDS)); - Thread.sleep(1000L); + Awaitility.await() Review Comment: IMO, we should move this common part at one place. ## flink/v1.17/flink/src/test/java/org/apache/iceberg/flink/source/TestIcebergSourceFailover.java: ## @@ -98,9 +98,9 @@ protected List generateRecords(int numRecords, long seed) { return RandomGenericData.generate(schema(), numRecords, seed); } - protected void assertRecords( - Table table, List expectedRecords, Duration interval, int maxCount) throws Exception { -SimpleDataUtil.assertTableRecords(table, expectedRecords, interval, maxCount); + protected void assertRecords(Table table, List expectedRecords, Duration interval) Review Comment: do we need this method? can't we replace with `SimpleDataUtil.assertTableRecords(table, expectedRecords, interval);` ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark: Fix Fast forward procedure output for non-main branches [iceberg]
amogh-jahagirdar closed pull request #8854: Spark: Fix Fast forward procedure output for non-main branches URL: https://github.com/apache/iceberg/pull/8854 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] API, Core: Add uuid() to View [iceberg]
nk1506 commented on code in PR #8851: URL: https://github.com/apache/iceberg/pull/8851#discussion_r1362126880 ## core/src/main/java/org/apache/iceberg/view/BaseView.java: ## @@ -97,4 +98,9 @@ public ReplaceViewVersion replaceVersion() { public UpdateLocation updateLocation() { return new SetViewLocation(ops); } + + @Override + public UUID uuid() { +return UUID.fromString(ops.current().uuid()); Review Comment: I might be missing something here. but it seems `current()` can do the refresh [here](https://github.com/apache/iceberg/blob/cb20bdbea7b299b2948e908a7775ce90818e6a92/core/src/main/java/org/apache/iceberg/view/BaseViewOperations.java#L75) . unless some catalog has it's own implementation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spec: add nanosecond timestamp types [iceberg]
Fokko commented on code in PR #8683: URL: https://github.com/apache/iceberg/pull/8683#discussion_r1362161257 ## format/spec.md: ## @@ -874,6 +878,11 @@ Maps with non-string keys must use an array representation with the `map` logica |**`list`**|`array`|| |**`map`**|`array` of key-value records, or `map` when keys are strings (optional).|Array storage must use logical type name `map` and must store elements that are 2-field records. The first field is a non-null key and the second field is the value.| +Notes: + +1. Avro type annotation `adjust-to-utc` is an Iceberg convention; default value is `false` if not present. +2. Avro logical type `timestamp-nanos` is an Iceberg convention; the Avro specification does not define this type. Review Comment: It depends on what comes out of the Avro PR, but I think we're good for now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[I] Improve `All` Metadata Tables with Snapshot Information [iceberg]
RussellSpitzer opened a new issue, #8856: URL: https://github.com/apache/iceberg/issues/8856 ### Feature Request / Improvement Currently all versions of metadata tables have the exact same schema as their not "all" versions. This is actually not very useful if you are attempting to locate the state of a particular entry at a specific time because the `snapshot_id` always just shows the file's original snapshot. For example the entries table looks like ``` scala> spark.sql("SELECT * FROM db.timezoned.entries").show warning: 1 deprecation (since 2.13.3); for details, enable `:setting -deprecation` or `:replay -deprecation` +--+---+---++++ |status|snapshot_id|sequence_number|file_sequence_number| data_file|readable_metrics| +--+---+---++++ | 1|6561920950175488866| 0| 0|{0, /Users/russel...|{{51, 1, 0, null,...| | 1|5535987506380389562| 0| 0|{0, /Users/russel...|{{51, 1, 0, null,...| | 1|2517256618694516958| 0| 0|{0, /Users/russel...|{{51, 1, 0, null,...| | 1|2750236691316126600| 0| 0|{0, /Users/russel...|{{51, 1, 0, null,...| | 1|7179885233531513409| 0| 0|{0, /Users/russel...|{{51, 1, 0, null,...| ``` And `all_entries` looks like ``` scala> spark.sql("SELECT * FROM db.timezoned.all_entries").show warning: 1 deprecation (since 2.13.3); for details, enable `:setting -deprecation` or `:replay -deprecation` +--+---+---++++ |status|snapshot_id|sequence_number|file_sequence_number| data_file|readable_metrics| +--+---+---++++ | 1|2517256618694516958| 0| 0|{0, /Users/russel...|{{51, 1, 0, null,...| | 1|6561920950175488866| 0| 0|{0, /Users/russel...|{{51, 1, 0, null,...| | 1|2750236691316126600| 0| 0|{0, /Users/russel...|{{51, 1, 0, null,...| | 1|7179885233531513409| 0| 0|{0, /Users/russel...|{{51, 1, 0, null,...| | 1|5535987506380389562| 0| 0|{0, /Users/russel...|{{51, 1, 0, null,...| | 0|6561920950175488866| 0| 0|{0, /Users/russel...|{{51, 1, 0, null,...| | 0|5535987506380389562| 0| 0|{0, /Users/russel...|{{51, 1, 0, null,...| | 0|2517256618694516958| 0| 0|{0, /Users/russel...|{{51, 1, 0, null,...| | 0|2750236691316126600| 0| 0|{0, /Users/russel...|{{51, 1, 0, null,...| | 0|7179885233531513409| 0| 0|{0, /Users/russel...|{{51, 1, 0, null,...| +--+---+---++++ By looking at all_entries it is impossible for me to determine whether or not all the manifests were rewritten at once, or if they were rewritten in groups. Ideally we would see something like ``` +---+-+--+---+---++++ | as_of| time|status| snapshot_id|sequence_number|file_sequence_number| data_file| readable_metrics| +---+-+--+---+---++++ |7179885233531513409|1697493267302| 1|7179885233531513409| 0| 0|{0, /Users/russel...|{{51, 1, 0, null,...| |2750236691316126600|1697493268363| 1|2750236691316126600| 0| 0|{0, /Users/russel...|{{51, 1, 0, null,...| |2750236691316126600|1697493268363| 1|7179885233531513409| 0| 0|{0, /Users/russel...|{{51, 1, 0, null,...| |2517256618694516958|1697493269568| 1|2517256618694516958| 0| 0|{0, /Users/russel...|{{51, 1, 0, null,...| |2517256618694516958|1697493269568| 1|2750236691316126600| 0| 0|{0, /Users/russel...|{{51, 1, 0, null,...| |2517256618694516958|1697493269568| 1|7179885233531513409| 0| 0|{0, /Users/russel...|{{51, 1, 0, null,...| |5535987506380389562|1697493270419| 1|5535987506380389562| 0|
Re: [PR] Spark 3.5: Use Awaitility instead of Thread.sleep() [iceberg]
nastra merged PR #8853: URL: https://github.com/apache/iceberg/pull/8853 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark 3.5: Use Awaitility instead of Thread.sleep() [iceberg]
nastra commented on code in PR #8853: URL: https://github.com/apache/iceberg/pull/8853#discussion_r1362301067 ## spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/source/SparkSQLExecutionHelper.java: ## @@ -42,28 +45,26 @@ public static String lastExecutedMetricValue(SparkSession spark, String metricNa SQLExecutionUIData lastExecution = statusStore.executionsList().last(); Option sqlPlanMetric = lastExecution.metrics().find(metric -> metric.name().equals(metricName)); -Assert.assertTrue( -String.format("Metric '%s' not found in last execution", metricName), -sqlPlanMetric.isDefined()); +assertThat(sqlPlanMetric.isDefined()) +.as(String.format("Metric '%s' not found in last execution", metricName)) +.isTrue(); long metricId = sqlPlanMetric.get().accumulatorId(); // Refresh metricValues, they will remain null until the execution is complete and metrics are // aggregated -int attempts = 3; -while (lastExecution.metricValues() == null && attempts > 0) { - try { -Thread.sleep(100); -attempts--; - } catch (InterruptedException e) { -throw new RuntimeException(e); - } +Awaitility.await() Review Comment: in certain cases it makes sense to add an alias, but here I think we should be ok, since there's a comment right above that explains what's being done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] API, Core: Add uuid() to View [iceberg]
nastra commented on code in PR #8851: URL: https://github.com/apache/iceberg/pull/8851#discussion_r1362309604 ## core/src/main/java/org/apache/iceberg/view/BaseView.java: ## @@ -97,4 +98,9 @@ public ReplaceViewVersion replaceVersion() { public UpdateLocation updateLocation() { return new SetViewLocation(ops); } + + @Override + public UUID uuid() { +return UUID.fromString(ops.current().uuid()); Review Comment: that kind of (mandatory) refresh is expected, because it's a lazy refresh, that is typically done after a commit. This means that the underlying metadata changed after the commit, but we refresh only when accessing data after the commit happened. The same is being done in `BaseMetastoreTableOperations` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] Nessie: retain authorship information when creating a namespace [iceberg]
adutra opened a new pull request, #8857: URL: https://github.com/apache/iceberg/pull/8857 This change enhances the process of creating new namespaces by retaining commit authorship information when committing the new namespace. It also switches to Nessie API V2 for the commit operation. See https://github.com/projectnessie/nessie/issues/6501 for context. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Flink 1.17: Use awaitility instead of Thread.sleep() [iceberg]
nastra commented on code in PR #8852: URL: https://github.com/apache/iceberg/pull/8852#discussion_r1362315314 ## flink/v1.17/flink/src/test/java/org/apache/iceberg/flink/source/TestIcebergSourceFailover.java: ## @@ -98,9 +98,9 @@ protected List generateRecords(int numRecords, long seed) { return RandomGenericData.generate(schema(), numRecords, seed); } - protected void assertRecords( - Table table, List expectedRecords, Duration interval, int maxCount) throws Exception { -SimpleDataUtil.assertTableRecords(table, expectedRecords, interval, maxCount); + protected void assertRecords(Table table, List expectedRecords, Duration interval) Review Comment: that's a valid point, I'm not sure why this method existed in the first place. I've removed it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Flink 1.17: Use awaitility instead of Thread.sleep() [iceberg]
nastra commented on code in PR #8852: URL: https://github.com/apache/iceberg/pull/8852#discussion_r1362375826 ## flink/v1.17/flink/src/test/java/org/apache/iceberg/flink/source/TestStreamingMonitorFunction.java: ## @@ -111,14 +113,19 @@ public void testConsumeWithoutStartSnapshotId() throws Exception { TestSourceContext sourceContext = new TestSourceContext(latch); runSourceFunctionInTask(sourceContext, function); - Assert.assertTrue( - "Should have expected elements.", latch.await(WAIT_TIME_MILLIS, TimeUnit.MILLISECONDS)); - Thread.sleep(1000L); + Awaitility.await() Review Comment: makes sense, done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Flink: Reverting the default custom partitioner for bucket column [iceberg]
stevenzwu merged PR #8848: URL: https://github.com/apache/iceberg/pull/8848 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] Flink: Reverting the default custom partitioner for bucket column (#8848) [iceberg]
nastra opened a new pull request, #8858: URL: https://github.com/apache/iceberg/pull/8858 This backports #8848 to 1.4.x -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Flink: revert the automatic application of custom partitioner for bucketing column with hash distribution [iceberg]
nastra commented on issue #8847: URL: https://github.com/apache/iceberg/issues/8847#issuecomment-1766722846 Closing this as #8848 has been merged to main and I backported it to 1.4.x in https://github.com/apache/iceberg/pull/8858 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Flink: revert the automatic application of custom partitioner for bucketing column with hash distribution [iceberg]
nastra closed issue #8847: Flink: revert the automatic application of custom partitioner for bucketing column with hash distribution URL: https://github.com/apache/iceberg/issues/8847 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] [1.4.x] Core: Do not use a lazy split offset list in manifests (#8834) [iceberg]
rdblue merged PR #8845: URL: https://github.com/apache/iceberg/pull/8845 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Enable column statistics filtering after planning [iceberg]
stevenzwu commented on code in PR #8803: URL: https://github.com/apache/iceberg/pull/8803#discussion_r1362401859 ## api/src/main/java/org/apache/iceberg/ContentFile.java: ## @@ -177,4 +191,26 @@ default Long fileSequenceNumber() { default F copy(boolean withStats) { return withStats ? copy() : copyWithoutStats(); } + + /** + * Copies this file (potentially with or without specific column stats). Manifest readers can + * reuse file instances; use this method to copy data when collecting files from tasks. + * + * @param withStats Will copy this file without file stats if set to false. + * @param statsToKeep Will keep stats only for these columns. Not used if withStats + * is set to false. + * @return a copy of this data file. If "withStats" is set to false the file will not + * contain lower bounds, upper bounds, value counts, null value counts, or nan value counts. + * If "withStats" is set to true and the "statsToKeep" is not empty then only + * specific column stats will be kept. + */ + default F copy(boolean withStats, Collection statsToKeep) { Review Comment: is this needed in this interface? `copyWithSpecificStats` from line 177 should imply `withStats` to be true. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Ignore split offsets when the last split offset is past the file length [iceberg]
amogh-jahagirdar commented on PR #8860: URL: https://github.com/apache/iceberg/pull/8860#issuecomment-1766749771 cc @bryanck -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Ignore split offsets when the last split offset is past the file length [iceberg]
bryanck commented on code in PR #8860: URL: https://github.com/apache/iceberg/pull/8860#discussion_r1362407399 ## core/src/main/java/org/apache/iceberg/BaseFile.java: ## @@ -460,6 +460,11 @@ public ByteBuffer keyMetadata() { @Override public List splitOffsets() { +// If the last split offset is past the file size this means the split offsets are corrupted and should not be used +if (splitOffsets[splitOffsets.length - 1] >= fileSizeInBytes) { Review Comment: you'll need a null check here -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Ignore split offsets when the last split offset is past the file length [iceberg]
amogh-jahagirdar commented on code in PR #8860: URL: https://github.com/apache/iceberg/pull/8860#discussion_r1362408168 ## core/src/main/java/org/apache/iceberg/BaseFile.java: ## @@ -460,6 +460,11 @@ public ByteBuffer keyMetadata() { @Override public List splitOffsets() { +// If the last split offset is past the file size this means the split offsets are corrupted and should not be used +if (splitOffsets[splitOffsets.length - 1] >= fileSizeInBytes) { Review Comment: Ooh yes, good catch -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add an expireAfterWrite cache eviction policy to CachingCatalog [iceberg]
nastra commented on code in PR #8844: URL: https://github.com/apache/iceberg/pull/8844#discussion_r1362413695 ## spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/TestCachingCatalogExpirationAfterWrite.java: ## @@ -0,0 +1,89 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.spark; + +import java.util.Map; +import java.util.concurrent.TimeUnit; +import org.apache.hadoop.conf.Configuration; +import org.apache.iceberg.hive.HiveClientPool; +import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap; +import org.apache.spark.sql.connector.catalog.Identifier; +import org.awaitility.Awaitility; +import org.junit.After; +import org.junit.AfterClass; +import org.junit.BeforeClass; +import org.junit.Test; +import org.junit.runners.Parameterized; + +public class TestCachingCatalogExpirationAfterWrite extends SparkCatalogTestBase { Review Comment: +1 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Python: Add support for Python 3.12 [iceberg-python]
steinsgateted commented on PR #35: URL: https://github.com/apache/iceberg-python/pull/35#issuecomment-1766760247 @jayceslesar Thank you for the information -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Ignore split offsets when the last split offset is past the file length [iceberg]
bryanck commented on code in PR #8860: URL: https://github.com/apache/iceberg/pull/8860#discussion_r1362417563 ## core/src/main/java/org/apache/iceberg/BaseFile.java: ## @@ -460,6 +460,12 @@ public ByteBuffer keyMetadata() { @Override public List splitOffsets() { +// If the last split offset is past the file size this means the split offsets are corrupted and +// should not be used +if (splitOffsets != null && splitOffsets[splitOffsets.length - 1] >= fileSizeInBytes) { Review Comment: maybe a length check also? not likely to be 0 but just in case -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Ignore split offsets when the last split offset is past the file length [iceberg]
bryanck commented on code in PR #8860: URL: https://github.com/apache/iceberg/pull/8860#discussion_r1362419546 ## core/src/main/java/org/apache/iceberg/BaseFile.java: ## @@ -460,6 +460,12 @@ public ByteBuffer keyMetadata() { @Override public List splitOffsets() { +// If the last split offset is past the file size this means the split offsets are corrupted and +// should not be used +if (splitOffsets != null && splitOffsets[splitOffsets.length - 1] >= fileSizeInBytes) { Review Comment: Also a test would be nice -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Ignore split offsets when the last split offset is past the file length [iceberg]
amogh-jahagirdar commented on code in PR #8860: URL: https://github.com/apache/iceberg/pull/8860#discussion_r1362421006 ## core/src/main/java/org/apache/iceberg/BaseFile.java: ## @@ -460,6 +460,12 @@ public ByteBuffer keyMetadata() { @Override public List splitOffsets() { +// If the last split offset is past the file size this means the split offsets are corrupted and +// should not be used +if (splitOffsets != null && splitOffsets[splitOffsets.length - 1] >= fileSizeInBytes) { Review Comment: yeah just realized that, updated! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Ignore split offsets when the last split offset is past the file length [iceberg]
amogh-jahagirdar commented on code in PR #8860: URL: https://github.com/apache/iceberg/pull/8860#discussion_r1362421006 ## core/src/main/java/org/apache/iceberg/BaseFile.java: ## @@ -460,6 +460,12 @@ public ByteBuffer keyMetadata() { @Override public List splitOffsets() { +// If the last split offset is past the file size this means the split offsets are corrupted and +// should not be used +if (splitOffsets != null && splitOffsets[splitOffsets.length - 1] >= fileSizeInBytes) { Review Comment: yeah just realized that, updated! Looking into where to add the test -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] [1.4.x] Flink: Reverting the default custom partitioner for bucket column (#8848) [iceberg]
nastra merged PR #8858: URL: https://github.com/apache/iceberg/pull/8858 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Build: Replace Thread.Sleep() usage with org.Awaitility from Tests. [iceberg]
nastra commented on code in PR #8804: URL: https://github.com/apache/iceberg/pull/8804#discussion_r1362428279 ## api/src/test/java/org/apache/iceberg/metrics/TestDefaultTimer.java: ## @@ -101,14 +103,7 @@ public void closeableTimer() throws InterruptedException { @Test public void measureRunnable() { Timer timer = new DefaultTimer(TimeUnit.NANOSECONDS); -Runnable runnable = Review Comment: the goal here is actually to measure the duration of the `runnable`, so this isn't something that we want to replace with Awaitility -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Build: Replace Thread.Sleep() usage with org.Awaitility from Tests. [iceberg]
nastra commented on code in PR #8804: URL: https://github.com/apache/iceberg/pull/8804#discussion_r1362427190 ## api/src/test/java/org/apache/iceberg/TestHelpers.java: ## @@ -62,6 +70,54 @@ public static long waitUntilAfter(long timestampMillis) { return current; } + /** wait for fixed duration */ + public static void delayInMilliseconds(String message, long delay, boolean useSameThread) { Review Comment: I don't think we want to use Awaitility to just wait a certain amount of time for a true condition to happen -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Build: Replace Thread.Sleep() usage with org.Awaitility from Tests. [iceberg]
nastra commented on code in PR #8804: URL: https://github.com/apache/iceberg/pull/8804#discussion_r1362429576 ## aws/src/integration/java/org/apache/iceberg/aws/TestAssumeRoleAwsClientFactory.java: ## @@ -189,7 +192,17 @@ public void testAssumeRoleS3FileIO() throws Exception { Assert.assertFalse("should be able to access file", inputFile.exists()); } - private void waitForIamConsistency() throws Exception { -Thread.sleep(1); // sleep to make sure IAM up to date + private void waitForIamConsistency() { +TestHelpers.delayUntilTrueCondition( +"wait for IAM role policy to update.", +1000, +1001, +10001, Review Comment: I think the parameters are rather confusing. It would be better to use Awaitility directly here -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Build: Replace Thread.Sleep() usage with org.Awaitility from Tests. [iceberg]
nastra commented on code in PR #8804: URL: https://github.com/apache/iceberg/pull/8804#discussion_r1362431071 ## aws/src/integration/java/org/apache/iceberg/aws/TestAssumeRoleAwsClientFactory.java: ## @@ -189,7 +192,17 @@ public void testAssumeRoleS3FileIO() throws Exception { Assert.assertFalse("should be able to access file", inputFile.exists()); } - private void waitForIamConsistency() throws Exception { -Thread.sleep(1); // sleep to make sure IAM up to date + private void waitForIamConsistency() { +TestHelpers.delayUntilTrueCondition( +"wait for IAM role policy to update.", +1000, +1001, +10001, +TimeUnit.MILLISECONDS, +() -> +iam.getRolePolicy( Review Comment: is that the check required whether IAM is consistent? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Build: Replace Thread.Sleep() usage with org.Awaitility from Tests. [iceberg]
nastra commented on code in PR #8804: URL: https://github.com/apache/iceberg/pull/8804#discussion_r1362441102 ## aws/src/integration/java/org/apache/iceberg/aws/dynamodb/TestDynamoDbLockManager.java: ## @@ -141,11 +142,8 @@ public void testAcquireSingleProcess() throws Exception { CompletableFuture.supplyAsync( () -> { - try { -Thread.sleep(5000); - } catch (InterruptedException e) { -throw new RuntimeException(e); - } + TestHelpers.delayInMilliseconds( Review Comment: this goes back to an earlier comment I had on https://github.com/apache/iceberg/pull/8715#issuecomment-1748190639. We don't just want to replace `Thread.sleep()` usage blindly with Awaitility by waiting the same amount of time. The important piece is that we'd need to understand what kind of condition the test is trying to _eventually_ reach, which we can then check by using Awaitility (rather than just sleeping X seconds). I've opened https://github.com/apache/iceberg/pull/8853 and https://github.com/apache/iceberg/pull/8852 to give an idea how that might look for other places in the code -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Ignore split offsets when the last split offset is past the file length [iceberg]
amogh-jahagirdar commented on code in PR #8860: URL: https://github.com/apache/iceberg/pull/8860#discussion_r1362445708 ## core/src/main/java/org/apache/iceberg/BaseFile.java: ## @@ -460,6 +460,12 @@ public ByteBuffer keyMetadata() { @Override public List splitOffsets() { +// If the last split offset is past the file size this means the split offsets are corrupted and +// should not be used +if (splitOffsets != null && splitOffsets[splitOffsets.length - 1] >= fileSizeInBytes) { Review Comment: Looks like `TestSplitPlanning` is the best place for verifying the behavior of split planning when the offsets are corrupted. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Replace Thread.sleep() usage in test code with Awaitility [iceberg]
nastra commented on issue #7154: URL: https://github.com/apache/iceberg/issues/7154#issuecomment-1766792091 I've opened https://github.com/apache/iceberg/pull/8853 and https://github.com/apache/iceberg/pull/8852 to give an idea about places that are good candidates to replace with Awaitility. I believe there are more similar places in the codebase that can be fixed as part of #7154 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] [1.4.x] AWS: avoid static global credentials provider which doesn't play well with lifecycle management (#8677) [iceberg]
nastra commented on PR #8843: URL: https://github.com/apache/iceberg/pull/8843#issuecomment-1766801596 > @nastra and @singhpk234, is this safe for a patch release? It seems like a behavior change that would only be safe if the behavior is always the same when creating multiple credentials providers. It appears that multiple people ran into this issue with 1.3.1 on [Slack and on the ML](https://lists.apache.org/thread/1nwb11v1qn1nr86stofkvqlb32fp0q9d). I don't think those people have an easy way to resolve the issue on their own (without building a custom jar with the proposed fix) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Ignore split offsets when the last split offset is past the file length [iceberg]
amogh-jahagirdar commented on code in PR #8860: URL: https://github.com/apache/iceberg/pull/8860#discussion_r1362467796 ## core/src/test/java/org/apache/iceberg/TableTestBase.java: ## @@ -110,7 +110,7 @@ public class TableTestBase { static final DataFile FILE_C = DataFiles.builder(SPEC) .withPath("/path/to/data-c.parquet") - .withFileSizeInBytes(10) + .withFileSizeInBytes(3_000_000L) Review Comment: I changed this so split offsets are within the file size for the existing test files we use. I'm adding a new test for a new file which has corrupted offsets. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Ignore split offsets when the last split offset is past the file length [iceberg]
amogh-jahagirdar commented on code in PR #8860: URL: https://github.com/apache/iceberg/pull/8860#discussion_r1362467796 ## core/src/test/java/org/apache/iceberg/TableTestBase.java: ## @@ -110,7 +110,7 @@ public class TableTestBase { static final DataFile FILE_C = DataFiles.builder(SPEC) .withPath("/path/to/data-c.parquet") - .withFileSizeInBytes(10) + .withFileSizeInBytes(3_000_000L) Review Comment: I changed this so split offsets are within the file size for the existing test files we use (otherwise TestManifestReader#testReaderWithFilterWithoutSelect ends up failing due to expectations on the split offsets). I'm adding a new test for a new file which has corrupted offsets. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Ignore split offsets when the last split offset is past the file length [iceberg]
amogh-jahagirdar commented on code in PR #8860: URL: https://github.com/apache/iceberg/pull/8860#discussion_r1362487232 ## core/src/main/java/org/apache/iceberg/BaseFile.java: ## @@ -460,6 +460,12 @@ public ByteBuffer keyMetadata() { @Override public List splitOffsets() { +// If the last split offset is past the file size this means the split offsets are corrupted and +// should not be used +if (splitOffsets != null && splitOffsets[splitOffsets.length - 1] >= fileSizeInBytes) { Review Comment: Also `TestManifestReader` for verifying the split offsets are null when reading the file entries. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] [1.4.x] AWS: avoid static global credentials provider which doesn't play well with lifecycle management (#8677) [iceberg]
stevenzwu commented on PR #8843: URL: https://github.com/apache/iceberg/pull/8843#issuecomment-1766827281 > would only be safe if the behavior is always the same when creating multiple credentials providers. @rdblue I think it is a safe change from a static global singleton to creating new objects for object lifecycle management -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Ignore split offsets when the last split offset is past the file length [iceberg]
amogh-jahagirdar commented on code in PR #8860: URL: https://github.com/apache/iceberg/pull/8860#discussion_r1362487232 ## core/src/main/java/org/apache/iceberg/BaseFile.java: ## @@ -460,6 +460,12 @@ public ByteBuffer keyMetadata() { @Override public List splitOffsets() { +// If the last split offset is past the file size this means the split offsets are corrupted and +// should not be used +if (splitOffsets != null && splitOffsets[splitOffsets.length - 1] >= fileSizeInBytes) { Review Comment: Actually, `TestManifestReader` seems better for verifying the split offsets are null when reading the file entries. This seems like a cleaner test since split planning tests already verify the behavior when explicit split offsets aren't defined. Manifest reader tests can help us explicitly verify the behavior when the offsets are corrupted. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Ignore split offsets when the last split offset is past the file length [iceberg]
amogh-jahagirdar commented on code in PR #8860: URL: https://github.com/apache/iceberg/pull/8860#discussion_r1362490912 ## core/src/test/java/org/apache/iceberg/TableTestBase.java: ## @@ -110,7 +110,7 @@ public class TableTestBase { static final DataFile FILE_C = DataFiles.builder(SPEC) .withPath("/path/to/data-c.parquet") - .withFileSizeInBytes(10) + .withFileSizeInBytes(3_000_000L) Review Comment: Err nvm, looks like split offsets for these files were added in #8834. Instead of changing the file size to fit the offsets (this breaks more tests that I need to go and update), I'll reverse it; I'll change the offsets to be within the file size since that was part of a more recent change and is isolated to a single test as far as I can tell. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Ignore split offsets when the last split offset is past the file length [iceberg]
singhpk234 commented on code in PR #8860: URL: https://github.com/apache/iceberg/pull/8860#discussion_r1362493799 ## core/src/main/java/org/apache/iceberg/BaseFile.java: ## @@ -460,6 +460,16 @@ public ByteBuffer keyMetadata() { @Override public List splitOffsets() { +if (splitOffsets == null || splitOffsets.length == 0) { + return null; +} + +// If the last split offset is past the file size this means the split offsets are corrupted and Review Comment: [doubt] wondering if throwing an exception or having a pre-condition would be helpful to identify the buggy writer ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Enable column statistics filtering after planning [iceberg]
stevenzwu commented on PR #8803: URL: https://github.com/apache/iceberg/pull/8803#issuecomment-1766844420 @pvary I think we probably want to push the `copyStatsForColumns` down to ManifestReader. https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/ManifestReader.java#L299 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org