[GitHub] [iceberg-rust] JanKaul commented on issue #52: No builder for TableMetadata and no public field
JanKaul commented on issue #52: URL: https://github.com/apache/iceberg-rust/issues/52#issuecomment-1711180269 I would also be in favor of using the builder pattern for the pub structs. If I'm correct all pub structs except for TableMetadata already have a builder. With the `derive_builder` crate it should be quite easy to implement the buider for TableMetadata. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] Fokko commented on pull request #8521: Python: Non-Cython fallback Avro parser
Fokko commented on PR #8521: URL: https://github.com/apache/iceberg/pull/8521#issuecomment-1711190176 @rustyconover Yes I agree. It looks like it is pulling the wheel correctly but it is missing the `decoder_fast` module. Maybe still good to just add this fallback anyway. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] GoGoWen opened a new issue, #8527: Why Iceberg do not support column with default value?
GoGoWen opened a new issue, #8527: URL: https://github.com/apache/iceberg/issues/8527 ### Query engine why Iceberg do not support column with default value? like mysql "k1 INT DEFAULT '1'"? ### Question why Iceberg do not support column with default value? like mysql "k1 INT DEFAULT '1'"? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] Fokko commented on issue #8527: Why Iceberg do not support column with default value?
Fokko commented on issue #8527: URL: https://github.com/apache/iceberg/issues/8527#issuecomment-1711211295 This is actually in the works: https://iceberg.apache.org/spec/#default-values This will be part of Spec version 3 that's being finalized. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] getAlexRibeiro closed issue #7537: Error reading version hint file
getAlexRibeiro closed issue #7537: Error reading version hint file URL: https://github.com/apache/iceberg/issues/7537 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg-rust] JanKaul opened a new pull request, #57: Metadata integration tests
JanKaul opened a new pull request, #57: URL: https://github.com/apache/iceberg-rust/pull/57 This PR adds integration tests for reading the table metadata from files. Some of the tests are designed to fail. With the current design of the serialization/deserialization the error doesn't specify which field is missing. So I couldn't do a precise check for certain tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg-rust] JanKaul commented on pull request #57: Metadata integration tests
JanKaul commented on PR #57: URL: https://github.com/apache/iceberg-rust/pull/57#issuecomment-1711222438 @liurenjie1024, @Xuanwo , @Fokko it would be great if you could take a look. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg-rust] ZENOTME commented on a diff in pull request #56: feat: support read Manifest List
ZENOTME commented on code in PR #56: URL: https://github.com/apache/iceberg-rust/pull/56#discussion_r1319653288 ## crates/iceberg/src/spec/manifest_list.rs: ## @@ -0,0 +1,881 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +//! ManifestList for Iceberg. + +use crate::{avro::schema_to_avro_schema, spec::Literal, Error}; +use apache_avro::{from_value, types::Value, Reader}; +use once_cell::sync::Lazy; +use std::sync::Arc; + +use super::{FormatVersion, ListType, NestedField, NestedFieldRef, Schema, StructType}; + +/// Snapshots are embedded in table metadata, but the list of manifests for a +/// snapshot are stored in a separate manifest list file. +/// +/// A new manifest list is written for each attempt to commit a snapshot +/// because the list of manifests always changes to produce a new snapshot. +/// When a manifest list is written, the (optimistic) sequence number of the +/// snapshot is written for all new manifest files tracked by the list. +/// +/// A manifest list includes summary metadata that can be used to avoid +/// scanning all of the manifests in a snapshot when planning a table scan. +/// This includes the number of added, existing, and deleted files, and a +/// summary of values for each field of the partition spec used to write the +/// manifest. +#[derive(Debug, Clone)] +pub struct ManifestList { +/// Entries in a manifest list. +entries: Vec, +} + +impl ManifestList { +/// Parse manifest list from bytes. +/// +/// QUESTION: Will we have more than one manifest list in a single file? +pub fn parse_with_version( +bs: &[u8], +version: FormatVersion, +partition_type: &StructType, +) -> Result { +match version { +FormatVersion::V2 => { +let schema = schema_to_avro_schema("manifest_list", &Self::v2_schema()).unwrap(); +let reader = Reader::with_schema(&schema, bs)?; +let values = Value::Array(reader.collect::, _>>()?); + from_value::<_serde::ManifestListV2>(&values)?.try_into(partition_type) +} +FormatVersion::V1 => { +let schema = schema_to_avro_schema("manifest_list", &Self::v1_schema()).unwrap(); +let reader = Reader::with_schema(&schema, bs)?; +let values = Value::Array(reader.collect::, _>>()?); + from_value::<_serde::ManifestListV1>(&values)?.try_into(partition_type) +} +} +} + +/// Get the entries in the manifest list. +pub fn entries(&self) -> &[ManifestListEntry] { +&self.entries +} + +const MANIFEST_PATH: Lazy = { +Lazy::new(|| { +Arc::new(NestedField::required( +500, +"manifest_path", +super::Type::Primitive(super::PrimitiveType::String), +)) +}) +}; +const MANIFEST_LENGTH: Lazy = { +Lazy::new(|| { +Arc::new(NestedField::required( +501, +"manifest_length", +super::Type::Primitive(super::PrimitiveType::Long), +)) +}) +}; +const PARTITION_SPEC_ID: Lazy = { +Lazy::new(|| { +Arc::new(NestedField::required( +502, +"partition_spec_id", +super::Type::Primitive(super::PrimitiveType::Int), +)) +}) +}; +const CONTENT: Lazy = { +Lazy::new(|| { +Arc::new(NestedField::required( +517, +"content", +super::Type::Primitive(super::PrimitiveType::Int), +)) +}) +}; +const SEQUENCE_NUMBER: Lazy = { +Lazy::new(|| { +Arc::new(NestedField::required( +515, +"sequence_number", +super::Type::Primitive(super::PrimitiveType::Long), +)) +}) +}; +const MIN_SEQUENCE_NUMBER: Lazy = { +Lazy::new(|| { +Arc::new(NestedField::required( +516, +"min_sequence_number", +super::Type::Primitive(super::PrimitiveType::Long), +))
[GitHub] [iceberg] andreacfm opened a new pull request, #8528: Schema Merge docs
andreacfm opened a new pull request, #8528: URL: https://github.com/apache/iceberg/pull/8528 Documentation about schemaMerge See #8005 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] zeddit commented on issue #8515: Python: Support vectorization read which improve read performance
zeddit commented on issue #8515: URL: https://github.com/apache/iceberg/issues/8515#issuecomment-1711486546 great thanks for your help. I have tried a poc about `minio + hive metastore + iceberg`, and I am using `pyiceberg` to conduct some performance test. I have a poor performance about just reading data into my python environment. e.g. 1. a table which contains only 1 row and I read it out it takes me about 6 seconds. 2. a table with double data typle which has 100k rows and about 20 columns whose size is about 200MB. read it out needs about 40 seconds. I think it is a bit slow with a bandwidth about 5MB/s. I wonder if it is my problem so I want a benchmark results for comparison. great thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg-rust] liurenjie1024 commented on a diff in pull request #57: Metadata integration tests
liurenjie1024 commented on code in PR #57: URL: https://github.com/apache/iceberg-rust/pull/57#discussion_r1319737798 ## crates/iceberg/src/spec/table_metadata.rs: ## @@ -346,21 +349,29 @@ pub(super) mod _serde { } else { value.current_snapshot_id }; +let schemas = HashMap::from_iter( +value +.schemas +.into_iter() +.map(|schema| Ok((schema.schema_id, Arc::new(schema.try_into()? +.collect::, Error>>()?, +); Ok(TableMetadata { format_version: FormatVersion::V2, table_uuid: value.table_uuid, location: value.location, last_sequence_number: value.last_sequence_number, last_updated_ms: value.last_updated_ms, last_column_id: value.last_column_id, -schemas: HashMap::from_iter( -value -.schemas -.into_iter() -.map(|schema| Ok((schema.schema_id, Arc::new(schema.try_into()? -.collect::, Error>>()?, -), -current_schema_id: value.current_schema_id, +current_schema_id: if schemas.keys().contains(&value.current_schema_id) { +Ok(value.current_schema_id) +} else { +Err(self::Error::new( +ErrorKind::DataInvalid, +"No schema exists with the current schema id.", Review Comment: ```suggestion format!("No schema exists with the current schema id: {}", *value.current_schema_id), ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg-rust] liurenjie1024 commented on pull request #56: feat: support read Manifest List
liurenjie1024 commented on PR #56: URL: https://github.com/apache/iceberg-rust/pull/56#issuecomment-1711520015 > And I find some place is inconsistent with spec. > > > https://iceberg.apache.org/spec/#manifests:~:text=504-,added_files_count,-int In partice, this field in avro is **added_data_files_count** same thing exist in: existing_files_count, deleted_files_count > > > > [Optional fields, **array elements**, and map values must be wrapped in an Avro union with null. This is the only union type allowed in Iceberg data files.](https://iceberg.apache.org/spec/#avro:~:text=Optional%20fields%2C%20array%20elements%2C%20and%20map%20values%20must%20be%20wrapped%20in%20an%20Avro%20union%20with%20null.%20This%20is%20the%20only%20union%20type%20allowed%20in%20Iceberg%20data%20files.) > > ``` > manifest_list: >partitions: `list<508: field_summary>` > ``` > > Actually this field_summary field is not a optional value. How about submitting fix to iceberg-docs? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg-rust] liurenjie1024 commented on a diff in pull request #56: feat: support read Manifest List
liurenjie1024 commented on code in PR #56: URL: https://github.com/apache/iceberg-rust/pull/56#discussion_r1319745001 ## crates/iceberg/src/avro/mod.rs: ## @@ -18,3 +18,4 @@ //! Avro related codes. #[allow(dead_code)] mod schema; +pub use schema::*; Review Comment: ```suggestion pub(crate) use schema::*; ``` Avro schema is not intended for external users. ## crates/iceberg/src/spec/manifest_list.rs: ## @@ -0,0 +1,881 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +//! ManifestList for Iceberg. + +use crate::{avro::schema_to_avro_schema, spec::Literal, Error}; +use apache_avro::{from_value, types::Value, Reader}; +use once_cell::sync::Lazy; +use std::sync::Arc; + +use super::{FormatVersion, ListType, NestedField, NestedFieldRef, Schema, StructType}; + +/// Snapshots are embedded in table metadata, but the list of manifests for a +/// snapshot are stored in a separate manifest list file. +/// +/// A new manifest list is written for each attempt to commit a snapshot +/// because the list of manifests always changes to produce a new snapshot. +/// When a manifest list is written, the (optimistic) sequence number of the +/// snapshot is written for all new manifest files tracked by the list. +/// +/// A manifest list includes summary metadata that can be used to avoid +/// scanning all of the manifests in a snapshot when planning a table scan. +/// This includes the number of added, existing, and deleted files, and a +/// summary of values for each field of the partition spec used to write the +/// manifest. +#[derive(Debug, Clone)] +pub struct ManifestList { +/// Entries in a manifest list. +entries: Vec, +} + +impl ManifestList { +/// Parse manifest list from bytes. +/// +/// QUESTION: Will we have more than one manifest list in a single file? +pub fn parse_with_version( +bs: &[u8], +version: FormatVersion, +partition_type: &StructType, +) -> Result { +match version { +FormatVersion::V2 => { +let schema = schema_to_avro_schema("manifest_list", &Self::v2_schema()).unwrap(); +let reader = Reader::with_schema(&schema, bs)?; +let values = Value::Array(reader.collect::, _>>()?); + from_value::<_serde::ManifestListV2>(&values)?.try_into(partition_type) +} +FormatVersion::V1 => { +let schema = schema_to_avro_schema("manifest_list", &Self::v1_schema()).unwrap(); +let reader = Reader::with_schema(&schema, bs)?; +let values = Value::Array(reader.collect::, _>>()?); + from_value::<_serde::ManifestListV1>(&values)?.try_into(partition_type) +} +} +} + +/// Get the entries in the manifest list. +pub fn entries(&self) -> &[ManifestListEntry] { +&self.entries +} + +const MANIFEST_PATH: Lazy = { +Lazy::new(|| { +Arc::new(NestedField::required( +500, +"manifest_path", +super::Type::Primitive(super::PrimitiveType::String), +)) +}) +}; +const MANIFEST_LENGTH: Lazy = { +Lazy::new(|| { +Arc::new(NestedField::required( +501, +"manifest_length", +super::Type::Primitive(super::PrimitiveType::Long), +)) +}) +}; +const PARTITION_SPEC_ID: Lazy = { +Lazy::new(|| { +Arc::new(NestedField::required( +502, +"partition_spec_id", +super::Type::Primitive(super::PrimitiveType::Int), +)) +}) +}; +const CONTENT: Lazy = { +Lazy::new(|| { +Arc::new(NestedField::required( +517, +"content", +super::Type::Primitive(super::PrimitiveType::Int), +)) +}) +}; +const SEQUENCE_NUMBER: Lazy = { +Lazy::new(|| { +Arc::new(NestedField::required( +515, +"sequence_number", +super::Type::Primitive(super::PrimitiveType::Long), +
[GitHub] [iceberg] xuqi1633 commented on issue #3028: i can't import class which start with org.apache.iceberg.relocated
xuqi1633 commented on issue #3028: URL: https://github.com/apache/iceberg/issues/3028#issuecomment-1711568897 After compiling the project, a relocated guava jar file will be generated under the bundled-guava module ``` ./gradlew clean build -x test -x javadoc -x integrationTest ```   If other modules also rely on the bundled-guava module, add the iceberg-bundled-guava-1.4.0-SNAPSHOT libraries to the module  after adding libraries  @ahmedriza -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7105: Spec: Add partition stats spec
ajantha-bhat commented on code in PR #7105: URL: https://github.com/apache/iceberg/pull/7105#discussion_r1319799186 ## format/spec.md: ## @@ -702,6 +703,41 @@ Blob metadata is a struct with the following fields: | _optional_ | _optional_ | **`properties`** | `map` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. | + Partition statistics + +Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). Partition statistics are informational. A reader can choose to +ignore partition statistics information. Partition statistics support is not required to read the table correctly. +Each table snapshot may be associated with at most one partition statistic file and the table can contain many partition statistics files associated with different table snapshots. +A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot, +it must be accurate and must be registered in the table metadata file to be considered as a valid statistics file for the reader. + +Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields: + +| v1 | v2 | Field name | Type | Description | +||||--|-| +| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. | +| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). | +| _required_ | _required_ | **`max-data-sequence-number`** | `long` | Maximum data sequence number of the Iceberg table's snapshot the partition statistics was computed from. | + + Partition Statistics file + +Statistics information for every partition tuple is stored as a row in the **table default format**. +These rows are sorted (in ascending manner with NULL FIRST) based on all partition columns from `partition` in the same order Review Comment: added a detailed note. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] ajantha-bhat commented on pull request #7105: Spec: Add partition stats spec
ajantha-bhat commented on PR #7105: URL: https://github.com/apache/iceberg/pull/7105#issuecomment-1711585128 @RussellSpitzer, @flyrain, @szehon-ho, @rdblue: I have addressed the new suggestions. Please approve the PR if it is ok or comment more if we need further changes. Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7105: Spec: Add partition stats spec
ajantha-bhat commented on code in PR #7105: URL: https://github.com/apache/iceberg/pull/7105#discussion_r1319801698 ## format/spec.md: ## @@ -702,6 +703,41 @@ Blob metadata is a struct with the following fields: | _optional_ | _optional_ | **`properties`** | `map` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. | + Partition statistics + +Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). Partition statistics are informational. A reader can choose to Review Comment: Simplified -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7105: Spec: Add partition stats spec
ajantha-bhat commented on code in PR #7105: URL: https://github.com/apache/iceberg/pull/7105#discussion_r1319802967 ## format/spec.md: ## @@ -702,6 +703,41 @@ Blob metadata is a struct with the following fields: | _optional_ | _optional_ | **`properties`** | `map` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. | + Partition statistics + +Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). Partition statistics are informational. A reader can choose to +ignore partition statistics information. Partition statistics support is not required to read the table correctly. +Each table snapshot may be associated with at most one partition statistic file and the table can contain many partition statistics files associated with different table snapshots. +A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot, +it must be accurate and must be registered in the table metadata file to be considered as a valid statistics file for the reader. + +Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields: + +| v1 | v2 | Field name | Type | Description | +||||--|-| +| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. | +| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). | +| _required_ | _required_ | **`max-data-sequence-number`** | `long` | Maximum data sequence number of the Iceberg table's snapshot the partition statistics was computed from. | + + Partition Statistics file + +Statistics information for every partition tuple is stored as a row in the **table default format**. +These rows are sorted (in ascending manner with NULL FIRST) based on all partition columns from `partition` in the same order +to optimize filtering rows while scanning. +Each unique partition tuple must have exactly one corresponding row, ensuring that statistics for all partitions are present. + +A partition statistics file stores statistics as a struct with the following fields: Review Comment: ok. Changed to `Statistics information for each unique partition tuple is stored as a row in the default data file format of the table (for example, Parquet or ORC).` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] juanrondineau commented on issue #8333: Unable to merge CDC data into snapshot data. java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String cannot be cast to java.la
juanrondineau commented on issue #8333: URL: https://github.com/apache/iceberg/issues/8333#issuecomment-1711594714 @chandu-1101 , thanks for your welcome i share 2 printscreens the first simulate on a dbeaver session connected to spark the operations that dbt internaly executes, in this case dbt creates a temporary view from a select over the table where we look for new data. Then when it tries to merge new data to destiny we got the cast exception.  in the second printscreen whe change the create temporary view for a create table sentence and then we save the exception and the merge operation works fine  -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #8528: Schema Merge docs
RussellSpitzer commented on code in PR #8528: URL: https://github.com/apache/iceberg/pull/8528#discussion_r1319920277 ## docs/spark-writes.md: ## @@ -313,6 +313,22 @@ data.writeTo("prod.db.table") .createOrReplace() ``` +### Schema Merge + +Iceberg support dynamic `schemaMerge` at writing time. The table must be configured to accept any schema. Review Comment: We should probably explain what this means without reusing the same name -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] szehon-ho commented on a diff in pull request #7105: Spec: Add partition stats spec
szehon-ho commented on code in PR #7105: URL: https://github.com/apache/iceberg/pull/7105#discussion_r1319945486 ## format/spec.md: ## @@ -702,6 +703,49 @@ Blob metadata is a struct with the following fields: | _optional_ | _optional_ | **`properties`** | `map` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. | + Partition statistics + +Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). +Partition statistics are not required for reading or planning and readers may ignore them. +Each table snapshot may be associated with at most one partition statistic file. +A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot, +it must be registered in the table metadata file to be considered as a valid statistics file for the reader. + +Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields: + +| v1 | v2 | Field name | Type | Description | +||||--|-| +| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. | +| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). | +| _required_ | _required_ | **`max-data-sequence-number`** | `long` | Maximum data sequence number of the Iceberg table's snapshot the partition statistics was computed from. | + + Partition Statistics file + +Statistics information for each unique partition tuple is stored as a row in the default data file format of the table (for example, Parquet or ORC). +These rows are sorted (in ascending manner with NULL FIRST) based on all partition columns from `partition` in the same order Review Comment: Nit: can we simplify to just `These rows must be sorted (in ascending manner with NULL FIRST) by partition to optimize...` ? ## format/spec.md: ## @@ -702,6 +703,49 @@ Blob metadata is a struct with the following fields: | _optional_ | _optional_ | **`properties`** | `map` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. | + Partition statistics + +Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). +Partition statistics are not required for reading or planning and readers may ignore them. +Each table snapshot may be associated with at most one partition statistic file. +A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot, +it must be registered in the table metadata file to be considered as a valid statistics file for the reader. + +Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields: Review Comment: Nit: does not make too much sense, does this suffice? `Partition statistics files contain a struct `partition-statistics' with the following fields` ## format/spec.md: ## @@ -702,6 +703,49 @@ Blob metadata is a struct with the following fields: | _optional_ | _optional_ | **`properties`** | `map` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. | + Partition statistics + +Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). +Partition statistics are not required for reading or planning and readers may ignore them. +Each table snapshot may be associated with at most one partition statistic file. +A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot, Review Comment: Nit: I am not too sure these two sentences add much value, it is the case for any file reference in Iceberg , isnt it? ```A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot, it must be registered in the table metadata file to be considered as a valid statistics file for the reader.``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] szehon-ho commented on a diff in pull request #7105: Spec: Add partition stats spec
szehon-ho commented on code in PR #7105: URL: https://github.com/apache/iceberg/pull/7105#discussion_r1319942204 ## format/spec.md: ## @@ -702,6 +703,49 @@ Blob metadata is a struct with the following fields: | _optional_ | _optional_ | **`properties`** | `map` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. | + Partition statistics + +Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). +Partition statistics are not required for reading or planning and readers may ignore them. +Each table snapshot may be associated with at most one partition statistic file. +A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot, Review Comment: Just my opinion, I am not too sure these two sentences add much value, it is the case for any file reference in Iceberg , isnt it? ```A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot, it must be registered in the table metadata file to be considered as a valid statistics file for the reader.``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg-docs] amogh-jahagirdar merged pull request #274: Update vendors.md
amogh-jahagirdar merged PR #274: URL: https://github.com/apache/iceberg-docs/pull/274 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7105: Spec: Add partition stats spec
ajantha-bhat commented on code in PR #7105: URL: https://github.com/apache/iceberg/pull/7105#discussion_r1320008836 ## format/spec.md: ## @@ -702,6 +703,49 @@ Blob metadata is a struct with the following fields: | _optional_ | _optional_ | **`properties`** | `map` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. | + Partition statistics + +Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). +Partition statistics are not required for reading or planning and readers may ignore them. +Each table snapshot may be associated with at most one partition statistic file. +A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot, +it must be registered in the table metadata file to be considered as a valid statistics file for the reader. + +Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields: Review Comment: I copied from puffin statistics file statements few lines above. changed it to `partition-statistics` field of table metadata is an optional list of struct with the following fields: ## format/spec.md: ## @@ -702,6 +703,49 @@ Blob metadata is a struct with the following fields: | _optional_ | _optional_ | **`properties`** | `map` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. | + Partition statistics + +Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). +Partition statistics are not required for reading or planning and readers may ignore them. +Each table snapshot may be associated with at most one partition statistic file. +A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot, Review Comment: I have shortened it a bit. Even though it seems implicit, It links back to how it is tracked and when it is valid. I remember getting some comment to add this statement. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7105: Spec: Add partition stats spec
ajantha-bhat commented on code in PR #7105: URL: https://github.com/apache/iceberg/pull/7105#discussion_r1320009356 ## format/spec.md: ## @@ -702,6 +703,49 @@ Blob metadata is a struct with the following fields: | _optional_ | _optional_ | **`properties`** | `map` | Additional properties associated with the statistic. Subset of Blob properties in the Puffin file. | + Partition statistics + +Partition statistics files are based on [Partition Statistics file spec](#partition-statistics-file). +Partition statistics are not required for reading or planning and readers may ignore them. +Each table snapshot may be associated with at most one partition statistic file. +A writer can optionally write the partition statistics file during each write operation. If the statistics file is written for the specific snapshot, +it must be registered in the table metadata file to be considered as a valid statistics file for the reader. + +Partition statistics files metadata within `partition-statistics` table metadata field is a struct with the following fields: + +| v1 | v2 | Field name | Type | Description | +||||--|-| +| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg table's snapshot the partition statistics file is associated with. | +| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of the partition statistics file. See [Partition Statistics file](#partition-statistics-file). | +| _required_ | _required_ | **`max-data-sequence-number`** | `long` | Maximum data sequence number of the Iceberg table's snapshot the partition statistics was computed from. | + + Partition Statistics file + +Statistics information for each unique partition tuple is stored as a row in the default data file format of the table (for example, Parquet or ORC). +These rows are sorted (in ascending manner with NULL FIRST) based on all partition columns from `partition` in the same order Review Comment: Done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] amogh-jahagirdar commented on pull request #8491: Python: Improved Readability and Alignment of Regex Patterns
amogh-jahagirdar commented on PR #8491: URL: https://github.com/apache/iceberg/pull/8491#issuecomment-1711851893 @hiteshbedre Since this is more of a cleanup, I'll merge after the checks pass. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg-go] delaneyj opened a new issue, #4: Implementations?
delaneyj opened a new issue, #4: URL: https://github.com/apache/iceberg-go/issues/4 ### Question Iceberg has subprojects targetting arrow/orc/parquet/etc. Is there plans to have adapters be part of this repo? Are there plans to have interfaces for `SchemaToDatastore`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] amogh-jahagirdar merged pull request #8491: Python: Improved Readability and Alignment of Regex Patterns
amogh-jahagirdar merged PR #8491: URL: https://github.com/apache/iceberg/pull/8491 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] amogh-jahagirdar commented on pull request #8491: Python: Improved Readability and Alignment of Regex Patterns
amogh-jahagirdar commented on PR #8491: URL: https://github.com/apache/iceberg/pull/8491#issuecomment-1711878925 Thanks for the contribution @hiteshbedre ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg-docs] melvynator opened a new pull request, #275: Update vendors.md
melvynator opened a new pull request, #275: URL: https://github.com/apache/iceberg-docs/pull/275 Fixed a typo -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg-go] zeroshade commented on issue #4: Implementations?
zeroshade commented on issue #4: URL: https://github.com/apache/iceberg-go/issues/4#issuecomment-1711921361 I plan on supporting Arrow, Parquet, Avro and Orc in this repo as much as I can. That said, I'm not familiar with `SchemaToDatastore`, but I want to support as much as possible in this library. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg-go] delaneyj commented on issue #4: Implementations?
delaneyj commented on issue #4: URL: https://github.com/apache/iceberg-go/issues/4#issuecomment-1712023219 Oh its not a library, I meant include an interface to be able to plugin any of these options or others. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] kunal-nandwana opened a new issue, #5556: Feature Request: Support mergeSchema option when using Spark MERGE INTO
kunal-nandwana opened a new issue, #5556: URL: https://github.com/apache/iceberg/issues/5556 ### Feature Request / Improvement Hi Team, I am using Iceberg in my project and I found a big thing which is missing from Iceberg which is easily available in Apache Hudi and Deltalake that is "merge schema". If possible this feature need to added into the Iceberg. I am attaching my last ticket which is explaining the problem that I am facing.Please find the below ticket for the refrence. [https://github.com/apache/iceberg/issues/5548](#5548) @rdblue any thoughts on this? ### Query engine Spark -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] vinitamaloo-asu commented on issue #2442: cannot insert value in hive command shell
vinitamaloo-asu commented on issue #2442: URL: https://github.com/apache/iceberg/issues/2442#issuecomment-1712329399 I created a new catalog "iceberg_catalog" using spark config like below: `.set("spark.sql.catalog.iceberg_catalog", "org.apache.iceberg.spark.SparkCatalog") .set("spark.sql.catalog.iceberg_catalog.type", "hive")` Now to create iceberg tables, I also initialized a hive catalog with the same catalog name and properties which is redundant. `val catalog = new HiveCatalog() catalog.setConf(conf) catalog.initialize( "iceberg_catalog", JavaConverters.mapAsJavaMap(Map( CatalogProperties.CATALOG_IMPL -> "org.apache.iceberg.hive.HiveCatalog", CatalogProperties.URI -> "thrift://localhost:9083", CatalogProperties.WAREHOUSE_LOCATION -> warehouseUri ))` Is there a way to get the previously initialized catalog with spark conf? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] vinitamaloo-asu opened a new issue, #8529: CASCADE WITH Drop Namespace Gives exception
vinitamaloo-asu opened a new issue, #8529: URL: https://github.com/apache/iceberg/issues/8529 ### Apache Iceberg version 1.3.1 (latest release) ### Query engine Spark ### Please describe the bug 🐞 Running this command with: `spark.sql(DROP DATABASE IF EXISTS dbname CASCADE)` Gives the below exception: `org.apache.iceberg.exceptions.NamespaceNotEmptyException: Namespace dbname is not empty. One or more tables exist. at org.apache.iceberg.hive.HiveCatalog.dropNamespace(HiveCatalog.java:353) at org.apache.iceberg.spark.SparkCatalog.dropNamespace(SparkCatalog.java:447) at org.apache.spark.sql.execution.datasources.v2.DropNamespaceExec.run(DropNamespaceExec.scala:52) at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:40) at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:40) at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.executeCollect(V2CommandExec.scala:46)` The expectation is that with CASCADE specified, command should delete all tables and the db itself. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] github-actions[bot] closed issue #6914: change partition led to query bug
github-actions[bot] closed issue #6914: change partition led to query bug URL: https://github.com/apache/iceberg/issues/6914 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] github-actions[bot] commented on issue #6914: change partition led to query bug
github-actions[bot] commented on issue #6914: URL: https://github.com/apache/iceberg/issues/6914#issuecomment-1712351531 This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale' -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org