MartynovVA-DE opened a new issue, #8710: URL: https://github.com/apache/iceberg/issues/8710
### Query engine pyspark version - 3.4.1 iceberg version - 1.3.0 spark config: spark = SparkSession.builder \ .appName("test_spj_with") \ .enableHiveSupport() \ .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")\ .config("spark.hadoop.fs.s3a.endpoint", "storage.yandexcloud.net" )\ .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem" )\ .config("spark.hadoop.fs.s3a.connection.maximum", "2000" )\ .config("spark.hadoop.fs.s3a.max.total.tasks", "2000" )\ .config("spark.hadoop.fs.s3a.threads.max", "2000" )\ .config("spark.hadoop.fs.s3a.experimental.fadvise", "random" )\ .config("spark.hadoop.fs.s3a.path.style.access", "true" )\ .config("spark.executor.instances", "4")\ .config("spark.driver.cores", "3")\ .config("spark.executor.cores", "1")\ .config("spark.driver.memory", "6g")\ .config("spark.executor.memory", "12g")\ .config("spark.sql.sources.bucketing.enabled", "true")\ .config("spark.sql.sources.v2.bucketing.enabled", "true")\ .config("spark.sql.v2_bucketing_enabled", "true")\ .config("spark.sql.iceberg.planning.preserve-data-grouping", "true")\ .config("spark.sql.sources.v2.bucketing.pushPartValues.enabled", "true")\ .config("spark.sql.requireAllClusterKeysForCoPartition", "false")\ .config("spark.sql.sources.v2.bucketing.partiallyClusteredDistribution.enabled", "true")\ .getOrCreate() ### Question I am performing this join: spark.sql(""" SELECT t1.* FROM schema.table_1 t1 INNER JOIN schema.table_2 t2 ON t1.account_id_int = t2.account_id_int""") t1 and t2 are identical tables with next DDL: CREATE TABLE schema.table_1 ( account_id_int BIGINT, account_id_char STRING, agreement_id_int BIGINT, agreement_id_char STRING, start_date DATE, final_date DATE, account_nbr STRING, name STRING, account_type_id_int BIGINT, account_type_id_char STRING) USING iceberg CLUSTERED BY (account_id_int) INTO 8 BUCKETS LOCATION 's3a://nova-nt/schema/table_1 ' TBLPROPERTIES ( 'current-snapshot-id' = '3281859074375823773', 'format' = 'iceberg/parquet', 'format-version' = '1') rows count in table t1 and t2 are 500000001 When performing join the query plan is next:  Confusing thing in the next plan is amount of rows in BatchScan of table2 - 7 500 000 000 rows  Can you please help me to understand why it is happened? And can I somehow to avoid this when spj is implemented? Thank you in advance for you help! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org