[I] spark read much volume of data from one source when storage partition join implemented [iceberg]

via GitHub Tue, 03 Oct 2023 11:44:20 -0700


MartynovVA-DE opened a new issue, #8710:
URL: https://github.com/apache/iceberg/issues/8710


   ### Query engine
   
   pyspark  version - 3.4.1
   iceberg version - 1.3.0
   
   spark config:
   spark = SparkSession.builder \
                       .appName("test_spj_with") \
                       .enableHiveSupport() \
                       .config("spark.sql.extensions", 
"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")\
                       .config("spark.hadoop.fs.s3a.endpoint", 
"storage.yandexcloud.net" )\
                       .config("spark.hadoop.fs.s3a.impl", 
"org.apache.hadoop.fs.s3a.S3AFileSystem" )\
                       .config("spark.hadoop.fs.s3a.connection.maximum", "2000" 
)\
                       .config("spark.hadoop.fs.s3a.max.total.tasks", "2000" )\
                       .config("spark.hadoop.fs.s3a.threads.max", "2000" )\
                       .config("spark.hadoop.fs.s3a.experimental.fadvise", 
"random" )\
                       .config("spark.hadoop.fs.s3a.path.style.access", "true" 
)\
                       .config("spark.executor.instances", "4")\
                       .config("spark.driver.cores", "3")\
                       .config("spark.executor.cores", "1")\
                       .config("spark.driver.memory", "6g")\
                       .config("spark.executor.memory", "12g")\
                       .config("spark.sql.sources.bucketing.enabled", "true")\
                       .config("spark.sql.sources.v2.bucketing.enabled", 
"true")\
                       .config("spark.sql.v2_bucketing_enabled", "true")\
                       
.config("spark.sql.iceberg.planning.preserve-data-grouping", "true")\
                       
.config("spark.sql.sources.v2.bucketing.pushPartValues.enabled", "true")\
                       .config("spark.sql.requireAllClusterKeysForCoPartition", 
"false")\
                       
.config("spark.sql.sources.v2.bucketing.partiallyClusteredDistribution.enabled",
 "true")\
                       .getOrCreate()
   
   ### Question
   
   I am performing this join:
   
   spark.sql("""
   SELECT t1.* 
   FROM schema.table_1 t1 
   INNER JOIN schema.table_2 t2 
       ON t1.account_id_int = t2.account_id_int""")
   
   t1 and t2 are identical tables with next DDL:
   
   CREATE TABLE schema.table_1  (
   account_id_int BIGINT,
   account_id_char STRING,
   agreement_id_int BIGINT,
   agreement_id_char STRING,
   start_date DATE,
   final_date DATE,
   account_nbr STRING,
   name STRING,
   account_type_id_int BIGINT,
   account_type_id_char STRING)
   USING iceberg
   CLUSTERED BY (account_id_int)
   INTO 8 BUCKETS
   LOCATION 's3a://nova-nt/schema/table_1 '
   TBLPROPERTIES (
   'current-snapshot-id' = '3281859074375823773',
   'format' = 'iceberg/parquet',
   'format-version' = '1')
   
   rows count in table t1 and t2 are 500000001
   
   When performing join the query plan is next:
   
![image](https://github.com/apache/iceberg/assets/126143757/a61d1797-a008-40a4-9ad1-734e8caf2f88)
   
   
   Confusing thing in the next plan is amount of rows in BatchScan of table2 - 
7 500 000 000 rows
   
![image](https://github.com/apache/iceberg/assets/126143757/fa281b62-3889-43b1-96ba-ed170c918c1f)
   
   Can you please help me to understand why it is happened?
   And can I somehow to avoid this when spj is implemented?
   
   Thank you in advance for you help!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[I] spark read much volume of data from one source when storage partition join implemented [iceberg]

Reply via email to