gitzwz opened a new issue, #1479: URL: https://github.com/apache/iceberg-python/issues/1479
### Question I encountered a problem with table.scan.plan_files() where there is no noticeable time difference between single-threaded and multi-threaded execution. The total time is directly proportional to the number of manifest entries. The table I used for testing has 6 manifest files, and each manifest file contains around 70,000 entries. The most time-consuming process is _open_manifest in the DataScan.plan_files() function, and it performs similarly whether using a thread pool or not. Could someone help me investigate if there might be an issue? Here is my test code: `from pyiceberg.catalog import load_catalog from pyspark.sql import SparkSession from pyiceberg import expressions as pyi_expr import time from line_profiler import LineProfiler catalog = load_catalog("default") table = catalog.load_table('b_ods.pyiceberg_test2') def scan_plan_files(key, values): row_filter=pyi_expr.In(key, values) files = table.scan( row_filter=row_filter, limit=1000 ).plan_files() print(f"total plans {len(files)}") for file in files: print(file.file.file_path) start_time = time.perf_counter() scan_plan_files("cid", {'844'}) print(f"Time consumed:{time.perf_counter() - start_time:.3f} seconds") ` I also modified the ~/.pyiceberg.yaml file, changing *max-workers: 1* to *max-workers: 32*, but the total time is still around 64 seconds with little to no change. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org