DieHertz commented on issue #1229: URL: https://github.com/apache/iceberg-python/issues/1229#issuecomment-2428451067
So I haven't tried any actual changes yet, but decided to collect some baseline measurements with py-spy. First there's pyiceberg 7.0.1 `.inspect.files()` on my big phat table: ```python In [3]: start = time.time() ...: f = m.inspect.files() ...: print('elapsed', time.time() - start) elapsed 110.91524386405945 In [4]: len(f) Out[4]: 401188 In [5]: len(m.current_snapshot().manifests(m.io)) Out[5]: 688 ```  It can be seen that most of the time is spent in the `AvroFile` constructor, where some initial decoding occurs, and inside the list comprehension for manifest entries, where the Avro records get transformed into dicts. I argue that this is a CPU load rather than IO, and to prove that conclusively I run the same code with a quickly-crafted memoized Snapshot/Manifest: ```python import pyiceberg from concurrent.futures import ThreadPoolExecutor class IOFromBytes: def __init__(self, bytes_: bytes): self._bytes = bytes_ def open(self): return self def __enter__(self): return self def __exit__(self, a, b, c): ... def read(self): return self._bytes def new_input(self, *args, **kwargs): return self class MemoryManifest: def __init__(self, manifest, io): self._manifest = manifest with io.new_input(manifest.manifest_path).open() as f: self._io = IOFromBytes(f.read()) def fetch_manifest_entry(self, *args, **kwargs): return self._manifest.fetch_manifest_entry(self._io, **kwargs) class MemorySnapshot: def __init__(self, table: pyiceberg.table.Table): with ThreadPoolExecutor() as pool: self._manifests = list(pool.map( lambda manifest: MemoryManifest(manifest, table.io), table.current_snapshot().manifests(table.io), )) def manifests(self, *args, **kwargs): return self._manifests ``` Now we can see the actual IO takes less than 1 second for a total of ~112 MiB (without `ThreadPoolExecutor` here it was closer to 11 seconds): ```python In [35]: start = time.time() ...: snapshot = MemorySnapshot(m) ...: print('elapsed', time.time() - start) elapsed 0.4690868854522705 In [37]: len(snapshot._manifests) Out[37]: 688 In [39]: sum(len(manifest._io._bytes) for manifest in snapshot._manifests) / 1024 / 1024 Out[39]: 112.21551609039307 ``` Now `.inspect.files()` over already downloaded data: ```python In [36]: start = time.time() ... m.inspect._get_snapshot = lambda self_: snapshot ...: f = m.inspect.files() ...: print('elapsed', time.time() - start) elapsed 97.30642795562744 In [38]: len(f) Out[38]: 401188 ```  It can be seen that IO takes a little more than 10% of the total time taken by `.inspect.files()`, and that's about it for the improvement I'm expecting to get if we use just the `ThreadPoolExecutor`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org