DieHertz commented on issue #1229:
URL: 
https://github.com/apache/iceberg-python/issues/1229#issuecomment-2428451067

   So I haven't tried any actual changes yet, but decided to collect some 
baseline measurements with py-spy.
   
   First there's pyiceberg 7.0.1 `.inspect.files()` on my big phat table:
   ```python
   In [3]: start = time.time()
      ...: f = m.inspect.files()
      ...: print('elapsed', time.time() - start)
   elapsed 110.91524386405945
   
   In [4]: len(f)
   Out[4]: 401188
   
   In [5]: len(m.current_snapshot().manifests(m.io))
   Out[5]: 688
   ```
   ![m files 
baseline](https://github.com/user-attachments/assets/4b3bbe29-101e-4145-8969-15a8af6883ad)
   
   It can be seen that most of the time is spent in the `AvroFile` constructor, 
where some initial decoding occurs, and inside the list comprehension for 
manifest entries, where the Avro records get transformed into dicts.
   
   I argue that this is a CPU load rather than IO, and to prove that 
conclusively I run the same code with a quickly-crafted memoized 
Snapshot/Manifest:
   ```python
   import pyiceberg
   from concurrent.futures import ThreadPoolExecutor
   
   
   class IOFromBytes:
       def __init__(self, bytes_: bytes):
           self._bytes = bytes_
   
       def open(self):
           return self
   
       def __enter__(self):
           return self
   
       def __exit__(self, a, b, c):
           ...
   
       def read(self):
           return self._bytes
   
       def new_input(self, *args, **kwargs):
           return self
   
   
   class MemoryManifest:
       def __init__(self, manifest, io):
           self._manifest = manifest
           with io.new_input(manifest.manifest_path).open() as f:
               self._io = IOFromBytes(f.read())
   
       def fetch_manifest_entry(self, *args, **kwargs):
           return self._manifest.fetch_manifest_entry(self._io, **kwargs)
   
   
   class MemorySnapshot:
       def __init__(self, table: pyiceberg.table.Table):
           with ThreadPoolExecutor() as pool:
               self._manifests = list(pool.map(
                   lambda manifest: MemoryManifest(manifest, table.io),
                   table.current_snapshot().manifests(table.io),
               ))
   
       def manifests(self, *args, **kwargs):
           return self._manifests
   ```
   
   Now we can see the actual IO takes less than 1 second for a total of ~112 
MiB (without `ThreadPoolExecutor` here it was closer to 11 seconds):
   ```python
   In [35]: start = time.time()
       ...: snapshot = MemorySnapshot(m)
       ...: print('elapsed', time.time() - start)
   elapsed 0.4690868854522705
   
   In [37]: len(snapshot._manifests)
   Out[37]: 688
   
   In [39]: sum(len(manifest._io._bytes) for manifest in snapshot._manifests) / 
1024 / 1024
   Out[39]: 112.21551609039307
   ```
   
   Now `.inspect.files()` over already downloaded data:
   ```python
   In [36]: start = time.time()
       ...  m.inspect._get_snapshot = lambda self_: snapshot
       ...: f = m.inspect.files()
       ...: print('elapsed', time.time() - start)
   elapsed 97.30642795562744
   
   
   In [38]: len(f)
   Out[38]: 401188
   ```
   ![m files 
no-io](https://github.com/user-attachments/assets/e6921611-6641-465f-a216-a4bd6f8b52d0)
   
   It can be seen that IO takes a little more than 10% of the total time taken 
by `.inspect.files()`, and that's about it for the improvement I'm expecting to 
get if we use just the `ThreadPoolExecutor`.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to