Re: [I] [Feature Request] Speed up InspectTable.files() [iceberg-python]

via GitHub Tue, 15 Oct 2024 13:25:34 -0700


kevinjqliu commented on issue #1229:
URL: 
https://github.com/apache/iceberg-python/issues/1229#issuecomment-2414939042


   > Most of the time is spent processing the manifests record-by-record and 
converting each record to a dict
   
   Heres a snippet using threads to parallelize both reading the manifest files 
and conversion to records. 
   
   This is generated by Claude, so please double-check.
   ```
   
   def get_message_table_files(
       table: pyiceberg.table.Table,
   ) -> pa.Table:
       schema = table.metadata.schema()
       snapshot = table.current_snapshot()
       if not snapshot:
           return pa.Table.from_pylist([], schema=_FILES_SCHEMA)
   
       with ThreadPoolExecutor() as pool:
           return pa.Table.from_batches(
               pool.map(
                   partial(_process_manifest, schema, table.io),
                   snapshot.manifests(table.io),
               ),
               schema=_FILES_SCHEMA,
           )
   
   def _process_manifest(
       table_schema: Schema,
       io: FileIO,
       manifest: ManifestFile,
   ) -> pa.RecordBatch:
       ts_field = table_schema.find_field('ts')
   
       entries = list(manifest.fetch_manifest_entry(io))
       valid_entries = [
           entry for entry in entries
           if entry.data_file.file_format == FileFormat.PARQUET
           and entry.status != ManifestEntryStatus.DELETED
       ]
   
       with ThreadPoolExecutor() as pool:
           rows = list(pool.map(
               partial(_process_entry, ts_field),
               valid_entries
           ))
   
       return pa.RecordBatch.from_pylist(rows, schema=_FILES_SCHEMA)
   
   def _process_entry(ts_field: Schema.Field, entry: ManifestEntry) -> Dict:
       return {
           'path': entry.data_file.file_path,
           'event_id': entry.data_file.partition.event_id,
           'ts_min': from_bytes(
               ts_field.field_type,
               entry.data_file.lower_bounds.get(ts_field.field_id),
           ),
           'ts_max': from_bytes(
               ts_field.field_type,
               entry.data_file.upper_bounds.get(ts_field.field_id),
           ),
       }
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] [Feature Request] Speed up InspectTable.files() [iceberg-python]

Reply via email to