paveon opened a new issue, #1132:
URL: https://github.com/apache/iceberg-go/issues/1132

   ### Feature Request / Improvement
   
   `getReferencedFiles` iterates every snapshot and reads all its manifests + 
entries. Since Iceberg manifests are immutable and shared across snapshots via 
copy-on-write, the same manifest is read N times where N is 
   the number of snapshots referencing it.
   
   For tables with many snapshots this causes the orphan cleaner to spend 93%+ 
of CPU time in `getReferencedFiles`, making `DeleteOrphanFiles` effectively 
unusable on large tables.
   
   Proposed fix: two-pass approach — first read lightweight manifest lists to 
discover unique manifest paths, then read each unique manifest's entries once 
in parallel. Also run the S3 walk and referenced-file 
   collection concurrently via errgroup.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to