[GitHub] [iceberg] RussellSpitzer commented on issue #8130: Using spark maintenance procedures, deleteOrphanFiles & expireSnapshotShots taking many hours for a small iceberg table

via GitHub Wed, 09 Aug 2023 07:27:37 -0700


RussellSpitzer commented on issue #8130:
URL: https://github.com/apache/iceberg/issues/8130#issuecomment-1671481858


   If it is the planning phase there isn't much to do since most of the cost is 
in reading the manifests. With 4000 files there are most likely many many 
manifests. You can try increasing the size of the manifest reading thread pool 
to increase parallelism there but it's best to just optimize manfiests more 
regularly and take the cost for one long optimize to start with. I would also 
highly recommend running optimize data files more frequently as well if you 
have 4000 files that only take up 5 gb.
   
   If it's in the delete phase you just need to enable bulk deletes, this is 
default in newer versions of Iceberg. In older versions of Iceberg there was an 
explicit delete parallelism parameter for expire snapshots and delete orphan 
files. If you are on an older version set these parameters to a high number 
like 50 or 100. If delete are taking a long time it's probably latency of 
getting the delete response which is taking so long. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] RussellSpitzer commented on issue #8130: Using spark maintenance procedures, deleteOrphanFiles & expireSnapshotShots taking many hours for a small iceberg table

Reply via email to