Re: [I] Failed to assign splits due to the serialized split size [iceberg]

via GitHub Fri, 12 Jan 2024 15:53:15 -0800


stevenzwu commented on issue #9410:
URL: https://github.com/apache/iceberg/issues/9410#issuecomment-1890156116


   ah. I didn't know it is a batch read mode using `asOfSnapshotId`.  note that 
they are `delete` (not `deleted`) files to capture the row-level deletes. the 
actual files are not loaded during scan planning in jobmanager/coordinator 
node. splits only contains the locations of those delete files. 
   
   the problem is that a equality file can be associated with many data files. 
that is probably why you are seeing many of them in one split. that is 
unfortunate implication of equality deletes.
   
   skipping those delete files won't be correct. delete compaction that was 
suggested earlier should help. Did you use Spark for that? Spark batch should 
generate position deletes, which are easier for the read path?
   
   Regardless, I would agree with @pvary 's suggestion of writeBytes to fix the 
64 KB size limit. curious how many delete files you saw in one split/data file?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Failed to assign splits due to the serialized split size [iceberg]

Reply via email to