kunwp1 opened a new issue, #5281:
URL: https://github.com/apache/texera/issues/5281
### What happened?
`S3StorageClient.deleteDirectory(bucketName, directoryPrefix)` lists objects
under the prefix with a single `listObjectsV2` call and then issues one
`deleteObjects` batch. `listObjectsV2` returns at most 1000 keys per page and
the method does not paginate (no continuation-token loop), so only the first
≤1000 objects under the prefix are ever deleted. Any objects beyond the first
1000 are silently orphaned.
This affects the per-execution large-binary cleanup added in #4123 (PR
#5280): `LargeBinaryManager.deleteByExecution(eid)` deletes the prefix
`objects/{eid}/`. An execution that produces more than 1000 large binaries
(e.g., a File Scan over many files, or a Python UDF emitting many
`largebinary()` values) will leave the excess objects in the
`texera-large-binaries` bucket forever, undermining the "no leaks" goal of that
fix. The same cap applies to every other caller of `deleteDirectory`
(result/console/stats cleanup paths).
Related latent limit: AWS `DeleteObjectsRequest` accepts at most 1000 keys
per call, so once listing is paginated, deletions must also be chunked into
batches of ≤1000.
Expected: `deleteDirectory` should remove all objects under the prefix
regardless of count.
### Affected code
`common/workflow-core/src/main/scala/org/apache/texera/service/util/S3StorageClient.scala`
— `deleteDirectory` (~lines 105–145): single `listObjectsV2` (no
`isTruncated`/`nextContinuationToken` loop) + single `deleteObjects`.
### Suggested fix
Paginate the listing via the continuation token until `isTruncated` is
false, accumulating keys, and delete them in batches of ≤1000 per
`DeleteObjectsRequest`.
### How to reproduce?
1. Run a workflow whose execution produces >1000 large binaries (so >1000
objects accumulate under `s3://texera-large-binaries/objects/{eid}/`).
2. Trigger cleanup for that execution (start a new run of the workflow, or
delete the workflow).
3. Inspect the bucket: objects beyond the first 1000 under `objects/{eid}/`
remain.
### Branch
main
### Affected Area
Storage
### Impact / Priority
(P3) Low–Medium — pre-existing; only affects executions producing >1000
objects under a single prefix, but causes silent storage leaks.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]