danielcweeks commented on PR #11112: URL: https://github.com/apache/iceberg/pull/11112#issuecomment-2402888017
> What do you think about this approach @danielcweeks: > > > Is there an optimal number of directories and depth? maybe we can just create those and put rest of the entropy into the fileName. For example: /data/010/001/100/01010101001-file-name.parquet. This can reduce both the sparse directory problem and help with orphan clean-up? > > It might be a nice win-win case both solving sparse directory problem and orphan clean up. As @jackye1995 it is a bit weird for partitioned-paths but we can do it for not partitioned paths only, so it would like: > > * partitioned-path=true: `/data/010/001/100/01010101001/key=val/file-name.parquet` > * partitioned-path=false: `/data/010/001/100/01010101001-file-name.parquet` I like removing additional path when `partitioned-path=false` (that's a great idea). I'm just wondering about the where we want to put the slashes in the bit field. Breaking it up by three bits means that each recursive listing just operates on 8 subpaths, which feels too small. I feel like we might want to move to four bits (i.e. `/data/0100/0110/0010/101010010110`). This leaves the leaf paths with a reasonable `2^12` paths and each parent with 16, which feels like enough to parallelize at some level. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org