danielcweeks commented on PR #11112:
URL: https://github.com/apache/iceberg/pull/11112#issuecomment-2402888017

   > What do you think about this approach @danielcweeks:
   > 
   > > Is there an optimal number of directories and depth? maybe we can just 
create those and put rest of the entropy into the fileName. For example: 
/data/010/001/100/01010101001-file-name.parquet. This can reduce both the 
sparse directory problem and help with orphan clean-up?
   > 
   > It might be a nice win-win case both solving sparse directory problem and 
orphan clean up. As @jackye1995 it is a bit weird for partitioned-paths but we 
can do it for not partitioned paths only, so it would like:
   > 
   > * partitioned-path=true:  
`/data/010/001/100/01010101001/key=val/file-name.parquet`
   > * partitioned-path=false: `/data/010/001/100/01010101001-file-name.parquet`
   
   I like removing additional path when `partitioned-path=false` (that's a 
great idea).
   
   I'm just wondering about the where we want to put the slashes in the bit 
field.  Breaking it up by three bits means that each recursive listing just 
operates on 8 subpaths, which feels too small.  I feel like we might want to 
move to four bits (i.e. `/data/0100/0110/0010/101010010110`).  This leaves the 
leaf paths with a reasonable `2^12` paths and each parent with 16, which feels 
like enough to parallelize at some level.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to