C-Loftus opened a new issue, #1080:
URL: https://github.com/apache/iceberg-go/issues/1080

   ### Question
   
   Thank you for the great work on iceberg-go!
   
   ## Question
   
   Are there any best practices for writing columns of string data where there 
is high cardinality but a finite set of values? (i.e. to improve parquet row 
group metadata and scan performance?)
   
   ## Context
   
   I want to write a column like `project_identifier` via iceberg-go that has 
many string values like `foo_123`, `bar_123`, `foo_baz_123`, `foo_bar_123`, 
`foo_test_123` .... etc. This is a finite set but it is high cardinality (say 
1000+ values)
   
   I don't want to transform the data and separate it into more tables if 
possible.
   
   However, since these are all string values, scans can be rather slow (as I 
saw here with `DELETE`s https://github.com/apache/iceberg-go/issues/1077) since 
strings provide less useful row group statistics metadata to my understanding. 
However, I was thinking that if the underlying parquet files were partitioned 
by the value of `project_identifier` (for instance, parquet file 1 contains all 
rows with `foo_*` and parquet file 2 contains all rows with `bar_*`) then the 
row group statistics would be much more useful. 
   
   However, I was unclear how to accomplish this (i.e. is it possible to 
partition on a substring?) and how 
https://github.com/apache/iceberg-go/pull/931 might affect this when dictionary 
encoding is added. 
   
   Thank you very much


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to