richardstartin commented on pull request #7661:
URL: https://github.com/apache/pinot/pull/7661#issuecomment-956196251


   > Since the target chunk size is much larger than the header size, I think 
it should not add much overhead to store long offset and remove the 4G limit 
for single index. We can also include the uncompressed size in the header in 
case some compressor does not include the length info in the compressed data.
   
   There's a couple of things here:
   * **Compression metadata** - this was the purpose of #7655 - to ensure that 
all formats we use have the correct metadata (3/4 already did) and enforce an 
upgrade path for `LZ4` when using this chunk format. So there's no need for any 
per-chunk compression metadata, and it factors into the next point
   * **Offset sizes** - to me, 4GB of compressed chunks feels like a lot. At a 
compression ratio of 2x, that's 8GB raw data in a single segment, at 10x (JSON 
can be amazingly repetitive) it's 40GB. I am aware that in the past 32 bit 
offsets were shown not to be enough for some use cases, but they were signed 
offsets so only permitted 2GB compressed data. I am unaware of any evidence 
that 4GB would not have been enough. Why do I care? It's not for the sake of 
storage overhead, but I want to keep the metadata as small as possible in 
memory for the sake of searching it quickly. Can we frame the discussion in 
terms of why a user would want/need more than 4GB compressed data for a raw 
column in a single segment?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Reply via email to