[GitHub] [pinot] richardstartin commented on issue #7870: Possible storage optimization for MV forward index

GitBox Tue, 19 Jul 2022 23:47:44 -0700


richardstartin commented on issue #7870:
URL: https://github.com/apache/pinot/issues/7870#issuecomment-1189894021


   > Hey @Jackie-Jiang @walterddr
   > 
   > As discussed over slack, we got the compression results for the actual 
table that ran into this forward index size bloat issue. I've updated the 
document in [this 
section](https://docs.google.com/document/d/1BWtNKvxL1Uaydni_BJCgWN8i9_WeSdgL3Ksh4IpY_K0/edit#heading=h.cq0je3xwcssi).
 The TL;DR is that for the actual table the compression savings are very 
minimal. I updated the 
[recommendations](https://docs.google.com/document/d/1BWtNKvxL1Uaydni_BJCgWN8i9_WeSdgL3Ksh4IpY_K0/edit#heading=h.b4ch3eh9yztq)
 to indicate that for now it does not make sense to try to solve this by 
compressing the data using any of the approaches.
   > 
   > @siddharthteotia and I would like to keep this issue open to explore 
further ideas in the future or perhaps revisit compression with dictionary in 
case we find users who have sufficient repeatability in their data to benefit 
from compression.
   > 
   > Also, as discussed over our call, there may be some use of implementing 
Approach 2 from the proposed approaches for the sake of speeding up the query 
rather than saving on storage costs (i.e. have a dictionary and store the 
forward index in raw format -> which can help avoid an additional dictionary 
lookup). I had started some work on Approach 2 and have an initial PR before we 
ran these compression experiments. My PR stores the data as raw + compressed in 
the forward index but creates a dictionary (passthrough compression can be 
enabled to avoid decompression overhead). I need to spend some time on seeing 
how best to divide up the PRs before submitting this to OSS. Just wanted to 
give a heads up.
   > 
   > cc @siddharthteotia
   
   I can’t access the document but was byte alignment (rounding the 
dictionary’s bits up to the next multiple of 8, so padding each dictionarized 
value with leading zeros) prior to LZ4 compression attempted? If the dictionary 
codes aren’t byte aligned, byte-oriented compression schemes won’t work well. I 
explained this on a call with @siddharthteotia several months ago.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

[GitHub] [pinot] richardstartin commented on issue #7870: Possible storage optimization for MV forward index

Reply via email to