[GitHub] [lucene] maosuhan opened a new issue, #12137: Add compression feature for DocValues format in new Codec

via GitHub Wed, 08 Feb 2023 22:37:34 -0800


maosuhan opened a new issue, #12137:
URL: https://github.com/apache/lucene/issues/12137


   ### Description
   
   We use ES as an OLAP engine in advertising scenarios, an advertiser will 
query the data of his own. We usually make advertiser_id as a routing and index 
sorting key so the read density is very high in docvalues. We leverage lucene's 
posting index structure to speedup the query and the performance meet our 
expectation.
   
   The most complained part of ES/lucene is that the disk usage is much bigger 
than clickhouse/doris, and in our case, lucene storage can be 3-4x times bigger.
   
   The reason why clickhouse/doris performs better in space is that they both 
compress data in blocks and uncompress the needed blocks on read. Since the 
read density is high, the performance is still acceptable.
   
   We also implement the zstd/lz4 compression for lucene docvalues, below is 
the storage cost improvement:
   <byte-sheet-html-origin data-id="I9tivHgWQ8-1675919010770" data-version="3" 
data-is-embed="true">
   
    name |  total size| docvalue size |  docvalue compression ratio
   -- |  -- | -- | -- 
   no compression | 485.8g | 394.6g    | 100%
   lz4 | 272g | 255.1g | 64.65% | 64.65% 
   zstd |  246.5g | 229.5g | 58.16% 
   </byte-sheet-html-origin>
   
   All the docvalues is numeric and we compress the data in block of 4096 
values.
   
   We also run a high QPS(4000) load test from our online query set, the pct50 
and pct99 both decreases by 20% to 30%. We are shocked by this improvement and 
we guess the read density is the key to the result.
   
   But the disadvantage of compression is that it hurts the performance of 
random read a lot because a full block of data must be read ahead even just 
only 1 byte is needed.
   
   I suggest that we create a new codec for compression and the purpose of this 
codec is to reduce the storage usage and provide adequate and balanced read 
performance is OLAP cases.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] maosuhan opened a new issue, #12137: Add compression feature for DocValues format in new Codec

Reply via email to