buchireddy commented on issue #4317: Support variable length Offline Dictionary 
Indexes for bytes, strings and maps to save on storage
URL: 
https://github.com/apache/incubator-pinot/issues/4317#issuecomment-502254249
 
 
   I've implemented a solution based on the approached discussed in the 
description and did some benchmarks with **String dictionary** to see the 
latency with VariableLength dictionary and the storage improvements it brings 
compared to the FixedLength dictionary. Here are the results and observations.
   
   **Time taken to lookup 10M values:**
   <img width="882" alt="TimeVsDictionarySizesCharts" 
src="https://user-images.githubusercontent.com/945283/59487242-edc20280-8e30-11e9-92d7-1891379d4639.png";>
   **Observation:** As can be clearly seen in the graph, as strings are getting 
bigger, the variable lengh dictionary is giving much better lookup latencies. 
When the dictionary size (cardinality) is >1M and the string sizes are small 
(<100), FixedLength dictionary has better lookup latencies though. 
   
   **Storage requirements of VarLength dictionary:**
   Since the variable length dictionary doesn't do any padding, it saves the 
space for all the cases where the strings in the dictionary aren't of equal 
length. Hence, this graph plots the % storage savings with VarLength dictionary 
instead of absolute values.
   <img width="612" alt="DictSizeVsStorageSavingsChart" 
src="https://user-images.githubusercontent.com/945283/59487464-9ff9ca00-8e31-11e9-9e60-13228766662e.png";>
   **Observation:** If the strings in the dictionary are of different lengths, 
VarLength dictionary saves 40% space compared to the fixed length dictionary. 
   
   Again thanks @kishoreg for all the guidance on this.
   
   P.S: All raw values from the benchmarking are available at 
https://docs.google.com/spreadsheets/d/1iOLyhD4AUZw3JsdOkmH6h36KWYalVeUBIVcty6Pnv0E/edit?usp=sharing
 so feel free to copy/comment on the results.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Reply via email to