[GitHub] [incubator-doris] gaodayue edited a comment on issue #2953: [segment_v2] Switch to Unified and Extensible Page Format

GitBox Mon, 24 Feb 2020 03:20:01 -0800

gaodayue edited a comment on issue #2953: [segment_v2] Switch to Unified and 
Extensible Page Format
URL: https://github.com/apache/incubator-doris/pull/2953#issuecomment-590273942
 
 
   @chaoyli answers inline
   
   > Bitmap index have save dict_column and bitmap with four page, every read 
have four I/O, this may be consuming. I think dict_column store a bitmap 
PagePointer will be better. Like this:
   DictColumnValueIndex : val -> dict_column_page_pointer
   DictColumnPage : dict_column|bitmap_page_pointer
   BitmapPage : RoaringBitMap
   which can save one I/O operation
   
   Storing bitmap_page_pointer in DictColumnPage has several drawbacks
   1. Binary search inside DictColumnPage is more complicated because we now 
need to separate `dict_column` and `bitmap_page_pointer`
   2. Storage size is not necessarily reduced because now we need to store a 
bitmap page pointer for each dictionary item. Previously we only store a bitmap 
page pointer for each bitmap page.
   3. The implementation is more complicated because now we tightly couples 
DictColumn with BitmapColumn. We lose the benefits of IndexedColumn abstraction.
   
   > I think ZoneMap and OrdinalIndex Read/Writer logic remaining the same may 
be better.
   Firstly ZoneMap and OrdinalIndex is simple, may not need to used 
IndexedColumnWriter/Reader complicated logic.
   Secondly IndexedColumnWriter will also contain all index writer, if we add 
BTree index in the futher.
   Thirdly if use the above optimization, the ZoneMap and OrdinalIndex also not 
suitable.
   
   The problems with implementing all kinds of indexes from scratch instead of 
reusing existing abstractions is lower code reusability and higher long term 
maintenance cost. The nice thing about `IndexedColumn` is that it can be used 
as the building blocks for all kinds of data and indexes, leading to a more 
layered system. Considering you worries about the cost of using BTree index for 
ZoneMap, I think IndexedColumn can support both single-level and multiple-level 
index in in the future.
   
   > I found the default decoding of VARCHAR and CHAR is dictionary encoding 
without policy, this may be consuming space when cardinality is high. And if we 
want to change it, we should rebuild all of the data.
   
   Actually the current implementation of `BinaryDictPageBuilder` will fallback 
to plain encoding automatically when it found the cardinality is high and the 
size of dictionary page is too big.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org

[GitHub] [incubator-doris] gaodayue edited a comment on issue #2953: [segment_v2] Switch to Unified and Extensible Page Format

Reply via email to