[GitHub] [doris] yiguolei commented on a diff in pull request #10420: [improvement]Do not lazily read dict encoded columns

GitBox Sat, 25 Jun 2022 04:04:59 -0700


yiguolei commented on code in PR #10420:
URL: https://github.com/apache/doris/pull/10420#discussion_r906669019



##########
be/src/olap/rowset/segment_v2/segment_iterator.cpp:
##########
@@ -704,61 +705,48 @@ void SegmentIterator::_vec_init_lazy_materialization() {
     //   When output block to query layer, delete column can be skipped.
     //  _schema.column_ids() stands for storage layer block schema, so it 
contains delete columnid
     //  we just regard delete column as common pred column here.
-    if (_schema.column_ids().size() > pred_column_ids.size()) {
-        for (auto cid : _schema.column_ids()) {
-            if (!_is_pred_column[cid]) {
-                _non_predicate_columns.push_back(cid);
-                FieldType type = _schema.column(cid)->type();
-
-                // todo(wb) maybe we can make read char type faster
-                // todo(wb) support map/array type
-                // todo(wb) consider multiple integer columns cost, such as 
1000 columns, maybe lazy materialization faster
-                if (!_lazy_materialization_read &&
-                    (_is_need_vec_eval ||
-                     _is_need_short_eval) && // only when pred exists, we need 
to consider lazy materialization
-                    (type == OLAP_FIELD_TYPE_HLL || type == 
OLAP_FIELD_TYPE_OBJECT ||
-                     type == OLAP_FIELD_TYPE_VARCHAR || type == 
OLAP_FIELD_TYPE_CHAR ||
-                     type == OLAP_FIELD_TYPE_STRING || type == 
OLAP_FIELD_TYPE_BOOL ||
-                     type == OLAP_FIELD_TYPE_DATE || type == 
OLAP_FIELD_TYPE_DATETIME ||
-                     type == OLAP_FIELD_TYPE_DECIMAL)) {
-                    _lazy_materialization_read = true;
+    for (size_t i = 0; i < _schema.num_column_ids(); ++i) {
+        auto cid = _schema.column_id(i);
+        FieldType type = _schema.column(cid)->type();
+        if (!_is_pred_column[cid]) {
+            _non_predicate_columns.emplace_back(cid);
+            switch (type) {
+            case OLAP_FIELD_TYPE_VARCHAR:
+            case OLAP_FIELD_TYPE_CHAR:
+            case OLAP_FIELD_TYPE_STRING: {
+                // if a string column is all dict encoding in one segment, 
it's almost same as
+                // an int32_t column, it can be read together with predicate 
columns.
+                if (config::enable_low_cardinality_optimize &&
+                    _column_iterators[cid]->is_all_dict_encoding()) {
+                    pred_column_ids.insert(cid);
+                    _is_pred_column[cid] = true;
+                } else {
+                    lazy_read_column_ids.insert(cid);
                 }
+                break;
             }
-        }
-    }
-
-    // Step 3: fill column ids for read and output
-    if (_lazy_materialization_read) {
-        // insert pred cid to first_read_columns
-        for (auto cid : pred_column_ids) {
-            _first_read_column_ids.push_back(cid);
-        }
-    } else if (!_is_need_vec_eval &&
-               !_is_need_short_eval) { // no pred exists, just read and output 
column
-        for (int i = 0; i < _schema.num_column_ids(); i++) {
-            auto cid = _schema.column_id(i);
-            _first_read_column_ids.push_back(cid);
-        }
-    } else { // pred exits, but we can eliminate lazy materialization
-        // insert pred/non-pred cid to first read columns
-        std::set<ColumnId> pred_id_set;
-        pred_id_set.insert(_short_cir_pred_column_ids.begin(), 
_short_cir_pred_column_ids.end());
-        pred_id_set.insert(_vec_pred_column_ids.begin(), 
_vec_pred_column_ids.end());
-        std::set<ColumnId> non_pred_set(_non_predicate_columns.begin(),
-                                        _non_predicate_columns.end());
-
-        for (int i = 0; i < _schema.num_column_ids(); i++) {
-            auto cid = _schema.column_id(i);
-            if (pred_id_set.find(cid) != pred_id_set.end()) {
-                _first_read_column_ids.push_back(cid);
-            } else if (non_pred_set.find(cid) != non_pred_set.end()) {
-                _first_read_column_ids.push_back(cid);
-                // when _lazy_materialization_read = false, non-predicate 
column should also be filtered by sel idx, so we regard it as pred columns
+            case OLAP_FIELD_TYPE_HLL:
+            case OLAP_FIELD_TYPE_OBJECT:
+            case OLAP_FIELD_TYPE_BOOL:
+            case OLAP_FIELD_TYPE_DATE:
+            case OLAP_FIELD_TYPE_DATETIME:
+            case OLAP_FIELD_TYPE_DECIMAL:
+                lazy_read_column_ids.insert(cid);
+                break;
+            default:
+                pred_column_ids.insert(cid);
                 _is_pred_column[cid] = true;
+                break;
             }
         }
     }
 
+    _lazy_materialization_read = !lazy_read_column_ids.empty();
+    _first_read_column_ids.assign(pred_column_ids.begin(), 
pred_column_ids.end());
+    if (_lazy_materialization_read && (_is_need_vec_eval || 
_is_need_short_eval)) {
+        _non_predicate_columns.assign(lazy_read_column_ids.begin(), 
lazy_read_column_ids.end());

Review Comment:
   In the past, segment iterator first read first_columns_ids and then 
non_predicate_colums because predicate columns == first read columns. If non 
predicate all dict encoding could also be first read columns then we should 
change current logic. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org

[GitHub] [doris] yiguolei commented on a diff in pull request #10420: [improvement]Do not lazily read dict encoded columns

Reply via email to