Re: [PR] update limit pushdown to tablet reader [doris]

via GitHub Wed, 25 Mar 2026 01:37:11 -0700


github-actions[bot] commented on code in PR #61713:
URL: https://github.com/apache/doris/pull/61713#discussion_r2986642320



##########
be/src/exec/scan/olap_scanner.cpp:
##########
@@ -513,6 +513,14 @@ Status OlapScanner::_init_tablet_reader_params(
         }
     }
 
+    // General limit pushdown for DUP_KEYS and UNIQUE_KEYS with MOW.
+    // Only set when the ORDER BY key topn optimization is NOT active,
+    // since topn already handles its own limit enforcement.
+    // The _limit value comes from the operator's LIMIT clause.
+    if (!_tablet_reader_params.read_orderby_key && _limit > 0) {
+        _tablet_reader_params.general_read_limit = _limit;
+    }
+

Review Comment:
   **[Critical Bug] `_limit` is counted BEFORE post-storage filtering — 
late-arriving runtime filters cause wrong results.**
   
   `_limit` is pushed as `general_read_limit` into VCollectIterator, which 
counts rows returned from the storage layer. However, 
`Scanner::_filter_output_block()` (scanner.cpp:155) applies `_conjuncts` AFTER 
VCollectIterator returns. Late-arriving runtime filters are added to 
`_conjuncts` after tablet reader initialization and are never pushed into 
`remaining_conjunct_roots`. When these filters remove rows, VCollectIterator 
has already counted them, potentially returning EOF before enough post-filter 
rows are produced.
   
   **Concrete scenario:**
   1. Query: `SELECT * FROM dup_table a JOIN dim_table b ON a.key = b.key LIMIT 
10`
   2. `_limit = 10` → `general_read_limit = 10`
   3. Scanner starts reading, VCollectIterator returns 10 raw rows → marks 
itself done (EOF)
   4. A runtime filter from the build side of the hash join arrives late, is 
added to `_conjuncts`
   5. `_filter_output_block` filters 4 rows → only 6 rows survive
   6. Scanner sees EOF from storage → reports EOF to operator with only 6 rows 
instead of 10
   
   **Suggested fix:** Either (a) move `_conjuncts` to `filter_block_conjuncts` 
(like the topn path does at line 500-504) so they are evaluated inside storage, 
or (b) only enable `general_read_limit` when `_conjuncts` is empty and runtime 
filters are fully arrived, or (c) account for the post-filtering gap by not 
using a hard limit in storage.



##########
be/src/storage/iterators.h:
##########
@@ -152,6 +152,10 @@ class StorageReadOptions {
     std::unordered_map<int32_t, ColumnPtr> sparse_column_cache;
 
     uint64_t condition_cache_digest = 0;
+
+    // General limit pushdown for DUP_KEYS and UNIQUE_KEYS with MOW.
+    // Propagated from RowsetReaderContext.general_read_limit.
+    int64_t general_read_limit = -1;
 };
 
 struct CompactionSampleInfo {

Review Comment:
   **[Dead Code]** This field is written by 
`BetaRowsetReader::get_segment_iterators()` but never read by `SegmentIterator` 
or any other consumer. `VCollectIterator` reads `general_read_limit` from 
`_reader->_reader_context` (the `RowsetReaderContext`), not from 
`StorageReadOptions`. This field and the corresponding assignment in 
`beta_rowset_reader.cpp` should be removed to avoid confusion.



##########
be/src/storage/iterator/vcollect_iterator.cpp:
##########
@@ -248,8 +258,31 @@ Status VCollectIterator::next(Block* block) {
         return _topn_next(block);
     }
 
+    // Fast path: if general limit already reached, return EOF immediately
+    if (_general_read_limit > 0 && _general_rows_returned >= 
_general_read_limit) {
+        return Status::Error<END_OF_FILE>("");
+    }
+
     if (LIKELY(_inner_iter)) {
-        return _inner_iter->next(block);
+        auto st = _inner_iter->next(block);
+        if (UNLIKELY(!st.ok())) {
+            return st;
+        }
+
+        // Enforce general read limit: truncate block if needed
+        if (_general_read_limit > 0) {
+            _general_rows_returned += block->rows();
+            if (_general_rows_returned > _general_read_limit) {
+                // Truncate block to return exactly the remaining rows needed

Review Comment:
   **[Correctness concern]** `block->rows()` here counts rows BEFORE 
`Scanner::_filter_output_block()` applies `_conjuncts`. If any `_conjuncts` 
exist that are NOT also evaluated inside the storage layer (e.g., late-arriving 
runtime filters), this count is inflated. The limit will be reached 
prematurely, causing the scanner to return fewer rows than requested.
   
   For comparison, the existing topn path moves `_conjuncts` into 
`filter_block_conjuncts` (olap_scanner.cpp:500-504) and evaluates them inside 
`_topn_next` (vcollect_iterator.cpp:359-360), so its row counting is 
post-filter and correct.



##########
be/src/storage/rowset/beta_rowset_reader.cpp:
##########
@@ -116,6 +116,7 @@ Status 
BetaRowsetReader::get_segment_iterators(RowsetReaderContext* read_context
     _read_options.version = _rowset->version();
     _read_options.tablet_id = _rowset->rowset_meta()->tablet_id();
     _read_options.topn_limit = _topn_limit;
+    _read_options.general_read_limit = _read_context->general_read_limit;
     if (_read_context->lower_bound_keys != nullptr) {
         for (int i = 0; i < _read_context->lower_bound_keys->size(); ++i) {
             
_read_options.key_ranges.emplace_back(&_read_context->lower_bound_keys->at(i),

Review Comment:
   **[Dead Code]** This assignment propagates `general_read_limit` into 
`_read_options` (StorageReadOptions), but no downstream consumer 
(SegmentIterator) ever reads it. The limit is consumed solely by 
`VCollectIterator` from `RowsetReaderContext`. This line can be removed.



##########
be/src/storage/iterator/vcollect_iterator.cpp:
##########
@@ -94,6 +94,16 @@ void VCollectIterator::init(TabletReader* reader, bool 
ori_data_overlapping, boo
         _topn_limit = 0;
         DCHECK_EQ(_reader->_reader_context.filter_block_conjuncts.size(), 0);
     }
+
+    // General limit pushdown: only for READER_QUERY on non-merge path
+    // (DUP_KEYS or UNIQUE_KEYS with MOW)
+    if (!_merge && _reader->_reader_type == ReaderType::READER_QUERY &&
+        _reader->_reader_context.general_read_limit > 0 &&
+        (_reader->_tablet->keys_type() == KeysType::DUP_KEYS ||
+         (_reader->_tablet->keys_type() == KeysType::UNIQUE_KEYS &&

Review Comment:
   The DUP_KEYS/UNIQUE_KEYS-with-MOW guard here is redundant with the `!_merge` 
check. On the non-merge path, `_merge` is already `false` only for these table 
types (vcollect_iterator.cpp:72-76). The explicit keys_type checks add safety 
but also maintenance burden — if the `_merge` logic changes, these checks could 
become stale. Consider adding a comment noting this intentional redundancy, or 
rely solely on `!_merge`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] update limit pushdown to tablet reader [doris]

Reply via email to