Re: [PR] [opt](variant) add column cache for variant sparse column [doris]

via GitHub Wed, 24 Sep 2025 02:53:53 -0700


eldenmoon commented on code in PR #56159:
URL: https://github.com/apache/doris/pull/56159#discussion_r2371050295



##########
be/src/olap/rowset/segment_v2/segment.cpp:
##########
@@ -720,9 +720,10 @@ Status Segment::new_default_iterator(const TabletColumn& 
tablet_column,
 // in the new schema column c's cid == 2
 // but in the old schema column b's cid == 2
 // but they are not the same column
-Status Segment::new_column_iterator(const TabletColumn& tablet_column,
-                                    std::unique_ptr<ColumnIterator>* iter,
-                                    const StorageReadOptions* opt) {
+Status Segment::new_column_iterator(
+        const TabletColumn& tablet_column, std::unique_ptr<ColumnIterator>* 
iter,
+        const StorageReadOptions* opt,
+        std::unordered_map<int32_t, PathToSparseColumnCacheUPtr>* 
variant_sparse_column_cache) {

Review Comment:
   const std::unordered_map<int32_t, PathToSparseColumnCacheUPtr>&



##########
be/src/olap/rowset/segment_v2/segment.cpp:
##########
@@ -744,9 +745,20 @@ Status Segment::new_column_iterator(const TabletColumn& 
tablet_column,
     }
     if (reader->get_meta_type() == FieldType::OLAP_FIELD_TYPE_VARIANT) {
         // use _column_reader_cache to get variant subcolumn(path column) 
reader
-        RETURN_IF_ERROR(
-                assert_cast<VariantColumnReader*>(reader.get())
-                        ->new_iterator(iter, &tablet_column, opt, 
_column_reader_cache.get()));
+        PathToSparseColumnCacheUPtr* sparse_column_cache_ptr = nullptr;
+        if (variant_sparse_column_cache && 
!variant_sparse_column_cache->contains(unique_id)) {

Review Comment:
   不应该在这里初始化， 应该再SegmentIterator::init



##########
be/src/olap/rowset/segment_v2/variant/variant_column_reader.h:
##########
@@ -48,6 +48,120 @@ class InvertedIndexIterator;
 class InvertedIndexFileReader;
 class ColumnReaderCache;
 
+/**
+ * SparseColumnCache provides a caching layer for sparse column data access.
+ * 
+ * The "shared" aspect refers to the ability to share cached column data 
between
+ * multiple iterators or readers that access the same column 
(SPARSE_COLUMN_PATH). This reduces
+ * redundant I/O operations and memory usage when multiple consumers need the
+ * same column data.
+ * 
+ * Key features:
+ * - Caches column data after reading to avoid repeated I/O
+ * - Maintains state to track the current data validity
+ * - Supports both sequential (next_batch) and random (read_by_rowids) access 
patterns
+ * - Optimizes performance by reusing cached data when possible
+ * 
+ * The cache operates in different states:
+ * - INVALID: Cache is uninitialized
+ * - INITED: Iterator is initialized but no data cached
+ * - SEEKED_NEXT_BATCHED: Data cached from sequential read
+ * - READ_BY_ROWIDS: Data cached from random access read
+ */
+struct SparseColumnCache {
+    const ColumnIteratorUPtr sparse_column_iterator = nullptr;
+    vectorized::MutableColumnPtr sparse_column = nullptr;
+
+    enum class State : uint8_t {
+        INVALID = 0,
+        INITED = 1,
+        SEEKED_NEXT_BATCHED = 2,
+        READ_BY_ROWIDS = 3,
+    };
+    State state = State::INVALID;
+
+    ordinal_t offset = 0;              // Current offset position for 
sequential reads
+    std::unique_ptr<rowid_t[]> rowids; // Cached row IDs for random access 
reads

Review Comment:
   为什么要同时保留offset、rowids、length跟state



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [opt](variant) add column cache for variant sparse column [doris]

Reply via email to