airborne12 opened a new pull request, #63692:
URL: https://github.com/apache/doris/pull/63692

   ## Proposed changes
   
   Issue Number: close #N/A (Jira DORIS-25510)
   
   ### What problem does this PR solve?
   
   When a variant column has a parent INVERTED index with parser, and a 
sub-column is materialized in some segment as a non-string value (e.g. `{"c": 
false}`), `variant_util::inherit_index` calls `remove_parser_and_analyzer()` 
and writes a BKD/numeric index for that sub-column. The on-disk entry for 
`(parent_index_id, "<sub>")` therefore exists but is **not** a Lucene fulltext 
segment.
   
   `MatchPredicateCollector::collect` (called from BM25 stats collection in 
`OlapScanner::_prepare_impl`) does not have segment context, so when the 
predicate references a variant sub-column it clones the parent fulltext index 
meta and sets the sub-column path as suffix. In segments where the sub-column 
happens to be non-string, `IndexFileReader::open(...)` returns a valid 
`DorisCompoundReader` pointing at the BKD entry, and 
`lucene::index::IndexReader::open(compound_reader.get())` throws 
`CLuceneError(\"No segments* file found in DorisCompoundReader@...\")`.
   
   That `CLuceneError` (derives from `std::exception`, not `doris::Exception`) 
escapes `CollectionStatistics::process_segment`, bubbles through `collect()` 
and `OlapScanner::_prepare_impl`, and the `ASSIGN_STATUS_IF_CATCH_EXCEPTION` 
wrapper in `scanner_scheduler.cpp` only catches `doris::Exception` — so the BE 
SIGABRTs during scanner prepare.
   
   Minimal reproducer (from DORIS-25510):
   
   ```sql
   create table t (
       `id` int(11) NULL,
       `v` variant NULL,
       INDEX idx_v (`v`) USING INVERTED PROPERTIES(\"parser\" = \"english\")
   ) ENGINE=OLAP DUPLICATE KEY(`id`)
     DISTRIBUTED BY HASH(`id`) BUCKETS 1
     PROPERTIES (\"replication_allocation\" = \"tag.location.default:1\");
   
   insert into t values(1, '{\"a\": \"abc\"}');
   insert into t values(2, '{\"b\": \"abc\"}');
   insert into t values(3, '{\"c\": false}');
   
   select score() from t where v[\"c\"] match \"abc\" order by score() limit 10;
   -- BE coredumps
   ```
   
   This PR wraps the `IndexReader::open` + searcher-cache-fill path in 
`CollectionStatistics::process_segment` with a `try { ... } catch 
(CLuceneError& e)` that logs and `continue`s to the next field. Skipping 
contributes 0 to `_total_num_tokens` / `_term_doc_freqs` for the affected field 
in that segment, which is the intended semantics for *no fulltext data for this 
sub-column in this segment*. Existing `INVERTED_INDEX_FILE_NOT_FOUND` / 
`INVERTED_INDEX_BYPASS` handling at `CollectionStatistics::collect` is 
unchanged and still applies when the entry is genuinely absent.
   
   The deeper schema-level fix — never cloning a fulltext parent meta for a 
sub-column whose actual segment-level index was written as BKD — needs segment 
context and is a follow-up. The defensive try/catch is enough to stop the abort 
and is the same shape Doris uses elsewhere when CLucene exceptions cross the BE 
/ Doris boundary.
   
   ### Release note
   
   Fix BE crash when running `score()` / BM25-scoring queries against a variant 
sub-column whose data in some segments is non-string while the parent variant 
column has a fulltext INVERTED index.
   
   ### Check List (For Author)
   
   - [x] Test:
     - Regression test: 
`regression-test/suites/inverted_index_p0/test_bm25_score_variant_boolean_subcolumn.groovy`
 replays the exact DORIS-25510 reproducer (3 single-row inserts so each lands 
in its own segment, including the boolean sub-column seg) and asserts the query 
returns without crash.
   - [x] Behavior changed: No (only converts a crash into a logged warning + 
empty stats contribution for the affected sub-column / segment).
   - [x] Does this need documentation: No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to