xiangfu0 commented on code in PR #17998:
URL: https://github.com/apache/pinot/pull/17998#discussion_r3003628001
##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/loader/invertedindex/TextIndexHandler.java:
##########
@@ -210,6 +227,9 @@ private void
createTextIndexForColumn(SegmentDirectory.Writer segmentWriter, Col
// Clean up any existing text index files
cleanupExistingTextIndexFiles(indexDir, columnName);
}
+ TextIndexUtils.cleanupTextIndex(indexDir, columnName);
+ FileUtils.deleteQuietly(inProgress);
+ FileUtils.touch(inProgress);
Review Comment:
Updated `TextIndexHandler` to clean both the top-level segment root and the
versioned segment directory before recreating v3 text indexes. I also added a
v3 regression in `SegmentPreProcessorTest.testV3ReplaceLegacyNativeTextIndex`
that drops a stale file into the Lucene text index dir and verifies reload
removes it before rebuild.
##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/readers/LuceneFSTIndexReader.java:
##########
@@ -77,4 +84,68 @@ public void close()
throws IOException {
// Do Nothing
}
+
+ private static List<Long> regexMatch(String regexQuery, FST<Long> fst)
+ throws IOException {
+ Automaton automaton = new RegExp(regexQuery).toAutomaton();
+ if (automaton.getNumStates() == 0) {
+ return Collections.emptyList();
+ }
+
+ List<Path<Long>> queue = new ArrayList<>();
+ List<Long> matchingDictIds = new ArrayList<>();
+ queue.add(new Path<>(0, fst.getFirstArc(new FST.Arc<>()),
fst.outputs.getNoOutput(), new IntsRefBuilder()));
+
+ FST.Arc<Long> scratchArc = new FST.Arc<>();
+ FST.BytesReader fstReader = fst.getBytesReader();
+ Transition transition = new Transition();
+ while (!queue.isEmpty()) {
+ Path<Long> path = queue.remove(queue.size() - 1);
+ if (automaton.isAccept(path._state) && path._fstNode.isFinal()) {
+ matchingDictIds.add(path._output);
+ }
Review Comment:
Updated `LuceneFSTIndexReader` to include `nextFinalOutput()` when a final
arc matches, matching the IFST reader. I also added exact-match coverage in
`LuceneFSTIndexCreatorTest` so the final-output path is exercised directly.
##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/loader/invertedindex/TextIndexHandler.java:
##########
@@ -384,4 +394,15 @@ private void
convertTextIndexToV3Format(SegmentDirectory.Writer segmentWriter, S
LOGGER.info("Successfully converted text index to V3 combined format for
column: {}", columnName);
}
+
+ private Set<String> getColumnsWithLegacyNativeTextIndex() {
+ File indexDir = _segmentDirectory.getSegmentMetadata().getIndexDir();
+ Set<String> columns = new HashSet<>();
+ for (String column :
_segmentDirectory.getSegmentMetadata().getAllColumns()) {
+ if (new File(indexDir, column +
V1Constants.Indexes.NATIVE_TEXT_INDEX_FILE_EXTENSION).exists()) {
+ columns.add(column);
+ }
+ }
+ return columns;
Review Comment:
Handled in the same patch. Legacy native text detection now checks both the
segment root and the versioned v3 directory, and the new v3 regression test
covers a legacy native text file under the versioned directory.
##########
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/segment/index/loader/SegmentPreProcessorTest.java:
##########
@@ -1337,6 +1338,37 @@ public void testV1CleanupIndices()
assertFalse(bfFile.exists());
}
+ @Test
+ public void testV1ReplaceLegacyNativeFstIndex()
+ throws Exception {
+ buildV1Segment();
+
+ String strColumn = "column3";
+ _fieldConfigMap.put(strColumn, new FieldConfig(strColumn,
FieldConfig.EncodingType.DICTIONARY,
+ List.of(FieldConfig.IndexType.FST), null, null));
+
+ File fstFile = new File(INDEX_DIR, strColumn +
V1Constants.Indexes.LUCENE_V912_FST_INDEX_FILE_EXTENSION);
+ int legacyNativeFstMagic = ('\\' << 24) | ('f' << 16) | ('s' << 8) | 'a';
+
Review Comment:
Reused the production value here by exposing
`FstIndexUtils.LEGACY_NATIVE_FST_MAGIC` and switched the test to use that
constant instead of duplicating the magic number inline.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]