Re: [PR] Enhance REGEXP_LIKE to scan dictionary when dictionary is small [pinot]

via GitHub Thu, 31 Jul 2025 15:10:18 -0700


Copilot commented on code in PR #16478:
URL: https://github.com/apache/pinot/pull/16478#discussion_r2246479840



##########
pinot-core/src/main/java/org/apache/pinot/core/operator/filter/predicate/RegexpLikePredicateEvaluatorFactory.java:
##########
@@ -32,6 +36,9 @@ public class RegexpLikePredicateEvaluatorFactory {
   private RegexpLikePredicateEvaluatorFactory() {
   }
 
+  /// When the cardinality of the dictionary is less than this threshold, scan 
the dictionary to get the matching ids.

Review Comment:
   Use standard Java comment syntax `//` instead of `///` for single-line 
comments.
   ```suggestion
     // When the cardinality of the dictionary is less than this threshold, 
scan the dictionary to get the matching ids.
   ```



##########
pinot-core/src/main/java/org/apache/pinot/core/operator/filter/predicate/RegexpLikePredicateEvaluatorFactory.java:
##########
@@ -56,16 +66,64 @@ public static BaseDictionaryBasedPredicateEvaluator 
newDictionaryBasedEvaluator(
    */
   public static BaseRawValueBasedPredicateEvaluator 
newRawValueBasedEvaluator(RegexpLikePredicate regexpLikePredicate,
       DataType dataType) {
-    Preconditions.checkArgument(dataType == DataType.STRING, "Unsupported data 
type: " + dataType);
+    Preconditions.checkArgument(dataType.getStoredType() == DataType.STRING, 
"Unsupported data type: " + dataType);
     return new RawValueBasedRegexpLikePredicateEvaluator(regexpLikePredicate);
   }
 
-  private static final class DictionaryBasedRegexpLikePredicateEvaluator 
extends BaseDictionaryBasedPredicateEvaluator {
+  private static final class DictIdBasedRegexpLikePredicateEvaluator
+      extends BaseDictIdBasedRegexpLikePredicateEvaluator {
+    final IntSet _matchingDictIdSet;
+
+    public DictIdBasedRegexpLikePredicateEvaluator(RegexpLikePredicate 
regexpLikePredicate, Dictionary dictionary) {
+      super(regexpLikePredicate, dictionary);
+      Matcher matcher = regexpLikePredicate.getPattern().matcher("");
+      IntList matchingDictIds = new IntArrayList();
+      int dictionarySize = _dictionary.length();
+      for (int dictId = 0; dictId < dictionarySize; dictId++) {
+        if (matcher.reset(dictionary.getStringValue(dictId)).find()) {
+          matchingDictIds.add(dictId);
+        }
+      }
+      int numMatchingDictIds = matchingDictIds.size();
+      if (numMatchingDictIds == 0) {
+        _alwaysFalse = true;
+      } else if (dictionarySize == numMatchingDictIds) {
+        _alwaysTrue = true;
+      }
+      _matchingDictIds = matchingDictIds.toIntArray();
+      _matchingDictIdSet = new IntOpenHashSet(_matchingDictIds);
+    }
+
+    @Override
+    public int getNumMatchingItems() {
+      return _matchingDictIdSet.size();
+    }
+
+    @Override
+    public boolean applySV(int dictId) {
+      return _matchingDictIdSet.contains(dictId);
+    }
+
+    @Override
+    public int applySV(int limit, int[] docIds, int[] values) {
+      // reimplemented here to ensure applySV can be inlined

Review Comment:
   The comment 'reimplemented here to ensure applySV can be inlined' should be 
more specific about why this reimplementation is necessary and what performance 
benefit it provides.
   ```suggestion
         // This method is reimplemented to ensure it can be inlined by the JVM 
for performance optimization.
         // Inlining allows the JVM to eliminate method call overhead and apply 
further optimizations such as loop unrolling.
         // This is particularly beneficial in scenarios where this method is 
called frequently during query execution,
         // as it reduces runtime overhead and improves overall query 
performance.
   ```



##########
pinot-integration-tests/src/test/java/org/apache/pinot/integration/tests/OfflineClusterIntegrationTest.java:
##########
@@ -2209,6 +2212,22 @@ private void testNewAddedColumns()
     assertEquals(row.get(12).asDouble(), 0.0);
   }
 
+  private void testRegexpLikeOnNewAddedColumns()
+      throws Exception {
+    int numTotalDocs = (int) getCountStarResult();
+
+    // REGEXP_LIKE on new added dictionary-encoded columns should not scan the 
table when it matches all or nothing

Review Comment:
   Change 'new added' to 'newly added' for correct grammar.
   ```suggestion
       // REGEXP_LIKE on newly added dictionary-encoded columns should not scan 
the table when it matches all or nothing
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Enhance REGEXP_LIKE to scan dictionary when dictionary is small [pinot]

Reply via email to