Re: [PR] Fix the behavior for AND/OR within lucene queries with wild cards for classic query parser [pinot]

via GitHub Mon, 16 Jun 2025 11:01:54 -0700


Jackie-Jiang commented on code in PR #16118:
URL: https://github.com/apache/pinot/pull/16118#discussion_r2150535724



##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/LuceneTextIndexUtils.java:
##########
@@ -33,27 +35,55 @@
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
 
-

Review Comment:
   (nit, format) We should keep 2 empty lines before class declaration



##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/LuceneTextIndexUtils.java:
##########
@@ -33,27 +35,55 @@
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
 
-
 public class LuceneTextIndexUtils {
   private static final Logger LOGGER = 
LoggerFactory.getLogger(LuceneTextIndexUtils.class);
 
   private LuceneTextIndexUtils() {
   }
 
   /**
+   * Converts a BooleanQuery to appropriate SpanQuery based on its clauses.
+   * For AND operations, uses SpanNearQuery with specified slop.
+   * For OR operations, uses SpanOrQuery.
+   * For NOT operations, uses SpanNotQuery.
    *
    * @param query a Lucene query
-   * @return a span query with 0 slop and the original clause order if the 
input query is boolean query with one or more
+   * @return a span query if the input query is boolean query with one or more
    * prefix or wildcard subqueries; the same query otherwise.
    */
   public static Query convertToMultiTermSpanQuery(Query query) {
     if (!(query instanceof BooleanQuery)) {
       return query;
     }
     LOGGER.debug("Perform rewriting for the phrase query {}.", query);
+
+    BooleanQuery booleanQuery = (BooleanQuery) query;
     ArrayList<SpanQuery> spanQueryLst = new ArrayList<>();
     boolean prefixOrSuffixQueryFound = false;
-    for (BooleanClause clause : ((BooleanQuery) query).clauses()) {
+    boolean isOrQuery = false;
+    boolean hasNotClause = false;
+    SpanQuery notQuery = null;
+
+    // First pass: Check query type and collect NOT clauses
+    for (BooleanClause clause : booleanQuery.clauses()) {
+      if (clause.getOccur() == BooleanClause.Occur.SHOULD) {
+        isOrQuery = true;

Review Comment:
   Is it possible to have mixed `MUST` and `SHOULD`? Should it be handled like 
a tree?
   Right now any `SHOULD` will make the entire query handled as `or`



##########
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/utils/LuceneTest.java:
##########
@@ -0,0 +1,218 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.pinot.segment.local.utils;
+
+import java.io.IOException;
+import java.nio.file.Files;
+import java.nio.file.Path;
+import java.util.Arrays;
+import java.util.HashSet;
+import java.util.Set;
+import org.apache.lucene.analysis.CharArraySet;
+import org.apache.lucene.analysis.StopFilter;
+import org.apache.lucene.analysis.StopwordAnalyzerBase;
+import org.apache.lucene.analysis.TokenStream;
+import org.apache.lucene.analysis.Tokenizer;
+import org.apache.lucene.analysis.core.LowerCaseFilter;
+import org.apache.lucene.analysis.core.WhitespaceTokenizer;
+import org.apache.lucene.document.Document;
+import org.apache.lucene.document.TextField;
+import org.apache.lucene.index.DirectoryReader;
+import org.apache.lucene.index.IndexWriter;
+import org.apache.lucene.index.IndexWriterConfig;
+import org.apache.lucene.queryparser.classic.QueryParser;
+import org.apache.lucene.queryparser.complexPhrase.ComplexPhraseQueryParser;
+import org.apache.lucene.queryparser.flexible.standard.StandardQueryParser;
+import org.apache.lucene.search.IndexSearcher;
+import org.apache.lucene.search.Query;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.store.FSDirectory;
+import org.testng.Assert;
+import org.testng.annotations.AfterClass;
+import org.testng.annotations.BeforeClass;
+import org.testng.annotations.Test;
+
+
+/**
+ * Lucene QueryParser Concepts:
+ *
+ * 1. Default Behavior:
+ *    - Terms without operators are treated as OR
+ *    - "term1 term2" is same as "term1 OR term2"
+ *    - Case-insensitive by default
+ *    - Whitespace is used to separate terms
+ *
+ * 2. Keywords and Operators:
+ *    AND     - Both terms must match
+ *    OR      - Either term can match
+ *    NOT     - Term must not match
+ *    +       - Term must match (same as AND)
+ *    -       - Term must not match (same as NOT)
+ *
+ * 3. Wildcards:
+ *    *       - Matches any number of characters
+ *    ?       - Matches single character
+ *
+ * 4. Grouping:
+ *    ( )     - Groups terms together
+ *    " "     - Phrase query (terms must be in exact order)
+ *
+ * 5. Example Queries:
+ *    "power"                     - Matches documents containing "power"
+ *    "power free"               - Matches documents containing either "power" 
OR "free"
+ *    "power AND free"           - Matches documents containing both "power" 
AND "free"
+ *    "power NOT free"           - Matches documents containing "power" but 
NOT "free"
+ *    "power*"                   - Matches "power", "powerful", etc.
+ *    "*power"                   - Matches "power", "superpower", etc. (if 
leading wildcards enabled)
+ *    "p?wer"                    - Matches "power", "pawer", etc.
+ */
+public class LuceneTest {

Review Comment:
   Are we testing pure Lucene behavior here? IMO we should integrate this into 
`LuceneTextIndexUtilsTest` and test Pinot behavior



##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/LuceneTextIndexUtils.java:
##########
@@ -33,27 +35,55 @@
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
 
-
 public class LuceneTextIndexUtils {
   private static final Logger LOGGER = 
LoggerFactory.getLogger(LuceneTextIndexUtils.class);
 
   private LuceneTextIndexUtils() {
   }
 
   /**
+   * Converts a BooleanQuery to appropriate SpanQuery based on its clauses.
+   * For AND operations, uses SpanNearQuery with specified slop.
+   * For OR operations, uses SpanOrQuery.
+   * For NOT operations, uses SpanNotQuery.
    *
    * @param query a Lucene query
-   * @return a span query with 0 slop and the original clause order if the 
input query is boolean query with one or more
+   * @return a span query if the input query is boolean query with one or more
    * prefix or wildcard subqueries; the same query otherwise.
    */
   public static Query convertToMultiTermSpanQuery(Query query) {
     if (!(query instanceof BooleanQuery)) {
       return query;
     }
     LOGGER.debug("Perform rewriting for the phrase query {}.", query);
+
+    BooleanQuery booleanQuery = (BooleanQuery) query;
     ArrayList<SpanQuery> spanQueryLst = new ArrayList<>();
     boolean prefixOrSuffixQueryFound = false;
-    for (BooleanClause clause : ((BooleanQuery) query).clauses()) {
+    boolean isOrQuery = false;
+    boolean hasNotClause = false;
+    SpanQuery notQuery = null;
+
+    // First pass: Check query type and collect NOT clauses

Review Comment:
   Any specific reason why you want to do 2 passes? Seems it can be done with 1 
pass



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Re: [PR] Fix the behavior for AND/OR within lucene queries with wild cards for classic query parser [pinot]

Reply via email to