Pulkitg64 opened a new issue, #12414:
URL: https://github.com/apache/lucene/issues/12414

   ### Description
   
   **Context**
   
   In KNN Prefiltering we require a ```BitSet``` of docs which matched with 
prefilter query. This BitSet is used during HNSW Graph Search to check whether 
a node is accepted for matching or not. To create this BitSet we call 
[createBitSet](https://github.com/apache/lucene/blob/9ffc625b2ec43481e2cfba29828efbddf12e6852/lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java#L161)
 function which creates a BitSet with following conditions:
   
   1. When there are **no deletions** in the index and ```DocIdSetIterator``` 
is of ```BitSetIterator``` instance, then directly return the BitSet.
   
   2. For other cases create a ```FilteredDocIdSetIterator``` first then create 
```BitSet``` by iterating over docs from the ```FilteredDocIdSetIterator```.
   
   
   ```
   private BitSet createBitSet(DocIdSetIterator iterator, Bits liveDocs, int 
maxDoc)
         throws IOException {
       if (liveDocs == null && iterator instanceof BitSetIterator 
bitSetIterator) {
         // If we already have a BitSet and no deletions, reuse the BitSet
         return bitSetIterator.getBitSet();
       } else {
         // Create a new BitSet from matching and live docs
         FilteredDocIdSetIterator filterIterator =
             new FilteredDocIdSetIterator(iterator) {
               @Override
               protected boolean match(int doc) {
                 return liveDocs == null || liveDocs.get(doc);
               }
             };
         return BitSet.of(filterIterator, maxDoc);
       }
     }
   
   ```
   **Problem**
   
   When the ```DocIdSetIterator``` is of ```BitSetIterator``` instance and 
there are **deletions** in the index, we create a FilteredDocIdSetIterator 
which is later used to create a BitSet. This is a time consuming process 
because to create the BitSet we iterate over all matching docs from the 
iterator.
   
   
   **Proposed Optimisation**
    
   For the above case, we can wrap matchedDocs and liveDocs under single Bits 
instance something like below which can be directly passed for HNSW search. So, 
now during HNSW Search when a node is explored, we will check if the doc is 
accepted or not by checking bits of both matchedDocs and liveDocs which is 
constant time operation.
   
   
   ```
   public class KNNBitSet implements Bits{
   
       Bits liveDocs;
       Bits matchedDocs;
       public KNNBitSet(Bits matchedDocs, Bits liveDocs) {
           this.matchedDocs = matchedDocs;
           this.liveDocs = liveDocs;
   
       }
       @Override
       public boolean get(int index) {
           if (liveDocs ==null) {
               return matchedDocs.get(index);
           }
           return matchedDocs.get(index) & liveDocs.get(index);
       }
   
       @Override
       public int length() {
           return 0;
       }
   }
   ```
   
   **Advantage**
   
   This will reduce BitSet creation time for some cases (when there are deleted 
documents) because now we are not iterating over each docs to create the 
BitSet. 
   
   
   **Drawback**
   
   There is one drawback in this approach, that we will not be able to give the 
exact cardinality of acceptedDocs and the number will be higher than exact 
number of acceptedDocs because now we are not considering liveDocs into the 
final BitSet.
   
   *Original*:
   Number of Matched Docs = 10, Number of Deleted Docs = 2
   Cardinality of Accepted Docs = 10-2 = 8
   
   *With Proposed Optimisation*:
   Number of Matched Docs = 10, Number of Deleted Docs = 2
   Cardinality of Accepted Docs = 10 (This is an overestimation as we are not 
considering deleted docs in BitSet)
    
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to