Re: [PR] Binary search all terms. [lucene]

via GitHub Thu, 28 Mar 2024 05:36:53 -0700


mikemccand commented on code in PR #13192:
URL: https://github.com/apache/lucene/pull/13192#discussion_r1542884001



##########
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnumFrame.java:
##########
@@ -196,6 +207,90 @@ void loadBlock() throws IOException {
     suffixLengthsReader.reset(suffixLengthBytes, 0, numSuffixLengthBytes);
     totalSuffixBytes = ste.in.getFilePointer() - startSuffixFP;
 
+    // Prepare suffixes, offsets to binary search.
+    if (allEqual) {
+      if (isLeafBlock) {
+        suffix = suffixLengthsReader.readVInt();
+      } else {
+        // Handle subCode for non leaf block.
+        postions = new int[entCount];

Review Comment:
   This many new allocations, at such a low level / hot spot (on each block 
load) seems risky / performance hurting?
   
   I wonder how often we see `allEqual` for non-leaf blocks when the field is 
indeed fixed-length terms?  It might happen often, depending on how evenly 
distributed the tokens are across term space?  A random UUID should be quite 
even, at least on initial indexing, but a predictable ID (just 0-padded 
incrementing int, like luceneutil) might be less so?



##########
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnumFrame.java:
##########
@@ -554,49 +683,128 @@ public SeekStatus scanToTermLeaf(BytesRef target, 
boolean exactOnly) throws IOEx
 
     assert prefixMatches(target);
 
-    // TODO: binary search when all terms have the same length, which is 
common for ID fields,
-    // which are also the most sensitive to lookup performance?
-    // Loop over each entry (term or sub-block) in this block:
-    do {
-      nextEnt++;
+    // TODO early terminate when target length unequals suffix + prefix.
+    // But we need to keep the same status with scanToTermLeaf.
+    int start = nextEnt;
+    int end = entCount - 1;
+    // Binary search the entries (terms) in this leaf block:
+    int cmp = 0;
+    while (start <= end) {
+      int mid = (start + end) / 2;

Review Comment:
   Looks like we need to rebase this change on the latest from #11888?



##########
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnumFrame.java:
##########
@@ -196,6 +207,90 @@ void loadBlock() throws IOException {
     suffixLengthsReader.reset(suffixLengthBytes, 0, numSuffixLengthBytes);
     totalSuffixBytes = ste.in.getFilePointer() - startSuffixFP;
 
+    // Prepare suffixes, offsets to binary search.
+    if (allEqual) {
+      if (isLeafBlock) {
+        suffix = suffixLengthsReader.readVInt();
+      } else {
+        // Handle subCode for non leaf block.
+        postions = new int[entCount];
+        termExists = new FixedBitSet(entCount);
+        subCodes = new long[entCount];
+        termBlockOrds = new int[entCount];
+        lastSubIndices = new int[entCount];
+        int termBlockOrd = 0;
+        int lastSubIndex = -1;
+        // read first vint to set suffix, byt the way, set termExist, subCode.
+        code = suffixLengthsReader.readVInt();
+        suffix = code >>> 1;
+        if ((code & 1) == 0) {
+          termExists.set(0);
+          termBlockOrd++;
+        } else {
+          // read subCode.
+          subCodes[0] = suffixLengthsReader.readVLong();
+          lastSubIndex = 0;
+        }
+        termBlockOrds[0] = termBlockOrd;
+        postions[0] = suffixLengthsReader.getPosition();
+        lastSubIndices[0] = lastSubIndex;
+        for (int i = 1; i < suffixes.length; i++) {
+          code = suffixLengthsReader.readVInt();
+          suffixes[i] = code >>> 1;
+          if ((code & 1) == 0) {
+            termExists.set(i);
+            termBlockOrd++;
+          } else {
+            // read subCode.
+            subCodes[i] = suffixLengthsReader.readVLong();
+            lastSubIndex = i;
+          }
+          termBlockOrds[i] = termBlockOrd;
+          postions[i] = suffixLengthsReader.getPosition();
+          lastSubIndices[i] = lastSubIndex;
+        }
+      }
+      // Reset suffixLengthsReader's position.
+      suffixLengthsReader.setPosition(0);
+    } else {
+      suffixes = new int[entCount];
+      // TODO: remove postions if it is unnecessary.
+      postions = new int[entCount];
+      if (isLeafBlock) {
+        for (int i = 0; i < suffixes.length; i++) {
+          suffixes[i] = suffixLengthsReader.readVInt();
+          postions[i] = suffixLengthsReader.getPosition();

Review Comment:
   `postions` -> `positions`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Binary search all terms. [lucene]

Reply via email to