Re: [PR] Write MSB VLong for better outputs sharing in block tree index [lucene]

via GitHub Sun, 08 Oct 2023 07:08:09 -0700


mikemccand commented on code in PR #12631:
URL: https://github.com/apache/lucene/pull/12631#discussion_r1349699402



##########
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/FieldReader.java:
##########
@@ -99,6 +102,26 @@ public final class FieldReader extends Terms {
     */
   }
 
+  long readLongOutput(DataInput in) throws IOException {

Review Comment:
   Maybe rename to `readVLongOutput`?  Make it clearly we are doing some form 
of vlong encoding?



##########
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/Lucene90BlockTreeTermsWriter.java:
##########
@@ -460,6 +460,19 @@ public String toString() {
       return "BLOCK: prefix=" + brToString(prefix);
     }
 
+    private static void writeMSBVLong(long i, DataOutput scratchBytes) throws 
IOException {

Review Comment:
   Maybe rename `i` to `value`?  And then you can use `i` for the for loop 
below?



##########
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/Lucene90BlockTreeTermsReader.java:
##########
@@ -81,8 +81,11 @@ public final class Lucene90BlockTreeTermsReader extends 
FieldsProducer {
   /** Initial terms format. */
   public static final int VERSION_START = 0;
 
+  /** Version that uses MSB VLong encoded output */
+  public static final int VERSION_MSB_VLONG_OUTPUT = 1;
+
   /** Current terms format. */
-  public static final int VERSION_CURRENT = VERSION_START;
+  public static final int VERSION_CURRENT = VERSION_MSB_VLONG_OUTPUT;

Review Comment:
   Add a comment explaining what changed, and link to the GitHub issue?



##########
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/Lucene90BlockTreeTermsWriter.java:
##########
@@ -460,6 +460,19 @@ public String toString() {
       return "BLOCK: prefix=" + brToString(prefix);
     }
 
+    private static void writeMSBVLong(long i, DataOutput scratchBytes) throws 
IOException {
+      assert i >= 0;
+      // Keep zero bits on most significant byte to have more chance to get 
prefix bytes shared.
+      // e.g. we expect 0x7FFF stored as [0x81, 0xFF, 0x7F] but not [0xFF, 
0xFF, 0x40]
+      int LSBVLongBytes = (Long.SIZE - Long.numberOfLeadingZeros(i) - 1) / 7 + 
1;

Review Comment:
   Maybe rename to `bytesNeeded`?  (Full disclosure this was suggested by 
[Bard](https://bard.google.com)).



##########
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/Lucene90BlockTreeTermsWriter.java:
##########
@@ -460,6 +460,19 @@ public String toString() {
       return "BLOCK: prefix=" + brToString(prefix);
     }
 
+    private static void writeMSBVLong(long i, DataOutput scratchBytes) throws 
IOException {
+      assert i >= 0;
+      // Keep zero bits on most significant byte to have more chance to get 
prefix bytes shared.
+      // e.g. we expect 0x7FFF stored as [0x81, 0xFF, 0x7F] but not [0xFF, 
0xFF, 0x40]
+      int LSBVLongBytes = (Long.SIZE - Long.numberOfLeadingZeros(i) - 1) / 7 + 
1;
+      i <<= Long.SIZE - LSBVLongBytes * 7;
+      for (int j = 1; j < LSBVLongBytes; j++) {
+        scratchBytes.writeByte((byte) (((i >>> 57) & 0x7FL) | 0x80));

Review Comment:
   Maybe @uschindler can help review this heavy math :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Write MSB VLong for better outputs sharing in block tree index [lucene]

Reply via email to