jpountz commented on code in PR #875:
URL: https://github.com/apache/lucene/pull/875#discussion_r870001129


##########
lucene/core/src/java/org/apache/lucene/index/OrdinalMap.java:
##########
@@ -48,10 +49,69 @@ public class OrdinalMap implements Accountable {
   // need it
   // TODO: use more efficient packed ints structures?
 
+  /**
+   * Copy the first 8 bytes of the given term as a comparable unsigned long. 
In case the term has
+   * less than 8 bytes, missing bytes will be replaced with zeroes. Note that 
two terms that produce
+   * the same long could still be different due to the fact that missing bytes 
are replaced with
+   * zeroes, e.g. {@code [1, 0]} and {@code [1]} get mapped to the same long.
+   */
+  static long prefix8ToComparableUnsignedLong(BytesRef term) {
+    // Use Big Endian so that longs are comparable
+    if (term.length >= Long.BYTES) {
+      return (long) BitUtil.VH_BE_LONG.get(term.bytes, term.offset);
+    } else {
+      long l;
+      int offset;
+      if (term.length >= Integer.BYTES) {
+        l = (int) BitUtil.VH_BE_INT.get(term.bytes, term.offset);
+        offset = Integer.BYTES;
+      } else {
+        l = 0;
+        offset = 0;
+      }
+      while (offset < term.length) {
+        l = (l << 8) | Byte.toUnsignedLong(term.bytes[term.offset + offset]);
+        offset++;
+      }
+      l <<= (Long.BYTES - term.length) << 3;
+      return l;
+    }
+  }
+
+  private static int compare(BytesRef termA, long prefix8A, BytesRef termB, 
long prefix8B) {
+    assert prefix8A == prefix8ToComparableUnsignedLong(termA);

Review Comment:
   The main improvement I can think of would consist of looking up the first 
and last values of the segment to check if all values share a common prefix, 
e.g. the IPv4-mapped IPv6 addresses case. Maybe in the future we could split 
the value space into smaller blocks or something like that that would help us 
still handle well cases when many values share a common prefix but not all, 
e.g. a dataset of URLs where many values have the `https://www.` prefix, but 
not all, or a dataset that mixes lots of IPv4-mapped IPv6 addresses with 
regular IPv6 addresses.
   
   Maybe the API could tell us about the min and max term lengths, so that we 
could optimize the fixed-length case (e.g. geonames IDs) in the future a bit.
   
   I don't have many ideas beyond these ones. I tried to review existing 
litterature for binary search and sorting string[] keys, which have 
commonalities with what we're doing here since there's a value that's 
potentially going to be compared with several other values, and it looks like 
the main idea consists of identifying shared prefixes so that these bytes 
wouldn't have to be compared over and over again. Maybe something we can try 
out next.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to