rmuir commented on code in PR #875: URL: https://github.com/apache/lucene/pull/875#discussion_r869459393
########## lucene/core/src/java/org/apache/lucene/index/OrdinalMap.java: ########## @@ -48,10 +49,69 @@ public class OrdinalMap implements Accountable { // need it // TODO: use more efficient packed ints structures? + /** + * Copy the first 8 bytes of the given term as a comparable unsigned long. In case the term has + * less than 8 bytes, missing bytes will be replaced with zeroes. Note that two terms that produce + * the same long could still be different due to the fact that missing bytes are replaced with + * zeroes, e.g. {@code [1, 0]} and {@code [1]} get mapped to the same long. + */ + static long prefix8ToComparableUnsignedLong(BytesRef term) { + // Use Big Endian so that longs are comparable + if (term.length >= Long.BYTES) { + return (long) BitUtil.VH_BE_LONG.get(term.bytes, term.offset); + } else { + long l; + int offset; + if (term.length >= Integer.BYTES) { + l = (int) BitUtil.VH_BE_INT.get(term.bytes, term.offset); + offset = Integer.BYTES; + } else { + l = 0; + offset = 0; + } + while (offset < term.length) { + l = (l << 8) | Byte.toUnsignedLong(term.bytes[term.offset + offset]); + offset++; + } + l <<= (Long.BYTES - term.length) << 3; + return l; + } + } + + private static int compare(BytesRef termA, long prefix8A, BytesRef termB, long prefix8B) { + assert prefix8A == prefix8ToComparableUnsignedLong(termA); Review Comment: How much could we speed up this merge if it was "aware" of more docvalues structure (shared prefix length, block id, etc)? That's the most obvious waste, right? We compare same bytes over and over again needlessly in some cases because the comparison isn't aware of this stuff? for "intersection" we added `Terms.intersect`, but this is something different. Maybe we could optimize it without messing up TermsEnum api (e.g. add a similar method, or perhaps optional interface/subclass the codec could implement). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org