mikemccand commented on code in PR #13192: URL: https://github.com/apache/lucene/pull/13192#discussion_r1542884001
########## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnumFrame.java: ########## @@ -196,6 +207,90 @@ void loadBlock() throws IOException { suffixLengthsReader.reset(suffixLengthBytes, 0, numSuffixLengthBytes); totalSuffixBytes = ste.in.getFilePointer() - startSuffixFP; + // Prepare suffixes, offsets to binary search. + if (allEqual) { + if (isLeafBlock) { + suffix = suffixLengthsReader.readVInt(); + } else { + // Handle subCode for non leaf block. + postions = new int[entCount]; Review Comment: This many new allocations, at such a low level / hot spot (on each block load) seems risky / performance hurting? I wonder how often we see `allEqual` for non-leaf blocks when the field is indeed fixed-length terms? It might happen often, depending on how evenly distributed the tokens are across term space? A random UUID should be quite even, at least on initial indexing, but a predictable ID (just 0-padded incrementing int, like luceneutil) might be less so? ########## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnumFrame.java: ########## @@ -554,49 +683,128 @@ public SeekStatus scanToTermLeaf(BytesRef target, boolean exactOnly) throws IOEx assert prefixMatches(target); - // TODO: binary search when all terms have the same length, which is common for ID fields, - // which are also the most sensitive to lookup performance? - // Loop over each entry (term or sub-block) in this block: - do { - nextEnt++; + // TODO early terminate when target length unequals suffix + prefix. + // But we need to keep the same status with scanToTermLeaf. + int start = nextEnt; + int end = entCount - 1; + // Binary search the entries (terms) in this leaf block: + int cmp = 0; + while (start <= end) { + int mid = (start + end) / 2; Review Comment: Looks like we need to rebase this change on the latest from #11888? ########## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnumFrame.java: ########## @@ -196,6 +207,90 @@ void loadBlock() throws IOException { suffixLengthsReader.reset(suffixLengthBytes, 0, numSuffixLengthBytes); totalSuffixBytes = ste.in.getFilePointer() - startSuffixFP; + // Prepare suffixes, offsets to binary search. + if (allEqual) { + if (isLeafBlock) { + suffix = suffixLengthsReader.readVInt(); + } else { + // Handle subCode for non leaf block. + postions = new int[entCount]; + termExists = new FixedBitSet(entCount); + subCodes = new long[entCount]; + termBlockOrds = new int[entCount]; + lastSubIndices = new int[entCount]; + int termBlockOrd = 0; + int lastSubIndex = -1; + // read first vint to set suffix, byt the way, set termExist, subCode. + code = suffixLengthsReader.readVInt(); + suffix = code >>> 1; + if ((code & 1) == 0) { + termExists.set(0); + termBlockOrd++; + } else { + // read subCode. + subCodes[0] = suffixLengthsReader.readVLong(); + lastSubIndex = 0; + } + termBlockOrds[0] = termBlockOrd; + postions[0] = suffixLengthsReader.getPosition(); + lastSubIndices[0] = lastSubIndex; + for (int i = 1; i < suffixes.length; i++) { + code = suffixLengthsReader.readVInt(); + suffixes[i] = code >>> 1; + if ((code & 1) == 0) { + termExists.set(i); + termBlockOrd++; + } else { + // read subCode. + subCodes[i] = suffixLengthsReader.readVLong(); + lastSubIndex = i; + } + termBlockOrds[i] = termBlockOrd; + postions[i] = suffixLengthsReader.getPosition(); + lastSubIndices[i] = lastSubIndex; + } + } + // Reset suffixLengthsReader's position. + suffixLengthsReader.setPosition(0); + } else { + suffixes = new int[entCount]; + // TODO: remove postions if it is unnecessary. + postions = new int[entCount]; + if (isLeafBlock) { + for (int i = 0; i < suffixes.length; i++) { + suffixes[i] = suffixLengthsReader.readVInt(); + postions[i] = suffixLengthsReader.getPosition(); Review Comment: `postions` -> `positions`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org