[I] Reproducible error in TestLucene90HnswVectorsFormat.testIndexedValueNotAliased [lucene]
iverase opened a new issue, #12840: URL: https://github.com/apache/lucene/issues/12840 Command to reproduce: ``` ./gradlew test --tests TestLucene90HnswVectorsFormat.testIndexedValueNotAliased -Dtests.seed=611EEBD0148F03C7 ``` error: ``` org.apache.lucene.backward_codecs.lucene90.TestLucene90HnswVectorsFormat > testIndexedValueNotAliased FAILED java.lang.AssertionError: expected:<1.0> but was:<2.0> at __randomizedtesting.SeedInfo.seed([611EEBD0148F03C7:651A742B93C1394]:0) at junit@4.13.1/org.junit.Assert.fail(Assert.java:89) at junit@4.13.1/org.junit.Assert.failNotEquals(Assert.java:835) at junit@4.13.1/org.junit.Assert.assertEquals(Assert.java:577) at junit@4.13.1/org.junit.Assert.assertEquals(Assert.java:701) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Skip decoding tail freqs when they are not needed. [lucene]
jpountz commented on PR #12832: URL: https://github.com/apache/lucene/pull/12832#issuecomment-1825371734 This seems to have further helped [`prefix` queries](http://people.apache.org/~mikemccand/lucenebench/Prefix3.html). I'll add an annotation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Move group-varint encoding/decoding logic to DataOutput/DataInput? [lucene]
jpountz commented on issue #12826: URL: https://github.com/apache/lucene/issues/12826#issuecomment-1825393392 Let's move your branch to a PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Simplify advancing on postings/impacts enums [lucene]
gf2121 commented on code in PR #12838: URL: https://github.com/apache/lucene/pull/12838#discussion_r1404145355 ## lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99SkipReader.java: ## @@ -48,7 +48,7 @@ * * Therefore, we'll trim df before passing it to the interface. see trim(int) Review Comment: This java doc explains in detail why we need this `trim`. We need to update it if we plan to remove this :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Move group-varint encoding/decoding logic to DataOutput/DataInput? [lucene]
easyice commented on issue #12826: URL: https://github.com/apache/lucene/issues/12826#issuecomment-1825397930 Okay! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] add dedicated test to assert internals of LZ4 hashtable [LUCENE-9190] [lucene]
slow-J commented on issue #10230: URL: https://github.com/apache/lucene/issues/10230#issuecomment-1825550388 Already implemented in https://github.com/apache/lucene-solr/pull/1236, this issue can be closed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Simplify advancing on postings/impacts enums [lucene]
gf2121 commented on code in PR #12838: URL: https://github.com/apache/lucene/pull/12838#discussion_r1404145355 ## lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99SkipReader.java: ## @@ -48,7 +48,7 @@ * * Therefore, we'll trim df before passing it to the interface. see trim(int) Review Comment: This java doc explains in detail why we need this `trim`. We need to update it :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Hide the internal data structure of HeapPointWriter [lucene]
iverase merged PR #12762: URL: https://github.com/apache/lucene/pull/12762 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]
jpountz commented on code in PR #12841: URL: https://github.com/apache/lucene/pull/12841#discussion_r1404341416 ## lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java: ## @@ -303,6 +304,30 @@ public byte readByte(long pos) throws IOException { } } + @Override + public void readGroupVInt(long[] docs, int pos) throws IOException { +if (curSegment.byteSize() - curPosition < 17) { + super.readGroupVInt(docs, pos); + return; +} + +final int flag = readByte() & 0xFF; + +final int n1Minus1 = flag >> 6; +final int n2Minus1 = (flag >> 4) & 0x03; +final int n3Minus1 = (flag >> 2) & 0x03; +final int n4Minus1 = flag & 0x03; + +docs[pos] = curSegment.get(LAYOUT_LE_INT, curPosition) & MASKS[n1Minus1]; +curPosition += 1 + n1Minus1; +docs[pos + 1] = curSegment.get(LAYOUT_LE_INT, curPosition) & MASKS[n2Minus1]; +curPosition += 1 + n2Minus1; +docs[pos + 2] = curSegment.get(LAYOUT_LE_INT, curPosition) & MASKS[n3Minus1]; +curPosition += 1 + n3Minus1; +docs[pos + 3] = curSegment.get(LAYOUT_LE_INT, curPosition) & MASKS[n4Minus1]; +curPosition += 1 + n4Minus1; + } Review Comment: Can you add the same `catch (NullPointerException | IllegalStateException e)` that `readInt()` and other read methods have, for the case when the index input is closed? ## lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java: ## @@ -303,6 +304,30 @@ public byte readByte(long pos) throws IOException { } } + @Override + public void readGroupVInt(long[] docs, int pos) throws IOException { +if (curSegment.byteSize() - curPosition < 17) { + super.readGroupVInt(docs, pos); + return; +} Review Comment: I don't think we have a test that covers this case well at the moment. ## lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java: ## @@ -49,6 +49,7 @@ abstract class MemorySegmentIndexInput extends IndexInput implements RandomAcces final int chunkSizePower; final Arena arena; final MemorySegment[] segments; + private static final int[] MASKS = new int[] {0xFF, 0x, 0xFF, 0x}; Review Comment: maybe rename to `GROUP_VINT_MASKS` or something along these lines now that this logic moved to a class which is not only about group vint? Also in general I prefer having constants before instance members in the class definition. ## lucene/core/src/java/org/apache/lucene/store/DataOutput.java: ## @@ -29,6 +29,7 @@ * internal state like file position). */ public abstract class DataOutput { + BytesRef groupVIntBytes; Review Comment: BytesRefBuilder feels like a better fit for how you're using it (using `length` rather than `offset` to track the number of written bytes). Also let's make it `private`? ## lucene/core/src/test/org/apache/lucene/codecs/lucene99/TestGroupVInt.java: ## @@ -31,9 +34,7 @@ public void testEncodeDecode() throws IOException { long[] values = new long[ForUtil.BLOCK_SIZE]; long[] restored = new long[ForUtil.BLOCK_SIZE]; final int iterations = atLeast(100); - -final GroupVIntWriter w = new GroupVIntWriter(); -byte[] encoded = new byte[(int) (Integer.BYTES * ForUtil.BLOCK_SIZE * 1.25)]; +Directory dir = FSDirectory.open(createTempDir()); Review Comment: Let's use `newFSDirectory` to add coverage for all Directory implementations? ```suggestion Directory dir = newFSDirectory(createTempDir()); ``` ## lucene/core/src/java/org/apache/lucene/store/DataOutput.java: ## @@ -324,4 +325,45 @@ public void writeSetOfStrings(Set set) throws IOException { writeString(value); } } + + /** + * Encode integers using group-varint. It uses VInt to encode tail values that are not enough for + * a group + * + * @param values the values to write + * @param limit the number of values to write. + */ + public void writeGroupVInts(long[] values, int limit) throws IOException { +if (groupVIntBytes == null) { + // the maximum size of one group is 4 integers + 1 byte flag. + groupVIntBytes = new BytesRef(17); +} +int off = 0; + +// encode each group +while ((limit - off) >= 4) { + byte flag = 0; + groupVIntBytes.offset = 1; + flag |= (encodeGroupValue((int) values[off++]) - 1) << 6; + flag |= (encodeGroupValue((int) values[off++]) - 1) << 4; + flag |= (encodeGroupValue((int) values[off++]) - 1) << 2; + flag |= (encodeGroupValue((int) values[off++]) - 1); + groupVIntBytes.bytes[0] = flag; + writeBytes(groupVIntBytes.bytes, groupVIntBytes.offset); +} + +// tail vints +for (; off < limit; off++) { + writeVInt((int) values[off]); Review Comment: Now that we're moving this to `DataOutput`, we probably need to check these casts, e.g. with `Math.toIntExact`. ##
Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]
jpountz commented on PR #12841: URL: https://github.com/apache/lucene/pull/12841#issuecomment-1825685057 And maybe `BufferedIndexInput` too for folks using `NIOFSDirectory`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Simplify advancing on postings/impacts enums [lucene]
jpountz commented on code in PR #12838: URL: https://github.com/apache/lucene/pull/12838#discussion_r1404365624 ## lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99SkipReader.java: ## @@ -48,7 +48,7 @@ * * Therefore, we'll trim df before passing it to the interface. see trim(int) Review Comment: Indeed! OK, this change is bigger than I thought it'd be, I won't try to fold it into 9.9. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]
easyice commented on code in PR #12841: URL: https://github.com/apache/lucene/pull/12841#discussion_r1404387618 ## lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java: ## @@ -303,6 +304,30 @@ public byte readByte(long pos) throws IOException { } } + @Override + public void readGroupVInt(long[] docs, int pos) throws IOException { +if (curSegment.byteSize() - curPosition < 17) { + super.readGroupVInt(docs, pos); + return; +} + +final int flag = readByte() & 0xFF; + +final int n1Minus1 = flag >> 6; +final int n2Minus1 = (flag >> 4) & 0x03; +final int n3Minus1 = (flag >> 2) & 0x03; +final int n4Minus1 = flag & 0x03; + +docs[pos] = curSegment.get(LAYOUT_LE_INT, curPosition) & MASKS[n1Minus1]; +curPosition += 1 + n1Minus1; +docs[pos + 1] = curSegment.get(LAYOUT_LE_INT, curPosition) & MASKS[n2Minus1]; +curPosition += 1 + n2Minus1; +docs[pos + 2] = curSegment.get(LAYOUT_LE_INT, curPosition) & MASKS[n3Minus1]; +curPosition += 1 + n3Minus1; +docs[pos + 3] = curSegment.get(LAYOUT_LE_INT, curPosition) & MASKS[n4Minus1]; +curPosition += 1 + n4Minus1; + } Review Comment: +1, Thanks! ## lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java: ## @@ -49,6 +49,7 @@ abstract class MemorySegmentIndexInput extends IndexInput implements RandomAcces final int chunkSizePower; final Arena arena; final MemorySegment[] segments; + private static final int[] MASKS = new int[] {0xFF, 0x, 0xFF, 0x}; Review Comment: +1, great suggestion! ## lucene/core/src/java/org/apache/lucene/store/DataOutput.java: ## @@ -29,6 +29,7 @@ * internal state like file position). */ public abstract class DataOutput { + BytesRef groupVIntBytes; Review Comment: +1, Thanks for the suggestion! ## lucene/core/src/java/org/apache/lucene/store/DataOutput.java: ## @@ -324,4 +325,45 @@ public void writeSetOfStrings(Set set) throws IOException { writeString(value); } } + + /** + * Encode integers using group-varint. It uses VInt to encode tail values that are not enough for + * a group + * + * @param values the values to write + * @param limit the number of values to write. + */ + public void writeGroupVInts(long[] values, int limit) throws IOException { +if (groupVIntBytes == null) { + // the maximum size of one group is 4 integers + 1 byte flag. + groupVIntBytes = new BytesRef(17); +} +int off = 0; + +// encode each group +while ((limit - off) >= 4) { + byte flag = 0; + groupVIntBytes.offset = 1; + flag |= (encodeGroupValue((int) values[off++]) - 1) << 6; + flag |= (encodeGroupValue((int) values[off++]) - 1) << 4; + flag |= (encodeGroupValue((int) values[off++]) - 1) << 2; + flag |= (encodeGroupValue((int) values[off++]) - 1); + groupVIntBytes.bytes[0] = flag; + writeBytes(groupVIntBytes.bytes, groupVIntBytes.offset); +} + +// tail vints +for (; off < limit; off++) { + writeVInt((int) values[off]); Review Comment: Good idea, i like that! ## lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java: ## @@ -303,6 +304,30 @@ public byte readByte(long pos) throws IOException { } } + @Override + public void readGroupVInt(long[] docs, int pos) throws IOException { +if (curSegment.byteSize() - curPosition < 17) { + super.readGroupVInt(docs, pos); + return; +} Review Comment: In `TestGroupVInt#testEncodeDecode` we use a range of [1-31] `bpv` and a ragne of [1-128] `numValues`, For instance if the `bpv==2` and `numValues==4` it will cover this case? ## lucene/core/src/test/org/apache/lucene/codecs/lucene99/TestGroupVInt.java: ## @@ -31,9 +34,7 @@ public void testEncodeDecode() throws IOException { long[] values = new long[ForUtil.BLOCK_SIZE]; long[] restored = new long[ForUtil.BLOCK_SIZE]; final int iterations = atLeast(100); - -final GroupVIntWriter w = new GroupVIntWriter(); -byte[] encoded = new byte[(int) (Integer.BYTES * ForUtil.BLOCK_SIZE * 1.25)]; +Directory dir = FSDirectory.open(createTempDir()); Review Comment: +1 ## lucene/core/src/java/org/apache/lucene/store/DataInput.java: ## @@ -98,6 +98,55 @@ public int readInt() throws IOException { return ((b4 & 0xFF) << 24) | ((b3 & 0xFF) << 16) | ((b2 & 0xFF) << 8) | (b1 & 0xFF); } + /** + * Read all the group varints, including the tail vints. + * + * @param docs the array to read ints into. + * @param limit the number of int values to read. + */ + public void readGroupVInts(long[] docs, int limit) throws IOException { +int i; +for (i = 0; i <= limit - 4; i += 4) { + readGroupVInt(docs, i); +} +for (; i < limit; ++i) { + docs[i] = readVInt(); +} + } + + /
[PR] Use group-varint encode the positions [lucene]
easyice opened a new pull request, #12842: URL: https://github.com/apache/lucene/pull/12842 Thanks the suggestion from @jpountz , as discussed in https://github.com/apache/lucene/issues/12826 This PR use group-varint to encode some vint values if `storeOffsets` is true, it's still using class `GroupVIntReader` and `GroupVIntWriter`, i will update it after https://github.com/apache/lucene/pull/12841 is finished. Currently i don't use group-vint if `(storeOffsets==false && storePayload==false)`, which means only `token` is stored, because i'm worried that it will use extra memory when bulk decoding. Feel free to correct me. Then benchmark and file size changes i'll add next week. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Use LinkedList instead of manual array re-sizing for better throughput. [LUCENE-9432] [lucene]
slow-J commented on issue #10472: URL: https://github.com/apache/lucene/issues/10472#issuecomment-1825871537 I took a quick look at this 3 years on. I took @mohammadsadiq's patch and applied it to `IDVersionSegmentTermsEnum` and `OrdsSegmentTermsEnum` I then changed the LinkedList to a ArrayDeque. I ran 2 benchmarks, both with wikibigall and JDK19 on an m5.12xlarge EC2 host. Test 1: https://github.com/slow-J/lucene/commit/e2f5e745f6523688f8bdff09e901aa346ac14d57 ``` TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value IntNRQ 411.92 (3.2%) 399.44 (4.1%) -3.0% ( -10% -4%) 0.009 HighPhrase 39.83 (8.4%) 38.94 (11.2%) -2.2% ( -20% - 18%) 0.473 AndHighMed 123.34 (2.6%) 121.53 (2.1%) -1.5% ( -6% -3%) 0.051 OrHighHigh 53.33 (5.7%) 52.57 (1.7%) -1.4% ( -8% -6%) 0.282 AndHighHigh 42.16 (4.3%) 41.69 (1.1%) -1.1% ( -6% -4%) 0.260 OrNotHighMed 247.13 (3.0%) 244.44 (3.6%) -1.1% ( -7% -5%) 0.295 MedSloppyPhrase 11.55 (7.5%) 11.44 (6.8%) -0.9% ( -14% - 14%) 0.681 MedPhrase 70.95 (3.5%) 70.35 (3.6%) -0.8% ( -7% -6%) 0.451 HighSloppyPhrase 30.95 (4.9%) 30.75 (4.2%) -0.6% ( -9% -8%) 0.653 BrowseMonthSSDVFacets 21.44 (11.2%) 21.34 (13.2%) -0.5% ( -22% - 26%) 0.907 BrowseDayOfYearSSDVFacets 21.19 (15.9%) 21.10 (15.0%) -0.4% ( -27% - 36%) 0.927 LowSloppyPhrase 50.43 (4.3%) 50.21 (4.0%) -0.4% ( -8% -8%) 0.738 HighTermTitleSort 105.90 (1.6%) 105.44 (2.1%) -0.4% ( -4% -3%) 0.453 Respell 29.29 (1.3%) 29.17 (1.4%) -0.4% ( -3% -2%) 0.339 PKLookup 173.09 (2.1%) 172.41 (1.8%) -0.4% ( -4% -3%) 0.527 OrNotHighLow 882.21 (2.7%) 879.44 (3.3%) -0.3% ( -6% -5%) 0.744 MedIntervalsOrdered 28.81 (3.4%) 28.72 (3.4%) -0.3% ( -6% -6%) 0.779 Prefix3 60.35 (3.3%) 60.18 (3.4%) -0.3% ( -6% -6%) 0.800 BrowseMonthTaxoFacets 13.18 (1.0%) 13.15 (1.5%) -0.2% ( -2% -2%) 0.578 Fuzzy2 52.06 (1.4%) 51.95 (1.3%) -0.2% ( -2% -2%) 0.630 Fuzzy1 72.08 (1.1%) 71.93 (1.3%) -0.2% ( -2% -2%) 0.595 AndHighLow 635.21 (1.9%) 633.97 (2.5%) -0.2% ( -4% -4%) 0.785 Wildcard 75.55 (2.3%) 75.42 (1.9%) -0.2% ( -4% -4%) 0.807 OrNotHighHigh 140.51 (4.8%) 140.33 (4.9%) -0.1% ( -9% - 10%) 0.936 AndHighMedDayTaxoFacets 44.82 (1.7%) 44.76 (2.3%) -0.1% ( -4% -3%) 0.847 HighTermDayOfYearSort 345.49 (1.7%) 345.27 (1.5%) -0.1% ( -3% -3%) 0.898 BrowseRandomLabelTaxoFacets 11.78 (0.7%) 11.77 (0.8%) -0.0% ( -1% -1%) 0.872 LowSpanNear 45.65 (1.0%) 45.64 (1.3%) -0.0% ( -2% -2%) 0.929 AndHighHighDayTaxoFacets 15.07 (2.5%) 15.07 (2.8%) -0.0% ( -5% -5%) 0.971 HighTermTitleBDVSort 15.95 (5.4%) 15.95 (5.9%) -0.0% ( -10% - 11%) 0.987 MedSpanNear 10.13 (1.2%) 10.13 (1.2%) -0.0% ( -2% -2%) 0.959 BrowseDateTaxoFacets 13.34 (0.6%) 13.34 (0.9%) -0.0% ( -1% -1%) 0.970 OrHighMed 220.09 (2.8%) 220.08 (2.2%) -0.0% ( -4% -5%) 0.999 HighIntervalsOrdered7.05 (2.5%)7.05 (2.6%)0.0% ( -4% -5%) 1.000 HighTermMonthSort 2701.34 (3.4%) 2701.37 (3.2%)0.0% ( -6% -6%) 0.999 LowIntervalsOrdered 40.23 (2.2%) 40.24 (2.3%)0.0% ( -4% -4%) 0.979 BrowseDateSSDVFacets4.68 (4.2%)4.68 (4.3%)0.0% ( -8% -8%) 0.984 OrHighNotHigh 152.28 (4.4%) 152.38 (4.6%)0.1% ( -8% -9
Re: [PR] Use group-varint encode the positions [lucene]
jpountz commented on PR #12842: URL: https://github.com/apache/lucene/pull/12842#issuecomment-1825874597 Thanks for looking. Unfortunately, the case I'm most interested in is when `storeOffsets` is false and there are no payloads, since this is the default. :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Faster prefix sum for bitsPerValue up to 9. [lucene]
jpountz commented on PR #12843: URL: https://github.com/apache/lucene/pull/12843#issuecomment-1825884854 luceneutil doesn't see a noticeable difference (all p-values are high) but the micro-benchmark that is attached to this PR seems to see an improvement: ``` main Benchmark(bpv) Mode Cnt Score Error Units ForUtilBenchmark.decodeAndPrefixSum 6 thrpt 25 18.762 ± 0.739 ops/us ForUtilBenchmark.decodeAndPrefixSum 7 thrpt 25 18.075 ± 0.220 ops/us ForUtilBenchmark.decodeAndPrefixSum 8 thrpt 25 21.040 ± 0.285 ops/us ForUtilBenchmark.decodeAndPrefixSum 9 thrpt 25 16.790 ± 0.896 ops/us ForUtilBenchmark.decodeAndPrefixSum 10 thrpt 25 17.441 ± 1.260 ops/us ForUtilBenchmark.decodeAndPrefixSum 11 thrpt 25 16.697 ± 0.883 ops/us PR Benchmark(bpv) Mode Cnt Score Error Units ForUtilBenchmark.decodeAndPrefixSum 6 thrpt 25 19.171 ± 0.277 ops/us ForUtilBenchmark.decodeAndPrefixSum 7 thrpt 25 18.875 ± 0.203 ops/us ForUtilBenchmark.decodeAndPrefixSum 8 thrpt 25 22.075 ± 0.497 ops/us ForUtilBenchmark.decodeAndPrefixSum 9 thrpt 25 18.689 ± 0.792 ops/us ForUtilBenchmark.decodeAndPrefixSum 10 thrpt 25 17.696 ± 0.252 ops/us ForUtilBenchmark.decodeAndPrefixSum 11 thrpt 25 16.623 ± 0.856 ops/us ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Grow arrays up to a given limit to avoid overallocation where possible [lucene]
jpountz commented on issue #12839: URL: https://github.com/apache/lucene/issues/12839#issuecomment-1825928704 If I'm not mistaken, the `NeighborArray` class we use for vector search may have similar needs (it should probably not size its data structure to `maxSize` i the constructor?). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] MultiSimilarity.MultiSimScorer should sum up scores into a double [lucene]
jpountz closed issue #12675: MultiSimilarity.MultiSimScorer should sum up scores into a double URL: https://github.com/apache/lucene/issues/12675 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] MultiSimilarity.MultiSimScorer should sum up scores into a double [lucene]
jpountz commented on issue #12675: URL: https://github.com/apache/lucene/issues/12675#issuecomment-1825930715 @shubhamvishu Yes indeed! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Faster prefix sum for bitsPerValue up to 9. [lucene]
jpountz commented on PR #12843: URL: https://github.com/apache/lucene/pull/12843#issuecomment-1826052610 Actually we can do even better by better tuning the disk layout for the prefix sum. Converting this PR to a draft until this is implemented. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Grow arrays up to a given limit to avoid overallocation where possible [lucene]
stefanvodita commented on issue #12839: URL: https://github.com/apache/lucene/issues/12839#issuecomment-1826057193 Thank you for the pointer @jpountz! I'll put together a PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Improve set deletions percentage javadoc [lucene]
yugushihuang commented on code in PR #12828: URL: https://github.com/apache/lucene/pull/12828#discussion_r1404662302 ## lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java: ## @@ -150,9 +150,10 @@ public double getMaxMergedSegmentMB() { } /** - * Controls the maximum percentage of deleted documents that is tolerated in the index. Lower - * values make the index more space efficient at the expense of increased CPU and I/O activity. - * Values must be between 5 and 50. Default value is 20. + * Sets the maximum percentage of deleted documents that is tolerated in the index. The Review Comment: Thanks for the review, I will modify the wording. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Introduce growInRange to reduce array overallocation [lucene]
stefanvodita opened a new pull request, #12844: URL: https://github.com/apache/lucene/pull/12844 In cases where we know there is an upper limit to the potential size of an array, we can use `growInRange` to avoid allocating beyond that limit. We address such cases in `DirectoryTaxonomyReader` and `NeighborArray`. Closes #12839 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Grow arrays up to a given limit to avoid overallocation where possible [lucene]
stefanvodita commented on issue #12839: URL: https://github.com/apache/lucene/issues/12839#issuecomment-1826125298 I added the new method and used it for `DirectoryTaxonomyReader` and `NeighborArray` (#12844). There might be other places where it makes sense to use, but I thought it best to get some feedback before going and hunting down more of those cases. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Use group-varint encode the positions [lucene]
easyice commented on PR #12842: URL: https://github.com/apache/lucene/pull/12842#issuecomment-1826180124 Thanks for your suggestion, i'm thinking about that too, i will continue working on this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org