Re: [PR] [Draft] Fix for the bug where JapaneseReadingFormFilter cannot convert some hiragana to romaji [lucene]
kuramitsu commented on PR #12885: URL: https://github.com/apache/lucene/pull/12885#issuecomment-1846729364 The modification within the getRomanization function has been dropped. Instead, in the incrementToken function, I added a process to treat the hiragana OOV term converted to kataka as the reading when it appears. This processing allows for correct handling even when the Romaji option is false. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]
jpountz commented on PR #12841: URL: https://github.com/apache/lucene/pull/12841#issuecomment-1846774383 I'll check if there is GC activity during the benchmark. In the meantime, I looked into using lambdas instead, and it seems like it would work well: https://github.com/apache/lucene/commit/8109e639508b6875d657e84634dd27c58942f610 ``` Benchmark(size) Mode Cnt Score Error Units GroupVIntBenchmark.mmap_byteBufferReadGroupVInt 64 thrpt5 13.174 ± 0.870 ops/us GroupVIntBenchmark.mmap_byteBufferReadGroupVIntBaseline 64 thrpt5 6.834 ± 0.021 ops/us ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]
jpountz commented on PR #12841: URL: https://github.com/apache/lucene/pull/12841#issuecomment-1846779641 I confirmed there's GC activity happening with the slice approach by using `-prof gc`: ``` Benchmark (size) Mode Cnt ScoreError Units GroupVIntBenchmark.mmap_byteBufferReadGroupVInt 64 thrpt5 11.482 ± 0.727 ops/us GroupVIntBenchmark.mmap_byteBufferReadGroupVInt:gc.alloc.rate 64 thrpt5 0.001 ± 0.001 MB/sec GroupVIntBenchmark.mmap_byteBufferReadGroupVInt:gc.alloc.rate.norm 64 thrpt5 ≈ 10⁻⁴ B/op GroupVIntBenchmark.mmap_byteBufferReadGroupVInt:gc.count 64 thrpt5≈ 0 counts GroupVIntBenchmark.mmap_byteBufferReadGroupVIntBaseline 64 thrpt5 6.613 ± 0.504 ops/us GroupVIntBenchmark.mmap_byteBufferReadGroupVIntBaseline:gc.alloc.rate 64 thrpt5 0.001 ± 0.001 MB/sec GroupVIntBenchmark.mmap_byteBufferReadGroupVIntBaseline:gc.alloc.rate.norm 64 thrpt5 ≈ 10⁻⁴ B/op GroupVIntBenchmark.mmap_byteBufferReadGroupVIntBaseline:gc.count 64 thrpt5≈ 0 counts GroupVIntBenchmark.mmap_byteBufferReadGroupVInt_viaByteBuffer 64 thrpt5 0.153 ± 0.002 ops/us GroupVIntBenchmark.mmap_byteBufferReadGroupVInt_viaByteBuffer:gc.alloc.rate 64 thrpt5 2391.134 ± 25.909 MB/sec GroupVIntBenchmark.mmap_byteBufferReadGroupVInt_viaByteBuffer:gc.alloc.rate.norm 64 thrpt5 16432.009 ± 0.001B/op GroupVIntBenchmark.mmap_byteBufferReadGroupVInt_viaByteBuffer:gc.count 64 thrpt5202.000 counts GroupVIntBenchmark.mmap_byteBufferReadGroupVInt_viaByteBuffer:gc.time 64 thrpt5129.000 ms ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]
easyice commented on PR #12841: URL: https://github.com/apache/lucene/pull/12841#issuecomment-1846784038 +1 for gc overhead, here is the gc output (`-prof gc` ): ``` Benchmark (size) Mode Cnt Score Error Units GroupVIntBenchmark.mmap_byteBufferReadGroupVIntUseByteBuffer 64 thrpt5 1.873 ± 0.269 ops/us GroupVIntBenchmark.mmap_byteBufferReadGroupVIntUseByteBuffer:gc.alloc.rate 64 thrpt5 4800.320 ± 690.606 MB/sec GroupVIntBenchmark.mmap_byteBufferReadGroupVIntUseByteBuffer:gc.alloc.rate.norm 64 thrpt5 2688.001 ± 0.001B/op GroupVIntBenchmark.mmap_byteBufferReadGroupVIntUseByteBuffer:gc.count 64 thrpt5 273.000counts GroupVIntBenchmark.mmap_byteBufferReadGroupVIntUseByteBuffer:gc.time 64 thrpt5 218.000ms GroupVIntBenchmark.mmap_byteBufferReadGroupVInt 64 thrpt5 8.118 ± 5.922 ops/us GroupVIntBenchmark.mmap_byteBufferReadGroupVInt:gc.alloc.rate 64 thrpt5 0.001 ± 0.001 MB/sec GroupVIntBenchmark.mmap_byteBufferReadGroupVInt:gc.alloc.rate.norm 64 thrpt5≈ 10⁻⁴ B/op GroupVIntBenchmark.mmap_byteBufferReadGroupVInt:gc.count 64 thrpt5 ≈ 0counts ``` > I looked into using lambdas instead, and it seems like it would work well: It's so cool! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]
uschindler commented on PR #12841: URL: https://github.com/apache/lucene/pull/12841#issuecomment-1846881254 I would still be safe and initialize the IntReader on construction of the IndexInput. It can strongly bind to the current segment. Can we do the same for all other inputs? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]
uschindler commented on PR #12841: URL: https://github.com/apache/lucene/pull/12841#issuecomment-1846883907 I will nag Maurizio again about the problem with slice(). The reason for this was some strange problem with Hotspot. I thought they fixed it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Correct last remaining instances of typo e.g. "Levenstein" -> "Levenshtein" [lucene]
shaikhu commented on PR #12519: URL: https://github.com/apache/lucene/pull/12519#issuecomment-1846884628 Oops I completely forgot about this. Restored forked repo and reopening. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]
easyice commented on PR #12841: URL: https://github.com/apache/lucene/pull/12841#issuecomment-1846907147 > Can we do the same for all other inputs? I think so, i will do this if @jpountz doesn't mind. > I will nag Maurizio again about the problem with slice(). Thank you so much for pinpointing the cause of the problem! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Enable CheckIndex to exorcise segments with missing segment infos (.si) (#7820) [lucene]
gokaai commented on code in PR #12872: URL: https://github.com/apache/lucene/pull/12872#discussion_r1420277534 ## lucene/core/src/java/org/apache/lucene/index/CheckIndex.java: ## @@ -957,6 +974,9 @@ private Status.SegmentInfoStatus testSegment( SegmentReader reader = null; try { + if (info.info.maxDoc() == 0) { +throw new CheckIndexException(" this segment has missing or corrupt segment info"); Review Comment: > Remove the extra space before this I was keeping it consistent with [this exception message](https://github.com/apache/lucene/blob/ba81826951c1f628efb60c6165add27601173bd9/lucene/core/src/java/org/apache/lucene/index/CheckIndex.java#L949) - should I change this one too? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]
easyice commented on PR #12841: URL: https://github.com/apache/lucene/pull/12841#issuecomment-1847060669 I'm running the performance differences between previous commit, it will take a moment. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]
uschindler commented on code in PR #12841: URL: https://github.com/apache/lucene/pull/12841#discussion_r1420364348 ## lucene/test-framework/src/java/org/apache/lucene/tests/store/BaseDirectoryTestCase.java: ## @@ -1438,4 +1440,68 @@ public void testListAllIsSorted() throws IOException { assertArrayEquals(expected, actual); } } + + public void testDataTypes() throws IOException { +final long[] values = new long[] {43, 12345, 123456, 1234567890}; +try (Directory dir = getDirectory(createTempDir("testDataTypes"))) { + IndexOutput out = dir.createOutput("test", IOContext.DEFAULT); + out.writeByte((byte) 43); + out.writeShort((short) 12345); + out.writeInt(1234567890); + out.writeGroupVInts(values, 4); + out.writeLong(1234567890123456789L); + out.close(); + + long[] restored = new long[4]; + IndexInput in = dir.openInput("test", IOContext.DEFAULT); + assertEquals(43, in.readByte()); + assertEquals(12345, in.readShort()); + assertEquals(1234567890, in.readInt()); + in.readGroupVInts(restored, 4); + assertArrayEquals(values, restored); + assertEquals(1234567890123456789L, in.readLong()); + in.close(); +} + } + + public void testGroupVInt() throws IOException { +try (Directory dir = getDirectory(createTempDir("testGroupVInt"))) { + // test fallback to default implementation of readGroupVInt + doTestGroupVInt(dir, 5, 1, 6, 8); + + // use more iterations to covers all bpv + doTestGroupVInt(dir, atLeast(100), 1, 31, 128); + + // we use BaseChunkedDirectoryTestCase#testGroupVIntMultiBlocks cover multiple blocks for + // ByteBuffersDataInput and MMapDirectory +} + } + + protected void doTestGroupVInt( + Directory dir, int iterations, int minBpv, int maxBpv, int maxNumValues) throws IOException { +long[] values = new long[maxNumValues]; +long[] restored = new long[maxNumValues]; + +for (int i = 0; i < iterations; i++) { Review Comment: I am a bit afraid that those tests kill my SSD at some point (also Jenkin's SSD). This creates files over and over. With test directories its not a problem, but with real FDDirectory ones we should reduce the iteration counts. The chunked test should clearly setup a directory with a random chunk size once and run all tests on it. Ideally write only one file with multiple variants of groupVints behind each other (so it definitely crosses borders) and then read all of them. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]
uschindler commented on code in PR #12841: URL: https://github.com/apache/lucene/pull/12841#discussion_r1420364348 ## lucene/test-framework/src/java/org/apache/lucene/tests/store/BaseDirectoryTestCase.java: ## @@ -1438,4 +1440,68 @@ public void testListAllIsSorted() throws IOException { assertArrayEquals(expected, actual); } } + + public void testDataTypes() throws IOException { +final long[] values = new long[] {43, 12345, 123456, 1234567890}; +try (Directory dir = getDirectory(createTempDir("testDataTypes"))) { + IndexOutput out = dir.createOutput("test", IOContext.DEFAULT); + out.writeByte((byte) 43); + out.writeShort((short) 12345); + out.writeInt(1234567890); + out.writeGroupVInts(values, 4); + out.writeLong(1234567890123456789L); + out.close(); + + long[] restored = new long[4]; + IndexInput in = dir.openInput("test", IOContext.DEFAULT); + assertEquals(43, in.readByte()); + assertEquals(12345, in.readShort()); + assertEquals(1234567890, in.readInt()); + in.readGroupVInts(restored, 4); + assertArrayEquals(values, restored); + assertEquals(1234567890123456789L, in.readLong()); + in.close(); +} + } + + public void testGroupVInt() throws IOException { +try (Directory dir = getDirectory(createTempDir("testGroupVInt"))) { + // test fallback to default implementation of readGroupVInt + doTestGroupVInt(dir, 5, 1, 6, 8); + + // use more iterations to covers all bpv + doTestGroupVInt(dir, atLeast(100), 1, 31, 128); + + // we use BaseChunkedDirectoryTestCase#testGroupVIntMultiBlocks cover multiple blocks for + // ByteBuffersDataInput and MMapDirectory +} + } + + protected void doTestGroupVInt( + Directory dir, int iterations, int minBpv, int maxBpv, int maxNumValues) throws IOException { +long[] values = new long[maxNumValues]; +long[] restored = new long[maxNumValues]; + +for (int i = 0; i < iterations; i++) { Review Comment: I am a bit afraid that those tests kill my SSD at some point (also Jenkin's SSD). This creates files over and over. With test directories its not a problem, but with real FSDirectory ones we should reduce the iteration counts. The chunked test should clearly setup a directory with a random chunk size once and run all tests on it. Ideally write only one file with multiple variants of groupVints behind each other (so it definitely crosses borders) and then read all of them. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]
uschindler commented on code in PR #12841: URL: https://github.com/apache/lucene/pull/12841#discussion_r1420373977 ## lucene/core/src/java/org/apache/lucene/util/GroupVIntUtil.java: ## @@ -62,4 +62,42 @@ private static long readLongInGroup(DataInput in, int numBytesMinus1) throws IOE return in.readInt() & 0xL; } } + + /** + * Provides an abstraction for read int values, so that decoding logic can be reused in different DataInput. + * + */ + public static interface IntReader { Review Comment: add `@FunctionalInterface` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]
uschindler commented on code in PR #12841: URL: https://github.com/apache/lucene/pull/12841#discussion_r1420375351 ## lucene/core/src/java/org/apache/lucene/util/GroupVIntUtil.java: ## @@ -62,4 +62,42 @@ private static long readLongInGroup(DataInput in, int numBytesMinus1) throws IOE return in.readInt() & 0xL; } } + + /** + * Provides an abstraction for read int values, so that decoding logic can be reused in different DataInput. + * + */ + public static interface IntReader { +int read(long v); + } + + /** + * Faster implementation of read single group, It read values from the buffer that would not cross + * boundaries. + * + * @param flag the flag of group varint. + * @param reader the supplier of read int. + * @param dst the array to read ints into. Review Comment: parameter "pos" is missing in javadocs -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]
uschindler commented on code in PR #12841: URL: https://github.com/apache/lucene/pull/12841#discussion_r1420377148 ## lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java: ## @@ -324,24 +324,9 @@ private void readGroupVInt(long[] dst, int offset) throws IOException { try { final int flag = curSegment.get(LAYOUT_BYTE, curPosition++) & 0xFF; - - final int n1Minus1 = flag >> 6; - final int n2Minus1 = (flag >> 4) & 0x03; - final int n3Minus1 = (flag >> 2) & 0x03; - final int n4Minus1 = flag & 0x03; - - dst[offset] = - curSegment.get(LAYOUT_LE_INT, curPosition) & GroupVIntUtil.GROUP_VINT_MASKS[n1Minus1]; - curPosition += 1 + n1Minus1; - dst[offset + 1] = - curSegment.get(LAYOUT_LE_INT, curPosition) & GroupVIntUtil.GROUP_VINT_MASKS[n2Minus1]; - curPosition += 1 + n2Minus1; - dst[offset + 2] = - curSegment.get(LAYOUT_LE_INT, curPosition) & GroupVIntUtil.GROUP_VINT_MASKS[n3Minus1]; - curPosition += 1 + n3Minus1; - dst[offset + 3] = - curSegment.get(LAYOUT_LE_INT, curPosition) & GroupVIntUtil.GROUP_VINT_MASKS[n4Minus1]; - curPosition += 1 + n4Minus1; + curPosition += + GroupVIntUtil.readGroupVInt( + flag, p -> curSegment.get(LAYOUT_LE_INT, p), curPosition, dst, offset); Review Comment: isn't the +1 missing here like above? This is why I'd like to have a test that reads a vint group and other data from the same input placed behind each other! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]
uschindler commented on code in PR #12841: URL: https://github.com/apache/lucene/pull/12841#discussion_r1420379683 ## lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java: ## @@ -303,6 +304,34 @@ public byte readByte(long pos) throws IOException { } } + @Override + public void readGroupVInts(long[] dst, int limit) throws IOException { +int i; +for (i = 0; i <= limit - 4; i += 4) { + readGroupVInt(dst, i); +} +for (; i < limit; ++i) { + dst[i] = readVInt(); +} + } + + private void readGroupVInt(long[] dst, int offset) throws IOException { +final MemorySegment curSegment = this.curSegment; Review Comment: This part MUST also be inside they try-catch, as it would throw NPE or IllegalStateException when the input was closed! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]
uschindler commented on code in PR #12841: URL: https://github.com/apache/lucene/pull/12841#discussion_r1420382794 ## lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java: ## @@ -303,6 +304,48 @@ public byte readByte(long pos) throws IOException { } } + @Override + public void readGroupVInts(long[] dst, int limit) throws IOException { +int i; +for (i = 0; i <= limit - 4; i += 4) { + readGroupVInt(dst, i); +} +for (; i < limit; ++i) { + dst[i] = readVInt(); +} + } + + private void readGroupVInt(long[] dst, int offset) throws IOException { +if (curSegment.byteSize() - curPosition < GroupVIntUtil.MAX_LENGTH_PER_GROUP) { + GroupVIntUtil.fallbackReadGroupVInt(this, dst, offset); + return; +} + Review Comment: the bigger problem here is that it is not in the try-catch block. when the indexinput is closed this fails with NPE or IllegalStateException. Everything in this method must be in the try-catch. Look at othe rmethods, the NPE/ISE catch block always covers everything. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]
easyice commented on PR #12841: URL: https://github.com/apache/lucene/pull/12841#issuecomment-1847102122 It looks good on `byteBuffers` and `MMapDirectory`, the benchmark result is pretty close to previous commit, but a bit slowdon on `NIOFSDirectory`, i will dig it. * `*ReadGroupVInt` the current commit * `*_viaDupCode` the previous commit, using duplicated code. * `*Baseline` no change. java17 ``` Benchmark (size) Mode Cnt Score Error Units GroupVIntBenchmark.byteBuffersReadGroupVInt 64 thrpt 5 5.156 ± 0.378 ops/us GroupVIntBenchmark.byteBuffersReadGroupVIntBaseline 64 thrpt 5 2.130 ± 0.039 ops/us GroupVIntBenchmark.byteBuffersReadGroupVInt_viaDupCode 64 thrpt 5 5.455 ± 0.164 ops/us GroupVIntBenchmark.mmap_byteBufferReadGroupVInt 64 thrpt 5 7.632 ± 0.377 ops/us GroupVIntBenchmark.mmap_byteBufferReadGroupVIntBaseline 64 thrpt 5 7.376 ± 0.485 ops/us GroupVIntBenchmark.mmap_byteBufferReadGroupVInt_viaDupCode 64 thrpt 5 7.408 ± 0.532 ops/us GroupVIntBenchmark.nioReadGroupVInt 64 thrpt 5 6.905 ± 0.589 ops/us GroupVIntBenchmark.nioReadGroupVIntBaseline 64 thrpt 5 5.574 ± 0.216 ops/us GroupVIntBenchmark.nioReadGroupVInt_viaDupCode 64 thrpt 5 9.014 ± 0.162 ops/us ``` java21 ``` Benchmark (size) Mode Cnt Score Error Units GroupVIntBenchmark.byteBuffersReadGroupVInt 64 thrpt 5 5.539 ± 0.220 ops/us GroupVIntBenchmark.byteBuffersReadGroupVIntBaseline 64 thrpt 5 1.825 ± 0.055 ops/us GroupVIntBenchmark.byteBuffersReadGroupVInt_viaDupCode 64 thrpt 5 5.543 ± 0.097 ops/us GroupVIntBenchmark.mmap_byteBufferReadGroupVInt 64 thrpt 5 9.428 ± 0.189 ops/us GroupVIntBenchmark.mmap_byteBufferReadGroupVIntBaseline 64 thrpt 5 5.565 ± 0.080 ops/us GroupVIntBenchmark.mmap_byteBufferReadGroupVInt_viaDupCode 64 thrpt 5 10.476 ± 1.343 ops/us GroupVIntBenchmark.nioReadGroupVInt 64 thrpt 5 8.434 ± 1.887 ops/us GroupVIntBenchmark.nioReadGroupVIntBaseline 64 thrpt 5 5.431 ± 0.072 ops/us GroupVIntBenchmark.nioReadGroupVInt_viaDupCode 64 thrpt 5 10.723 ± 0.399 ops/us ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Removing @lucene.experimental tags in testXXX methods in CheckIndex [lucene]
slow-J opened a new pull request, #12893: URL: https://github.com/apache/lucene/pull/12893 Following up on @mikemccand's comment in previous CheckIndex PR:https://github.com/apache/lucene/pull/12876. > I do think some of these tags in CheckIndex.java could be removed, e.g. on each of the testXXX methods? But likely we should leave it on each of the XXXStatus returned classes, so we are free to break/change what metadata these classes return? And let's do it separately -- this PR looks like a great cleanup already. Progress not perfection! Targetting Lucene 9.10. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]
uschindler commented on code in PR #12841: URL: https://github.com/apache/lucene/pull/12841#discussion_r1420387856 ## lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java: ## @@ -303,6 +304,34 @@ public byte readByte(long pos) throws IOException { } } + @Override + public void readGroupVInts(long[] dst, int limit) throws IOException { +int i; +for (i = 0; i <= limit - 4; i += 4) { + readGroupVInt(dst, i); +} +for (; i < limit; ++i) { + dst[i] = readVInt(); +} + } + + private void readGroupVInt(long[] dst, int offset) throws IOException { +final MemorySegment curSegment = this.curSegment; Review Comment: please add a test to the exitsing closed tests, reading readGroupVInt from an already closed IndexInput -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]
easyice commented on code in PR #12841: URL: https://github.com/apache/lucene/pull/12841#discussion_r1420418454 ## lucene/core/src/java/org/apache/lucene/util/GroupVIntUtil.java: ## @@ -0,0 +1,104 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.util; + +import java.io.IOException; +import org.apache.lucene.store.DataInput; + +/** + * This class contains utility methods and constants for group varint + * + * @lucene.internal + */ +public final class GroupVIntUtil { + // the maximum length of a single group-varint is 4 integers + 1 byte flag. + public static final int MAX_LENGTH_PER_GROUP = 17; + private static final int[] MASKS = new int[] {0xFF, 0x, 0xFF, 0x}; + + /** + * Default implementation of read single group, for optimal performance, you should use {@link + * DataInput#readGroupVInts(long[], int)} instead. + * + * @param dst the array to read ints into. + * @param offset the offset in the array to start storing ints. + */ + public static void readGroupVInt(DataInput in, long[] dst, int offset) throws IOException { +final int flag = in.readByte() & 0xFF; + +final int n1Minus1 = flag >> 6; +final int n2Minus1 = (flag >> 4) & 0x03; +final int n3Minus1 = (flag >> 2) & 0x03; +final int n4Minus1 = flag & 0x03; + +dst[offset] = readLongInGroup(in, n1Minus1); +dst[offset + 1] = readLongInGroup(in, n2Minus1); +dst[offset + 2] = readLongInGroup(in, n3Minus1); +dst[offset + 3] = readLongInGroup(in, n4Minus1); + } + + private static long readLongInGroup(DataInput in, int numBytesMinus1) throws IOException { +switch (numBytesMinus1) { + case 0: +return in.readByte() & 0xFFL; + case 1: +return in.readShort() & 0xL; + case 2: +return (in.readShort() & 0xL) | ((in.readByte() & 0xFFL) << 16); + default: +return in.readInt() & 0xL; +} + } + + /** + * Provides an abstraction for read int values, so that decoding logic can be reused in different + * DataInput. + */ + public static interface IntReader { +int read(long v); + } + + /** + * Faster implementation of read single group, It read values from the buffer that would not cross + * boundaries. + * + * @param flag the flag of group varint. + * @param reader the supplier of read int. + * @param dst the array to read ints into. + * @param offset the offset in the array to start storing ints. + * @return the number of bytes read, it is a positive number and less than {@link + * #MAX_LENGTH_PER_GROUP} + */ + public static int readGroupVInt(int flag, IntReader reader, long pos, long[] dst, int offset) + throws IOException { Review Comment: @uschindler should we also throws `NullPointerException`, `IllegalStateException` exception that maybe happen in `MemorySegmentIndexInput`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]
uschindler commented on code in PR #12841: URL: https://github.com/apache/lucene/pull/12841#discussion_r1420447123 ## lucene/core/src/java/org/apache/lucene/util/GroupVIntUtil.java: ## @@ -0,0 +1,104 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.util; + +import java.io.IOException; +import org.apache.lucene.store.DataInput; + +/** + * This class contains utility methods and constants for group varint + * + * @lucene.internal + */ +public final class GroupVIntUtil { + // the maximum length of a single group-varint is 4 integers + 1 byte flag. + public static final int MAX_LENGTH_PER_GROUP = 17; + private static final int[] MASKS = new int[] {0xFF, 0x, 0xFF, 0x}; + + /** + * Default implementation of read single group, for optimal performance, you should use {@link + * DataInput#readGroupVInts(long[], int)} instead. + * + * @param dst the array to read ints into. + * @param offset the offset in the array to start storing ints. + */ + public static void readGroupVInt(DataInput in, long[] dst, int offset) throws IOException { +final int flag = in.readByte() & 0xFF; + +final int n1Minus1 = flag >> 6; +final int n2Minus1 = (flag >> 4) & 0x03; +final int n3Minus1 = (flag >> 2) & 0x03; +final int n4Minus1 = flag & 0x03; + +dst[offset] = readLongInGroup(in, n1Minus1); +dst[offset + 1] = readLongInGroup(in, n2Minus1); +dst[offset + 2] = readLongInGroup(in, n3Minus1); +dst[offset + 3] = readLongInGroup(in, n4Minus1); + } + + private static long readLongInGroup(DataInput in, int numBytesMinus1) throws IOException { +switch (numBytesMinus1) { + case 0: +return in.readByte() & 0xFFL; + case 1: +return in.readShort() & 0xL; + case 2: +return (in.readShort() & 0xL) | ((in.readByte() & 0xFFL) << 16); + default: +return in.readInt() & 0xL; +} + } + + /** + * Provides an abstraction for read int values, so that decoding logic can be reused in different + * DataInput. + */ + public static interface IntReader { +int read(long v); + } + + /** + * Faster implementation of read single group, It read values from the buffer that would not cross + * boundaries. + * + * @param flag the flag of group varint. + * @param reader the supplier of read int. + * @param dst the array to read ints into. + * @param offset the offset in the array to start storing ints. + * @return the number of bytes read, it is a positive number and less than {@link + * #MAX_LENGTH_PER_GROUP} + */ + public static int readGroupVInt(int flag, IntReader reader, long pos, long[] dst, int offset) + throws IOException { Review Comment: No. Why? It is an unchecked exception. The NPE and ISE are mmap special case (also for the legacy mmap impl). Please only cover the whole method in memorysegment's input by the try-catch. If anywhere while the method is executed we get an NPE the input is closed. No need to do anything outside. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]
uschindler commented on code in PR #12841: URL: https://github.com/apache/lucene/pull/12841#discussion_r1420451580 ## lucene/core/src/java/org/apache/lucene/util/GroupVIntUtil.java: ## @@ -0,0 +1,104 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.util; + +import java.io.IOException; +import org.apache.lucene.store.DataInput; + +/** + * This class contains utility methods and constants for group varint + * + * @lucene.internal + */ +public final class GroupVIntUtil { + // the maximum length of a single group-varint is 4 integers + 1 byte flag. + public static final int MAX_LENGTH_PER_GROUP = 17; + private static final int[] MASKS = new int[] {0xFF, 0x, 0xFF, 0x}; + + /** + * Default implementation of read single group, for optimal performance, you should use {@link + * DataInput#readGroupVInts(long[], int)} instead. + * + * @param dst the array to read ints into. + * @param offset the offset in the array to start storing ints. + */ + public static void readGroupVInt(DataInput in, long[] dst, int offset) throws IOException { +final int flag = in.readByte() & 0xFF; + +final int n1Minus1 = flag >> 6; +final int n2Minus1 = (flag >> 4) & 0x03; +final int n3Minus1 = (flag >> 2) & 0x03; +final int n4Minus1 = flag & 0x03; + +dst[offset] = readLongInGroup(in, n1Minus1); +dst[offset + 1] = readLongInGroup(in, n2Minus1); +dst[offset + 2] = readLongInGroup(in, n3Minus1); +dst[offset + 3] = readLongInGroup(in, n4Minus1); + } + + private static long readLongInGroup(DataInput in, int numBytesMinus1) throws IOException { +switch (numBytesMinus1) { + case 0: +return in.readByte() & 0xFFL; + case 1: +return in.readShort() & 0xL; + case 2: +return (in.readShort() & 0xL) | ((in.readByte() & 0xFFL) << 16); + default: +return in.readInt() & 0xL; +} + } + + /** + * Provides an abstraction for read int values, so that decoding logic can be reused in different + * DataInput. + */ + public static interface IntReader { +int read(long v); + } + + /** + * Faster implementation of read single group, It read values from the buffer that would not cross + * boundaries. + * + * @param flag the flag of group varint. + * @param reader the supplier of read int. + * @param dst the array to read ints into. + * @param offset the offset in the array to start storing ints. + * @return the number of bytes read, it is a positive number and less than {@link + * #MAX_LENGTH_PER_GROUP} + */ + public static int readGroupVInt(int flag, IntReader reader, long pos, long[] dst, int offset) + throws IOException { Review Comment: All public methods in MemorySegmentIndexInput need the try catch. Everything private not. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]
uschindler commented on code in PR #12841: URL: https://github.com/apache/lucene/pull/12841#discussion_r1420450756 ## lucene/core/src/java/org/apache/lucene/util/GroupVIntUtil.java: ## @@ -0,0 +1,104 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.util; + +import java.io.IOException; +import org.apache.lucene.store.DataInput; + +/** + * This class contains utility methods and constants for group varint + * + * @lucene.internal + */ +public final class GroupVIntUtil { + // the maximum length of a single group-varint is 4 integers + 1 byte flag. + public static final int MAX_LENGTH_PER_GROUP = 17; + private static final int[] MASKS = new int[] {0xFF, 0x, 0xFF, 0x}; + + /** + * Default implementation of read single group, for optimal performance, you should use {@link + * DataInput#readGroupVInts(long[], int)} instead. + * + * @param dst the array to read ints into. + * @param offset the offset in the array to start storing ints. + */ + public static void readGroupVInt(DataInput in, long[] dst, int offset) throws IOException { +final int flag = in.readByte() & 0xFF; + +final int n1Minus1 = flag >> 6; +final int n2Minus1 = (flag >> 4) & 0x03; +final int n3Minus1 = (flag >> 2) & 0x03; +final int n4Minus1 = flag & 0x03; + +dst[offset] = readLongInGroup(in, n1Minus1); +dst[offset + 1] = readLongInGroup(in, n2Minus1); +dst[offset + 2] = readLongInGroup(in, n3Minus1); +dst[offset + 3] = readLongInGroup(in, n4Minus1); + } + + private static long readLongInGroup(DataInput in, int numBytesMinus1) throws IOException { +switch (numBytesMinus1) { + case 0: +return in.readByte() & 0xFFL; + case 1: +return in.readShort() & 0xL; + case 2: +return (in.readShort() & 0xL) | ((in.readByte() & 0xFFL) << 16); + default: +return in.readInt() & 0xL; +} + } + + /** + * Provides an abstraction for read int values, so that decoding logic can be reused in different + * DataInput. + */ + public static interface IntReader { +int read(long v); + } + + /** + * Faster implementation of read single group, It read values from the buffer that would not cross + * boundaries. + * + * @param flag the flag of group varint. + * @param reader the supplier of read int. + * @param dst the array to read ints into. + * @param offset the offset in the array to start storing ints. + * @return the number of bytes read, it is a positive number and less than {@link + * #MAX_LENGTH_PER_GROUP} + */ + public static int readGroupVInt(int flag, IntReader reader, long pos, long[] dst, int offset) + throws IOException { Review Comment: See my other comment. Just extend the NPE/ISE try on the whole read method. Like everywhere else. Just look at the other methods, they are all identically structured. There is a reason why it is done like that. Don't change the pattern. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix position increment in (Reverse)PathHierarchyTokenizer [lucene]
lukas-vlcek commented on PR #12875: URL: https://github.com/apache/lucene/pull/12875#issuecomment-1847241112 @mikemccand Do you think you can give me some hint about? > (e.g. `UnifiedHighlighter`, in certain modes) I am looking at `TestUnifiedHighlighter*` tests. Does it mean that I need to use specific fieldType? Can I use any fieldType(s) from existing `UHTestHelper.parametersFactoryList()`? Such as `postingsType` and `postingsWithTvType`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [Minor] Quick exit for non-zero slice buffers [lucene]
gsmiller commented on code in PR #12812: URL: https://github.com/apache/lucene/pull/12812#discussion_r1420685472 ## lucene/memory/src/java/org/apache/lucene/index/memory/MemoryIndex.java: ## @@ -179,6 +179,19 @@ static class SlicedIntBlockPool extends IntBlockPool { super(allocator); } +/** + * For slices, buffers must be filled with zeros, so that we can find a slice's end based on a + * non-zero final value. + */ +private static boolean assertSliceBuffer(int[] buffer) { + for (int value : buffer) { Review Comment: OK just a guess, but I'm wondering if it was written this way to avoid conditional checks and branching in the CPU? It looks inefficient to us humans, but to a CPU, it might be more efficient to use the older style implementation that is branchless? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [Minor] Quick exit for non-zero slice buffers [lucene]
stefanvodita commented on code in PR #12812: URL: https://github.com/apache/lucene/pull/12812#discussion_r1420699163 ## lucene/memory/src/java/org/apache/lucene/index/memory/MemoryIndex.java: ## @@ -179,6 +179,19 @@ static class SlicedIntBlockPool extends IntBlockPool { super(allocator); } +/** + * For slices, buffers must be filled with zeros, so that we can find a slice's end based on a + * non-zero final value. + */ +private static boolean assertSliceBuffer(int[] buffer) { + for (int value : buffer) { Review Comment: I think that’s possible, but I have some counterarguments: 1. Branch prediction will kick in. We should never take the branch, so mispredictions will be rare and they will happen when we want to stop execution anyway. 2. We would not enable assertions in performance-critical situations, so this method would not run at all. 3. The previous solution has other problems. The count can overflow. It can also be zero at the end if it encounters negative and positive values along the way that happened to add up to zero. ## lucene/memory/src/java/org/apache/lucene/index/memory/MemoryIndex.java: ## @@ -179,6 +179,19 @@ static class SlicedIntBlockPool extends IntBlockPool { super(allocator); } +/** + * For slices, buffers must be filled with zeros, so that we can find a slice's end based on a + * non-zero final value. + */ +private static boolean assertSliceBuffer(int[] buffer) { + for (int value : buffer) { Review Comment: I think that’s possible, but I have some counterarguments: 1. Branch prediction will kick in. We should never take the branch, so mispredictions will be rare and they will happen when we want to stop execution anyway. 2. We would not enable assertions in performance-critical situations, so this method would not run at all. 3. The previous solution has other problems. The count can overflow. It can also be zero at the end if it encounters negative and positive values along the way that happened to add up to zero. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [Minor] Quick exit for non-zero slice buffers [lucene]
gsmiller commented on code in PR #12812: URL: https://github.com/apache/lucene/pull/12812#discussion_r1420707913 ## lucene/memory/src/java/org/apache/lucene/index/memory/MemoryIndex.java: ## @@ -179,6 +179,19 @@ static class SlicedIntBlockPool extends IntBlockPool { super(allocator); } +/** + * For slices, buffers must be filled with zeros, so that we can find a slice's end based on a + * non-zero final value. + */ +private static boolean assertSliceBuffer(int[] buffer) { + for (int value : buffer) { Review Comment: Yeah, fair points. Was only pointing out why it may have been written the way it was. I'm honestly not concerned with how we write this check if it's only being used in asserts. I got thrown off since your PR leads with: "This PR makes it so we exit faster if we find a non-zero value", and was just pointing out that this may not be true. If this is only for assertions, then I don't think we really need to "optimize" it :). Avoiding overflow is good. You can do it this way, or with a disjunctive accumulator if you want to keep it branchless. But again, if only for assertions, whatever is simplest is fine with me. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [Minor] Quick exit for non-zero slice buffers [lucene]
stefanvodita commented on code in PR #12812: URL: https://github.com/apache/lucene/pull/12812#discussion_r1420723711 ## lucene/memory/src/java/org/apache/lucene/index/memory/MemoryIndex.java: ## @@ -179,6 +179,19 @@ static class SlicedIntBlockPool extends IntBlockPool { super(allocator); } +/** + * For slices, buffers must be filled with zeros, so that we can find a slice's end based on a + * non-zero final value. + */ +private static boolean assertSliceBuffer(int[] buffer) { + for (int value : buffer) { Review Comment: Bad choice of words on my part to say "exit faster". Maybe I should have said "fail faster" or "short-circuit the check". I guess I don't have a strong preference either as long as we are addressing the correctness issues. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [WIP] LUCENE-10002: Deprecate FacetsCollector#search helper methods as they internally use IndexSearcher#search(Query, Collector) API [lucene]
gsmiller commented on PR #12890: URL: https://github.com/apache/lucene/pull/12890#issuecomment-1847461557 IMO we should deprecate these without replacement. I agree that users should be able to implement this logic pretty easily in their application layer, and would probably be better suited to do so. For example, if a user doesn't want to collect actual hits and just do faceting, these methods are a little trappy since they internally decide to also do a full (accurate) count of hits. They're baking in some assumptions like this that users may not expect. Just my opinion of course. I don't feel all that strongly about it, and if we want some convenience functionality somewhere for this, that's fine. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Optimize outputs accumulating for SegmentTermsEnum and IntersectTermsEnum [lucene]
benwtrent commented on code in PR #12699: URL: https://github.com/apache/lucene/pull/12699#discussion_r1420733099 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnum.java: ## @@ -1190,4 +1176,63 @@ public void seekExact(long ord) { public long ord() { throw new UnsupportedOperationException(); } + + static class OutputAccumulator extends DataInput { + +BytesRef[] outputs = new BytesRef[16]; +BytesRef current; +int num; +int outputIndex; +int index; + +void push(BytesRef output) { + if (output != Lucene90BlockTreeTermsReader.NO_OUTPUT) { Review Comment: While we have strict contracts, I can see that `BytesSequenceOutputs#add(BytesRef, BytesRef)` has assertions that the length is > 0. Here in `OutputAccumulator` `readByte()` makes a big assumption that a `BytesRef` has at least length 1. If it had a length of 0, we would read past the ref end and read bytes sitting in a `byte[]` that we shouldn't. IMO `OutputAccumulator` needs to be way more cautious than `BytesSequenceOutputs#add(BytesRef, BytesRef)` because `OutputAccumulator` isn't making copies and is relying on the underlying byte arrays not changing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Optimize outputs accumulating for SegmentTermsEnum and IntersectTermsEnum [lucene]
benwtrent commented on code in PR #12699: URL: https://github.com/apache/lucene/pull/12699#discussion_r1420733099 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnum.java: ## @@ -1190,4 +1176,63 @@ public void seekExact(long ord) { public long ord() { throw new UnsupportedOperationException(); } + + static class OutputAccumulator extends DataInput { + +BytesRef[] outputs = new BytesRef[16]; +BytesRef current; +int num; +int outputIndex; +int index; + +void push(BytesRef output) { + if (output != Lucene90BlockTreeTermsReader.NO_OUTPUT) { Review Comment: While we have strict contracts, I can see that `BytesSequenceOutputs#add(BytesRef, BytesRef)` has assertions that the length is > 0. Here in `OutputAccumulator` `readByte()` makes a big assumption that a `BytesRef` has at least length 1. If it had a length of 0, we would read past the ref end and read bytes sitting in a `byte[]` that we shouldn't. IMO `OutputAccumulator` needs to be way more cautious than ``BytesSequenceOutputs#add(BytesRef, BytesRef)` because `OutputAccumulator` isn't making copies and is relying on the underlying byte arrays not changing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]
easyice commented on code in PR #12841: URL: https://github.com/apache/lucene/pull/12841#discussion_r1420565417 ## lucene/test-framework/src/java/org/apache/lucene/tests/store/BaseDirectoryTestCase.java: ## @@ -1438,4 +1440,68 @@ public void testListAllIsSorted() throws IOException { assertArrayEquals(expected, actual); } } + + public void testDataTypes() throws IOException { +final long[] values = new long[] {43, 12345, 123456, 1234567890}; +try (Directory dir = getDirectory(createTempDir("testDataTypes"))) { + IndexOutput out = dir.createOutput("test", IOContext.DEFAULT); + out.writeByte((byte) 43); + out.writeShort((short) 12345); + out.writeInt(1234567890); + out.writeGroupVInts(values, 4); + out.writeLong(1234567890123456789L); + out.close(); + + long[] restored = new long[4]; + IndexInput in = dir.openInput("test", IOContext.DEFAULT); + assertEquals(43, in.readByte()); + assertEquals(12345, in.readShort()); + assertEquals(1234567890, in.readInt()); + in.readGroupVInts(restored, 4); + assertArrayEquals(values, restored); + assertEquals(1234567890123456789L, in.readLong()); + in.close(); +} + } + + public void testGroupVInt() throws IOException { +try (Directory dir = getDirectory(createTempDir("testGroupVInt"))) { + // test fallback to default implementation of readGroupVInt + doTestGroupVInt(dir, 5, 1, 6, 8); + + // use more iterations to covers all bpv + doTestGroupVInt(dir, atLeast(100), 1, 31, 128); + + // we use BaseChunkedDirectoryTestCase#testGroupVIntMultiBlocks cover multiple blocks for + // ByteBuffersDataInput and MMapDirectory +} + } + + protected void doTestGroupVInt( + Directory dir, int iterations, int minBpv, int maxBpv, int maxNumValues) throws IOException { +long[] values = new long[maxNumValues]; +long[] restored = new long[maxNumValues]; + +for (int i = 0; i < iterations; i++) { Review Comment: I changed the all iterations write to a single file, then do `writeVInt` to an other file as expected. ## lucene/core/src/java/org/apache/lucene/util/GroupVIntUtil.java: ## @@ -62,4 +62,42 @@ private static long readLongInGroup(DataInput in, int numBytesMinus1) throws IOE return in.readInt() & 0xL; } } + + /** + * Provides an abstraction for read int values, so that decoding logic can be reused in different DataInput. + * + */ + public static interface IntReader { +int read(long v); + } + + /** + * Faster implementation of read single group, It read values from the buffer that would not cross + * boundaries. + * + * @param flag the flag of group varint. + * @param reader the supplier of read int. + * @param dst the array to read ints into. Review Comment: sorry for the missing! ## lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java: ## @@ -324,24 +324,9 @@ private void readGroupVInt(long[] dst, int offset) throws IOException { try { final int flag = curSegment.get(LAYOUT_BYTE, curPosition++) & 0xFF; - - final int n1Minus1 = flag >> 6; - final int n2Minus1 = (flag >> 4) & 0x03; - final int n3Minus1 = (flag >> 2) & 0x03; - final int n4Minus1 = flag & 0x03; - - dst[offset] = - curSegment.get(LAYOUT_LE_INT, curPosition) & GroupVIntUtil.GROUP_VINT_MASKS[n1Minus1]; - curPosition += 1 + n1Minus1; - dst[offset + 1] = - curSegment.get(LAYOUT_LE_INT, curPosition) & GroupVIntUtil.GROUP_VINT_MASKS[n2Minus1]; - curPosition += 1 + n2Minus1; - dst[offset + 2] = - curSegment.get(LAYOUT_LE_INT, curPosition) & GroupVIntUtil.GROUP_VINT_MASKS[n3Minus1]; - curPosition += 1 + n3Minus1; - dst[offset + 3] = - curSegment.get(LAYOUT_LE_INT, curPosition) & GroupVIntUtil.GROUP_VINT_MASKS[n4Minus1]; - curPosition += 1 + n4Minus1; + curPosition += + GroupVIntUtil.readGroupVInt( + flag, p -> curSegment.get(LAYOUT_LE_INT, p), curPosition, dst, offset); Review Comment: the `curPosition` had +1 when read flag, so it don't need +1 > This is why I'd like to have a test that reads a vint group and other data from the same input placed behind each other! In `BaseDirectoryTestCase#testDataTypes` we did the similar test, is that enough? ## lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java: ## @@ -303,6 +304,34 @@ public byte readByte(long pos) throws IOException { } } + @Override + public void readGroupVInts(long[] dst, int limit) throws IOException { +int i; +for (i = 0; i <= limit - 4; i += 4) { + readGroupVInt(dst, i); +} +for (; i < limit; ++i) { + dst[i] = readVInt(); +} + } + + private void readGroupVInt(long[] dst, int offset) throws IOException { +final MemorySeg
Re: [PR] [Minor] Quick exit for non-zero slice buffers [lucene]
gsmiller commented on code in PR #12812: URL: https://github.com/apache/lucene/pull/12812#discussion_r1420739649 ## lucene/memory/src/java/org/apache/lucene/index/memory/MemoryIndex.java: ## @@ -179,6 +179,19 @@ static class SlicedIntBlockPool extends IntBlockPool { super(allocator); } +/** + * For slices, buffers must be filled with zeros, so that we can find a slice's end based on a + * non-zero final value. + */ +private static boolean assertSliceBuffer(int[] buffer) { + for (int value : buffer) { Review Comment: Since it's only used in asserts, I think readability is probably preferential to performance? The way you have it here in this change would be the most readable IMO, and probably makes the most sense. Let's keep the change the way you have it but maybe just update the CHANGES entry to highlight that you're making the assertion defensive against overflow and negative values? That might leave a better "digital archeology" trail for the future so someone doesn't make the same mistake I did, read this change as an optimization and then go off wondering if a branchless implementation wouldn't make more sense? :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add Facets#getBulkSpecificValues method (#12180) [lucene]
epotyom commented on code in PR #12862: URL: https://github.com/apache/lucene/pull/12862#discussion_r1420763238 ## lucene/CHANGES.txt: ## @@ -67,6 +67,8 @@ API Changes * GITHUB#11023: Adding -level param to CheckIndex, making the old -fast param the default behaviour. (Jakub Slowinski) +* GITHUB#12180: Add Facets#getBulkSpecificValues method to get specific values for multiple FacetLabel at once. (Egor Potemkin) Review Comment: Done! I'm never sure if I'm supposed to use plurals for class names in sentences like this... ## lucene/facet/src/java/org/apache/lucene/facet/Facets.java: ## @@ -44,11 +50,17 @@ public abstract FacetResult getTopChildren(int topN, String dim, String... path) throws IOException; /** - * Return the count or value for a specific path. Returns -1 if this path doesn't exist, else the - * count. + * Return the count or value for a specific path. Returns -1 {@link #MISSING_SPECIFIC_VALUE} if Review Comment: You are right, fixed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add Facets#getBulkSpecificValues method (#12180) [lucene]
epotyom commented on code in PR #12862: URL: https://github.com/apache/lucene/pull/12862#discussion_r1420763643 ## lucene/facet/src/java/org/apache/lucene/facet/LongValueFacetCounts.java: ## @@ -568,6 +568,12 @@ public Number getSpecificValue(String dim, String... path) { throw new UnsupportedOperationException(); } + @Override + public Number[] getBulkSpecificValues(FacetLabel[] facetLabels) { +// TODO: should we impl this? Review Comment: Yes, but this class doesn't implement `getSpecificValue`. We can add it as a default implementation maybe, to `Facets` class? But then there will be less incentive to write faster version of `getBulkSpecificValues` for future `Facets` subclasses. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add Facets#getBulkSpecificValues method (#12180) [lucene]
epotyom commented on code in PR #12862: URL: https://github.com/apache/lucene/pull/12862#discussion_r1420764008 ## lucene/facet/src/java/org/apache/lucene/facet/MultiFacets.java: ## @@ -77,6 +80,39 @@ public Number getSpecificValue(String dim, String... path) throws IOException { return facets.getSpecificValue(dim, path); } + @Override + public Number[] getBulkSpecificValues(FacetLabel[] facetLabels) throws IOException { +if (facetLabels.length == 0) { + return new Number[0]; +} +// Split facetLabels into chunks by the Facets they belog to +Map labelsPerFacet = new HashMap<>(); +for (int i = 0; i < facetLabels.length; i++) { + String dim = facetLabels[i].components[0]; + Facets facets = dimToFacets.get(dim); Review Comment: The idea was to rely on defaults which make each object unique. But I just found that there is `IdentityHashMap` class that always uses object ID even if equals/hashCode are overriden, which I think is safer to use in this case. Thanks for heads up! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add Facets#getBulkSpecificValues method (#12180) [lucene]
epotyom commented on code in PR #12862: URL: https://github.com/apache/lucene/pull/12862#discussion_r1420764333 ## lucene/facet/src/java/org/apache/lucene/facet/MultiFacets.java: ## @@ -77,6 +80,39 @@ public Number getSpecificValue(String dim, String... path) throws IOException { return facets.getSpecificValue(dim, path); } + @Override + public Number[] getBulkSpecificValues(FacetLabel[] facetLabels) throws IOException { +if (facetLabels.length == 0) { + return new Number[0]; +} +// Split facetLabels into chunks by the Facets they belog to +Map labelsPerFacet = new HashMap<>(); +for (int i = 0; i < facetLabels.length; i++) { + String dim = facetLabels[i].components[0]; + Facets facets = dimToFacets.get(dim); + if (facets == null) { +if (defaultFacets == null) { + throw new IllegalArgumentException("invalid dim \"" + dim + "\""); +} +facets = defaultFacets; + } + labelsPerFacet.computeIfAbsent(facets, f -> new IntArrayList()).add(i); +} +// Call getBulkSpecificValues for each chunk and reconcile results +Number[] result = new Number[facetLabels.length]; +for (Map.Entry fl : labelsPerFacet.entrySet()) { + FacetLabel[] labelsForFacet = new FacetLabel[fl.getValue().size()]; + for (IntCursor ordI : fl.getValue()) { +labelsForFacet[ordI.index] = facetLabels[ordI.value]; + } + Number[] ords = fl.getKey().getBulkSpecificValues(labelsForFacet); Review Comment: It is because there are Facet implementations for integer and float values that share the same base class `Facets`. Oh, `ords` is a bad name for this variable - renamed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]
uschindler commented on code in PR #12841: URL: https://github.com/apache/lucene/pull/12841#discussion_r1420760926 ## lucene/core/src/test/org/apache/lucene/store/TestMMapDirectory.java: ## @@ -114,4 +115,31 @@ public void testNullParamsIndexInput() throws Exception { } } } + + private void doTestReadingClosedInput(Directory dir, long[] values, int numValues) Review Comment: You don't need to test all variants and different sizes here. Just create an empty file with indexoutput. open it and close it, then call `readGroupVInts`. As this should only trigger the AlreadyClosedException theres no need to repeat this multiple times. These are IO-heavy tests, so only do the required stuff and don't scrub my SSD. It also slows down those tests. Just add it here and try to call that method on "two": - https://github.com/apache/lucene/blob/08d24dd9cb10b0808264b7c44e5b620d7d1a398e/lucene/test-framework/src/java/org/apache/lucene/tests/store/BaseChunkedDirectoryTestCase.java#L70-L74 - https://github.com/apache/lucene/blob/08d24dd9cb10b0808264b7c44e5b620d7d1a398e/lucene/test-framework/src/java/org/apache/lucene/tests/store/BaseChunkedDirectoryTestCase.java#L91-L95 Anyways, I don't think we need this test in reality. Otherwise we would need a call of all public methods. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [Minor] Quick exit for non-zero slice buffers [lucene]
stefanvodita commented on code in PR #12812: URL: https://github.com/apache/lucene/pull/12812#discussion_r1420778257 ## lucene/memory/src/java/org/apache/lucene/index/memory/MemoryIndex.java: ## @@ -179,6 +179,19 @@ static class SlicedIntBlockPool extends IntBlockPool { super(allocator); } +/** + * For slices, buffers must be filled with zeros, so that we can find a slice's end based on a + * non-zero final value. + */ +private static boolean assertSliceBuffer(int[] buffer) { + for (int value : buffer) { Review Comment: Makes sense, I modified the CHANGES entry. Sorry about the confusion! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add Facets#getBulkSpecificValues method (#12180) [lucene]
epotyom commented on code in PR #12862: URL: https://github.com/apache/lucene/pull/12862#discussion_r1420792687 ## lucene/facet/src/java/org/apache/lucene/facet/MultiFacets.java: ## @@ -77,6 +80,39 @@ public Number getSpecificValue(String dim, String... path) throws IOException { return facets.getSpecificValue(dim, path); } + @Override + public Number[] getBulkSpecificValues(FacetLabel[] facetLabels) throws IOException { +if (facetLabels.length == 0) { + return new Number[0]; +} +// Split facetLabels into chunks by the Facets they belog to +Map labelsPerFacet = new HashMap<>(); +for (int i = 0; i < facetLabels.length; i++) { + String dim = facetLabels[i].components[0]; + Facets facets = dimToFacets.get(dim); + if (facets == null) { +if (defaultFacets == null) { + throw new IllegalArgumentException("invalid dim \"" + dim + "\""); +} +facets = defaultFacets; + } + labelsPerFacet.computeIfAbsent(facets, f -> new IntArrayList()).add(i); +} +// Call getBulkSpecificValues for each chunk and reconcile results +Number[] result = new Number[facetLabels.length]; +for (Map.Entry fl : labelsPerFacet.entrySet()) { + FacetLabel[] labelsForFacet = new FacetLabel[fl.getValue().size()]; + for (IntCursor ordI : fl.getValue()) { +labelsForFacet[ordI.index] = facetLabels[ordI.value]; + } + Number[] ords = fl.getKey().getBulkSpecificValues(labelsForFacet); Review Comment: Hmm, re-read your message. I don't think we can change `getSpecificValue` behavior as users might not check for `null`. But we can choose to implement `null` for the new method. I think it's more intuitive to use `null`, but It could be a little weird that 2 sibling methods behave differently? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]
uschindler commented on code in PR #12841: URL: https://github.com/apache/lucene/pull/12841#discussion_r1420793976 ## lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java: ## @@ -324,24 +324,9 @@ private void readGroupVInt(long[] dst, int offset) throws IOException { try { final int flag = curSegment.get(LAYOUT_BYTE, curPosition++) & 0xFF; - - final int n1Minus1 = flag >> 6; - final int n2Minus1 = (flag >> 4) & 0x03; - final int n3Minus1 = (flag >> 2) & 0x03; - final int n4Minus1 = flag & 0x03; - - dst[offset] = - curSegment.get(LAYOUT_LE_INT, curPosition) & GroupVIntUtil.GROUP_VINT_MASKS[n1Minus1]; - curPosition += 1 + n1Minus1; - dst[offset + 1] = - curSegment.get(LAYOUT_LE_INT, curPosition) & GroupVIntUtil.GROUP_VINT_MASKS[n2Minus1]; - curPosition += 1 + n2Minus1; - dst[offset + 2] = - curSegment.get(LAYOUT_LE_INT, curPosition) & GroupVIntUtil.GROUP_VINT_MASKS[n3Minus1]; - curPosition += 1 + n3Minus1; - dst[offset + 3] = - curSegment.get(LAYOUT_LE_INT, curPosition) & GroupVIntUtil.GROUP_VINT_MASKS[n4Minus1]; - curPosition += 1 + n4Minus1; + curPosition += + GroupVIntUtil.readGroupVInt( + flag, p -> curSegment.get(LAYOUT_LE_INT, p), curPosition, dst, offset); Review Comment: Are you sure. It just reads the flag but does not increment. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]
uschindler commented on code in PR #12841: URL: https://github.com/apache/lucene/pull/12841#discussion_r1420793976 ## lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java: ## @@ -324,24 +324,9 @@ private void readGroupVInt(long[] dst, int offset) throws IOException { try { final int flag = curSegment.get(LAYOUT_BYTE, curPosition++) & 0xFF; - - final int n1Minus1 = flag >> 6; - final int n2Minus1 = (flag >> 4) & 0x03; - final int n3Minus1 = (flag >> 2) & 0x03; - final int n4Minus1 = flag & 0x03; - - dst[offset] = - curSegment.get(LAYOUT_LE_INT, curPosition) & GroupVIntUtil.GROUP_VINT_MASKS[n1Minus1]; - curPosition += 1 + n1Minus1; - dst[offset + 1] = - curSegment.get(LAYOUT_LE_INT, curPosition) & GroupVIntUtil.GROUP_VINT_MASKS[n2Minus1]; - curPosition += 1 + n2Minus1; - dst[offset + 2] = - curSegment.get(LAYOUT_LE_INT, curPosition) & GroupVIntUtil.GROUP_VINT_MASKS[n3Minus1]; - curPosition += 1 + n3Minus1; - dst[offset + 3] = - curSegment.get(LAYOUT_LE_INT, curPosition) & GroupVIntUtil.GROUP_VINT_MASKS[n4Minus1]; - curPosition += 1 + n4Minus1; + curPosition += + GroupVIntUtil.readGroupVInt( + flag, p -> curSegment.get(LAYOUT_LE_INT, p), curPosition, dst, offset); Review Comment: Are you sure? It just reads the flag but does not increment. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]
uschindler commented on code in PR #12841: URL: https://github.com/apache/lucene/pull/12841#discussion_r1420760926 ## lucene/core/src/test/org/apache/lucene/store/TestMMapDirectory.java: ## @@ -114,4 +115,31 @@ public void testNullParamsIndexInput() throws Exception { } } } + + private void doTestReadingClosedInput(Directory dir, long[] values, int numValues) Review Comment: You don't need to test all variants and different sizes here. Just create a file with random data using indexoutput. open it and close it, then call `readGroupVInts`. As this should only trigger the AlreadyClosedException theres no need to repeat this multiple times. These are IO-heavy tests, so only do the required stuff and don't scrub my SSD. It also slows down those tests. Just add it here and try to call that method on "two": - https://github.com/apache/lucene/blob/08d24dd9cb10b0808264b7c44e5b620d7d1a398e/lucene/test-framework/src/java/org/apache/lucene/tests/store/BaseChunkedDirectoryTestCase.java#L70-L74 - https://github.com/apache/lucene/blob/08d24dd9cb10b0808264b7c44e5b620d7d1a398e/lucene/test-framework/src/java/org/apache/lucene/tests/store/BaseChunkedDirectoryTestCase.java#L91-L95 Anyways, I don't think we need this test in reality. Otherwise we would need a call of all public methods. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add Facets#getBulkSpecificValues method (#12180) [lucene]
epotyom commented on PR #12862: URL: https://github.com/apache/lucene/pull/12862#issuecomment-1847540870 Thank you fore reviewing @mikemccand ! Resolved your comments in 2nd commit. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Fix NPE on off-heap test and FST is null [lucene]
dungba88 opened a new pull request, #12894: URL: https://github.com/apache/lucene/pull/12894 ### Description The test can throw a NPE when it's using off-heap mode and no nodes are accepted -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]
easyice commented on code in PR #12841: URL: https://github.com/apache/lucene/pull/12841#discussion_r1420812950 ## lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java: ## @@ -324,24 +324,9 @@ private void readGroupVInt(long[] dst, int offset) throws IOException { try { final int flag = curSegment.get(LAYOUT_BYTE, curPosition++) & 0xFF; - - final int n1Minus1 = flag >> 6; - final int n2Minus1 = (flag >> 4) & 0x03; - final int n3Minus1 = (flag >> 2) & 0x03; - final int n4Minus1 = flag & 0x03; - - dst[offset] = - curSegment.get(LAYOUT_LE_INT, curPosition) & GroupVIntUtil.GROUP_VINT_MASKS[n1Minus1]; - curPosition += 1 + n1Minus1; - dst[offset + 1] = - curSegment.get(LAYOUT_LE_INT, curPosition) & GroupVIntUtil.GROUP_VINT_MASKS[n2Minus1]; - curPosition += 1 + n2Minus1; - dst[offset + 2] = - curSegment.get(LAYOUT_LE_INT, curPosition) & GroupVIntUtil.GROUP_VINT_MASKS[n3Minus1]; - curPosition += 1 + n3Minus1; - dst[offset + 3] = - curSegment.get(LAYOUT_LE_INT, curPosition) & GroupVIntUtil.GROUP_VINT_MASKS[n4Minus1]; - curPosition += 1 + n4Minus1; + curPosition += + GroupVIntUtil.readGroupVInt( + flag, p -> curSegment.get(LAYOUT_LE_INT, p), curPosition, dst, offset); Review Comment: emmm... we use `curSegment.get(LAYOUT_BYTE, curPosition++)` to read flag byte, so `curPosition++` will increment the read pos , is that right? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[I] Corruption read on term dictionaries in Lucene 9.9 [lucene]
benwtrent opened a new issue, #12895: URL: https://github.com/apache/lucene/issues/12895 ### Description It seems that https://github.com/apache/lucene/pull/12699/ has inadvertantly broken reading term dictionaries created in Lucene 9.8<=. To replicate a bug, one can index wikibigall with LuceneUtil & Lucene 9.8 & force-merge. Then attempt to read the created index using a wildcard query: ``` Path path = Paths.get("/data/local/lucene/indices/wikibigall.lucene-main.opt.Lucene90.dvfields.nd6.72652M/index"); try (FSDirectory dir = FSDirectory.open(path); DirectoryReader reader = DirectoryReader.open(dir)) { IndexSearcher searcher = new IndexSearcher(reader); searcher.count(new WildcardQuery(new Term("body", "*fo*"))); } ``` This will result in a trace similar to below: ``` Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Index 3 out of bounds for length 3 at org.apache.lucene.store.ByteArrayDataInput.readByte(ByteArrayDataInput.java:136) at org.apache.lucene.store.DataInput.readVInt(DataInput.java:110) at org.apache.lucene.store.ByteArrayDataInput.readVInt(ByteArrayDataInput.java:114) at org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnumFrame.load(IntersectTermsEnumFrame.java:158) at org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnumFrame.load(IntersectTermsEnumFrame.java:149) at org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnum.pushFrame(IntersectTermsEnum.java:203) at org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnum._next(IntersectTermsEnum.java:531) at org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnum.next(IntersectTermsEnum.java:373) at org.apache.lucene.search.MultiTermQueryConstantScoreBlendedWrapper$1.rewriteInner(MultiTermQueryConstantScoreBlendedWrapper.java:111) at org.apache.lucene.search.AbstractMultiTermQueryConstantScoreWrapper$RewritingWeight.rewrite(AbstractMultiTermQueryConstantScoreWrapper.java:179) at org.apache.lucene.search.AbstractMultiTermQueryConstantScoreWrapper$RewritingWeight.bulkScorer(AbstractMultiTermQueryConstantScoreWrapper.java:220) at org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.bulkScorer(LRUQueryCache.java:930) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:678) at org.apache.lucene.search.IndexSearcher.lambda$4(IndexSearcher.java:636) at org.apache.lucene.search.TaskExecutor$TaskGroup.lambda$0(TaskExecutor.java:118) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317) at org.apache.lucene.search.TaskExecutor$TaskGroup.invokeAll(TaskExecutor.java:153) at org.apache.lucene.search.TaskExecutor.invokeAll(TaskExecutor.java:76) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:640) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:607) at org.apache.lucene.search.IndexSearcher.count(IndexSearcher.java:423) at Corruption.main(Corruption.java:18) ``` We are currently not sure if this effects Lucene 9.9 created indices & reading via Lucene 9.9. NOTE: This also fails with just a prefix wildcard query. It seems to be all multi-term queries could be affected. Will provide more example stack traces in issue comments. ### Version and environment details Lucene 9.9 reading Lucene 9.8 indices. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Corruption read on term dictionaries in Lucene 9.9 [lucene]
benwtrent commented on issue #12895: URL: https://github.com/apache/lucene/issues/12895#issuecomment-1847562169 //cc @gf2121 && @mikemccand -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Corruption read on term dictionaries in Lucene 9.9 [lucene]
benwtrent commented on issue #12895: URL: https://github.com/apache/lucene/issues/12895#issuecomment-1847568572 Here are some exceptions ran into when trying to do multi-term queries with Lucene 9.9 against an index created in 9.8 or before: ``` Caused by: java.lang.ArrayIndexOutOfBoundsException: Index 4 out of bounds for length 4 at org.apache.lucene.store.ByteArrayDataInput.readVInt(ByteArrayDataInput.java:112) at org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnumFrame.load(IntersectTermsEnumFrame.java:158) at org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnumFrame.load(IntersectTermsEnumFrame.java:149) at org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnum.pushFrame(IntersectTermsEnum.java:203) at org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnum._next(IntersectTermsEnum.java:531) at org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnum.next(IntersectTermsEnum.java:373) at org.apache.lucene.index.FilterLeafReader$FilterTermsEnum.next(FilterLeafReader.java:201) ``` ``` Caused by: org.apache.lucene.index.CorruptIndexException: Illegal code for a compression algorithm: 3 (resource=SearchIndexInput{[(clone of) _3e4i_Lucene90_0.tim], context=IOContext [context=READ, mergeInfo=null, flushInfo=null, readOnce=false], cacheFile=SharedCacheFile{cacheKey=FileCacheKey[shardId=[.ds-logs-endpoint.events.file-default-2023.11.22-02][0], primaryTerm=3, fileName=stateless_commit_61353], length=3174808501}, length=308358826, offset=181596}) at org.apache.lucene.core@9.9.0/org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnumFrame.load(IntersectTermsEnumFrame.java:196) at org.apache.lucene.core@9.9.0/org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnumFrame.loadNextFloorBlock(IntersectTermsEnumFrame.java:119) at org.apache.lucene.core@9.9.0/org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnum.popPushNext(IntersectTermsEnum.java:339) at org.apache.lucene.core@9.9.0/org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnum._next(IntersectTermsEnum.java:543) at org.apache.lucene.core@9.9.0/org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnum.next(IntersectTermsEnum.java:373) at org.apache.lucene.core@9.9.0/org.apache.lucene.index.FilterLeafReader$FilterTermsEnum.next(FilterLeafReader.java:201) ``` ``` Caused by: java.lang.NullPointerException: Cannot invoke "org.apache.lucene.util.fst.FST$Arc.output()" because "arc" is null at org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnum.pushFrame(IntersectTermsEnum.java:196) at org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnum._next(IntersectTermsEnum.java:531) at org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnum.next(IntersectTermsEnum.java:373) at org.apache.lucene.index.FilterLeafReader$FilterTermsEnum.next(FilterLeafReader.java:201) ``` Still digging into this to figure out why reading is now corrupted. https://github.com/apache/lucene/pull/12699 has many subtle changes, including a very big one where we don't copy the byte arrays any longer. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Remove DrillSideways#createDrillDownFacetsCollector in favor of the manager-based hook [lucene]
gsmiller merged PR #12855: URL: https://github.com/apache/lucene/pull/12855 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Mark DrillSideways#createDrillDownFacetsCollector as @Deprecated [lucene]
gsmiller merged PR #12854: URL: https://github.com/apache/lucene/pull/12854 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [Minor] Quick exit for non-zero slice buffers [lucene]
gsmiller commented on code in PR #12812: URL: https://github.com/apache/lucene/pull/12812#discussion_r1420830252 ## lucene/memory/src/java/org/apache/lucene/index/memory/MemoryIndex.java: ## @@ -179,6 +179,19 @@ static class SlicedIntBlockPool extends IntBlockPool { super(allocator); } +/** + * For slices, buffers must be filled with zeros, so that we can find a slice's end based on a + * non-zero final value. + */ +private static boolean assertSliceBuffer(int[] buffer) { + for (int value : buffer) { Review Comment: No problem! Thanks for the change! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [Minor] Quick exit for non-zero slice buffers [lucene]
gsmiller merged PR #12812: URL: https://github.com/apache/lucene/pull/12812 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Corruption read on term dictionaries in Lucene 9.9 [lucene]
benwtrent commented on issue #12895: URL: https://github.com/apache/lucene/issues/12895#issuecomment-1847590622 Possibly related: https://github.com/apache/lucene/pull/12631 NOTE: the read corruption doesn't occur when reading from an index created in 9.9. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]
easyice commented on code in PR #12841: URL: https://github.com/apache/lucene/pull/12841#discussion_r1420843242 ## lucene/core/src/test/org/apache/lucene/store/TestMMapDirectory.java: ## @@ -114,4 +115,31 @@ public void testNullParamsIndexInput() throws Exception { } } } + + private void doTestReadingClosedInput(Directory dir, long[] values, int numValues) Review Comment: Thank you @uschindler , It's more simpler, i like this! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix NPE on off-heap test and FST is null [lucene]
gsmiller commented on code in PR #12894: URL: https://github.com/apache/lucene/pull/12894#discussion_r1420835887 ## lucene/test-framework/src/java/org/apache/lucene/tests/util/fst/FSTTester.java: ## @@ -283,14 +283,17 @@ public FST doTest() throws IOException { } } FST fst = fstCompiler.compile(); -; Review Comment: Don't like @mikemccand see this! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix NPE on off-heap test and FST is null [lucene]
gsmiller merged PR #12894: URL: https://github.com/apache/lucene/pull/12894 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Corruption read on term dictionaries in Lucene 9.9 [lucene]
benwtrent commented on issue #12895: URL: https://github.com/apache/lucene/issues/12895#issuecomment-1847604189 Git bisect has confirmed the read corruption occurs with: https://github.com/apache/lucene/pull/12699 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Corruption read on term dictionaries in Lucene 9.9 [lucene]
jpountz commented on issue #12895: URL: https://github.com/apache/lucene/issues/12895#issuecomment-1847604281 I have a 9.8 index that reproduces the bug and ran a `git bisect` to figure out the first commit that fails, it pointed to #12699. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[I] Reproducible failure of TestParentBlockJoinByteKnnVectorQuery.testScoringWithMultipleChildren [lucene]
gsmiller opened a new issue, #12896: URL: https://github.com/apache/lucene/issues/12896 ### Description Saw this test fail a couple times in automated builds (e.g., [here](https://jenkins.thetaphi.de/job/Lucene-main-Windows/13501/testReport/junit/org.apache.lucene.search.join/TestParentBlockJoinByteKnnVectorQuery/testScoringWithMultipleChildren/)). I was able to reproduce locally against `main` with: `./gradlew test --tests TestParentBlockJoinByteKnnVectorQuery.testScoringWithMultipleChildren -Dtests.seed=98B9F347D87BF07E` ### Version and environment details Reproduced on the `main` branch -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Corruption read on term dictionaries in Lucene 9.9 [lucene]
mikemccand commented on issue #12895: URL: https://github.com/apache/lucene/issues/12895#issuecomment-1847782193 Ugh -- I'll try to look at this later today. Disappointing that our back compat test specifically for reading 9.8 indices failed to catch this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Corruption read on term dictionaries in Lucene 9.9 [lucene]
mikemccand commented on issue #12895: URL: https://github.com/apache/lucene/issues/12895#issuecomment-1847786318 It's also curious that it's not happening w/ 9.9 created indices. #12699 is about optimizing how we accumulate the long output while traversing (reading) the FST block tree terms index. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix for the bug where JapaneseReadingFormFilter cannot convert some hiragana to romaji [lucene]
zhaih commented on code in PR #12885: URL: https://github.com/apache/lucene/pull/12885#discussion_r1421011978 ## lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseReadingFormFilter.java: ## @@ -43,10 +43,30 @@ public JapaneseReadingFormFilter(TokenStream input) { this(input, false); } + private boolean isAllHiragana(CharSequence s) { +int len = s.length(); +for (int i = 0; i < len; i++) { + char ch = s.charAt(i); + if (ch < 0x3040 || ch > 0x3097) { +return false; + } +} +return true; + } + @Override public boolean incrementToken() throws IOException { if (input.incrementToken()) { String reading = readingAttr.getReading(); + if (reading == null & isAllHiragana(termAttr)) { Review Comment: What if we have hiragana and katagana mixed text? I feel like we still need to convert the hiragana part to katagana? ## lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseReadingFormFilter.java: ## @@ -43,10 +43,30 @@ public JapaneseReadingFormFilter(TokenStream input) { this(input, false); } + private boolean isAllHiragana(CharSequence s) { +int len = s.length(); +for (int i = 0; i < len; i++) { + char ch = s.charAt(i); + if (ch < 0x3040 || ch > 0x3097) { +return false; + } +} +return true; + } + @Override public boolean incrementToken() throws IOException { if (input.incrementToken()) { String reading = readingAttr.getReading(); + if (reading == null & isAllHiragana(termAttr)) { Review Comment: Maybe use `&&` instead? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Reproducible failure of TestParentBlockJoinByteKnnVectorQuery.testScoringWithMultipleChildren [lucene]
zhaih commented on issue #12896: URL: https://github.com/apache/lucene/issues/12896#issuecomment-1847828807 I think it might due to the same problem as: https://github.com/apache/lucene/pull/12889 e.g. a doc reorder merge policy reordered the parent child block I haven't check it myself tho. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Reproducible failure of TestParentBlockJoinByteKnnVectorQuery.testScoringWithMultipleChildren [lucene]
zhaih commented on issue #12896: URL: https://github.com/apache/lucene/issues/12896#issuecomment-1847830605 Oh probably not, the test is just using the default merge policy (TMP) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Corruption read on term dictionaries in Lucene 9.9 [lucene]
benwtrent commented on issue #12895: URL: https://github.com/apache/lucene/issues/12895#issuecomment-1847858897 @mikemccand I have to use at a minimum: `wikibig1m` for it to replicate. Couple of weird things I noticed in that optimization PR: - https://github.com/apache/lucene/pull/12699/files#diff-c30add12e4170e7905cdb0912fd869d447b5b1772d74c9fcaf0afe41cd58866aL200-R204 Now instead of giving a complete view of all appended arcs, there is this weird assumption that we only give one to `floorDataReader` - https://github.com/apache/lucene/pull/12699/files#diff-c30add12e4170e7905cdb0912fd869d447b5b1772d74c9fcaf0afe41cd58866aR200-L202. The frame outputPrefix is then set as the accumulation of all the arc outputs. I am not sure if this is relevant as for it to be important `getFrame` would have to return a frame already seen before, I am not sure if that is valid or not. This makes me think that each frame should have its own `OutputAccumulator` But all this code seems incredibly subtle and opaque to me. Been staring at it for hours now :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Ensure #finish is called on all drill-sideways FacetCollectors even when no hits are scored [lucene]
gautamworah96 commented on code in PR #12853: URL: https://github.com/apache/lucene/pull/12853#discussion_r1421077089 ## lucene/facet/src/java/org/apache/lucene/facet/DrillSidewaysQuery.java: ## @@ -193,42 +204,29 @@ public BulkScorer bulkScorer(LeafReaderContext context) throws IOException { FacetsCollector sidewaysCollector = drillSidewaysCollectorManagers[dim].newCollector(); sidewaysCollectors[dim] = sidewaysCollector; - dims[dim] = new DrillSidewaysScorer.DocsAndCost(scorer, sidewaysCollector); + dims[dim] = + new DrillSidewaysScorer.DocsAndCost( + scorer, sidewaysCollector.getLeafCollector(context)); } -// If more than one dim has no matches, then there -// are no hits nor drill-sideways counts. Or, if we -// have only one dim and that dim has no matches, -// same thing. Review Comment: unrelated but how exactly is `(nullCount > 1 || (nullCount == 1 && dims.length == 1))` equivalent to `nullCount > 1`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Ensure #finish is called on all drill-sideways FacetCollectors even when no hits are scored [lucene]
gautamworah96 commented on code in PR #12853: URL: https://github.com/apache/lucene/pull/12853#discussion_r1421069686 ## lucene/facet/src/java/org/apache/lucene/facet/DrillSidewaysQuery.java: ## @@ -193,42 +204,29 @@ public BulkScorer bulkScorer(LeafReaderContext context) throws IOException { FacetsCollector sidewaysCollector = drillSidewaysCollectorManagers[dim].newCollector(); sidewaysCollectors[dim] = sidewaysCollector; - dims[dim] = new DrillSidewaysScorer.DocsAndCost(scorer, sidewaysCollector); + dims[dim] = + new DrillSidewaysScorer.DocsAndCost( + scorer, sidewaysCollector.getLeafCollector(context)); } -// If more than one dim has no matches, then there -// are no hits nor drill-sideways counts. Or, if we -// have only one dim and that dim has no matches, -// same thing. -// if (nullCount > 1 || (nullCount == 1 && dims.length == 1)) { -if (nullCount > 1) { +// If baseScorer is null or the dim nullCount > 1, then we have nothing to score. We return +// a null scorer in this case, but we need to make sure #finish gets called on all facet +// collectors since IndexSearcher won't handle this for us: +if (baseScorer == null || nullCount > 1) { + if (drillDownCollector != null) { +drillDownCollector.finish(); + } + for (FacetsCollector fc : sidewaysCollectors) { +fc.finish(); + } return null; } // Sort drill-downs by most restrictive first: -Arrays.sort( -dims, -new Comparator() { - @Override - public int compare(DocsAndCost o1, DocsAndCost o2) { -return Long.compare(o1.approximation.cost(), o2.approximation.cost()); - } -}); - -if (baseScorer == null) { - return null; -} - -FacetsCollector drillDownCollector; -if (drillDownCollectorManager != null) { - drillDownCollector = drillDownCollectorManager.newCollector(); - managedDrillDownCollectors.add(drillDownCollector); -} else { - drillDownCollector = null; -} +Arrays.sort(dims, Comparator.comparingLong(o -> o.approximation.cost())); Review Comment: Thanks for these small cleanups as well. LGTM ## lucene/facet/src/java/org/apache/lucene/facet/DrillSidewaysQuery.java: ## @@ -193,42 +204,29 @@ public BulkScorer bulkScorer(LeafReaderContext context) throws IOException { FacetsCollector sidewaysCollector = drillSidewaysCollectorManagers[dim].newCollector(); sidewaysCollectors[dim] = sidewaysCollector; - dims[dim] = new DrillSidewaysScorer.DocsAndCost(scorer, sidewaysCollector); + dims[dim] = + new DrillSidewaysScorer.DocsAndCost( + scorer, sidewaysCollector.getLeafCollector(context)); } -// If more than one dim has no matches, then there -// are no hits nor drill-sideways counts. Or, if we -// have only one dim and that dim has no matches, -// same thing. Review Comment: unlrelated but how exactly is `(nullCount > 1 || (nullCount == 1 && dims.length == 1))` equivalent to `nullCount > 1`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Ensure #finish is called on all drill-sideways FacetCollectors even when no hits are scored [lucene]
gsmiller commented on code in PR #12853: URL: https://github.com/apache/lucene/pull/12853#discussion_r1421103992 ## lucene/facet/src/java/org/apache/lucene/facet/DrillSidewaysQuery.java: ## @@ -193,42 +204,29 @@ public BulkScorer bulkScorer(LeafReaderContext context) throws IOException { FacetsCollector sidewaysCollector = drillSidewaysCollectorManagers[dim].newCollector(); sidewaysCollectors[dim] = sidewaysCollector; - dims[dim] = new DrillSidewaysScorer.DocsAndCost(scorer, sidewaysCollector); + dims[dim] = + new DrillSidewaysScorer.DocsAndCost( + scorer, sidewaysCollector.getLeafCollector(context)); } -// If more than one dim has no matches, then there -// are no hits nor drill-sideways counts. Or, if we -// have only one dim and that dim has no matches, -// same thing. Review Comment: Yeah, they're not :). We _do_ need to still run the query in the case where there's a single dim but it's null because we need to collect all the "sideways" hits for that null dim. Looks like this comment (along with the commented out condition) came from @mikemccand back in 2014. I doubt he recalls much 9 years later. Maybe he realized this is a bug, commented out the condition (and replaced it with the correct one) and left the comment untouched? That's my best guess. But no need to keep the comment around anymore. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Ensure #finish is called on all drill-sideways FacetCollectors even when no hits are scored [lucene]
gsmiller merged PR #12853: URL: https://github.com/apache/lucene/pull/12853 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Ensure #finish is called on all drill-sideways FacetCollectors even when no hits are scored [lucene]
gsmiller commented on PR #12853: URL: https://github.com/apache/lucene/pull/12853#issuecomment-1847976009 Thanks @gautamworah96 for taking a look! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] IntTaxonomyFacets chooses dense values array when FacetsCollector has no MatchingDocs [lucene]
gsmiller commented on issue #12558: URL: https://github.com/apache/lucene/issues/12558#issuecomment-1847976273 Fixed the root cause of this in #12853 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Reproducible TestDrillSideways failure [lucene]
gsmiller commented on issue #12418: URL: https://github.com/apache/lucene/issues/12418#issuecomment-1847976644 OK merged #12853 which I think fixes the root cause of this randomized test failures. I'm going to resolve out this issue and will keep an eye on nightlies for any new failures. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix for the bug where JapaneseReadingFormFilter cannot convert some hiragana to romaji [lucene]
kuramitsu commented on code in PR #12885: URL: https://github.com/apache/lucene/pull/12885#discussion_r1421172943 ## lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseReadingFormFilter.java: ## @@ -43,10 +43,30 @@ public JapaneseReadingFormFilter(TokenStream input) { this(input, false); } + private boolean isAllHiragana(CharSequence s) { +int len = s.length(); +for (int i = 0; i < len; i++) { + char ch = s.charAt(i); + if (ch < 0x3040 || ch > 0x3097) { +return false; + } +} +return true; + } + @Override public boolean incrementToken() throws IOException { if (input.incrementToken()) { String reading = readingAttr.getReading(); + if (reading == null & isAllHiragana(termAttr)) { Review Comment: Thank you. I fixed it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix for the bug where JapaneseReadingFormFilter cannot convert some hiragana to romaji [lucene]
kuramitsu commented on code in PR #12885: URL: https://github.com/apache/lucene/pull/12885#discussion_r1421180794 ## lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseReadingFormFilter.java: ## @@ -43,10 +43,30 @@ public JapaneseReadingFormFilter(TokenStream input) { this(input, false); } + private boolean isAllHiragana(CharSequence s) { +int len = s.length(); +for (int i = 0; i < len; i++) { + char ch = s.charAt(i); + if (ch < 0x3040 || ch > 0x3097) { +return false; + } +} +return true; + } + @Override public boolean incrementToken() throws IOException { if (input.incrementToken()) { String reading = readingAttr.getReading(); + if (reading == null & isAllHiragana(termAttr)) { Review Comment: Fixed to support partial hiragana. However, it is difficult to describe a test for oov terms that contain a mixture of hiragana and katakana, because the system seems to basically split different character types for oov terms. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix for the bug where JapaneseReadingFormFilter cannot convert some hiragana to romaji [lucene]
kuramitsu commented on code in PR #12885: URL: https://github.com/apache/lucene/pull/12885#discussion_r1421180794 ## lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseReadingFormFilter.java: ## @@ -43,10 +43,30 @@ public JapaneseReadingFormFilter(TokenStream input) { this(input, false); } + private boolean isAllHiragana(CharSequence s) { +int len = s.length(); +for (int i = 0; i < len; i++) { + char ch = s.charAt(i); + if (ch < 0x3040 || ch > 0x3097) { +return false; + } +} +return true; + } + @Override public boolean incrementToken() throws IOException { if (input.incrementToken()) { String reading = readingAttr.getReading(); + if (reading == null & isAllHiragana(termAttr)) { Review Comment: Fixed to support partial hiragana. However, it is difficult to describe a test for OOV terms that contain a mixture of hiragana and katakana, because the system seems to basically split different character types for OOV terms. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Corruption read on term dictionaries in Lucene 9.9 [lucene]
benwtrent commented on issue #12895: URL: https://github.com/apache/lucene/issues/12895#issuecomment-1848037016 I think if a fix for this isn't found early next week, we should consider reverting it. No user should upgrade to Lucene 9.9.0 with this bug. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Output well-formed UTF-8 bytes in SimpleTextCodec's segmentinfos [lucene]
msfroh opened a new pull request, #12897: URL: https://github.com/apache/lucene/pull/12897 ### Description The SimpleTextSegmentInfoFormat was writing the random byte array used as a segment's ID directly -- not converting to a simple text representation of the byte array. As a result, the segment infos were often malformed (as UTF-8 text). The included test was failing before the change to write out the text representation of the byte array. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Output well-formed UTF-8 bytes in SimpleTextCodec's segmentinfos [lucene]
msfroh commented on PR #12897: URL: https://github.com/apache/lucene/pull/12897#issuecomment-1848071583 If needed, I'm happy to add versions of `testFileIsUTF8()` for the other SimpleTextCodec format unit tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Corruption read on term dictionaries in Lucene 9.9 [lucene]
mikemccand commented on issue #12895: URL: https://github.com/apache/lucene/issues/12895#issuecomment-1848198927 I am travelling this weekend and unlikely to make much progress on this until early next week. Maybe we just revert and release 9.9.1 now? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Make FSTCompiler.compile() to only return the FSTMetadata [lucene]
dungba88 commented on code in PR #12831: URL: https://github.com/apache/lucene/pull/12831#discussion_r1420866991 ## lucene/core/src/java/org/apache/lucene/util/fst/FST.java: ## @@ -503,9 +518,7 @@ public FSTMetadata getMetadata() { } /** - * Save the FST to DataOutput. If you use an {@link org.apache.lucene.store.IndexOutput} to build - * the FST, then you should not and do not need to call this method, as the FST is already saved. - * Doing so will throw an {@link UnsupportedOperationException}. + * Save the FST to DataOutput. Review Comment: It's no longer possible for user to create a FST that uses `NullFSTReader`, so this Javadoc will no longer holds. If users are able to create FST, then it will always be readable. ## lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java: ## @@ -175,7 +176,7 @@ private FSTCompiler( fst = new FST<>( new FST.FSTMetadata<>(inputType, outputs, null, -1, VERSION_CURRENT, 0), -toFSTReader(dataOutput)); +NULL_FST_READER); Review Comment: As we can't pass actual null to FSTReader, `NullFSTReader` will be used instead. ## lucene/analysis/common/src/java/org/apache/lucene/analysis/charfilter/NormalizeCharMap.java: ## @@ -111,7 +111,7 @@ public NormalizeCharMap build() { for (Map.Entry ent : pendingPairs.entrySet()) { fstCompiler.add(Util.toUTF16(ent.getKey(), scratch), new CharsRef(ent.getValue())); } -map = fstCompiler.compile(); +map = FST.fromFSTReader(fstCompiler.compile(), fstCompiler.getFSTReader()); Review Comment: This `fromFSTReader` is there to avoid the boilerplate null-check that each consumer must now do. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org