Re: [PR] Revert "Optimize outputs accumulating for SegmentTermsEnum and InterssectTermsEnum " [lucene]
uschindler commented on PR #12899: URL: https://github.com/apache/lucene/pull/12899#issuecomment-1848910834 Are we sure that this did not affect already existing indexes, e.g. while merging? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Revert "Optimize outputs accumulating for SegmentTermsEnum and InterssectTermsEnum " [lucene]
mikemccand commented on PR #12899: URL: https://github.com/apache/lucene/pull/12899#issuecomment-1848925405 > Are we sure that this did not affect already existing indexes, e.g. while merging? Great question @uschindler ("blast radius") -- I'm pretty sure newly 9.9.0 written indices are fine: it does look like `SegmentTermsEnum` is correct (the bug is only in `IntersectTermsEnum`), but I'm still trying to convince myself of this / figure out how to make a better test than our existing tiny-index BWC tests... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Corruption read on term dictionaries in Lucene 9.9 [lucene]
mikemccand commented on issue #12895: URL: https://github.com/apache/lucene/issues/12895#issuecomment-1848934191 OK I think I understand why we see the bug on 9.8.x indices but NOT 9.9.x. indices. And I'm quite sure blast radius of the bug is contained solely to read-time, i.e. newly written 9.9.0 indices are not corrupt. Though I'd really love to see some stronger testing -- I'm hoping a randomly written large enough index might reveal the bug. I'll open a spinoff issue so we can create that. Details: We write FST with pairs, where term-prefix is the prefix of a set of terms sharing one on-disk block. E.g. `foo` is prefix, and the N terms would be `foobar`, `foobaz`, ... The start of that `byte[]` is a `vLong`, which is the absolute file pointer of the on-disk block shifted up by 2 bits, and 2 lower bits are two important flags: `OUTPUT_FLAG_IS_FLOOR` and `OUTPUT_FLAG_HAS_TERMS`. Now, with 9.8.x, this is LSB encoded: those two flags are in the leading byte, not the last byte. FST will share output byte prefixes when it can, and it does in fact happen (rarely!) that the LSB (6 bits + 2 bit flags) is the same across N term blocks. It can happen if the last 6 bits of their block file pointers happen to be the same. In this case that leading byte (or possibly, even more rarely, bytes) are not all on the last arc, and this PR loses that leading flag-containing byte/bytes in that case, and computes the wrong flag value for `OUTPUT_FLAG_IS_FLOOR`, then tries to readVInt when non was written --> exceptions at read time. @gf2121 indeed the leading fp + 2-bit flags will never be the same across blocks, so this means even if FST is able to share leading prefix byte or two, it will never share that entire `vLong`: there must be at least one final (different) byte for `readVLong` to read. And because `vLong` is self-delineating (each byte knows whether it is the last one by checking 8th bit is clear) losing a byte or two of its leading bytes won't break reading of the `vLong`, i.e., `readVLong` will indeed stop at the right byte (but of course produce the wrong value). Importantly, `IntersectTermsEnum` does not use this encoded file pointer in the `floorData`! It gets the file-pointer by traversing prior arcs in the FST and accumulating N incremental diffs contained in the on-disk block entries leading to that final floor block. So `IntersectTermsEnum` just needs these two big flags. In the 9.9.x written index, where we instead encode that leading long fp + 2 bits in MSB order, we share tons of prefixes of course (this is why FST got so much smaller), but, we will never share the entire thing, and that last byte contains our bit flags. So at read-time we will always have the right (flag containing) byte. The 9.9.x index is intact, and reading it is buggy yet buggy in a harmless way (throwing away bytes it does not try to use), and also because of how `vLong` is self delineating. Fun! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Corruption read on term dictionaries in Lucene 9.9 [lucene]
mikemccand commented on issue #12895: URL: https://github.com/apache/lucene/issues/12895#issuecomment-1848934726 > > I don't think this assumption is valid @gf2121? Because that floor data first contains the file pointer of the on-disk block that this prefix points to (in MSB order as of 9.9, where lots of prefix sharing should happen), so, internal arcs before the final arc are in fact expected to output shared prefix bytes? > > I thought the 'assumption' here means that we assert the floor data are all stored in the last arc. The whole FST output encoded as `[ MSBVLong | floordata ]`. We may share prefixes in MSBVLong, but we can not have two output having same `MSBVLong` so `floordata` will never be splitted into more than one arcs. Did i misunderstand something? Sorry @gf2121 -- that is indeed correct: except for the leading vLong-encoded "fp + 2 bits", the remainder of floor data will always be on the last arc. But that leading vLong has those important flags that we were losing in the LSB encoded case. > As @benwtrent pointed out, we should accumulate from the `outputPrefix` instead of `arc.output`. I raised #12900 for this. This patch seems to fix the exception when searching `WildcardQuery(new Term("body", "*fo*"))` on `Wikibig1m`. I'll try`Wikibigall` as well. +1 -- this is the right fix (to not lose any leading bytes for the FST's output in `IntersectTermsEnum`). I'll review the PR and open followon issue to somehow expose the bug with stronger BWC test. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[I] Improve BWC tests to reveal #12895 and confirm its fix [lucene]
mikemccand opened a new issue, #12901: URL: https://github.com/apache/lucene/issues/12901 ### Description Spinoff from #12895 where we inadvertently introduced read-time exceptions in `MultiTermQueries` (e.g. `WildcardQuery` `*fo*`) in 9.9.0 when reading pre-9.9.0 written indices. Our `TestBackwardsCompatibility` test should have caught this but clearly did not. Let's: 1. Make a test revealing #12895 2. Confirm the test (and all other tests) pass when applying the proposed PR fix #12900 3. Merge the new test case 4. Merget he fix in #12900 Hopefully in that order? We could simply snapshot the `wikibigall` index into BWC zip files, but, that's way too massive. Can we instead make a randomized test that indexes fewer-ish terms and reveals the bug? With "just so" set of terms, the bug should reproduce with a tiny index. We could maybe iterate to "just so" by tweaking the set of terms' prefixes until the on-disk file pointers of each block share the last 6 bits of their respective pointers? ### Gradle command to reproduce _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Corruption read on term dictionaries in Lucene 9.9 [lucene]
mikemccand commented on issue #12895: URL: https://github.com/apache/lucene/issues/12895#issuecomment-1848936838 OK I opened #12901 to create a test -- can we do that, first, and then confirm #12900 indeed fixes it, then push the test, then push the fix? I can try to work on creating that test case, but won't have much time until early tomorrow AM. If anyone else wants to crack it, feel free! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] IntersectTermsEnum should accumulate from output prefix instead of current output [lucene]
mikemccand commented on PR #12900: URL: https://github.com/apache/lucene/pull/12900#issuecomment-1848936948 OK I opened #12901 to create a better BWC test revealing these read-time exceptions -- can we do that, first, and then confirm #12900 indeed fixes it, then push the test, then push the fix? I can try to work on creating that test case, but won't have much time until early tomorrow AM. If anyone else wants to crack it, feel free! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Corruption read on term dictionaries in Lucene 9.9 [lucene]
benwtrent commented on issue #12895: URL: https://github.com/apache/lucene/issues/12895#issuecomment-1848961163 > and so sorry for introducing this bug! I am with @mikemccand on this @gf2121! The optimization proved valuable in benchmarks and all testing indicated it was good to go. Mistakes happen. I think the bigger thing is that our testing around this part of the code is lacking! It's awesome to see that a solution was found so quickly! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Allow FST builder to use different writer (alternative reverse BytesReader) [lucene]
dungba88 commented on PR #12879: URL: https://github.com/apache/lucene/pull/12879#issuecomment-1848988292 Note: We already merged #12624 , but this PR can be used as benchmark candidate for https://github.com/apache/lucene/issues/12884 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Create a simple JMH benchmark to measure FST compilation / traversal times [lucene]
dungba88 commented on issue #12884: URL: https://github.com/apache/lucene/issues/12884#issuecomment-1848988593 We can use https://github.com/apache/lucene/pull/12879 as a benchmark candidate to compare against the current baseline. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Make FSTCompiler.compile() to only return the FSTMetadata [lucene]
dungba88 commented on PR #12831: URL: https://github.com/apache/lucene/pull/12831#issuecomment-184998 One of the point I'm unsure about this PR is that there is now 2 (obscured) ways to construct the FST, one using the DataInput+FSTStore and one using FSTReader returned by the on-heap DataOutput, while the previous way there is only 1 way (calling `compile()`). If users don't read the Javadoc, they might be confused on how to get the FST. Maybe there could be other way which feels more natural. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[I] Where should we stream FST to disk directly? [lucene]
dungba88 opened a new issue, #12902: URL: https://github.com/apache/lucene/issues/12902 ### Description Most of the use cases with FST seems to be writing the FST eventually to a DataOutput (is it IndexOutput?). In that case we are currently writing the FST to an on heap DataOutput (previously BytesStore and now ReadWriteDataOutput) and then save it to the on disk. With #12624 it's possible to write the FST to an on disk DataOutput. So maybe let first compile a list of places which can be migrated to the new way? Note: With the new way, there is a catch: We can't write the metadata in the same DataOutput as the main FST body, it has to be written separately. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Where should we stream FST to disk directly? [lucene]
dungba88 commented on issue #12902: URL: https://github.com/apache/lucene/issues/12902#issuecomment-1849015292 A point for reference, Tantivy saves the metadata to the end of file, and it will first jump to the end to know the size and starting node. But we couldn't do it as one file might contain multiple FST -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Introduce growInRange to reduce array overallocation [lucene]
dungba88 commented on code in PR #12844: URL: https://github.com/apache/lucene/pull/12844#discussion_r1421782211 ## lucene/core/src/java/org/apache/lucene/util/ArrayUtil.java: ## @@ -330,15 +330,36 @@ public static int[] growExact(int[] array, int newLength) { return copy; } + /** + * Returns an array whose size is at least {@code minLength}, generally over-allocating + * exponentially, but never allocating more than {@code maxLength} elements. + */ + public static int[] growInRange(int[] array, int minLength, int maxLength) { +assert minLength >= 0 +: "length must be positive (got " + minLength + "): likely integer overflow?"; + +if (minLength > maxLength) { Review Comment: I think this can be assert instead, as this class is marked as internal ## lucene/core/src/test/org/apache/lucene/util/hnsw/HnswGraphTestCase.java: ## @@ -757,6 +757,30 @@ public void testRamUsageEstimate() throws IOException { long estimated = RamUsageEstimator.sizeOfObject(hnsw); long actual = ramUsed(hnsw); +// The estimation assumes neighbor arrays are always max size. +// When this is not true, the estimate can be much larger than the actual value. +// In these cases, we compute how much we overestimated the neighbors arrays. +if (estimated > actual) { + long neighborsError = 0; + int numLevels = hnsw.numLevels(); + for (int level = 0; level < numLevels; ++level) { +NodesIterator nodesOnLevel = hnsw.getNodesOnLevel(level); +while (nodesOnLevel.hasNext()) { + int node = nodesOnLevel.nextInt(); + NeighborArray neighbors = hnsw.getNeighbors(level, node); + long maxNeighborsSize; + if (level == 0) { Review Comment: I think generally we would rather want to avoid having test to duplicate assumption or logic made on prod path? This seems to be a specific implementation decision that could be changed independently. I couldn't think of a better way though. But I'm unsure about the need of this newly added code. It seems we only compute it in a single test, and we want to have a better estimation? The test seems to verify that our over-estimation cannot be more than 30% of the actual size. If we provide a better estimation maybe we can lower the tolerant threshold? Still it seems strange that if we truly need this better estimation but it is only in test. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]
uschindler commented on code in PR #12841: URL: https://github.com/apache/lucene/pull/12841#discussion_r1421801910 ## lucene/benchmark-jmh/src/java/org/apache/lucene/benchmark/jmh/GroupVIntBenchmark.java: ## @@ -140,10 +155,12 @@ public void init() throws Exception { } initByteBufferInput(docs); initArrayInput(docs); +initNioInput(docs); +initByteBuffersInput(docs); } @Benchmark - public void byteBufferReadVInt(Blackhole bh) throws IOException { + public void mmap_byteBufferReadVInt(Blackhole bh) throws IOException { Review Comment: Looks fine. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]
uschindler commented on PR #12841: URL: https://github.com/apache/lucene/pull/12841#issuecomment-1849057360 Hi, I was busy this Sunday, sorry for delay. Will check tomorrow. In general: My idea is to have a single static utility method (like the default one) in the util class that does the *whole* readGroupVInt, but that also (additionally) takes a lambda to random-access read the integer and also takes a parameter with the maximum size available in the context of the caller for random access reads with the lambda. MemorySegmentIndexInput would pass the segment.byteSize()-position, while ByteBuffers would pass remaining(). It would also take the reference offset of the *actual* position so it knows what to add for random-access reads). As return value it would return the new position. Both (reference position before the first block and return value) would be relative for usage by the lambda. The caller then just calls the method with the lambda (possibly wrapping the whole call with the usual NPE/ISE try-catch block). This would allow an easy implementation by 3rd party directory implementations without knowing how readGroupVInt works internally. 3rd party implementations and our own MMap/NIO/... would just call this static method if they support random access or call super(), if they can't. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Improve BWC tests to reveal #12895 and confirm its fix [lucene]
jpountz commented on issue #12901: URL: https://github.com/apache/lucene/issues/12901#issuecomment-1849073645 IMO one thing to improve here is that we changed file formats for the MSB vlong optimization by bumping the file format version number, changing the writer to write the new version and adding version checks to the reader. But we sometimes also change file formats by forking the code, like we did with `Lucene90PostingsFormat` -> `Lucene99PostingsFormat`. A benefit of the latter approach is that we keep existing tests in place, specifically `BasePostingsFormatTestCase` which has pretty good coverage, probably more than what `TestBackwardsCompatibility` will ever be able to check? We should look into more systematically forking the code when we do a file format change, or figuring out other ways to keep testing prior formats with version bumps, e.g. by retaining version checks on the write side and making sure there is a separate `BasePostingsFormatTestCase` impl for every version number? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Jit workaround [lucene]
ChrisHegarty commented on PR #12903: URL: https://github.com/apache/lucene/pull/12903#issuecomment-1849079207 For reviewers or other interested parties. I run the test as follows: ``` export CI=true; export x=; while ./gradlew --no-build-cache cleanTest :lucene:core:test --rerun --tests "org.apache.lucene.util.bkd.TestDocIdsWriter.testCrash"; do echo $x | wc -c; export x=x$x; done ``` We need the `CI=true` because, by default in our gradle environment C2 compilation is disabled - and that is where the JIT crash happens. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Add tests for Lucene90PostingsFormat back [lucene]
jpountz opened a new pull request, #12904: URL: https://github.com/apache/lucene/pull/12904 I just noticed that the move from FOR to PFOR did all the work to make the old format (FOR) writeable, but missed keeping an instance of `BasePostingsFormatTestCase` for this format. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [branch_9_9] Reflow computeCommonPrefixLengthAndBuildHistogram to avoid crash [lucene]
jpountz commented on code in PR #12903: URL: https://github.com/apache/lucene/pull/12903#discussion_r1421819068 ## lucene/core/src/java/org/apache/lucene/util/MSBRadixSorter.java: ## @@ -214,6 +214,12 @@ protected int getBucket(int i, int k) { * @see #buildHistogram */ private int computeCommonPrefixLengthAndBuildHistogram(int from, int to, int k, int[] histogram) { +int commonPrefixLength = computeInitialCommonPrefixLength(from, k); +return computeCommonPrefixLengthAndBuildHistogramPart1( +from, to, k, histogram, commonPrefixLength); + } + + private int computeInitialCommonPrefixLength(int from, int k) { Review Comment: Can you add a comment here and in other places that you touched, explaining that this works around a C2 compiler bug? ## lucene/core/src/test/org/apache/lucene/util/bkd/TestDocIdsWriter.java: ## @@ -150,4 +154,18 @@ public Relation compare(byte[] minPackedValue, byte[] maxPackedValue) { } dir.deleteFile("tmp"); } + + // This simple test tickles a JVM JIT crash on JDK's less than 21.0.1 + // Needs to be run with C2, so with the environment variable `CI` set + public void testCrash() throws IOException { +for (int i = 0; i < 100; i++) { + try (Directory dir = newDirectory(); + IndexWriter iw = new IndexWriter(dir, newIndexWriterConfig(null))) { +for (int d = 0; d < 20_000; d++) { + iw.addDocument( + List.of(new IntPoint("foo", 0), new SortedNumericDocValuesField("bar", 0))); +} + } +} + } Review Comment: How long does this test take? If it's multiple seconds, should it only run nightly? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [branch_9_9] Reflow computeCommonPrefixLengthAndBuildHistogram to avoid crash [lucene]
ChrisHegarty commented on code in PR #12903: URL: https://github.com/apache/lucene/pull/12903#discussion_r1421825869 ## lucene/core/src/test/org/apache/lucene/util/bkd/TestDocIdsWriter.java: ## @@ -150,4 +154,18 @@ public Relation compare(byte[] minPackedValue, byte[] maxPackedValue) { } dir.deleteFile("tmp"); } + + // This simple test tickles a JVM JIT crash on JDK's less than 21.0.1 + // Needs to be run with C2, so with the environment variable `CI` set + public void testCrash() throws IOException { +for (int i = 0; i < 100; i++) { + try (Directory dir = newDirectory(); + IndexWriter iw = new IndexWriter(dir, newIndexWriterConfig(null))) { +for (int d = 0; d < 20_000; d++) { + iw.addDocument( + List.of(new IntPoint("foo", 0), new SortedNumericDocValuesField("bar", 0))); +} + } +} + } Review Comment: The test takes around 3.5 seconds on my laptop. I can probably reduce the iterations. Or make the number of iterations depend on whether nightly or not? I’ll take a look. ## lucene/core/src/test/org/apache/lucene/util/bkd/TestDocIdsWriter.java: ## @@ -150,4 +154,18 @@ public Relation compare(byte[] minPackedValue, byte[] maxPackedValue) { } dir.deleteFile("tmp"); } + + // This simple test tickles a JVM JIT crash on JDK's less than 21.0.1 + // Needs to be run with C2, so with the environment variable `CI` set + public void testCrash() throws IOException { +for (int i = 0; i < 100; i++) { + try (Directory dir = newDirectory(); + IndexWriter iw = new IndexWriter(dir, newIndexWriterConfig(null))) { +for (int d = 0; d < 20_000; d++) { + iw.addDocument( + List.of(new IntPoint("foo", 0), new SortedNumericDocValuesField("bar", 0))); +} + } +} + } Review Comment: The test takes around 3.5 seconds on my laptop. I can probably reduce the iterations. Or make the number of iterations depend on whether nightly or not? I’ll take a look. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [branch_9_9] Reflow computeCommonPrefixLengthAndBuildHistogram to avoid crash [lucene]
ChrisHegarty commented on code in PR #12903: URL: https://github.com/apache/lucene/pull/12903#discussion_r1421825998 ## lucene/core/src/java/org/apache/lucene/util/MSBRadixSorter.java: ## @@ -214,6 +214,12 @@ protected int getBucket(int i, int k) { * @see #buildHistogram */ private int computeCommonPrefixLengthAndBuildHistogram(int from, int to, int k, int[] histogram) { +int commonPrefixLength = computeInitialCommonPrefixLength(from, k); +return computeCommonPrefixLengthAndBuildHistogramPart1( +from, to, k, histogram, commonPrefixLength); + } + + private int computeInitialCommonPrefixLength(int from, int k) { Review Comment: Yes. I’ll add a comment to that effect. ## lucene/core/src/java/org/apache/lucene/util/MSBRadixSorter.java: ## @@ -214,6 +214,12 @@ protected int getBucket(int i, int k) { * @see #buildHistogram */ private int computeCommonPrefixLengthAndBuildHistogram(int from, int to, int k, int[] histogram) { +int commonPrefixLength = computeInitialCommonPrefixLength(from, k); +return computeCommonPrefixLengthAndBuildHistogramPart1( +from, to, k, histogram, commonPrefixLength); + } + + private int computeInitialCommonPrefixLength(int from, int k) { Review Comment: Yes. I’ll add a comment to that effect. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[I] TransitionAccessor for NFA: get transitions for a given state via random-access leads to wrong results. [lucene]
Tony-X opened a new issue, #12906: URL: https://github.com/apache/lucene/issues/12906 ### Description as part of https://github.com/apache/lucene/pull/12688, I'm trying to develop an algorithm that can intersect FST and FSA efficiently. For FSA this means we can leverage the sorted transitions of a given state and perform binary search in order to advance to a target. So the algorithm uses `TransitionAccessor` and there are three relevant APIs ``` int initTransition(int state, Transition t); /** Iterate to the next transition after the provided one */ void getNextTransition(Transition t); /** * Fill the provided {@link Transition} with the index'th transition leaving the specified state. */ void getTransition(int state, int index, Transition t); ``` One could call `initTransition` followed by many `getNextTransition` to iterate the transitions one by one. Or in my case, Binary search needs to use `getTransition` to randomly access the middle point. I debugged for hours and realized for a given test case my algo works as expected but the transitions I got were messed up. The original tests uses quite complicated automatons so I tried to find a small and simple enough to exhibit the behavior. Here is the test ``` public void testAutomaton() { RegExp regExp = new RegExp("+*.", RegExp.NONE); Automaton a = regExp.toAutomaton(); CompiledAutomaton compiledAutomaton = new CompiledAutomaton(a); System.out.println("isFinite: " + compiledAutomaton.finite); var byteRunnable = compiledAutomaton.getByteRunnable(); var transitionAccessor = compiledAutomaton.getTransitionAccessor(); // dfsAutomaton(byteRunnable, transitionAccessor, 0, ""); // dumpTransitionsViaNext(byteRunnable, transitionAccessor, 0, new HashSet<>()); dumpTransitionsViaRA(byteRunnable, transitionAccessor, 0, new HashSet<>()); } void dumpTransitionsViaNext(ByteRunnable a, TransitionAccessor transitionAccessor, int currentState, Set seenStates) { if (seenStates.contains(currentState)) { return; } seenStates.add(currentState); var t = new Transition(); var numStates = transitionAccessor.initTransition(currentState, t); for (int i = 0; i < numStates; i++) { transitionAccessor.getNextTransition(t); System.out.println("At: src: " + t.source + " [" + t.min +", " + t.max + "] " + "dest: " + t.dest + " is dest accept: " + (a.isAccept(t.dest) ? "yes" : "no")); dumpTransitionsViaNext(a, transitionAccessor, t.dest, seenStates); } } void dumpTransitionsViaRA(ByteRunnable a, TransitionAccessor transitionAccessor, int currentState, Set seenStates) { if (seenStates.contains(currentState)) { return; } seenStates.add(currentState); var t = new Transition(); var numStates = transitionAccessor.initTransition(currentState, t); // transitionAccessor.getTransition(currentState, numStates - 1, t); for (int i = 0; i < numStates; i++) { transitionAccessor.getTransition(currentState, i, t); System.out.println("At: src: " + t.source + " [" + t.min +", " + t.max + "] " + "dest: " + t.dest + " is dest accept: " + (a.isAccept(t.dest) ? "yes" : "no")); dumpTransitionsViaRA(a, transitionAccessor, t.dest, seenStates); } } ``` `dumpTransitionsViaNext` gives expected transitions but `dumpTransitionsViaRA` produces a mess. ``` Via next At: src: 0 arcIdx: 0 [0, 42] dest: 1 is dest accept: yes At: src: 0 arcIdx: 1 [43, 43] dest: 2 is dest accept: yes At: src: 2 arcIdx: 0 [0, 42] dest: 1 is dest accept: yes At: src: 2 arcIdx: 1 [43, 43] dest: 2 is dest accept: yes At: src: 2 arcIdx: 2 [44, 127] dest: 1 is dest accept: yes At: src: 2 arcIdx: 3 [194, 223] dest: 7 is dest accept: no At: src: 7 arcIdx: 0 [128, 142] dest: 1 is dest accept: yes At: src: 7 arcIdx: 1 [143, 143] dest: 1 is dest accept: yes At: src: 7 arcIdx: 2 [144, 190] dest: 1 is dest accept: yes At: src: 7 arcIdx: 3 [191, 191] dest: 1 is dest accept: yes At: src: 2 arcIdx: 4 [224, 239] dest: 8 is dest accept: no At: src: 8 arcIdx: 0 [128, 142] dest: 11 is dest accept: no At: src: 11 arcIdx: 0 [128, 142] dest: 1 is dest accept: yes At: src: 11 arcIdx: 1 [143, 143] dest: 1 is dest accept: yes At: src: 11 arcIdx: 2 [144, 190] dest: 1 is dest accept: yes At: src: 11 arcIdx: 3 [191, 191] dest: 1 is dest accept: yes At: src: 8 arcIdx: 1 [143, 143] dest: 11 is dest accept: no At: src: 8 arcIdx: 2 [144, 190] dest: 11 is dest accept: no At: src: 8 arcIdx: 3 [191, 191] dest: 11 is dest accept: no At: src: 2 arcIdx: 5 [240, 243] dest: 9 is dest accept: n
Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]
easyice commented on PR #12841: URL: https://github.com/apache/lucene/pull/12841#issuecomment-1849275229 > 3rd party implementations and our own MMap/NIO/... would just call this static method if they support random access or call super(), if they can't. Thanks @uschindler , it's ok!, i'm going to do this today, and then you can take a look when you get a chance. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [branch_9_9] Reflow computeCommonPrefixLengthAndBuildHistogram to avoid crash [lucene]
dweiss commented on code in PR #12903: URL: https://github.com/apache/lucene/pull/12903#discussion_r1422008879 ## lucene/core/src/test/org/apache/lucene/util/bkd/TestDocIdsWriter.java: ## @@ -150,4 +154,18 @@ public Relation compare(byte[] minPackedValue, byte[] maxPackedValue) { } dir.deleteFile("tmp"); } + + // This simple test tickles a JVM JIT crash on JDK's less than 21.0.1 + // Needs to be run with C2, so with the environment variable `CI` set Review Comment: If it's triggered by the CI env. variable (and effectively c2 compilation) then perhaps we should check for System.getProperty("CI") is assumption-ignore the test if it can't be triggered? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [branch_9_9] Reflow computeCommonPrefixLengthAndBuildHistogram to avoid crash [lucene]
dweiss commented on PR #12903: URL: https://github.com/apache/lucene/pull/12903#issuecomment-1849478590 > export CI=true; export x=; while ./gradlew --no-build-cache cleanTest :lucene:core:test --rerun --tests "org.apache.lucene.util.bkd.TestDocIdsWriter.testCrash"; do echo $x | wc -c; export x=x$x; done Just a suggestion that this call to gradle should have a similar outcome and be a tad faster (it does fork multiple test JVMs after one compilation - used to run in parallel too but not anymore - gradle's constraint). ``` gradlew --no-build-cache :lucene:core:beast -Ptests.dups=100 --tests "org.apache.lucene.util.bkd.TestDocIdsWriter.testCrash" ``` More info: https://github.com/apache/lucene/blob/main/help/tests.txt#L89-L123 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Improve BWC tests to reveal #12895 and confirm its fix [lucene]
jpountz commented on issue #12901: URL: https://github.com/apache/lucene/issues/12901#issuecomment-1849481891 > A benefit of the latter approach is that we keep existing tests in place, specifically BasePostingsFormatTestCase Actually, it seems like we lost the test as part of refactorings, I'm adding it back in #12904. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[I] jenkins dump file traversal exceptions ("no matches found within 10000") [lucene]
dweiss opened a new issue, #12907: URL: https://github.com/apache/lucene/issues/12907 ### Description These stack traces at the end of jenkins runs are super annoying: ``` ... Archiving artifacts hudson.FilePath$ValidateAntFileMask$1Cancel at hudson.FilePath$ValidateAntFileMask$1.isCaseSensitive(FilePath.java:3300) at org.apache.tools.ant.DirectoryScanner.lambda$isIncluded$3(DirectoryScanner.java:1374) at java.base/java.util.stream.MatchOps$1MatchSink.accept(MatchOps.java:90) at java.base/java.util.Spliterators$ArraySpliterator.tryAdvance(Spliterators.java:958) ... at hudson.FilePath$ValidateAntFileMask.hasMatch(FilePath.java:3313) Caused: hudson.FilePath$FileMaskNoMatchesFoundException: no matches found within 1 ``` This is a weird path traversal limit in jenkins (to dodge circular link structure, I guess) that is configurable via system properties - https://www.jenkins.io/doc/book/managing/system-properties/#hudson-filepath-validate_ant_file_mask_bound I don't think we can tweak it but we can change the default path traversal patterns to be more restrictive: ``` No artifacts found that match the file pattern "**/*.events,heapdumps/**,**/hs_err_pid*". Configuration error? ``` The *.events is no longer used (used to be emitted from the custom runner in ant). I don't think we can limit the scanning pattern to not exceed the limit but I think we can add a test finalization task to gradle that will scan only those folders that _can_ contains hs_err_pid* files and perhaps copy them over to one top-level directory, then add a much simpler pattern to include anything present in that folder? ### Version and environment details _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org