Re: [PR] Revert "Optimize outputs accumulating for SegmentTermsEnum and InterssectTermsEnum " [lucene]

2023-12-10 Thread via GitHub


uschindler commented on PR #12899:
URL: https://github.com/apache/lucene/pull/12899#issuecomment-1848910834

   Are we sure that this did not affect already existing indexes, e.g. while 
merging?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Revert "Optimize outputs accumulating for SegmentTermsEnum and InterssectTermsEnum " [lucene]

2023-12-10 Thread via GitHub


mikemccand commented on PR #12899:
URL: https://github.com/apache/lucene/pull/12899#issuecomment-1848925405

   > Are we sure that this did not affect already existing indexes, e.g. while 
merging?
   
   Great question @uschindler ("blast radius") -- I'm pretty sure newly 9.9.0 
written indices are fine: it does look like `SegmentTermsEnum` is correct (the 
bug is only in `IntersectTermsEnum`), but I'm still trying to convince myself 
of this / figure out how to make a better test than our existing tiny-index BWC 
tests...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Corruption read on term dictionaries in Lucene 9.9 [lucene]

2023-12-10 Thread via GitHub


mikemccand commented on issue #12895:
URL: https://github.com/apache/lucene/issues/12895#issuecomment-1848934191

   OK I think I understand why we see the bug on 9.8.x indices but NOT 9.9.x. 
indices.  And I'm quite sure blast radius of the bug is contained solely to 
read-time, i.e. newly written 9.9.0 indices are not corrupt.  Though I'd really 
love to see some stronger testing -- I'm hoping a randomly written large enough 
index might reveal the bug.  I'll open a spinoff issue so we can create that.
   
   Details:
   
   We write FST with  pairs, where term-prefix is the 
prefix of a set of terms sharing one on-disk block.  E.g. `foo` is prefix, and 
the N terms would be `foobar`, `foobaz`, ...  The start of that `byte[]` is a 
`vLong`, which is the absolute file pointer of the on-disk block shifted up by 
2 bits, and 2 lower bits are two important flags: `OUTPUT_FLAG_IS_FLOOR` and 
`OUTPUT_FLAG_HAS_TERMS`.
   
   Now, with 9.8.x, this is LSB encoded: those two flags are in the leading 
byte, not the last byte.  FST will share output byte prefixes when it can, and 
it does in fact happen (rarely!) that the LSB (6 bits + 2 bit flags) is the 
same across N term blocks.  It can happen if the last 6 bits of their block 
file pointers happen to be the same.  In this case that leading byte (or 
possibly, even more rarely, bytes) are not all on the last arc, and this PR 
loses that leading flag-containing byte/bytes in that case, and computes the 
wrong flag value for `OUTPUT_FLAG_IS_FLOOR`, then tries to readVInt when non 
was written --> exceptions at read time.
   
   @gf2121 indeed the leading fp + 2-bit flags will never be the same across 
blocks, so this means even if FST is able to share leading prefix byte or two, 
it will never share that entire `vLong`: there must be at least one final 
(different) byte for `readVLong` to read.  And because `vLong` is 
self-delineating (each byte knows whether it is the last one by checking 8th 
bit is clear) losing a byte or two of its leading bytes won't break reading of 
the `vLong`, i.e., `readVLong` will indeed stop at the right byte (but of 
course produce the wrong value).
   
   Importantly, `IntersectTermsEnum` does not use this encoded file pointer in 
the `floorData`!  It gets the file-pointer by traversing prior arcs in the FST 
and accumulating N incremental diffs contained in the on-disk block entries 
leading to that final floor block.  So `IntersectTermsEnum` just needs these 
two big flags.
   
   In the 9.9.x written index, where we instead encode that leading long fp + 2 
bits in MSB order, we share tons of prefixes of course (this is why FST got so 
much smaller), but, we will never share the entire thing, and that last byte 
contains our bit flags.  So at read-time we will always have the right (flag 
containing) byte.  The 9.9.x index is intact, and reading it is buggy yet buggy 
in a harmless way (throwing away bytes it does not try to use), and also 
because of how `vLong` is self delineating.  Fun!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Corruption read on term dictionaries in Lucene 9.9 [lucene]

2023-12-10 Thread via GitHub


mikemccand commented on issue #12895:
URL: https://github.com/apache/lucene/issues/12895#issuecomment-1848934726

   > > I don't think this assumption is valid @gf2121? Because that floor data 
first contains the file pointer of the on-disk block that this prefix points to 
(in MSB order as of 9.9, where lots of prefix sharing should happen), so, 
internal arcs before the final arc are in fact expected to output shared prefix 
bytes?
   > 
   > I thought the 'assumption' here means that we assert the floor data are 
all stored in the last arc. The whole FST output encoded as `[ MSBVLong | 
floordata ]`. We may share prefixes in MSBVLong, but we can not have two output 
having same `MSBVLong` so `floordata` will never be splitted into more than one 
arcs. Did i misunderstand something?
   
   Sorry @gf2121 -- that is indeed correct: except for the leading 
vLong-encoded "fp + 2 bits", the remainder of floor data will always be on the 
last arc.  But that leading vLong has those important flags that we were losing 
in the LSB encoded case.
   
   > As @benwtrent pointed out, we should accumulate from the `outputPrefix` 
instead of `arc.output`. I raised #12900 for this. This patch seems to fix the 
exception when searching `WildcardQuery(new Term("body", "*fo*"))` on 
`Wikibig1m`. I'll try`Wikibigall` as well.
   
   +1 -- this is the right fix (to not lose any leading bytes for the FST's 
output in `IntersectTermsEnum`).  I'll review the PR and open followon issue to 
somehow expose the bug with stronger BWC test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[I] Improve BWC tests to reveal #12895 and confirm its fix [lucene]

2023-12-10 Thread via GitHub


mikemccand opened a new issue, #12901:
URL: https://github.com/apache/lucene/issues/12901

   ### Description
   
   Spinoff from #12895 where we inadvertently introduced read-time exceptions 
in `MultiTermQueries` (e.g. `WildcardQuery` `*fo*`) in 9.9.0 when reading 
pre-9.9.0 written indices.
   
   Our `TestBackwardsCompatibility` test should have caught this but clearly 
did not.
   
   Let's:
 1. Make a test revealing #12895 
 2. Confirm the test (and all other tests) pass when applying the proposed 
PR fix #12900
 3. Merge the new test case
 4. Merget he fix in #12900 
   
   Hopefully in that order?
   
   We could simply snapshot the `wikibigall` index into BWC zip files, but, 
that's way too massive.  Can we instead make a randomized test that indexes 
fewer-ish terms and reveals the bug?  With "just so" set of terms, the bug 
should reproduce with a tiny index.  We could maybe iterate to "just so" by 
tweaking the set of terms' prefixes until the on-disk file pointers of each 
block share the last 6 bits of their respective pointers?
   
   ### Gradle command to reproduce
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Corruption read on term dictionaries in Lucene 9.9 [lucene]

2023-12-10 Thread via GitHub


mikemccand commented on issue #12895:
URL: https://github.com/apache/lucene/issues/12895#issuecomment-1848936838

   OK I opened #12901 to create a test -- can we do that, first, and then 
confirm #12900 indeed fixes it, then push the test, then push the fix?  I can 
try to work on creating that test case, but won't have much time until early 
tomorrow AM.  If anyone else wants to crack it, feel free!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] IntersectTermsEnum should accumulate from output prefix instead of current output [lucene]

2023-12-10 Thread via GitHub


mikemccand commented on PR #12900:
URL: https://github.com/apache/lucene/pull/12900#issuecomment-1848936948

   OK I opened #12901 to create a better BWC test revealing these read-time 
exceptions -- can we do that, first, and then confirm #12900 indeed fixes it, 
then push the test, then push the fix?  I can try to work on creating that test 
case, but won't have much time until early tomorrow AM.  If anyone else wants 
to crack it, feel free!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Corruption read on term dictionaries in Lucene 9.9 [lucene]

2023-12-10 Thread via GitHub


benwtrent commented on issue #12895:
URL: https://github.com/apache/lucene/issues/12895#issuecomment-1848961163

   > and so sorry for introducing this bug!
   
   I am with @mikemccand on this @gf2121!
   
   The optimization proved valuable in benchmarks and all testing indicated it 
was good to go. Mistakes happen. I think the bigger thing is that our testing 
around this part of the code is lacking!
   
   It's awesome to see that a solution was found so quickly!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Allow FST builder to use different writer (alternative reverse BytesReader) [lucene]

2023-12-10 Thread via GitHub


dungba88 commented on PR #12879:
URL: https://github.com/apache/lucene/pull/12879#issuecomment-1848988292

   Note: We already merged #12624 , but this PR can be used as benchmark 
candidate for https://github.com/apache/lucene/issues/12884


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Create a simple JMH benchmark to measure FST compilation / traversal times [lucene]

2023-12-10 Thread via GitHub


dungba88 commented on issue #12884:
URL: https://github.com/apache/lucene/issues/12884#issuecomment-1848988593

   We can use https://github.com/apache/lucene/pull/12879 as a benchmark 
candidate to compare against the current baseline.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Make FSTCompiler.compile() to only return the FSTMetadata [lucene]

2023-12-10 Thread via GitHub


dungba88 commented on PR #12831:
URL: https://github.com/apache/lucene/pull/12831#issuecomment-184998

   One of the point I'm unsure about this PR is that there is now 2 (obscured) 
ways to construct the FST, one using the DataInput+FSTStore and one using 
FSTReader returned by the on-heap DataOutput, while the previous way there is 
only 1 way (calling `compile()`). If users don't read the Javadoc, they might 
be confused on how to get the FST. Maybe there could be other way which feels 
more natural.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[I] Where should we stream FST to disk directly? [lucene]

2023-12-10 Thread via GitHub


dungba88 opened a new issue, #12902:
URL: https://github.com/apache/lucene/issues/12902

   ### Description
   
   Most of the use cases with FST seems to be writing the FST eventually to a 
DataOutput (is it IndexOutput?). In that case we are currently writing the FST 
to an on heap DataOutput (previously BytesStore and now ReadWriteDataOutput) 
and then save it to the on disk.
   
   With #12624 it's possible to write the FST to an on disk DataOutput. So 
maybe let first compile a list of places which can be migrated to the new way?
   
   Note: With the new way, there is a catch: We can't write the metadata in the 
same DataOutput as the main FST body, it has to be written separately.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Where should we stream FST to disk directly? [lucene]

2023-12-10 Thread via GitHub


dungba88 commented on issue #12902:
URL: https://github.com/apache/lucene/issues/12902#issuecomment-1849015292

   A point for reference, Tantivy saves the metadata to the end of file, and it 
will first jump to the end to know the size and starting node. But we couldn't 
do it as one file might contain multiple FST


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Introduce growInRange to reduce array overallocation [lucene]

2023-12-10 Thread via GitHub


dungba88 commented on code in PR #12844:
URL: https://github.com/apache/lucene/pull/12844#discussion_r1421782211


##
lucene/core/src/java/org/apache/lucene/util/ArrayUtil.java:
##
@@ -330,15 +330,36 @@ public static int[] growExact(int[] array, int newLength) 
{
 return copy;
   }
 
+  /**
+   * Returns an array whose size is at least {@code minLength}, generally 
over-allocating
+   * exponentially, but never allocating more than {@code maxLength} elements.
+   */
+  public static int[] growInRange(int[] array, int minLength, int maxLength) {
+assert minLength >= 0
+: "length must be positive (got " + minLength + "): likely integer 
overflow?";
+
+if (minLength > maxLength) {

Review Comment:
   I think this can be assert instead, as this class is marked as internal



##
lucene/core/src/test/org/apache/lucene/util/hnsw/HnswGraphTestCase.java:
##
@@ -757,6 +757,30 @@ public void testRamUsageEstimate() throws IOException {
 long estimated = RamUsageEstimator.sizeOfObject(hnsw);
 long actual = ramUsed(hnsw);
 
+// The estimation assumes neighbor arrays are always max size.
+// When this is not true, the estimate can be much larger than the actual 
value.
+// In these cases, we compute how much we overestimated the neighbors 
arrays.
+if (estimated > actual) {
+  long neighborsError = 0;
+  int numLevels = hnsw.numLevels();
+  for (int level = 0; level < numLevels; ++level) {
+NodesIterator nodesOnLevel = hnsw.getNodesOnLevel(level);
+while (nodesOnLevel.hasNext()) {
+  int node = nodesOnLevel.nextInt();
+  NeighborArray neighbors = hnsw.getNeighbors(level, node);
+  long maxNeighborsSize;
+  if (level == 0) {

Review Comment:
   I think generally we would rather want to avoid having test to duplicate 
assumption or logic made on prod path? This seems to be a specific 
implementation decision that could be changed independently. I couldn't think 
of a better way though.
   
   But I'm unsure about the need of this newly added code. It seems we only 
compute it in a single test, and we want to have a better estimation? The test 
seems to verify that our over-estimation cannot be more than 30% of the actual 
size. If we provide a better estimation maybe we can lower the tolerant 
threshold?
   
   Still it seems strange that if we truly need this better estimation but it 
is only in test.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-10 Thread via GitHub


uschindler commented on code in PR #12841:
URL: https://github.com/apache/lucene/pull/12841#discussion_r1421801910


##
lucene/benchmark-jmh/src/java/org/apache/lucene/benchmark/jmh/GroupVIntBenchmark.java:
##
@@ -140,10 +155,12 @@ public void init() throws Exception {
 }
 initByteBufferInput(docs);
 initArrayInput(docs);
+initNioInput(docs);
+initByteBuffersInput(docs);
   }
 
   @Benchmark
-  public void byteBufferReadVInt(Blackhole bh) throws IOException {
+  public void mmap_byteBufferReadVInt(Blackhole bh) throws IOException {

Review Comment:
   Looks fine.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-10 Thread via GitHub


uschindler commented on PR #12841:
URL: https://github.com/apache/lucene/pull/12841#issuecomment-1849057360

   Hi, I was busy this Sunday, sorry for delay. Will check tomorrow.
   
   In general: My idea is to have a single static utility method (like the 
default one) in the util class that does the *whole* readGroupVInt, but that 
also (additionally) takes a lambda to random-access read the integer and also 
takes a parameter with the maximum size available in the context of the caller 
for random access reads with the lambda. MemorySegmentIndexInput would pass the 
segment.byteSize()-position, while ByteBuffers would pass remaining(). It would 
also take the reference offset of the *actual* position so it knows what to add 
for random-access reads). As return value it would return the new position. 
Both (reference position before the first block and return value) would be 
relative for usage by the lambda. The caller then just calls the method with 
the lambda (possibly wrapping the whole call with the usual NPE/ISE try-catch 
block).
   
   This would allow an easy implementation by 3rd party directory 
implementations without knowing how readGroupVInt works internally. 3rd party 
implementations and our own MMap/NIO/... would just call this static method if 
they support random access or call super(), if they can't.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Improve BWC tests to reveal #12895 and confirm its fix [lucene]

2023-12-10 Thread via GitHub


jpountz commented on issue #12901:
URL: https://github.com/apache/lucene/issues/12901#issuecomment-1849073645

   IMO one thing to improve here is that we changed file formats for the MSB 
vlong optimization by bumping the file format version number, changing the 
writer to write the new version and adding version checks to the reader. But we 
sometimes also change file formats by forking the code, like we did with 
`Lucene90PostingsFormat` -> `Lucene99PostingsFormat`. A benefit of the latter 
approach is that we keep existing tests in place, specifically 
`BasePostingsFormatTestCase` which has pretty good coverage, probably more than 
what `TestBackwardsCompatibility` will ever be able to check? We should look 
into more systematically forking the code when we do a file format change, or 
figuring out other ways to keep testing prior formats with version bumps, e.g. 
by retaining version checks on the write side and making sure there is a 
separate `BasePostingsFormatTestCase` impl for every version number?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Jit workaround [lucene]

2023-12-10 Thread via GitHub


ChrisHegarty commented on PR #12903:
URL: https://github.com/apache/lucene/pull/12903#issuecomment-1849079207

   For reviewers or other interested parties. I run the test as follows:
   
   ```
   export CI=true; export x=; while ./gradlew --no-build-cache cleanTest 
:lucene:core:test --rerun --tests 
"org.apache.lucene.util.bkd.TestDocIdsWriter.testCrash"; do echo $x | wc -c; 
export x=x$x; done
   ```
   
   We need the `CI=true` because, by default in our gradle environment C2 
compilation is disabled - and that is where the JIT crash happens.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Add tests for Lucene90PostingsFormat back [lucene]

2023-12-10 Thread via GitHub


jpountz opened a new pull request, #12904:
URL: https://github.com/apache/lucene/pull/12904

   I just noticed that the move from FOR to PFOR did all the work to make the 
old format (FOR) writeable, but missed keeping an instance of 
`BasePostingsFormatTestCase` for this format.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [branch_9_9] Reflow computeCommonPrefixLengthAndBuildHistogram to avoid crash [lucene]

2023-12-10 Thread via GitHub


jpountz commented on code in PR #12903:
URL: https://github.com/apache/lucene/pull/12903#discussion_r1421819068


##
lucene/core/src/java/org/apache/lucene/util/MSBRadixSorter.java:
##
@@ -214,6 +214,12 @@ protected int getBucket(int i, int k) {
* @see #buildHistogram
*/
   private int computeCommonPrefixLengthAndBuildHistogram(int from, int to, int 
k, int[] histogram) {
+int commonPrefixLength = computeInitialCommonPrefixLength(from, k);
+return computeCommonPrefixLengthAndBuildHistogramPart1(
+from, to, k, histogram, commonPrefixLength);
+  }
+
+  private int computeInitialCommonPrefixLength(int from, int k) {

Review Comment:
   Can you add a comment here and in other places that you touched, explaining 
that this works around a C2 compiler bug?



##
lucene/core/src/test/org/apache/lucene/util/bkd/TestDocIdsWriter.java:
##
@@ -150,4 +154,18 @@ public Relation compare(byte[] minPackedValue, byte[] 
maxPackedValue) {
 }
 dir.deleteFile("tmp");
   }
+
+  // This simple test tickles a JVM JIT crash on JDK's less than 21.0.1
+  // Needs to be run with C2, so with the environment variable `CI` set
+  public void testCrash() throws IOException {
+for (int i = 0; i < 100; i++) {
+  try (Directory dir = newDirectory();
+  IndexWriter iw = new IndexWriter(dir, newIndexWriterConfig(null))) {
+for (int d = 0; d < 20_000; d++) {
+  iw.addDocument(
+  List.of(new IntPoint("foo", 0), new 
SortedNumericDocValuesField("bar", 0)));
+}
+  }
+}
+  }

Review Comment:
   How long does this test take? If it's multiple seconds, should it only run 
nightly?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [branch_9_9] Reflow computeCommonPrefixLengthAndBuildHistogram to avoid crash [lucene]

2023-12-10 Thread via GitHub


ChrisHegarty commented on code in PR #12903:
URL: https://github.com/apache/lucene/pull/12903#discussion_r1421825869


##
lucene/core/src/test/org/apache/lucene/util/bkd/TestDocIdsWriter.java:
##
@@ -150,4 +154,18 @@ public Relation compare(byte[] minPackedValue, byte[] 
maxPackedValue) {
 }
 dir.deleteFile("tmp");
   }
+
+  // This simple test tickles a JVM JIT crash on JDK's less than 21.0.1
+  // Needs to be run with C2, so with the environment variable `CI` set
+  public void testCrash() throws IOException {
+for (int i = 0; i < 100; i++) {
+  try (Directory dir = newDirectory();
+  IndexWriter iw = new IndexWriter(dir, newIndexWriterConfig(null))) {
+for (int d = 0; d < 20_000; d++) {
+  iw.addDocument(
+  List.of(new IntPoint("foo", 0), new 
SortedNumericDocValuesField("bar", 0)));
+}
+  }
+}
+  }

Review Comment:
   The test takes around 3.5 seconds on my laptop. I can probably reduce the 
iterations. Or make the number of iterations depend on whether nightly or not? 
I’ll take a look. 



##
lucene/core/src/test/org/apache/lucene/util/bkd/TestDocIdsWriter.java:
##
@@ -150,4 +154,18 @@ public Relation compare(byte[] minPackedValue, byte[] 
maxPackedValue) {
 }
 dir.deleteFile("tmp");
   }
+
+  // This simple test tickles a JVM JIT crash on JDK's less than 21.0.1
+  // Needs to be run with C2, so with the environment variable `CI` set
+  public void testCrash() throws IOException {
+for (int i = 0; i < 100; i++) {
+  try (Directory dir = newDirectory();
+  IndexWriter iw = new IndexWriter(dir, newIndexWriterConfig(null))) {
+for (int d = 0; d < 20_000; d++) {
+  iw.addDocument(
+  List.of(new IntPoint("foo", 0), new 
SortedNumericDocValuesField("bar", 0)));
+}
+  }
+}
+  }

Review Comment:
   The test takes around 3.5 seconds on my laptop. I can probably reduce the 
iterations. Or make the number of iterations depend on whether nightly or not? 
I’ll take a look. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [branch_9_9] Reflow computeCommonPrefixLengthAndBuildHistogram to avoid crash [lucene]

2023-12-10 Thread via GitHub


ChrisHegarty commented on code in PR #12903:
URL: https://github.com/apache/lucene/pull/12903#discussion_r1421825998


##
lucene/core/src/java/org/apache/lucene/util/MSBRadixSorter.java:
##
@@ -214,6 +214,12 @@ protected int getBucket(int i, int k) {
* @see #buildHistogram
*/
   private int computeCommonPrefixLengthAndBuildHistogram(int from, int to, int 
k, int[] histogram) {
+int commonPrefixLength = computeInitialCommonPrefixLength(from, k);
+return computeCommonPrefixLengthAndBuildHistogramPart1(
+from, to, k, histogram, commonPrefixLength);
+  }
+
+  private int computeInitialCommonPrefixLength(int from, int k) {

Review Comment:
   Yes. I’ll add a comment to that effect.



##
lucene/core/src/java/org/apache/lucene/util/MSBRadixSorter.java:
##
@@ -214,6 +214,12 @@ protected int getBucket(int i, int k) {
* @see #buildHistogram
*/
   private int computeCommonPrefixLengthAndBuildHistogram(int from, int to, int 
k, int[] histogram) {
+int commonPrefixLength = computeInitialCommonPrefixLength(from, k);
+return computeCommonPrefixLengthAndBuildHistogramPart1(
+from, to, k, histogram, commonPrefixLength);
+  }
+
+  private int computeInitialCommonPrefixLength(int from, int k) {

Review Comment:
   Yes. I’ll add a comment to that effect.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[I] TransitionAccessor for NFA: get transitions for a given state via random-access leads to wrong results. [lucene]

2023-12-10 Thread via GitHub


Tony-X opened a new issue, #12906:
URL: https://github.com/apache/lucene/issues/12906

   ### Description
   
   as part of https://github.com/apache/lucene/pull/12688,  
   I'm trying to develop an algorithm that can intersect FST and FSA 
efficiently. For FSA this means we can leverage the sorted transitions of a 
given state and perform binary search in order to advance to a target.
   
   So the algorithm uses `TransitionAccessor` and there are three relevant APIs
   
   ```
 int initTransition(int state, Transition t);
   
 /** Iterate to the next transition after the provided one */
 void getNextTransition(Transition t);
   
 /**
  * Fill the provided {@link Transition} with the index'th transition 
leaving the specified state.
  */
 void getTransition(int state, int index, Transition t);
   ```
   
   One could call `initTransition` followed by many `getNextTransition` to 
iterate the transitions one by one. Or in my case, Binary search needs to use 
`getTransition` to randomly access the middle point. 
   
   I debugged for hours and realized for a given test case my algo works as 
expected but the transitions I got were messed up. The original tests uses 
quite complicated automatons so I tried to find a small and simple enough to 
exhibit the behavior. Here is the test 
   
   ```
 public void testAutomaton() {
   RegExp regExp = new RegExp("+*.", RegExp.NONE);
   Automaton a = regExp.toAutomaton();
   CompiledAutomaton compiledAutomaton = new CompiledAutomaton(a);
   System.out.println("isFinite: " + compiledAutomaton.finite);
   
   var byteRunnable = compiledAutomaton.getByteRunnable();
   var transitionAccessor = compiledAutomaton.getTransitionAccessor();
   // dfsAutomaton(byteRunnable, transitionAccessor, 0, "");
   // dumpTransitionsViaNext(byteRunnable, transitionAccessor, 0, new 
HashSet<>());
   dumpTransitionsViaRA(byteRunnable, transitionAccessor, 0, new 
HashSet<>());
 }
   
 void dumpTransitionsViaNext(ByteRunnable a, TransitionAccessor 
transitionAccessor, int currentState, Set seenStates) {
   if (seenStates.contains(currentState)) {
 return;
   }
   
   seenStates.add(currentState);
   
   var t = new Transition();
   var numStates = transitionAccessor.initTransition(currentState, t);
   
   for (int i = 0; i < numStates; i++) {
 transitionAccessor.getNextTransition(t);
 System.out.println("At: src: " + t.source + " [" + t.min +", " + t.max 
+ "] "
 + "dest: " + t.dest + " is dest accept: "
 + (a.isAccept(t.dest) ? "yes" : "no"));
 dumpTransitionsViaNext(a, transitionAccessor, t.dest, seenStates);
   }
 }
   
 void dumpTransitionsViaRA(ByteRunnable a, TransitionAccessor 
transitionAccessor, int currentState, Set seenStates) {
   if (seenStates.contains(currentState)) {
 return;
   }
   
   seenStates.add(currentState);
   
   var t = new Transition();
   var numStates = transitionAccessor.initTransition(currentState, t);
   
   // transitionAccessor.getTransition(currentState, numStates - 1, t);
   for (int i = 0; i < numStates; i++) {
 transitionAccessor.getTransition(currentState, i, t);
 System.out.println("At: src: " + t.source + " [" + t.min +", " + t.max 
+ "] "
 + "dest: " + t.dest + " is dest accept: "
 + (a.isAccept(t.dest) ? "yes" : "no"));
 dumpTransitionsViaRA(a, transitionAccessor, t.dest, seenStates);
   }
 }
   ```
   
   `dumpTransitionsViaNext` gives expected transitions but 
`dumpTransitionsViaRA` produces a mess. 
   
   
   
   ```
   Via next
   At: src: 0 arcIdx: 0  [0, 42] dest: 1 is dest accept: yes
   At: src: 0 arcIdx: 1  [43, 43] dest: 2 is dest accept: yes
   At: src: 2 arcIdx: 0  [0, 42] dest: 1 is dest accept: yes
   At: src: 2 arcIdx: 1  [43, 43] dest: 2 is dest accept: yes
   At: src: 2 arcIdx: 2  [44, 127] dest: 1 is dest accept: yes
   At: src: 2 arcIdx: 3  [194, 223] dest: 7 is dest accept: no
   At: src: 7 arcIdx: 0  [128, 142] dest: 1 is dest accept: yes
   At: src: 7 arcIdx: 1  [143, 143] dest: 1 is dest accept: yes
   At: src: 7 arcIdx: 2  [144, 190] dest: 1 is dest accept: yes
   At: src: 7 arcIdx: 3  [191, 191] dest: 1 is dest accept: yes
   At: src: 2 arcIdx: 4  [224, 239] dest: 8 is dest accept: no
   At: src: 8 arcIdx: 0  [128, 142] dest: 11 is dest accept: no
   At: src: 11 arcIdx: 0  [128, 142] dest: 1 is dest accept: yes
   At: src: 11 arcIdx: 1  [143, 143] dest: 1 is dest accept: yes
   At: src: 11 arcIdx: 2  [144, 190] dest: 1 is dest accept: yes
   At: src: 11 arcIdx: 3  [191, 191] dest: 1 is dest accept: yes
   At: src: 8 arcIdx: 1  [143, 143] dest: 11 is dest accept: no
   At: src: 8 arcIdx: 2  [144, 190] dest: 11 is dest accept: no
   At: src: 8 arcIdx: 3  [191, 191] dest: 11 is dest accept: no
   At: src: 2 arcIdx: 5  [240, 243] dest: 9 is dest accept: n

Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-10 Thread via GitHub


easyice commented on PR #12841:
URL: https://github.com/apache/lucene/pull/12841#issuecomment-1849275229

   > 3rd party implementations and our own MMap/NIO/... would just call this 
static method if they support random access or call super(), if they can't.
   
   Thanks @uschindler , it's ok!, i'm going to do this today, and then you can 
take a look when you get a chance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [branch_9_9] Reflow computeCommonPrefixLengthAndBuildHistogram to avoid crash [lucene]

2023-12-10 Thread via GitHub


dweiss commented on code in PR #12903:
URL: https://github.com/apache/lucene/pull/12903#discussion_r1422008879


##
lucene/core/src/test/org/apache/lucene/util/bkd/TestDocIdsWriter.java:
##
@@ -150,4 +154,18 @@ public Relation compare(byte[] minPackedValue, byte[] 
maxPackedValue) {
 }
 dir.deleteFile("tmp");
   }
+
+  // This simple test tickles a JVM JIT crash on JDK's less than 21.0.1
+  // Needs to be run with C2, so with the environment variable `CI` set

Review Comment:
   If it's triggered by the CI env. variable (and effectively c2 compilation) 
then perhaps we should check for System.getProperty("CI") is assumption-ignore 
the test if it can't be triggered?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [branch_9_9] Reflow computeCommonPrefixLengthAndBuildHistogram to avoid crash [lucene]

2023-12-10 Thread via GitHub


dweiss commented on PR #12903:
URL: https://github.com/apache/lucene/pull/12903#issuecomment-1849478590

   > export CI=true; export x=; while ./gradlew --no-build-cache cleanTest 
:lucene:core:test --rerun --tests 
"org.apache.lucene.util.bkd.TestDocIdsWriter.testCrash"; do echo $x | wc -c; 
export x=x$x; done
   
   Just a suggestion that this call to gradle should have a similar outcome and 
be a tad faster (it does fork multiple test JVMs after one compilation - used 
to run in parallel too but not anymore - gradle's constraint).
   ```
   gradlew --no-build-cache :lucene:core:beast -Ptests.dups=100 --tests 
"org.apache.lucene.util.bkd.TestDocIdsWriter.testCrash"
   ```
   
   More info:
   https://github.com/apache/lucene/blob/main/help/tests.txt#L89-L123


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Improve BWC tests to reveal #12895 and confirm its fix [lucene]

2023-12-10 Thread via GitHub


jpountz commented on issue #12901:
URL: https://github.com/apache/lucene/issues/12901#issuecomment-1849481891

   > A benefit of the latter approach is that we keep existing tests in place, 
specifically BasePostingsFormatTestCase
   
   Actually, it seems like we lost the test as part of refactorings, I'm adding 
it back in #12904.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[I] jenkins dump file traversal exceptions ("no matches found within 10000") [lucene]

2023-12-10 Thread via GitHub


dweiss opened a new issue, #12907:
URL: https://github.com/apache/lucene/issues/12907

   ### Description
   
   These stack traces at the end of jenkins runs are super annoying:
   ```
   ...
   Archiving artifacts
   hudson.FilePath$ValidateAntFileMask$1Cancel
   at 
hudson.FilePath$ValidateAntFileMask$1.isCaseSensitive(FilePath.java:3300)
   at 
org.apache.tools.ant.DirectoryScanner.lambda$isIncluded$3(DirectoryScanner.java:1374)
   at 
java.base/java.util.stream.MatchOps$1MatchSink.accept(MatchOps.java:90)
   at 
java.base/java.util.Spliterators$ArraySpliterator.tryAdvance(Spliterators.java:958)
   ...
   at hudson.FilePath$ValidateAntFileMask.hasMatch(FilePath.java:3313)
   Caused: hudson.FilePath$FileMaskNoMatchesFoundException: no matches found 
within 1
   ```
   
   This is a weird path traversal limit in jenkins (to dodge circular link 
structure, I guess) that is configurable via system properties -
   
   
https://www.jenkins.io/doc/book/managing/system-properties/#hudson-filepath-validate_ant_file_mask_bound
   
   I don't think we can tweak it but we can change the default path traversal 
patterns to be more restrictive:
   ```
   No artifacts found that match the file pattern 
"**/*.events,heapdumps/**,**/hs_err_pid*". Configuration error?
   ```
   
   The *.events is no longer used (used to be emitted from the custom runner in 
ant). I don't think we can limit the scanning pattern to not exceed the limit 
but I think we can add a test finalization task to gradle that will scan only 
those folders that _can_ contains hs_err_pid* files and perhaps copy them over 
to one top-level directory, then add a much simpler pattern to include anything 
present in that folder?
   
   ### Version and environment details
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org