Re: [PR] Revert "Optimize outputs accumulating for SegmentTermsEnum and InterssectTermsEnum " [lucene]

2023-12-10 Thread via GitHub
uschindler commented on PR #12899: URL: https://github.com/apache/lucene/pull/12899#issuecomment-1848910834 Are we sure that this did not affect already existing indexes, e.g. while merging? -- This is an automated message from the Apache Git Service. To respond to the message, please log

Re: [PR] Revert "Optimize outputs accumulating for SegmentTermsEnum and InterssectTermsEnum " [lucene]

2023-12-10 Thread via GitHub
mikemccand commented on PR #12899: URL: https://github.com/apache/lucene/pull/12899#issuecomment-1848925405 > Are we sure that this did not affect already existing indexes, e.g. while merging? Great question @uschindler ("blast radius") -- I'm pretty sure newly 9.9.0 written indices

Re: [I] Corruption read on term dictionaries in Lucene 9.9 [lucene]

2023-12-10 Thread via GitHub
mikemccand commented on issue #12895: URL: https://github.com/apache/lucene/issues/12895#issuecomment-1848934191 OK I think I understand why we see the bug on 9.8.x indices but NOT 9.9.x. indices. And I'm quite sure blast radius of the bug is contained solely to read-time, i.e. newly writt

Re: [I] Corruption read on term dictionaries in Lucene 9.9 [lucene]

2023-12-10 Thread via GitHub
mikemccand commented on issue #12895: URL: https://github.com/apache/lucene/issues/12895#issuecomment-1848934726 > > I don't think this assumption is valid @gf2121? Because that floor data first contains the file pointer of the on-disk block that this prefix points to (in MSB order as of 9.

[I] Improve BWC tests to reveal #12895 and confirm its fix [lucene]

2023-12-10 Thread via GitHub
mikemccand opened a new issue, #12901: URL: https://github.com/apache/lucene/issues/12901 ### Description Spinoff from #12895 where we inadvertently introduced read-time exceptions in `MultiTermQueries` (e.g. `WildcardQuery` `*fo*`) in 9.9.0 when reading pre-9.9.0 written indices.

Re: [I] Corruption read on term dictionaries in Lucene 9.9 [lucene]

2023-12-10 Thread via GitHub
mikemccand commented on issue #12895: URL: https://github.com/apache/lucene/issues/12895#issuecomment-1848936838 OK I opened #12901 to create a test -- can we do that, first, and then confirm #12900 indeed fixes it, then push the test, then push the fix? I can try to work on creating that

Re: [PR] IntersectTermsEnum should accumulate from output prefix instead of current output [lucene]

2023-12-10 Thread via GitHub
mikemccand commented on PR #12900: URL: https://github.com/apache/lucene/pull/12900#issuecomment-1848936948 OK I opened #12901 to create a better BWC test revealing these read-time exceptions -- can we do that, first, and then confirm #12900 indeed fixes it, then push the test, then push th

Re: [I] Corruption read on term dictionaries in Lucene 9.9 [lucene]

2023-12-10 Thread via GitHub
benwtrent commented on issue #12895: URL: https://github.com/apache/lucene/issues/12895#issuecomment-1848961163 > and so sorry for introducing this bug! I am with @mikemccand on this @gf2121! The optimization proved valuable in benchmarks and all testing indicated it was good t

Re: [PR] Allow FST builder to use different writer (alternative reverse BytesReader) [lucene]

2023-12-10 Thread via GitHub
dungba88 commented on PR #12879: URL: https://github.com/apache/lucene/pull/12879#issuecomment-1848988292 Note: We already merged #12624 , but this PR can be used as benchmark candidate for https://github.com/apache/lucene/issues/12884 -- This is an automated message from the Apache Git S

Re: [I] Create a simple JMH benchmark to measure FST compilation / traversal times [lucene]

2023-12-10 Thread via GitHub
dungba88 commented on issue #12884: URL: https://github.com/apache/lucene/issues/12884#issuecomment-1848988593 We can use https://github.com/apache/lucene/pull/12879 as a benchmark candidate to compare against the current baseline. -- This is an automated message from the Apache Git Servi

Re: [PR] Make FSTCompiler.compile() to only return the FSTMetadata [lucene]

2023-12-10 Thread via GitHub
dungba88 commented on PR #12831: URL: https://github.com/apache/lucene/pull/12831#issuecomment-184998 One of the point I'm unsure about this PR is that there is now 2 (obscured) ways to construct the FST, one using the DataInput+FSTStore and one using FSTReader returned by the on-heap D

[I] Where should we stream FST to disk directly? [lucene]

2023-12-10 Thread via GitHub
dungba88 opened a new issue, #12902: URL: https://github.com/apache/lucene/issues/12902 ### Description Most of the use cases with FST seems to be writing the FST eventually to a DataOutput (is it IndexOutput?). In that case we are currently writing the FST to an on heap DataOutput (

Re: [I] Where should we stream FST to disk directly? [lucene]

2023-12-10 Thread via GitHub
dungba88 commented on issue #12902: URL: https://github.com/apache/lucene/issues/12902#issuecomment-1849015292 A point for reference, Tantivy saves the metadata to the end of file, and it will first jump to the end to know the size and starting node. But we couldn't do it as one file might

Re: [PR] Introduce growInRange to reduce array overallocation [lucene]

2023-12-10 Thread via GitHub
dungba88 commented on code in PR #12844: URL: https://github.com/apache/lucene/pull/12844#discussion_r1421782211 ## lucene/core/src/java/org/apache/lucene/util/ArrayUtil.java: ## @@ -330,15 +330,36 @@ public static int[] growExact(int[] array, int newLength) { return copy;

Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-10 Thread via GitHub
uschindler commented on code in PR #12841: URL: https://github.com/apache/lucene/pull/12841#discussion_r1421801910 ## lucene/benchmark-jmh/src/java/org/apache/lucene/benchmark/jmh/GroupVIntBenchmark.java: ## @@ -140,10 +155,12 @@ public void init() throws Exception { }

Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-10 Thread via GitHub
uschindler commented on PR #12841: URL: https://github.com/apache/lucene/pull/12841#issuecomment-1849057360 Hi, I was busy this Sunday, sorry for delay. Will check tomorrow. In general: My idea is to have a single static utility method (like the default one) in the util class that doe

Re: [I] Improve BWC tests to reveal #12895 and confirm its fix [lucene]

2023-12-10 Thread via GitHub
jpountz commented on issue #12901: URL: https://github.com/apache/lucene/issues/12901#issuecomment-1849073645 IMO one thing to improve here is that we changed file formats for the MSB vlong optimization by bumping the file format version number, changing the writer to write the new version

Re: [PR] Jit workaround [lucene]

2023-12-10 Thread via GitHub
ChrisHegarty commented on PR #12903: URL: https://github.com/apache/lucene/pull/12903#issuecomment-1849079207 For reviewers or other interested parties. I run the test as follows: ``` export CI=true; export x=; while ./gradlew --no-build-cache cleanTest :lucene:core:test --rerun --

[PR] Add tests for Lucene90PostingsFormat back [lucene]

2023-12-10 Thread via GitHub
jpountz opened a new pull request, #12904: URL: https://github.com/apache/lucene/pull/12904 I just noticed that the move from FOR to PFOR did all the work to make the old format (FOR) writeable, but missed keeping an instance of `BasePostingsFormatTestCase` for this format. -- This is an

Re: [PR] [branch_9_9] Reflow computeCommonPrefixLengthAndBuildHistogram to avoid crash [lucene]

2023-12-10 Thread via GitHub
jpountz commented on code in PR #12903: URL: https://github.com/apache/lucene/pull/12903#discussion_r1421819068 ## lucene/core/src/java/org/apache/lucene/util/MSBRadixSorter.java: ## @@ -214,6 +214,12 @@ protected int getBucket(int i, int k) { * @see #buildHistogram */

Re: [PR] [branch_9_9] Reflow computeCommonPrefixLengthAndBuildHistogram to avoid crash [lucene]

2023-12-10 Thread via GitHub
ChrisHegarty commented on code in PR #12903: URL: https://github.com/apache/lucene/pull/12903#discussion_r1421825869 ## lucene/core/src/test/org/apache/lucene/util/bkd/TestDocIdsWriter.java: ## @@ -150,4 +154,18 @@ public Relation compare(byte[] minPackedValue, byte[] maxPacked

Re: [PR] [branch_9_9] Reflow computeCommonPrefixLengthAndBuildHistogram to avoid crash [lucene]

2023-12-10 Thread via GitHub
ChrisHegarty commented on code in PR #12903: URL: https://github.com/apache/lucene/pull/12903#discussion_r1421825998 ## lucene/core/src/java/org/apache/lucene/util/MSBRadixSorter.java: ## @@ -214,6 +214,12 @@ protected int getBucket(int i, int k) { * @see #buildHistogram

[I] TransitionAccessor for NFA: get transitions for a given state via random-access leads to wrong results. [lucene]

2023-12-10 Thread via GitHub
Tony-X opened a new issue, #12906: URL: https://github.com/apache/lucene/issues/12906 ### Description as part of https://github.com/apache/lucene/pull/12688, I'm trying to develop an algorithm that can intersect FST and FSA efficiently. For FSA this means we can leverage the sort

Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-10 Thread via GitHub
easyice commented on PR #12841: URL: https://github.com/apache/lucene/pull/12841#issuecomment-1849275229 > 3rd party implementations and our own MMap/NIO/... would just call this static method if they support random access or call super(), if they can't. Thanks @uschindler , it's ok!,

Re: [PR] [branch_9_9] Reflow computeCommonPrefixLengthAndBuildHistogram to avoid crash [lucene]

2023-12-10 Thread via GitHub
dweiss commented on code in PR #12903: URL: https://github.com/apache/lucene/pull/12903#discussion_r1422008879 ## lucene/core/src/test/org/apache/lucene/util/bkd/TestDocIdsWriter.java: ## @@ -150,4 +154,18 @@ public Relation compare(byte[] minPackedValue, byte[] maxPackedValue)

Re: [PR] [branch_9_9] Reflow computeCommonPrefixLengthAndBuildHistogram to avoid crash [lucene]

2023-12-10 Thread via GitHub
dweiss commented on PR #12903: URL: https://github.com/apache/lucene/pull/12903#issuecomment-1849478590 > export CI=true; export x=; while ./gradlew --no-build-cache cleanTest :lucene:core:test --rerun --tests "org.apache.lucene.util.bkd.TestDocIdsWriter.testCrash"; do echo $x | wc -c; exp

Re: [I] Improve BWC tests to reveal #12895 and confirm its fix [lucene]

2023-12-10 Thread via GitHub
jpountz commented on issue #12901: URL: https://github.com/apache/lucene/issues/12901#issuecomment-1849481891 > A benefit of the latter approach is that we keep existing tests in place, specifically BasePostingsFormatTestCase Actually, it seems like we lost the test as part of refacto

[I] jenkins dump file traversal exceptions ("no matches found within 10000") [lucene]

2023-12-10 Thread via GitHub
dweiss opened a new issue, #12907: URL: https://github.com/apache/lucene/issues/12907 ### Description These stack traces at the end of jenkins runs are super annoying: ``` ... Archiving artifacts hudson.FilePath$ValidateAntFileMask$1Cancel at hudson.FilePath$Val