Re: [I] Should we clean up the few remaining references to `Lucene/Solr`? [lucene]

2023-12-12 Thread via GitHub
dweiss commented on issue #12934: URL: https://github.com/apache/lucene/issues/12934#issuecomment-1853399419 Please do, thank you! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific com

Re: [I] Upgrade ant requirement to 1.9 in trunk (6.0) [LUCENE-6739] [lucene]

2023-12-12 Thread via GitHub
dweiss closed issue #7797: Upgrade ant requirement to 1.9 in trunk (6.0) [LUCENE-6739] URL: https://github.com/apache/lucene/issues/7797 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [I] Build should statically detect when an invisible unicode character, such as U+200B (zero width space), sneak into our sources [lucene]

2023-12-12 Thread via GitHub
dweiss commented on issue #12931: URL: https://github.com/apache/lucene/issues/12931#issuecomment-1853395780 > If you need a text string to pass to an analyzer that splits on whitespace whatever, why can't we forbid invisible code points. Just use a escaped text string. I'm not again

Re: [I] Build should statically detect when an invisible unicode character, such as U+200B (zero width space), sneak into our sources [lucene]

2023-12-12 Thread via GitHub
uschindler commented on issue #12931: URL: https://github.com/apache/lucene/issues/12931#issuecomment-1853371617 > > Wouldn't it be better if they were explicitly (\u) encoded though? Especially those invisible ones, which can be really annoying since you can't see them... > > Bu

Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-12 Thread via GitHub
easyice commented on PR #12841: URL: https://github.com/apache/lucene/pull/12841#issuecomment-1853232196 I agree with you, the getInt() seems more expensive than `MemorySegment`, i prefer to revert the change on `ByteBufferIndexInput`, then the similar improve on java20, java19 (to keep th

Re: [I] Build should statically detect when an invisible unicode character, such as U+200B (zero width space), sneak into our sources [lucene]

2023-12-12 Thread via GitHub
rmuir commented on issue #12931: URL: https://github.com/apache/lucene/issues/12931#issuecomment-1853171320 OK see PR, this check just wasn't enabled because there was stuff to fix. I fixed the spatial3d code so now we can prevent this going forwards. -- This is an automated message from

[PR] enable error-prone's DisableUnicodeInCode check [lucene]

2023-12-12 Thread via GitHub
rmuir opened a new pull request, #12936: URL: https://github.com/apache/lucene/pull/12936 Previously this error-prone check was not enabled, requiring code to be simple ascii, because of violations in spatial3d. I used find-replace on the greek letters there and the tests pass. Close

Re: [I] Build should statically detect when an invisible unicode character, such as U+200B (zero width space), sneak into our sources [lucene]

2023-12-12 Thread via GitHub
rmuir commented on issue #12931: URL: https://github.com/apache/lucene/issues/12931#issuecomment-1853144255 I'm looking into this one: https://errorprone.info/bugpattern/UnicodeInCode This wouldn't cause any problems for comments or literals which is my only concern: but would prevent

[PR] Fixing some potential cases of null value dereference (committing 8 year old patch) [lucene]

2023-12-12 Thread via GitHub
slow-J opened a new pull request, #12935: URL: https://github.com/apache/lucene/pull/12935 Took the patch from https://github.com/apache/lucene/issues/7721, I got `No such file or directory` when running git apply so I manually applied it (with some minor modification). All credit to

Re: [I] Null value dereference [LUCENE-6663] [lucene]

2023-12-12 Thread via GitHub
slow-J commented on issue #7721: URL: https://github.com/apache/lucene/issues/7721#issuecomment-1852844789 The patch would still apply now if the files were in the same directory Will put this forward in a PR and see should we still merge it? What is the best way to give credit to Risha

Re: [I] Upgrade ant requirement to 1.9 in trunk (6.0) [LUCENE-6739] [lucene]

2023-12-12 Thread via GitHub
slow-J commented on issue #7797: URL: https://github.com/apache/lucene/issues/7797#issuecomment-1852787690 I think we can close this one since Ant support was removed from trunk in https://issues.apache.org/jira/browse/LUCENE-9433 :D -- This is an automated message from the Apache Git Ser

Re: [PR] Ensure Nori/Kuromoji shipped binary FST is the latest version [lucene]

2023-12-12 Thread via GitHub
uschindler commented on PR #12933: URL: https://github.com/apache/lucene/pull/12933#issuecomment-1852784981 I never run all regeneration - it doesn't work on Windows and takes very long -- This is an automated message from the Apache Git Service. To respond to the message, please log

Re: [PR] Ensure Nori/Kuromoji shipped binary FST is the latest version [lucene]

2023-12-12 Thread via GitHub
uschindler commented on PR #12933: URL: https://github.com/apache/lucene/pull/12933#issuecomment-1852783783 It would be better to give a more detailed Gradle line like: ```sh ./gradlew :lucene:analysis:nori:regenerate ``` The test is not the nicest looking thing, but I acc

Re: [PR] Clean up sleep in TestBackwardsCompatibility#testCreateMoreTermsIndex [lucene]

2023-12-12 Thread via GitHub
uschindler commented on PR #12914: URL: https://github.com/apache/lucene/pull/12914#issuecomment-1852773264 > I don't get it -- why aren't we seeing that `TestBackwardsCompatibility` times out every time we run it? The magic is here: https://github.com/apache/lucene/blob/2acf76e9e2f9

Re: [PR] Add BWC test to reveal #12895 [lucene]

2023-12-12 Thread via GitHub
mikemccand commented on PR #12912: URL: https://github.com/apache/lucene/pull/12912#issuecomment-1852755681 Oh thanks for working on this @gf2121 -- sorry that we both did it at the same time! -- This is an automated message from the Apache Git Service. To respond to the message, please l

Re: [PR] Fix failing BaseVectorSimilarityQueryTestCase#testApproximate [lucene]

2023-12-12 Thread via GitHub
benwtrent commented on code in PR #12922: URL: https://github.com/apache/lucene/pull/12922#discussion_r1424532683 ## lucene/core/src/java/org/apache/lucene/search/AbstractVectorSimilarityQuery.java: ## @@ -255,6 +255,11 @@ static VectorSimilarityScorer fromAcceptDocs(

[I] Should we clean up the few remaining references to `Lucene/Solr`? [lucene]

2023-12-12 Thread via GitHub
slow-J opened a new issue, #12934: URL: https://github.com/apache/lucene/issues/12934 ### Description I was looking at a very old issue for a failing unit test with an ant test command. When I tried to run it in gradle (incorrectly), I was greeted with a message that stood out to me:

Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]

2023-12-12 Thread via GitHub
mikemccand commented on code in PR #12915: URL: https://github.com/apache/lucene/pull/12915#discussion_r1424520399 ## lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseHiraganaUppercaseFilter.java: ## @@ -0,0 +1,65 @@ +package org.apache.lucene.analysis.ja;

[PR] Ensure Nori/Kuromoji shipped binary FST is the latest version [lucene]

2023-12-12 Thread via GitHub
mikemccand opened a new pull request, #12933: URL: https://github.com/apache/lucene/pull/12933 Closes #12911 This just adds specific unit tests for the binary FST for Nori's and Kuromoji's `TokenInfoDictionary`. I had to promote some APIs from private -> package private for test vis

Re: [PR] Fix bug where NFARunAutomaton#getTransition does not set Transition correctly [lucene]

2023-12-12 Thread via GitHub
Tony-X commented on code in PR #12909: URL: https://github.com/apache/lucene/pull/12909#discussion_r1424472195 ## lucene/core/src/test/org/apache/lucene/util/automaton/TestNFARunAutomaton.java: ## @@ -73,6 +73,37 @@ public void testWithRandomRegex() { } } + public voi

Re: [I] Require bundled FSTs to be on the current FST version [lucene]

2023-12-12 Thread via GitHub
uschindler commented on issue #12911: URL: https://github.com/apache/lucene/issues/12911#issuecomment-1852647559 > > > There is no easy way to fail the build, because the FST files are heavy to generate and stay in the resources folder. It is hard to detect a problem if you don't regenerate

Re: [I] Require bundled FSTs to be on the current FST version [lucene]

2023-12-12 Thread via GitHub
mikemccand commented on issue #12911: URL: https://github.com/apache/lucene/issues/12911#issuecomment-1852640427 > > There is no easy way to fail the build, because the FST files are heavy to generate and stay in the resources folder. It is hard to detect a problem if you don't regenerate t

Re: [PR] Beef up `Terms#intersect` checks in `CheckIndex`. [lucene]

2023-12-12 Thread via GitHub
mikemccand commented on code in PR #12926: URL: https://github.com/apache/lucene/pull/12926#discussion_r1424442691 ## lucene/codecs/src/java/org/apache/lucene/codecs/memory/DirectPostingsFormat.java: ## @@ -1119,12 +1122,14 @@ public DirectIntersectTermsEnum(CompiledAutomaton c

Re: [PR] Beef up `Terms#intersect` checks in `CheckIndex`. [lucene]

2023-12-12 Thread via GitHub
mikemccand commented on PR #12926: URL: https://github.com/apache/lucene/pull/12926#issuecomment-1852610006 > This found bugs in `DirectPostingsFormat` Whoa, awesome! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

Re: [I] Require bundled FSTs to be on the current FST version [lucene]

2023-12-12 Thread via GitHub
mikemccand commented on issue #12911: URL: https://github.com/apache/lucene/issues/12911#issuecomment-1852574526 > I will commit an push the change to 3 branches. Thanks @uschindler. And thank you @rmuir for helping me fix my home dev box to put the default charset back to UTF-

Re: [I] Require bundled FSTs to be on the current FST version [lucene]

2023-12-12 Thread via GitHub
uschindler commented on issue #12911: URL: https://github.com/apache/lucene/issues/12911#issuecomment-1852574170 > > Sorry for this My fault, looks like while copypasting Stackoverflow code I introduced the hidden character. > > Maybe apache rat check should actually scan for

[I] Let's run our Monster tests, at least once? [lucene]

2023-12-12 Thread via GitHub
mikemccand opened a new issue, #12932: URL: https://github.com/apache/lucene/issues/12932 ### Description I don't think our monster tests even run? I think we should run them at least for the 9.9.1 release. I'm especially interested in ensuring `Test2BTerms` is happy with our

Re: [I] [Minor] Regeneration investigation [lucene]

2023-12-12 Thread via GitHub
slow-J commented on issue #12923: URL: https://github.com/apache/lucene/issues/12923#issuecomment-1852542026 > See #12924 - I think it's something that fixes your problem. Also, take a look at: > https://github.com/apache/lucene/blob/main/help/regeneration.txt unless you have already - t

Re: [I] [Minor] Regeneration investigation [lucene]

2023-12-12 Thread via GitHub
slow-J closed issue #12923: [Minor] Regeneration investigation URL: https://github.com/apache/lucene/issues/12923 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe

Re: [PR] Fix failing BaseVectorSimilarityQueryTestCase#testApproximate [lucene]

2023-12-12 Thread via GitHub
kaivalnp commented on code in PR #12922: URL: https://github.com/apache/lucene/pull/12922#discussion_r1424385137 ## lucene/core/src/java/org/apache/lucene/search/AbstractVectorSimilarityQuery.java: ## @@ -255,6 +255,11 @@ static VectorSimilarityScorer fromAcceptDocs(

Re: [PR] Fix failing BaseVectorSimilarityQueryTestCase#testApproximate [lucene]

2023-12-12 Thread via GitHub
kaivalnp commented on code in PR #12922: URL: https://github.com/apache/lucene/pull/12922#discussion_r1424377622 ## lucene/core/src/java/org/apache/lucene/search/AbstractVectorSimilarityQuery.java: ## @@ -255,6 +255,11 @@ static VectorSimilarityScorer fromAcceptDocs(

Re: [PR] #12911: regenerate binary FSTs for Nori and Kuromoji dictionaries to match current FST format [lucene]

2023-12-12 Thread via GitHub
mikemccand commented on PR #12924: URL: https://github.com/apache/lucene/pull/12924#issuecomment-1852489631 OK I regenerated these two FSTs on 9.9.x, 9.x, and main. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the U

Re: [I] [Minor] Regeneration investigation [lucene]

2023-12-12 Thread via GitHub
dweiss commented on issue #12923: URL: https://github.com/apache/lucene/issues/12923#issuecomment-1852482593 See https://github.com/apache/lucene/pull/12924 - I think it's something that fixes your problem. Also, take a look at: https://github.com/apache/lucene/blob/main/help/regeneration

Re: [I] [Minor] Regeneration investigation [lucene]

2023-12-12 Thread via GitHub
slow-J commented on issue #12923: URL: https://github.com/apache/lucene/issues/12923#issuecomment-1852471268 > I think it's because of changes in the fst code. So seems legitimate. Thanks @dweiss, does this mean that we should regenerate these 2 `TokenInfoDictionary`? Or does it s

Re: [PR] #12911: regenerate binary FSTs for Nori and Kuromoji dictionaries to match current FST format [lucene]

2023-12-12 Thread via GitHub
mikemccand merged PR #12924: URL: https://github.com/apache/lucene/pull/12924 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.

Re: [PR] #12911: regenerate binary FSTs for Nori and Kuromoji dictionaries to match current FST format [lucene]

2023-12-12 Thread via GitHub
rmuir commented on PR #12924: URL: https://github.com/apache/lucene/pull/12924#issuecomment-1852434323 if it regenerates successfully and the test suite passes afterwards (which it seems they do), that's basically it -- This is an automated message from the Apache Git Service. To respond

Re: [I] Require bundled FSTs to be on the current FST version [lucene]

2023-12-12 Thread via GitHub
mikemccand commented on issue #12911: URL: https://github.com/apache/lucene/issues/12911#issuecomment-1852430721 > > Sorry for this My fault, looks like while copypasting Stackoverflow code I introduced the hidden character. > > Maybe apache rat check should actually scan for

Re: [I] Require bundled FSTs to be on the current FST version [lucene]

2023-12-12 Thread via GitHub
mikemccand commented on issue #12911: URL: https://github.com/apache/lucene/issues/12911#issuecomment-1852429248 > So it may fail in _some_ configurations, e.g. when your default characterset is not fitting the encoding of that special character. Argh! I was worried about this. I mu

Re: [PR] #12911: regenerate binary FSTs for Nori and Kuromoji dictionaries to match current FST format [lucene]

2023-12-12 Thread via GitHub
mikemccand commented on PR #12924: URL: https://github.com/apache/lucene/pull/12924#issuecomment-1852434257 Thanks @ChrisHegarty and @dweiss -- I'll merge to 9.9.x, 9.x and main shortly. Hmm, rather, I'll regen on the other two branches (not certain FST format is identical everywhere), and

Re: [I] Require bundled FSTs to be on the current FST version [lucene]

2023-12-12 Thread via GitHub
dweiss commented on issue #12911: URL: https://github.com/apache/lucene/issues/12911#issuecomment-1852414564 > Sorry for this My fault, looks like while copypasting Stackoverflow code I introduced the hidden character. Maybe apache rat check should actually scan for those, it'

Re: [PR] Add support for index sorting with document blocks [lucene]

2023-12-12 Thread via GitHub
s1monw commented on PR #12829: URL: https://github.com/apache/lucene/pull/12829#issuecomment-1852410786 @mikemccand I agree we should not add this to sort but rather tread it the same way we treat the softDeletes field. it's essentially the same thing from an IW perspective. I will go ahead

Re: [I] [Minor] Regeneration investigation [lucene]

2023-12-12 Thread via GitHub
dweiss commented on issue #12923: URL: https://github.com/apache/lucene/issues/12923#issuecomment-1852401144 I think it's because of changes in the fst code. So seems legitimate. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub

[PR] Writing a HOWTO migrate codec version [lucene]

2023-12-12 Thread via GitHub
slow-J opened a new pull request, #12930: URL: https://github.com/apache/lucene/pull/12930 More detail in https://github.com/apache/lucene/issues/12918. Changing PFOR encoding to FOR for doc blocks in #12741, required bumping the codec version. The codec upgrade process has numerous movi

Re: [PR] Add support for index sorting with document blocks [lucene]

2023-12-12 Thread via GitHub
msokolov commented on code in PR #12829: URL: https://github.com/apache/lucene/pull/12829#discussion_r1424209861 ## lucene/core/src/java/org/apache/lucene/index/IndexingChain.java: ## @@ -219,15 +222,33 @@ private Sorter.DocMap maybeSortSegment(SegmentWriteState state) throws I

Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-12 Thread via GitHub
uschindler commented on PR #12841: URL: https://github.com/apache/lucene/pull/12841#issuecomment-1852261778 If you have an on-heap ByteBuffer (like for ByteBuffersIndexInput), it executes completely different code when reading from the underlying data structure.. -- This is an automated

Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-12 Thread via GitHub
uschindler commented on PR #12841: URL: https://github.com/apache/lucene/pull/12841#issuecomment-1852257293 I have the feeling that for direct buffers (this is what MMap and NIO use, the getInt() seems more expensive than the sequential reads. -- This is an automated message from the Apac

Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-12 Thread via GitHub
easyice commented on PR #12841: URL: https://github.com/apache/lucene/pull/12841#issuecomment-1852253188 Even if i used the copy code approach(avoid to using lambda, for test purpose), it was only 15%-20% faster. like this: ``` @Override public void readGroupVInts(lo

Re: [I] Require bundled FSTs to be on the current FST version [lucene]

2023-12-12 Thread via GitHub
uschindler commented on issue #12911: URL: https://github.com/apache/lucene/issues/12911#issuecomment-1852173027 > > There is no easy way to fail the build, because the FST files are heavy to generate and stay in the resources folder. It is hard to detect a problem if you don't regenerate t

Re: [I] Require bundled FSTs to be on the current FST version [lucene]

2023-12-12 Thread via GitHub
jpountz commented on issue #12911: URL: https://github.com/apache/lucene/issues/12911#issuecomment-1852167681 > There is no easy way to fail the build, because the FST files are heavy to generate and stay in the resources folder. It is hard to detect a problem if you don't regenerate them,

Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-12 Thread via GitHub
uschindler commented on PR #12841: URL: https://github.com/apache/lucene/pull/12841#issuecomment-1852165482 There is no lambda capturing problem. I have no idea why it complains. It really looks like fully inlined. It seems that it is not happy about those ByteBuffers at all. `ix()`

Re: [I] Require bundled FSTs to be on the current FST version [lucene]

2023-12-12 Thread via GitHub
uschindler commented on issue #12911: URL: https://github.com/apache/lucene/issues/12911#issuecomment-1852141369 To come back to the original issue: There is no easy way to fail the build, because the FST files are heavy to generate and stay in the resources folder. It is hard to detect a p

[PR] Modernize LineFileDocs. [lucene]

2023-12-12 Thread via GitHub
jpountz opened a new pull request, #12929: URL: https://github.com/apache/lucene/pull/12929 This replaces `StringField`/`SortedDocValuesField` with `KeywordField` and `IntPoint`/`NumericDocValuesField` with `IntField`. -- This is an automated message from the Apache Git Service. To re

Re: [I] Reproducible failure of TestParentBlockJoinByteKnnVectorQuery.testScoringWithMultipleChildren [lucene]

2023-12-12 Thread via GitHub
msokolov closed issue #12896: Reproducible failure of TestParentBlockJoinByteKnnVectorQuery.testScoringWithMultipleChildren URL: https://github.com/apache/lucene/issues/12896 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and us

Re: [PR] Add forceMerge to test to fix intermittent failure; addresses #12896 [lucene]

2023-12-12 Thread via GitHub
msokolov commented on PR #12928: URL: https://github.com/apache/lucene/pull/12928#issuecomment-1852116447 I also cherry-picked to branch_9x -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the spe

Re: [PR] Add forceMerge to test to fix intermittent failure; addresses #12896 [lucene]

2023-12-12 Thread via GitHub
msokolov merged PR #12928: URL: https://github.com/apache/lucene/pull/12928 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.ap

Re: [PR] Add forceMerge to test to fix intermittent failure; addresses #12896 [lucene]

2023-12-12 Thread via GitHub
msokolov commented on PR #12928: URL: https://github.com/apache/lucene/pull/12928#issuecomment-1852113022 heh, got an unrelated test fail there: org.apache.lucene.search.TestByteVectorSimilarityQuery > testApproximate FAILED java.lang.UnsupportedOperationException at

Re: [I] Require bundled FSTs to be on the current FST version [lucene]

2023-12-12 Thread via GitHub
uschindler commented on issue #12911: URL: https://github.com/apache/lucene/issues/12911#issuecomment-1852101176 Sorry for this My fault, looks like while copypasting Stackoverflow code I introduced the hidden character. -- This is an automated message from the Apache Git Service.

Re: [PR] Refactor around NeighborArray [lucene]

2023-12-12 Thread via GitHub
zhaih commented on code in PR #12910: URL: https://github.com/apache/lucene/pull/12910#discussion_r1424037388 ## lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java: ## @@ -51,45 +51,61 @@ public NeighborArray(int maxSize, boolean descOrder) { */ public vo

Re: [PR] Refactor around NeighborArray [lucene]

2023-12-12 Thread via GitHub
zhaih commented on code in PR #12910: URL: https://github.com/apache/lucene/pull/12910#discussion_r1424036233 ## lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java: ## @@ -51,45 +51,61 @@ public NeighborArray(int maxSize, boolean descOrder) { */ public vo

Re: [I] Require bundled FSTs to be on the current FST version [lucene]

2023-12-12 Thread via GitHub
uschindler commented on issue #12911: URL: https://github.com/apache/lucene/issues/12911#issuecomment-1852094139 Fixed in 10387f136ffa88ad9e86c526aa52908829a01ad3. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the UR

Re: [I] Reproducible failure of TestParentBlockJoinByteKnnVectorQuery.testScoringWithMultipleChildren [lucene]

2023-12-12 Thread via GitHub
msokolov commented on issue #12896: URL: https://github.com/apache/lucene/issues/12896#issuecomment-1852094137 I think it's simply that the test writer chooses to flush randomly, creating two segments instead of one. I was able to fix by adding a call to forceMerge(1). Opened https://github

[PR] Add forceMerge to test to fix intermittent failure; addresses #12896 [lucene]

2023-12-12 Thread via GitHub
msokolov opened a new pull request, #12928: URL: https://github.com/apache/lucene/pull/12928 (no comment) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-

Re: [I] Require bundled FSTs to be on the current FST version [lucene]

2023-12-12 Thread via GitHub
uschindler commented on issue #12911: URL: https://github.com/apache/lucene/issues/12911#issuecomment-1852090987 OK, I fixed the file and also enforced UTF-8 when executing the JDK version. The general problem was: If I enforce locally another charset than UTF-8 (see the sysprops passed to

[PR] Add a stored fields test that indexes LineFileDocs. [lucene]

2023-12-12 Thread via GitHub
jpountz opened a new pull request, #12927: URL: https://github.com/apache/lucene/pull/12927 Real-world data exhibits patterns that are taken advantage of by the compression logic, but also hardly reproducible in a randomized way. This makes this new test introduce interesting coverage.

Re: [PR] Refactor around NeighborArray [lucene]

2023-12-12 Thread via GitHub
zhaih commented on code in PR #12910: URL: https://github.com/apache/lucene/pull/12910#discussion_r1424031699 ## lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java: ## @@ -51,45 +51,61 @@ public NeighborArray(int maxSize, boolean descOrder) { */ public vo

Re: [PR] Refactor around NeighborArray [lucene]

2023-12-12 Thread via GitHub
zhaih commented on code in PR #12910: URL: https://github.com/apache/lucene/pull/12910#discussion_r1424029197 ## lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java: ## @@ -201,9 +225,69 @@ private int descSortFindRightMostInsertionPoint(float newScore, int boun

Re: [PR] Refactor around NeighborArray [lucene]

2023-12-12 Thread via GitHub
zhaih commented on code in PR #12910: URL: https://github.com/apache/lucene/pull/12910#discussion_r1424022318 ## lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java: ## @@ -34,14 +34,14 @@ public class NeighborArray { private final boolean scoresDescOrder;

Re: [PR] Fix bug where NFARunAutomaton#getTransition does not set Transition correctly [lucene]

2023-12-12 Thread via GitHub
zhaih commented on code in PR #12909: URL: https://github.com/apache/lucene/pull/12909#discussion_r1424011845 ## lucene/core/src/test/org/apache/lucene/util/automaton/TestNFARunAutomaton.java: ## @@ -73,6 +73,37 @@ public void testWithRandomRegex() { } } + public void

Re: [I] Require bundled FSTs to be on the current FST version [lucene]

2023-12-12 Thread via GitHub
uschindler commented on issue #12911: URL: https://github.com/apache/lucene/issues/12911#issuecomment-1852045851 Let me commit a fix for that file to all 3 branches. It really has a hidden character. And there's a second problem: When invoking the JVM it does not pass a character set

Re: [I] Require bundled FSTs to be on the current FST version [lucene]

2023-12-12 Thread via GitHub
uschindler commented on issue #12911: URL: https://github.com/apache/lucene/issues/12911#issuecomment-1852042380 > Hmm actually I think there may be a zero-width space character U+200B on this line before the opening (? https://github.com/apache/lucene/blob/branch_9_9/gradle/generation/extr

Re: [I] Require bundled FSTs to be on the current FST version [lucene]

2023-12-12 Thread via GitHub
mikemccand commented on issue #12911: URL: https://github.com/apache/lucene/issues/12911#issuecomment-1852036564 > may be a zero-width space character on `main` and `branch_9_9`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

Re: [I] Require bundled FSTs to be on the current FST version [lucene]

2023-12-12 Thread via GitHub
uschindler commented on issue #12911: URL: https://github.com/apache/lucene/issues/12911#issuecomment-1852036299 This is where it configures the language version: https://github.com/apache/lucene/blob/e0f4321b40fd06a556ff4a11f137a3fc0f67b5bb/gradle/generation/extract-jdk-apis.gradle#L46-L48

Re: [I] Require bundled FSTs to be on the current FST version [lucene]

2023-12-12 Thread via GitHub
mikemccand commented on issue #12911: URL: https://github.com/apache/lucene/issues/12911#issuecomment-1852037166 OK I gotta drop off ... will try to root cause this later today. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub

Re: [I] Require bundled FSTs to be on the current FST version [lucene]

2023-12-12 Thread via GitHub
mikemccand commented on issue #12911: URL: https://github.com/apache/lucene/issues/12911#issuecomment-1852034874 Hmm actually I think there may be a zero-width space character U+200B on this line before the opening (? https://github.com/apache/lucene/blob/branch_9_9/gradle/generation/extra

Re: [I] Require bundled FSTs to be on the current FST version [lucene]

2023-12-12 Thread via GitHub
uschindler commented on issue #12911: URL: https://github.com/apache/lucene/issues/12911#issuecomment-1852033091 I can explain why you see a difference with Java 17 vs 21. Did the job fail on `:lucene:core:generateJdkApiJar21`. If yes the following happened: All those three tasks are

Re: [I] Require bundled FSTs to be on the current FST version [lucene]

2023-12-12 Thread via GitHub
mikemccand commented on issue #12911: URL: https://github.com/apache/lucene/issues/12911#issuecomment-1852026627 And you're right -- I don't see any special characters here: https://github.com/apache/lucene/blob/main/gradle/generation/extract-jdk-apis/ExtractJdkApis.java#L192 Not sure

Re: [I] Require bundled FSTs to be on the current FST version [lucene]

2023-12-12 Thread via GitHub
uschindler commented on issue #12911: URL: https://github.com/apache/lucene/issues/12911#issuecomment-1852025005 I regenrated locally, no issues: ``` openjdk version "17.0.9" 2023-10-17 OpenJDK Runtime Environment Temurin-17.0.9+9 (build 17.0.9+9) OpenJDK 64-Bit Server VM Temurin

Re: [I] Require bundled FSTs to be on the current FST version [lucene]

2023-12-12 Thread via GitHub
mikemccand commented on issue #12911: URL: https://github.com/apache/lucene/issues/12911#issuecomment-1852023328 > Can you post in which exact task name this happened? Thanks @uschindler. It happened when I ran `./gradlew regenerate` on 9.9.x branch with OpenJDK 17 (`openjdk full ver

Re: [I] Require bundled FSTs to be on the current FST version [lucene]

2023-12-12 Thread via GitHub
uschindler commented on issue #12911: URL: https://github.com/apache/lucene/issues/12911#issuecomment-1852016535 I checked that file, there's no such special characters? Or do I miss something. I only checked main branch -- This is an automated message from the Apache Git Service. To

Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-12 Thread via GitHub
easyice commented on PR #12841: URL: https://github.com/apache/lucene/pull/12841#issuecomment-1852013823 Hi, I added the implementation for `ByteBufferIndexInput`, Unfortunately, the benchmark shows a bit regression: java17 ``` Benchmark

Re: [I] Require bundled FSTs to be on the current FST version [lucene]

2023-12-12 Thread via GitHub
uschindler commented on issue #12911: URL: https://github.com/apache/lucene/issues/12911#issuecomment-1852004695 > Hmm this is a bit of a spooky crab. If I use OpenJDK 17 (`openjdk full version "17.0.9+8"`) to `./gradlew regenerate` on current `branch_9_9_0` I get this horrifying failure:

Re: [PR] Check `Terms#intersect` in CheckIndex. [lucene]

2023-12-12 Thread via GitHub
jpountz merged PR #12925: URL: https://github.com/apache/lucene/pull/12925 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apa

Re: [PR] Check `Terms#intersect` in CheckIndex. [lucene]

2023-12-12 Thread via GitHub
mikemccand commented on PR #12925: URL: https://github.com/apache/lucene/pull/12925#issuecomment-1851987006 > With this change, `TestLucene90PostingsFormat` now exhibits #12895. Oooh that's aweosme! I'll review. -- This is an automated message from the Apache Git Service. To respon

Re: [I] Require bundled FSTs to be on the current FST version [lucene]

2023-12-12 Thread via GitHub
mikemccand commented on issue #12911: URL: https://github.com/apache/lucene/issues/12911#issuecomment-1851980193 Thanks @dweiss -- I regenerated everything, but only the two dicts were `git diff`, which I took to be a good sign (we haven't missed to regenerated any of the other many things)

Re: [PR] Fix failing BaseVectorSimilarityQueryTestCase#testApproximate [lucene]

2023-12-12 Thread via GitHub
benwtrent commented on code in PR #12922: URL: https://github.com/apache/lucene/pull/12922#discussion_r1423943933 ## lucene/core/src/java/org/apache/lucene/search/AbstractVectorSimilarityQuery.java: ## @@ -255,6 +255,11 @@ static VectorSimilarityScorer fromAcceptDocs(

Re: [I] Require bundled FSTs to be on the current FST version [lucene]

2023-12-12 Thread via GitHub
dweiss commented on issue #12911: URL: https://github.com/apache/lucene/issues/12911#issuecomment-1851945354 https://github.com/apache/lucene/blob/main/help/regeneration.txt#L64-L76 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to Git

Re: [I] Require bundled FSTs to be on the current FST version [lucene]

2023-12-12 Thread via GitHub
dweiss commented on issue #12911: URL: https://github.com/apache/lucene/issues/12911#issuecomment-1851942814 ./gradlew regenerate regenerates everything, everywhere. I think you can be more selective by passing -p (project path). -- This is an automated message from the Apache Git Service

Re: [PR] Push and pop OutputAccumulator as IntersectTermsEnumFrames are pushed and popped [lucene]

2023-12-12 Thread via GitHub
gf2121 merged PR #12900: URL: https://github.com/apache/lucene/pull/12900 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apac

Re: [I] [Minor] ForUtil generated file does not conform to spotless [lucene]

2023-12-12 Thread via GitHub
slow-J commented on issue #12923: URL: https://github.com/apache/lucene/issues/12923#issuecomment-1851936071 An actual valid nuance related to regeneration: running `./gradlew regenerate` will modify ``` modified: lucene/analysis/kuromoji/src/resources/org/apache/lucene/a

Re: [I] [Minor] ForUtil generated file does not conform to spotless [lucene]

2023-12-12 Thread via GitHub
slow-J commented on issue #12923: URL: https://github.com/apache/lucene/issues/12923#issuecomment-1851934034 Ah ok, answering myself: running the command: `./gradlew generateForUtil` will apply spotless -- This is an automated message from the Apache Git Service. To respond to the mes

Re: [PR] IntersectTermsEnum should accumulate from output prefix instead of current output [lucene]

2023-12-12 Thread via GitHub
gf2121 commented on PR #12900: URL: https://github.com/apache/lucene/pull/12900#issuecomment-1851927851 Thanks for review and great advice @mikemccand ! > I think we should merge this to main, 9x and 99x and let's let CI chew on it for a bit (day or so) before cutting 9.9.1? +1

Re: [PR] IntersectTermsEnum should accumulate from output prefix instead of current output [lucene]

2023-12-12 Thread via GitHub
jpountz commented on PR #12900: URL: https://github.com/apache/lucene/pull/12900#issuecomment-1851928485 FWIW I confirmed that this change makes the new test in #12925 pass. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

[PR] Check `Terms#intersect` in CheckIndex. [lucene]

2023-12-12 Thread via GitHub
jpountz opened a new pull request, #12925: URL: https://github.com/apache/lucene/pull/12925 This commit adds coverage to `Terms#intersect` to `CheckIndex` and indexes `LineFileDocs` in `BasePostingsFormatTestCase` to get some coverage with real-world data. With this change, `TestLuce

[PR] #12911: regenerate binary FSTs for Nori and Kuromoji dictionaries to match current FST format [lucene]

2023-12-12 Thread via GitHub
mikemccand opened a new pull request, #12924: URL: https://github.com/apache/lucene/pull/12924 This is the result of top-level `./gradlew regenerate` to rewrite all generated stuff in our source tree. The only resulting `git diff` were the Nori and Kuromoji FST dictionaries. Note th

Re: [I] Require bundled FSTs to be on the current FST version [lucene]

2023-12-12 Thread via GitHub
mikemccand commented on issue #12911: URL: https://github.com/apache/lucene/issues/12911#issuecomment-1851845215 Hmm this is a bit of a spooky crab. If I use OpenJDK 17 (`openjdk full version "17.0.9+8"`) to `./gradlew regenerate` on current `branch_9_9_0` I get this horrifying failure:

Re: [I] Require bundled FSTs to be on the current FST version [lucene]

2023-12-12 Thread via GitHub
mikemccand commented on issue #12911: URL: https://github.com/apache/lucene/issues/12911#issuecomment-1851829283 I'll open a PR shortly with the regenerated FSTs ... seems only `kuromoji` and `nori` build FSTs. -- This is an automated message from the Apache Git Service. To respond to the

Re: [I] Require bundled FSTs to be on the current FST version [lucene]

2023-12-12 Thread via GitHub
mikemccand commented on issue #12911: URL: https://github.com/apache/lucene/issues/12911#issuecomment-1851827035 > Hmm is there a top-level `gradle` target to do this ... Yes, yes there is: `./gradlew regenerate`. -- This is an automated message from the Apache Git Service. To respo

[I] [Minor] ForUtil generated file does not conform to spotless [lucene]

2023-12-12 Thread via GitHub
slow-J opened a new issue, #12923: URL: https://github.com/apache/lucene/issues/12923 ### Description ForUtil generated from the gradle task is out of sync with spotless, forcing a run of ./gradlew tidy. This is not a large issue, but the code should be consistent in my opinion.

Re: [I] Require bundled FSTs to be on the current FST version [lucene]

2023-12-12 Thread via GitHub
mikemccand commented on issue #12911: URL: https://github.com/apache/lucene/issues/12911#issuecomment-1851816034 Since we will respin shortly 9.9.1, should we do this specifically for 9.9.1 now? And leave this issue open for failing the (some) build when the generated FSTs are stale?

Re: [PR] IntersectTermsEnum should accumulate from output prefix instead of current output [lucene]

2023-12-12 Thread via GitHub
mikemccand commented on code in PR #12900: URL: https://github.com/apache/lucene/pull/12900#discussion_r1423827461 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnum.java: ## @@ -495,7 +495,7 @@ public boolean seekExact(BytesRef target) throws

  1   2   >