Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]

2023-12-12 Thread via GitHub
dungba88 commented on code in PR #12915: URL: https://github.com/apache/lucene/pull/12915#discussion_r1423585285 ## lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseKatakanaUppercaseFilter.java: ## @@ -0,0 +1,99 @@ +/* + * Licensed to the Apache Software F

Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]

2023-12-12 Thread via GitHub
dungba88 commented on code in PR #12915: URL: https://github.com/apache/lucene/pull/12915#discussion_r1423583689 ## lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseHiraganaUppercaseFilter.java: ## @@ -0,0 +1,65 @@ +package org.apache.lucene.analysis.ja; +

Re: [PR] Add support for index sorting with document blocks [lucene]

2023-12-12 Thread via GitHub
s1monw commented on code in PR #12829: URL: https://github.com/apache/lucene/pull/12829#discussion_r1423606084 ## lucene/core/src/java/org/apache/lucene/index/IndexingChain.java: ## @@ -219,15 +222,33 @@ private Sorter.DocMap maybeSortSegment(SegmentWriteState state) throws IOE

Re: [PR] Add support for index sorting with document blocks [lucene]

2023-12-12 Thread via GitHub
s1monw commented on code in PR #12829: URL: https://github.com/apache/lucene/pull/12829#discussion_r1423637103 ## lucene/core/src/java/org/apache/lucene/index/IndexingChain.java: ## @@ -219,15 +222,33 @@ private Sorter.DocMap maybeSortSegment(SegmentWriteState state) throws IOE

Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-12 Thread via GitHub
uschindler commented on PR #12841: URL: https://github.com/apache/lucene/pull/12841#issuecomment-1851570021 > Could we consider not changing `MemorySegmentIndexInput` for java 19 and java20? As a preview feature , it seems reasonable that we only do optimizations in higher versions, and the

Re: [PR] IntersectTermsEnum should accumulate from output prefix instead of current output [lucene]

2023-12-12 Thread via GitHub
gf2121 commented on code in PR #12900: URL: https://github.com/apache/lucene/pull/12900#discussion_r1423704366 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnum.java: ## @@ -495,7 +495,7 @@ public boolean seekExact(BytesRef target) throws IOEx

Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-12 Thread via GitHub
uschindler commented on code in PR #12841: URL: https://github.com/apache/lucene/pull/12841#discussion_r1423722848 ## lucene/core/src/java/org/apache/lucene/store/ByteBuffersIndexInput.java: ## @@ -205,6 +205,12 @@ public void readLongs(long[] dst, int offset, int length) throw

Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-12 Thread via GitHub
easyice commented on code in PR #12841: URL: https://github.com/apache/lucene/pull/12841#discussion_r1423744250 ## lucene/core/src/java/org/apache/lucene/store/ByteBuffersIndexInput.java: ## @@ -205,6 +205,12 @@ public void readLongs(long[] dst, int offset, int length) throws I

Re: [I] Write a HOWTO migrate Codec format version [lucene]

2023-12-12 Thread via GitHub
shubhamvishu commented on issue #12918: URL: https://github.com/apache/lucene/issues/12918#issuecomment-1851719328 NiceThis would be really helpful! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

Re: [PR] IntersectTermsEnum should accumulate from output prefix instead of current output [lucene]

2023-12-12 Thread via GitHub
mikemccand commented on code in PR #12900: URL: https://github.com/apache/lucene/pull/12900#discussion_r1423806163 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/IntersectTermsEnum.java: ## @@ -198,6 +204,7 @@ private IntersectTermsEnumFrame pushFrame(int st

Re: [PR] IntersectTermsEnum should accumulate from output prefix instead of current output [lucene]

2023-12-12 Thread via GitHub
mikemccand commented on code in PR #12900: URL: https://github.com/apache/lucene/pull/12900#discussion_r1423808624 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/IntersectTermsEnum.java: ## @@ -198,9 +199,11 @@ private IntersectTermsEnumFrame pushFrame(int s

Re: [PR] IntersectTermsEnum should accumulate from output prefix instead of current output [lucene]

2023-12-12 Thread via GitHub
mikemccand commented on code in PR #12900: URL: https://github.com/apache/lucene/pull/12900#discussion_r1423822953 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnum.java: ## @@ -495,7 +495,7 @@ public boolean seekExact(BytesRef target) throws

Re: [PR] IntersectTermsEnum should accumulate from output prefix instead of current output [lucene]

2023-12-12 Thread via GitHub
mikemccand commented on code in PR #12900: URL: https://github.com/apache/lucene/pull/12900#discussion_r1423827461 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnum.java: ## @@ -495,7 +495,7 @@ public boolean seekExact(BytesRef target) throws

Re: [I] Require bundled FSTs to be on the current FST version [lucene]

2023-12-12 Thread via GitHub
mikemccand commented on issue #12911: URL: https://github.com/apache/lucene/issues/12911#issuecomment-1851816034 Since we will respin shortly 9.9.1, should we do this specifically for 9.9.1 now? And leave this issue open for failing the (some) build when the generated FSTs are stale?

[I] [Minor] ForUtil generated file does not conform to spotless [lucene]

2023-12-12 Thread via GitHub
slow-J opened a new issue, #12923: URL: https://github.com/apache/lucene/issues/12923 ### Description ForUtil generated from the gradle task is out of sync with spotless, forcing a run of ./gradlew tidy. This is not a large issue, but the code should be consistent in my opinion.

Re: [I] Require bundled FSTs to be on the current FST version [lucene]

2023-12-12 Thread via GitHub
mikemccand commented on issue #12911: URL: https://github.com/apache/lucene/issues/12911#issuecomment-1851827035 > Hmm is there a top-level `gradle` target to do this ... Yes, yes there is: `./gradlew regenerate`. -- This is an automated message from the Apache Git Service. To respo

Re: [I] Require bundled FSTs to be on the current FST version [lucene]

2023-12-12 Thread via GitHub
mikemccand commented on issue #12911: URL: https://github.com/apache/lucene/issues/12911#issuecomment-1851829283 I'll open a PR shortly with the regenerated FSTs ... seems only `kuromoji` and `nori` build FSTs. -- This is an automated message from the Apache Git Service. To respond to the

Re: [I] Require bundled FSTs to be on the current FST version [lucene]

2023-12-12 Thread via GitHub
mikemccand commented on issue #12911: URL: https://github.com/apache/lucene/issues/12911#issuecomment-1851845215 Hmm this is a bit of a spooky crab. If I use OpenJDK 17 (`openjdk full version "17.0.9+8"`) to `./gradlew regenerate` on current `branch_9_9_0` I get this horrifying failure:

[PR] #12911: regenerate binary FSTs for Nori and Kuromoji dictionaries to match current FST format [lucene]

2023-12-12 Thread via GitHub
mikemccand opened a new pull request, #12924: URL: https://github.com/apache/lucene/pull/12924 This is the result of top-level `./gradlew regenerate` to rewrite all generated stuff in our source tree. The only resulting `git diff` were the Nori and Kuromoji FST dictionaries. Note th

[PR] Check `Terms#intersect` in CheckIndex. [lucene]

2023-12-12 Thread via GitHub
jpountz opened a new pull request, #12925: URL: https://github.com/apache/lucene/pull/12925 This commit adds coverage to `Terms#intersect` to `CheckIndex` and indexes `LineFileDocs` in `BasePostingsFormatTestCase` to get some coverage with real-world data. With this change, `TestLuce

Re: [PR] IntersectTermsEnum should accumulate from output prefix instead of current output [lucene]

2023-12-12 Thread via GitHub
jpountz commented on PR #12900: URL: https://github.com/apache/lucene/pull/12900#issuecomment-1851928485 FWIW I confirmed that this change makes the new test in #12925 pass. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

Re: [PR] IntersectTermsEnum should accumulate from output prefix instead of current output [lucene]

2023-12-12 Thread via GitHub
gf2121 commented on PR #12900: URL: https://github.com/apache/lucene/pull/12900#issuecomment-1851927851 Thanks for review and great advice @mikemccand ! > I think we should merge this to main, 9x and 99x and let's let CI chew on it for a bit (day or so) before cutting 9.9.1? +1

Re: [I] [Minor] ForUtil generated file does not conform to spotless [lucene]

2023-12-12 Thread via GitHub
slow-J commented on issue #12923: URL: https://github.com/apache/lucene/issues/12923#issuecomment-1851934034 Ah ok, answering myself: running the command: `./gradlew generateForUtil` will apply spotless -- This is an automated message from the Apache Git Service. To respond to the mes

Re: [I] [Minor] ForUtil generated file does not conform to spotless [lucene]

2023-12-12 Thread via GitHub
slow-J commented on issue #12923: URL: https://github.com/apache/lucene/issues/12923#issuecomment-1851936071 An actual valid nuance related to regeneration: running `./gradlew regenerate` will modify ``` modified: lucene/analysis/kuromoji/src/resources/org/apache/lucene/a

Re: [PR] Push and pop OutputAccumulator as IntersectTermsEnumFrames are pushed and popped [lucene]

2023-12-12 Thread via GitHub
gf2121 merged PR #12900: URL: https://github.com/apache/lucene/pull/12900 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apac

Re: [I] Require bundled FSTs to be on the current FST version [lucene]

2023-12-12 Thread via GitHub
dweiss commented on issue #12911: URL: https://github.com/apache/lucene/issues/12911#issuecomment-1851942814 ./gradlew regenerate regenerates everything, everywhere. I think you can be more selective by passing -p (project path). -- This is an automated message from the Apache Git Service

Re: [I] Require bundled FSTs to be on the current FST version [lucene]

2023-12-12 Thread via GitHub
dweiss commented on issue #12911: URL: https://github.com/apache/lucene/issues/12911#issuecomment-1851945354 https://github.com/apache/lucene/blob/main/help/regeneration.txt#L64-L76 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to Git

Re: [PR] Fix failing BaseVectorSimilarityQueryTestCase#testApproximate [lucene]

2023-12-12 Thread via GitHub
benwtrent commented on code in PR #12922: URL: https://github.com/apache/lucene/pull/12922#discussion_r1423943933 ## lucene/core/src/java/org/apache/lucene/search/AbstractVectorSimilarityQuery.java: ## @@ -255,6 +255,11 @@ static VectorSimilarityScorer fromAcceptDocs(

Re: [I] Require bundled FSTs to be on the current FST version [lucene]

2023-12-12 Thread via GitHub
mikemccand commented on issue #12911: URL: https://github.com/apache/lucene/issues/12911#issuecomment-1851980193 Thanks @dweiss -- I regenerated everything, but only the two dicts were `git diff`, which I took to be a good sign (we haven't missed to regenerated any of the other many things)

Re: [PR] Check `Terms#intersect` in CheckIndex. [lucene]

2023-12-12 Thread via GitHub
mikemccand commented on PR #12925: URL: https://github.com/apache/lucene/pull/12925#issuecomment-1851987006 > With this change, `TestLucene90PostingsFormat` now exhibits #12895. Oooh that's aweosme! I'll review. -- This is an automated message from the Apache Git Service. To respon

Re: [PR] Check `Terms#intersect` in CheckIndex. [lucene]

2023-12-12 Thread via GitHub
jpountz merged PR #12925: URL: https://github.com/apache/lucene/pull/12925 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apa

Re: [I] Require bundled FSTs to be on the current FST version [lucene]

2023-12-12 Thread via GitHub
uschindler commented on issue #12911: URL: https://github.com/apache/lucene/issues/12911#issuecomment-1852004695 > Hmm this is a bit of a spooky crab. If I use OpenJDK 17 (`openjdk full version "17.0.9+8"`) to `./gradlew regenerate` on current `branch_9_9_0` I get this horrifying failure:

Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-12 Thread via GitHub
easyice commented on PR #12841: URL: https://github.com/apache/lucene/pull/12841#issuecomment-1852013823 Hi, I added the implementation for `ByteBufferIndexInput`, Unfortunately, the benchmark shows a bit regression: java17 ``` Benchmark

Re: [I] Require bundled FSTs to be on the current FST version [lucene]

2023-12-12 Thread via GitHub
uschindler commented on issue #12911: URL: https://github.com/apache/lucene/issues/12911#issuecomment-1852016535 I checked that file, there's no such special characters? Or do I miss something. I only checked main branch -- This is an automated message from the Apache Git Service. To

Re: [I] Require bundled FSTs to be on the current FST version [lucene]

2023-12-12 Thread via GitHub
mikemccand commented on issue #12911: URL: https://github.com/apache/lucene/issues/12911#issuecomment-1852023328 > Can you post in which exact task name this happened? Thanks @uschindler. It happened when I ran `./gradlew regenerate` on 9.9.x branch with OpenJDK 17 (`openjdk full ver

Re: [I] Require bundled FSTs to be on the current FST version [lucene]

2023-12-12 Thread via GitHub
uschindler commented on issue #12911: URL: https://github.com/apache/lucene/issues/12911#issuecomment-1852025005 I regenrated locally, no issues: ``` openjdk version "17.0.9" 2023-10-17 OpenJDK Runtime Environment Temurin-17.0.9+9 (build 17.0.9+9) OpenJDK 64-Bit Server VM Temurin

Re: [I] Require bundled FSTs to be on the current FST version [lucene]

2023-12-12 Thread via GitHub
mikemccand commented on issue #12911: URL: https://github.com/apache/lucene/issues/12911#issuecomment-1852026627 And you're right -- I don't see any special characters here: https://github.com/apache/lucene/blob/main/gradle/generation/extract-jdk-apis/ExtractJdkApis.java#L192 Not sure

Re: [I] Require bundled FSTs to be on the current FST version [lucene]

2023-12-12 Thread via GitHub
uschindler commented on issue #12911: URL: https://github.com/apache/lucene/issues/12911#issuecomment-1852033091 I can explain why you see a difference with Java 17 vs 21. Did the job fail on `:lucene:core:generateJdkApiJar21`. If yes the following happened: All those three tasks are

Re: [I] Require bundled FSTs to be on the current FST version [lucene]

2023-12-12 Thread via GitHub
mikemccand commented on issue #12911: URL: https://github.com/apache/lucene/issues/12911#issuecomment-1852034874 Hmm actually I think there may be a zero-width space character U+200B on this line before the opening (? https://github.com/apache/lucene/blob/branch_9_9/gradle/generation/extra

Re: [I] Require bundled FSTs to be on the current FST version [lucene]

2023-12-12 Thread via GitHub
mikemccand commented on issue #12911: URL: https://github.com/apache/lucene/issues/12911#issuecomment-1852037166 OK I gotta drop off ... will try to root cause this later today. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub

Re: [I] Require bundled FSTs to be on the current FST version [lucene]

2023-12-12 Thread via GitHub
uschindler commented on issue #12911: URL: https://github.com/apache/lucene/issues/12911#issuecomment-1852036299 This is where it configures the language version: https://github.com/apache/lucene/blob/e0f4321b40fd06a556ff4a11f137a3fc0f67b5bb/gradle/generation/extract-jdk-apis.gradle#L46-L48

Re: [I] Require bundled FSTs to be on the current FST version [lucene]

2023-12-12 Thread via GitHub
mikemccand commented on issue #12911: URL: https://github.com/apache/lucene/issues/12911#issuecomment-1852036564 > may be a zero-width space character on `main` and `branch_9_9`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

Re: [I] Require bundled FSTs to be on the current FST version [lucene]

2023-12-12 Thread via GitHub
uschindler commented on issue #12911: URL: https://github.com/apache/lucene/issues/12911#issuecomment-1852042380 > Hmm actually I think there may be a zero-width space character U+200B on this line before the opening (? https://github.com/apache/lucene/blob/branch_9_9/gradle/generation/extr

Re: [I] Require bundled FSTs to be on the current FST version [lucene]

2023-12-12 Thread via GitHub
uschindler commented on issue #12911: URL: https://github.com/apache/lucene/issues/12911#issuecomment-1852045851 Let me commit a fix for that file to all 3 branches. It really has a hidden character. And there's a second problem: When invoking the JVM it does not pass a character set

Re: [PR] Fix bug where NFARunAutomaton#getTransition does not set Transition correctly [lucene]

2023-12-12 Thread via GitHub
zhaih commented on code in PR #12909: URL: https://github.com/apache/lucene/pull/12909#discussion_r1424011845 ## lucene/core/src/test/org/apache/lucene/util/automaton/TestNFARunAutomaton.java: ## @@ -73,6 +73,37 @@ public void testWithRandomRegex() { } } + public void

Re: [PR] Refactor around NeighborArray [lucene]

2023-12-12 Thread via GitHub
zhaih commented on code in PR #12910: URL: https://github.com/apache/lucene/pull/12910#discussion_r1424022318 ## lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java: ## @@ -34,14 +34,14 @@ public class NeighborArray { private final boolean scoresDescOrder;

Re: [PR] Refactor around NeighborArray [lucene]

2023-12-12 Thread via GitHub
zhaih commented on code in PR #12910: URL: https://github.com/apache/lucene/pull/12910#discussion_r1424029197 ## lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java: ## @@ -201,9 +225,69 @@ private int descSortFindRightMostInsertionPoint(float newScore, int boun

Re: [PR] Refactor around NeighborArray [lucene]

2023-12-12 Thread via GitHub
zhaih commented on code in PR #12910: URL: https://github.com/apache/lucene/pull/12910#discussion_r1424031699 ## lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java: ## @@ -51,45 +51,61 @@ public NeighborArray(int maxSize, boolean descOrder) { */ public vo

[PR] Add a stored fields test that indexes LineFileDocs. [lucene]

2023-12-12 Thread via GitHub
jpountz opened a new pull request, #12927: URL: https://github.com/apache/lucene/pull/12927 Real-world data exhibits patterns that are taken advantage of by the compression logic, but also hardly reproducible in a randomized way. This makes this new test introduce interesting coverage.

Re: [I] Require bundled FSTs to be on the current FST version [lucene]

2023-12-12 Thread via GitHub
uschindler commented on issue #12911: URL: https://github.com/apache/lucene/issues/12911#issuecomment-1852090987 OK, I fixed the file and also enforced UTF-8 when executing the JDK version. The general problem was: If I enforce locally another charset than UTF-8 (see the sysprops passed to

[PR] Add forceMerge to test to fix intermittent failure; addresses #12896 [lucene]

2023-12-12 Thread via GitHub
msokolov opened a new pull request, #12928: URL: https://github.com/apache/lucene/pull/12928 (no comment) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-

Re: [I] Reproducible failure of TestParentBlockJoinByteKnnVectorQuery.testScoringWithMultipleChildren [lucene]

2023-12-12 Thread via GitHub
msokolov commented on issue #12896: URL: https://github.com/apache/lucene/issues/12896#issuecomment-1852094137 I think it's simply that the test writer chooses to flush randomly, creating two segments instead of one. I was able to fix by adding a call to forceMerge(1). Opened https://github

Re: [I] Require bundled FSTs to be on the current FST version [lucene]

2023-12-12 Thread via GitHub
uschindler commented on issue #12911: URL: https://github.com/apache/lucene/issues/12911#issuecomment-1852094139 Fixed in 10387f136ffa88ad9e86c526aa52908829a01ad3. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the UR

Re: [PR] Refactor around NeighborArray [lucene]

2023-12-12 Thread via GitHub
zhaih commented on code in PR #12910: URL: https://github.com/apache/lucene/pull/12910#discussion_r1424036233 ## lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java: ## @@ -51,45 +51,61 @@ public NeighborArray(int maxSize, boolean descOrder) { */ public vo

Re: [PR] Refactor around NeighborArray [lucene]

2023-12-12 Thread via GitHub
zhaih commented on code in PR #12910: URL: https://github.com/apache/lucene/pull/12910#discussion_r1424037388 ## lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java: ## @@ -51,45 +51,61 @@ public NeighborArray(int maxSize, boolean descOrder) { */ public vo

Re: [I] Require bundled FSTs to be on the current FST version [lucene]

2023-12-12 Thread via GitHub
uschindler commented on issue #12911: URL: https://github.com/apache/lucene/issues/12911#issuecomment-1852101176 Sorry for this My fault, looks like while copypasting Stackoverflow code I introduced the hidden character. -- This is an automated message from the Apache Git Service.

Re: [PR] Add forceMerge to test to fix intermittent failure; addresses #12896 [lucene]

2023-12-12 Thread via GitHub
msokolov commented on PR #12928: URL: https://github.com/apache/lucene/pull/12928#issuecomment-1852113022 heh, got an unrelated test fail there: org.apache.lucene.search.TestByteVectorSimilarityQuery > testApproximate FAILED java.lang.UnsupportedOperationException at

Re: [PR] Add forceMerge to test to fix intermittent failure; addresses #12896 [lucene]

2023-12-12 Thread via GitHub
msokolov merged PR #12928: URL: https://github.com/apache/lucene/pull/12928 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.ap

Re: [PR] Add forceMerge to test to fix intermittent failure; addresses #12896 [lucene]

2023-12-12 Thread via GitHub
msokolov commented on PR #12928: URL: https://github.com/apache/lucene/pull/12928#issuecomment-1852116447 I also cherry-picked to branch_9x -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the spe

Re: [I] Reproducible failure of TestParentBlockJoinByteKnnVectorQuery.testScoringWithMultipleChildren [lucene]

2023-12-12 Thread via GitHub
msokolov closed issue #12896: Reproducible failure of TestParentBlockJoinByteKnnVectorQuery.testScoringWithMultipleChildren URL: https://github.com/apache/lucene/issues/12896 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and us

[PR] Modernize LineFileDocs. [lucene]

2023-12-12 Thread via GitHub
jpountz opened a new pull request, #12929: URL: https://github.com/apache/lucene/pull/12929 This replaces `StringField`/`SortedDocValuesField` with `KeywordField` and `IntPoint`/`NumericDocValuesField` with `IntField`. -- This is an automated message from the Apache Git Service. To re

Re: [I] Require bundled FSTs to be on the current FST version [lucene]

2023-12-12 Thread via GitHub
uschindler commented on issue #12911: URL: https://github.com/apache/lucene/issues/12911#issuecomment-1852141369 To come back to the original issue: There is no easy way to fail the build, because the FST files are heavy to generate and stay in the resources folder. It is hard to detect a p

Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-12 Thread via GitHub
uschindler commented on PR #12841: URL: https://github.com/apache/lucene/pull/12841#issuecomment-1852165482 There is no lambda capturing problem. I have no idea why it complains. It really looks like fully inlined. It seems that it is not happy about those ByteBuffers at all. `ix()`

Re: [I] Require bundled FSTs to be on the current FST version [lucene]

2023-12-12 Thread via GitHub
jpountz commented on issue #12911: URL: https://github.com/apache/lucene/issues/12911#issuecomment-1852167681 > There is no easy way to fail the build, because the FST files are heavy to generate and stay in the resources folder. It is hard to detect a problem if you don't regenerate them,

Re: [I] Require bundled FSTs to be on the current FST version [lucene]

2023-12-12 Thread via GitHub
uschindler commented on issue #12911: URL: https://github.com/apache/lucene/issues/12911#issuecomment-1852173027 > > There is no easy way to fail the build, because the FST files are heavy to generate and stay in the resources folder. It is hard to detect a problem if you don't regenerate t

Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-12 Thread via GitHub
easyice commented on PR #12841: URL: https://github.com/apache/lucene/pull/12841#issuecomment-1852253188 Even if i used the copy code approach(avoid to using lambda, for test purpose), it was only 15%-20% faster. like this: ``` @Override public void readGroupVInts(lo

Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-12 Thread via GitHub
uschindler commented on PR #12841: URL: https://github.com/apache/lucene/pull/12841#issuecomment-1852257293 I have the feeling that for direct buffers (this is what MMap and NIO use, the getInt() seems more expensive than the sequential reads. -- This is an automated message from the Apac

Re: [PR] Move group-varint encoding/decoding logic to DataOutput/DataInput [lucene]

2023-12-12 Thread via GitHub
uschindler commented on PR #12841: URL: https://github.com/apache/lucene/pull/12841#issuecomment-1852261778 If you have an on-heap ByteBuffer (like for ByteBuffersIndexInput), it executes completely different code when reading from the underlying data structure.. -- This is an automated

Re: [PR] Add support for index sorting with document blocks [lucene]

2023-12-12 Thread via GitHub
msokolov commented on code in PR #12829: URL: https://github.com/apache/lucene/pull/12829#discussion_r1424209861 ## lucene/core/src/java/org/apache/lucene/index/IndexingChain.java: ## @@ -219,15 +222,33 @@ private Sorter.DocMap maybeSortSegment(SegmentWriteState state) throws I

[PR] Writing a HOWTO migrate codec version [lucene]

2023-12-12 Thread via GitHub
slow-J opened a new pull request, #12930: URL: https://github.com/apache/lucene/pull/12930 More detail in https://github.com/apache/lucene/issues/12918. Changing PFOR encoding to FOR for doc blocks in #12741, required bumping the codec version. The codec upgrade process has numerous movi

Re: [I] [Minor] Regeneration investigation [lucene]

2023-12-12 Thread via GitHub
dweiss commented on issue #12923: URL: https://github.com/apache/lucene/issues/12923#issuecomment-1852401144 I think it's because of changes in the fst code. So seems legitimate. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub

Re: [PR] Add support for index sorting with document blocks [lucene]

2023-12-12 Thread via GitHub
s1monw commented on PR #12829: URL: https://github.com/apache/lucene/pull/12829#issuecomment-1852410786 @mikemccand I agree we should not add this to sort but rather tread it the same way we treat the softDeletes field. it's essentially the same thing from an IW perspective. I will go ahead

Re: [I] Require bundled FSTs to be on the current FST version [lucene]

2023-12-12 Thread via GitHub
dweiss commented on issue #12911: URL: https://github.com/apache/lucene/issues/12911#issuecomment-1852414564 > Sorry for this My fault, looks like while copypasting Stackoverflow code I introduced the hidden character. Maybe apache rat check should actually scan for those, it'

Re: [PR] #12911: regenerate binary FSTs for Nori and Kuromoji dictionaries to match current FST format [lucene]

2023-12-12 Thread via GitHub
mikemccand commented on PR #12924: URL: https://github.com/apache/lucene/pull/12924#issuecomment-1852434257 Thanks @ChrisHegarty and @dweiss -- I'll merge to 9.9.x, 9.x and main shortly. Hmm, rather, I'll regen on the other two branches (not certain FST format is identical everywhere), and

Re: [I] Require bundled FSTs to be on the current FST version [lucene]

2023-12-12 Thread via GitHub
mikemccand commented on issue #12911: URL: https://github.com/apache/lucene/issues/12911#issuecomment-1852429248 > So it may fail in _some_ configurations, e.g. when your default characterset is not fitting the encoding of that special character. Argh! I was worried about this. I mu

Re: [I] Require bundled FSTs to be on the current FST version [lucene]

2023-12-12 Thread via GitHub
mikemccand commented on issue #12911: URL: https://github.com/apache/lucene/issues/12911#issuecomment-1852430721 > > Sorry for this My fault, looks like while copypasting Stackoverflow code I introduced the hidden character. > > Maybe apache rat check should actually scan for

Re: [PR] #12911: regenerate binary FSTs for Nori and Kuromoji dictionaries to match current FST format [lucene]

2023-12-12 Thread via GitHub
rmuir commented on PR #12924: URL: https://github.com/apache/lucene/pull/12924#issuecomment-1852434323 if it regenerates successfully and the test suite passes afterwards (which it seems they do), that's basically it -- This is an automated message from the Apache Git Service. To respond

Re: [PR] #12911: regenerate binary FSTs for Nori and Kuromoji dictionaries to match current FST format [lucene]

2023-12-12 Thread via GitHub
mikemccand merged PR #12924: URL: https://github.com/apache/lucene/pull/12924 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.

Re: [I] [Minor] Regeneration investigation [lucene]

2023-12-12 Thread via GitHub
slow-J commented on issue #12923: URL: https://github.com/apache/lucene/issues/12923#issuecomment-1852471268 > I think it's because of changes in the fst code. So seems legitimate. Thanks @dweiss, does this mean that we should regenerate these 2 `TokenInfoDictionary`? Or does it s

Re: [I] [Minor] Regeneration investigation [lucene]

2023-12-12 Thread via GitHub
dweiss commented on issue #12923: URL: https://github.com/apache/lucene/issues/12923#issuecomment-1852482593 See https://github.com/apache/lucene/pull/12924 - I think it's something that fixes your problem. Also, take a look at: https://github.com/apache/lucene/blob/main/help/regeneration

Re: [PR] #12911: regenerate binary FSTs for Nori and Kuromoji dictionaries to match current FST format [lucene]

2023-12-12 Thread via GitHub
mikemccand commented on PR #12924: URL: https://github.com/apache/lucene/pull/12924#issuecomment-1852489631 OK I regenerated these two FSTs on 9.9.x, 9.x, and main. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the U

Re: [PR] Fix failing BaseVectorSimilarityQueryTestCase#testApproximate [lucene]

2023-12-12 Thread via GitHub
kaivalnp commented on code in PR #12922: URL: https://github.com/apache/lucene/pull/12922#discussion_r1424377622 ## lucene/core/src/java/org/apache/lucene/search/AbstractVectorSimilarityQuery.java: ## @@ -255,6 +255,11 @@ static VectorSimilarityScorer fromAcceptDocs(

Re: [PR] Fix failing BaseVectorSimilarityQueryTestCase#testApproximate [lucene]

2023-12-12 Thread via GitHub
kaivalnp commented on code in PR #12922: URL: https://github.com/apache/lucene/pull/12922#discussion_r1424385137 ## lucene/core/src/java/org/apache/lucene/search/AbstractVectorSimilarityQuery.java: ## @@ -255,6 +255,11 @@ static VectorSimilarityScorer fromAcceptDocs(

Re: [I] [Minor] Regeneration investigation [lucene]

2023-12-12 Thread via GitHub
slow-J closed issue #12923: [Minor] Regeneration investigation URL: https://github.com/apache/lucene/issues/12923 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe

Re: [I] [Minor] Regeneration investigation [lucene]

2023-12-12 Thread via GitHub
slow-J commented on issue #12923: URL: https://github.com/apache/lucene/issues/12923#issuecomment-1852542026 > See #12924 - I think it's something that fixes your problem. Also, take a look at: > https://github.com/apache/lucene/blob/main/help/regeneration.txt unless you have already - t

[I] Let's run our Monster tests, at least once? [lucene]

2023-12-12 Thread via GitHub
mikemccand opened a new issue, #12932: URL: https://github.com/apache/lucene/issues/12932 ### Description I don't think our monster tests even run? I think we should run them at least for the 9.9.1 release. I'm especially interested in ensuring `Test2BTerms` is happy with our

Re: [I] Require bundled FSTs to be on the current FST version [lucene]

2023-12-12 Thread via GitHub
uschindler commented on issue #12911: URL: https://github.com/apache/lucene/issues/12911#issuecomment-1852574170 > > Sorry for this My fault, looks like while copypasting Stackoverflow code I introduced the hidden character. > > Maybe apache rat check should actually scan for

Re: [I] Require bundled FSTs to be on the current FST version [lucene]

2023-12-12 Thread via GitHub
mikemccand commented on issue #12911: URL: https://github.com/apache/lucene/issues/12911#issuecomment-1852574526 > I will commit an push the change to 3 branches. Thanks @uschindler. And thank you @rmuir for helping me fix my home dev box to put the default charset back to UTF-

Re: [PR] Beef up `Terms#intersect` checks in `CheckIndex`. [lucene]

2023-12-12 Thread via GitHub
mikemccand commented on PR #12926: URL: https://github.com/apache/lucene/pull/12926#issuecomment-1852610006 > This found bugs in `DirectPostingsFormat` Whoa, awesome! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

Re: [PR] Beef up `Terms#intersect` checks in `CheckIndex`. [lucene]

2023-12-12 Thread via GitHub
mikemccand commented on code in PR #12926: URL: https://github.com/apache/lucene/pull/12926#discussion_r1424442691 ## lucene/codecs/src/java/org/apache/lucene/codecs/memory/DirectPostingsFormat.java: ## @@ -1119,12 +1122,14 @@ public DirectIntersectTermsEnum(CompiledAutomaton c

Re: [I] Require bundled FSTs to be on the current FST version [lucene]

2023-12-12 Thread via GitHub
mikemccand commented on issue #12911: URL: https://github.com/apache/lucene/issues/12911#issuecomment-1852640427 > > There is no easy way to fail the build, because the FST files are heavy to generate and stay in the resources folder. It is hard to detect a problem if you don't regenerate t

Re: [I] Require bundled FSTs to be on the current FST version [lucene]

2023-12-12 Thread via GitHub
uschindler commented on issue #12911: URL: https://github.com/apache/lucene/issues/12911#issuecomment-1852647559 > > > There is no easy way to fail the build, because the FST files are heavy to generate and stay in the resources folder. It is hard to detect a problem if you don't regenerate

Re: [PR] Fix bug where NFARunAutomaton#getTransition does not set Transition correctly [lucene]

2023-12-12 Thread via GitHub
Tony-X commented on code in PR #12909: URL: https://github.com/apache/lucene/pull/12909#discussion_r1424472195 ## lucene/core/src/test/org/apache/lucene/util/automaton/TestNFARunAutomaton.java: ## @@ -73,6 +73,37 @@ public void testWithRandomRegex() { } } + public voi

[PR] Ensure Nori/Kuromoji shipped binary FST is the latest version [lucene]

2023-12-12 Thread via GitHub
mikemccand opened a new pull request, #12933: URL: https://github.com/apache/lucene/pull/12933 Closes #12911 This just adds specific unit tests for the binary FST for Nori's and Kuromoji's `TokenInfoDictionary`. I had to promote some APIs from private -> package private for test vis

Re: [PR] Add new token filters for Japanese sutegana (捨て仮名) [lucene]

2023-12-12 Thread via GitHub
mikemccand commented on code in PR #12915: URL: https://github.com/apache/lucene/pull/12915#discussion_r1424520399 ## lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseHiraganaUppercaseFilter.java: ## @@ -0,0 +1,65 @@ +package org.apache.lucene.analysis.ja;

[I] Should we clean up the few remaining references to `Lucene/Solr`? [lucene]

2023-12-12 Thread via GitHub
slow-J opened a new issue, #12934: URL: https://github.com/apache/lucene/issues/12934 ### Description I was looking at a very old issue for a failing unit test with an ant test command. When I tried to run it in gradle (incorrectly), I was greeted with a message that stood out to me:

Re: [PR] Fix failing BaseVectorSimilarityQueryTestCase#testApproximate [lucene]

2023-12-12 Thread via GitHub
benwtrent commented on code in PR #12922: URL: https://github.com/apache/lucene/pull/12922#discussion_r1424532683 ## lucene/core/src/java/org/apache/lucene/search/AbstractVectorSimilarityQuery.java: ## @@ -255,6 +255,11 @@ static VectorSimilarityScorer fromAcceptDocs(

Re: [PR] Add BWC test to reveal #12895 [lucene]

2023-12-12 Thread via GitHub
mikemccand commented on PR #12912: URL: https://github.com/apache/lucene/pull/12912#issuecomment-1852755681 Oh thanks for working on this @gf2121 -- sorry that we both did it at the same time! -- This is an automated message from the Apache Git Service. To respond to the message, please l

Re: [PR] Clean up sleep in TestBackwardsCompatibility#testCreateMoreTermsIndex [lucene]

2023-12-12 Thread via GitHub
uschindler commented on PR #12914: URL: https://github.com/apache/lucene/pull/12914#issuecomment-1852773264 > I don't get it -- why aren't we seeing that `TestBackwardsCompatibility` times out every time we run it? The magic is here: https://github.com/apache/lucene/blob/2acf76e9e2f9

Re: [PR] Ensure Nori/Kuromoji shipped binary FST is the latest version [lucene]

2023-12-12 Thread via GitHub
uschindler commented on PR #12933: URL: https://github.com/apache/lucene/pull/12933#issuecomment-1852783783 It would be better to give a more detailed Gradle line like: ```sh ./gradlew :lucene:analysis:nori:regenerate ``` The test is not the nicest looking thing, but I acc

  1   2   >