Re: [PR] Add support for index sorting with document blocks [lucene]

2023-11-28 Thread via GitHub
s1monw commented on PR #12829: URL: https://github.com/apache/lucene/pull/12829#issuecomment-1829488114 @mikemccand @jpountz thanks for your ideas. I'd love to flash this out more before we add anything we write to the index. Today we'd only use this for sorting but if that field can be use

Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-11-28 Thread via GitHub
dungba88 commented on code in PR #12624: URL: https://github.com/apache/lucene/pull/12624#discussion_r1407526003 ## lucene/core/src/java/org/apache/lucene/util/fst/ReadWriteDataOutput.java: ## @@ -0,0 +1,76 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or

Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-11-28 Thread via GitHub
dungba88 commented on code in PR #12624: URL: https://github.com/apache/lucene/pull/12624#discussion_r1407526003 ## lucene/core/src/java/org/apache/lucene/util/fst/ReadWriteDataOutput.java: ## @@ -0,0 +1,76 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or

Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-11-28 Thread via GitHub
dungba88 commented on code in PR #12624: URL: https://github.com/apache/lucene/pull/12624#discussion_r1407526003 ## lucene/core/src/java/org/apache/lucene/util/fst/ReadWriteDataOutput.java: ## @@ -0,0 +1,76 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or

Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-11-28 Thread via GitHub
dungba88 commented on code in PR #12624: URL: https://github.com/apache/lucene/pull/12624#discussion_r1407526003 ## lucene/core/src/java/org/apache/lucene/util/fst/ReadWriteDataOutput.java: ## @@ -0,0 +1,76 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or

Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-11-28 Thread via GitHub
mikemccand commented on PR #12624: URL: https://github.com/apache/lucene/pull/12624#issuecomment-1829601799 > Tested Test2BFST with -Dtests.seed=D193E7FD4B9E68C4 Duh, I forgot to fix the seed! And the test is indeed random in the inputs it compiles. Sorry for the false alarm :) --

Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-11-28 Thread via GitHub
mikemccand commented on PR #12624: URL: https://github.com/apache/lucene/pull/12624#issuecomment-1829613228 Thanks @dungba88 -- I will catch up with the latest iterations soon. I tested just how much slower the `ByteBuffer` based store is than the FST's `BytesStore`: 9.x: ```

Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-11-28 Thread via GitHub
mikemccand commented on PR #12624: URL: https://github.com/apache/lucene/pull/12624#issuecomment-1829633268 > More than two orders-of-magnitude (base 10) slower! I wonder: are there other places in Lucene that might fall prey to this performance trap (calling `toDataInput` frequently

Re: [PR] Report the time it took for building the FST in Test2BFST [lucene]

2023-11-28 Thread via GitHub
mikemccand commented on code in PR #12847: URL: https://github.com/apache/lucene/pull/12847#discussion_r1407685508 ## lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java: ## @@ -867,6 +867,10 @@ public long fstRamBytesUsed() { return fst.ramBytesUsed(); }

Re: [PR] Report the time it took for building the FST in Test2BFST [lucene]

2023-11-28 Thread via GitHub
dungba88 commented on code in PR #12847: URL: https://github.com/apache/lucene/pull/12847#discussion_r1407739948 ## lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java: ## @@ -867,6 +867,10 @@ public long fstRamBytesUsed() { return fst.ramBytesUsed(); } +

Re: [PR] CheckIndex - Adding a `-level` parameter to give ability to control index check detail programmatically [lucene]

2023-11-28 Thread via GitHub
slow-J commented on PR #12797: URL: https://github.com/apache/lucene/pull/12797#issuecomment-1829926883 Hi @mikemccand thanks for all the comments, addressed them all and now resolved the new merge conflicts! -- This is an automated message from the Apache Git Service. To respond to the m

Re: [PR] Copy collected acc(maxFreqs) into empty acc, rather than merge them. [lucene]

2023-11-28 Thread via GitHub
jpountz commented on PR #12846: URL: https://github.com/apache/lucene/pull/12846#issuecomment-1829969643 Sorry I wasn't clear, I meant to replace entries of the treeset with entries of the other treeset by clearing it first, and then doing an `addAll`. -- This is an automated message from

Re: [PR] Introduce growInRange to reduce array overallocation [lucene]

2023-11-28 Thread via GitHub
jpountz commented on code in PR #12844: URL: https://github.com/apache/lucene/pull/12844#discussion_r1407925571 ## lucene/core/src/java/org/apache/lucene/util/ArrayUtil.java: ## @@ -330,6 +330,29 @@ public static int[] growExact(int[] array, int newLength) { return copy;

Re: [PR] Introduce growInRange to reduce array overallocation [lucene]

2023-11-28 Thread via GitHub
benwtrent commented on code in PR #12844: URL: https://github.com/apache/lucene/pull/12844#discussion_r1407980242 ## lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java: ## @@ -31,18 +33,21 @@ * * @lucene.internal */ -public class NeighborArray { +public cl

Re: [PR] upgrade to OpenNLP 2.3.1 [lucene]

2023-11-28 Thread via GitHub
cpoerschke commented on code in PR #12674: URL: https://github.com/apache/lucene/pull/12674#discussion_r1407985494 ## lucene/licenses/opennlp-tools-NOTICE.txt: ## @@ -1,11 +1,101 @@ Apache OpenNLP Review Comment: https://github.com/apache/opennlp/blob/opennlp-2.3.1/NOTICE

Re: [PR] upgrade to OpenNLP 2.3.1 [lucene]

2023-11-28 Thread via GitHub
cpoerschke commented on code in PR #12674: URL: https://github.com/apache/lucene/pull/12674#discussion_r1407992397 ## lucene/analysis/opennlp/src/test/org/apache/lucene/analysis/opennlp/TestOpenNLPChunkerFilterFactory.java: ## @@ -58,7 +58,7 @@ public class TestOpenNLPChunkerFil

Re: [PR] Make FSTPostingsFormat load FSTs off-heap [lucene]

2023-11-28 Thread via GitHub
mikemccand commented on PR #12552: URL: https://github.com/apache/lucene/pull/12552#issuecomment-1830133688 I think this was mistakingly not backported to 9.x? (I only caught this because I was seeing merge conflicts trying to backport #12803 and saw this. I'll backport shortly -- I think

Re: [PR] upgrade to OpenNLP 2.3.1 [lucene]

2023-11-28 Thread via GitHub
cpoerschke commented on code in PR #12674: URL: https://github.com/apache/lucene/pull/12674#discussion_r1408000809 ## lucene/licenses/slf4j-api-LICENSE-MIT.txt: ## @@ -0,0 +1,24 @@ +Copyright (c) 2004-2022 QOS.ch Sarl (Switzerland) Review Comment: https://github.com/qos-ch/s

Re: [PR] Remove FST constructors with DataInput for metadata [lucene]

2023-11-28 Thread via GitHub
mikemccand commented on PR #12803: URL: https://github.com/apache/lucene/pull/12803#issuecomment-1830145262 This one is also low risk for 9.9.0 -- it's cutting over to a cleaner FST ctor API, and has been baking in main for almost a week. I had meant to backport last week but Turkey interv

Re: [PR] Add support for index sorting with document blocks [lucene]

2023-11-28 Thread via GitHub
msokolov commented on PR #12829: URL: https://github.com/apache/lucene/pull/12829#issuecomment-1830150503 @s1monw that makes sense. I think I was confusing index-time changes and query-time changes. This whole piece of functionality is a little confusing given how loosely coupled these thin

Re: [PR] Introduce growInRange to reduce array overallocation [lucene]

2023-11-28 Thread via GitHub
msokolov commented on code in PR #12844: URL: https://github.com/apache/lucene/pull/12844#discussion_r1408011800 ## lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java: ## @@ -31,18 +33,21 @@ * * @lucene.internal */ -public class NeighborArray { +public cla

Re: [PR] Random access term dictionary [lucene]

2023-11-28 Thread via GitHub
mikemccand commented on PR #12688: URL: https://github.com/apache/lucene/pull/12688#issuecomment-1830166075 > This is reasonable as the terms index (FST) holds all the terms. +1, nice! > Fuzzy/Wildcard/Prefix queries got _much slower_ > This is also expected because curr

Re: [PR] Introduce growInRange to reduce array overallocation [lucene]

2023-11-28 Thread via GitHub
msokolov commented on code in PR #12844: URL: https://github.com/apache/lucene/pull/12844#discussion_r1408025348 ## lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java: ## @@ -31,18 +33,21 @@ * * @lucene.internal */ -public class NeighborArray { +public cla

[PR] Fix *HnswVectorsFormat.testIndexedValueNotAliased test flakiness [lucene]

2023-11-28 Thread via GitHub
benwtrent opened a new pull request, #12848: URL: https://github.com/apache/lucene/pull/12848 periodic and random merge policies can cause the docs iterated to be in a different order (as they are merged). This commit reduces the randomness of the merge policy for more consistent ve

Re: [PR] CheckIndex - Adding a `-level` parameter to give ability to control index check detail programmatically [lucene]

2023-11-28 Thread via GitHub
mikemccand merged PR #12797: URL: https://github.com/apache/lucene/pull/12797 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.

Re: [I] Make CheckIndex doChecksumsOnly / -fast as default [LUCENE-9984] [lucene]

2023-11-28 Thread via GitHub
mikemccand closed issue #11023: Make CheckIndex doChecksumsOnly / -fast as default [LUCENE-9984] URL: https://github.com/apache/lucene/issues/11023 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to th

Re: [PR] Let WordDelimiterGraphFilterFactory propagate ignoreKeywords flag [lucene]

2023-11-28 Thread via GitHub
mikemccand merged PR #12525: URL: https://github.com/apache/lucene/pull/12525 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.

Re: [I] Add support for ignoreKeywords in WordDelimiterGraphFilterFactory [lucene]

2023-11-28 Thread via GitHub
mikemccand closed issue #12522: Add support for ignoreKeywords in WordDelimiterGraphFilterFactory URL: https://github.com/apache/lucene/issues/12522 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to t

Re: [PR] Introduce growInRange to reduce array overallocation [lucene]

2023-11-28 Thread via GitHub
zhaih commented on code in PR #12844: URL: https://github.com/apache/lucene/pull/12844#discussion_r1408142129 ## lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java: ## @@ -31,18 +33,21 @@ * * @lucene.internal */ -public class NeighborArray { +public class

Re: [PR] Fix *HnswVectorsFormat.testIndexedValueNotAliased test flakiness [lucene]

2023-11-28 Thread via GitHub
ChrisHegarty commented on code in PR #12848: URL: https://github.com/apache/lucene/pull/12848#discussion_r1408143557 ## lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseKnnVectorsFormatTestCase.java: ## @@ -732,7 +733,13 @@ public void testIndexedValueNotAliased(

Re: [PR] CheckIndex - Adding a `-level` parameter to give ability to control index check detail programmatically [lucene]

2023-11-28 Thread via GitHub
slow-J commented on PR #12797: URL: https://github.com/apache/lucene/pull/12797#issuecomment-1830400036 > Thanks @slow-J -- looks great! > > This is a 10.0 only change right? I'll merge soon. Thanks! And yes 10.0 only. -- This is an automated message from the Apache Git Se

Re: [PR] Report the time it took for building the FST in Test2BFST [lucene]

2023-11-28 Thread via GitHub
mikemccand commented on code in PR #12847: URL: https://github.com/apache/lucene/pull/12847#discussion_r1408282267 ## lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java: ## @@ -867,6 +867,10 @@ public long fstRamBytesUsed() { return fst.ramBytesUsed(); }

Re: [PR] Fix *HnswVectorsFormat.testIndexedValueNotAliased test flakiness [lucene]

2023-11-28 Thread via GitHub
benwtrent merged PR #12848: URL: https://github.com/apache/lucene/pull/12848 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.a

[PR] Fix bug in UnescapedCharSequence and add basic unit tests [lucene]

2023-11-28 Thread via GitHub
slow-J opened a new pull request, #12849: URL: https://github.com/apache/lucene/pull/12849 Fixing a basic bug in UnescapedCharSequence https://github.com/apache/lucene/blob/2bb69f3246218dd8176cf92d8064623688c5272c/lucene/queryparser/src/java/org/apache/lucene/queryparser/flexible/core/util/U

Re: [PR] Fix bug in UnescapedCharSequence and add basic unit tests [lucene]

2023-11-28 Thread via GitHub
slow-J commented on code in PR #12849: URL: https://github.com/apache/lucene/pull/12849#discussion_r1408514109 ## lucene/queryparser/src/java/org/apache/lucene/queryparser/flexible/core/util/UnescapedCharSequence.java: ## @@ -101,7 +90,7 @@ public String toStringEscaped() {

Re: [I] IntTaxonomyFacets chooses dense values array when FacetsCollector has no MatchingDocs [lucene]

2023-11-28 Thread via GitHub
gsmiller commented on issue #12558: URL: https://github.com/apache/lucene/issues/12558#issuecomment-1830955298 OK, came back across this while cleaning up open browser tabs and decided to repro it myself. I know what's going on. It has to do with `#finish` not properly getting called on sid

Re: [PR] Report the time it took for building the FST in Test2BFST [lucene]

2023-11-28 Thread via GitHub
dungba88 commented on code in PR #12847: URL: https://github.com/apache/lucene/pull/12847#discussion_r1408573574 ## lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java: ## @@ -867,6 +867,10 @@ public long fstRamBytesUsed() { return fst.ramBytesUsed(); } +

Re: [PR] Copy collected acc(maxFreqs) into empty acc, rather than merge them. [lucene]

2023-11-28 Thread via GitHub
vsop-479 commented on PR #12846: URL: https://github.com/apache/lucene/pull/12846#issuecomment-1831132910 > replace entries of the treeset with entries of the other treeset by clearing it first, and then doing an addAll. Sorry, I am confused about this. If we clear an unEmpty treeset,

Re: [I] Multi-value Support for KnnVectorField [lucene]

2023-11-28 Thread via GitHub
david-sitsky commented on issue #12313: URL: https://github.com/apache/lucene/issues/12313#issuecomment-1831197772 > The key issue is document collection. Right now, the `topK` is limited to only `topK` children documents. Really, what you want is the `topK` parent documents based on childr

Re: [I] Virtual threads and Lucene (support async tasks) [lucene]

2023-11-28 Thread via GitHub
Jeevananthan-23 commented on issue #12531: URL: https://github.com/apache/lucene/issues/12531#issuecomment-1831313847 Hi @uschindler, I came across an interesting article on Qdrant vector database that uses io_uring for async and mmap benchmarking. https://qdrant.tech/articles/io_uring/

Re: [PR] Report the time it took for building the FST in Test2BFST [lucene]

2023-11-28 Thread via GitHub
dungba88 commented on PR #12847: URL: https://github.com/apache/lucene/pull/12847#issuecomment-1831363117 The test failed with just `Error: The operation was canceled.` but I can't tell why it happened. The same PR in my local branch works: https://github.com/dungba88/lucene/pull/20 -- T