[GitHub] [lucene] benwtrent commented on pull request #12529: Introduce a random vector scorer in HNSW builder/searcher

2023-09-08 Thread via GitHub
benwtrent commented on PR #12529: URL: https://github.com/apache/lucene/pull/12529#issuecomment-1711551199 @msokolov what say you? It seems like encapsulating random vector seeking & scoring into one thing makes the code simpler. -- This is an automated message from the Apache Git Service

[GitHub] [lucene] jpountz commented on a diff in pull request #12529: Introduce a random vector scorer in HNSW builder/searcher

2023-09-08 Thread via GitHub
jpountz commented on code in PR #12529: URL: https://github.com/apache/lucene/pull/12529#discussion_r1319777844 ## lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene90/Lucene90HnswVectorsReader.java: ## @@ -423,8 +422,12 @@ public RandomAccessVectorValues c

[GitHub] [lucene] mikemccand commented on issue #12542: Lucene's FST Builder should have a simpler "knob" to trade off memory/CPU required against minimality

2023-09-08 Thread via GitHub
mikemccand commented on issue #12542: URL: https://github.com/apache/lucene/issues/12542#issuecomment-1711608900 Digging into this a bit, I think I found some silly performance bugs in our current FST impl: * We seem to create a `PagedGrowableWriter` with [page size 128 MB here](https:

[GitHub] [lucene] javanna commented on pull request #12544: Close index readers in tests

2023-09-08 Thread via GitHub
javanna commented on PR #12544: URL: https://github.com/apache/lucene/pull/12544#issuecomment-1711628290 thanks @jpountz ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. T

[GitHub] [lucene] javanna merged pull request #12544: Close index readers in tests

2023-09-08 Thread via GitHub
javanna merged PR #12544: URL: https://github.com/apache/lucene/pull/12544 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apa

[GitHub] [lucene] dweiss commented on issue #12542: Lucene's FST Builder should have a simpler "knob" to trade off memory/CPU required against minimality

2023-09-08 Thread via GitHub
dweiss commented on issue #12542: URL: https://github.com/apache/lucene/issues/12542#issuecomment-1711706053 With regard to automata/ FSTs - they're nearly the same thing, conceptually. Automata are logically transducers producing a constant epsilon value (no value). This knowledge can be u

[GitHub] [lucene] jimczi commented on a diff in pull request #12529: Introduce a random vector scorer in HNSW builder/searcher

2023-09-08 Thread via GitHub
jimczi commented on code in PR #12529: URL: https://github.com/apache/lucene/pull/12529#discussion_r1319990727 ## lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene90/Lucene90HnswVectorsReader.java: ## @@ -423,8 +422,12 @@ public RandomAccessVectorValues co

[GitHub] [lucene] jimczi commented on a diff in pull request #12529: Introduce a random vector scorer in HNSW builder/searcher

2023-09-08 Thread via GitHub
jimczi commented on code in PR #12529: URL: https://github.com/apache/lucene/pull/12529#discussion_r132915 ## lucene/core/src/java/org/apache/lucene/util/hnsw/RandomVectorScorerProvider.java: ## @@ -0,0 +1,79 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under

[GitHub] [lucene] jimczi commented on a diff in pull request #12529: Introduce a random vector scorer in HNSW builder/searcher

2023-09-08 Thread via GitHub
jimczi commented on code in PR #12529: URL: https://github.com/apache/lucene/pull/12529#discussion_r1320004511 ## lucene/core/src/java/org/apache/lucene/codecs/lucene95/OffHeapByteVectorValues.java: ## @@ -60,13 +61,17 @@ public int size() { @Override public byte[] vecto

[GitHub] [lucene] jimczi commented on a diff in pull request #12529: Introduce a random vector scorer in HNSW builder/searcher

2023-09-08 Thread via GitHub
jimczi commented on code in PR #12529: URL: https://github.com/apache/lucene/pull/12529#discussion_r1320004724 ## lucene/core/src/java/org/apache/lucene/util/hnsw/RandomVectorScorerProvider.java: ## @@ -0,0 +1,79 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under

[GitHub] [lucene] jimczi commented on a diff in pull request #12529: Introduce a random vector scorer in HNSW builder/searcher

2023-09-08 Thread via GitHub
jimczi commented on code in PR #12529: URL: https://github.com/apache/lucene/pull/12529#discussion_r1320004018 ## lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene91/Lucene91HnswVectorsReader.java: ## @@ -42,9 +41,7 @@ import org.apache.lucene.util.Bits;

[GitHub] [lucene] mikemccand commented on pull request #12489: Add support for recursive graph bisection.

2023-09-08 Thread via GitHub
mikemccand commented on PR #12489: URL: https://github.com/apache/lucene/pull/12489#issuecomment-1712097368 @jpountz did you measure any change to index size with the reordered docids? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

[GitHub] [lucene] jpountz commented on pull request #12489: Add support for recursive graph bisection.

2023-09-08 Thread via GitHub
jpountz commented on PR #12489: URL: https://github.com/apache/lucene/pull/12489#issuecomment-1712166542 I did. My wikimedium file is sorted by title, which already gives some compression compared to random ordering. Disappointedly, recursive graph bisection only improved compression of pos

[GitHub] [lucene] mikemccand commented on issue #12542: Lucene's FST Builder should have a simpler "knob" to trade off memory/CPU required against minimality

2023-09-09 Thread via GitHub
mikemccand commented on issue #12542: URL: https://github.com/apache/lucene/issues/12542#issuecomment-1712472912 > We seem to create a PagedGrowableWriter with [page size 128 MB here](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/fst/NodeHash.java#L34

[GitHub] [lucene] mikemccand opened a new pull request, #12545: Fix minor (excess reallocation) performance bug when building FSTs

2023-09-09 Thread via GitHub
mikemccand opened a new pull request, #12545: URL: https://github.com/apache/lucene/pull/12545 The bitsRequired passed during NodeHash rehash (when building an FST) was too small, causing excess/wasted reallocations. This is just a performance bug, especially impacting larger FSTs, but lik

[GitHub] [lucene] mikemccand commented on pull request #12545: Fix minor (excess reallocation) performance bug when building FSTs

2023-09-09 Thread via GitHub
mikemccand commented on PR #12545: URL: https://github.com/apache/lucene/pull/12545#issuecomment-1712474813 Tests and precommit passed locally (once) for me ... I'll make sure `Test2BFST` passes once too. -- This is an automated message from the Apache Git Service. To respond to the messa

[GitHub] [lucene] mikemccand commented on pull request #12545: Fix minor (excess reallocation) performance bug when building FSTs

2023-09-09 Thread via GitHub
mikemccand commented on PR #12545: URL: https://github.com/apache/lucene/pull/12545#issuecomment-1712476087 For the record, this command seems to at least kick off `Test2BFST`: `./gradlew test --max-workers=1 --tests org.apache.lucene.util.fst.Test2BFST -Dtests.nightly=true -Dtests.mo

[GitHub] [lucene] mikemccand commented on pull request #12545: Fix minor (excess reallocation) performance bug when building FSTs

2023-09-09 Thread via GitHub
mikemccand commented on PR #12545: URL: https://github.com/apache/lucene/pull/12545#issuecomment-1712497190 OK `Test2BFST` is happy: ``` BUILD SUCCESSFUL in 54m 15s ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

[GitHub] [lucene] stefanvodita opened a new issue, #12546: Compute multiple aggregations in one iteration of the match-set

2023-09-09 Thread via GitHub
stefanvodita opened a new issue, #12546: URL: https://github.com/apache/lucene/issues/12546 ### Description When a user knows that they want multiple different aggregations, they have to iterate the match-set once for each aggregation, which [is inefficient](https://lists.apache.org/

[GitHub] [lucene] stefanvodita opened a new pull request, #12547: Compute multiple float aggregations in one go

2023-09-09 Thread via GitHub
stefanvodita opened a new pull request, #12547: URL: https://github.com/apache/lucene/pull/12547 Usually facets maintain a one-dimensional array indexed by ordinal which keeps the values they're supposed to compute. The change here is simple in principle - use a two-dimensional array, in

[GitHub] [lucene] mikemccand merged pull request #12545: Fix minor (excess reallocation) performance bug when building FSTs

2023-09-09 Thread via GitHub
mikemccand merged PR #12545: URL: https://github.com/apache/lucene/pull/12545 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.

[GitHub] [lucene] mikemccand commented on pull request #12545: Fix minor (excess reallocation) performance bug when building FSTs

2023-09-09 Thread via GitHub
mikemccand commented on PR #12545: URL: https://github.com/apache/lucene/pull/12545#issuecomment-1712668955 I backported to 9.x as well: https://github.com/apache/lucene/commit/d70c91134726ff5768c0bcdc7bce51f3fbfcac56 -- This is an automated message from the Apache Git Service. To respond

[GitHub] [lucene] jpountz commented on pull request #12489: Add support for recursive graph bisection.

2023-09-10 Thread via GitHub
jpountz commented on PR #12489: URL: https://github.com/apache/lucene/pull/12489#issuecomment-1712779097 Wikibigall. Less space spent on doc valuse this time since I did not enable indexing of facets. There is a more significant size reduction of postings this time (-10.5%). This is not mis

[GitHub] [lucene] jpountz commented on pull request #12489: Add support for recursive graph bisection.

2023-09-10 Thread via GitHub
jpountz commented on PR #12489: URL: https://github.com/apache/lucene/pull/12489#issuecomment-1712923358 > I wonder why stored fields index size wasn't really hurt nearly as much for wikibigall but was for wikimediumall? This is because wikimedium uses chunks of articles as documents,

[GitHub] [lucene] jpountz commented on pull request #12489: Add support for recursive graph bisection.

2023-09-10 Thread via GitHub
jpountz commented on PR #12489: URL: https://github.com/apache/lucene/pull/12489#issuecomment-1712928445 Regarding positions, the reproducibility paper noted that the algorithm helped term frequencies a bit, though not as much as docs. It doesn't say anythink about positions, though I suspe

[GitHub] [lucene] shubhamvishu opened a new pull request, #12548: Add API to compute vector similarity in DoubleValuesSource

2023-09-10 Thread via GitHub
shubhamvishu opened a new pull request, #12548: URL: https://github.com/apache/lucene/pull/12548 ### Description This PR addresses the issue #12394. It adds an API **`similarityToQueryVector`** to `DoubleValuesSource` to compute vector similarity scores between the query vector and t

[GitHub] [lucene] jpountz opened a new pull request, #12549: Run merge-on-full-flush even though no changes got flushed.

2023-09-11 Thread via GitHub
jpountz opened a new pull request, #12549: URL: https://github.com/apache/lucene/pull/12549 Currently, merge-on-full-flush only checks if merges need to run if changes have been flushed to disk. This prevents from having different merging logic for refreshes and commits, since the merge pol

[GitHub] [lucene] mikemccand commented on a diff in pull request #12337: Index arbitrary fields in taxonomy docs

2023-09-11 Thread via GitHub
mikemccand commented on code in PR #12337: URL: https://github.com/apache/lucene/pull/12337#discussion_r1321799297 ## lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyIndexReader.java: ## @@ -0,0 +1,43 @@ +/* + * Licensed to the Apache Software F

[GitHub] [lucene] mikemccand commented on a diff in pull request #12337: Index arbitrary fields in taxonomy docs

2023-09-11 Thread via GitHub
mikemccand commented on code in PR #12337: URL: https://github.com/apache/lucene/pull/12337#discussion_r1321802426 ## lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/ReindexingEnrichedDirectoryTaxonomyWriter.java: ## @@ -0,0 +1,103 @@ +/* + * Licensed to the Apa

[GitHub] [lucene] mikemccand commented on pull request #12337: Index arbitrary fields in taxonomy docs

2023-09-11 Thread via GitHub
mikemccand commented on PR #12337: URL: https://github.com/apache/lucene/pull/12337#issuecomment-1714232934 > But as I think about this feature and how do I see it mature over time, I DO think the payload should be given when ingesting the documents Hmm -- I don't think that's great b

[GitHub] [lucene] mikemccand commented on issue #12190: Add "Expression" Facets Implementation

2023-09-11 Thread via GitHub
mikemccand commented on issue #12190: URL: https://github.com/apache/lucene/issues/12190#issuecomment-1714240627 I like this idea -- it's an "aggregation level expression", which computes an expression in "aggregation space", instead of the existing (already supported) document level expres

[GitHub] [lucene] onyxmaster commented on issue #4549: ShingleFilter should handle positionIncrement of zero, e.g. synonyms [LUCENE-3475]

2023-09-11 Thread via GitHub
onyxmaster commented on issue #4549: URL: https://github.com/apache/lucene/issues/4549#issuecomment-1714290760 Hi. Got bitten by this today after a lemmatizer filter produced two variants of base word at the same position and ShingleFilter producing a "shingle" from these variants, failing

[GitHub] [lucene] jpountz commented on pull request #12490: Reduce the overhead of ImpactsDISI.

2023-09-11 Thread via GitHub
jpountz commented on PR #12490: URL: https://github.com/apache/lucene/pull/12490#issuecomment-1714465729 I plan on merging soon if there are no objections. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

[GitHub] [lucene] jpountz commented on pull request #12526: Speed up disjunctions by computing estimations of the score of the k-th top hit up-front.

2023-09-11 Thread via GitHub
jpountz commented on PR #12526: URL: https://github.com/apache/lucene/pull/12526#issuecomment-1714471318 We could. These tasks are a bit malicious as the doc freq is slightly greater than the value of `k=100` so it takes lots of collected matches to find k documents that have this term. I s

[GitHub] [lucene] gokaai commented on a diff in pull request #12530: Fix CheckIndex to detect major corruption with old (not the latest) commit point

2023-09-11 Thread via GitHub
gokaai commented on code in PR #12530: URL: https://github.com/apache/lucene/pull/12530#discussion_r1322006478 ## lucene/core/src/java/org/apache/lucene/index/CheckIndex.java: ## @@ -610,6 +610,39 @@ public Status checkIndex(List onlySegments, ExecutorService executorServ

[GitHub] [lucene] jainankitk commented on issue #12527: Optimize readInts24 performance for DocIdsWriter

2023-09-11 Thread via GitHub
jainankitk commented on issue #12527: URL: https://github.com/apache/lucene/issues/12527#issuecomment-1714517103 > Maybe next we should try 4 readLong() for readInts32? Though I wonder how often in this benchy are we really needing 32 bits to encode the docid deltas in a BKD leaf block?

[GitHub] [lucene] Tony-X closed pull request #12541: Document why we need `lastPosBlockOffset`

2023-09-11 Thread via GitHub
Tony-X closed pull request #12541: Document why we need `lastPosBlockOffset` URL: https://github.com/apache/lucene/pull/12541 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

[GitHub] [lucene] zhaih commented on issue #11537: StackOverflow when RegExp encounters a very large string [LUCENE-10501]

2023-09-11 Thread via GitHub
zhaih commented on issue #11537: URL: https://github.com/apache/lucene/issues/11537#issuecomment-1715016712 I checked the CHANGES list since last release and seems we have good amount of commits already, let me start a thread about releasing the next version. On Wed, Sep 6, 2023 at

[GitHub] [lucene] jpountz commented on pull request #12460: Allow reading binary doc values as a DataInput

2023-09-12 Thread via GitHub
jpountz commented on PR #12460: URL: https://github.com/apache/lucene/pull/12460#issuecomment-1715126194 The more I think of this change, the more I like it: most of the time, you would need to read data out of binary doc values, e.g. (variable-length) integers, strings, etc. and exposing b

[GitHub] [lucene] jpountz commented on a diff in pull request #12549: Run merge-on-full-flush even though no changes got flushed.

2023-09-12 Thread via GitHub
jpountz commented on code in PR #12549: URL: https://github.com/apache/lucene/pull/12549#discussion_r1322592471 ## lucene/core/src/test/org/apache/lucene/index/TestIndexWriter.java: ## @@ -518,11 +518,10 @@ public void testFlushWithNoMerging() throws IOException { doc.add(n

[GitHub] [lucene] jpountz commented on a diff in pull request #12549: Run merge-on-full-flush even though no changes got flushed.

2023-09-12 Thread via GitHub
jpountz commented on code in PR #12549: URL: https://github.com/apache/lucene/pull/12549#discussion_r1322599113 ## lucene/core/src/test/org/apache/lucene/index/TestIndexWriterDelete.java: ## @@ -1315,7 +1315,8 @@ public void testTryDeleteDocument() throws Exception { w.addD

[GitHub] [lucene] iverase commented on pull request #12460: Allow reading binary doc values as a DataInput

2023-09-12 Thread via GitHub
iverase commented on PR #12460: URL: https://github.com/apache/lucene/pull/12460#issuecomment-1715224914 > I'm contemplating not introducing a new DataInputDocValues class, and instead have a dataInput() method on BinaryDocValues I think this approach defeats on of the main purposes f

[GitHub] [lucene] jpountz commented on pull request #12460: Allow reading binary doc values as a DataInput

2023-09-12 Thread via GitHub
jpountz commented on PR #12460: URL: https://github.com/apache/lucene/pull/12460#issuecomment-1715238722 > I think this approach defeats on of the main purposes for this change, that is to avoid allocating a byte array when reading doc values. I don't think we want BinaryDocValues to do tha

[GitHub] [lucene] stefanvodita opened a new pull request, #12550: [Demo] Per label association facet fields

2023-09-12 Thread via GitHub
stefanvodita opened a new pull request, #12550: URL: https://github.com/apache/lucene/pull/12550 ### Description A user could have data about facet labels. In the demo here, we record an author's popularity score, with authors being facet labels in an index of books. Today, use

[GitHub] [lucene] stefanvodita commented on pull request #12550: [Demo] Per label association facet fields

2023-09-12 Thread via GitHub
stefanvodita commented on PR #12550: URL: https://github.com/apache/lucene/pull/12550#issuecomment-1715245714 Cancelling right away, this is not meant to be merged. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the U

[GitHub] [lucene] stefanvodita closed pull request #12550: [Demo] Per label association facet fields

2023-09-12 Thread via GitHub
stefanvodita closed pull request #12550: [Demo] Per label association facet fields URL: https://github.com/apache/lucene/pull/12550 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comme

[GitHub] [lucene] jpountz commented on pull request #12490: Reduce the overhead of ImpactsDISI.

2023-09-12 Thread via GitHub
jpountz commented on PR #12490: URL: https://github.com/apache/lucene/pull/12490#issuecomment-1715453502 Another benchmark run on the last commit to make sure it still works as expected, and wikibigall this time instead of wikimedium10m: ``` TaskQPS base

[GitHub] [lucene] jimczi commented on pull request #12529: Introduce a random vector scorer in HNSW builder/searcher

2023-09-12 Thread via GitHub
jimczi commented on PR #12529: URL: https://github.com/apache/lucene/pull/12529#issuecomment-1715484871 Given that no further concerns have been raised, I am intending to merge this change soon. -- This is an automated message from the Apache Git Service. To respond to the message, please

[GitHub] [lucene] stefanvodita commented on a diff in pull request #12337: Index arbitrary fields in taxonomy docs

2023-09-12 Thread via GitHub
stefanvodita commented on code in PR #12337: URL: https://github.com/apache/lucene/pull/12337#discussion_r1322872602 ## lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyIndexReader.java: ## @@ -0,0 +1,43 @@ +/* + * Licensed to the Apache Software

[GitHub] [lucene] stefanvodita commented on pull request #12337: Index arbitrary fields in taxonomy docs

2023-09-12 Thread via GitHub
stefanvodita commented on PR #12337: URL: https://github.com/apache/lucene/pull/12337#issuecomment-1715512722 Thank you for the review @mikemccand! I’ve integrated your feedback. Updatable doc values are definitely something to consider. For comparison, I coded up an [association facet fi

[GitHub] [lucene] uschindler commented on pull request #12460: Allow reading binary doc values as a DataInput

2023-09-12 Thread via GitHub
uschindler commented on PR #12460: URL: https://github.com/apache/lucene/pull/12460#issuecomment-1715514900 > This has been a challenge so many times in the past, maybe it's time to add `seek()` support to `DataInput`? We have full random access (positional reads), if you extend the i

[GitHub] [lucene] jpountz commented on a diff in pull request #12529: Introduce a random vector scorer in HNSW builder/searcher

2023-09-12 Thread via GitHub
jpountz commented on code in PR #12529: URL: https://github.com/apache/lucene/pull/12529#discussion_r1322897603 ## lucene/core/src/java/org/apache/lucene/util/hnsw/RandomVectorScorerProvider.java: ## @@ -0,0 +1,79 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under

[GitHub] [lucene] uschindler commented on pull request #12460: Allow reading binary doc values as a DataInput

2023-09-12 Thread via GitHub
uschindler commented on PR #12460: URL: https://github.com/apache/lucene/pull/12460#issuecomment-1715550666 To save more memory copies, the codec may use a slice from the underlying IndexInput directly to support both access apis. All file pointer checks would then be performed by the low l

[GitHub] [lucene] mikemccand merged pull request #12541: Document why we need `lastPosBlockOffset`

2023-09-12 Thread via GitHub
mikemccand merged PR #12541: URL: https://github.com/apache/lucene/pull/12541 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.

[GitHub] [lucene] mikemccand commented on pull request #12541: Document why we need `lastPosBlockOffset`

2023-09-12 Thread via GitHub
mikemccand commented on PR #12541: URL: https://github.com/apache/lucene/pull/12541#issuecomment-1715559983 I backported to 9.x as well ... annoying that GitHub doesn't state in summary that the above push was to 9.x (it's only reflected here because it referenced this PR). It does reflect

[GitHub] [lucene] jimczi merged pull request #12529: Introduce a random vector scorer in HNSW builder/searcher

2023-09-12 Thread via GitHub
jimczi merged PR #12529: URL: https://github.com/apache/lucene/pull/12529 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apac

[GitHub] [lucene] jpountz merged pull request #12490: Reduce the overhead of ImpactsDISI.

2023-09-12 Thread via GitHub
jpountz merged PR #12490: URL: https://github.com/apache/lucene/pull/12490 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apa

[GitHub] [lucene] jimczi opened a new pull request, #12551: Introduce dynamic segment efSearch to Knn{Byte|Float}VectorQuery

2023-09-12 Thread via GitHub
jimczi opened a new pull request, #12551: URL: https://github.com/apache/lucene/pull/12551 This PR introduces a new parameter known as 'efSearch' to the knn vector query. 'efSearch' governs the maximum size of the priority queue employed for nearest neighbor searches. As each segment may co

[GitHub] [lucene] Tony-X opened a new pull request, #12552: Make FSTPostingsFormat load FSTs off-heap

2023-09-12 Thread via GitHub
Tony-X opened a new pull request, #12552: URL: https://github.com/apache/lucene/pull/12552 ### Description FSTs supports to load offheap for a while. As we were trying to use `FSTPostingsFormat` for some fields we realized heap usage bumped. Upon further investigation we reali

[GitHub] [lucene] msokolov commented on a diff in pull request #12552: Make FSTPostingsFormat load FSTs off-heap

2023-09-12 Thread via GitHub
msokolov commented on code in PR #12552: URL: https://github.com/apache/lucene/pull/12552#discussion_r1323494538 ## lucene/codecs/src/java/org/apache/lucene/codecs/memory/FSTTermsReader.java: ## @@ -191,7 +193,9 @@ final class TermsReader extends Terms { this.sumTotalTerm

[GitHub] [lucene] Tony-X commented on a diff in pull request #12552: Make FSTPostingsFormat load FSTs off-heap

2023-09-12 Thread via GitHub
Tony-X commented on code in PR #12552: URL: https://github.com/apache/lucene/pull/12552#discussion_r1323531587 ## lucene/codecs/src/java/org/apache/lucene/codecs/memory/FSTTermsReader.java: ## @@ -191,7 +193,9 @@ final class TermsReader extends Terms { this.sumTotalTermFr

[GitHub] [lucene] Tony-X commented on issue #12536: Remove `lastPosBlockOffset` from term metadata for Lucene90PostingsFormat

2023-09-12 Thread via GitHub
Tony-X commented on issue #12536: URL: https://github.com/apache/lucene/issues/12536#issuecomment-1716406470 https://github.com/apache/lucene/pull/12541 is merged and I'll close this one -- This is an automated message from the Apache Git Service. To respond to the message, please log on

[GitHub] [lucene] Tony-X closed issue #12536: Remove `lastPosBlockOffset` from term metadata for Lucene90PostingsFormat

2023-09-12 Thread via GitHub
Tony-X closed issue #12536: Remove `lastPosBlockOffset` from term metadata for Lucene90PostingsFormat URL: https://github.com/apache/lucene/issues/12536 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

[GitHub] [lucene] shubhamvishu commented on pull request #12183: Make some heavy query rewrites concurrent

2023-09-12 Thread via GitHub
shubhamvishu commented on PR #12183: URL: https://github.com/apache/lucene/pull/12183#issuecomment-1716957965 @jpountz I have made some changes to the `TermStates#build` to unblock this PR and avoid the deadlock issue happening due to executor forking into itself by checking if its a `Thre

[GitHub] [lucene] javanna commented on a diff in pull request #12183: Make some heavy query rewrites concurrent

2023-09-13 Thread via GitHub
javanna commented on code in PR #12183: URL: https://github.com/apache/lucene/pull/12183#discussion_r1324086202 ## lucene/core/src/java/org/apache/lucene/index/TermStates.java: ## @@ -86,19 +93,58 @@ public TermStates( * @param needsStats if {@code true} then all leaf contex

[GitHub] [lucene] javanna commented on a diff in pull request #12183: Make some heavy query rewrites concurrent

2023-09-13 Thread via GitHub
javanna commented on code in PR #12183: URL: https://github.com/apache/lucene/pull/12183#discussion_r1324085210 ## lucene/core/src/java/org/apache/lucene/index/TermStates.java: ## @@ -86,19 +93,58 @@ public TermStates( * @param needsStats if {@code true} then all leaf contex

[GitHub] [lucene] javanna commented on a diff in pull request #12183: Make some heavy query rewrites concurrent

2023-09-13 Thread via GitHub
javanna commented on code in PR #12183: URL: https://github.com/apache/lucene/pull/12183#discussion_r1324087271 ## lucene/CHANGES.txt: ## @@ -232,11 +172,6 @@ Other * GITHUB#12410: Refactor vectorization support (split provider from implementation classes). (Uwe Schindler,

[GitHub] [lucene] javanna commented on a diff in pull request #12183: Make some heavy query rewrites concurrent

2023-09-13 Thread via GitHub
javanna commented on code in PR #12183: URL: https://github.com/apache/lucene/pull/12183#discussion_r1324093739 ## lucene/core/src/java/org/apache/lucene/index/TermStates.java: ## @@ -86,19 +93,58 @@ public TermStates( * @param needsStats if {@code true} then all leaf contex

[GitHub] [lucene] shubhamvishu commented on a diff in pull request #12183: Make some heavy query rewrites concurrent

2023-09-13 Thread via GitHub
shubhamvishu commented on code in PR #12183: URL: https://github.com/apache/lucene/pull/12183#discussion_r1324225960 ## lucene/core/src/java/org/apache/lucene/index/TermStates.java: ## @@ -86,19 +93,58 @@ public TermStates( * @param needsStats if {@code true} then all leaf c

[GitHub] [lucene] uschindler commented on a diff in pull request #12183: Make TermStates#build concurrent

2023-09-13 Thread via GitHub
uschindler commented on code in PR #12183: URL: https://github.com/apache/lucene/pull/12183#discussion_r1324259855 ## lucene/core/src/java/org/apache/lucene/index/TermStates.java: ## @@ -86,19 +93,58 @@ public TermStates( * @param needsStats if {@code true} then all leaf con

[GitHub] [lucene] jpountz commented on pull request #12489: Add support for recursive graph bisection.

2023-09-13 Thread via GitHub
jpountz commented on PR #12489: URL: https://github.com/apache/lucene/pull/12489#issuecomment-1717341776 I just found a bug that in practice only made BP run one iteration per level, fixing it makes performance better (wikibigall): ``` TaskQPS baseline

[GitHub] [lucene] javanna commented on a diff in pull request #12183: Make TermStates#build concurrent

2023-09-13 Thread via GitHub
javanna commented on code in PR #12183: URL: https://github.com/apache/lucene/pull/12183#discussion_r1324466373 ## lucene/core/src/java/org/apache/lucene/index/TermStates.java: ## @@ -86,19 +93,58 @@ public TermStates( * @param needsStats if {@code true} then all leaf contex

[GitHub] [lucene] Shradha26 opened a new issue, #12553: [DISCUSS] Identifying Gaps in Lucene’s Faceting

2023-09-13 Thread via GitHub
Shradha26 opened a new issue, #12553: URL: https://github.com/apache/lucene/issues/12553 I’d like to gather a list of areas where Lucene’s support for aggregations can be improved and discuss if faceting can be augmented to offer that support or if it would need to be separate functionality

[GitHub] [lucene] jmazanec15 commented on issue #12533: Init HNSW merge with graph containing deleted documents

2023-09-13 Thread via GitHub
jmazanec15 commented on issue #12533: URL: https://github.com/apache/lucene/issues/12533#issuecomment-1717981259 Additionally, the [FreshDiskANN](https://arxiv.org/pdf/2105.09613.pdf) paper did some work in this space. They ran a test for NSG where they iteratively repeat the following proc

[GitHub] [lucene] uschindler commented on a diff in pull request #12183: Make TermStates#build concurrent

2023-09-13 Thread via GitHub
uschindler commented on code in PR #12183: URL: https://github.com/apache/lucene/pull/12183#discussion_r1324955170 ## lucene/core/src/java/org/apache/lucene/search/TaskExecutor.java: ## @@ -64,4 +68,57 @@ final List invokeAll(Collection> tasks) throws IOExcept } retu

[GitHub] [lucene] shubhamvishu commented on a diff in pull request #12183: Make TermStates#build concurrent

2023-09-13 Thread via GitHub
shubhamvishu commented on code in PR #12183: URL: https://github.com/apache/lucene/pull/12183#discussion_r1325027898 ## lucene/core/src/java/org/apache/lucene/index/TermStates.java: ## @@ -86,19 +93,58 @@ public TermStates( * @param needsStats if {@code true} then all leaf c

[GitHub] [lucene] shubhamvishu commented on a diff in pull request #12183: Make TermStates#build concurrent

2023-09-13 Thread via GitHub
shubhamvishu commented on code in PR #12183: URL: https://github.com/apache/lucene/pull/12183#discussion_r1325031509 ## lucene/core/src/java/org/apache/lucene/search/TaskExecutor.java: ## @@ -64,4 +68,57 @@ final List invokeAll(Collection> tasks) throws IOExcept } re

[GitHub] [lucene] uschindler commented on a diff in pull request #12183: Make TermStates#build concurrent

2023-09-13 Thread via GitHub
uschindler commented on code in PR #12183: URL: https://github.com/apache/lucene/pull/12183#discussion_r1325065806 ## lucene/core/src/java/org/apache/lucene/index/TermStates.java: ## @@ -86,19 +93,58 @@ public TermStates( * @param needsStats if {@code true} then all leaf con

[GitHub] [lucene] uschindler commented on a diff in pull request #12183: Make TermStates#build concurrent

2023-09-13 Thread via GitHub
uschindler commented on code in PR #12183: URL: https://github.com/apache/lucene/pull/12183#discussion_r1325066260 ## lucene/core/src/java/org/apache/lucene/index/TermStates.java: ## @@ -86,19 +93,58 @@ public TermStates( * @param needsStats if {@code true} then all leaf con

[GitHub] [lucene] shubhamvishu commented on a diff in pull request #12183: Make TermStates#build concurrent

2023-09-13 Thread via GitHub
shubhamvishu commented on code in PR #12183: URL: https://github.com/apache/lucene/pull/12183#discussion_r1325085377 ## lucene/core/src/java/org/apache/lucene/index/TermStates.java: ## @@ -86,19 +93,58 @@ public TermStates( * @param needsStats if {@code true} then all leaf c

[GitHub] [lucene] uschindler commented on a diff in pull request #12183: Make TermStates#build concurrent

2023-09-13 Thread via GitHub
uschindler commented on code in PR #12183: URL: https://github.com/apache/lucene/pull/12183#discussion_r1325092791 ## lucene/core/src/java/org/apache/lucene/index/TermStates.java: ## @@ -86,19 +93,58 @@ public TermStates( * @param needsStats if {@code true} then all leaf con

[GitHub] [lucene] uschindler commented on a diff in pull request #12183: Make TermStates#build concurrent

2023-09-13 Thread via GitHub
uschindler commented on code in PR #12183: URL: https://github.com/apache/lucene/pull/12183#discussion_r1325097710 ## lucene/core/src/java/org/apache/lucene/index/TermStates.java: ## @@ -86,19 +93,58 @@ public TermStates( * @param needsStats if {@code true} then all leaf con

[GitHub] [lucene] shubhamvishu commented on a diff in pull request #12183: Make TermStates#build concurrent

2023-09-13 Thread via GitHub
shubhamvishu commented on code in PR #12183: URL: https://github.com/apache/lucene/pull/12183#discussion_r1325111790 ## lucene/core/src/java/org/apache/lucene/index/TermStates.java: ## @@ -86,19 +93,58 @@ public TermStates( * @param needsStats if {@code true} then all leaf c

[GitHub] [lucene] jpountz commented on pull request #12526: Speed up disjunctions by computing estimations of the score of the k-th top hit up-front.

2023-09-14 Thread via GitHub
jpountz commented on PR #12526: URL: https://github.com/apache/lucene/pull/12526#issuecomment-1718893926 FYI there was an interesting observation on another benchmark that took advantage of recursive graph bisection: https://jpountz.github.io/lucene-9.7-vs-9.8/. One query (`the incredibles`

[GitHub] [lucene] gokaai opened a new pull request, #12554: Allow FilteredDocIdSetIterator.match(doc) to throw IOException

2023-09-14 Thread via GitHub
gokaai opened a new pull request, #12554: URL: https://github.com/apache/lucene/pull/12554 ### Description Allows `org.apache.lucene.search.FilteredDocIdSetIterator#match(doc)` to throw an IOException so that users don't have to explicitly catch it Closes #12492 -- This is

[GitHub] [lucene] mikemccand commented on a diff in pull request #12552: Make FSTPostingsFormat load FSTs off-heap

2023-09-14 Thread via GitHub
mikemccand commented on code in PR #12552: URL: https://github.com/apache/lucene/pull/12552#discussion_r1325827523 ## lucene/codecs/src/java/org/apache/lucene/codecs/memory/FSTTermsReader.java: ## @@ -191,7 +193,9 @@ final class TermsReader extends Terms { this.sumTotalTe

[GitHub] [lucene] mikemccand commented on pull request #12552: Make FSTPostingsFormat load FSTs off-heap

2023-09-14 Thread via GitHub
mikemccand commented on PR #12552: URL: https://github.com/apache/lucene/pull/12552#issuecomment-1719297920 @Tony-X have you tried passing all Lucene unit tests using this Codec? I think you can add `-Dtests.codec=...` to force all tests to use it. -- This is an automated message from th

[GitHub] [lucene] jpountz commented on pull request #12554: Allow FilteredDocIdSetIterator.match(doc) to throw IOException

2023-09-14 Thread via GitHub
jpountz commented on PR #12554: URL: https://github.com/apache/lucene/pull/12554#issuecomment-1719334101 Looks great, can you add a CHANGES entry under "Lucene 9.8.0" / "API Changes"? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

[GitHub] [lucene] jimczi commented on pull request #12551: Introduce dynamic segment efSearch to Knn{Byte|Float}VectorQuery

2023-09-14 Thread via GitHub
jimczi commented on PR #12551: URL: https://github.com/apache/lucene/pull/12551#issuecomment-1719529457 I made some adjustments to the formula to consider the logarithmic complexity of the greedy search. I conducted tests on two datasets: 1. The standard SIFT dataset, which has 128 d

[GitHub] [lucene] epotyom opened a new pull request, #12555: Fix: Lucene90DocValuesProducer.TermsDict.seekCeil doesn't always position bytes correctly (#12167)

2023-09-14 Thread via GitHub
epotyom opened a new pull request, #12555: URL: https://github.com/apache/lucene/pull/12555 Fix: Lucene90DocValuesProducer.TermsDict.seekCeil doesn't always position bytes correctly (#12167) TermsDict `ord` and `bytes` can be out of sync after a call to seekCeil which caused test fai

[GitHub] [lucene] jpountz merged pull request #12554: Allow FilteredDocIdSetIterator.match(doc) to throw IOException

2023-09-14 Thread via GitHub
jpountz merged PR #12554: URL: https://github.com/apache/lucene/pull/12554 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apa

[GitHub] [lucene] jpountz closed issue #12492: Allow FilteredDocIdSetIterator.match(doc) to throw IOException

2023-09-14 Thread via GitHub
jpountz closed issue #12492: Allow FilteredDocIdSetIterator.match(doc) to throw IOException URL: https://github.com/apache/lucene/issues/12492 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the spe

[GitHub] [lucene] jpountz merged pull request #12489: Add support for recursive graph bisection.

2023-09-14 Thread via GitHub
jpountz merged PR #12489: URL: https://github.com/apache/lucene/pull/12489 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apa

[GitHub] [lucene] jpountz commented on pull request #12489: Add support for recursive graph bisection.

2023-09-14 Thread via GitHub
jpountz commented on PR #12489: URL: https://github.com/apache/lucene/pull/12489#issuecomment-1719763923 Since it's fairly unintrusive to other functionality, I felt free to merge. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitH

[GitHub] [lucene] jpountz commented on pull request #12489: Add support for recursive graph bisection.

2023-09-14 Thread via GitHub
jpountz commented on PR #12489: URL: https://github.com/apache/lucene/pull/12489#issuecomment-1719763914 Since it's fairly unintrusive to other functionality, I felt free to merge. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitH

[GitHub] [lucene] Tony-X commented on pull request #12552: Make FSTPostingsFormat load FSTs off-heap

2023-09-14 Thread via GitHub
Tony-X commented on PR #12552: URL: https://github.com/apache/lucene/pull/12552#issuecomment-1719878383 @mikemccand hey Mike, I did not make a new Codec for this. IIRC, `FSTPostingsFormat` will be exercised by the RandomCodec. Also there is `TestFSTPostingsFormat extends BasePostingsFormatT

[GitHub] [lucene] epotyom commented on pull request #12555: Fix: Lucene90DocValuesProducer.TermsDict.seekCeil doesn't always position bytes correctly (#12167)

2023-09-14 Thread via GitHub
epotyom commented on PR #12555: URL: https://github.com/apache/lucene/pull/12555#issuecomment-1719935323 Extended existing nightly random tests to catch the issue most of the time. Would that be enough or do we need a test that catches it every single time? -- This is an automated message

[GitHub] [lucene] benwtrent commented on pull request #12551: Introduce dynamic segment efSearch to Knn{Byte|Float}VectorQuery

2023-09-14 Thread via GitHub
benwtrent commented on PR #12551: URL: https://github.com/apache/lucene/pull/12551#issuecomment-1720048714 @jimczi I like this idea at first glance, but I have one major concern. What about data that is indexed according to a specific order? Two tests to verify how this behaves would

[GitHub] [lucene] jimczi commented on pull request #12551: Introduce dynamic segment efSearch to Knn{Byte|Float}VectorQuery

2023-09-14 Thread via GitHub
jimczi commented on PR #12551: URL: https://github.com/apache/lucene/pull/12551#issuecomment-1720078533 Adding some charts together to compare how effective it is to use a dynamic efSearch. The first chart shows how well different efSearch values work on one segment, on multiple segm

[GitHub] [lucene] zhaih commented on a diff in pull request #12555: Fix: Lucene90DocValuesProducer.TermsDict.seekCeil doesn't always position bytes correctly (#12167)

2023-09-14 Thread via GitHub
zhaih commented on code in PR #12555: URL: https://github.com/apache/lucene/pull/12555#discussion_r1326538550 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90DocValuesProducer.java: ## @@ -1205,7 +1205,15 @@ public SeekStatus seekCeil(BytesRef text) throws IOE

  1   2   3   4   5   6   7   8   9   10   >