[GitHub] [lucene] jimczi commented on a diff in pull request #12582: Add new int8 scalar quantization to HNSW codec

2023-09-22 Thread via GitHub
jimczi commented on code in PR #12582: URL: https://github.com/apache/lucene/pull/12582#discussion_r1334738549 ## lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99ScalarQuantizedVectorsWriter.java: ## @@ -0,0 +1,851 @@ +/* + * Licensed to the Apache Software Founda

[GitHub] [lucene] jimczi commented on a diff in pull request #12582: Add new int8 scalar quantization to HNSW codec

2023-09-22 Thread via GitHub
jimczi commented on code in PR #12582: URL: https://github.com/apache/lucene/pull/12582#discussion_r1334746185 ## lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99ScalarQuantizedVectorsWriter.java: ## @@ -0,0 +1,851 @@ +/* + * Licensed to the Apache Software Founda

[GitHub] [lucene] jimczi commented on a diff in pull request #12582: Add new int8 scalar quantization to HNSW codec

2023-09-22 Thread via GitHub
jimczi commented on code in PR #12582: URL: https://github.com/apache/lucene/pull/12582#discussion_r1334758274 ## lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99ScalarQuantizedVectorsWriter.java: ## @@ -0,0 +1,851 @@ +/* + * Licensed to the Apache Software Founda

[GitHub] [lucene] stefanvodita opened a new issue, #12585: Is it correct for facets to assume positive aggregation values?

2023-09-23 Thread via GitHub
stefanvodita opened a new issue, #12585: URL: https://github.com/apache/lucene/issues/12585 ### Description In `IntTaxonomyFacets` and `FloatTaxonomyFacets` when we `getTopChildrenForPath` we maintain a priority queue of aggregation values and corresponding ordinals. If an aggrega

[GitHub] [lucene] stefanvodita commented on pull request #12454: Clean up ordinal map in default SSDV reader state

2023-09-23 Thread via GitHub
stefanvodita commented on PR #12454: URL: https://github.com/apache/lucene/pull/12454#issuecomment-1732315340 @gmiller, I think this PR is ready. Is there anything else you'd like to see changed? -- This is an automated message from the Apache Git Service. To respond to the message, pleas

[GitHub] [lucene] stefanvodita commented on a diff in pull request #12547: Compute multiple float aggregations in one go

2023-09-23 Thread via GitHub
stefanvodita commented on code in PR #12547: URL: https://github.com/apache/lucene/pull/12547#discussion_r1335004657 ## lucene/facet/src/java/org/apache/lucene/facet/taxonomy/FloatTaxonomyFacets.java: ## @@ -37,33 +37,43 @@ abstract class FloatTaxonomyFacets extends TaxonomyFace

[GitHub] [lucene] elliotzlin commented on pull request #11724: LUCENE-10520 / #11556 HTMLStripCharFilter bugfix

2023-09-23 Thread via GitHub
elliotzlin commented on PR #11724: URL: https://github.com/apache/lucene/pull/11724#issuecomment-1732391579 @dweiss @rmuir apologies for reviving this ancient thread, but I was wondering if one of you can help run the PR checks again? -- This is an automated message from the Apache Git Se

[GitHub] [lucene] vsop-479 commented on pull request #12528: Early terminate visit BKD leaf when current value greater than upper point in sorted dim.

2023-09-25 Thread via GitHub
vsop-479 commented on PR #12528: URL: https://github.com/apache/lucene/pull/12528#issuecomment-1733041087 @iverase Does it make sense to you if MatchState defined in other class, such as BKDReader or IntersectVisitor, and only leave the sorted dimension in IntersectVisitor' visit method as

[GitHub] [lucene] sylph-eu commented on issue #11507: Increase the number of dims for KNN vectors to 2048 [LUCENE-10471]

2023-09-25 Thread via GitHub
sylph-eu commented on issue #11507: URL: https://github.com/apache/lucene/issues/11507#issuecomment-1733118806 Last comment is already a couple of months old, so please let me clarify the status of this initiative. If there's a chance it's going to be merged? If there's any blocker or actio

[GitHub] [lucene] gf2121 opened a new pull request, #12586: Remove over-counting of deleted terms

2023-09-25 Thread via GitHub
gf2121 opened a new pull request, #12586: URL: https://github.com/apache/lucene/pull/12586 `BufferedUpdates` used to count deleted terms without deduplication to respect `IndexWriterConfig.setMaxBufferedDeleteTerms`. As `IndexWriterConfig.setMaxBufferedDeleteTerms` is removed since [LUCENE

[GitHub] [lucene] gf2121 commented on a diff in pull request #12586: Remove over-counting of deleted terms

2023-09-25 Thread via GitHub
gf2121 commented on code in PR #12586: URL: https://github.com/apache/lucene/pull/12586#discussion_r1335560326 ## lucene/core/src/java/org/apache/lucene/index/BufferedUpdates.java: ## @@ -284,6 +276,13 @@ void forEachOrdered(DeletedTermConsumer consumer) throw public long

[GitHub] [lucene] uschindler commented on issue #11507: Increase the number of dims for KNN vectors to 2048 [LUCENE-10471]

2023-09-25 Thread via GitHub
uschindler commented on issue #11507: URL: https://github.com/apache/lucene/issues/11507#issuecomment-1733223795 Hi, actually this issue is already resolved, although the DEFAULT did not change (and won't change due to performance risks), see here: https://github.com/apache/lucene/pull/1

[GitHub] [lucene] uschindler commented on issue #11507: Increase the number of dims for KNN vectors to 2048 [LUCENE-10471]

2023-09-25 Thread via GitHub
uschindler commented on issue #11507: URL: https://github.com/apache/lucene/issues/11507#issuecomment-1733230392 @mayya-sharipova: Should we close this issue or are there any plans to also change the default maximum? I don't think so. -- This is an automated message from the Apache Git Se

[GitHub] [lucene] iverase merged pull request #12581: Allow reading / writing binary stored fields as DataInput

2023-09-25 Thread via GitHub
iverase merged PR #12581: URL: https://github.com/apache/lucene/pull/12581 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apa

[GitHub] [lucene] iverase closed issue #12556: Allow reading binary stored values as DataInput

2023-09-25 Thread via GitHub
iverase closed issue #12556: Allow reading binary stored values as DataInput URL: https://github.com/apache/lucene/issues/12556 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

[GitHub] [lucene] iverase commented on pull request #12528: Early terminate visit BKD leaf when current value greater than upper point in sorted dim.

2023-09-25 Thread via GitHub
iverase commented on PR #12528: URL: https://github.com/apache/lucene/pull/12528#issuecomment-1733441807 My recommendation is to use the following method as you are just trying to flag if the visit method needs to keep processing points. ``` /** Similar to {@link IntersectVisitor#vi

[GitHub] [lucene] gf2121 opened a new pull request, #12587: Use radix sort to speed up the sorting of terms in TermInSetQuery

2023-09-25 Thread via GitHub
gf2121 opened a new pull request, #12587: URL: https://github.com/apache/lucene/pull/12587 ### Description Sort terms in TermInSetQuery with radix sort. This helps TermInSetQueries with a number of terms. ### Benchmark I made a simple benchmark on sorting `BytesRef[]` wi

[GitHub] [lucene] jpountz commented on pull request #12382: Run top-level conjunctions of term queries with a specialized BulkScorer.

2023-09-25 Thread via GitHub
jpountz commented on PR #12382: URL: https://github.com/apache/lucene/pull/12382#issuecomment-1733490475 FWIW I ran the benchmark from https://tantivy-search.github.io/bench/ and also observed a speedup on conjunctions, so I think that the speedup is indeed real. -- This is an automated

[GitHub] [lucene] jpountz merged pull request #12382: Run top-level conjunctions of term queries with a specialized BulkScorer.

2023-09-25 Thread via GitHub
jpountz merged PR #12382: URL: https://github.com/apache/lucene/pull/12382 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apa

[GitHub] [lucene] javanna opened a new pull request, #12588: Shared executor for LuceneTestCase#newSearcher callers

2023-09-25 Thread via GitHub
javanna opened a new pull request, #12588: URL: https://github.com/apache/lucene/pull/12588 Until now, LuceneTestCase#newSearcher randomly associates the returned IndexSearcher instance with an executor that is ad-hoc created, which gets shut down when the index reader is closed. Thi

[GitHub] [lucene] rmuir commented on pull request #12588: Shared executor for LuceneTestCase#newSearcher callers

2023-09-25 Thread via GitHub
rmuir commented on PR #12588: URL: https://github.com/apache/lucene/pull/12588#issuecomment-1733714024 this is nice, much cleaner. I think i added the original cache-helper-hack. Just one question: can we reduce the number of threads used? I have 2 cores. shouldn't 2 be enough to occasional

[GitHub] [lucene] javanna commented on pull request #12588: Shared executor for LuceneTestCase#newSearcher callers

2023-09-25 Thread via GitHub
javanna commented on PR #12588: URL: https://github.com/apache/lucene/pull/12588#issuecomment-1733731499 Thanks @rmuir for looking, I lowered the number of threads -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the UR

[GitHub] [lucene] jpountz opened a new pull request, #12589: Sometimes intersect the essential clause and the best non-essential clause.

2023-09-25 Thread via GitHub
jpountz opened a new pull request, #12589: URL: https://github.com/apache/lucene/pull/12589 The idea behind MAXSCORE is to run disjunctions as `+(essentialClause1 ... essentialClauseM) nonEssentialClause1 ... nonEssentialClauseN`, moving more and more clauses from the essential list to the

[GitHub] [lucene] jpountz commented on pull request #12589: Sometimes intersect the essential clause and the best non-essential clause.

2023-09-25 Thread via GitHub
jpountz commented on PR #12589: URL: https://github.com/apache/lucene/pull/12589#issuecomment-1733875246 Opening as a draft as I still need to figure out how to test this optimization. I tested on wikibigall where this yielded a good speedup. I would expect an even better speedup if

[GitHub] [lucene] javanna merged pull request #12588: Shared executor for LuceneTestCase#newSearcher callers

2023-09-25 Thread via GitHub
javanna merged PR #12588: URL: https://github.com/apache/lucene/pull/12588 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apa

[GitHub] [lucene] kaivalnp opened a new pull request, #12590: Allow implementers of AbstractKnnVectorQuery to access final topK results

2023-09-25 Thread via GitHub
kaivalnp opened a new pull request, #12590: URL: https://github.com/apache/lucene/pull/12590 ### Context Vector search is performed in [`AbstractKnnVectorQuery`](https://github.com/kaivalnp/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java), w

[GitHub] [lucene] benwtrent commented on pull request #12590: Allow implementers of AbstractKnnVectorQuery to access final topK results

2023-09-25 Thread via GitHub
benwtrent commented on PR #12590: URL: https://github.com/apache/lucene/pull/12590#issuecomment-1734167969 The new KnnCollector abstraction doesn't already address these needs? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub a

[GitHub] [lucene] kaivalnp commented on pull request #12590: Allow implementers of AbstractKnnVectorQuery to access final topK results

2023-09-25 Thread via GitHub
kaivalnp commented on PR #12590: URL: https://github.com/apache/lucene/pull/12590#issuecomment-1734230703 Thanks for the quick response @benwtrent! As far as I understand (please let me know if I'm missing something), the new [`KnnCollector`](https://github.com/apache/lucene/blob/mai

[GitHub] [lucene] benwtrent commented on pull request #12590: Allow implementers of AbstractKnnVectorQuery to access final topK results

2023-09-25 Thread via GitHub
benwtrent commented on PR #12590: URL: https://github.com/apache/lucene/pull/12590#issuecomment-1734248507 @kaivalnp I guess you could have a collector that spans all the segments (created once at the query level). I am not really against this change, I am just wondering if there is a

[GitHub] [lucene] kaivalnp commented on pull request #12590: Allow implementers of AbstractKnnVectorQuery to access final topK results

2023-09-25 Thread via GitHub
kaivalnp commented on PR #12590: URL: https://github.com/apache/lucene/pull/12590#issuecomment-1734321355 > you could have a collector that spans all the segments (created once at the query level). Interesting, so a single `KnnCollector` passed to all [`searchNearestVectors`](https:/

[GitHub] [lucene] benwtrent commented on pull request #12590: Allow implementers of AbstractKnnVectorQuery to access final topK results

2023-09-25 Thread via GitHub
benwtrent commented on PR #12590: URL: https://github.com/apache/lucene/pull/12590#issuecomment-1734371468 @kaivalnp depends on what you need to do. You can easily get around all this without any expensive locking. The collector has a "topDocs" method that could call some higher

[GitHub] [lucene] kaivalnp commented on pull request #12590: Allow implementers of AbstractKnnVectorQuery to access final topK results

2023-09-25 Thread via GitHub
kaivalnp commented on PR #12590: URL: https://github.com/apache/lucene/pull/12590#issuecomment-1734432239 > You can easily get around all this without any expensive locking. > The collector has a "topDocs" method that could call some higher level collector. Nice idea! So basically

[GitHub] [lucene] vsop-479 commented on pull request #12528: Early terminate visit BKD leaf when current value greater than upper point in sorted dim.

2023-09-25 Thread via GitHub
vsop-479 commented on PR #12528: URL: https://github.com/apache/lucene/pull/12528#issuecomment-1734701754 @iverase Thanks for your recommendation, it may makes code more clear. I will try to implement it, and run the performance test. -- This is an automated message from the Apache Git

[GitHub] [lucene] iverase commented on pull request #12460: Allow reading binary doc values as a DataInput

2023-09-26 Thread via GitHub
iverase commented on PR #12460: URL: https://github.com/apache/lucene/pull/12460#issuecomment-1735063044 I have been thinking a bit longer about this and I think this approach of `DataInput` is not right. Instead we should try to return an API more similar to `RandomAccessInput` as Uwe sugg

[GitHub] [lucene] gf2121 opened a new pull request, #12591: Sort update terms with stable radix sorter

2023-09-26 Thread via GitHub
gf2121 opened a new pull request, #12591: URL: https://github.com/apache/lucene/pull/12591 Inspired by the #91, This PR proposes to use a stable radix sorter to sort update terms instead of tie comparator to doc id. As terms are appended in order, the latest update of each term value should

[GitHub] [lucene] iverase opened a new issue, #12592: Add length method to RandomAccessInput

2023-09-26 Thread via GitHub
iverase opened a new issue, #12592: URL: https://github.com/apache/lucene/issues/12592 It is currently not possible to read all bytes from a RandomAccessInput without previous knowledge of how many bytes were written. I would like to propose that RandomAccessInput can provide the user the i

[GitHub] [lucene] jpountz opened a new pull request, #12593: Compute better windows in MaxScoreBulkScorer.

2023-09-26 Thread via GitHub
jpountz opened a new pull request, #12593: URL: https://github.com/apache/lucene/pull/12593 MaxScoreBulkScorer computes windows based on the set of clauses that were essential in the *previous* window. This usually works well as the set of essential clauses tends to be stable over time, but

[GitHub] [lucene] jpountz commented on pull request #12593: Compute better windows in MaxScoreBulkScorer.

2023-09-26 Thread via GitHub
jpountz commented on PR #12593: URL: https://github.com/apache/lucene/pull/12593#issuecomment-1735586604 No impact on wikibigall, which is expected as all queries have sets of essential/non-essential clauses which are stable over time. ``` TaskQPS baseli

[GitHub] [lucene] jpountz commented on pull request #12564: Window-at-a-time scoring for conjunctions.

2023-09-26 Thread via GitHub
jpountz commented on PR #12564: URL: https://github.com/apache/lucene/pull/12564#issuecomment-1735610903 Closing: benchmarks on wikibigall don't suggest it helps more than #12382. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHu

[GitHub] [lucene] jpountz closed pull request #12564: Window-at-a-time scoring for conjunctions.

2023-09-26 Thread via GitHub
jpountz closed pull request #12564: Window-at-a-time scoring for conjunctions. URL: https://github.com/apache/lucene/pull/12564 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

[GitHub] [lucene] s1monw opened a new pull request, #12595: Make IndexWriter#flushNextBuffer also apply deletes if necessary

2023-09-26 Thread via GitHub
s1monw opened a new pull request, #12595: URL: https://github.com/apache/lucene/pull/12595 `IndexWriter#flushNextBuffer()` is a convenient way to control indexing buffer sizes across multiple index writers. This change also flushes deletes if necessary when `#flushNextBuffer()` is called ev

[GitHub] [lucene] heemin32 opened a new issue, #12596: surpriseMePolygon and createRegularPolygon in test util class returns invalid polygon

2023-09-26 Thread via GitHub
heemin32 opened a new issue, #12596: URL: https://github.com/apache/lucene/issues/12596 ### Description https://github.com/apache/lucene/blob/main/lucene/test-framework/src/java/org/apache/lucene/tests/geo/ShapeTestUtil.java surpriseMePolygon and createRegularPolygon returns in

[GitHub] [lucene] jpountz commented on pull request #12595: Make IndexWriter#flushNextBuffer also apply deletes if necessary

2023-09-26 Thread via GitHub
jpountz commented on PR #12595: URL: https://github.com/apache/lucene/pull/12595#issuecomment-1736121993 Thinking out loud: it looks like your change always flushes both the largest pending writer and deletes. I wonder if we should try to make it more granular and e.g. check whichever of de

[GitHub] [lucene] sgup432 opened a new issue, #12597: Make IndexReader.CacheKey serializable

2023-09-26 Thread via GitHub
sgup432 opened a new issue, #12597: URL: https://github.com/apache/lucene/issues/12597 ### Description As of now, lucene LRU query cache and other OpenSearch caches uses CacheKey as a primary key for their caches as it helps to determine any changes during segment merges etc. We use

[GitHub] [lucene] shubhamvishu commented on issue #12394: Add the ability to compute vector similarity scores with the new ValuesSource API

2023-09-26 Thread via GitHub
shubhamvishu commented on issue #12394: URL: https://github.com/apache/lucene/issues/12394#issuecomment-1736155862 @jpountz I have raised a PR #12548 that adds the required APIs to DVS for computing vector similarity scores. Thanks! -- This is an automated message from the Apache Git Serv

[GitHub] [lucene] iverase commented on pull request #12594: Add length method to RandomAccessInput

2023-09-26 Thread via GitHub
iverase commented on PR #12594: URL: https://github.com/apache/lucene/pull/12594#issuecomment-1736205488 Thanks @jpountz! What it is not clear to me is if adding a method to this interface is considered a breaking change and it can only be introduced in a major release, or if it can be back

[GitHub] [lucene] gsmiller commented on issue #12558: IntTaxonomyFacets chooses dense values array when FacetsCollector has no MatchingDocs

2023-09-26 Thread via GitHub
gsmiller commented on issue #12558: URL: https://github.com/apache/lucene/issues/12558#issuecomment-1736274592 Oh interesting. Thanks @Shradha26! Is drill-sideways the only means for reproducing this (that we're aware of anyway)? That's tricky. Do we think it's doing the right thing, or sho

[GitHub] [lucene] gf2121 commented on a diff in pull request #12573: Use radix sort to speed up the sorting of deleted terms

2023-09-26 Thread via GitHub
gf2121 commented on code in PR #12573: URL: https://github.com/apache/lucene/pull/12573#discussion_r1338004334 ## lucene/core/src/java/org/apache/lucene/index/BufferedUpdates.java: ## @@ -139,15 +131,11 @@ public void addTerm(Term term, int docIDUpto) { return; } -

[GitHub] [lucene] romseygeek commented on pull request #12594: Add length method to RandomAccessInput

2023-09-27 Thread via GitHub
romseygeek commented on PR #12594: URL: https://github.com/apache/lucene/pull/12594#issuecomment-1736960788 I think it's fine to add simple methods to interfaces like this in point releases. Extending RandomAccessInput would be a pretty expert use of lucene. One suggestion: given tha

[GitHub] [lucene] iverase commented on pull request #12594: Add length method to RandomAccessInput

2023-09-27 Thread via GitHub
iverase commented on PR #12594: URL: https://github.com/apache/lucene/pull/12594#issuecomment-1736975225 > One suggestion: given that ByteBuffersDataInput already has a size method, why not name the new interface method size as well? The two terms are more or less interchangeable, and that

[GitHub] [lucene] romseygeek commented on pull request #12594: Add length method to RandomAccessInput

2023-09-27 Thread via GitHub
romseygeek commented on PR #12594: URL: https://github.com/apache/lucene/pull/12594#issuecomment-1736976414 > Because IndexInput already defines a method called length, so adding a size method means that all the IndexInput that implements RandomAccessInput (which there are a few) will have

[GitHub] [lucene] gf2121 opened a new issue, #12598: FST#Compiler allocates too much memory

2023-09-27 Thread via GitHub
gf2121 opened a new issue, #12598: URL: https://github.com/apache/lucene/issues/12598 ### Description https://blunders.io/jfr-demo/indexing-4kb-2023.09.25.18.03.36/allocations-drill-down The allocation profile for nightly indexing benchmark shows that `FST#ByteStore` occupies

[GitHub] [lucene] s1monw commented on pull request #12595: Make IndexWriter#flushNextBuffer also apply deletes if necessary

2023-09-27 Thread via GitHub
s1monw commented on PR #12595: URL: https://github.com/apache/lucene/pull/12595#issuecomment-1737068633 @jpountz I added a simplified version of your idea that applies deletes if they consume more RAM than the DWPT we are flushing. The applyAllDeletes() only has an impact if the `FlushContr

[GitHub] [lucene] iverase merged pull request #12594: Add length method to RandomAccessInput

2023-09-27 Thread via GitHub
iverase merged PR #12594: URL: https://github.com/apache/lucene/pull/12594 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apa

[GitHub] [lucene] iverase closed issue #12592: Add length method to RandomAccessInput

2023-09-27 Thread via GitHub
iverase closed issue #12592: Add length method to RandomAccessInput URL: https://github.com/apache/lucene/issues/12592 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubs

[GitHub] [lucene] gf2121 closed issue #12598: FST#Compiler allocates too much memory

2023-09-27 Thread via GitHub
gf2121 closed issue #12598: FST#Compiler allocates too much memory URL: https://github.com/apache/lucene/issues/12598 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubsc

[GitHub] [lucene] iverase opened a new issue, #12599: Add readBytes method to RandomAccessInput

2023-09-27 Thread via GitHub
iverase opened a new issue, #12599: URL: https://github.com/apache/lucene/issues/12599 We can currently read a RandomAccessInput byte by byte but it does not provide a method to bulk read a chunck of bytes, similary to what DataInput provides with DataInput#readBytes(byte[], int, int). Ther

[GitHub] [lucene] iverase opened a new pull request, #12600: Add readBytes method to RandomAccessInput

2023-09-27 Thread via GitHub
iverase opened a new pull request, #12600: URL: https://github.com/apache/lucene/pull/12600 Adds a new method to RandomAccessInput tio bulk read bytes into a provided byte array. The default implementation reads byte by byte but faster implementations are provided for all lucene implementat

[GitHub] [lucene] uschindler commented on pull request #12600: Add readBytes method to RandomAccessInput

2023-09-27 Thread via GitHub
uschindler commented on PR #12600: URL: https://github.com/apache/lucene/pull/12600#issuecomment-1737512784 Hi, > @uschindler I tried to provide implementations for the `MemorySegmentIndexInputProvider`. I am a bit confused because I run `TestMmapDirectory` several times hoping it wi

[GitHub] [lucene] uschindler commented on pull request #12600: Add readBytes method to RandomAccessInput

2023-09-27 Thread via GitHub
uschindler commented on PR #12600: URL: https://github.com/apache/lucene/pull/12600#issuecomment-1737520431 I need more time to review this as I am a bit crowded. Sorry for possible delays. If you get tests running with all kinds of mmap, just give me some feedback. -- This is an a

[GitHub] [lucene] uschindler commented on pull request #12600: Add readBytes method to RandomAccessInput

2023-09-27 Thread via GitHub
uschindler commented on PR #12600: URL: https://github.com/apache/lucene/pull/12600#issuecomment-1737527376 I changed to draft until all is tested. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

[GitHub] [lucene] uschindler commented on a diff in pull request #12600: Add readBytes method to RandomAccessInput

2023-09-27 Thread via GitHub
uschindler commented on code in PR #12600: URL: https://github.com/apache/lucene/pull/12600#discussion_r1338712090 ## lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java: ## @@ -290,6 +312,23 @@ public byte readByte(long pos) throws IOException { }

[GitHub] [lucene] iverase commented on a diff in pull request #12600: Add readBytes method to RandomAccessInput

2023-09-27 Thread via GitHub
iverase commented on code in PR #12600: URL: https://github.com/apache/lucene/pull/12600#discussion_r1338733436 ## lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java: ## @@ -290,6 +312,23 @@ public byte readByte(long pos) throws IOException { } }

[GitHub] [lucene] uschindler commented on a diff in pull request #12600: Add readBytes method to RandomAccessInput

2023-09-27 Thread via GitHub
uschindler commented on code in PR #12600: URL: https://github.com/apache/lucene/pull/12600#discussion_r1338738357 ## lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java: ## @@ -290,6 +312,23 @@ public byte readByte(long pos) throws IOException { }

[GitHub] [lucene] uschindler commented on a diff in pull request #12600: Add readBytes method to RandomAccessInput

2023-09-27 Thread via GitHub
uschindler commented on code in PR #12600: URL: https://github.com/apache/lucene/pull/12600#discussion_r1338741133 ## lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java: ## @@ -290,6 +312,23 @@ public byte readByte(long pos) throws IOException { }

[GitHub] [lucene] iverase commented on a diff in pull request #12600: Add readBytes method to RandomAccessInput

2023-09-27 Thread via GitHub
iverase commented on code in PR #12600: URL: https://github.com/apache/lucene/pull/12600#discussion_r1338780906 ## lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java: ## @@ -290,6 +312,23 @@ public byte readByte(long pos) throws IOException { } }

[GitHub] [lucene] iverase opened a new issue, #12601: Reproducible TestDrillSideways failure

2023-09-27 Thread via GitHub
iverase opened a new issue, #12601: URL: https://github.com/apache/lucene/issues/12601 ### Description The following gradle command fails reproducibly on main with the following error: ``` > java.lang.AssertionError > at __randomizedtesting.SeedInfo.see

[GitHub] [lucene] benwtrent commented on pull request #12590: Allow implementers of AbstractKnnVectorQuery to access final topK results

2023-09-27 Thread via GitHub
benwtrent commented on PR #12590: URL: https://github.com/apache/lucene/pull/12590#issuecomment-1737897805 > Overriding rewrite (first calling super.rewrite and looking at the rewritten Query) I would say don't call `super.rewrite` and just copy paste things. Or use a collector

[GitHub] [lucene] zouxiang1993 opened a new pull request, #12602: Reduce collection operations when minShouldMatch == 0.

2023-09-27 Thread via GitHub
zouxiang1993 opened a new pull request, #12602: URL: https://github.com/apache/lucene/pull/12602 ### Description -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsu

[GitHub] [lucene] sgup432 commented on issue #12597: Make IndexReader.CacheKey serializable

2023-09-28 Thread via GitHub
sgup432 commented on issue #12597: URL: https://github.com/apache/lucene/issues/12597#issuecomment-1738594461 @jpountz What do you think on this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to t

[GitHub] [lucene] gf2121 commented on issue #12598: FST#Compiler allocates too much memory

2023-09-28 Thread via GitHub
gf2121 commented on issue #12598: URL: https://github.com/apache/lucene/issues/12598#issuecomment-1738687168 I did an experiment: Index random BytesRefs and count the byte usage when `BytesStore#finish` called. The following are the statistical results: ``` total sample: 23130

[GitHub] [lucene] kaivalnp commented on pull request #12590: Allow implementers of AbstractKnnVectorQuery to access final topK results

2023-09-28 Thread via GitHub
kaivalnp commented on PR #12590: URL: https://github.com/apache/lucene/pull/12590#issuecomment-1738933241 > My main concern is that we are adding yet another extension point for "partial control" when we already have that with the rewrite or something even more complex with the collector

[GitHub] [lucene] javanna opened a new pull request, #12603: Simplify TaskExecutor API

2023-09-28 Thread via GitHub
javanna opened a new pull request, #12603: URL: https://github.com/apache/lucene/pull/12603 We recently made TaskExecutor public. It currently exposes two methods: one to create tasks given a collection of callables, and one to execute all tasks created at step 1. We can rather expose a sin

[GitHub] [lucene] gf2121 commented on issue #12598: FST#Compiler allocates too much memory

2023-09-28 Thread via GitHub
gf2121 commented on issue #12598: URL: https://github.com/apache/lucene/issues/12598#issuecomment-1739489111 I get similar statistics for wikimediumall and here are the results when `BytesStore#finish` called 1,000,000 times. ``` BytesStore#finish called: 100 times min:

[GitHub] [lucene] gf2121 opened a new pull request, #12604: Reduce FST block size for BlockTreeTermsWriter

2023-09-28 Thread via GitHub
gf2121 opened a new pull request, #12604: URL: https://github.com/apache/lucene/pull/12604 ### Description https://blunders.io/jfr-demo/indexing-4kb-2023.09.25.18.03.36/allocations-drill-down Nightly benchmark shows that `FSTCompiler#init` allocated most of the memory during i

[GitHub] [lucene] cpoerschke commented on a diff in pull request #12380: Add a post-collection hook to LeafCollector.

2023-09-28 Thread via GitHub
cpoerschke commented on code in PR #12380: URL: https://github.com/apache/lucene/pull/12380#discussion_r1340380526 ## lucene/suggest/src/java/org/apache/lucene/search/suggest/document/SuggestIndexSearcher.java: ## @@ -67,14 +68,16 @@ public void suggest(CompletionQuery query, T

[GitHub] [lucene] cpoerschke opened a new pull request, #12605: IndexingChain.validateMaxVectorDimension: add missing space wording

2023-09-28 Thread via GitHub
cpoerschke opened a new pull request, #12605: URL: https://github.com/apache/lucene/pull/12605 (Keeping as draft for now, to see if similar change might be needed nearby too.) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub a

[GitHub] [lucene] benwtrent commented on pull request #12590: Allow implementers of AbstractKnnVectorQuery to access final topK results

2023-09-28 Thread via GitHub
benwtrent commented on PR #12590: URL: https://github.com/apache/lucene/pull/12590#issuecomment-1739838321 Ok, if we really want the ability to get the final topK for a Lucene index, I think a new method should be added (merge results or something). That seems like a better extension than o

[GitHub] [lucene] jpountz commented on a diff in pull request #12380: Add a post-collection hook to LeafCollector.

2023-09-28 Thread via GitHub
jpountz commented on code in PR #12380: URL: https://github.com/apache/lucene/pull/12380#discussion_r1340582160 ## lucene/suggest/src/java/org/apache/lucene/search/suggest/document/SuggestIndexSearcher.java: ## @@ -67,14 +68,16 @@ public void suggest(CompletionQuery query, TopS

[GitHub] [lucene] javanna opened a new pull request, #12606: Create a task executor when executor is not provided

2023-09-28 Thread via GitHub
javanna opened a new pull request, #12606: URL: https://github.com/apache/lucene/pull/12606 As we introduce more places where we add concurrency (there are currently three) there is a common pattern around checking whether there is an executor provided, and then going sequential on the call

[GitHub] [lucene] javanna commented on a diff in pull request #12606: Create a task executor when executor is not provided

2023-09-28 Thread via GitHub
javanna commented on code in PR #12606: URL: https://github.com/apache/lucene/pull/12606#discussion_r1340606929 ## lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java: ## @@ -420,13 +418,12 @@ public int count(Query query) throws IOException { } /** - * Re

[GitHub] [lucene] jpountz commented on issue #12597: Make IndexReader.CacheKey serializable

2023-09-28 Thread via GitHub
jpountz commented on issue #12597: URL: https://github.com/apache/lucene/issues/12597#issuecomment-1739925151 Can you explain a bit more why you need to store your cache keys off-heap? Presumably this isn't because of memory usage since these cache keys are tiny. -- This is an automated m

[GitHub] [lucene] sgup432 commented on issue #12597: Make IndexReader.CacheKey serializable

2023-09-28 Thread via GitHub
sgup432 commented on issue #12597: URL: https://github.com/apache/lucene/issues/12597#issuecomment-1739942303 Yeah this isn't because of memory consumption as such. But more towards providing capability to offload cache data to off-heap/disk so that one can maintain a larger cache if needed

[GitHub] [lucene] jpountz commented on issue #12597: Make IndexReader.CacheKey serializable

2023-09-28 Thread via GitHub
jpountz commented on issue #12597: URL: https://github.com/apache/lucene/issues/12597#issuecomment-1739956888 Would it work to only store the cache values on disk and keep cache keys in memory? These cache keys should be a small fraction of the memory that open `IndexReader`s take anyway?

[GitHub] [lucene] javanna opened a new pull request, #12607: Add missing create github release step to release wizard

2023-09-28 Thread via GitHub
javanna opened a new pull request, #12607: URL: https://github.com/apache/lucene/pull/12607 The "create github release" step was missing from the release wizard. We have forgotten about it a few times recently. While at it, I also expanded the instructions around closing the current

[GitHub] [lucene] quux00 commented on a diff in pull request #12523: TaskExecutor waits for all tasks to complete before returning

2023-09-28 Thread via GitHub
quux00 commented on code in PR #12523: URL: https://github.com/apache/lucene/pull/12523#discussion_r1340643119 ## lucene/core/src/test/org/apache/lucene/search/TestIndexSearcher.java: ## @@ -267,11 +266,130 @@ protected LeafSlice[] slices(List leaves) { return slic

[GitHub] [lucene] kaivalnp commented on pull request #12590: Allow implementers of AbstractKnnVectorQuery to access final topK results

2023-09-28 Thread via GitHub
kaivalnp commented on PR #12590: URL: https://github.com/apache/lucene/pull/12590#issuecomment-1739975925 Thanks, makes sense to me! Added an explicit method to merge per-segment results -- This is an automated message from the Apache Git Service. To respond to the message, please log on

[GitHub] [lucene] sgup432 commented on issue #12597: Make IndexReader.CacheKey serializable

2023-09-28 Thread via GitHub
sgup432 commented on issue #12597: URL: https://github.com/apache/lucene/issues/12597#issuecomment-1739977503 We can, that's more of a implementation choice and should be kept open. Apart from that, OpenSearch also uses RequestCache which uses a composite key and CacheKey being one of them.

[GitHub] [lucene] gsmiller opened a new issue, #12608: Should DrillSidewaysQuery BulkScorer leverage scoreSuppliers of the base query and dimensions?

2023-09-28 Thread via GitHub
gsmiller opened a new issue, #12608: URL: https://github.com/apache/lucene/issues/12608 ### Description When DrillSidewaysQuery creates a BulkScorer, it directly calls `#scorer` on the base query and each drill-down dim. This means the drill-down dims are not able to optimize based o

[GitHub] [lucene] jpountz commented on issue #12597: Make IndexReader.CacheKey serializable

2023-09-28 Thread via GitHub
jpountz commented on issue #12597: URL: https://github.com/apache/lucene/issues/12597#issuecomment-1739991979 Got it. I'm a bit torn on this change, on the one hand it would be harmless as you pointed out, on the other hand I could see it being a bit of a rabbit hole with future feat

[GitHub] [lucene] jpountz commented on pull request #12382: Run top-level conjunctions of term queries with a specialized BulkScorer.

2023-09-29 Thread via GitHub
jpountz commented on PR #12382: URL: https://github.com/apache/lucene/pull/12382#issuecomment-1740478404 This gave a good speedup to [AndHighHigh](http://people.apache.org/~mikemccand/lucenebench/AndHighHigh.html) on nightlies: +15%, but [AndHighMed](http://people.apache.org/~mikemccand/lu

[GitHub] [lucene] shubhamvishu commented on a diff in pull request #12606: Create a task executor when executor is not provided

2023-09-29 Thread via GitHub
shubhamvishu commented on code in PR #12606: URL: https://github.com/apache/lucene/pull/12606#discussion_r1341149197 ## lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java: ## @@ -420,13 +418,12 @@ public int count(Query query) throws IOException { } /** -

[GitHub] [lucene] cpoerschke opened a new pull request, #12609: SuggestIndexSearcher.suggest catches any CollectionTerminatedException (theoretically) thrown by getLeafCollector

2023-09-29 Thread via GitHub
cpoerschke opened a new pull request, #12609: URL: https://github.com/apache/lucene/pull/12609 please see https://github.com/apache/lucene/pull/12380#discussion_r1340380526 and https://github.com/apache/lucene/pull/12380#discussion_r1340582160 for context -- This is an automated message

[GitHub] [lucene] cpoerschke commented on a diff in pull request #12380: Add a post-collection hook to LeafCollector.

2023-09-29 Thread via GitHub
cpoerschke commented on code in PR #12380: URL: https://github.com/apache/lucene/pull/12380#discussion_r1341153435 ## lucene/suggest/src/java/org/apache/lucene/search/suggest/document/SuggestIndexSearcher.java: ## @@ -67,14 +68,16 @@ public void suggest(CompletionQuery query, T

[GitHub] [lucene] cpoerschke commented on a diff in pull request #12436: Move max vector dims limit to Codec

2023-09-29 Thread via GitHub
cpoerschke commented on code in PR #12436: URL: https://github.com/apache/lucene/pull/12436#discussion_r1341157051 ## lucene/core/src/java/org/apache/lucene/index/IndexingChain.java: ## @@ -831,6 +837,20 @@ private static void verifyUnIndexedFieldType(String name, IndexableFiel

[GitHub] [lucene] shubhamvishu commented on pull request #12606: Create a task executor when executor is not provided

2023-09-29 Thread via GitHub
shubhamvishu commented on PR #12606: URL: https://github.com/apache/lucene/pull/12606#issuecomment-1740631119 I really like this! This looks much more cleaner. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL ab

[GitHub] [lucene] shubhamvishu commented on a diff in pull request #12606: Create a task executor when executor is not provided

2023-09-29 Thread via GitHub
shubhamvishu commented on code in PR #12606: URL: https://github.com/apache/lucene/pull/12606#discussion_r1341158559 ## lucene/core/src/test/org/apache/lucene/search/TestIndexSearcher.java: ## Review Comment: We should change the `assertNull ` to `assertNotNull` in `testGet

[GitHub] [lucene] jpountz commented on a diff in pull request #12591: Sort update terms with stable radix sorter

2023-09-29 Thread via GitHub
jpountz commented on code in PR #12591: URL: https://github.com/apache/lucene/pull/12591#discussion_r1341195931 ## lucene/core/src/java/org/apache/lucene/util/StableStringSorter.java: ## @@ -0,0 +1,80 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more

[GitHub] [lucene] cpoerschke merged pull request #12605: IndexingChain.validateMaxVectorDimension: add missing space in IllegalArgumentException wording

2023-09-29 Thread via GitHub
cpoerschke merged PR #12605: URL: https://github.com/apache/lucene/pull/12605 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.

[GitHub] [lucene] cpoerschke merged pull request #12609: SuggestIndexSearcher.suggest catches any CollectionTerminatedException (theoretically) thrown by getLeafCollector

2023-09-29 Thread via GitHub
cpoerschke merged PR #12609: URL: https://github.com/apache/lucene/pull/12609 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.

<    1   2   3   4   5   6   7   8   9   10   >