Re: [PR] Add support for two-phase iterators to DenseConjunctionBulkScorer. [lucene]

2025-03-16 Thread via GitHub
jpountz commented on PR #14359: URL: https://github.com/apache/lucene/pull/14359#issuecomment-2727649234 This doesn't slow down existing tasks significantly, including `CountFilteredPhrase` which now runs with `DenseConjunctionBulkScorer` vs. a `DefaultBulkScorer` on top of a `ConjunctionSc

[PR] Add support for two-phase iterators to DenseConjunctionBulkScorer. [lucene]

2025-03-16 Thread via GitHub
jpountz opened a new pull request, #14359: URL: https://github.com/apache/lucene/pull/14359 The main motivation is to efficiently evaluate range queries on fields that have a doc-value index enabled. These range queries produce two-phase iterators that should match large contiguous range of

Re: [PR] Specialise DirectMonotonicReader when it only contains one block [lucene]

2025-03-16 Thread via GitHub
iverase commented on code in PR #14358: URL: https://github.com/apache/lucene/pull/14358#discussion_r1997607493 ## lucene/core/src/java/org/apache/lucene/util/packed/DirectMonotonicReader.java: ## @@ -90,102 +140,142 @@ public static DirectMonotonicReader getInstance(Meta meta,

[PR] Specialise DirectMonotonicReader when it only contains one block [lucene]

2025-03-16 Thread via GitHub
iverase opened a new pull request, #14358: URL: https://github.com/apache/lucene/pull/14358 While looking into some heap dumps, I notice in the DirectMonotonicReader.Meta objects hold by segments that the case of single value block is actually common. I wondered if we could specialize that

Re: [PR] add Automaton.toMermaid() for emitting mermaid state charts [lucene]

2025-03-16 Thread via GitHub
rmuir commented on PR #14360: URL: https://github.com/apache/lucene/pull/14360#issuecomment-2727698038 Here's the same Automaton, but via `toDot()` tossed into https://dreampuf.github.io/GraphvizOnline with all defaults. I guess I'm still a fan of that output style, I feel it is more readab

Re: [PR] add Automaton.toMermaid() for emitting mermaid state charts [lucene]

2025-03-16 Thread via GitHub
rmuir commented on PR #14360: URL: https://github.com/apache/lucene/pull/14360#issuecomment-2727699099 Mermaid definitely doesn't handle infinite automata very well at all: ```mermaid stateDiagram direction LR classDef accept border-width:5px;stroke-width:5px,s

Re: [PR] Decode doc ids in BKD leaves with auto-vectorized loops [lucene]

2025-03-16 Thread via GitHub
gf2121 merged PR #14203: URL: https://github.com/apache/lucene/pull/14203 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apac

Re: [PR] Specialise DirectMonotonicReader when it only contains one block [lucene]

2025-03-16 Thread via GitHub
jpountz commented on PR #14358: URL: https://github.com/apache/lucene/pull/14358#issuecomment-2727467298 I'm wary about adding all these micro-optimizations to reduce the per-segment per-field overhead. They hurt readability and may easily get lost over time when codecs get replaced with ne

Re: [PR] BooleanScorer doesn't optimize for TwoPhaseIterator [lucene]

2025-03-16 Thread via GitHub
jpountz commented on PR #14357: URL: https://github.com/apache/lucene/pull/14357#issuecomment-2727461724 +1 let's use `DisjunctionSumScorerwhich` (which already supports two-phase iteration) when one of the clauses exposes a non-null two-phase iterator? -- This is an automated message fro

Re: [PR] BooleanScorer doesn't optimize for TwoPhaseIterator [lucene]

2025-03-16 Thread via GitHub
dsmiley commented on PR #14357: URL: https://github.com/apache/lucene/pull/14357#issuecomment-2727528269 If one or more DISI has a high cost (irrespective of TPIs), thus matching many docs, I could see avoiding BS1 as well. An aside, if we are going to refer to these as BS1 vs BS2, th

Re: [PR] BooleanScorer doesn't optimize for TwoPhaseIterator [lucene]

2025-03-16 Thread via GitHub
jpountz commented on PR #14357: URL: https://github.com/apache/lucene/pull/14357#issuecomment-2727629320 In case you missed it, `BooleanScorer` had optimizations recently that make it hard to beat by `DisjunctionScorer` when clauses are `PostingsEnum`s: - `DocIdSetIterator#intoBitSet` he

Re: [PR] BooleanScorer doesn't optimize for TwoPhaseIterator [lucene]

2025-03-16 Thread via GitHub
jpountz commented on PR #14357: URL: https://github.com/apache/lucene/pull/14357#issuecomment-2727625240 > If one or more DISI has a high cost (irrespective of TPIs), thus matching many docs, I could see avoiding BS1 as well. I imagine that your idea is that if most of the cost comes

Re: [I] Multi-HNSW graphs per segment? [lucene]

2025-03-16 Thread via GitHub
navneet1v commented on issue #14341: URL: https://github.com/apache/lucene/issues/14341#issuecomment-2727640413 > What you primarily want in the referenced GH issue is the ability to filter on more metadata during traversal vs doing a pre filter on the candidate documents themselves. As Adr

Re: [I] Opening of vector files with ReadAdvice.RANDOM_PRELOAD [lucene]

2025-03-16 Thread via GitHub
navneet1v commented on issue #14348: URL: https://github.com/apache/lucene/issues/14348#issuecomment-2727641512 @viliam-durina if you have benchmarks that shows the performance is better it will be good to raise the PR. Once PR is there maintainers can also do more tests to see if it is rea

Re: [PR] BooleanScorer doesn't optimize for TwoPhaseIterator [lucene]

2025-03-16 Thread via GitHub
dsmiley commented on PR #14357: URL: https://github.com/apache/lucene/pull/14357#issuecomment-2727499162 Thanks for your confirmation of the problem. The collect-per-clause is surprising to me; like what would benefit from that algorithm? Wouldn't that _only_ be in fact _needed_ if scores

[PR] add Automaton.toMermaid() for emitting mermaid state charts [lucene]

2025-03-16 Thread via GitHub
rmuir opened a new pull request, #14360: URL: https://github.com/apache/lucene/pull/14360 Mermaid is state chart supported within fenced codeblocks by github. For some reason it doesn't support dotty but instead the latest js tool. I'm sure in 2 months it will be a different tool. Be

Re: [PR] BooleanScorer doesn't optimize for TwoPhaseIterator [lucene]

2025-03-16 Thread via GitHub
jpountz commented on PR #14357: URL: https://github.com/apache/lucene/pull/14357#issuecomment-2727502419 BS2 uses a heap to merge multiple `DocIdSetIterator`s. Unfortunately, reordering this heap on every call to `nextDoc()` or `advance(int)` is not completely free and BS1's approach of loa

Re: [PR] Specialise DirectMonotonicReader when it only contains one block [lucene]

2025-03-16 Thread via GitHub
iverase commented on PR #14358: URL: https://github.com/apache/lucene/pull/14358#issuecomment-2728327145 I understand what you say, I will close this then. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

Re: [PR] Reduce Lucene90DocValuesProducer memory footprint [lucene]

2025-03-16 Thread via GitHub
iverase commented on PR #14340: URL: https://github.com/apache/lucene/pull/14340#issuecomment-2728329058 See here https://github.com/apache/lucene/pull/14358#issuecomment-2727467298, I will close this. -- This is an automated message from the Apache Git Service. To respond to the message

Re: [PR] BooleanScorer doesn't optimize for TwoPhaseIterator [lucene]

2025-03-16 Thread via GitHub
dsmiley commented on PR #14357: URL: https://github.com/apache/lucene/pull/14357#issuecomment-2728182572 I could imagine improving BooleanScorer so that the TPI clauses are separated and converted to a filter around the collector to try to match docs *not* collected (i.e. test for docs inbe

Re: [I] Multi-threaded vector search over multiple segments can lead to inconsistent results [lucene]

2025-03-16 Thread via GitHub
Zona-hu commented on issue #14180: URL: https://github.com/apache/lucene/issues/14180#issuecomment-2728315199 > ### 描述 > 相关:[#14167](https://github.com/apache/lucene/pull/14167) > > 但是,除了多叶收集(例如信息共享)之外,对多个段进行多线程搜索也可以在低值下获得一致的结果`k`。 > > 有可能获得更一致的结果,并且可能通过简单地收集更多邻居(`k`在查询中、`fan

Re: [PR] Reduce Lucene90DocValuesProducer memory footprint [lucene]

2025-03-16 Thread via GitHub
iverase closed pull request #14340: Reduce Lucene90DocValuesProducer memory footprint URL: https://github.com/apache/lucene/pull/14340 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific co

Re: [PR] Optimize commit retention policy to maintain only the last 5 commits [lucene]

2025-03-16 Thread via GitHub
vigyasharma commented on PR #14325: URL: https://github.com/apache/lucene/pull/14325#issuecomment-2728289419 +1 to Adrien's comment, IndexDeletionPolicy can quite easily be implemented and configured by users in IndexWriterConfig. It if often configured outside of Lucene too, like the [Com

Re: [PR] Use Vector API to decode BKD docIds [lucene]

2025-03-16 Thread via GitHub
jpountz commented on code in PR #14203: URL: https://github.com/apache/lucene/pull/14203#discussion_r1997535617 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90PointsWriter.java: ## @@ -105,15 +107,22 @@ public Lucene90PointsWriter( } } + public Luce

Re: [PR] A specialized Trie for Block Tree Index [lucene]

2025-03-16 Thread via GitHub
gf2121 commented on code in PR #14333: URL: https://github.com/apache/lucene/pull/14333#discussion_r1997503104 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/Trie.java: ## @@ -0,0 +1,486 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one o

Re: [PR] Decode doc ids in BKD leaves with auto-vectorized loops [lucene]

2025-03-16 Thread via GitHub
gf2121 commented on code in PR #14203: URL: https://github.com/apache/lucene/pull/14203#discussion_r1997571999 ## lucene/core/src/java/org/apache/lucene/util/bkd/DocIdsWriter.java: ## @@ -248,21 +281,68 @@ private void readBitSet(IndexInput in, int count, int[] docIDs) throws I