date:20250402

Re: [I] Incorrect use of fsync [lucene]

2025-04-02 Thread via GitHub



uschindler commented on issue #14334:
URL: https://github.com/apache/lucene/issues/14334#issuecomment-2771580012

   > There was no crash, that's the problem. You wrote a file, then opened it 
again for reading and it's corrupted, and there was no IO error reported.
   > 
   > As I said, if you close a file without fsyncing, write-back errors con't 
be reported to the writing process. The IO error might be a network glitch on 
your SAN, which doesn't automatically crash the whole system.
   
   This is a general problem and does not have anything to do with fsync. If 
this is really a problem, then every application everywhere and everytime would 
need to call fsync on close. So it would need to be the default in the OS.
   
   About temporary files: If a temporary file in the lucene index is corrupted 
then lucene will bail out, which is something that could always happen. There 
are no guarantees, because Lucene also caches documents in memory before 
writing, it may or may not create a tempoirary file! In Lucene, if you want to 
persist the index, you need to commit. After commit, no temporary files are in 
use. If a temporary file was corrupt before, the commit fails.
   
   So in short: No problem.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] Incorrect use of fsync [lucene]

2025-04-02 Thread via GitHub



dweiss commented on issue #14334:
URL: https://github.com/apache/lucene/issues/14334#issuecomment-2771582280

   Ok, sorry but the scenario you're describing is insane to me. If something 
like this happens, I don't think it's Lucene's duty to try to correct it - it 
seems like the entire system is broken then. I don't think it makes sense to be 
so paranoid. There are some measures in Lucene to check for sanity (checksums 
on index files). I think this is sufficient.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Revert "Add UnwrappingReuseStrategy for AnalyzerWrapper (#14154)" [lucene]

2025-04-02 Thread via GitHub



mayya-sharipova merged PR #14430:
URL: https://github.com/apache/lucene/pull/14430


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[PR] Revert "Add UnwrappingReuseStrategy for AnalyzerWrapper (#14154)" [lucene]

2025-04-02 Thread via GitHub



mayya-sharipova opened a new pull request, #14432:
URL: https://github.com/apache/lucene/pull/14432

   This reverts commit 1a676b64f89069284bf7d0510162f8095bba3980.
   
   Revert based on https://github.com/apache/lucene/pull/14430
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Adding profiling support for concurrent segment search [lucene]

2025-04-02 Thread via GitHub



jainankitk commented on PR #14413:
URL: https://github.com/apache/lucene/pull/14413#issuecomment-2774601593

   > I'd have a top-level tree for everything related to initializing the 
search and combining results (rewrite(), createWeight(), 
CollectorManager#reduce) and then a list of trees for each slice.
   
   While working on the code, I realized it is better to have list of slices 
within the tree itself at each level, instead of repeating the query structure 
and information across multiple trees. In this approach, we can easily view the 
tree for specific `sliceId` using jq or simple script. The structure looks like 
below:
   ```
   "query": [ <-- for list of root queries
 {
   "type": "TermQuery",
   "description": "foo:bar",
   "startTime" : 11972972,
   "totalTime": 354343,
   "breakdown" :   {.}, <-- query level breakdown like 
weight count and time
   "sliceBreakdowns": [
   {.},<-- first slice information
   {.}]<-- second slice information
   "queryChildren": [
   {..},  <-- recursive repetition of above structure
   {..}]
   ```
   
   
   > Related, it'd be nice if each per-slice object could also tell us about 
the thread that it ran in and its start time so that we could understand how 
exactly Lucene managed to parallelize the search.
   
   Yes, that would be really useful. I have included the threadId as `sliceId`, 
`startTime` and `totalTime` information for each slice object at every query 
level.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Optimize ConcurrentMergeScheduler for Multi-Tenant Indexing [lucene]

2025-04-02 Thread via GitHub



github-actions[bot] commented on PR #14335:
URL: https://github.com/apache/lucene/pull/14335#issuecomment-2774035527

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[I] Add a timeout for forceMergeDeletes in IndexWriter [lucene]

2025-04-02 Thread via GitHub



houserjohn opened a new issue, #14431:
URL: https://github.com/apache/lucene/issues/14431

   ### Description
   
   Using IndexWriter's `forceMergeDeletes` to eliminate merge debt is a very 
useful feature -- especially during initial indexing. However, larger indexes 
can require 20+ minutes to finish a `forceMergeDeletes` call. It would be a 
useful feature if we add an API for specifying a timeout for the existing 
`doWait` argument. Instead of blocking until `forceMergeDeletes` finishes, we 
should block until either `forceMergeDeletes` finishes or the timeout is 
reached, whichever comes first. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] Incorrect use of fsync [lucene]

2025-04-02 Thread via GitHub



viliam-durina commented on issue #14334:
URL: https://github.com/apache/lucene/issues/14334#issuecomment-2772476115

   >There has to be a certain trust in what the operating system provides and 
its consistency guarantees
   
   Sure, but the OS doesn't guarantee anything, if you don't fsync. I'd say you 
can have a degree of optimism that the probability of an undetected corruption 
is _low enough_. But I think I've made my case that it's possible to fix this 
with no performance hit and acceptable complexity. I went through the code, I 
think it's pretty straightforward in the places I saw. I can try, maybe just 
tell me if you'd accept a solution along the lines [described 
here](https://github.com/apache/lucene/issues/14334#issuecomment-2771963385), 
so that I don't waste time.
   
   Let's return to the bigger issue: fsyncing non-temporary files before 
closing them, which probably was the root cause of #10906. Can we agree that 
this is an issue?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] [DISCUSS] Could we have a different ANN algorithm for Learned Sparse Vectors? [lucene]

2025-04-02 Thread via GitHub



yuye-aws commented on issue #13675:
URL: https://github.com/apache/lucene/issues/13675#issuecomment-2772606870

   Hi @atris , thanks for your contribution. 
   
   I found this paper (**Bridging Dense and Sparse Maximum Inner Product 
Search**) pretty interesting as it caters to the skip list structure within 
Lucene: https://arxiv.org/pdf/2309.09013. Can we have a call if possible to 
discuss how to implement this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add a HNSW collector that exits early when nearest neighbor queue saturates [lucene]

2025-04-02 Thread via GitHub



benwtrent commented on code in PR #14094:
URL: https://github.com/apache/lucene/pull/14094#discussion_r2023396128


##
lucene/CHANGES.txt:
##
@@ -139,6 +139,8 @@ New Features
 
 * GITHUB#14412: Allow skip cache factor to be updated dynamically. (Sagar 
Upadhyaya)
 
+* GITHUB#14094: New KNN query that early terminates when HNSW nearest neighbor 
queue saturates. (Tommaso Teofili)

Review Comment:
   10.2 is cut, so this will now be a 10.3 thing :/



##
lucene/CHANGES.txt:
##
@@ -139,6 +139,8 @@ New Features
 
 * GITHUB#14412: Allow skip cache factor to be updated dynamically. (Sagar 
Upadhyaya)
 
+* GITHUB#14094: New KNN query that early terminates when HNSW nearest neighbor 
queue saturates. (Tommaso Teofili)

Review Comment:
   Unless this is being added to 10.2



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add a HNSW collector that exits early when nearest neighbor queue saturates [lucene]

2025-04-02 Thread via GitHub



tteofili merged PR #14094:
URL: https://github.com/apache/lucene/pull/14094


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Revert "Add UnwrappingReuseStrategy for AnalyzerWrapper (#14154)" [lucene]

2025-04-02 Thread via GitHub



mayya-sharipova merged PR #14432:
URL: https://github.com/apache/lucene/pull/14432


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] New IndexReaderFunctions.positionLength from the norm [lucene]

2025-04-02 Thread via GitHub



rmuir commented on PR #14433:
URL: https://github.com/apache/lucene/pull/14433#issuecomment-2774215695

   I think the history is just that this norm can contain arbitrary value, 
which before was a suboptimal encoding into a single byte. There was a 
ValueSource that assumed it was a single byte, so that was moved to only work 
with TFIDF for backwards compatibility purposes.
   
   Elsewhere, norm was extended and generalized to be opaque 64-bit value. 
Depending upon the Similarity's index-time `computeNorm()` implementation, it 
might not even be possible to decode to a float.
   
   But the default encoding was also fixed to be practical, by @jpountz, whilst 
still using a single byte. So in practice all the built-in Similarities use the 
same encoding and can work with this: it just won't work if you extend 
Similarity to do something else.
   
   Any confusion can be solved with documentation:
   * should be clear that this only works, if your similarity uses the default 
implementation of `computeNorm()`
   * don't think PositionLength is a good name, norm is not that (see 
discountOverlaps as an example).
   
   Also I would ask if we really need this `EMPTY` instance: it would be good 
to keep polymorphism under wraps.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Adding profiling support for concurrent segment search [lucene]

2025-04-02 Thread via GitHub



jainankitk commented on PR #14413:
URL: https://github.com/apache/lucene/pull/14413#issuecomment-2774610745

   @jpountz - The code changes are ready for review. For now, I have made 
changes to accommodate all the timers in `QueryProfilerTimingType`.
   
   While this does not modify (`rewrite()`, `CollectorManager#reduce`) called 
out earlier as it is not part of `QueryProfilerTimingType`, it does lay the 
groundwork using `createWeight()`. We can take those changes in a followup PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] Incorrect use of fsync [lucene]

2025-04-02 Thread via GitHub



dweiss commented on issue #14334:
URL: https://github.com/apache/lucene/issues/14334#issuecomment-2772129998

   There has to be a certain trust in what the operating system provides and 
its consistency guarantees. What you describe seems like a fringe case that - 
even if possible - falls under the tiny, tiny fraction of possibilities that 
can end up corrupting your index. The tradeoff here is making everything else 
potentially slower for the 99.99% of other cases.
   
   Let's not be paranoid. It's not possible to fix such low-level corner cases 
from Java level. Put some monitoring on remaining disk space, add an UPS and 
ECC memory. Should get you a long way in dodging these scenarios.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Disable the query cache by default. [lucene]

2025-04-02 Thread via GitHub



jpountz merged PR #14187:
URL: https://github.com/apache/lucene/pull/14187


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] Incorrect use of fsync [lucene]

2025-04-02 Thread via GitHub



uschindler commented on issue #14334:
URL: https://github.com/apache/lucene/issues/14334#issuecomment-2771766515

   If a file is incomplete, the commit will fail. Lucene's fileformats are 
designed in a way that corruption can be found early (this includes checksums). 
So zero byte temp files or truncated ones will cause no issue. The Exception 
ahndling in Lucene never swallows exceptions, when something goes wrong it 
fails the commit.
   
   Please stop arguing here about problems that don't exist. Issue #10906 has 
nothing to do with temporary files. So please stop arguing about temporary 
files.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] Incorrect use of fsync [lucene]

2025-04-02 Thread via GitHub



uschindler commented on issue #14334:
URL: https://github.com/apache/lucene/issues/14334#issuecomment-2771774432

   > For temporary files, we should either fsync before closing, or start 
reading without closing the file. 
   
   This shows that you have no idea about how Lucene works internally. This is 
not possible because Lucene solely uses MMap for reading files, and for writing 
it uses OutputStreams, which can't be reused to read again. For reopening with 
MMap, it is not possible to reuse any file descriptors which were used for 
writing, Java just does not allow this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Reduce the number of comparisons when lowerPoint is equal to upperPoint [lucene]

2025-04-02 Thread via GitHub



gsmiller commented on code in PR #14267:
URL: https://github.com/apache/lucene/pull/14267#discussion_r2025592245


##
lucene/core/src/java/org/apache/lucene/search/PointRangeQuery.java:
##
@@ -517,6 +623,11 @@ public byte[] getUpperPoint() {
 return upperPoint.clone();
   }
 
+  // for test
+  public boolean isEqualValues() {

Review Comment:
   I'd really prefer we do not increase our public API surface area only for 
testing on a class like this. Can we find another way to test without exposing 
this please?



##
lucene/core/src/java/org/apache/lucene/search/PointRangeQuery.java:
##
@@ -120,381 +132,475 @@ public void visit(QueryVisitor visitor) {
   public final Weight createWeight(IndexSearcher searcher, ScoreMode 
scoreMode, float boost)
   throws IOException {
 
+if (this.equalValues) { // lowerPoint==upperPoint
+  return new SinglePointRangeQueryWeight(this, scoreMode, boost);
+}
 // We don't use RandomAccessWeight here: it's no good to approximate with 
"match all docs".
 // This is an inverted structure and should be used in the first pass:
+return new MultiPointRangeQueryWeight(this, scoreMode, boost);
+  }
 
-return new ConstantScoreWeight(this, boost) {
+  /**
+   * Single-point range query weight implementation class, used to handle the 
special case where the
+   * lower and upper bounds are equal (i.e. single-point query).
+   *
+   * Optimize query performance by reducing the number of comparisons 
between dimensions. This
+   * implementation is used when the upper and lower bounds of all dimensions 
are exactly the same.
+   */
+  protected class SinglePointRangeQueryWeight extends PointRangeQueryWeight {

Review Comment:
   Does this actually need to be `protected` instead of `private`? (Same 
question for `MultiPointRangeQueryWeight` and `PointRangeQueryWeight`).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[I] Monitor TermFilteredPresearcher does not return stored query if it contains filter field [lucene]

2025-04-02 Thread via GitHub



bjacobowitz opened a new issue, #14427:
URL: https://github.com/apache/lucene/issues/14427

   ### Description
   
   `TermFilteredPresearcher` may fail to return stored queries if those queries 
contain the filter field in the query itself (not just in the metadata).
   
   When building the presearcher query 
([here](https://github.com/apache/lucene/blob/ddef6177122d564bf2319fa72b3327ee53ad7019/lucene/monitor/src/java/org/apache/lucene/monitor/TermFilteredPresearcher.java#L99-L152))
 , `TermFilteredPresearcher` removes filter field tokens from the main clause 
of the presearcher query ([via acceptor 
here](https://github.com/apache/lucene/blob/ddef6177122d564bf2319fa72b3327ee53ad7019/lucene/monitor/src/java/org/apache/lucene/monitor/TermFilteredPresearcher.java#L115-L130))
 , presumably under the assumption that they will be added back to the query in 
a separate clause for filter fields, ANDed with the first clause 
([here](https://github.com/apache/lucene/blob/ddef6177122d564bf2319fa72b3327ee53ad7019/lucene/monitor/src/java/org/apache/lucene/monitor/TermFilteredPresearcher.java#L138-L146)).
   
   We end up with a presearcher query that is something like
   ```
   +(some_other_field_in_query_index: value) #(+(my_filter_field:value))
   ```
   
   However, if an indexed query contains that filter field, and if that field 
was the only indexed field for the query's associated document in the query 
index, the first of the AND'd clauses cannot match (because the filter field 
was omitted), so the overall AND'd presearcher query cannot match, and the 
presearcher fails to return the query.
   
   A user can work around this by using an additional dedicated field for the 
filter field (i.e. adding it on both the query metadata and the document), but 
this seems like an _easy_ trap to fall into.
   
   **My question here: is this intentional?** Is the idea of a "filter field" 
that it appears in documents and in MonitorQuery metadata but must not appear 
in a itself query? I'm aware of another issue about the intended behavior of 
Monitor filter fields (https://github.com/apache/lucene/issues/11607), so I'm 
unsure.
   
   If intentional, I think we should document that more directly. If 
unintentional, we might consider removing the check on filter fields 
([here](https://github.com/apache/lucene/blob/ddef6177122d564bf2319fa72b3327ee53ad7019/lucene/monitor/src/java/org/apache/lucene/monitor/TermFilteredPresearcher.java#L121))
 when building the first part of the presearcher query.
   
   I've set up a test project 
[here](https://github.com/bjacobowitz/sturdy-chainsaw/tree/main) to demonstrate 
the problem (specifically in [this test 
file](https://github.com/bjacobowitz/sturdy-chainsaw/blob/8e24641e2af3e9d5f19de31e78480048f19f5dd4/src/test/java/MonitorTest.java),
 where 
[`testWithEmptyMetadata`](https://github.com/bjacobowitz/sturdy-chainsaw/blob/8e24641e2af3e9d5f19de31e78480048f19f5dd4/src/test/java/MonitorTest.java#L22C10-L22C31)
 works but 
[`testWithMetadata`](https://github.com/bjacobowitz/sturdy-chainsaw/blob/8e24641e2af3e9d5f19de31e78480048f19f5dd4/src/test/java/MonitorTest.java#L52)
 fails).
   
   ### Version and environment details
   
   Tested with Lucene 10.1.0 on macOS Sequoia 15.3.2 (but I think the problem 
has been around since at least Lucene version 8, if not earlier).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] UnsupportedOperation when merging `Lucene90BlockTreeTermsWriter` [lucene]

2025-04-02 Thread via GitHub



mikemccand commented on issue #14429:
URL: https://github.com/apache/lucene/issues/14429#issuecomment-2773775763

   Phew, this is a spooky exception!
   
   I think it means that the same term was fed to the FST Builder twice in row. 
 FST Builder in general can support this case, and it means that a single 
output can have multiple outputs, and the `Outputs` impl is supposed to be able 
to combine multiple outputs into a set (internally).  But you're right: in this 
context (BlockTree) there should never be the same term added more than once, 
and each term has a single output, and the Outputs impl does not support it.  
It is indeed NOT supposed to happen!
   
   BlockTree is confusing in how it builds up its blocks.  It does it one 
sub-tree at a time, using intermediate FSTs to hold each sub-tree, and then 
regurgitating the terms from each subtree with `FSTTermsEnum`, adding them into 
a bigger FST Builder to combine multiple sub-trees into a single FST.  It keeps 
doing this up and up the terms trie until it gets to empty string and then that 
FST is the terms index.
   
   So  somehow this regurgitation process added the same term twice in a 
row.  This means either a given `FSTTermsEnum` returned the same term twice in 
a row, or, somehow a term was duplicated at the boundary (where one 
`FSTTermsEnum` ended from a sub-block, and next `FSTTermsEnum` began).
   
   Do we know any fun details about the use case?  Maybe an exotic/old JVM?  
Massive numbers of terms...?  Or the terms are some crazy binary gene sequences 
or something?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] cache preset dict for LZ4WithPresetDictDecompressor [lucene]

2025-04-02 Thread via GitHub



jainankitk commented on code in PR #14397:
URL: https://github.com/apache/lucene/pull/14397#discussion_r2021747427


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/compressing/Lucene90CompressingStoredFieldsReader.java:
##
@@ -512,6 +512,7 @@ private void doReset(int docID) throws IOException {
   bytes.offset = bytes.length = 0;
   for (int decompressed = 0; decompressed < totalLength; ) {
 final int toDecompress = Math.min(totalLength - decompressed, 
chunkSize);
+decompressor.reset();
 decompressor.decompress(fieldsStream, toDecompress, 0, 
toDecompress, spare);

Review Comment:
   > When the block changes, we must discard the cache in time, this operation 
can only be detected from external.
   
   I am not questioning that. My point is to not have `reset` method in the 
Decompressor interface, and add another decompress method that takes 
`reuseIfPossible` as one of the parameters. It ensures the functional 
correctness even if we don't make the `reset` call from somewhere in the code. 
And, allows explicit optimization wherever we deem appropriate. The risk in not 
explicitly making the `reset` call is much more than using original decompress 
without the reuse.
   
   ```
   public abstract class Decompressor implements Cloneable {
   
 protected Decompressor() {}
   
 public void decompress(
 DataInput in, int originalLength, int offset, int length, BytesRef 
bytes) throws IOException {
   decompress(in, originalLength, offset, length, bytes, false);
 }
   
 public abstract void decompress(
 DataInput in, int originalLength, int offset, int length, BytesRef 
bytes, boolean reuseIfPossible) throws IOException;
   
 @Override
 public abstract Decompressor clone();
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[PR] Revert "Add UnwrappingReuseStrategy for AnalyzerWrapper (#14154)" [lucene]

2025-04-02 Thread via GitHub



mayya-sharipova opened a new pull request, #14430:
URL: https://github.com/apache/lucene/pull/14430

   This reverts commit ce2a917cf2c2f40b3996656f3b294e3c01d25e5b.
   
   ### Description
   
   In Elasticsearch (and probably other applications) we reuse the same 
analyzer across fields. And this change breaks it. 
   
A fix in Lucene is trivial to make a default reuse strategy per_field:
   
   ![image 
(2)](https://github.com/user-attachments/assets/f13ad4f7-6df0-44d2-88d1-ab2a818ef10a)
   
   But I don't know all the implications for this. So I will revert my change 
for now .
   
   
   For example, the test below will fail currently:
   ```java
   private static class PhraseWrappedAnalyzer extends AnalyzerWrapper {
   
   private final Analyzer delegate;
   private final int posIncGap;
   
   PhraseWrappedAnalyzer(Analyzer delegate, int posIncGap) {
   super(delegate.getReuseStrategy());
   this.delegate = delegate;
   this.posIncGap = posIncGap;
   }
   
   @Override
   public int getPositionIncrementGap(String fieldName) {
   // Delegate or return fixed value? Original test didn't rely on 
this.
   // Returning the passed value is consistent with the constructor.
   // Delegating might be safer generally: return 
delegate.getPositionIncrementGap(fieldName);
   return posIncGap;
   }
   
   @Override
   public int getOffsetGap(String fieldName) {
   // Delegate offset gap as well for completeness
   return delegate.getOffsetGap(fieldName);
   }
   
   
   @Override
   protected Analyzer getWrappedAnalyzer(String fieldName) {
   return delegate;
   }
   
   @Override
   protected TokenStreamComponents wrapComponents(String fieldName, 
TokenStreamComponents components) {
   // Wrap the delegate's token stream with FixedShingleFilter for 
bigrams
   return new TokenStreamComponents(components.getSource(),
   new ShingleFilter(components.getTokenStream(), 2));
   }
   
   
   }
   
   
   public void testIndexDiffFieldsSameAnalyzer() throws IOException {
   final Analyzer textAnalyzer = new StandardAnalyzer();
   final Analyzer phraseAnalyzer = new 
PhraseWrappedAnalyzer(textAnalyzer, 0);
   
   FieldType textVectorType = new FieldType(TextField.TYPE_NOT_STORED);
   
textVectorType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
   textVectorType.freeze();
   Map analyzerMap = new HashMap<>();
   analyzerMap.put("text", textAnalyzer);
   analyzerMap.put("text_phrases", phraseAnalyzer); // Use this field 
to store phrase tokens
   Analyzer perFieldAnalyzer = new PerFieldAnalyzerWrapper(new 
StandardAnalyzer(), analyzerMap);
   Directory dir = newDirectory();
   IndexWriterConfig iwc = newIndexWriterConfig(perFieldAnalyzer);
   IndexWriter writer = new IndexWriter(dir, iwc);
   
   int maxDocs = 4;
   String content = "the quick brown fox jumped over the lazy dog";
   
   for (int i = 0; i < maxDocs; i++) {
   Document doc = new Document();
   doc.add(new Field("text", content, textVectorType));
   doc.add(new Field("text_phrases", content, textVectorType));
   writer.addDocument(doc);
   }
   writer.commit();
   
   try (IndexReader reader = DirectoryReader.open(writer)) {
   assertEquals("Should have indexed maxDocs documents", maxDocs, 
reader.numDocs());
   
   // Verify term frequencies for the 'text' field
   Terms textTerms = MultiTerms.getTerms(reader, "text");
   assertNotNull("Terms should exist for 'text' field", textTerms);
   TermsEnum textTermsEnum = textTerms.iterator();
   BytesRef term;
   int termCount = 0;
   while ((term = textTermsEnum.next()) != null) {
   assertEquals("Incorrect docFreq for term '" + 
term.utf8ToString() + "' in field 'text'",
   maxDocs, textTermsEnum.docFreq());
   termCount++;
   }
   assertTrue("Should find terms in 'text' field", termCount > 0);
   
   // Verify term frequencies for the 'text_phrases' field 
(shingles)
   Terms phraseTerms = MultiTerms.getTerms(reader, "text_phrases");
   assertNotNull("Terms should exist for 'text_phrases' field", 
phraseTerms);
   TermsEnum phraseTermsEnum = phraseTerms.iterator();
   BytesRef phrase;
   int phraseCount = 0;
   while ((phrase = phraseTermsEnum.next()) != null) {
   assertEquals("Incorrect docFreq for phrase '" + 
phrase.utf8To

Re: [PR] Add a Faiss codec for KNN searches [lucene]

2025-04-02 Thread via GitHub



kaivalnp commented on PR #14178:
URL: https://github.com/apache/lucene/pull/14178#issuecomment-2771809487

   All dependent Faiss PRs are merged:
   1. https://github.com/facebookresearch/faiss/pull/4158: Support 
pre-filtering on a Java `long[]` (underlying of `FixedBitSet`) using 
`IDSelectorBitmap`
   2. https://github.com/facebookresearch/faiss/pull/4167: Allow a pre-filtered 
HNSW search, taking other parameters from the index
   3. https://github.com/facebookresearch/faiss/pull/4180: Simplify reading / 
writing to Lucene segments using custom IO
   4. https://github.com/facebookresearch/faiss/pull/4186: Publish the C_API to 
Conda for ease of use
   
   Looking for reviews, as I believe this PR is ready to merge now..


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add support for determining off-heap memory requirements for KnnVectorsReader [lucene]

2025-04-02 Thread via GitHub



ChrisHegarty commented on PR #14426:
URL: https://github.com/apache/lucene/pull/14426#issuecomment-2773007104

   Given the feedback so far, I've pivot this quite a bit to now include 
per-field metrics. To support this I removed the previously proposed 
OffHeapAccountable interface and put the accessor on KnnVectorsReader - where 
other per-field accessors are. I used FieldInfo as the lookup since that it 
needed to determine if the field is either Byte of Float32, which is required 
to know in the "richer" codec implementations which provide quantization of 
floats and passthrough for bytes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] Incorrect use of fsync [lucene]

2025-04-02 Thread via GitHub



viliam-durina commented on issue #14334:
URL: https://github.com/apache/lucene/issues/14334#issuecomment-2771546918

   There was no crash, that's the problem. You wrote a file, then opened it 
again for reading and it's corrupted, and there was no IO error reported. 
   
   As I said, if you close a file without fsyncing, write-back errors con't be 
reported to the writing process. The IO error might be a network glitch on your 
SAN, which doesn't automatically crash the whole system.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[I] UnsupportedOperation when merging `Lucene90BlockTreeTermsWriter` [lucene]

2025-04-02 Thread via GitHub



benwtrent opened a new issue, #14429:
URL: https://github.com/apache/lucene/issues/14429

   ### Description
   
   Found this in the wild. I haven't been able to replicate :(
   
   I don't even know what it means to hit this `fst.outputs.merge` branch and 
under what conditions is it valid/invalid. Any pointers here would be useful.
   
   
   We ran into a strange postings merge error in production.
   
   The FST compiler reaches the "merge" line when merging some segments:
   ```if (lastInput.length() == input.length && prefixLenPlus1 == 1 + 
input.length) {
 // same input more than 1 time in a row, mapping to
 // multiple outputs
 lastNode.output = fst.outputs.merge(lastNode.output, output);```
   
   However, the "outputs" provided by `Lucene90BlockTreeTermsWriter` is 
`ByteSequenceOutputs`, which does not override merge, and thus throws an 
unsupported operation exception.
   
   Given this, it seems like it should be "impossible" to reach the 
"Outputs.merge" path when merging with the `Lucene90BlockTreeTermsWriter`, but 
somehow it did.
   
   Any ideas on where I should look?
   
   ```Caused by: 
org.elasticsearch.common.io.stream.NotSerializableExceptionWrapper: 
unsupported_operation_exception: null
   at org.apache.lucene.util.fst.Outputs.merge(Outputs.java:95) 
~[lucene-core-9.11.1.jar:?]
   at org.apache.lucene.util.fst.FSTCompiler.add(FSTCompiler.java:936) 
~[lucene-core-9.11.1.jar:?]
   at 
org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter$PendingBlock.append(Lucene90BlockTreeTermsWriter.java:593)
 ~[lucene-core-9.11.1.jar:?]
   at 
org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter$PendingBlock.compileIndex(Lucene90BlockTreeTermsWriter.java:562)
 ~[lucene-core-9.11.1.jar:?]
   at 
org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter$TermsWriter.writeBlocks(Lucene90BlockTreeTermsWriter.java:776)
 ~[lucene-core-9.11.1.jar:?]
   at 
org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter$TermsWriter.finish(Lucene90BlockTreeTermsWriter.java:1163)
 ~[lucene-core-9.11.1.jar:?]
   at 
org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter.write(Lucene90BlockTreeTermsWriter.java:402)
 ~[lucene-core-9.11.1.jar:?]
   at org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:95) 
~[lucene-core-9.11.1.jar:?]
   at 
org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.merge(PerFieldPostingsFormat.java:204)
 ~[lucene-core-9.11.1.jar:?]
   at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:211) 
~[lucene-core-9.11.1.jar:?]
   at 
org.apache.lucene.index.SegmentMerger.mergeWithLogging(SegmentMerger.java:300) 
~[lucene-core-9.11.1.jar:?]
   at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:139) 
~[lucene-core-9.11.1.jar:?]
   at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:5293) 
~[lucene-core-9.11.1.jar:?]
   at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4761) 
~[lucene-core-9.11.1.jar:?]
   at 
org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:6582)
 ~[lucene-core-9.11.1.jar:?]
   at 
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:660)
 ~[lucene-core-9.11.1.jar:?]
   at 
org.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler.doMerge(ElasticsearchConcurrentMergeScheduler.java:134)
 ~[elasticsearch-8.15.0.jar:?]
   at 
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:721)
 ~[lucene-core-9.11.1.jar:?]```
   
   ### Version and environment details
   
   Lucene 9.11.1
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] OptimisticKnnVectorQuery [lucene]

2025-04-02 Thread via GitHub



benwtrent commented on PR #14226:
URL: https://github.com/apache/lucene/pull/14226#issuecomment-2773659852

   @dungba88 @msokolov 
   
   Do we know where we stand with this? I wonder if we should simply use this 
to replace the current buggy behavior as fanning out to segments multiple times 
with sane sync points should fix our consistency issues and if the performance 
of this is on-par with the current information sharing logic, seems like a 
net-win.
   
   Anything I can do to help here?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] Incorrect use of fsync [lucene]

Re: [I] Incorrect use of fsync [lucene]

Re: [PR] Revert "Add UnwrappingReuseStrategy for AnalyzerWrapper (#14154)" [lucene]

[PR] Revert "Add UnwrappingReuseStrategy for AnalyzerWrapper (#14154)" [lucene]

Re: [PR] Adding profiling support for concurrent segment search [lucene]

Re: [PR] Optimize ConcurrentMergeScheduler for Multi-Tenant Indexing [lucene]

[I] Add a timeout for forceMergeDeletes in IndexWriter [lucene]

Re: [I] Incorrect use of fsync [lucene]

Re: [I] [DISCUSS] Could we have a different ANN algorithm for Learned Sparse Vectors? [lucene]

Re: [PR] Add a HNSW collector that exits early when nearest neighbor queue saturates [lucene]

Re: [PR] Add a HNSW collector that exits early when nearest neighbor queue saturates [lucene]

Re: [PR] Revert "Add UnwrappingReuseStrategy for AnalyzerWrapper (#14154)" [lucene]

Re: [PR] New IndexReaderFunctions.positionLength from the norm [lucene]

Re: [PR] Adding profiling support for concurrent segment search [lucene]

Re: [I] Incorrect use of fsync [lucene]

Re: [PR] Disable the query cache by default. [lucene]

Re: [I] Incorrect use of fsync [lucene]

Re: [I] Incorrect use of fsync [lucene]

Re: [PR] Reduce the number of comparisons when lowerPoint is equal to upperPoint [lucene]

[I] Monitor TermFilteredPresearcher does not return stored query if it contains filter field [lucene]

Re: [I] UnsupportedOperation when merging `Lucene90BlockTreeTermsWriter` [lucene]

Re: [PR] cache preset dict for LZ4WithPresetDictDecompressor [lucene]

[PR] Revert "Add UnwrappingReuseStrategy for AnalyzerWrapper (#14154)" [lucene]

Re: [PR] Add a Faiss codec for KNN searches [lucene]

Re: [PR] Add support for determining off-heap memory requirements for KnnVectorsReader [lucene]

Re: [I] Incorrect use of fsync [lucene]

[I] UnsupportedOperation when merging `Lucene90BlockTreeTermsWriter` [lucene]

Re: [PR] OptimisticKnnVectorQuery [lucene]

28 matches

Site Navigation

Mail list logo

Footer information