Re: [PR] Revert "Add UnwrappingReuseStrategy for AnalyzerWrapper (#14154)" [lucene]
mayya-sharipova merged PR #14430: URL: https://github.com/apache/lucene/pull/14430 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Revert "Add UnwrappingReuseStrategy for AnalyzerWrapper (#14154)" [lucene]
mayya-sharipova opened a new pull request, #14432: URL: https://github.com/apache/lucene/pull/14432 This reverts commit 1a676b64f89069284bf7d0510162f8095bba3980. Revert based on https://github.com/apache/lucene/pull/14430 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Adding profiling support for concurrent segment search [lucene]
jainankitk commented on PR #14413: URL: https://github.com/apache/lucene/pull/14413#issuecomment-2774601593 > I'd have a top-level tree for everything related to initializing the search and combining results (rewrite(), createWeight(), CollectorManager#reduce) and then a list of trees for each slice. While working on the code, I realized it is better to have list of slices within the tree itself at each level, instead of repeating the query structure and information across multiple trees. In this approach, we can easily view the tree for specific `sliceId` using jq or simple script. The structure looks like below: ``` "query": [ <-- for list of root queries { "type": "TermQuery", "description": "foo:bar", "startTime" : 11972972, "totalTime": 354343, "breakdown" : {.}, <-- query level breakdown like weight count and time "sliceBreakdowns": [ {.},<-- first slice information {.}]<-- second slice information "queryChildren": [ {..}, <-- recursive repetition of above structure {..}] ``` > Related, it'd be nice if each per-slice object could also tell us about the thread that it ran in and its start time so that we could understand how exactly Lucene managed to parallelize the search. Yes, that would be really useful. I have included the threadId as `sliceId`, `startTime` and `totalTime` information for each slice object at every query level. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Optimize ConcurrentMergeScheduler for Multi-Tenant Indexing [lucene]
github-actions[bot] commented on PR #14335: URL: https://github.com/apache/lucene/pull/14335#issuecomment-2774035527 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contribution! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[I] Add a timeout for forceMergeDeletes in IndexWriter [lucene]
houserjohn opened a new issue, #14431: URL: https://github.com/apache/lucene/issues/14431 ### Description Using IndexWriter's `forceMergeDeletes` to eliminate merge debt is a very useful feature -- especially during initial indexing. However, larger indexes can require 20+ minutes to finish a `forceMergeDeletes` call. It would be a useful feature if we add an API for specifying a timeout for the existing `doWait` argument. Instead of blocking until `forceMergeDeletes` finishes, we should block until either `forceMergeDeletes` finishes or the timeout is reached, whichever comes first. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Incorrect use of fsync [lucene]
viliam-durina commented on issue #14334: URL: https://github.com/apache/lucene/issues/14334#issuecomment-2772476115 >There has to be a certain trust in what the operating system provides and its consistency guarantees Sure, but the OS doesn't guarantee anything, if you don't fsync. I'd say you can have a degree of optimism that the probability of an undetected corruption is _low enough_. But I think I've made my case that it's possible to fix this with no performance hit and acceptable complexity. I went through the code, I think it's pretty straightforward in the places I saw. I can try, maybe just tell me if you'd accept a solution along the lines [described here](https://github.com/apache/lucene/issues/14334#issuecomment-2771963385), so that I don't waste time. Let's return to the bigger issue: fsyncing non-temporary files before closing them, which probably was the root cause of #10906. Can we agree that this is an issue? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] [DISCUSS] Could we have a different ANN algorithm for Learned Sparse Vectors? [lucene]
yuye-aws commented on issue #13675: URL: https://github.com/apache/lucene/issues/13675#issuecomment-2772606870 Hi @atris , thanks for your contribution. I found this paper (**Bridging Dense and Sparse Maximum Inner Product Search**) pretty interesting as it caters to the skip list structure within Lucene: https://arxiv.org/pdf/2309.09013. Can we have a call if possible to discuss how to implement this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add a HNSW collector that exits early when nearest neighbor queue saturates [lucene]
benwtrent commented on code in PR #14094: URL: https://github.com/apache/lucene/pull/14094#discussion_r2023396128 ## lucene/CHANGES.txt: ## @@ -139,6 +139,8 @@ New Features * GITHUB#14412: Allow skip cache factor to be updated dynamically. (Sagar Upadhyaya) +* GITHUB#14094: New KNN query that early terminates when HNSW nearest neighbor queue saturates. (Tommaso Teofili) Review Comment: 10.2 is cut, so this will now be a 10.3 thing :/ ## lucene/CHANGES.txt: ## @@ -139,6 +139,8 @@ New Features * GITHUB#14412: Allow skip cache factor to be updated dynamically. (Sagar Upadhyaya) +* GITHUB#14094: New KNN query that early terminates when HNSW nearest neighbor queue saturates. (Tommaso Teofili) Review Comment: Unless this is being added to 10.2 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add a HNSW collector that exits early when nearest neighbor queue saturates [lucene]
tteofili merged PR #14094: URL: https://github.com/apache/lucene/pull/14094 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Revert "Add UnwrappingReuseStrategy for AnalyzerWrapper (#14154)" [lucene]
mayya-sharipova merged PR #14432: URL: https://github.com/apache/lucene/pull/14432 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] New IndexReaderFunctions.positionLength from the norm [lucene]
rmuir commented on PR #14433: URL: https://github.com/apache/lucene/pull/14433#issuecomment-2774215695 I think the history is just that this norm can contain arbitrary value, which before was a suboptimal encoding into a single byte. There was a ValueSource that assumed it was a single byte, so that was moved to only work with TFIDF for backwards compatibility purposes. Elsewhere, norm was extended and generalized to be opaque 64-bit value. Depending upon the Similarity's index-time `computeNorm()` implementation, it might not even be possible to decode to a float. But the default encoding was also fixed to be practical, by @jpountz, whilst still using a single byte. So in practice all the built-in Similarities use the same encoding and can work with this: it just won't work if you extend Similarity to do something else. Any confusion can be solved with documentation: * should be clear that this only works, if your similarity uses the default implementation of `computeNorm()` * don't think PositionLength is a good name, norm is not that (see discountOverlaps as an example). Also I would ask if we really need this `EMPTY` instance: it would be good to keep polymorphism under wraps. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Adding profiling support for concurrent segment search [lucene]
jainankitk commented on PR #14413: URL: https://github.com/apache/lucene/pull/14413#issuecomment-2774610745 @jpountz - The code changes are ready for review. For now, I have made changes to accommodate all the timers in `QueryProfilerTimingType`. While this does not modify (`rewrite()`, `CollectorManager#reduce`) called out earlier as it is not part of `QueryProfilerTimingType`, it does lay the groundwork using `createWeight()`. We can take those changes in a followup PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Incorrect use of fsync [lucene]
dweiss commented on issue #14334: URL: https://github.com/apache/lucene/issues/14334#issuecomment-2772129998 There has to be a certain trust in what the operating system provides and its consistency guarantees. What you describe seems like a fringe case that - even if possible - falls under the tiny, tiny fraction of possibilities that can end up corrupting your index. The tradeoff here is making everything else potentially slower for the 99.99% of other cases. Let's not be paranoid. It's not possible to fix such low-level corner cases from Java level. Put some monitoring on remaining disk space, add an UPS and ECC memory. Should get you a long way in dodging these scenarios. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Disable the query cache by default. [lucene]
jpountz merged PR #14187: URL: https://github.com/apache/lucene/pull/14187 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Incorrect use of fsync [lucene]
uschindler commented on issue #14334: URL: https://github.com/apache/lucene/issues/14334#issuecomment-2771766515 If a file is incomplete, the commit will fail. Lucene's fileformats are designed in a way that corruption can be found early (this includes checksums). So zero byte temp files or truncated ones will cause no issue. The Exception ahndling in Lucene never swallows exceptions, when something goes wrong it fails the commit. Please stop arguing here about problems that don't exist. Issue #10906 has nothing to do with temporary files. So please stop arguing about temporary files. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Incorrect use of fsync [lucene]
uschindler commented on issue #14334: URL: https://github.com/apache/lucene/issues/14334#issuecomment-2771774432 > For temporary files, we should either fsync before closing, or start reading without closing the file. This shows that you have no idea about how Lucene works internally. This is not possible because Lucene solely uses MMap for reading files, and for writing it uses OutputStreams, which can't be reused to read again. For reopening with MMap, it is not possible to reuse any file descriptors which were used for writing, Java just does not allow this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Reduce the number of comparisons when lowerPoint is equal to upperPoint [lucene]
gsmiller commented on code in PR #14267: URL: https://github.com/apache/lucene/pull/14267#discussion_r2025592245 ## lucene/core/src/java/org/apache/lucene/search/PointRangeQuery.java: ## @@ -517,6 +623,11 @@ public byte[] getUpperPoint() { return upperPoint.clone(); } + // for test + public boolean isEqualValues() { Review Comment: I'd really prefer we do not increase our public API surface area only for testing on a class like this. Can we find another way to test without exposing this please? ## lucene/core/src/java/org/apache/lucene/search/PointRangeQuery.java: ## @@ -120,381 +132,475 @@ public void visit(QueryVisitor visitor) { public final Weight createWeight(IndexSearcher searcher, ScoreMode scoreMode, float boost) throws IOException { +if (this.equalValues) { // lowerPoint==upperPoint + return new SinglePointRangeQueryWeight(this, scoreMode, boost); +} // We don't use RandomAccessWeight here: it's no good to approximate with "match all docs". // This is an inverted structure and should be used in the first pass: +return new MultiPointRangeQueryWeight(this, scoreMode, boost); + } -return new ConstantScoreWeight(this, boost) { + /** + * Single-point range query weight implementation class, used to handle the special case where the + * lower and upper bounds are equal (i.e. single-point query). + * + * Optimize query performance by reducing the number of comparisons between dimensions. This + * implementation is used when the upper and lower bounds of all dimensions are exactly the same. + */ + protected class SinglePointRangeQueryWeight extends PointRangeQueryWeight { Review Comment: Does this actually need to be `protected` instead of `private`? (Same question for `MultiPointRangeQueryWeight` and `PointRangeQueryWeight`). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[I] Monitor TermFilteredPresearcher does not return stored query if it contains filter field [lucene]
bjacobowitz opened a new issue, #14427: URL: https://github.com/apache/lucene/issues/14427 ### Description `TermFilteredPresearcher` may fail to return stored queries if those queries contain the filter field in the query itself (not just in the metadata). When building the presearcher query ([here](https://github.com/apache/lucene/blob/ddef6177122d564bf2319fa72b3327ee53ad7019/lucene/monitor/src/java/org/apache/lucene/monitor/TermFilteredPresearcher.java#L99-L152)) , `TermFilteredPresearcher` removes filter field tokens from the main clause of the presearcher query ([via acceptor here](https://github.com/apache/lucene/blob/ddef6177122d564bf2319fa72b3327ee53ad7019/lucene/monitor/src/java/org/apache/lucene/monitor/TermFilteredPresearcher.java#L115-L130)) , presumably under the assumption that they will be added back to the query in a separate clause for filter fields, ANDed with the first clause ([here](https://github.com/apache/lucene/blob/ddef6177122d564bf2319fa72b3327ee53ad7019/lucene/monitor/src/java/org/apache/lucene/monitor/TermFilteredPresearcher.java#L138-L146)). We end up with a presearcher query that is something like ``` +(some_other_field_in_query_index: value) #(+(my_filter_field:value)) ``` However, if an indexed query contains that filter field, and if that field was the only indexed field for the query's associated document in the query index, the first of the AND'd clauses cannot match (because the filter field was omitted), so the overall AND'd presearcher query cannot match, and the presearcher fails to return the query. A user can work around this by using an additional dedicated field for the filter field (i.e. adding it on both the query metadata and the document), but this seems like an _easy_ trap to fall into. **My question here: is this intentional?** Is the idea of a "filter field" that it appears in documents and in MonitorQuery metadata but must not appear in a itself query? I'm aware of another issue about the intended behavior of Monitor filter fields (https://github.com/apache/lucene/issues/11607), so I'm unsure. If intentional, I think we should document that more directly. If unintentional, we might consider removing the check on filter fields ([here](https://github.com/apache/lucene/blob/ddef6177122d564bf2319fa72b3327ee53ad7019/lucene/monitor/src/java/org/apache/lucene/monitor/TermFilteredPresearcher.java#L121)) when building the first part of the presearcher query. I've set up a test project [here](https://github.com/bjacobowitz/sturdy-chainsaw/tree/main) to demonstrate the problem (specifically in [this test file](https://github.com/bjacobowitz/sturdy-chainsaw/blob/8e24641e2af3e9d5f19de31e78480048f19f5dd4/src/test/java/MonitorTest.java), where [`testWithEmptyMetadata`](https://github.com/bjacobowitz/sturdy-chainsaw/blob/8e24641e2af3e9d5f19de31e78480048f19f5dd4/src/test/java/MonitorTest.java#L22C10-L22C31) works but [`testWithMetadata`](https://github.com/bjacobowitz/sturdy-chainsaw/blob/8e24641e2af3e9d5f19de31e78480048f19f5dd4/src/test/java/MonitorTest.java#L52) fails). ### Version and environment details Tested with Lucene 10.1.0 on macOS Sequoia 15.3.2 (but I think the problem has been around since at least Lucene version 8, if not earlier). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] UnsupportedOperation when merging `Lucene90BlockTreeTermsWriter` [lucene]
mikemccand commented on issue #14429: URL: https://github.com/apache/lucene/issues/14429#issuecomment-2773775763 Phew, this is a spooky exception! I think it means that the same term was fed to the FST Builder twice in row. FST Builder in general can support this case, and it means that a single output can have multiple outputs, and the `Outputs` impl is supposed to be able to combine multiple outputs into a set (internally). But you're right: in this context (BlockTree) there should never be the same term added more than once, and each term has a single output, and the Outputs impl does not support it. It is indeed NOT supposed to happen! BlockTree is confusing in how it builds up its blocks. It does it one sub-tree at a time, using intermediate FSTs to hold each sub-tree, and then regurgitating the terms from each subtree with `FSTTermsEnum`, adding them into a bigger FST Builder to combine multiple sub-trees into a single FST. It keeps doing this up and up the terms trie until it gets to empty string and then that FST is the terms index. So somehow this regurgitation process added the same term twice in a row. This means either a given `FSTTermsEnum` returned the same term twice in a row, or, somehow a term was duplicated at the boundary (where one `FSTTermsEnum` ended from a sub-block, and next `FSTTermsEnum` began). Do we know any fun details about the use case? Maybe an exotic/old JVM? Massive numbers of terms...? Or the terms are some crazy binary gene sequences or something? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] cache preset dict for LZ4WithPresetDictDecompressor [lucene]
jainankitk commented on code in PR #14397: URL: https://github.com/apache/lucene/pull/14397#discussion_r2021747427 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/compressing/Lucene90CompressingStoredFieldsReader.java: ## @@ -512,6 +512,7 @@ private void doReset(int docID) throws IOException { bytes.offset = bytes.length = 0; for (int decompressed = 0; decompressed < totalLength; ) { final int toDecompress = Math.min(totalLength - decompressed, chunkSize); +decompressor.reset(); decompressor.decompress(fieldsStream, toDecompress, 0, toDecompress, spare); Review Comment: > When the block changes, we must discard the cache in time, this operation can only be detected from external. I am not questioning that. My point is to not have `reset` method in the Decompressor interface, and add another decompress method that takes `reuseIfPossible` as one of the parameters. It ensures the functional correctness even if we don't make the `reset` call from somewhere in the code. And, allows explicit optimization wherever we deem appropriate. The risk in not explicitly making the `reset` call is much more than using original decompress without the reuse. ``` public abstract class Decompressor implements Cloneable { protected Decompressor() {} public void decompress( DataInput in, int originalLength, int offset, int length, BytesRef bytes) throws IOException { decompress(in, originalLength, offset, length, bytes, false); } public abstract void decompress( DataInput in, int originalLength, int offset, int length, BytesRef bytes, boolean reuseIfPossible) throws IOException; @Override public abstract Decompressor clone(); } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Revert "Add UnwrappingReuseStrategy for AnalyzerWrapper (#14154)" [lucene]
mayya-sharipova opened a new pull request, #14430: URL: https://github.com/apache/lucene/pull/14430 This reverts commit ce2a917cf2c2f40b3996656f3b294e3c01d25e5b. ### Description In Elasticsearch (and probably other applications) we reuse the same analyzer across fields. And this change breaks it. A fix in Lucene is trivial to make a default reuse strategy per_field:  But I don't know all the implications for this. So I will revert my change for now . For example, the test below will fail currently: ```java private static class PhraseWrappedAnalyzer extends AnalyzerWrapper { private final Analyzer delegate; private final int posIncGap; PhraseWrappedAnalyzer(Analyzer delegate, int posIncGap) { super(delegate.getReuseStrategy()); this.delegate = delegate; this.posIncGap = posIncGap; } @Override public int getPositionIncrementGap(String fieldName) { // Delegate or return fixed value? Original test didn't rely on this. // Returning the passed value is consistent with the constructor. // Delegating might be safer generally: return delegate.getPositionIncrementGap(fieldName); return posIncGap; } @Override public int getOffsetGap(String fieldName) { // Delegate offset gap as well for completeness return delegate.getOffsetGap(fieldName); } @Override protected Analyzer getWrappedAnalyzer(String fieldName) { return delegate; } @Override protected TokenStreamComponents wrapComponents(String fieldName, TokenStreamComponents components) { // Wrap the delegate's token stream with FixedShingleFilter for bigrams return new TokenStreamComponents(components.getSource(), new ShingleFilter(components.getTokenStream(), 2)); } } public void testIndexDiffFieldsSameAnalyzer() throws IOException { final Analyzer textAnalyzer = new StandardAnalyzer(); final Analyzer phraseAnalyzer = new PhraseWrappedAnalyzer(textAnalyzer, 0); FieldType textVectorType = new FieldType(TextField.TYPE_NOT_STORED); textVectorType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS); textVectorType.freeze(); Map analyzerMap = new HashMap<>(); analyzerMap.put("text", textAnalyzer); analyzerMap.put("text_phrases", phraseAnalyzer); // Use this field to store phrase tokens Analyzer perFieldAnalyzer = new PerFieldAnalyzerWrapper(new StandardAnalyzer(), analyzerMap); Directory dir = newDirectory(); IndexWriterConfig iwc = newIndexWriterConfig(perFieldAnalyzer); IndexWriter writer = new IndexWriter(dir, iwc); int maxDocs = 4; String content = "the quick brown fox jumped over the lazy dog"; for (int i = 0; i < maxDocs; i++) { Document doc = new Document(); doc.add(new Field("text", content, textVectorType)); doc.add(new Field("text_phrases", content, textVectorType)); writer.addDocument(doc); } writer.commit(); try (IndexReader reader = DirectoryReader.open(writer)) { assertEquals("Should have indexed maxDocs documents", maxDocs, reader.numDocs()); // Verify term frequencies for the 'text' field Terms textTerms = MultiTerms.getTerms(reader, "text"); assertNotNull("Terms should exist for 'text' field", textTerms); TermsEnum textTermsEnum = textTerms.iterator(); BytesRef term; int termCount = 0; while ((term = textTermsEnum.next()) != null) { assertEquals("Incorrect docFreq for term '" + term.utf8ToString() + "' in field 'text'", maxDocs, textTermsEnum.docFreq()); termCount++; } assertTrue("Should find terms in 'text' field", termCount > 0); // Verify term frequencies for the 'text_phrases' field (shingles) Terms phraseTerms = MultiTerms.getTerms(reader, "text_phrases"); assertNotNull("Terms should exist for 'text_phrases' field", phraseTerms); TermsEnum phraseTermsEnum = phraseTerms.iterator(); BytesRef phrase; int phraseCount = 0; while ((phrase = phraseTermsEnum.next()) != null) { assertEquals("Incorrect docFreq for phrase '" + phrase.utf8To
Re: [PR] Add a Faiss codec for KNN searches [lucene]
kaivalnp commented on PR #14178: URL: https://github.com/apache/lucene/pull/14178#issuecomment-2771809487 All dependent Faiss PRs are merged: 1. https://github.com/facebookresearch/faiss/pull/4158: Support pre-filtering on a Java `long[]` (underlying of `FixedBitSet`) using `IDSelectorBitmap` 2. https://github.com/facebookresearch/faiss/pull/4167: Allow a pre-filtered HNSW search, taking other parameters from the index 3. https://github.com/facebookresearch/faiss/pull/4180: Simplify reading / writing to Lucene segments using custom IO 4. https://github.com/facebookresearch/faiss/pull/4186: Publish the C_API to Conda for ease of use Looking for reviews, as I believe this PR is ready to merge now.. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add support for determining off-heap memory requirements for KnnVectorsReader [lucene]
ChrisHegarty commented on PR #14426: URL: https://github.com/apache/lucene/pull/14426#issuecomment-2773007104 Given the feedback so far, I've pivot this quite a bit to now include per-field metrics. To support this I removed the previously proposed OffHeapAccountable interface and put the accessor on KnnVectorsReader - where other per-field accessors are. I used FieldInfo as the lookup since that it needed to determine if the field is either Byte of Float32, which is required to know in the "richer" codec implementations which provide quantization of floats and passthrough for bytes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Incorrect use of fsync [lucene]
viliam-durina commented on issue #14334: URL: https://github.com/apache/lucene/issues/14334#issuecomment-2771546918 There was no crash, that's the problem. You wrote a file, then opened it again for reading and it's corrupted, and there was no IO error reported. As I said, if you close a file without fsyncing, write-back errors con't be reported to the writing process. The IO error might be a network glitch on your SAN, which doesn't automatically crash the whole system. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[I] UnsupportedOperation when merging `Lucene90BlockTreeTermsWriter` [lucene]
benwtrent opened a new issue, #14429: URL: https://github.com/apache/lucene/issues/14429 ### Description Found this in the wild. I haven't been able to replicate :( I don't even know what it means to hit this `fst.outputs.merge` branch and under what conditions is it valid/invalid. Any pointers here would be useful. We ran into a strange postings merge error in production. The FST compiler reaches the "merge" line when merging some segments: ```if (lastInput.length() == input.length && prefixLenPlus1 == 1 + input.length) { // same input more than 1 time in a row, mapping to // multiple outputs lastNode.output = fst.outputs.merge(lastNode.output, output);``` However, the "outputs" provided by `Lucene90BlockTreeTermsWriter` is `ByteSequenceOutputs`, which does not override merge, and thus throws an unsupported operation exception. Given this, it seems like it should be "impossible" to reach the "Outputs.merge" path when merging with the `Lucene90BlockTreeTermsWriter`, but somehow it did. Any ideas on where I should look? ```Caused by: org.elasticsearch.common.io.stream.NotSerializableExceptionWrapper: unsupported_operation_exception: null at org.apache.lucene.util.fst.Outputs.merge(Outputs.java:95) ~[lucene-core-9.11.1.jar:?] at org.apache.lucene.util.fst.FSTCompiler.add(FSTCompiler.java:936) ~[lucene-core-9.11.1.jar:?] at org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter$PendingBlock.append(Lucene90BlockTreeTermsWriter.java:593) ~[lucene-core-9.11.1.jar:?] at org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter$PendingBlock.compileIndex(Lucene90BlockTreeTermsWriter.java:562) ~[lucene-core-9.11.1.jar:?] at org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter$TermsWriter.writeBlocks(Lucene90BlockTreeTermsWriter.java:776) ~[lucene-core-9.11.1.jar:?] at org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter$TermsWriter.finish(Lucene90BlockTreeTermsWriter.java:1163) ~[lucene-core-9.11.1.jar:?] at org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter.write(Lucene90BlockTreeTermsWriter.java:402) ~[lucene-core-9.11.1.jar:?] at org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:95) ~[lucene-core-9.11.1.jar:?] at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.merge(PerFieldPostingsFormat.java:204) ~[lucene-core-9.11.1.jar:?] at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:211) ~[lucene-core-9.11.1.jar:?] at org.apache.lucene.index.SegmentMerger.mergeWithLogging(SegmentMerger.java:300) ~[lucene-core-9.11.1.jar:?] at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:139) ~[lucene-core-9.11.1.jar:?] at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:5293) ~[lucene-core-9.11.1.jar:?] at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4761) ~[lucene-core-9.11.1.jar:?] at org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:6582) ~[lucene-core-9.11.1.jar:?] at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:660) ~[lucene-core-9.11.1.jar:?] at org.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler.doMerge(ElasticsearchConcurrentMergeScheduler.java:134) ~[elasticsearch-8.15.0.jar:?] at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:721) ~[lucene-core-9.11.1.jar:?]``` ### Version and environment details Lucene 9.11.1 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] OptimisticKnnVectorQuery [lucene]
benwtrent commented on PR #14226: URL: https://github.com/apache/lucene/pull/14226#issuecomment-2773659852 @dungba88 @msokolov Do we know where we stand with this? I wonder if we should simply use this to replace the current buggy behavior as fanning out to segments multiple times with sane sync points should fix our consistency issues and if the performance of this is on-par with the current information sharing logic, seems like a net-win. Anything I can do to help here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Incorrect use of fsync [lucene]
uschindler commented on issue #14334: URL: https://github.com/apache/lucene/issues/14334#issuecomment-2771580012 > There was no crash, that's the problem. You wrote a file, then opened it again for reading and it's corrupted, and there was no IO error reported. > > As I said, if you close a file without fsyncing, write-back errors con't be reported to the writing process. The IO error might be a network glitch on your SAN, which doesn't automatically crash the whole system. This is a general problem and does not have anything to do with fsync. If this is really a problem, then every application everywhere and everytime would need to call fsync on close. So it would need to be the default in the OS. About temporary files: If a temporary file in the lucene index is corrupted then lucene will bail out, which is something that could always happen. There are no guarantees, because Lucene also caches documents in memory before writing, it may or may not create a tempoirary file! In Lucene, if you want to persist the index, you need to commit. After commit, no temporary files are in use. If a temporary file was corrupt before, the commit fails. So in short: No problem. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Incorrect use of fsync [lucene]
dweiss commented on issue #14334: URL: https://github.com/apache/lucene/issues/14334#issuecomment-2771582280 Ok, sorry but the scenario you're describing is insane to me. If something like this happens, I don't think it's Lucene's duty to try to correct it - it seems like the entire system is broken then. I don't think it makes sense to be so paranoid. There are some measures in Lucene to check for sanity (checksums on index files). I think this is sufficient. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org