Re: [PR] Fix HistogramCollector to not create zero-count buckets. [lucene]

2025-03-31 Thread via GitHub


jpountz merged PR #14421:
URL: https://github.com/apache/lucene/pull/14421


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix HistogramCollector to not create zero-count buckets. [lucene]

2025-03-31 Thread via GitHub


jpountz commented on PR #14421:
URL: https://github.com/apache/lucene/pull/14421#issuecomment-2766465560

   I backported to branch_10_2 since this is a bugfix cc @iverase 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] quick exit on filter query matching no docs when rewriting knn query [lucene]

2025-03-31 Thread via GitHub


jpountz merged PR #14418:
URL: https://github.com/apache/lucene/pull/14418


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[I] Add more information to IOContext [lucene]

2025-03-31 Thread via GitHub


thecoop opened a new issue, #14422:
URL: https://github.com/apache/lucene/issues/14422

   ### Description
   
   Currently Lucene tends to use `ReadAdvice.RANDOM` quite liberally, in 
various hard-coded locations. We have seen that this causes some problems with 
recent kernel versions, such as #14408. There are also several other aspects 
about opening files - such as which file access implementation to use, how to 
map memory, etc, that are dependent on the context of opening such files, and 
cannot be determined solely by the codec that is used. For example, you may 
want to access a flat vector file differently if it’s being used for an 
exhaustive scan vs being used to access specific vectors.
   
   To solve both these issues, Directory implementations could have information 
on the context a particular file is being opened with, to help it decide how 
and with what options it should open the file. However, there is no such 
information provided to Directory classes in the IOContext, and there is no 
space to provide such information in an extensible way.
   
   So there needs to be a way to pass additional context-specific information 
in an IOContext, to help Directory implementations decide on the ReadAdvice to 
use, and how and with what options it should open a particular file. There are 
two main ways I can see of doing this:
   
   1. Make IOContext an interface. This allows custom implementations to be 
created, which can then be checked for specific types inside the custom 
Directory implementation to then cast and pull out the required information 
(prototype can be found 
[here](https://github.com/apache/lucene/commit/739f816b5492a505b1bc964214a6d2de60558d47))
   2. Add a `Map` or `Map` payload to IOContext 
allowing for arbitrary information to be included with an IOContext, that can 
then be checked by the Directory implementation
   
   As part of this, we could introduce some standard patterns, or some standard 
map keys for existing Lucene directories to replace the existing ReadAdvice 
logic, to provide more information on how files should be opened. A later piece 
of work could be to remove ReadAdvice from IOContext, and provide some other 
way Directory implementations could determine the access pattern that is likely 
to be used.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] quick exit on filter query matching no docs when rewriting knn query [lucene]

2025-03-31 Thread via GitHub


bugmakerr commented on PR #14418:
URL: https://github.com/apache/lucene/pull/14418#issuecomment-2766274186

   > So in the case when the filter rewrites to MatchNoDocsQuery, 
`IndexSearcher#search` would now return immediately, while it may currently 
need to wait for tasks to be submitted (if the queue is not empty).
   
   Yes, you are right.
   
   > Can you add a CHANGES entry?
   
   Added.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Enable collectors to take advantage of pre-aggregated data. [lucene]

2025-03-31 Thread via GitHub


gf2121 commented on PR #14401:
URL: https://github.com/apache/lucene/pull/14401#issuecomment-2765918007

   The change of https://github.com/apache/lucene/pull/14421 is also included, 
which seems not expected?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Disable the query cache by default. [lucene]

2025-03-31 Thread via GitHub


msfroh commented on code in PR #14187:
URL: https://github.com/apache/lucene/pull/14187#discussion_r2021571472


##
lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java:
##
@@ -77,7 +77,8 @@
 public class IndexSearcher {
 
   static int maxClauseCount = 1024;
-  private static QueryCache DEFAULT_QUERY_CACHE;
+  // Caching is disabled by default.
+  private static QueryCache DEFAULT_QUERY_CACHE = null;
   private static QueryCachingPolicy DEFAULT_CACHING_POLICY = new 
UsageTrackingQueryCachingPolicy();

Review Comment:
   > In general, I'm in favor of this, would looking into it in a follow-up 
work for you?
   
   Sounds good to me, thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] add Automaton.toMermaid() for emitting mermaid state charts [lucene]

2025-03-31 Thread via GitHub


github-actions[bot] commented on PR #14360:
URL: https://github.com/apache/lucene/pull/14360#issuecomment-2767713257

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Enable collectors to take advantage of pre-aggregated data. [lucene]

2025-03-31 Thread via GitHub


jpountz merged PR #14401:
URL: https://github.com/apache/lucene/pull/14401


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Adding profiling support for concurrent segment search [lucene]

2025-03-31 Thread via GitHub


jainankitk commented on PR #14413:
URL: https://github.com/apache/lucene/pull/14413#issuecomment-2767417910

   @jpountz - Can you provide your thoughts on above?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] PointInSetQuery clips segments by lower and upper [lucene]

2025-03-31 Thread via GitHub


gsmiller commented on code in PR #14268:
URL: https://github.com/apache/lucene/pull/14268#discussion_r2021381006


##
lucene/CHANGES.txt:
##
@@ -186,6 +186,8 @@ Optimizations
 
 * GITHUB#14272: Use DocIdSetIterator#range for continuous-id BKD leaves. (Guo 
Feng)
 
+* GITHUB#14268: PointInSetQuery clips segments by lower and upper (hanbj)

Review Comment:
   Let's move the changes entry to 10.3 since the 10.2 branch has already been 
cut and I don't think we need to squeeze this in with that release?
   
   Also, can we make this entry a little more descriptive? Maybe something like 
`PointInSetQuery optimization for the case when no segment docs can intersect 
with the query values`?



##
lucene/core/src/java/org/apache/lucene/search/PointInSetQuery.java:
##
@@ -248,6 +255,33 @@ public long cost() {
 }
   }
 
+  private boolean checkValidPointValues(PointValues values) throws 
IOException {

Review Comment:
   I'd prefer going back to having this logic inlined as it was. I don't think 
checking `values == null` is really part of validating the `PointValues`. And 
there's nothing else calling this, so I think it's a bit easier to read when 
inlined.



##
lucene/core/src/java/org/apache/lucene/search/PointInSetQuery.java:
##
@@ -248,6 +255,33 @@ public long cost() {
 }
   }
 
+  private boolean checkValidPointValues(PointValues values) throws 
IOException {

Review Comment:
   (But I do agree we should do the validation before the optimization you 
introduced)



##
lucene/core/src/java/org/apache/lucene/search/PointInSetQuery.java:
##
@@ -153,6 +162,21 @@ public ScorerSupplier scorerSupplier(LeafReaderContext 
context) throws IOExcepti
   return null;
 }
 
+if (values.getDocCount() == 0) {

Review Comment:
   Looking at the git annotations in `PointRangeQuery`, these checks were added 
in two different changes. I think it's likely this was just overlooked. I do 
not believe it's possible to have a non-null `PointValue` at this point that 
returns a zero doc count. (I also played around with this using some unit tests 
and a debugger and can confirm that behavior). All that said, I'm not strongly 
opposed to leaving the check in there.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Optimize commit retention policy to maintain only the last 5 commits [lucene]

2025-03-31 Thread via GitHub


github-actions[bot] commented on PR #14325:
URL: https://github.com/apache/lucene/pull/14325#issuecomment-2767713463

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Add more information to IOContext [lucene]

2025-03-31 Thread via GitHub


jpountz commented on issue #14422:
URL: https://github.com/apache/lucene/issues/14422#issuecomment-2767398441

   @rmuir I'm curious if you can expand a bit more on what you have in mind, 
what you are describing sounds to me like how things are today where 
`ReadAdvice.RANDOM` is the context and the `MADV_RANDOM` flag to `madvise` is 
the implementation. So I'm sure I'm missing something.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Reduce the number of comparisons when lowerPoint is equal to upperPoint [lucene]

2025-03-31 Thread via GitHub


jainankitk commented on code in PR #14267:
URL: https://github.com/apache/lucene/pull/14267#discussion_r2021794022


##
lucene/core/src/java/org/apache/lucene/search/PointRangeQuery.java:
##
@@ -129,6 +141,16 @@ public final Weight createWeight(IndexSearcher searcher, 
ScoreMode scoreMode, fl
 
   private boolean matches(byte[] packedValue) {
 int offset = 0;
+
+if (equalValues) {
+  for (int dim = 0; dim < numDims; dim++, offset += bytesPerDim) {
+if (comparator.compare(packedValue, offset, lowerPoint, offset) != 
0) {
+  return false;
+}
+  }
+  return true;
+}

Review Comment:
   Given we know about `equalValues` being true/false while initializing the 
`PointRangeQuery`, I would have a separate weight object, instead of having 
this additional logic when `lowerPoint != upperPoint`. For example :
   
   ```
@Override
 public final Weight createWeight(IndexSearcher searcher, ScoreMode 
scoreMode, float boost)
 throws IOException {
   if (this.equalValues) {
 return new ConstantScoreWeight(this, boost) {}
   }
   // We don't use RandomAccessWeight here: it's no good to approximate 
with "match all docs".
   // This is an inverted structure and should be used in the first pass:
   return new ConstantScoreWeight(this, boost) {}
}
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] PointInSetQuery early exit on non-matching segments [lucene]

2025-03-31 Thread via GitHub


hanbj commented on code in PR #14268:
URL: https://github.com/apache/lucene/pull/14268#discussion_r2022087763


##
lucene/CHANGES.txt:
##
@@ -186,6 +186,8 @@ Optimizations
 
 * GITHUB#14272: Use DocIdSetIterator#range for continuous-id BKD leaves. (Guo 
Feng)
 
+* GITHUB#14268: PointInSetQuery clips segments by lower and upper (hanbj)

Review Comment:
   This has been modified.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] cache preset dict for LZ4WithPresetDictDecompressor [lucene]

2025-03-31 Thread via GitHub


jainankitk commented on code in PR #14397:
URL: https://github.com/apache/lucene/pull/14397#discussion_r2021750116


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/LZ4WithPresetDictCompressionMode.java:
##
@@ -98,12 +98,17 @@ public void decompress(DataInput in, int originalLength, 
int offset, int length,
   final int blockLength = in.readVInt();
 
   final int numBlocks = readCompressedLengths(in, originalLength, 
dictLength, blockLength);
-
-  buffer = ArrayUtil.growNoCopy(buffer, dictLength + blockLength);
   bytes.length = 0;
-  // Read the dictionary
-  if (LZ4.decompress(in, dictLength, buffer, 0) != dictLength) {
-throw new CorruptIndexException("Illegal dict length", in);
+  if (reused) {
+assert buffer.length >= dictLength + blockLength;
+in.skipBytes(compressedLengths[0]);
+  } else {
+// Read the dictionary
+buffer = ArrayUtil.growNoCopy(buffer, dictLength + blockLength);
+if (LZ4.decompress(in, dictLength, buffer, 0) != dictLength) {
+  throw new CorruptIndexException("Illegal dict length", in);
+}
+reused = true;

Review Comment:
   I am wondering if we should consider exposing metric (simple counter maybe) 
on how many times we could reuse, and how many times had to read from the disk? 
That would provide some useful insights on the usefulness of this change



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] cache preset dict for LZ4WithPresetDictDecompressor [lucene]

2025-03-31 Thread via GitHub


jainankitk commented on code in PR #14397:
URL: https://github.com/apache/lucene/pull/14397#discussion_r2021748068


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/LZ4WithPresetDictCompressionMode.java:
##
@@ -144,10 +148,85 @@ public void decompress(DataInput in, int originalLength, 
int offset, int length,
   assert bytes.isValid();
 }
 
+@Override
+public void decompress(
+IndexInput in, int originalLength, int offset, int length, BytesRef 
bytes)
+throws IOException {
+  assert offset + length <= originalLength;
+
+  if (length == 0) {
+bytes.length = 0;
+return;
+  }
+
+  final int dictLength = in.readVInt();
+  final int blockLength = in.readVInt();
+
+  final int numBlocks = readCompressedLengths(in, originalLength, 
dictLength, blockLength);
+
+  buffer = ArrayUtil.grow(buffer, dictLength + blockLength);
+  long startPointer = in.getFilePointer();
+  bytes.length = 0;
+  if (cachedDictFilPointer == startPointer) {
+assert cachedDictLength == dictLength && dictEndFilePointer > 0;
+in.seek(dictEndFilePointer);

Review Comment:
   Yes, looks so much cleaner now. Thanks for the refactoring!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org