Re: [PR] Binary search all terms. [lucene]

2024-03-26 Thread via GitHub


vsop-479 commented on PR #13192:
URL: https://github.com/apache/lucene/pull/13192#issuecomment-2019536235

   Implemented binary search term non leaf (`allEqual` and `unEqual`).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Binary search all terms. [lucene]

2024-03-26 Thread via GitHub


vsop-479 commented on PR #13192:
URL: https://github.com/apache/lucene/pull/13192#issuecomment-2019849096

   @jpountz @mikemccand 
   Please take a look when you get a chance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix TestIndexWriter.testDeleteUnusedFiles' failure on Windows 11 [lucene]

2024-03-26 Thread via GitHub


vsop-479 commented on PR #13183:
URL: https://github.com/apache/lucene/pull/13183#issuecomment-2019860075

   You are right @vigyasharma. I will dig into them.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add support for posix_madvise to Java 21 MMapDirectory [lucene]

2024-03-26 Thread via GitHub


uschindler commented on PR #13196:
URL: https://github.com/apache/lucene/pull/13196#issuecomment-2019919116

   I also removed the extra logging included while development from the main 
branch. In 9.x the log message was adapted to list both features together with 
the sysprop to disable).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add support for posix_madvise to Java 21 MMapDirectory [lucene]

2024-03-26 Thread via GitHub


ChrisHegarty commented on PR #13196:
URL: https://github.com/apache/lucene/pull/13196#issuecomment-2019923997

   > The test added by @ChrisHegarty sometimes fails on windows: It does not 
close the file it opened for random access testing, so the directory can't be 
deleted. Will fix this in a separate commit.
   
   Oops, sorry about this. Thankfully the fix was straightforward.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add support for posix_madvise to Java 21 MMapDirectory [lucene]

2024-03-26 Thread via GitHub


uschindler commented on PR #13196:
URL: https://github.com/apache/lucene/pull/13196#issuecomment-2019973001

   > > The test added by @ChrisHegarty sometimes fails on windows: It does not 
close the file it opened for random access testing, so the directory can't be 
deleted. Will fix this in a separate commit.
   > 
   > Oops, sorry about this. Thankfully the fix was straightforward.
   
   About this test and the quite large 8 MB buffer (see 
https://github.com/apache/lucene/pull/13196/files#diff-ebee319f3691cdb1627e5e9b1dfbdd0c266b1da28ccf0b2a9218dee1d34ff2b7R103):
   
   Is this some limit inside the kernel to trigger something? For this test, we 
call posix_madvise() always.
   
   Maybe there was a misunderstanding: We only do not call madvise, if the 
chunk size is too small (< 2 MiB), but by default the chunk size is 16 
Gigabytes, except for tests we always call madvise. Theres only one special 
case: If the file is zero length, then we can't call it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add support for posix_madvise to Java 21 MMapDirectory [lucene]

2024-03-26 Thread via GitHub


ChrisHegarty commented on PR #13196:
URL: https://github.com/apache/lucene/pull/13196#issuecomment-2020022627

   I dunno what I was thinking, this is clearly not correct. I opened #13214 to 
fix the test. ( apologies for the stupid test issues! )


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add support for posix_madvise to Java 21 MMapDirectory [lucene]

2024-03-26 Thread via GitHub


jpountz commented on PR #13196:
URL: https://github.com/apache/lucene/pull/13196#issuecomment-2020031076

   @uschindler Should we open a separate issue for adding `fadvise` support to 
`NIOFSDirectory`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Should we fold DirectIODirectory into FSDirectory? [lucene]

2024-03-26 Thread via GitHub


jpountz closed issue #13194: Should we fold DirectIODirectory into FSDirectory?
URL: https://github.com/apache/lucene/issues/13194


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Should we fold DirectIODirectory into FSDirectory? [lucene]

2024-03-26 Thread via GitHub


jpountz commented on issue #13194:
URL: https://github.com/apache/lucene/issues/13194#issuecomment-2020035633

   Closing this issue as won't fix, I imagine that madvise/fadvise is 
considered a superior solution than direct I/O.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add support for posix_madvise to Java 21 MMapDirectory [lucene]

2024-03-26 Thread via GitHub


uschindler commented on PR #13196:
URL: https://github.com/apache/lucene/pull/13196#issuecomment-2020036092

   Unfortunately fadvise is at moment close to impossible. Reason: we have no 
file handle!
   
   Chances are good that we also get a Java-based fadvise some time in the 
future (e.g., through an OpenOption like with O_DIRECT).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add support for posix_madvise to Java 21 MMapDirectory [lucene]

2024-03-26 Thread via GitHub


uschindler commented on PR #13196:
URL: https://github.com/apache/lucene/pull/13196#issuecomment-2020039697

   As disussed before, for implementing fadvise for reading/writing files, we 
would need to write a full stack of IO layer natively (OutputStream for writing 
and FileChannel for NIOFSDir). See https://bugs.openjdk.org/browse/JDK-8292771


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add support for posix_madvise to Java 21 MMapDirectory [lucene]

2024-03-26 Thread via GitHub


uschindler commented on PR #13196:
URL: https://github.com/apache/lucene/pull/13196#issuecomment-2020051067

   Anyways we can open an issue to track what's going on on the JDK (listing 
all relevant issue numbers like the above one).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Avoid creating large buffer in TestMMapDirectory.testWithRandom [lucene]

2024-03-26 Thread via GitHub


ChrisHegarty merged PR #13214:
URL: https://github.com/apache/lucene/pull/13214


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Break point estimate when threshold exceeded [lucene]

2024-03-26 Thread via GitHub


jpountz commented on code in PR #13199:
URL: https://github.com/apache/lucene/pull/13199#discussion_r1538951789


##
lucene/core/src/java/org/apache/lucene/index/PointValues.java:
##
@@ -375,16 +375,23 @@ private void intersect(IntersectVisitor visitor, 
PointTree pointTree) throws IOE
   public final long estimatePointCount(IntersectVisitor visitor) {
 try {
   final PointTree pointTree = getPointTree();
-  final long count = estimatePointCount(visitor, pointTree);
+  final long count = estimatePointCount(visitor, pointTree, 
Long.MAX_VALUE);
   assert pointTree.moveToParent() == false;
   return count;
 } catch (IOException ioe) {
   throw new UncheckedIOException(ioe);
 }
   }
 
-  private long estimatePointCount(IntersectVisitor visitor, PointTree 
pointTree)
-  throws IOException {
+  /**
+   * Estimate the number of documents that would be matched by {@link 
#intersect} with the given
+   * {@link IntersectVisitor}. The estimation will terminate when the point 
count exceeds the upper
+   * bound.
+   *
+   * TODO: Broad-first will help extimation terminate earlier?
+   */
+  public static long estimatePointCount(

Review Comment:
   Nit: To make contracts of these functions clearer, I'd rather make this 
function private, and them have another 
`isEstimatedPointCountGreaterThanOrEqualTo` public function (and probably 
tagged with `@lucene.internal` so that we can evolve it as we want) that calls 
this private function?



##
lucene/core/src/java/org/apache/lucene/index/PointValues.java:
##
@@ -398,8 +405,8 @@ private long estimatePointCount(IntersectVisitor visitor, 
PointTree pointTree)
 if (pointTree.moveToChild()) {
   long cost = 0;
   do {
-cost += estimatePointCount(visitor, pointTree);
-  } while (pointTree.moveToSibling());
+cost += estimatePointCount(visitor, pointTree, upperBound - cost);
+  } while (cost <= upperBound && pointTree.moveToSibling());

Review Comment:
   I believe that we can stop counting if `cost == upperBound`?
   
   ```suggestion
 } while (cost < upperBound && pointTree.moveToSibling());
   ```



##
lucene/core/src/java/org/apache/lucene/search/comparators/NumericComparator.java:
##
@@ -301,6 +303,14 @@ public PointValues.Relation compare(byte[] minPackedValue, 
byte[] maxPackedValue
   updateSkipInterval(true);
 }
 
+private PointValues.PointTree pointTree() throws IOException {
+  if (pointTree == null) {
+pointTree = pointValues.getPointTree();
+  }
+  assert !pointTree.moveToParent();
+  return pointTree;
+}

Review Comment:
   What about pulling the point tree in the constructor instead of doing it 
lazily (for simplicity)?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Convert IOContext, MergeInfo, and FlushInfo to record classes [lucene]

2024-03-26 Thread via GitHub


ChrisHegarty commented on PR #13205:
URL: https://github.com/apache/lucene/pull/13205#issuecomment-2020109140

   > Hi @ChrisHegarty,
   > can you have another look on the crazy randomAccess flag afetr I merged 
main into this. Especially the checks in the record's constructor should be 
checked again. I have no idea what flags are mutually exclusive. Maybe there 
are more combinations missing.
   
   The record constructor checks, while not pretty, seem right to me. This is a 
nice cleanup and somewhat easier to reason about.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Subtract deleted file size from the cache size of NRTCachingDirectory. [lucene]

2024-03-26 Thread via GitHub


uschindler commented on PR #13206:
URL: https://github.com/apache/lucene/pull/13206#issuecomment-2020164536

   > > The bigger problem is when the file is still open in an NRTReader and 
gets deleted.
   > 
   > Hmm does it actually happen? I thought index files were ref-counted so 
that files would only get deleted when the NRT reader gets closed? This is what 
`StandardDirectoryReader#doClose` suggests.
   
   Ah you're right. This is to workaround windows issues.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Subtract deleted file size from the cache size of NRTCachingDirectory. [lucene]

2024-03-26 Thread via GitHub


uschindler commented on PR #13206:
URL: https://github.com/apache/lucene/pull/13206#issuecomment-2020165033

   So I think we are fine then. +1 to merge


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Convert IOContext, MergeInfo, and FlushInfo to record classes [lucene]

2024-03-26 Thread via GitHub


uschindler merged PR #13205:
URL: https://github.com/apache/lucene/pull/13205


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Break point estimate when threshold exceeded [lucene]

2024-03-26 Thread via GitHub


gf2121 merged PR #13199:
URL: https://github.com/apache/lucene/pull/13199


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Break point estimate when threshold exceeded [lucene]

2024-03-26 Thread via GitHub


gf2121 commented on PR #13199:
URL: https://github.com/apache/lucene/pull/13199#issuecomment-2020191817

   Thanks for review @jpountz !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Add support for Github issue numbers in Markdown converter (e.g., MIGRATE.md file) [lucene]

2024-03-26 Thread via GitHub


uschindler opened a new pull request, #13215:
URL: https://github.com/apache/lucene/pull/13215

   Our `MIGRATE.md` file in the documentation has many Github links now, but 
they are not hotlinked like JIRA issue numbers. This adds support for 
`GITHUB#xxx` and `GH-xxx` numbers like in CHANGES.txt (although a bit more 
simple).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] StackOverflow when RegExp encounters a very large string [LUCENE-10501] [lucene]

2024-03-26 Thread via GitHub


reschke commented on issue #11537:
URL: https://github.com/apache/lucene/issues/11537#issuecomment-2020222631

   FWIW: Apache Jackrabbit Oak is stuck with Lucene 4.7.x for the time being. 
That said, backporting the change to the 4.7.2 source seems to be trivial; see 
.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Subtract deleted file size from the cache size of NRTCachingDirectory. [lucene]

2024-03-26 Thread via GitHub


mikemccand commented on code in PR #13206:
URL: https://github.com/apache/lucene/pull/13206#discussion_r1539130332


##
lucene/core/src/java/org/apache/lucene/store/NRTCachingDirectory.java:
##
@@ -73,6 +73,7 @@ public class NRTCachingDirectory extends FilterDirectory 
implements Accountable
   new SingleInstanceLockFactory(),
   ByteBuffersDataOutput::new,
   (fileName, content) -> {
+if (!isCachedFile(fileName)) return null;

Review Comment:
   Could we split this onto three lines and add `{` ... `}` around the body?
   
   And replace the `!` with `== false` to reduce chance of future accidental 
refactoring bugs?



##
lucene/core/src/java/org/apache/lucene/store/NRTCachingDirectory.java:
##
@@ -73,6 +73,7 @@ public class NRTCachingDirectory extends FilterDirectory 
implements Accountable
   new SingleInstanceLockFactory(),
   ByteBuffersDataOutput::new,
   (fileName, content) -> {
+if (!isCachedFile(fileName)) return null;

Review Comment:
   Hmm, was this a separate bug?  We were including this file's size in 
`cacheSize` even if it is not cached?
   
   Though, why would this lambda be called on a `fileName` that is not going to 
be cached?  I though the `if` in `NRTCachingDirectory.createOutput` would only 
ask the `cacheDirectory` to create the output if the `fileName` was to be 
cached?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add support for Github issue numbers in Markdown converter (e.g., MIGRATE.md file) [lucene]

2024-03-26 Thread via GitHub


uschindler merged PR #13215:
URL: https://github.com/apache/lucene/pull/13215


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Clean up variable-gaps terms format. [lucene]

2024-03-26 Thread via GitHub


jpountz opened a new pull request, #13216:
URL: https://github.com/apache/lucene/pull/13216

   The variable-gaps terms format uses the legacy storage layout of storing 
metadata at the end of the index file, and storing the start pointer of the 
metadata as the last 8 bytes of the index files (just before the footer). This 
forces an awkward access pattern at open time when we first need to seek 
towards the end to check that a footer is present, then seek some more bytes 
backwards to read metadata, and finally read the content of the index that sits 
before metadata.
   
   To fix this, meta data and index data are now split into different files. 
This way, both files have a clean sequential and read-once access pattern, and 
can take advantage of the `ChecksumIndexInput` abstraction for checksum 
validation.
   
   This further helps clean up `IOContext` by removing the ability to set 
`readOnce` to `true` on an existing `IOContext`.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Subtract deleted file size from the cache size of NRTCachingDirectory. [lucene]

2024-03-26 Thread via GitHub


jpountz commented on code in PR #13206:
URL: https://github.com/apache/lucene/pull/13206#discussion_r1539187790


##
lucene/core/src/java/org/apache/lucene/store/NRTCachingDirectory.java:
##
@@ -73,6 +73,7 @@ public class NRTCachingDirectory extends FilterDirectory 
implements Accountable
   new SingleInstanceLockFactory(),
   ByteBuffersDataOutput::new,
   (fileName, content) -> {
+if (!isCachedFile(fileName)) return null;

Review Comment:
   This is trying to cover the case when a file gets deleted while it's still 
open for writing.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [Fix] Binary search the entries when all suffixes have the same length in a leaf block. [lucene]

2024-03-26 Thread via GitHub


mikemccand commented on PR #11888:
URL: https://github.com/apache/lucene/pull/11888#issuecomment-2020383140

   I like this idea!  It seems like it'd especially help primary key lookup 
against fixed length IDs like UUID?
   
   Hmm, the QPS in the `luceneutil` runs are way too high (1000s of QPS) to be 
trustworthy?  Was this on `wikimediumall`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [Fix] Binary search the entries when all suffixes have the same length in a leaf block. [lucene]

2024-03-26 Thread via GitHub


mikemccand commented on code in PR #11888:
URL: https://github.com/apache/lucene/pull/11888#discussion_r1539192140


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnumFrame.java:
##
@@ -523,7 +526,9 @@ public void scanToSubBlock(long subFP) {
 
   // NOTE: sets startBytePos/suffix as a side effect
   public SeekStatus scanToTerm(BytesRef target, boolean exactOnly) throws 
IOException {
-return isLeafBlock ? scanToTermLeaf(target, exactOnly) : 
scanToTermNonLeaf(target, exactOnly);
+return isLeafBlock

Review Comment:
   I know this was a pre-existing ternary :)   But now we are embedding another 
confusing ternary inside the first one -- could we instead spell all of this 
out as verbose `if`?



##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnumFrame.java:
##
@@ -568,8 +573,6 @@ public SeekStatus scanToTermLeaf(BytesRef target, boolean 
exactOnly) throws IOEx
 
 assert prefixMatches(target);
 
-// TODO: binary search when all terms have the same length, which is 
common for ID fields,

Review Comment:
   Aha!  Another `TODO` gone, thank you @vsop-479!



##
lucene/core/src/test/org/apache/lucene/codecs/lucene99/TestLucene99PostingsFormat.java:
##
@@ -143,4 +141,13 @@ private void doTestImpactSerialization(List 
impacts) throws IOException
   }
 }
   }
+
+  @Override
+  protected void subCheckBinarySearch(TermsEnum termsEnum) throws Exception {
+// 10004a matched block's entries: [11, 13, ..., 100049].
+// if target greater than the last entry of the matched block,
+// termsEnum.term should be the last entry.
+assertFalse(termsEnum.seekExact(new BytesRef("10004a")));
+assertEquals(termsEnum.term(), new BytesRef("100049"));

Review Comment:
   Hmm, when `seekExact` returns `false`, the `TermsEnum` is unpositioned and 
calling `.term()` (and other methods e.g. `.postings()`) is not allowed (the 
behavior is undefined -- it could throw an exception or corrupt its internal 
state or so) ... why are we testing that here :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] A new idea for numeric dynamic pruning [lucene]

2024-03-26 Thread via GitHub


gf2121 opened a new pull request, #13217:
URL: https://github.com/apache/lucene/pull/13217

   This PR proposes a new way to do numeric dynamic pruning with following 
changes:
   
   * Instead of sampling and estimating point count to judge whether build the 
competitive iterator, this patch proposes to find out the threshold value. That 
said, we find out the value that 'N docs away from' the top value, in favor of 
the [the fact that top value should be final in 
LeafComparators](https://github.com/apache/lucene/blob/99b9636fd8c383c80d06c8815cfdb49b1b77dcdb/lucene/core/src/java/org/apache/lucene/search/FieldComparator.java#L75).
 
   
   
![image](https://github.com/apache/lucene/assets/52390227/b7125122-194a-4167-a821-57c47479915d)
   
   * Instead of building and rebuilding the competitive iterator when it get 
smaller, this patch proposes to build the competitive iterator as a disjunction 
of small DISIs. Each small DISI maintains its most competitive value and 
discarded when their most competitive value is no more competitive, like what 
we did in `TermOrdValComparator`. This helps us intersect the tree only once 
and update the competitive iterator more frequently.
   
    Some minor points
   
   * For simplification, i tweaked the bytes codec things to a comparable long 
value. e.g. `maxValueAsBytes` -> `maxValueAsLong`.
   
   * This PR still works with the stale method `public static long 
estimatePointCount(IntersectVisitor visitor, PointTree pointTree, long 
upperBound) throws IOException`, it seems a bit challenging to work with the 
new boolean API, i'll dig more, and this is why this is a draft.
   
   Here is a result based on wikimedium10m (baseline contains 
https://github.com/apache/lucene/pull/13199)
   
   ```
   TaskQPS baseline  StdDevQPS 
my_modified_version  StdDevPct diff p-value
 TermDTSort  176.84  (5.0%)  303.31  
(6.3%)   71.5% (  57% -   87%) 0.000
  HighTermDayOfYearSort  454.72  (3.2%)  791.09  
(7.9%)   74.0% (  60% -   87%) 0.000
   ```
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Subtract deleted file size from the cache size of NRTCachingDirectory. [lucene]

2024-03-26 Thread via GitHub


mikemccand commented on code in PR #13206:
URL: https://github.com/apache/lucene/pull/13206#discussion_r1539235911


##
lucene/core/src/java/org/apache/lucene/store/NRTCachingDirectory.java:
##
@@ -73,6 +73,7 @@ public class NRTCachingDirectory extends FilterDirectory 
implements Accountable
   new SingleInstanceLockFactory(),
   ByteBuffersDataOutput::new,
   (fileName, content) -> {
+if (!isCachedFile(fileName)) return null;

Review Comment:
   Ahh thank you for the clarification @jpountz.  I am playing catchup on this 
code ;)  I think I understand it now.
   
   So, this is the `onClose` lambda that is called (when the `IndexOutput` is 
closed), where we normally increase the `cacheSize` (we do not count the bytes 
as they are being appended, live, to the file, but rather only when the 
`IndexOutput` is closed).
   
   And we are (newly, with this PR) checking whether it has already been 
deleted before it was closed and avoiding adding to `cacheSize` in this case.  
Indeed Lucene should never do this, except maybe in exceptional situations (and 
even then, hopefully not: Lucene should first close any still-open 
`IndexOutput`s on exception before delete is tried, I think).  +1 to be 
defensive anyways.
   
   Also this is different from the "delete while file is still open for 
reading" case...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Subtract deleted file size from the cache size of NRTCachingDirectory. [lucene]

2024-03-26 Thread via GitHub


mikemccand commented on PR #13206:
URL: https://github.com/apache/lucene/pull/13206#issuecomment-2020444316

   > > The bigger problem is when the file is still open in an NRTReader and 
gets deleted.
   > 
   > Hmm does it actually happen? I thought index files were ref-counted so 
that files would only get deleted when the NRT reader gets closed? This is what 
`StandardDirectoryReader#doClose` suggests.
   
   NRT readers pulled from `IndexWriter` indeed do this ref counting so delete 
is not attempted until all readers still using that segment/file are closed.
   
   But in the NRT segment replication case, I think this may still happen?  
I.e. segments are replicated out to a node that has an open `DirectoryReader`.  
That reader is refreshed to switch to the new segments, but the old reader is 
still alive to finish any in-flight queries or stay open for any long-term 
leases for deep scrolling purposes.  Oh, nevermind, even in the NRT segment 
replication case we have `ReplicaFileDeleter` to do the right ref counting.  So 
yeah I think we don't need to worry about this case either!  Phew, complicated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Subtract deleted file size from the cache size of NRTCachingDirectory. [lucene]

2024-03-26 Thread via GitHub


mikemccand commented on code in PR #13206:
URL: https://github.com/apache/lucene/pull/13206#discussion_r1539244396


##
lucene/core/src/java/org/apache/lucene/store/NRTCachingDirectory.java:
##
@@ -118,7 +119,9 @@ public synchronized void deleteFile(String name) throws 
IOException {
   System.out.println("nrtdir.deleteFile name=" + name);
 }
 if (cacheDirectory.fileExists(name)) {
+  long size = cacheDirectory.fileLength(name);
   cacheDirectory.deleteFile(name);
+  cacheSize.addAndGet(-size);

Review Comment:
   Could we hold onto the returned (get'd) value and assert that it never drops 
below zero?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Subtract deleted file size from the cache size of NRTCachingDirectory. [lucene]

2024-03-26 Thread via GitHub


mikemccand commented on code in PR #13206:
URL: https://github.com/apache/lucene/pull/13206#discussion_r1539246356


##
lucene/core/src/java/org/apache/lucene/store/NRTCachingDirectory.java:
##
@@ -73,6 +73,7 @@ public class NRTCachingDirectory extends FilterDirectory 
implements Accountable
   new SingleInstanceLockFactory(),
   ByteBuffersDataOutput::new,
   (fileName, content) -> {
+if (!isCachedFile(fileName)) return null;

Review Comment:
   In addition to the small style comments up above, could you also please add 
a comment explaining that this check is defensive/paranoia to handle deleting a 
file that is still open for writing, which Lucene should never do?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] MIGRATE.md doesn't mention the transition from Collector to CollectorManager [lucene]

2024-03-26 Thread via GitHub


mikemccand commented on issue #13218:
URL: https://github.com/apache/lucene/issues/13218#issuecomment-2020482796

   Hmm is there some way to mark issues as blocker for release?  I want to make 
sure we address this for 10.0.0.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Clean up variable-gaps terms format. [lucene]

2024-03-26 Thread via GitHub


jpountz commented on PR #13216:
URL: https://github.com/apache/lucene/pull/13216#issuecomment-2020509119

   > we now use ChecksumIndexInput as the file is fully readonce without 
seeking. Because we removed the IOContext from directory's open method (as it 
is hardcoded to readOnce), we do not need to change the existing IOContext, so 
the modifier method can go away
   
   This is correct.
   
   > the file is then read in whole and closed, the FST is on-heap.
   
   Correct as well.
   
   > We do not have backwards compatibility here as it is not the default 
format, so we can change it in main branch (10.x). This won't be backported, 
correct?
   
   I don't believe that we attempt to not break non-default formats in minor 
releases, I have broken a few of them in the past in a minor. That said, I'm 
fine with not backporting, we could still have the `IOContext` ctor removed on 
9.x by hardcoding `IOContext.READONCE` when opening the `IndexInput`.
   
   For reference, I bumped the version number to make sure users get an 
`IndexFormatTooOldException` rather than a cryptic corruption exception.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Replace boolean flags on `IOContext` with an enum. [lucene]

2024-03-26 Thread via GitHub


jpountz opened a new pull request, #13219:
URL: https://github.com/apache/lucene/pull/13219

   This replaces the `load`, `randomAccess` and `readOnce` flags with a 
`ReadAdvice` enum, whose values are aligned with the allowed values to 
(f|m)advise.
   
   Closes #13211


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix TestTaxonomyFacetValueSource.testRandom [lucene]

2024-03-26 Thread via GitHub


iamsanjay commented on PR #13198:
URL: https://github.com/apache/lucene/pull/13198#issuecomment-2020515624

   @dweiss Thanks for the clarification, It does change the seed and hence was 
not able to reproduce the failure case. To increase the likelihood I switch to 
choosing only from two values (0,1), instead of `random.nextFloat()`  and able 
to reproduce it without any Hassle. So far the issue is with the test code 
which is producing the expected aggregated Facet Results and only 
includingpositive values.
   
   @stefanvodita You are right! #12966 would consider the non-positive values 
as well, and the change that I introduced would start failing. 
   
   So I revert the change that I did for aggregation and ran the boundary test 
(including negative 1) case, and It failed again. Try to include this new 
boundary case and you will see it.
   
   ```
   public void testBoundaryCases() throws Exception {
   final float[] boundaryCases = new float[] {-1, 0, 1};
   doTestDifferentValueRanges(() -> 
boundaryCases[random().nextInt(boundaryCases.length)]);
 }
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Mark TimeLimitingCollector as deprecated [lucene]

2024-03-26 Thread via GitHub


kaivalnp opened a new pull request, #13220:
URL: https://github.com/apache/lucene/pull/13220

   ### Description
   
   Follow-up to 
https://github.com/apache/lucene/pull/13202#issuecomment-2016947960
   
   
[`TimeLimitingCollector`](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/TimeLimitingCollector.java)
 only seems to be used from comments and tests. Can we mark it as `@Deprecated` 
and remove from `main` in another PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add timeout support to AbstractKnnVectorQuery [lucene]

2024-03-26 Thread via GitHub


kaivalnp commented on PR #13202:
URL: https://github.com/apache/lucene/pull/13202#issuecomment-2020534763

   > Separately, should we deprecate `TimeLimitingCollector` ? It doesn't use 
`QueryTimeout` and I don't think we're using it anywhere.
   
   Created #13220 to discuss this


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add Facets#getBulkSpecificValues method [lucene]

2024-03-26 Thread via GitHub


gsmiller commented on code in PR #12862:
URL: https://github.com/apache/lucene/pull/12862#discussion_r1539336428


##
lucene/facet/src/java/org/apache/lucene/facet/LongValueFacetCounts.java:
##
@@ -568,6 +568,12 @@ public Number getSpecificValue(String dim, String... path) 
{
 throw new UnsupportedOperationException();
   }
 
+  @Override
+  public Number[] getBulkSpecificValues(FacetLabel[] facetLabels) {
+// TODO: should we impl this?

Review Comment:
   @epotyom OK I'm fine with that. Thanks for considering!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Subtract deleted file size from the cache size of NRTCachingDirectory. [lucene]

2024-03-26 Thread via GitHub


jfboeuf commented on code in PR #13206:
URL: https://github.com/apache/lucene/pull/13206#discussion_r1539337432


##
lucene/core/src/java/org/apache/lucene/store/NRTCachingDirectory.java:
##
@@ -118,7 +119,9 @@ public synchronized void deleteFile(String name) throws 
IOException {
   System.out.println("nrtdir.deleteFile name=" + name);
 }
 if (cacheDirectory.fileExists(name)) {
+  long size = cacheDirectory.fileLength(name);
   cacheDirectory.deleteFile(name);
+  cacheSize.addAndGet(-size);

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Subtract deleted file size from the cache size of NRTCachingDirectory. [lucene]

2024-03-26 Thread via GitHub


jfboeuf commented on code in PR #13206:
URL: https://github.com/apache/lucene/pull/13206#discussion_r1539336230


##
lucene/core/src/java/org/apache/lucene/store/NRTCachingDirectory.java:
##
@@ -73,6 +73,7 @@ public class NRTCachingDirectory extends FilterDirectory 
implements Accountable
   new SingleInstanceLockFactory(),
   ByteBuffersDataOutput::new,
   (fileName, content) -> {
+if (!isCachedFile(fileName)) return null;

Review Comment:
   Done!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Replace boolean flags on `IOContext` with an enum. [lucene]

2024-03-26 Thread via GitHub


jpountz commented on code in PR #13219:
URL: https://github.com/apache/lucene/pull/13219#discussion_r1539310893


##
lucene/codecs/src/java/org/apache/lucene/codecs/blockterms/VariableGapTermsIndexReader.java:
##
@@ -53,7 +54,7 @@ public VariableGapTermsIndexReader(SegmentReadState state) 
throws IOException {
 state.segmentInfo.name,
 state.segmentSuffix,
 VariableGapTermsIndexWriter.TERMS_INDEX_EXTENSION);
-final IndexInput in = state.directory.openInput(fileName, 
state.context.toReadOnce());
+final IndexInput in = state.directory.openInput(fileName, 
IOContext.READONCE);

Review Comment:
   A change to this class wouldn't be necessary anymore once #13216 is merged.



##
lucene/core/src/java/org/apache/lucene/store/IOContext.java:
##
@@ -54,58 +43,74 @@ public enum Context {
 DEFAULT
   };
 
-  public static final IOContext DEFAULT = new IOContext(Context.DEFAULT);
+  /** Advice regarding the read access pattern. */
+  public enum ReadAdvice {
+/**
+ * Normal behavior. Data is expected to be read mostly sequentially. The 
system is expected to
+ * cache the hottest pages.
+ */
+NORMAL,
+/**
+ * Data is expected to be read in a random-access fashion, either by {@link
+ * IndexInput#seek(long) seeking} often and reading relatively short 
sequences of bytes at once,
+ * or by reading data through the {@link RandomAccessInput} abstraction in 
random order.
+ */
+RANDOM,
+/** Data is expected to be read sequentially with very little seeking at 
most. */
+SEQUENTIAL,
+/**
+ * Data is treated as random-access memory in practice. {@link Directory} 
implementations may
+ * explicitly load the content of the file in memory, or provide hints to 
the system so that it
+ * loads the content of the file into the page cache at open time. This 
should only be used on
+ * very small files that can be expected to fit in RAM with very high 
confidence.
+ */
+LOAD
+  }
+
+  public static final IOContext DEFAULT =
+  new IOContext(Context.DEFAULT, null, null, ReadAdvice.NORMAL);
 
-  public static final IOContext READONCE = new IOContext(true, false, false);
+  public static final IOContext READONCE = new 
IOContext(ReadAdvice.SEQUENTIAL);
 
-  public static final IOContext READ = new IOContext(false, false, false);
+  public static final IOContext READ = new IOContext(ReadAdvice.NORMAL);
 
-  public static final IOContext LOAD = new IOContext(false, true, true);
+  public static final IOContext LOAD = new IOContext(ReadAdvice.LOAD);
 
-  public static final IOContext RANDOM = new IOContext(false, false, true);
+  public static final IOContext RANDOM = new IOContext(ReadAdvice.RANDOM);

Review Comment:
   FWIW kept these constant names as-is rather than align them with 
`ReadAdvice` constant names on purpose as they convey stronger expectations.



##
lucene/core/src/java/org/apache/lucene/store/IOContext.java:
##
@@ -54,58 +43,74 @@ public enum Context {
 DEFAULT
   };
 
-  public static final IOContext DEFAULT = new IOContext(Context.DEFAULT);
+  /** Advice regarding the read access pattern. */
+  public enum ReadAdvice {
+/**
+ * Normal behavior. Data is expected to be read mostly sequentially. The 
system is expected to
+ * cache the hottest pages.
+ */
+NORMAL,
+/**
+ * Data is expected to be read in a random-access fashion, either by {@link
+ * IndexInput#seek(long) seeking} often and reading relatively short 
sequences of bytes at once,
+ * or by reading data through the {@link RandomAccessInput} abstraction in 
random order.
+ */
+RANDOM,
+/** Data is expected to be read sequentially with very little seeking at 
most. */
+SEQUENTIAL,
+/**
+ * Data is treated as random-access memory in practice. {@link Directory} 
implementations may
+ * explicitly load the content of the file in memory, or provide hints to 
the system so that it
+ * loads the content of the file into the page cache at open time. This 
should only be used on
+ * very small files that can be expected to fit in RAM with very high 
confidence.
+ */
+LOAD

Review Comment:
   I wonder if there's a better name for this that is more aligned with other 
constant names. `RANDOM_ACCESS_MEMORY`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add Facets#getBulkSpecificValues method [lucene]

2024-03-26 Thread via GitHub


gsmiller commented on code in PR #12862:
URL: https://github.com/apache/lucene/pull/12862#discussion_r1539340494


##
lucene/facet/src/java/org/apache/lucene/facet/Facets.java:
##
@@ -58,6 +58,9 @@ public abstract FacetResult getTopChildren(int topN, String 
dim, String... path)
   /**
* Return the counts or values for specific paths. Returns {@link 
#MISSING_SPECIFIC_VALUE} if
* corresponding path doesn't exist, else the count.
+   *
+   * {@link #getSpecificValue} for each path should produce similar 
results, but the bulk method

Review Comment:
   Can we make this stronger? It should produce identical results right (not 
"similar")? Or maybe that's not true?



##
lucene/CHANGES.txt:
##
@@ -87,6 +87,10 @@ API Changes
 * GITHUB#13146, GITHUB#13148: Remove ByteBufferIndexInput and only use 
MemorySegment APIs
   for MMapDirectory.  (Uwe Schindler)
 
+* GITHUB#12919: Add Facets#getBulkSpecificValues method to get specific values 
for multiple FacetLabels at once. (Egor Potemkin)
+
+* GITHUB#12919: Add Facets#getBulkSpecificValues method to get specific values 
for multiple FacetLabels at once. (Egor Potemkin)

Review Comment:
   I know it's a big change but we probably don't need double changes entries ;)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Reduce some unnecessary ArrayUtil#grow calls [lucene]

2024-03-26 Thread via GitHub


easyice closed pull request #13171: Reduce some unnecessary ArrayUtil#grow calls
URL: https://github.com/apache/lucene/pull/13171


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Get better cost estimate on MultiTermQuery over few terms [lucene]

2024-03-26 Thread via GitHub


rquesada-tibco commented on code in PR #13201:
URL: https://github.com/apache/lucene/pull/13201#discussion_r1539388204


##
lucene/core/src/java/org/apache/lucene/search/AbstractMultiTermQueryConstantScoreWrapper.java:
##
@@ -292,7 +292,21 @@ public long cost() {
   };
 }
 
-private static long estimateCost(Terms terms, long queryTermsCount) throws 
IOException {
+private static final int MAX_TERMS_TO_COUNT = 128;
+
+private long estimateCost(Terms terms, long queryTermsCount) throws 
IOException {
+
+  // Try to explicitly compute the cost, but only if the term count is 
sufficiently small.
+  if (queryTermsCount < MAX_TERMS_TO_COUNT) {

Review Comment:
   If `queryTermsCount` is unknown (i.e. -1), we'll also enter here, I guess 
this is intended, right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Get better cost estimate on MultiTermQuery over few terms [lucene]

2024-03-26 Thread via GitHub


rquesada-tibco commented on code in PR #13201:
URL: https://github.com/apache/lucene/pull/13201#discussion_r1539388204


##
lucene/core/src/java/org/apache/lucene/search/AbstractMultiTermQueryConstantScoreWrapper.java:
##
@@ -292,7 +292,21 @@ public long cost() {
   };
 }
 
-private static long estimateCost(Terms terms, long queryTermsCount) throws 
IOException {
+private static final int MAX_TERMS_TO_COUNT = 128;
+
+private long estimateCost(Terms terms, long queryTermsCount) throws 
IOException {
+
+  // Try to explicitly compute the cost, but only if the term count is 
sufficiently small.
+  if (queryTermsCount < MAX_TERMS_TO_COUNT) {

Review Comment:
   If queryTermsCount is unknown (i.e. -1), we'll also enter here, I guess this 
is intended, right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Replace boolean flags on `IOContext` with an enum. [lucene]

2024-03-26 Thread via GitHub


uschindler commented on code in PR #13219:
URL: https://github.com/apache/lucene/pull/13219#discussion_r1539364622


##
lucene/core/src/java21/org/apache/lucene/store/PosixNativeAccess.java:
##
@@ -137,17 +136,11 @@ public void madvise(MemorySegment segment, IOContext 
context) throws IOException
   }
 
   private Integer mapIOContext(IOContext ctx) {
-// Merging always wins and implies sequential access, because kernel is 
advised to free pages
-// after use:
-if (ctx.context() == Context.MERGE) {
-  return POSIX_MADV_SEQUENTIAL;
-}
-if (ctx.randomAccess()) {
-  return POSIX_MADV_RANDOM;
-}
-if (ctx.readOnce()) {
-  return POSIX_MADV_SEQUENTIAL;
-}
-return null;
+return switch (ctx.readAdvice()) {

Review Comment:
   I think we can remove the context from the signature and change it to 
`madvise(MemorySegment, ReadAdvice)`. `MemorySegmentIndexInputProvider` would 
just pass `context.readAdvice()` to `madvice()` then.



##
lucene/core/src/java/org/apache/lucene/store/IOContext.java:
##
@@ -54,58 +43,74 @@ public enum Context {
 DEFAULT
   };
 
-  public static final IOContext DEFAULT = new IOContext(Context.DEFAULT);
+  /** Advice regarding the read access pattern. */
+  public enum ReadAdvice {
+/**
+ * Normal behavior. Data is expected to be read mostly sequentially. The 
system is expected to
+ * cache the hottest pages.
+ */
+NORMAL,
+/**
+ * Data is expected to be read in a random-access fashion, either by {@link
+ * IndexInput#seek(long) seeking} often and reading relatively short 
sequences of bytes at once,
+ * or by reading data through the {@link RandomAccessInput} abstraction in 
random order.
+ */
+RANDOM,
+/** Data is expected to be read sequentially with very little seeking at 
most. */
+SEQUENTIAL,
+/**
+ * Data is treated as random-access memory in practice. {@link Directory} 
implementations may
+ * explicitly load the content of the file in memory, or provide hints to 
the system so that it
+ * loads the content of the file into the page cache at open time. This 
should only be used on
+ * very small files that can be expected to fit in RAM with very high 
confidence.
+ */
+LOAD
+  }
+
+  public static final IOContext DEFAULT =
+  new IOContext(Context.DEFAULT, null, null, ReadAdvice.NORMAL);
 
-  public static final IOContext READONCE = new IOContext(true, false, false);
+  public static final IOContext READONCE = new 
IOContext(ReadAdvice.SEQUENTIAL);
 
-  public static final IOContext READ = new IOContext(false, false, false);
+  public static final IOContext READ = new IOContext(ReadAdvice.NORMAL);
 
-  public static final IOContext LOAD = new IOContext(false, true, true);
+  public static final IOContext LOAD = new IOContext(ReadAdvice.LOAD);
 
-  public static final IOContext RANDOM = new IOContext(false, false, true);
+  public static final IOContext RANDOM = new IOContext(ReadAdvice.RANDOM);
 
   @SuppressWarnings("incomplete-switch")
   public IOContext {
+Objects.requireNonNull(context, "context must not be null");
+Objects.requireNonNull(readAdvice, "readAdvice must not be null");
 switch (context) {
   case MERGE -> Objects.requireNonNull(
   mergeInfo, "mergeInfo must not be null if context is MERGE");
   case FLUSH -> Objects.requireNonNull(
   flushInfo, "flushInfo must not be null if context is FLUSH");
 }
-if (load && readOnce) {
-  throw new IllegalArgumentException("load and readOnce are mutually 
exclusive");
-}
-if (readOnce && randomAccess) {
-  throw new IllegalArgumentException("readOnce and randomAccess are 
mutually exclusive");
+if (context == Context.MERGE && readAdvice != ReadAdvice.SEQUENTIAL) {

Review Comment:
   This is really a good idea! It makes code much easier and the merge case 
needs no special handling in MMapDir.



##
lucene/core/src/java/org/apache/lucene/store/IOContext.java:
##
@@ -27,22 +27,11 @@
  * @param context An object of a enumerator Context type
  * @param mergeInfo must be given when {@code context == MERGE}
  * @param flushInfo must be given when {@code context == FLUSH}
- * @param readOnce This flag indicates that the file will be opened, then 
fully read sequentially
- * then closed.
- * @param load This flag is used for files that are a small fraction of the 
total index size and are
- * expected to be heavily accessed in random-access fashion. Some {@link 
Directory}
- * implementations may choose to load such files into physical memory 
(e.g. Java heap) as a way
- * to provide stronger guarantees on query latency.
- * @param randomAccess This flag indicates that the file will be accessed 
randomly. If this flag is
- * set, then readOnce will be false.
+ * @param readAdvice Advice regarding the read access pattern. Write 
operations should disregard

Review C

Re: [PR] Replace boolean flags on `IOContext` with an enum. [lucene]

2024-03-26 Thread via GitHub


uschindler commented on code in PR #13219:
URL: https://github.com/apache/lucene/pull/13219#discussion_r1539364622


##
lucene/core/src/java21/org/apache/lucene/store/PosixNativeAccess.java:
##
@@ -137,17 +136,11 @@ public void madvise(MemorySegment segment, IOContext 
context) throws IOException
   }
 
   private Integer mapIOContext(IOContext ctx) {
-// Merging always wins and implies sequential access, because kernel is 
advised to free pages
-// after use:
-if (ctx.context() == Context.MERGE) {
-  return POSIX_MADV_SEQUENTIAL;
-}
-if (ctx.randomAccess()) {
-  return POSIX_MADV_RANDOM;
-}
-if (ctx.readOnce()) {
-  return POSIX_MADV_SEQUENTIAL;
-}
-return null;
+return switch (ctx.readAdvice()) {

Review Comment:
   I think we can remove the context from the signature and change it to 
`madvise(MemorySegment, ReadAdvice)`. `MemorySegmentIndexInputProvider` would 
just pass `context.readAdvice()` to `madvise()` then.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speed up writeGroupVInts [lucene]

2024-03-26 Thread via GitHub


easyice merged PR #13203:
URL: https://github.com/apache/lucene/pull/13203


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Replace boolean flags on `IOContext` with an enum. [lucene]

2024-03-26 Thread via GitHub


uschindler commented on code in PR #13219:
URL: https://github.com/apache/lucene/pull/13219#discussion_r1539407232


##
lucene/core/src/java/org/apache/lucene/store/IOContext.java:
##
@@ -54,58 +43,74 @@ public enum Context {
 DEFAULT
   };
 
-  public static final IOContext DEFAULT = new IOContext(Context.DEFAULT);
+  /** Advice regarding the read access pattern. */
+  public enum ReadAdvice {
+/**
+ * Normal behavior. Data is expected to be read mostly sequentially. The 
system is expected to
+ * cache the hottest pages.
+ */
+NORMAL,
+/**
+ * Data is expected to be read in a random-access fashion, either by {@link
+ * IndexInput#seek(long) seeking} often and reading relatively short 
sequences of bytes at once,
+ * or by reading data through the {@link RandomAccessInput} abstraction in 
random order.
+ */
+RANDOM,
+/** Data is expected to be read sequentially with very little seeking at 
most. */
+SEQUENTIAL,
+/**
+ * Data is treated as random-access memory in practice. {@link Directory} 
implementations may
+ * explicitly load the content of the file in memory, or provide hints to 
the system so that it
+ * loads the content of the file into the page cache at open time. This 
should only be used on
+ * very small files that can be expected to fit in RAM with very high 
confidence.
+ */
+LOAD
+  }
+
+  public static final IOContext DEFAULT =
+  new IOContext(Context.DEFAULT, null, null, ReadAdvice.NORMAL);
 
-  public static final IOContext READONCE = new IOContext(true, false, false);
+  public static final IOContext READONCE = new 
IOContext(ReadAdvice.SEQUENTIAL);
 
-  public static final IOContext READ = new IOContext(false, false, false);
+  public static final IOContext READ = new IOContext(ReadAdvice.NORMAL);
 
-  public static final IOContext LOAD = new IOContext(false, true, true);
+  public static final IOContext LOAD = new IOContext(ReadAdvice.LOAD);
 
-  public static final IOContext RANDOM = new IOContext(false, false, true);
+  public static final IOContext RANDOM = new IOContext(ReadAdvice.RANDOM);

Review Comment:
   We can still refactor and rename them a bit. The LOAD should really be 
PRELOAD.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Replace boolean flags on `IOContext` with an enum. [lucene]

2024-03-26 Thread via GitHub


jpountz commented on code in PR #13219:
URL: https://github.com/apache/lucene/pull/13219#discussion_r1539431077


##
lucene/core/src/java/org/apache/lucene/store/IOContext.java:
##
@@ -54,58 +43,74 @@ public enum Context {
 DEFAULT
   };
 
-  public static final IOContext DEFAULT = new IOContext(Context.DEFAULT);
+  /** Advice regarding the read access pattern. */
+  public enum ReadAdvice {

Review Comment:
   If you don't mind, I'm leaving changes to `Context` to a separate PR to keep 
this one focused.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Replace boolean flags on `IOContext` with an enum. [lucene]

2024-03-26 Thread via GitHub


uschindler commented on PR #13219:
URL: https://github.com/apache/lucene/pull/13219#issuecomment-2020687878

   P.S.: Are we using RANDOM at the moment?
   
   I also found https://github.com/elastic/elasticsearch/issues/27748, this 
person suggests to pass RANDOM for everything. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Replace boolean flags on `IOContext` with an enum. [lucene]

2024-03-26 Thread via GitHub


uschindler commented on code in PR #13219:
URL: https://github.com/apache/lucene/pull/13219#discussion_r1539448998


##
lucene/core/src/java21/org/apache/lucene/store/PosixNativeAccess.java:
##
@@ -135,12 +135,12 @@ public void madvise(MemorySegment segment, IOContext 
context) throws IOException
 }
   }
 
-  private Integer mapIOContext(IOContext ctx) {
-return switch (ctx.readAdvice()) {
+  private Integer mapIOContext(ReadAdvice readAdvice) {

Review Comment:
   should be `mapReadAdvice()` :-)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Replace boolean flags on `IOContext` with an enum. [lucene]

2024-03-26 Thread via GitHub


uschindler commented on code in PR #13219:
URL: https://github.com/apache/lucene/pull/13219#discussion_r1539457487


##
lucene/core/src/java/org/apache/lucene/store/IOContext.java:
##
@@ -54,58 +42,50 @@ public enum Context {
 DEFAULT
   };
 
-  public static final IOContext DEFAULT = new IOContext(Context.DEFAULT);
+  public static final IOContext DEFAULT =
+  new IOContext(Context.DEFAULT, null, null, ReadAdvice.NORMAL);
 
-  public static final IOContext READONCE = new IOContext(true, false, false);
+  public static final IOContext READONCE = new 
IOContext(ReadAdvice.SEQUENTIAL);
 
-  public static final IOContext READ = new IOContext(false, false, false);
+  public static final IOContext READ = new IOContext(ReadAdvice.NORMAL);
 
-  public static final IOContext LOAD = new IOContext(false, true, true);
+  public static final IOContext PRELOAD = new 
IOContext(ReadAdvice.RANDOM_PRELOAD);

Review Comment:
   This may need a MIGRATE.md entry.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Get better cost estimate on MultiTermQuery over few terms [lucene]

2024-03-26 Thread via GitHub


rquesada-tibco commented on code in PR #13201:
URL: https://github.com/apache/lucene/pull/13201#discussion_r1539468242


##
lucene/core/src/java/org/apache/lucene/search/AbstractMultiTermQueryConstantScoreWrapper.java:
##
@@ -292,7 +292,21 @@ public long cost() {
   };
 }
 
-private static long estimateCost(Terms terms, long queryTermsCount) throws 
IOException {
+private static final int MAX_TERMS_TO_COUNT = 128;

Review Comment:
   Stupid question: would it be feasible / desirable to make this value somehow 
configurable?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Clean up variable-gaps terms format. [lucene]

2024-03-26 Thread via GitHub


uschindler commented on PR #13216:
URL: https://github.com/apache/lucene/pull/13216#issuecomment-2020751491

   > > we now use ChecksumIndexInput as the file is fully readonce without 
seeking. Because we removed the IOContext from directory's open method (as it 
is hardcoded to readOnce), we do not need to change the existing IOContext, so 
the modifier method can go away
   > 
   > This is correct.
   > 
   > > the file is then read in whole and closed, the FST is on-heap.
   > 
   > Correct as well.
   > 
   > > We do not have backwards compatibility here as it is not the default 
format, so we can change it in main branch (10.x). This won't be backported, 
correct?
   > 
   > I don't believe that we attempt to not break non-default formats in minor 
releases, I have broken a few of them in the past in a minor. That said, I'm 
fine with not backporting, we could still have the `IOContext` ctor removed on 
9.x by hardcoding `IOContext.READONCE` when opening the `IndexInput`.
   
   That's fine. Hardcoding READONCE is also fine for this case, because the 
MergeContext is not needed at all.
   
   We should at least add a note that the format has changed and older indexes 
can't be read.
   
   > For reference, I bumped the version number to make sure users get an 
`IndexFormatTooOldException` rather than a cryptic corruption exception.
   
   Of course, I understood that. Although we provide no backwards compatibility 
we should always change version numbers when format changes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] New structure for numeric dynamic pruning [lucene]

2024-03-26 Thread via GitHub


gf2121 closed pull request #13217: New structure for numeric dynamic pruning
URL: https://github.com/apache/lucene/pull/13217


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Replace boolean flags on `IOContext` with an enum. [lucene]

2024-03-26 Thread via GitHub


jpountz commented on PR #13219:
URL: https://github.com/apache/lucene/pull/13219#issuecomment-2020818027

   > P.S.: Are we using RANDOM at the moment?
   
   Not yet, we'd need to start using it where it makes sense like we do for 
(PRE)LOAD.
   
   > I also found https://github.com/elastic/elasticsearch/issues/27748, this 
person suggests to pass RANDOM for everything.
   
   Yeah, Wikimedia also did testing and they 
[report](https://phabricator.wikimedia.org/T169498) getting best performance 
with a mmap readahead of 16kB instead of the default of 128kB (it's shared on 
the same thread). It feels a bit like a bug to me that mmap has such a higher 
readahead than regular read operations, I wonder if we should recommend 
lowering this default readahead in our wiki / javadocs instead of trying to 
work around it by passing RANDOM everywhere. My preference would be to not 
index too much on how the various hints perform in practice and try to provide 
what seems to be the correct read advice based on what we know of the access 
patterns. E.g. postings and doc values data should probably use NORMAL, stored 
fields, term vectors and vectors data should probably use RANDOM, etc.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Disjunction as CompetitiveIterator for numeric dynamic pruning [lucene]

2024-03-26 Thread via GitHub


gf2121 opened a new pull request, #13221:
URL: https://github.com/apache/lucene/pull/13221

   This PR proposes a new way to do numeric dynamic pruning with following 
changes:
   
   * Instead of complex sampling and estimating point count to judge whether to 
build the competitive iterator, this patch proposes to find out the threshold 
value. That said, we find out the value that 'N docs away from' the top value, 
in favor of the [the fact that top value should be final in LeafComparators], 
like the following picture shows.
   
   
(https://github.com/apache/lucene/blob/99b9636fd8c383c80d06c8815cfdb49b1b77dcdb/lucene/core/src/java/org/apache/lucene/search/FieldComparator.java#L75).
 
   
   
![image](https://github.com/apache/lucene/assets/52390227/b7125122-194a-4167-a821-57c47479915d)
   
   * Instead of building and rebuilding the competitive iterator when bottom 
value get more competitive, this patch proposes to build the competitive 
iterator as a disjunction of small DISIs. Each small DISI maintains its most 
competitive value and get discarded when their most competitive value is no 
more competitive, like what we did in `TermOrdValComparator`. This helps us 
intersect the tree only once and update the competitive iterator more 
frequently.
   
   * For simplification, i tweaked the bytes things to comparable long values. 
e.g. `maxValueAsBytes` -> `maxValueAsLong`.
   
    Benchmark
   
   Here is a result based on wikimedium10m (baseline contains 
https://github.com/apache/lucene/pull/13199)
   
   ```
   TaskQPS baseline  StdDevQPS 
my_modified_version  StdDevPct diff p-value
 TermDTSort  176.84  (5.0%)  303.31  
(6.3%)   71.5% (  57% -   87%) 0.000
  HighTermDayOfYearSort  454.72  (3.2%)  791.09  
(7.9%)   74.0% (  60% -   87%) 0.000
   ```
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Disjunction as CompetitiveIterator for numeric dynamic pruning [lucene]

2024-03-26 Thread via GitHub


gf2121 commented on code in PR #13221:
URL: https://github.com/apache/lucene/pull/13221#discussion_r1539558105


##
lucene/core/src/java/org/apache/lucene/index/PointValues.java:
##
@@ -383,25 +383,18 @@ public final long estimatePointCount(IntersectVisitor 
visitor) {
 }
   }
 
-  /**
-   * Estimate if the point count that would be matched by {@link #intersect} 
with the given {@link
-   * IntersectVisitor} is greater than or equal to the upperBound.
-   *
-   * @lucene.internal
-   */
-  public static boolean isEstimatedPointCountGreaterThanOrEqualTo(
-  IntersectVisitor visitor, PointTree pointTree, long upperBound) throws 
IOException {
-return estimatePointCount(visitor, pointTree, upperBound) >= upperBound;
-  }
-
   /**
* Estimate the number of documents that would be matched by {@link 
#intersect} with the given
-   * {@link IntersectVisitor}. The estimation will terminate when the point 
count gets greater than
-   * or equal to the upper bound.
+   * {@link IntersectVisitor}. The estimation will terminate when the point 
count get greater than
+   * up bound. That said, if the return value is less than the upperBound, it 
is the accurate
+   * estimated point value, otherwise it means the number of points in the 
tree is greater than or
+   * equals to the upperBound.
*
* TODO: will broad-first help estimation terminate earlier?
+   *
+   * @lucene.internal
*/
-  private static long estimatePointCount(
+  public static long estimatePointCount(

Review Comment:
   I have to make the public again to get accurate point count size in 
[`intersectThresholdValue`](https://github.com/apache/lucene/blob/5e95d360730794bfe637e23d63b561b7f7e6ef9a/lucene/core/src/java/org/apache/lucene/search/comparators/NumericComparator.java#L245)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Replace boolean flags on `IOContext` with an enum. [lucene]

2024-03-26 Thread via GitHub


uschindler commented on PR #13219:
URL: https://github.com/apache/lucene/pull/13219#issuecomment-2020886166

   > > P.S.: Are we using RANDOM at the moment?
   > 
   > Not yet, we'd need to start using it where it makes sense like we do for 
(PRE)LOAD.
   > 
   > > I also found 
[elastic/elasticsearch#27748](https://github.com/elastic/elasticsearch/issues/27748),
 this person suggests to pass RANDOM for everything.
   > 
   > Yeah, Wikimedia also did testing and they 
[report](https://phabricator.wikimedia.org/T169498) getting best performance 
with a mmap readahead of 16kB instead of the default of 128kB (it's shared on 
the same thread). It feels a bit like a bug to me that mmap has such a higher 
readahead than regular read operations, I wonder if we should recommend 
lowering this default readahead in our wiki / javadocs instead of trying to 
work around it by passing RANDOM everywhere. My preference would be to not 
index too much on how the various hints perform in practice and try to provide 
what seems to be the correct read advice based on what we know of the access 
patterns. E.g. postings and doc values data should probably use NORMAL, stored 
fields, term vectors and vectors data should probably use RANDOM, etc.
   
   The question that I have about this: How to handle merging then? If we use 
random access for some files and then want to merge away the segments. As you 
said before, the problem is with reused NRT readers for merging. I think, we 
should not hardcode the RANDOM flag now on all files?!
   
   It is good that IOContext with MergeInfo always requires SEQUENTIAL, but is 
this really used in all cases when we merge? When its hardcoded while opening 
index files we have a problem. The example of the vargaps reader has exactly 
that problem: It always uses readOnce.
   
   I think you are more familar with how the merging works, these are just some 
points to consider.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Use `IOContext#RANDOM` when appropriate. [lucene]

2024-03-26 Thread via GitHub


jpountz opened a new pull request, #13222:
URL: https://github.com/apache/lucene/pull/13222

   This switches the following files to `IOContext.RANDOM`:
- Stored fields data file.
- Term vectors data file.
- HNSW graph.
- Temporary file storing vectors at merge time that we use to construct the 
merged HNSW graph.
- Vector data files, including quantized data files.
   
   I hesitated using `IOContext.RANDOM` on terms, since they have a random 
access pattern when running term queries, but a more sequential access pattern 
when running multi-term queries. I erred on the conservative side and did not 
switch them to `IOContext.RANDOM` for now.
   
   For simplicity, I'm only touching the current codec, not previous codecs. 
There are also some known issues:
- These files will keep using a `RANDOM` `IOContext` at merge time. We need 
some way for merge instances to get an updated `IOContext`? We have the same 
problem with `IOContext#LOAD` today.
- With quantized vectors, raw vectors don't have random access pattern, but 
it was challenging to give raw vectors a sequential access pattern when there 
are quantized vectors and a random access pattern otherwise. So they assume a 
random access pattern all the time.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Replace boolean flags on `IOContext` with an enum. [lucene]

2024-03-26 Thread via GitHub


jpountz commented on PR #13219:
URL: https://github.com/apache/lucene/pull/13219#issuecomment-2020941392

   > The question that I have about this: How to handle merging then?
   
   This is a big question to me too. With reader pooling, if you open a reader 
and then it gets included in a merge, we'll reuse the same `SegmentReader` and 
its existing open index inputs, which have likely been open with a `NORMAL` or 
`RANDOM` advice. Ideally there would be a way for our `getMergeInstance()` APIs 
to somehow return a clone that has a different read advice.
   
   > It is good that IOContext with MergeInfo always requires SEQUENTIAL, but 
is this really used in all cases when we merge?
   
   My understanding is that it will only be used if the index input is created 
with the `IOContext` that is set on the `SegmentReadState` and this reader has 
been open specifically for merging, said otherwise the index has not been 
reopened between the time when the segment got written and the time when the 
same segment got merged away. This is only going to cover a small subset of the 
segments that we write, this doesn't look good enough to me.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Replace boolean flags on `IOContext` with an enum. [lucene]

2024-03-26 Thread via GitHub


jpountz commented on PR #13219:
URL: https://github.com/apache/lucene/pull/13219#issuecomment-2020943095

   > P.S.: Are we using RANDOM at the moment?
   
   FYI I tried to start switching some files to it at #13222 and discussed some 
limitations.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] MIGRATE.md doesn't mention the transition from Collector to CollectorManager [lucene]

2024-03-26 Thread via GitHub


jpountz commented on issue #13218:
URL: https://github.com/apache/lucene/issues/13218#issuecomment-2020945911

   FWIW I plan on treating all issues that have their milestone set to 10.0 as 
needing discussion if we want to exclude them from 10.0.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] New int4 scalar quantization [lucene]

2024-03-26 Thread via GitHub


benwtrent commented on PR #13197:
URL: https://github.com/apache/lucene/pull/13197#issuecomment-2021292348

   I did a bunch of local benchmarking on this. I am adding a parameter to 
allow optional compression as the numbers without compressing are compelling 
enough on ARM to justify it IMO. 
   
   To achieve similar recall, `int4` without compression is about 30% faster. 
With compression its about 30% slower, but with 50% of the memory requirements.
   
   Here are some latency vs. recall for int7, and int4 with this change.
   
   ```
   plt.plot([1.49, 1.53, 1.54, 1.83, 2.09], [0.952, 0.962, 0.965, 0.974, 
0.981], marker='o', label='int7')
   plt.plot([1.72, 1.75, 1.79, 2.04, 2.48], [0.897, 0.915, 0.929, 0.971, 0.980 
], marker='o', label='int4_compressed')
   plt.plot([1.08, 1.12, 1.12, 1.34, 1.50], [0.897, 0.915, 0.929, 0.971, 0.980 
], marker='o', label='int4')
   ```
   
![image](https://github.com/apache/lucene/assets/4357155/f825ee55-ca44-4af2-8057-d59ab6a34fd3)
   
   int4 with compression gives 2x space improvement over int7, but it comes at 
an obvious cost as we have to (un)pack bytes during dot-products.
   
   Here are the numbers around index building as well. I committed ever 1MB to 
ensure merging occurred and that force-merging was adequately exercised. 
   
   Int4 no compression:
   ```
   Indexed 50 documents in 312090ms
   Force merge done in: 76169 ms
   ```
   
   Int4 compression:
   ```
   Indexed 50 documents in 326978ms
   Force merge done in: 124961 ms
   ```
   
   Int7:
   ```
   Indexed 50 documents in 344584ms
   Force merge done in: 98311 ms
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Mark TimeLimitingCollector as deprecated [lucene]

2024-03-26 Thread via GitHub


vigyasharma commented on PR #13220:
URL: https://github.com/apache/lucene/pull/13220#issuecomment-2021304317

   As an alternate to deprecating this entirely, we could also change it to 
start using `QueryTimeout`, instead of its custom time limiting counters. I'd 
like to get more opinions from the community, on whether we think it's useful 
to keep a time limiting *collector* around.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add timeout support to AbstractKnnVectorQuery [lucene]

2024-03-26 Thread via GitHub


vigyasharma commented on code in PR #13202:
URL: https://github.com/apache/lucene/pull/13202#discussion_r1539995871


##
lucene/core/src/test/org/apache/lucene/search/TestKnnByteVectorQuery.java:
##
@@ -102,14 +103,34 @@ public void testVectorEncodingMismatch() throws 
IOException {
 }
   }
 
+  public void testTimeout() throws IOException {

Review Comment:
   We could also add a test for the partial result case. You could create a 
mock query timeout that returns `false` only for the first time it's called, 
and then flips to returning `true`.



##
lucene/core/src/java/org/apache/lucene/search/TimeLimitingKnnCollectorManager.java:
##
@@ -0,0 +1,95 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import java.io.IOException;
+import org.apache.lucene.index.LeafReaderContext;
+import org.apache.lucene.index.QueryTimeout;
+import org.apache.lucene.search.knn.KnnCollectorManager;
+
+/** A {@link KnnCollectorManager} that collects results with a timeout. */
+public class TimeLimitingKnnCollectorManager implements KnnCollectorManager {
+  private final KnnCollectorManager delegate;
+  private final QueryTimeout queryTimeout;
+
+  public TimeLimitingKnnCollectorManager(KnnCollectorManager delegate, 
QueryTimeout timeout) {
+this.delegate = delegate;
+this.queryTimeout = timeout;
+  }
+
+  /** Get the {@link QueryTimeout} for terminating graph searches. */
+  public QueryTimeout getQueryTimeout() {
+return queryTimeout;
+  }
+
+  @Override
+  public KnnCollector newCollector(int visitedLimit, LeafReaderContext 
context) throws IOException {
+KnnCollector collector = delegate.newCollector(visitedLimit, context);
+if (queryTimeout == null) {
+  return collector;
+}
+return new KnnCollector() {
+  @Override
+  public boolean earlyTerminated() {
+return queryTimeout.shouldExit() || collector.earlyTerminated();
+  }
+
+  @Override
+  public void incVisitedCount(int count) {
+collector.incVisitedCount(count);
+  }
+
+  @Override
+  public long visitedCount() {
+return collector.visitedCount();
+  }
+
+  @Override
+  public long visitLimit() {
+return collector.visitLimit();
+  }
+
+  @Override
+  public int k() {
+return collector.k();
+  }
+
+  @Override
+  public boolean collect(int docId, float similarity) {
+return collector.collect(docId, similarity);
+  }
+
+  @Override
+  public float minCompetitiveSimilarity() {
+return collector.minCompetitiveSimilarity();
+  }
+
+  @Override
+  public TopDocs topDocs() {
+TopDocs docs = collector.topDocs();
+
+// Mark results as partial if timeout is met
+TotalHits.Relation relation =
+queryTimeout.shouldExit()
+? TotalHits.Relation.GREATER_THAN_OR_EQUAL_TO
+: docs.totalHits.relation;
+
+return new TopDocs(new TotalHits(docs.totalHits.value, relation), 
docs.scoreDocs);

Review Comment:
   Can we simply return `collector.topDocs();`  and let collectors decide how 
to handle this? All implementations of `AbstractKnnCollector` already set 
relation to `GTE` based on `earlyTerminated()` check.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add timeout support to AbstractKnnVectorQuery [lucene]

2024-03-26 Thread via GitHub


kaivalnp commented on code in PR #13202:
URL: https://github.com/apache/lucene/pull/13202#discussion_r1540033526


##
lucene/core/src/java/org/apache/lucene/search/TimeLimitingKnnCollectorManager.java:
##
@@ -0,0 +1,95 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import java.io.IOException;
+import org.apache.lucene.index.LeafReaderContext;
+import org.apache.lucene.index.QueryTimeout;
+import org.apache.lucene.search.knn.KnnCollectorManager;
+
+/** A {@link KnnCollectorManager} that collects results with a timeout. */
+public class TimeLimitingKnnCollectorManager implements KnnCollectorManager {
+  private final KnnCollectorManager delegate;
+  private final QueryTimeout queryTimeout;
+
+  public TimeLimitingKnnCollectorManager(KnnCollectorManager delegate, 
QueryTimeout timeout) {
+this.delegate = delegate;
+this.queryTimeout = timeout;
+  }
+
+  /** Get the {@link QueryTimeout} for terminating graph searches. */
+  public QueryTimeout getQueryTimeout() {
+return queryTimeout;
+  }
+
+  @Override
+  public KnnCollector newCollector(int visitedLimit, LeafReaderContext 
context) throws IOException {
+KnnCollector collector = delegate.newCollector(visitedLimit, context);
+if (queryTimeout == null) {
+  return collector;
+}
+return new KnnCollector() {
+  @Override
+  public boolean earlyTerminated() {
+return queryTimeout.shouldExit() || collector.earlyTerminated();
+  }
+
+  @Override
+  public void incVisitedCount(int count) {
+collector.incVisitedCount(count);
+  }
+
+  @Override
+  public long visitedCount() {
+return collector.visitedCount();
+  }
+
+  @Override
+  public long visitLimit() {
+return collector.visitLimit();
+  }
+
+  @Override
+  public int k() {
+return collector.k();
+  }
+
+  @Override
+  public boolean collect(int docId, float similarity) {
+return collector.collect(docId, similarity);
+  }
+
+  @Override
+  public float minCompetitiveSimilarity() {
+return collector.minCompetitiveSimilarity();
+  }
+
+  @Override
+  public TopDocs topDocs() {
+TopDocs docs = collector.topDocs();
+
+// Mark results as partial if timeout is met
+TotalHits.Relation relation =
+queryTimeout.shouldExit()
+? TotalHits.Relation.GREATER_THAN_OR_EQUAL_TO
+: docs.totalHits.relation;
+
+return new TopDocs(new TotalHits(docs.totalHits.value, relation), 
docs.scoreDocs);

Review Comment:
   I don't think the `collector.topDocs()` will set the relation to `GTE` in 
case of timeouts (it will only set this when the `visitLimit()` is crossed, 
because it will use its own internal `earlyTerminated()` function that does not 
have information about the `QueryTimeout`)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Get better cost estimate on MultiTermQuery over few terms [lucene]

2024-03-26 Thread via GitHub


msfroh commented on code in PR #13201:
URL: https://github.com/apache/lucene/pull/13201#discussion_r1540080444


##
lucene/core/src/java/org/apache/lucene/search/AbstractMultiTermQueryConstantScoreWrapper.java:
##
@@ -292,7 +292,21 @@ public long cost() {
   };
 }
 
-private static long estimateCost(Terms terms, long queryTermsCount) throws 
IOException {
+private static final int MAX_TERMS_TO_COUNT = 128;
+
+private long estimateCost(Terms terms, long queryTermsCount) throws 
IOException {
+
+  // Try to explicitly compute the cost, but only if the term count is 
sufficiently small.
+  if (queryTermsCount < MAX_TERMS_TO_COUNT) {

Review Comment:
   Correct -- I thought about making that more explicit, but opted to take 
advantage of the "fall through" logic.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add Facets#getBulkSpecificValues method [lucene]

2024-03-26 Thread via GitHub


epotyom commented on code in PR #12862:
URL: https://github.com/apache/lucene/pull/12862#discussion_r1540089552


##
lucene/CHANGES.txt:
##
@@ -87,6 +87,10 @@ API Changes
 * GITHUB#13146, GITHUB#13148: Remove ByteBufferIndexInput and only use 
MemorySegment APIs
   for MMapDirectory.  (Uwe Schindler)
 
+* GITHUB#12919: Add Facets#getBulkSpecificValues method to get specific values 
for multiple FacetLabels at once. (Egor Potemkin)
+
+* GITHUB#12919: Add Facets#getBulkSpecificValues method to get specific values 
for multiple FacetLabels at once. (Egor Potemkin)

Review Comment:
   Oh, thanks for catching that! Updated the branch.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add Facets#getBulkSpecificValues method [lucene]

2024-03-26 Thread via GitHub


epotyom commented on code in PR #12862:
URL: https://github.com/apache/lucene/pull/12862#discussion_r1540090770


##
lucene/facet/src/java/org/apache/lucene/facet/Facets.java:
##
@@ -58,6 +58,9 @@ public abstract FacetResult getTopChildren(int topN, String 
dim, String... path)
   /**
* Return the counts or values for specific paths. Returns {@link 
#MISSING_SPECIFIC_VALUE} if
* corresponding path doesn't exist, else the count.
+   *
+   * {@link #getSpecificValue} for each path should produce similar 
results, but the bulk method

Review Comment:
   That is true, changed👍



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Get better cost estimate on MultiTermQuery over few terms [lucene]

2024-03-26 Thread via GitHub


msfroh commented on code in PR #13201:
URL: https://github.com/apache/lucene/pull/13201#discussion_r1540109959


##
lucene/core/src/java/org/apache/lucene/search/AbstractMultiTermQueryConstantScoreWrapper.java:
##
@@ -292,7 +292,21 @@ public long cost() {
   };
 }
 
-private static long estimateCost(Terms terms, long queryTermsCount) throws 
IOException {
+private static final int MAX_TERMS_TO_COUNT = 128;

Review Comment:
   It's a great question! Maybe we could pass it as a constructor parameter, 
and the constant `RewriteMethod`s in `MultiTermQuery` could specify the default?
   
   I host a weekly OpenSearch Lucene Study Group on Zoom and we spent this 
week's meetup talking about this PR (and your issue): 
https://forum.opensearch.org/t/opensearch-lucene-study-group-meeting-monday-march-25th-2024/18547/3
   
   There were some nice ideas that came up there around how to pick something 
better than an arbitrary limit -- like maybe using a time threshold instead. 
(But would that be a threshold per-clause?)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Mark TimeLimitingCollector as deprecated [lucene]

2024-03-26 Thread via GitHub


jpountz commented on PR #13220:
URL: https://github.com/apache/lucene/pull/13220#issuecomment-2021484078

   It makes sense to me to deprecate it, `IndexSearcher#setTimeout` should be 
used instead.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add timeout support to AbstractKnnVectorQuery [lucene]

2024-03-26 Thread via GitHub


kaivalnp commented on code in PR #13202:
URL: https://github.com/apache/lucene/pull/13202#discussion_r1540162745


##
lucene/core/src/test/org/apache/lucene/search/TestKnnByteVectorQuery.java:
##
@@ -102,14 +103,34 @@ public void testVectorEncodingMismatch() throws 
IOException {
 }
   }
 
+  public void testTimeout() throws IOException {

Review Comment:
   Makes sense!
   
   There was a consideration here that the number of levels in the HNSW graph 
should be == 1, because if the timeout is hit while finding the best entry 
point for the last level, we haven't collected any results yet. I think this 
should be fine as we're only indexing 3 vectors, and running these tests for a 
few thousand times did not give an error. Added a note about this as well
   
   Also fixed [another 
place](https://github.com/apache/lucene/blob/1f909baca5496d5ea7d06d6e83f3b0c54756d8e2/lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsReader.java#L253-L256)
 where the timeout needs to be checked



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add timeout support to AbstractKnnVectorQuery [lucene]

2024-03-26 Thread via GitHub


kaivalnp commented on code in PR #13202:
URL: https://github.com/apache/lucene/pull/13202#discussion_r1540166440


##
lucene/join/src/java/org/apache/lucene/search/join/DiversifyingChildrenFloatKnnVectorQuery.java:
##
@@ -100,8 +102,15 @@ protected TopDocs exactSearch(LeafReaderContext context, 
DocIdSetIterator accept
 fi.getVectorSimilarityFunction());
 final int queueSize = Math.min(k, Math.toIntExact(acceptIterator.cost()));
 HitQueue queue = new HitQueue(queueSize, true);
+TotalHits.Relation relation = TotalHits.Relation.EQUAL_TO;
 ScoreDoc topDoc = queue.top();
 while (vectorScorer.nextParent() != DocIdSetIterator.NO_MORE_DOCS) {
+  // Mark results as partial if timeout is met
+  if (queryTimeout != null && queryTimeout.shouldExit()) {
+relation = TotalHits.Relation.GREATER_THAN_OR_EQUAL_TO;
+break;
+  }

Review Comment:
   Also wanted some opinions here: we're checking the timeout in exact search 
of `DiversifyingChildren[Byte|Float]KnnVectorQuery` once per-parent (as opposed 
to once per document elsewhere)
   
   Should we update this to once per-child as well?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Made DocIdsWriter use DISI when reading documents with an IntersectVisitor [lucene]

2024-03-26 Thread via GitHub


antonha commented on PR #13149:
URL: https://github.com/apache/lucene/pull/13149#issuecomment-2021556140

   > Ideally we'd add a query that adds another implementation of an 
`IntersectVisitor` to the nightly benchmarks before merging that PR so that we 
can see the performance bump?
   
   Yes - this would be ideal imo. Some caveats that I found while testing:
   - `PointRangeQuery`s are executed with different `IntersectVisitor`s 
depending on expected number of matches, which means that several query 
intervals need to be used in each JVM. In luceneutil, I believe that this 
corresponds to setting `taskCountPerCat` high.
   - To see significant change, the number of documents in the segments needs 
to either fit into the 16bit or 32bit implementation of `bkd.DocIdsWriter` - 
this was not the case with the simple `localrun.py` script out of the box.
   - It is possible that we see a performance decrease when we add more 
`IntersectVisitor` variants, since the benchmark becomes more realistically 
slow with the additional implementations.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Made DocIdsWriter use DISI when reading documents with an IntersectVisitor [lucene]

2024-03-26 Thread via GitHub


antonha commented on code in PR #13149:
URL: https://github.com/apache/lucene/pull/13149#discussion_r1540175478


##
lucene/core/src/java/org/apache/lucene/search/PointRangeQuery.java:
##
@@ -185,6 +186,13 @@ public void visit(DocIdSetIterator iterator) throws 
IOException {
 adder.add(iterator);
   }
 
+  @Override
+  public void visit(IntsRef ref) {
+for (int i = ref.offset; i < ref.offset + ref.length; i++) {

Review Comment:
   I unfortunately did not find time enough for this this week - and since this 
PR seems to be wrapping up I will leave it out. It looks like a simple change 
but I got stuck implementing tests (which I feel would be very needed). I will 
leave this for future improvement!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Made DocIdsWriter use DISI when reading documents with an IntersectVisitor [lucene]

2024-03-26 Thread via GitHub


antonha commented on code in PR #13149:
URL: https://github.com/apache/lucene/pull/13149#discussion_r1540177437


##
lucene/core/src/java/org/apache/lucene/search/PointRangeQuery.java:
##
@@ -222,6 +230,14 @@ public void visit(DocIdSetIterator iterator) throws 
IOException {
 cost[0] = Math.max(0, cost[0] - iterator.cost());
   }
 
+  @Override
+  public void visit(IntsRef ref) throws IOException {
+for (int i = ref.offset; i < ref.offset + ref.length; i++) {
+  result.clear(ref.ints[i]);
+  cost[0]--;

Review Comment:
   yes, done :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Made the UnifiedHighlighter's hasUnrecognizedQuery function processes FunctionQuery the same way as MatchAllDocsQuery and MatchNoDocsQuery queries for performance reasons. [lucene]

2024-03-26 Thread via GitHub


romseygeek commented on code in PR #13165:
URL: https://github.com/apache/lucene/pull/13165#discussion_r1540259892


##
lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/UnifiedHighlighter.java:
##
@@ -1130,7 +1137,7 @@ public boolean acceptField(String field) {
   @Override
   public void visitLeaf(Query query) {
 if (MultiTermHighlighting.canExtractAutomataFromLeafQuery(query) 
== false) {
-  if (!(query instanceof MatchAllDocsQuery || query instanceof 
MatchNoDocsQuery)) {
+  if (!UnifiedHighlighter.this.isIgnorableQuery(query)) {

Review Comment:
   Can you add a test to TestUnifiedHighlighterExtensibility that checks that 
overrides to `isIgnorableQuery` work here as well?  I think it will, but to be 
perfectly honest I can never remember the inheritance rules around 
class-specific `this` calls...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Made the UnifiedHighlighter's hasUnrecognizedQuery function processes FunctionQuery the same way as MatchAllDocsQuery and MatchNoDocsQuery queries for performance reasons. [lucene]

2024-03-26 Thread via GitHub


romseygeek commented on PR #13165:
URL: https://github.com/apache/lucene/pull/13165#issuecomment-2021674549

   Thanks @vletard!  This looks great, I'd just like to add one more test to 
ensure that inheritance works in the way we expect. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [Fix] Binary search the entries when all suffixes have the same length in a leaf block. [lucene]

2024-03-26 Thread via GitHub


vsop-479 commented on PR #11888:
URL: https://github.com/apache/lucene/pull/11888#issuecomment-2021791458

   > Was this on wikimediumall?
   
   No, this was on `wikimedium10k`. 
   I will measure the performance again on `wikimediumall`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] [Fix] Binary search the entries when all suffixes have the same length in a leaf block. [lucene]

2024-03-26 Thread via GitHub


vsop-479 commented on code in PR #11888:
URL: https://github.com/apache/lucene/pull/11888#discussion_r1540528182


##
lucene/core/src/test/org/apache/lucene/codecs/lucene99/TestLucene99PostingsFormat.java:
##
@@ -143,4 +141,13 @@ private void doTestImpactSerialization(List 
impacts) throws IOException
   }
 }
   }
+
+  @Override
+  protected void subCheckBinarySearch(TermsEnum termsEnum) throws Exception {
+// 10004a matched block's entries: [11, 13, ..., 100049].
+// if target greater than the last entry of the matched block,
+// termsEnum.term should be the last entry.
+assertFalse(termsEnum.seekExact(new BytesRef("10004a")));
+assertEquals(termsEnum.term(), new BytesRef("100049"));

Review Comment:
   > why are we testing that here :)
   
   Since there was a bug(fixed by 7084596c1c3a62dec2614aaeb37d0954f5fbd4e2) in 
previous implementation.
   So i added this test to watch it.
   
   Should i remove it?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org