[GitHub] [lucene] jpountz opened a new issue, #11915: Make Lucene smarter about long runs of matches

2022-11-10 Thread GitBox


jpountz opened a new issue, #11915:
URL: https://github.com/apache/lucene/issues/11915

   ### Description
   
   Lucene's abstractions are good at dealing with long runs of documents that 
do not match a query, but much less at dealing with long runs of documents that 
match a query. In such cases, Lucene still needs to linearly scan these matches 
and do some amount of work for each of the matches.
   
   Is it actually common to have long runs of matches? For full-text indexes, 
maybe not so much, only stop words may have runs of adjacent matches. For 
string fields, this may happen if the field has a default value that is the 
value of most documents in the collection. Also it's possible for users to use 
index sorting in order to cluster similar documents together, which increases 
the likelihood to have long runs of adjacent matches.
   
   One idea would be to augment the `DocIdSetIterator` API to add a new 
`peekNextNonMatchingDocID` method, which would return the next doc ID that may 
not be a match. The default implementation could return `docID() + 1`. We'd 
need to implement this API in postings, doc-value iterators and a few other 
important `DocIdSetIterators` like `BitSetIterator` and `DocIdSetIterator#all`. 
And propagate this information in disjunctions and conjunctions. Then we could 
leverage this API in a number of places:
- `ReqExclScorer` could ask the prohibited clause for the next doc ID that 
is worth evaluating on the required clause.
- Conjunctions could ignore non-scoring clauses over ranges of doc IDs that 
all match.
- `FixedBitSet#or` could set ranges of doc IDs at once, which would in-turn 
speed up `DocIdSetBuilder`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rendel commented on issue #11702: Multi-Value Support for Binary DocValues [LUCENE-10666]

2022-11-10 Thread GitBox


rendel commented on issue #11702:
URL: https://github.com/apache/lucene/issues/11702#issuecomment-1309968917

   > I don't think ESQL is going to be different from existing faceting 
support: it will still want to use ordinals when it makes sense such as 
grouping by term.
   
   @jpountz This may be correct for aggregate operation, however, if you wish 
to support join operation in ESQL at some point, then you'll need to perform a 
scan of the binary values and not of the ordinal values (as they are not 
compatible with a join operation). 
   
   > Would you be willing to contribute your multi-valued binary doc value 
implementation here? I think having multi-value parity with other doc value 
types is good to support multiple use cases like this. 
   
   @nknize We do not see any problem in sharing the code, but our 
implementation is based on the Elasticsearch framework (on the 
`BinaryFieldMapper.CustomBinaryDocValuesField` to be exact), not Lucene, so 
likely it will be not relevant for this thread.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #11888: [Fix] Binary search the entries when all suffixes have the same length in a leaf block.

2022-11-10 Thread GitBox


jpountz commented on PR #11888:
URL: https://github.com/apache/lucene/pull/11888#issuecomment-1310043887

   Thanks @vsop-479. Do you know if the test you added to terms can be improved 
in such a way that it would have caught this bug?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jfboeuf commented on a diff in pull request #11900: Reduce bloom filter size by using the optimal count for hash functions.

2022-11-10 Thread GitBox


jfboeuf commented on code in PR #11900:
URL: https://github.com/apache/lucene/pull/11900#discussion_r1019029965


##
lucene/codecs/src/java/org/apache/lucene/codecs/bloom/FuzzySet.java:
##
@@ -46,7 +46,9 @@ public class FuzzySet implements Accountable {
 
   public static final int VERSION_SPI = 1; // HashFunction used to be loaded 
through a SPI
   public static final int VERSION_START = VERSION_SPI;
-  public static final int VERSION_CURRENT = 2;
+  public static final int VERSION_MURMUR2 = 2;
+  private static final int VERSION_MULTI_HASH = 3;
+  public static final int VERSION_CURRENT = VERSION_MULTI_HASH;

Review Comment:
   Awesome, breaking backward compatibility will allow lots of cleaning up!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jfboeuf commented on a diff in pull request #11900: Reduce bloom filter size by using the optimal count for hash functions.

2022-11-10 Thread GitBox


jfboeuf commented on code in PR #11900:
URL: https://github.com/apache/lucene/pull/11900#discussion_r1019030241


##
lucene/codecs/src/java/org/apache/lucene/codecs/bloom/MurmurHash64.java:
##
@@ -0,0 +1,104 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.codecs.bloom;
+
+import org.apache.lucene.util.BytesRef;
+
+/**
+ * This is a very fast, non-cryptographic hash suitable for general hash-based 
lookup. See
+ * http://murmurhash.googlepages.com/ for more details.
+ *
+ * The code from Apache Commons was adapted in the form here to work with 
BytesRefs with offsets
+ * and lengths rather than raw byte arrays.
+ */
+public class MurmurHash64 extends HashFunction {
+  private static final long M64 = 0xc6a4a7935bd1e995L;
+  private static final int R64 = 47;
+  public static final HashFunction INSTANCE = new MurmurHash64();
+
+  /**
+   * Generates a 64-bit hash from byte array of the given length and seed.
+   *
+   * @param data The input byte array
+   * @param seed The initial seed value
+   * @param length The length of the array
+   * @return The 64-bit hash of the given array
+   */
+  public static long hash64(byte[] data, int seed, int offset, int length) {
+long h = (seed & 0xL) ^ (length * M64);
+
+final int nblocks = length >> 3;
+
+// body
+for (int i = 0; i < nblocks; i++) {
+
+  long k = getLittleEndianLong(data, offset);

Review Comment:
   Thanks, I will do it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on issue #11911: improve checkindex to be more thorough for vectors (e.g. test seeking)

2022-11-10 Thread GitBox


rmuir commented on issue #11911:
URL: https://github.com/apache/lucene/issues/11911#issuecomment-1310217848

   "read every byte of the index" is the promise that checkindex makes. So this 
bug is really important.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir closed pull request #11906: Add monster test for many knn docs

2022-11-10 Thread GitBox


rmuir closed pull request #11906: Add monster test for many knn docs
URL: https://github.com/apache/lucene/pull/11906


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #11906: Add monster test for many knn docs

2022-11-10 Thread GitBox


rmuir commented on PR #11906:
URL: https://github.com/apache/lucene/pull/11906#issuecomment-1310221605

   this test is folded into #11905


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir merged pull request #11905: Fix integer overflow when seeking the vector index for connections

2022-11-10 Thread GitBox


rmuir merged PR #11905:
URL: https://github.com/apache/lucene/pull/11905


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #907: LUCENE-10357 Ghost fields and postings/points

2022-11-10 Thread GitBox


jpountz commented on PR #907:
URL: https://github.com/apache/lucene/pull/907#issuecomment-1310330397

   Apologies Luca, but after looking more at your changes, I'm getting worried 
that this change is harder than I had anticipated. I was optimistically hoping 
that never returning null PointValues would automatically help save some 
`NullPointerException`s but looking at your change it looks like there are many 
places where we just need to replace these null checks with other checks if we 
want to keep the logic correct, so things wouldn't magically work on segments 
that have ghost fields unless we explicitly test ghost fields. So I would 
suggest closing this PR, apologies for the time you spent on it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on issue #11393: Ghost fields and postings/points [LUCENE-10357]

2022-11-10 Thread GitBox


jpountz commented on issue #11393:
URL: https://github.com/apache/lucene/issues/11393#issuecomment-1310333691

   I had hoped that getting rid of ghost fields would automatically help avoid 
some bugs but after looking into it for both postings and points (thanks 
@javanna and @shahrs87 !) it looks like it's not actually possible to make 
things significantly better. Hence closing this issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz closed issue #11393: Ghost fields and postings/points [LUCENE-10357]

2022-11-10 Thread GitBox


jpountz closed issue #11393: Ghost fields and postings/points [LUCENE-10357]
URL: https://github.com/apache/lucene/issues/11393


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #11793: Prevent PointValues from returning null for ghost fields

2022-11-10 Thread GitBox


jpountz commented on PR #11793:
URL: https://github.com/apache/lucene/pull/11793#issuecomment-1310334991

   Apologies @javanna, but after looking more at your changes, I'm getting 
worried that this change is harder than I had anticipated. I was optimistically 
hoping that never returning null PointValues would automatically help save some 
NullPointerExceptions but looking at your change it looks like there are many 
places where we just need to replace these null checks with other checks if we 
want to keep the logic correct, so things wouldn't magically work on segments 
that have ghost fields unless we explicitly test ghost fields. So I would 
suggest closing this PR, apologies for the time you spent on it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] javanna closed pull request #11793: Prevent PointValues from returning null for ghost fields

2022-11-10 Thread GitBox


javanna closed pull request #11793: Prevent PointValues from returning null for 
ghost fields
URL: https://github.com/apache/lucene/pull/11793


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] javanna commented on pull request #11793: Prevent PointValues from returning null for ghost fields

2022-11-10 Thread GitBox


javanna commented on PR #11793:
URL: https://github.com/apache/lucene/pull/11793#issuecomment-1310370411

   Agreed @jpountz I think it was a good experiment to spend some time on, and 
I have also been thinking along the same lines, that the changes I ended up 
making were not solving the problem like we initially thought. No problem at 
all. I will open a follow-up PR to clarify the current behaviour and 
expectations.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent opened a new pull request, #11916: GITHUB#11911: improve checkindex to be more thorough for vectors

2022-11-10 Thread GitBox


benwtrent opened a new pull request, #11916:
URL: https://github.com/apache/lucene/pull/11916

   Checkindex with vectors should exercise the graph and seek operations. These 
are exposed via the search interface.
   
   There is the option to search EVERY stored vector value as we iterate it, 
but this seemed excessive. 
   
   So, only the first 64 vectors are searched, and we don't limit the search to 
stop early.
   
   closes: https://github.com/apache/lucene/issues/11911


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on pull request #11916: GITHUB#11911: improve checkindex to be more thorough for vectors

2022-11-10 Thread GitBox


benwtrent commented on PR #11916:
URL: https://github.com/apache/lucene/pull/11916#issuecomment-1310547475

   @rmuir I took a stab at it. I am unfamiliar with checkindex, but this will 
search 64 vectors, seeking the graph to catch if there is something obscene 
broken.
   
   A more complicated check would require inner interface changes (allowing 
seek to be hit directly). But search seems nice as it is a common path and 
actually exercising the thing as it is typically used.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #11916: GITHUB#11911: improve checkindex to be more thorough for vectors

2022-11-10 Thread GitBox


rmuir commented on PR #11916:
URL: https://github.com/apache/lucene/pull/11916#issuecomment-1310577261

   Thank you, yeah this is fine as a start! I think, it would be an improvement 
in the future to not just search the first 64 vectors but maybe every n'th 
(just a different form of sampling).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz opened a new pull request, #11917: Automatically preload index files that are both tiny and very hot.

2022-11-10 Thread GitBox


jpountz opened a new pull request, #11917:
URL: https://github.com/apache/lucene/pull/11917

   The default codec has a number of small and hot files, that actually used to 
be fully loaded in memory before we moved them off-heap. In the general case, 
these files are expected to fully fit into the page cache for things to work 
well. Should we give control over preloading to codecs? This is what this 
commit does for the following files:
- Terms index (`tip`)
- Points index (`kdi`)
- Stored fields index (`fdx`)
- Terms vector index (`tvx`)
   
   This only has an effect on `MMapDirectory`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler opened a new pull request, #11918: Port generic exception handling from MemorySegmentIndexInput to ByteBufferIndexInput

2022-11-10 Thread GitBox


uschindler opened a new pull request, #11918:
URL: https://github.com/apache/lucene/pull/11918

   This also adds incorrect (e.g., negative) positions to exception message.
   
   This also fixes some wrong exception messages (seek vs. read) in 
ByteBufferIndexInput. Sometimes it said "seek" although it should say "read". 
In MemorySegmentIndexInput I already fixed this last year.
   
   This closes #11912.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler commented on pull request #11918: Port generic exception handling from MemorySegmentIndexInput to ByteBufferIndexInput

2022-11-10 Thread GitBox


uschindler commented on PR #11918:
URL: https://github.com/apache/lucene/pull/11918#issuecomment-1310726064

   The new test is a bit bad, but unfortunately, MMapDirectory's multi-input 
only has an assert in seek(). If that hits, test also passses. In reality 
negative offsets on slices should also throw correct exception.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on issue #11863: Add large-scale test for kNN vectors

2022-11-10 Thread GitBox


jtibshirani commented on issue #11863:
URL: https://github.com/apache/lucene/issues/11863#issuecomment-1310797975

   In https://github.com/apache/lucene/pull/11905 we added a test for a large 
number of documents (with a tiny dimension).
   
   It'd also be good to clean up and merge something like 
https://github.com/apache/lucene/pull/11867, which indexes a large number of 
bytes (maybe fewer vectors but high dimension).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #11916: GITHUB#11911: improve checkindex to be more thorough for vectors

2022-11-10 Thread GitBox


rmuir commented on PR #11916:
URL: https://github.com/apache/lucene/pull/11916#issuecomment-1310890561

   thanks! I like it. Feel free to add a CHANGES entry if you want, it is a 
good one for that, because checkindex is user-visible and important. I would 
suggest in the 9.4.2 section as that's where I want to backport this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on pull request #11916: GITHUB#11911: improve checkindex to be more thorough for vectors

2022-11-10 Thread GitBox


benwtrent commented on PR #11916:
URL: https://github.com/apache/lucene/pull/11916#issuecomment-1310907074

   pushed CHANGES under 9.4.2 as an `Improvement` @rmuir 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir merged pull request #11916: GITHUB#11911: improve checkindex to be more thorough for vectors

2022-11-10 Thread GitBox


rmuir merged PR #11916:
URL: https://github.com/apache/lucene/pull/11916


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir closed issue #11911: improve checkindex to be more thorough for vectors (e.g. test seeking)

2022-11-10 Thread GitBox


rmuir closed issue #11911: improve checkindex to be more thorough for vectors 
(e.g. test seeking)
URL: https://github.com/apache/lucene/issues/11911


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler commented on pull request #11918: Port generic exception handling from MemorySegmentIndexInput to ByteBufferIndexInput

2022-11-10 Thread GitBox


uschindler commented on PR #11918:
URL: https://github.com/apache/lucene/pull/11918#issuecomment-1310952543

   Should I backport this also to 9.4.2 when it gets released next week. I am 
afraid of more horrible bugs in vectors and I'd like to give people a chance to 
report it.
   
   Problem of Elasticsearch is (LOL): they did not enable `--enable-preview`, 
so they did not get good exception messages (for that I have a solution where I 
will come with a PR soon, which enables MemorySegment support without 
`--enable-preview` by some tricks).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #11916: GITHUB#11911: improve checkindex to be more thorough for vectors

2022-11-10 Thread GitBox


rmuir commented on PR #11916:
URL: https://github.com/apache/lucene/pull/11916#issuecomment-1310961095

   @benwtrent I hit issue upon backporting to branch_9x: it may be nothing 
specific to 9.x but just a random seed that hasn't been encountered yet on 
master?
   
   The checkindex error message is a bit generic, maybe we can improve it, as I 
can't tell immediately what went wrong. It seems that it got 0 results when 
searching for nearest neighbors on a vector... but I'm guessing this happens 
because we don't check Bits for deleted docs before issuing the query? If we 
search on a vector that only exists in a deleted doc, its possible all the 
results could be deleted ones too. So I think we are just missing an "if 
statement" before issuing the query?
   
   ```
   think:lucene[branch_9x]$ git cherry-pick 
3a506ec87a01556a530eee5eb54ada49fe3cde3f
   Auto-merging lucene/CHANGES.txt
   [branch_9x cfbb7e9bd35] GITHUB#11911: improve checkindex to be more thorough 
for vectors (#11916)
Author: Benjamin Trent 
Date: Thu Nov 10 16:45:47 2022 -0500
2 files changed, 31 insertions(+), 6 deletions(-)
   think:lucene[branch_9x]$ ./gradlew check
   ...
   > Task :lucene:core:test
   
   org.apache.lucene.codecs.lucene94.TestLucene94HnswVectorsFormat > 
testRandomBytes FAILED
   org.apache.lucene.index.CheckIndex$CheckIndexException: Field "field" 
failed to search k nearest neighbors
   at 
__randomizedtesting.SeedInfo.seed([45F1ECAB3ABC3C78:8EF8167273CC4F95]:0)
   at 
app//org.apache.lucene.index.CheckIndex.testVectors(CheckIndex.java:2603)
   at 
app//org.apache.lucene.index.CheckIndex.testSegment(CheckIndex.java:1011)
   at 
app//org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:710)
   at 
app//org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:548)
   at 
app//org.apache.lucene.tests.util.TestUtil.checkIndex(TestUtil.java:343)
   at 
app//org.apache.lucene.tests.store.MockDirectoryWrapper.close(MockDirectoryWrapper.java:909)
   at 
app//org.apache.lucene.tests.index.BaseKnnVectorsFormatTestCase.testRandomBytes(BaseKnnVectorsFormatTestCase.java:975)
   at 
java.base@17.0.5/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
Method)
   at 
java.base@17.0.5/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
   at 
java.base@17.0.5/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.base@17.0.5/java.lang.reflect.Method.invoke(Method.java:568)
   at 
app//com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
   at 
app//com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
   at 
app//com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
   at 
app//com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
   at 
app//org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
   at 
app//org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
   at 
app//org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
   at 
app//org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
   at 
app//org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
   at app//org.junit.rules.RunRules.evaluate(RunRules.java:20)
   at 
app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
   at 
app//com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
   at 
app//com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
   at 
app//com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
   at 
app//com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
   at 
app//com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
   at 
app//com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
   at 
app//com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
   at 
app//org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
   at 
app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
   at 
app//org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
   at 

[GitHub] [lucene] rmuir commented on pull request #11916: GITHUB#11911: improve checkindex to be more thorough for vectors

2022-11-10 Thread GitBox


rmuir commented on PR #11916:
URL: https://github.com/apache/lucene/pull/11916#issuecomment-1310962304

   Another idea, perhaps even simpler, is not to filter deleteddocs at all here 
in this logic. Because checkindex doesnt normally exclude deleted docs and just 
checks everything.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on pull request #11916: GITHUB#11911: improve checkindex to be more thorough for vectors

2022-11-10 Thread GitBox


benwtrent commented on PR #11916:
URL: https://github.com/apache/lucene/pull/11916#issuecomment-1310966373

   @rmuir 100%. I reproduced it with that seed, then removed the deleted docs 
check and it cleared up. I bet its because ALL the docs were deleted or 
something.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent opened a new pull request, #11919: Follow up to GITHUB#11916, remove deleted docs check

2022-11-10 Thread GitBox


benwtrent opened a new pull request, #11919:
URL: https://github.com/apache/lucene/pull/11919

   There is a chance that all the docs are deleted. This is ok in a checkindex 
scenario and other checks don't bother with verifying deleted docs like this.
   
   Removing the check.
   
   This reproduce line failed before this commit:
   ```
   ./gradlew test --tests TestLucene94HnswVectorsFormat.testRandomBytes 
-Dtests.seed=45F1ECAB3ABC3C78 -Dtests.badapples=true -Dtests.locale=es-BO 
-Dtests.timezone=America/Fortaleza -Dtests.asserts=true 
-Dtests.file.encoding=UTF-8
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #11919: Follow up to GITHUB#11916, remove deleted docs check

2022-11-10 Thread GitBox


rmuir commented on PR #11919:
URL: https://github.com/apache/lucene/pull/11919#issuecomment-1310970223

   looks good, thank you for making the PR so fast. The test failure reproduces 
and with this change it passes again.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on pull request #11919: Follow up to GITHUB#11916, remove deleted docs check

2022-11-10 Thread GitBox


benwtrent commented on PR #11919:
URL: https://github.com/apache/lucene/pull/11919#issuecomment-1310971424

   Ran
   ```
   ./gradlew test --tests TestLucene94HnswVectorsFormat -Dtests.iters=1000
   ```
   Just to be sure we are good. All green locally. 
   
   @rmuir 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on pull request #11919: Follow up to GITHUB#11916, remove deleted docs check

2022-11-10 Thread GitBox


benwtrent commented on PR #11919:
URL: https://github.com/apache/lucene/pull/11919#issuecomment-1310971694

   Apologies for the noise! Still learning all of Lucene's edges :D 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #11917: Automatically preload index files that are both tiny and very hot.

2022-11-10 Thread GitBox


rmuir commented on PR #11917:
URL: https://github.com/apache/lucene/pull/11917#issuecomment-1310974545

   I think preload is different from mlock, mlock needs way more discussion and 
personally I'm against it. mlock would be an operational hassle because of 
default resource limits on linux as well. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler commented on a diff in pull request #11917: Automatically preload index files that are both tiny and very hot.

2022-11-10 Thread GitBox


uschindler commented on code in PR #11917:
URL: https://github.com/apache/lucene/pull/11917#discussion_r1019662247


##
lucene/core/src/java/org/apache/lucene/store/MMapDirectory.java:
##
@@ -235,7 +235,7 @@ public IndexInput openInput(String name, IOContext context) 
throws IOException {
 ensureOpen();
 ensureCanRead(name);
 Path path = directory.resolve(name);
-return PROVIDER.openInput(path, context, chunkSizePower, preload, 
useUnmapHack);
+return PROVIDER.openInput(path, context, chunkSizePower, preload || 
context.load, useUnmapHack);

Review Comment:
   See my last message: It may be better in future to leave the decission to 
preload or map to the implementation of the provider. For exactly that reason 
the context is passed to the provider!
   
   So I would move the `|| context.load` to somewhere in 
MappedByteBufferIndexInputProvider and MemorySegmentIndexInputProvider so they 
can have different code.
   
   Maybe also rename "load" to something else because it is up to 
implementation how to guarantee how something is "sticked to memory".



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler commented on pull request #11917: Automatically preload index files that are both tiny and very hot.

2022-11-10 Thread GitBox


uschindler commented on PR #11917:
URL: https://github.com/apache/lucene/pull/11917#issuecomment-1310976902

   > I think preload is different from mlock, mlock needs way more discussion 
and personally I'm against it. mlock would be an operational hassle because of 
default resource limits on linux as well.
   
   This would of course be optional (it must be optional as it also requires 
extra JVM command line params). I think we should give users the option to do 
it.
   
   Personally I also tend to be against "load".


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler commented on pull request #11918: Port generic exception handling from MemorySegmentIndexInput to ByteBufferIndexInput

2022-11-10 Thread GitBox


uschindler commented on PR #11918:
URL: https://github.com/apache/lucene/pull/11918#issuecomment-1310991790

   Ah you added milestone 9.4.2 already to issue. Will do same here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler commented on pull request #912: MR-JAR rewrite of MMapDirectory with JDK-19 preview Panama APIs (>= JDK-19-ea+23)

2022-11-10 Thread GitBox


uschindler commented on PR #912:
URL: https://github.com/apache/lucene/pull/912#issuecomment-1311033936

   After Mike switched to preview mode the results look good. The speed with 
MemorySegmentIndexInput ist similar to old ByteBuffer code.
   
   https://home.apache.org/~mikemccand/lucenebench/
   
   See change EZ. Of course what we can't test is highly concurrent reopening 
of index and therefore closing shared MemorySessions at high ratio.
   
   CC @mcimadamore 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir merged pull request #11919: Follow up to GITHUB#11916, remove deleted docs check

2022-11-10 Thread GitBox


rmuir merged PR #11919:
URL: https://github.com/apache/lucene/pull/11919


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #11918: Port generic exception handling from MemorySegmentIndexInput to ByteBufferIndexInput

2022-11-10 Thread GitBox


rmuir commented on PR #11918:
URL: https://github.com/apache/lucene/pull/11918#issuecomment-1311107401

   yes, +1 to backport. This way if there is another problem, it might be 
easier to debug.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org